How To Build a GoogleBot (Web) Crawler

Video: How To Build a GoogleBot (Web) Crawler

If you haven’t checked out my story of how I got into building search engines, read that first. I won’t go into the specifics (programming languages, libraries, etc…) here, just the things I wish I had someone tell me when I built my first search engine.

Larry Page’s discovery of how audacious a task it was in 1996 to build a crawler to crawl the Web was an interesting story. Back then, there were approximately 10 million web pages on the Internet, and far less powerful hardware resources to crawl those pages. He called his crawler “BackRub”.

The original BackRub infrastructure (the beginnings of Google).

At the time, only two other major players had tackled the challenges of building a scalable crawler infrastructure: AltaVista and Excite. Larry Page and Sergey Brin were the next two to discover just how complicated a project like this was.

My story starts a few years later, still well before open source libraries or generic SaaS crawler offerings that can now solve many of the long tail problems that come with building a crawler on your own.

One of the major challenges is the sheer scale of things. The startup costs associated with such an undertaking is enormous. But building a crawler that works well across billions of web pages doesn’t take hardware…it takes time. Time to discover (and code) all of the “edge cases” that invariably surface as the crawler discovers more and more of the Internet. Every time you thought you had written a rules-based algorithm that covered all the cases you were going to come across, there would be a website that would send the crawler into a dizzying path of infinite loops or out of memory errors.

Then there is the sizable storage requirements that are required. Today, these are easily solvable using services like Amazon’s AWS. Back when I started though (during the beginnings of Google), there were no cloud storage services. You were buying off the shelf hardware at your local electronics warehouse (Fry’s if you lived in the Silicon Valley), and it wasn’t cheap.

Fast networking is also a requirement, even in prototyping stages. Back then, not every place had fast (and reliable) networking. Startup costs were considerable if you didn’t live on or near a campus like Stanford or Carnegie Mellon University.

Setting Limits

After two decades of building search engines, I’ve gained a lot of experience and foresight into what I could have done differently when I first started building website crawlers.

The first, most important lesson that I learned is that if you are starting from scratch, you should establish boundaries or limits on the crawling that your crawler will do. Some examples of these limits would be:

What type of content should I crawl? One of the first things you must do when crawling a URL is to identify what type of content you just crawled. You don’t have to crawl the entire content to do this: you can determine the type of an input stream based on the characters at the beginning of the input stream. There are many types of content. You should first start by just crawling “text/html”, but you can evolve to crawling other types that are necessary for a final production crawler. These types include “text/css”, “text/plain”, “text/html”, “text/csv”, “text/javascript”, “application/xml”, and “application/json”.
How many bytes should I crawl? Based on the content type, you will need to set a limit on the size for each URL. This way, no individual URL can cause your crawling to come to a complete stop. To determine what the limit should be, you’ll run into a chicken and the egg problem. You first need to determine what is one standard deviation from the mean size of all URLs. But to do that, you’ll need to crawl all the URLs! Instead, you can start with something large enough to catch 99% of the URLs, and size down the limit as you get more information.
How many redirects should I follow? One of the responses you can get back from a web server is a redirect code. Typically you will run into a 301 or 302 redirect code, but any code in the 300s will indicate that the content you tried to crawl is located at an alternative URL that you now need to fetch. There are also other mechanisms that this can be triggered by: the META “refresh” and the “location” header fields. These should also be considered redirects. Regardless of the mechanism, once you have discovered a redirect, you must consider putting a limit on consecutive redirects. Otherwise, the crawler will be stuck in an infinite loop. I found that three redirects was a nice limit. Once you go past this number, simply mark the URL as “TOO MANY REDIRECTS” and move on.
How long should I wait? Determining how long your crawler should wait for the web server to response is a tricky thing (especially when considering that your own networking could be the culprit, see below). This is again a chicken and the egg problem: you must determine what the average response time is for a URL. You can start simple, like in the case of the size limit, and get more sophisticated in the future by storing average response times for a specific IP address range, or by content type.

Checking Your Crawler’s Networking Status

Another challenge when crawling is determining a baseline for your crawler’s own networking bandwidth and speed. Since some of the META information that should be recorded are items like web server response time, your crawler must not accidentally impart its own networking delays into these metrics.

In order to accomplish this, you must have a separate process that can determine a baseline (and subsequent deviations from this baseline) of your networking status. This can be a known web server that is consistent enough that you always know the response time. Or, better yet, a known ping server that only requires a small packet of information and has very fast response times. The challenge of this is to not get banned for too many tests against these servers, so some basic logic should be provided to only check this when a suspected deviation has occurred (your URL is taking too long to respond as in the case of a timeout, or the URL response time was much longer than previously recorded).

One trick to not having to do too many of these requests is to be smarter about detecting situations that may cause your architecture to attempt repeated networking checks. An example of this may be to have a separate process that catalogues all of the timeouts occurring per web server. If a server starts returning too many timeouts, have your crawler skip these pages and come back later.

Determining The Character Encoding

The Internet spans many different character encodings due to the varying languages that are used. In order to later parse this content correctly, you must determine the encoding of the content you are crawling.

Another challenge I ran into when first building a crawler was the underlying storage sometimes had issues with certain characters in certain encodings. For instance, we used the PostgreSQL relational database, which had a bug dealing with storing the NUL entity character (/0 or &#0). These types of edge cases have to be dealt with depending on the infrastructure that you are working with.

Crawling Smartly

As you begin to crawl a website, you will undoubtedly come across efficiencies that enable you to speed up your overall crawling experience. Storing certain META information about your crawls can lead to a faster and leaner crawler, and in the world of crawlers this is what you always want.

An example here is the preferred subdomain of a website. Most websites use the standard “www” subdomain, but not all do. Some leave the “www” part off, even though when crawling the links on the content it serves says otherwise. Instead of having to continually redirect the crawler each time it attempts to crawl a new URL, a smarter crawler will instead realize that is unnecessary.

Another more modern example is the http:// vs. https:// protocol. Most sites are now switching to the secure protocol of https://, but they have not yet updated all of their links.

Once you get past the redirect issues, you’ll run into the includes files like CSS and Javascript. The crawler will need to fetch all of the resources of each page to correctly parse the content. But many of a website’s pages will typically share the same CSS and Javascript content and so therefore these files can be cached. Determining the length of caching and the triggers to go back and re-crawl these resources are part of building a crawler.

Crawling Fast Enough

The Internet changes fast. Really fast. In fact, one of the biggest challenges to building a crawler today isn’t storage or processing power. It’s the fact that by the time you’ve crawled all the pages in a website, more than likely some of those pages have already changed. Expand that thought to the entire Internet and you’ve got a hefty challenge on your hands.

Many non-programmers think serially; that is, they think to build the fastest crawler means you must have the fastest computers, or use the fastest programming language. This couldn’t be further from the truth.

One of the most important problems to solve when building a crawler is to design it in such a way that allows easy scalability. This used to mean hardware architecture; however, in modern times we have realized that this is not the optimum path. You see, no single unit of hardware is fast enough or big enough to solve the needs of a Internet-sized crawler. Instead, what we do today is to break up the Internet into “shards”, or logical groups of data that can be physically put on separate machines. When we want that data, we have a catalog that we can retrieve to lookup where the shard we want is located.

Does Your Crawler Need To Crawl Everything?

The answer to this is probably no. Unless you are trying to create the next Google, it is likely that you only need to crawl a subset of the Internet to accomplish your business goals.

At Market Brew, we are mostly dealing with creating models of specific parts of the Internet. So we only need to crawl a statistical sampling of those pages. Accordingly, we created an intelligent crawler that learned how many (and which) pages it needed to crawl to build a statistically significant sample.

An example would be crawling a subset, and then on the next pass determining if a slightly larger sample gained any additional accuracy. If not, then the next pass might decrease the sample, and measure again. Also, as in the case of our search engine models at Market Brew, if you are modeling a search result with a specific URL in it, you must ensure that this URL is part of the model. Also, you will need to make sure that you handle things like canonical and hreflang links, the target of which needs to be included.

Be forewarned though, if and when you take this approach, during the processing and scoring process, you will run into a rather insidious artifact of only crawling a subset of pages. That is, artificial dangling nodes in your link graph. You will need to come up with a rather intricate algorithm that not only determines where you stop your crawl but also where you stop your scoring.

Crawling Politely

Your crawler needs to be a good citizen. That means it must abide by the robots exclusion standard. Historically, this has not been a simple thing to do.

First, many websites will not correctly format the robots.txt file. You may have conflicting directives, ambiguous directives, and everything in between.

On top of this, Google and other search engines have added vendor-specific directives that have changed their behavior over time. You will need to decide which directives your crawler will follow, but the safest bet is to abide by the leading search engine’s robots.txt protocol.

Your crawler should also have a standard user-agent that describes what your crawler is. This can be accomplished by adding a URL into the user-agent description. This gives the IT department at the web server a chance to lookup what is going on.

In addition to crawling according to the robots.txt protocol, your crawler should be able to be throttled; that is, slowed down so as to not place any undo strain on the respective web server.

I may be biased, but I am in the camp that believes each and every public page on the Internet should be able to be retrieved in a manner that abides by the robots.txt protocol and also does not place any undo strain on that web server.

Some web servers will selectively decide who can access their public content. I can understand this: many travel sites, for instance, will block attempted “scrapers”. These scrapers are crawlers whose business interests are to copy and republish (i.e. steal) that content.

These crawlers will often use a proxy network to crawl, effectively showing the web server many different IP addresses. In addition, those crawlers will typically not display the true user-agent header that I mentioned should be a part of all polite crawlers.

Savvy websites will often find patterns or have a blacklist of IP addresses based on known proxy networks, to block these crawlers. That is why using a proxy network that can be looked up on the Internet will typically not work (for long).

Summary

The process of building a scalable crawler is a highly iterative one. This is because the Pareto Principle comes into play: 80% of the problems will come from 20% of the URLs. The only way to move your crawler from a prototype to a modern one will be to either use the many shortcuts available today (open source libraries or SaaS based crawlers that have already solved these issues) or to be aware that the crawler you have built is only as good as the next edge case that you will inevitably run into.

To give you an idea how long the process is: the crawler that we built for our search engine modeling company, Market Brew, began to stabilize only after about eight years. This was after crawling the majority of the Internet many thousands of times over. That is, we were continually finding edge cases where the crawler was stuck or had caused something like an out of memory failure. Eight years of continuously crawling!

The goal of your crawler should be to correctly fetch and store a true copy of the URLs content (and instructions on how that content may be updated before a user views it). Once this is successfully done, you are ready to move onto the next step: properly storing and indexing your crawls.