Properly Storing and Indexing Your Crawls

 

The first step in creating a search engine is to design and build the crawling layer architecture. Once this has been accomplished, the next step is to decide how and where you will store and index the web pages that you will be crawling. Again, I won’t go into the specifics (programming languages, libraries, etc…) here, just the things I wish I had someone tell me when I built my first search engine.

The trick to building this layer of the search engine (from scratch) is that you need to know how you will be manipulating the data in the next layer: the scoring layer. The types of algorithms and the data indexing that is needed for those algorithms will determine how and where you store the data from the crawling layer.

For instance, what happens when you have two or more links on a web page with the same anchor text and same target URL? Well now, you will need to store something else that uniquely identifies each link, like the location on the page. Next time you crawl this page, and the content has changed, how do you know which link you are looking at? Was it the first one or the second one? At Market Brew, we created a link matching algorithm, which included collision detection and resolution.

One of the algorithms in a modern search engine should determine whether or not there is any “cloaking” going on. Cloaking is when the web server is serving one version to a search engine crawler, and another version to a human reading the content. Obviously, these types of subverting the algorithms should be punished in your search engine results. But how will you know if there is any cloaking going on if you don’t plan for it from the beginning during your crawling stage? You’ll need to crawl different copies of the same content from different user-agents, as well as store different copies: before and after Javascript manipulating the DOM.

Many linking algorithms in the scoring layer will require that you have fast access to retrieve these links, and that means having a plan to store all “related” links on physically close databases. This is such a crucial component that Google invented its own database architecture called Bigtable to solve this problem.

Some decisions at the indexing layer will require you to adjust how you design the crawling layer. You may want to crawl URLs that aren’t even in the human readable content of the web page. Part of a crawling architecture is to find all possible URLs that you can find. If you come across a properly formatted URL in non-human readable content, should you store that info?

Your crawler needs to be as smart as possible and this means that some content needs to be stored and indexed immediately during the crawl, not as a separate process. This is because certain pieces of information are needed to direct the crawler on future tasks. In my discussion about building a crawling architecture, I mentioned that the crawler may only want to crawl a subset of the Internet. In doing that, things like dangling nodes will need to be handled. You will need to have a process that watches the crawling and directs when and where to start and stop. This may seem easy until you start realizing that there are things like canonical and hreflang tags within the content that can effectively alter the link and scoring graph.

Understanding Indexes

I won’t go on a dissertation of how database indexes work, but when you build a search engine indexing layer, you will undoubtedly come across a number of issues dealing with the massive amounts of data you will need to store.

One of the biggest indexes in your database will be related to the link data you come across. Many of your link algorithms (and even content algorithms) will require you to be able to quickly retrieve specific sets of links. The speed at which you do this will dramatically alter the speed at which you can score this data, and therefore determine how recent and relevant your final data is.

Database indexes are a way to organize data so it can be retrieved efficiently, using one or more unique identifiers. For instance, retrieving a link may require you to specify the web page that it is a part of, the position on that web page, and the web page that it is pointing to. These three identifiers would make up a key that could be used to quickly lookup any link.

So why not create indexes everywhere? They are resource and time expensive. Each time an index is created, a separate storage area is dedicated to it. In addition, once indexes grow very large, it can be very expensive to insert new data into them. 

During Market Brew’s nascent years, we used a MySQL database. We rolled out a BETA version of our search engine model, and everything worked…until all of a sudden we had thousands of customers using it! We quickly discovered that the MySQL database had a major INSERT bottleneck that prevented us from building large link indexes. After a certain point, our indexing (and crawling) came to an abrupt halt. In that case, we not only had to redesign our indexing architecture, but we also ended up switching database software.

All this to say that when and where you create indexes are a crucial part of designing this layer of your search engine architecture.

Choosing Granularity

When we originally designed Market Brew’s search engine model, we started with 13 different database columns for each link. After five years of building new scoring algorithms, that number grew to over 35 columns. That means there were 35 link characteristics that were required to be stored separately for efficient processing at the scoring layer. This may seem innocuous, but adding new columns to a database can be a very expensive proposition. 

One expensive component to changing the database structure after it has been populated with millions (or billions) of rows, is that all of the indexes need to be updated (and sorted again). These are very time consuming tasks, and depending on your business product, you may not have all the time in the world to do these changes.

When in doubt, always store more granular data (and indexes) than you need at the time. This will save you heartache down the line, and it will provide you with the flexibility that a search engineer needs when iterating a design as large as this.

Many of the characteristics of data will not always be available at the indexing stage. Some will be subsequently filled in at a later scoring stage. An example of this is storing the number of duplicate links across a website (like a link that appears on all pages in a header or footer) or across many websites (like a link ad that appears on affiliate sites). At Market Brew, we found that a link algorithm was necessary to demote these types of links in the link ranking algorithm on each page, so we had to later gather the count of all links that we found with similar anchor text and/or target URL, and then store that count in a new link column in our database.

Storing Useless Data

Part of efficiently designing the database layer will require you to store seemingly useless data. One of the first things I learned in my early software career was that you should never delete data. This has more to do with user interaction with your product than anything else: users sometimes delete stuff that they later need. And they will blame you for not having it!

But in this context, I’m talking about storing useless data that a search engine will actually need; not for the final product, but for the logistical crawling process. For instance, the amount of web pages that end up outside of the link graph or scoring process is larger than you think.

In Market Brew’s search engine models, we found that 27% of the web pages we crawled were either broken, blocked, or redirected. These pages are important to store though: our crawlers must know which pages to avoid re-crawling in the future. Other web pages may point to these incorrectly, and we don’t want to compound these errors.

Another thing you will find later in the scoring process of a search engine: most of the web pages (nodes) in a link graph are useless in determining the ranking order. In fact, Google initially created their Supplemental Index to handle this specific artifact of scoring. They correctly determined that most of the content on the Internet was not useful (from a citation ranking perspective) and created a separate physical database structure to keep the critical speed-optimized indexes clean of this lesser important data.

You may want to design your indexes and the database architecture based on this, but if you are starting from scratch it is certainly a challenge. How do you know which data is useless until you have designed all of the scoring algorithms? It is always better to iterate here, but just be aware that this is a big performance consideration at the scale of search engine data.

Summary

The way in which you store and index the data from your crawling layer may not be as sexy as writing scoring algorithms, but it turns out to be one of the most critical components of building a search engine. 

The performance of your search engine depends on you correctly designing and placing the right indexes and choosing the appropriate granularity of the data which is stored in the first place.

You will most likely run into some very painful iterations along the way (unless you have done this a number of times), no matter how much advice people like me give you. 

Accordingly, it’s important to understand that having a flexible hardware and software architecture is vitally important to this process. You will never be able to anticipate every design decision that needs to be made. But you can make it a less painful situation when you do iterate.

Almost in tandem with designing the indexing layer, you will be designing the search engine scoring layer.