The World Wide Web has been main stream for around 20 years, and has grown phenomenally in that time. In the early 1990s when the web was young, it was far more difficult for an average web user to create their website. Websites were mainly hosted by tech savvy companies or hobbyists.
In those days, there was no such thing as a ‘search engine’ -- websites were discovered by word of mouth, or one of the few ‘What’s new on the web?’ type pages that listed new sites. This was not very efficient to begin with, but as the web grew over the next couple of years it became clear that a solution was needed.
During 1993-94 the first web search engines sprang up followed over the next couple of years by many commercial engines, including Excite, AltaVista and Yahoo. The number of webpages and users had grown to the point where discovering the content you were looking for simply was no longer manageable via a centralised list.
Google itself started in 1996 and was called BackRub when Larry Page and Sergey Brin began working on it. They were the first search engine to realise the power and potential of hyperlinks as a signal of trust and authority, they talked in depth about this in their University paper released in 1997. Shortly after, PageRank was born and pushed Google ahead of their competitors on both the relevancy and quality of their results.
The World Wide Web now consists of billions of web pages, and search engines are a daily part of most people’s lives.
For a truly in-depth history of search engines, technically dating back to 1945 we’d recommend taking a look at Search Engine History.
Steps of Search Engines
There are 3 main areas to understand when looking at search engines: Crawling, Indexing and Ranking.
Crawling - This is the process that search engines use to discover new content. They have sophisticated programs that visit web pages and follow the links on them to find new pages.
Indexing - The search engines maintain a copy of the content of all web pages they have visited. This index is stored on a large collection of computers, in such a manner that it can be searched through very rapidly.
Ranking - This is the area of search engines that SEO is most concerned with. When a user performs a search on any search engine, the engine needs a ‘recipe’ (known as an algorithm) it can use to evaluate the pages in its index to determine which are most relevant, and thus determine in which position (rank) they are returned to the user.
For many years search engines determined which pages were most relevant for a given query based solely on the content of those pages, and how other pages on the web referred to them. All of the information that the search engines examined to make the determination of relevancy was encapsulated within the web itself.
Anyone searching for a specific word or search phrase would get the same results as everyone else who searched from within the same country.
However, over the last few years this has changed in two important ways:
Social Networking - Sites such as Facebook and Twitter provide the search engines with important clues about which webpages people are talking about, or have shared with each other. This has meant these clues have provided additional information to the search engines, allowing them to change the ‘recipe’ for determining a site’s ranking.
Personalised Search - Similarly, the search engines have been able to use a specific user’s Social Network usage, and their previous searches, to determine what is more importantly to them personally. This has meant that now different users searching for the same search phrase might see somewhat different results.
There have also been other major developments over recent years which have changed the way that people search. Google in particular has become much more advanced at using machine learning and user data to predict the best results for a given query. There are two Google features which demonstrate this ability and show the advances they have made:
Google Suggest - launched in August 2008, Google Suggest uses advanced algorithms and machine learning to predict what you may be searching for. As you start typing your query, Google suggests keywords for you, this allows you to refine your query as you go and get ideas for what you may want.
Google Instant - launched in September 2010, Google Instant significantly changed how people search by creating dynamic results as the user typed their query. The results would update “live” without the user even having to press enter.
As an SEO, it is important to not only be aware of these developments, but how they affect your work. In particular you need to figure out how they may change the way people search and the types of keywords they use.
Before the Web became the most visible part of the Internet, there were already search engines in place to help people find information on the Net. Programs with names like "gopher" and "Archie" kept indexes of files stored on servers connected to the Internet, and dramatically reduced the amount of time required to find programs and documents. In the late 1980s, getting serious value from the Internet meant knowing how to use gopher, Archie, Veronica and the rest.
Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. In order to build and maintain a useful list of words, a search engine's spiders have to look at a lot of pages.
How does any spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.
Google began as an academic search engine. In the paper that describes how the system was built, Sergey Brin and Lawrence Page give an example of how quickly their spiders can work. They built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second.
Keeping everything running quickly meant building a system to feed necessary information to the spiders. The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on an Internet service provider for the domain name server (DNS) that translates a server's name into an address, Google had its own DNS, in order to keep delays to a minimum.
When the Google spider looked at an HTML page, it took note of two things:
The words within the page
Where the words were found
Words occurring in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.
These different approaches usually attempt to make the spider operate faster, allow users to search more efficiently, or both. For example, some spiders will keep track of the words in the title, sub-headings and links, along with the 100 most frequently used words on the page and each word in the first 20 lines of text. Lycos is said to use this approach to spidering the Web.
Other systems, such as AltaVista, go in the other direction, indexing every single word on a page, including "a," "an," "the" and other "insignificant" words. The push to completeness in this approach is matched by other systems in the attention given to the unseen portion of the Web page, the meta tags.
Meta tags allow the owner of a page to specify key words and concepts under which the page will be indexed. This can be helpful, especially in cases in which the words on the page might have double or triple meanings -- the meta tags can guide the search engine in choosing which of the several possible meanings for these words is correct. There is, however, a danger in over-reliance on meta tags, because a careless or unscrupulous page owner might add meta tags that fit very popular topics but have nothing to do with the actual contents of the page. To protect against this, spiders will correlate meta tags with page content, rejecting the meta tags that don't match the words on the page.
All of this assumes that the owner of a page actually wants it to be included in the results of a search engine's activities. Many times, the page's owner doesn't want it showing up on a major search engine, or doesn't want the activity of a spider accessing the page. Consider, for example, a game that builds new, active pages each time sections of the page are displayed or new links are followed. If a Web spider accesses one of these pages, and begins following all of the links for new pages, the game could mistake the activity for a high-speed human player and spin out of control. To avoid situations like this, the robot exclusion protocol was developed. This protocol, implemented in the meta-tag section at the beginning of a Web page, tells a spider to leave the page alone -- to neither index the words on the page nor try to follow its links.
Building the Index
Once the spiders have completed the task of finding information on Web pages, the search engine must store the information in a way that makes it useful. There are two key components involved in making the gathered data accessible to users:
The information stored with the data
The method by which the information is indexed
In the simplest case, a search engine could just store the word and the URL where it was found. In reality, this would make for an engine of limited use, since there would be no way of telling whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word. In other words, there would be no way of building the ranking list that tries to present the most useful pages at the top of the list of search results.
To make for more useful results, most search engines store more than just the word and URL. An engine might store the number of times that the word appears on a page. The engine might assign a weight to each entry, with increasing values assigned to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page. Each commercial search engine has a different formula for assigning weight to the words in its index. This is one of the reasons that a search for the same word on different search engines will produce different lists, with the pages presented in different orders.
Regardless of the precise combination of additional pieces of information stored by a search engine, the data will be encoded to save storage space. For example, the original Google paper describes using 2 bytes, of 8 bits each, to store information on weighting -- whether the word was capitalized, its font size, position, and other information to help in ranking the hit. Each factor might take up 2 or 3 bits within the 2-byte grouping (8 bits = 1 byte). As a result, a great deal of information can be stored in a very compact form. After the information is compacted, it's ready for indexing.
An index has a single purpose: It allows information to be found as quickly as possible. There are quite a few ways for an index to be built, but one of the most effective ways is to build a hash table. In hashing, a formula is applied to attach a numerical value to each word. The formula is designed to evenly distribute the entries across a predetermined number of divisions. This numerical distribution is different from the distribution of words across the alphabet, and that is the key to a hash table's effectiveness.
In English, there are some letters that begin many words, while others begin fewer. You'll find, for example, that the "M" section of the dictionary is much thicker than the "X" section. This inequity means that finding a word beginning with a very "popular" letter could take much longer than finding a word that begins with a less popular one. Hashing evens out the difference, and reduces the average time it takes to find an entry. It also separates the index from the actual entry. The hash table contains the hashed number along with a pointer to the actual data, which can be sorted in whichever way allows it to be stored most efficiently. The combination of efficient indexing and effective storage makes it possible to get results quickly, even when the user creates a complicated search.