Friday, 13 February 2015

The Difference Between Crawling and Indexing

I talk a lot about crawling and indexing, but I think it’s worthwhile to back up and describe some of what’s going on.

The terms crawling and indexing (and indexing’s cousin, caching) are frequently used together, but you should not consider them synonyms.
Exact definitions probably differ from person to person, but following is how I explain the processes:

Crawling is the process of an engine requesting — and successfully downloading — a unique URL. Obstacles to crawling include no links to a URL, server downtime, robots exclusion, or using links (such as some JavaScript links) from which bots cannot find a valid URL.

Indexing is the result of successful crawling. I consider a URL to be indexed (by Google) when an info: or cache: query produces a result, signifying the URL’s presence in the Google index. Obstacles to indexing can include duplication (the engine might decide to index only one version of content for which it finds many nearly identical URLs), unreliable server delivery (the engine may decide to not index a page that it can access during only one-third of its attempts), and so on.


What’s the difference between crawling and indexing, in terms of time?


For Example :

I recently watched a newly introduced URL to see when it would be indexed. I monitored the text cache query of the URL every four hours starting when the URL went live on July 2. (This URL was one of a number of URLs linked to on a new site map.)
On July 17, the text cache showed results and finally stopped saying “Your search – cache:[URL] – did not match any documents.” But what was interesting is that the cached file showed the results of the URL “as retrieved on 8 Jul 08.” So make special note that the URL was crawled and cached over a week before it appeared in the index.
A better, more comprehensive test would be to watch server logs and see how many times the file was requested, and with what frequency, between the original request date and date at which the cache query showed results. Additional testing would try to detect ways to shorten that time by increasing the number (and prominence) of incoming links and so on.

No comments: