Interpreting “Crawl Queue” report with Sharepoint search

We’ve noticed something interesting a few weeks ago while working on a FAST Sharepoint 2010 system.  One of the advanced Search Administration Reports called Crawl Queue report was suddenly showing a consistent spike in “Links to Process” counter.  This report is specific to the FAST Content SSA and the two metrics shown are:

  • Links to Process:  Incoming links to process
  • Transactions Queued:  Outgoing transactions queued

New crawls would then start and complete without problems, which led us to believe that this had to do with a specific crawl that never completed.  We took a guess that perhaps one of the crawls has been in a Paused state, which ended up to be a correct assumption and saved us from writing SQL statements to figure out what state various crawls were in and so on.  Once this particular crawl was resumed, Links to Process went down as expected.  This process did give me a reason to explore what exactly happens when a crawl starts up and is either Paused or Stopped.

My colleague Brian Pendergrass describes a lot of these details in the following articles:

http://blogs.msdn.com/b/sharepoint_strategery/archive/2012/10/30/sp2010-search-explained-crawling.aspx

http://blogs.msdn.com/b/sharepoint_strategery/archive/2014/02/10/sharepoint-search-and-deadlocks-in-sql-server.aspx

If I had to just do a very high-level description, here is what happens when a crawl starts up:

  • The HTTP protocol handler starts by making an HTTP GET request to the URL specified by the content source.
  • If the file is successfully retrieved, a Finder() method will enumerate items-to-be-crawled from the WFE(Web Front-End) and all the links found in each document will be added to the MSSCrawlQueue table.  Gathering Manager called msssearch.exe is responsible for that.  This is exactly what “Links to Process” metric shows on a graph, it’s the links found in each document but not yet crawled.  If there are links still to process, they will be seen by querying the MSSCrawlQueue table.
  • Items actually scheduled to be crawled and waiting on callbacks are also seen in MSSCrawlURL table.  This corresponds to “Transactions queued” metric in the graph.
  • Each Crawl Component involved with the crawl will then pull a subset of links from MSSCrawlQueue table and actually attempt to retASTrieve each link from the WFE.
  • These links are removed from the MSSCrawlQueue once each link has been retrieved from a WFE and there is a callback that indicates that this item has now been processed/crawled.

Once a specific crawl was set to a Paused state, enumerated items-to-be crawled stayed in MSSCrawlQueue table and were not cleared out, corresponding to “Links to Process” metric in the graph.  If instead we attempted to Stop a crawl, these links would have actually be cleared out from the table.

This behavior should be similar with Sharepoint 2013 Search.

About Igor Veytskin

I have been working with Enterprise Search since 2005, ever since joining a company called FAST Search & Transfer. I'm currently working as a Premier Field Engineer with Microsoft, helping customers with large ESP, FS14 and Sharepoint 2013 implementations.
This entry was posted in FS4SP, SP2010, SP2013 and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s