How to tell that a Continuous Crawl is actually working

One of the great things about having Continuous Crawl running is with the fact that it’s supposed to magically work, i.e you enable it and ‘mini-crawls’ are kicked off periodically keeping your index fresh.  However, it’s not very easy what’s happening in case something goes wrong or if you simply need to find out why a specific document isn’t searchable yet, since the crawl logs are aggregated over 24 hours for the Content Source.  This means that you can’t get information about each “Mini-crawl” from crawl logs themselves.

One way to see what’s happening is to take a look at the SQL table itself. The table is called MSSMiniCrawls and it resides in your SSA DB. This will at least provide you with information on when a specific mini-crawl has been kicked off or will get kicked off (based on the set interval) and can come handy if someone mentions that they’ve just uploaded a new document and are wondering when it will get crawled/indexed.

Continuous_Crawl

 

Each Mini-Crawl is recorded in this table and given a MiniCrawlID. If you have many Content Sources, it’s also easy to find out what the ContentSourceID correlates to via Powershell.

Posted in SP2013, Uncategorized | Tagged , , , | Leave a comment

“About X results”: Sharepoint 2013 Search result counts keep changing

A question that’s asked fairly often is regarding the fact that Sharepoint Search seems to show inconsistent result counts from what is initially shown on the very first page and as you move along. Here is an example below:

About X results

Interestingly enough, this is pretty common for most search engines out there and I’ll start with showing a quick example of similar behavior with Google search. I’m going to run a search for “DocumentumConnector”, which you can see returns “About 13,000 results” on my first page.

DC_1

So what happens when I get to Page 10? There are now “About 12,800 results” shown, which means these numbers are just estimates and should get more precise as we move closer to the tail of the result set:

DC_2

Let’s get back to Sharepoint 2013 Search and explains what happens here. The “About x results” is presented when the total number of results is uncertain with Collapsing (i.e. duplicate detection in a typical case) enabled and the result set is partially processed.

  •  With TrimDuplicates = $true(it is by default), Sharepoint Search uses a default CollapseSpecification = “DocumentSignature”. This means we are collapsing on what are considered to be duplicate documents.
  •  Processing all of the results is too costly in terms of quickly returning results, which is why an estimate is given.

To recap, this means that a full result set isn’t processed for duplicates and the number returned as the Total number of hits after duplicate removal is an estimate based on how many duplicates were found in those results that were de-duplicated(based on collapsing done so far).

By default, your result page shows 10 results at a time. You may notice that for result sets with less than or equal to 10 results, you will always get an accurate number of results and thus “About” is omitted. For more than 10 results, you will get the “About x results” on the first page until you get to the very last page of results and only then we will actually know the exact number of results and can once again omit the “About”.

Posted in FS4SP, SP2010, SP2013, Uncategorized | Tagged , , , , , , , | 2 Comments

Interpreting “Crawl Queue” report with Sharepoint search

We’ve noticed something interesting a few weeks ago while working on a FAST Sharepoint 2010 system.  One of the advanced Search Administration Reports called Crawl Queue report was suddenly showing a consistent spike in “Links to Process” counter.  This report is specific to the FAST Content SSA and the two metrics shown are:

  • Links to Process:  Incoming links to process
  • Transactions Queued:  Outgoing transactions queued

New crawls would then start and complete without problems, which led us to believe that this had to do with a specific crawl that never completed.  We took a guess that perhaps one of the crawls has been in a Paused state, which ended up to be a correct assumption and saved us from writing SQL statements to figure out what state various crawls were in and so on.  Once this particular crawl was resumed, Links to Process went down as expected.  This process did give me a reason to explore what exactly happens when a crawl starts up and is either Paused or Stopped.

My colleague Brian Pendergrass describes a lot of these details in the following articles:

http://blogs.msdn.com/b/sharepoint_strategery/archive/2012/10/30/sp2010-search-explained-crawling.aspx

http://blogs.msdn.com/b/sharepoint_strategery/archive/2014/02/10/sharepoint-search-and-deadlocks-in-sql-server.aspx

If I had to just do a very high-level description, here is what happens when a crawl starts up:

  • The HTTP protocol handler starts by making an HTTP GET request to the URL specified by the content source.
  • If the file is successfully retrieved, a Finder() method will enumerate items-to-be-crawled from the WFE(Web Front-End) and all the links found in each document will be added to the MSSCrawlQueue table.  Gathering Manager called msssearch.exe is responsible for that.  This is exactly what “Links to Process” metric shows on a graph, it’s the links found in each document but not yet crawled.  If there are links still to process, they will be seen by querying the MSSCrawlQueue table.
  • Items actually scheduled to be crawled and waiting on callbacks are also seen in MSSCrawlURL table.  This corresponds to “Transactions queued” metric in the graph.
  • Each Crawl Component involved with the crawl will then pull a subset of links from MSSCrawlQueue table and actually attempt to retASTrieve each link from the WFE.
  • These links are removed from the MSSCrawlQueue once each link has been retrieved from a WFE and there is a callback that indicates that this item has now been processed/crawled.

Once a specific crawl was set to a Paused state, enumerated items-to-be crawled stayed in MSSCrawlQueue table and were not cleared out, corresponding to “Links to Process” metric in the graph.  If instead we attempted to Stop a crawl, these links would have actually be cleared out from the table.

This behavior should be similar with Sharepoint 2013 Search.

Posted in FS4SP, SP2010, SP2013 | Tagged , , , , | Leave a comment

Notes on securing data with Sharepoint 2013 Search

A few days ago a question came up regarding looking at Sharepoint 2013 Search from a security perspective, specifically looking at any file-storage paths where ingested content may be stored, temporarily or permanently. An example is a document that contains Personal Information (PII info) and it’s important to know where this document may be stored on disk for auditing purposes. We are leaving Sharepoint databases out of this example.

Before talking about specific file paths, here are some general tidbits on this topic I’ve been able to gather.

  •  The SharePoint 2013 Search Service does not encrypt any data.
  •  All temporary files are secured by ACLs so that sensitive information on disk is only accessible to the relevant users and Windows services.
  • If the disk is encrypted at OS-level, this is transparent to SharePoint search. It’s important to carefully benchmark indexing and search performance when using OS-level encryption due to performance impact.
  • If you do need to use OS-level disk encryption, please first contact Microsoft support to get the official guidance from the Product Group (if official documentation is not yet available on TechNet). My understanding is that currently only Bitlocker drive encryption will work with Sharepoint 2013 Search.
  • Although both the Journal and index files are compressed, they should be considered readable.

Specific paths to where data is stored on disk at some point in time:

Index and Journal files:

C:\Program Files\Microsoft Office Servers\15.0\Data\Office Server\Applications\Search\Nodes\SomeNumber\IndexComponent_SomeNumber\storage\data

Crawler: 

1. The temp path, which is where the mssdmn.exe process initially writes the files it  has gathered:
◾[RegKey on the particular Crawl Component] HKLM\SOFTWARE\Microsoft\Office Server\15.0
\Search\Global\Gathering Manager\TempPath

2. The Gatherer Data Path (shared with Content Processing Component), which is where the MSSearch.exe writes the files that were gathered by the MSSdmn.exe process:
◾[RegKey on the particular Crawl Component] HKLM\SOFTWARE\Microsoft\Office Server\15.0\Search\Components\CrawlComponent_Number>\GathererDataPath

Content Processing Component:
This needs to be tested a bit further and the actual path may need to be updated
(will update later). Temporary storage for input/output data during parsing and
document conversion in Content Processing Component under
C:\Program Files\Microsoft Office Servers\15.0\Data\Office
Server\Applications\Search\Nodes\SomeNumber\ContentProcessingComponent_SomeNumber\Temp\. Continue reading

Posted in SP2013 | Tagged , , , , | 1 Comment

Crawling content with Sharepoint 2013 Search

Before we get anywhere further with search, let’s discuss in more detail how content is gathered, and that’s via crawling.  Crawling is simply a process of gathering documents from various sources/repositories, making sure they obey by various rules and sending them off for further processing to the Content Processing Component.

Let’s take a more in-depth look at how Sharepoint crawl works.

Crawling_In_Depth_Architecture_updated

Architecture:

There are 2 processes that you should be aware of when working with Sharepoint crawler/gatherer:  MSSearch.exe and MSSDmn.exe

  1. The MSSearch.exe process is responsible for crawling content from various repositories, such as SharePoint sites, HTTP sites, file shares, Exchange Server and more.
  2.  When a request is issued to crawl a ‘Content Source’,  MSSearch.exe invokes a ‘Filter Daemon’ process called MSSDmn.exe. This loads the required protocol handlers and filters necessary to connect, fetch and parse the content.  Another way of defining MSSDmn.exe is that it is a child process of MSSearch.exe and is a hosting process for protocol handlers.

The figure above should give you a feel for how Crawl Component operates, it uses MSSearch.exe and MSSDmn.exe to load the necessary protocol handlers and gather documents from various supported repositories, and then sends the crawled content via a Content Plug-In API to the Content Processing Component.  There is one temporary location I should mention as listed in the figure(as there is more than one), and that’s a network location where crawler will store document blobs for CPC to pick up.   It is a temporary on-disk location based on callbacks received  by the crawler Content Plug-In from the indexer.

Last part of this architecture is the Crawl Store database.  It is used by the Crawler/Gatherer to manage crawl operations and store history, URL, deletes, error data, etc.

Major Changes from SP2010 Crawler:

- Crawler is no longer responsible for parsing and extracting document properties and various other tasks such as linguistic processing as was the case with previous Sharepoint Search versions.  Its job is now much closer to FAST for Sharepoint 2010 crawler, where crawler is really just the gatherer of documents that’s tasked with shipping them off to the Content Processing Component for further processing.  This also means no more Property Store database.

- Crawl component and Crawl DB relationship.  As of Sharepoint 2013, crawl component will automatically communicate with all crawl databases if there is more than one(for a single host).  Previously, mapping of crawl components to crawl databases resulted in a big difference in database sizes.

- Coming from FAST for Sharepoint 2010, there is single Search SSA that will handle both content and people crawl.  No longer is there a need to have FAST Content SSA to crawl documents and FAST Query SSA to crawl People data.

- Crawl Reports.  Better clarity from a troubleshooting perspective.

Protocol Handlers:

Protocol handler is a component used for each of the target types.  Here are the target types supported by Sharepoint 2013 crawler:

•  HTTP Protocol Handler:  accessing websites, public Exchange folders and SP sites. (http://)
•  File Protocol Handler: accessing file shares (file://)
•  BCS Protocol Handler: accessing Business Connectivity Services  – (bdc://)
•  STS3 Protocol Handler:  accessing SharePoint Server 2007 sites.
•  STS4 Protocol Handler: accessing SharePoint Server 2010 and 2013 sites.
•  SPS3 Protocol Handler: accessing people profiles in SharePoint 2007 and 2010.

Note that only STS4 Protocol Handler will crawl SP sites as true Sharepoint sites.  If using HTTP protocol handler, Sharepoint sites will still be crawled but only as regular web sites.

 

Crawl Modes:

  • Full Crawl Mode – Discover and Crawl every possible document on the specific content source.
  • Incremental Crawl Mode – Discover and Crawl only documents that have changed since the last full or incremental crawl.

Both are defined on a per-content source basis and sequential and dedicated.  This means that they cannot be run in a parallel and that they process changes from the Content Source ‘change log’ in a top-down fashion.  This presents the following challenge.  Let’s say this is what we expect from an Incremental crawl as far as processing changes and the amount of time it should take.

Incremental_crawl_expected

However, there is a tendency to have some “deep changes” spikes (say a wide security update) which alter this timeline and result in incremental crawls taking longer than expected.  Since these incremental crawls are sequential, the subsequent crawls cannot start until the previous crawl has completed, leading to missing scheduled timelines set by administrator.  Figure below shows the impact:

Incremental_crawl_actual

What is the best way for a search administrator to deal with this?  Enter the new Continuous Crawl mode:

  • Continuous Crawl Mode – Enables a continuous crawl of a content source.  Eliminates the need to define crawl schedules and automatically kicks off crawls as needed to process the latest changes and ensure index freshness. Note that Continuous mode can only work for Sharepoint-type content source.   Below is a figure that shows how using Continuous crawl mode with its parallel sessions ensures that index is kept fresh even with unexpected deep content changes:

Continuous_crawl

There are couple of things to keep in mind here regarding Continuous crawls:

- Each SSA will have only one Continuous Crawl running.

- They are automatically spun up every 15 minutes(can be changed with Powershell)

-  You cannot pause or resume a Continuous crawl, it can only be stopped.

Scaling/Performance/Load:

Some notes here:

- Add crawl components for both tolerance and potentially a better throughput (depending on the use-case).  Number of crawl components figures into calculation of how many sessions each crawl component will start with a Content Processing Component.

- Continuous crawls increase the load on the crawler and on crawl targets.  For each large content source for which you enable continuous crawls, it is recommended that you configure one or more front-end web servers as dedicated targets for crawling. For more information, take a look at http://technet.microsoft.com/en-us/library/dd335962(v=office.14).aspx

- There is a global setting that allows you to control how many worker threads each crawl component will use to target a host.  The default setting is High, which is a change from Sharepoint 2010 Search where this setting was set to Partially Reduced.  The reason for the change is that crawler is now by far less resource intensive than in the past due to much of the functionality moving to Content Processing Component.  Microsoft support team recommends changing Crawler Impact Rules versus this setting, mainly due to the fact that Crawler Impact Rules are host-based and not global.

  1. Reduced = 1 per CPU
  2. Partially reduced = Number_of_CPU’s+4, but threads set to ‘low priority’, meaning another thread can make it wait for CPU time.
  3. High = Number_of_CPU’s + 4  AND a normal thread priority).  This is the default setting

 

- Crawler Impact Rules:  There are not global and are “host”-based.  You can either set it to request a different number of simultaneous requests than below or change to have 1 request/thread at a time with a wait time of Y number of seconds.

Choosing to have 1 request a time while waiting for a specified time will most likely ensure a pretty slow crawl.

Impact_Rules

We will tackle other search components in future posts, hopefully providing a very clear view of how all these components interact with each other.

Posted in SP2013 | Tagged , , , , | 3 Comments

Search Architecture with SharePoint 2013

I’d like to revisit the topics that Leo have so well described in his 2 previous posts titled  ”Search 101″ and “Search Architecture in SharePoint 2010″, but discuss those in the context of SharePoint 2013 Search.  This post will address general architecture of SharePoint 2013 Search, describe all the components involved and briefly touch upon the biggest changes when coming from FAST for SharePoint 2010 “world”.  Future posts will go deeper into each search component and provide both an overview and troubleshooting information.

  • Search 101: general concepts of search, including crawling, processing, indexing and searching (Leo’s initial post)
  • Search Architecture in SharePoint 2013: the overall architecture of search-related components in SharePoint 2013  (this post)
  • Planning and Scale (future post)
  • Installation / Deployment (future post)
  • Crawling
  • Processing (future post)
  • Indexing (future post)
  • Searching (future post)

Search 101

As Leo has described in his previous post, if are a complete ‘newbie’ as to how search engine should work, the very basic tasks it should perform are:

  • Crawling: acquire content from wherever it may be located (web sites, intranet, file shares, email, databases, internal systems, etc.)
  • Processing: prepare this content to make it more “searchable”. Think of a Word document, where you will want to extract the text contained in the document, or an image that has some text that you want people to be able to search, or even an web page where you want to extract the title, the body and maybe some of its HTML metadata tags.
  • Indexing: this is the magic sauce of search and what makes it different than just storing the content in a database and searching it using SQL statements. The content in a search engine is stored in a certain way optimized for later retrieval of this content. We typically call this optimized version of the content as the search index.
  • Searching: the part of search engines most well known. You pass one or more query terms and the search engine will return results based on what is available in its search index.

Armed with this knowledge, let’s take a look at SharePoint 2013 Search Architecture and we can we can immediately see that the main components do just that:

Search Architecture in SharePoint 2013

SP2013_Search_Architecture

- Crawling:                              SharePoint Crawler via SharePoint Content SSA

- Content Processing:           CPC(Content Processing Component)

- Indexing:                              Indexing Component

- Searching:                            Query Processing Component(QPC)

You’ll notice that there is one more component that we didn’t describe in our ‘basics’, but it’s quite an important one.  The Analytics Processing Component is extremely important to SharePoint 2013 Search, as it does both Usage and Search Analytics and learns by usage and by processing various events such as ‘views’, ‘clicks’ and so on.  It then enriches the index by updating index items, which impacts relevancy calculations based on processed data, and provides valuable information in such forms as Recommendations and Usage reports.

- Analytics:                              Analytics Processing Component(APC)

Let’s take a brief look at each sub-system and its architecture:

Crawling

Simply put, SharePoint 203 crawler grabs content from various repositories, runs it through various crawler rules and sends it off to Content Processing Components for further processing.  You can think of it as an initial step for your feeding chain with search index being the final destination.

Crawling can be scaled out using multiple crawl components and databases.  New Continuous crawl mode ensures index freshness, while architecture has been simplified from FAST for SharePoint with having a single SharePoint Search Service Application handle both crawling and querying.

A Continuous Crawl can have multiple continuous crawl sessions running in parallel. This  capability enables crawler to keep Search Index fresher – for example if a preceding Continuous Crawl session is busy processing a deep security change, the subsequent crawl can process content updates.  Unlike Incremental crawl, there is no longer a need to wait for completion before new changes can be picked up, these crawls are spun-up every 15 minutes and crawl the “change logs”.

SP2013_Crawl_Component

  • Invokes Connectors/Protocol Handlers to content sources to retrieve data
  • Crawling is done via a single SharePoint Search SSA
  • Crawl Database is used to store information about crawled items and to track crawl history
  • Crawl modes:  Incremental, Full and Continuous

What’s new:

  • Incremental, Full and Continuous crawl modes
  • No need for Content SSA and Query SSA:  a single Search SSA
  • FAST Web Crawler no longer exists
  • Improved Crawl Reporting/Analytics

Content Processing

In SharePoint Search 2010, there was a single role involved in the feeding chain:  the Crawler.  In FAST for SharePoint 2010, feeding chain consisted of 3 additional components(other than the crawler): Content Distributor, Document Processor and Indexing Dispatcher.  With SharePoint 2013 Search, Content Processing Component combines all three.

A simple way to described Content Processing Component is that it takes the content produced by the Crawler, does some analysis/processing on the content to prepare it for indexing and sends it off to the Indexing Component.  It takes crawled properties as input from the Crawler and produces output in terms of Managed Properties for the Indexer.

Content Processing Component uses Flows and Operators to process the content.  If coming from FAST “world”, think of Flows as Pipelines and Operators as Stages.  Flows define how to process content, queries and results and each flow processes 1 item at a time.  Flows consist of operators and connections organized as graphs.  This is really where all the “magic” happens, things like language detection, word breaking, security descriptors, content enrichment(web service callout), entity and metadata extraction, deep link extraction and so on.

 

SP2013_ContentProcessingFlows

CPC comes with pre-defined flows and operators that currently cannot be changed in a supported way.  If you search hard enough, you will find blogs that will described how to customize flows and operators in an unsupported fashion.   The flow has branches that handle different operations, like inserts, deletes and partial updates.  Notice that security descriptors are now updated in a separate flow, which should make the dreaded “security-only” crawl perform better as opposed to previous versions.

As I’ve mentioned, CPC has an internal mechanism to load-balance items coming from the Crawler between available flows(analogy to the old FAST Content Distributor).  It also has a mechanism at the very end of the flow to load-balance indexing across the available Indexing Components(analogy to the old FAST Indexing Dispatcher).  We will revisit this topic in more detail in subsequent posts.

  • Stateless node
  • Analyzes content for indexing
  • Enriches content as needed via Content Enrichment Web Service (web service callout)
  • Schema mapping.  Produces managed properties from crawled properties
  • Stores links and anchors in Link database(analytics)

What’s new:

  • Web Service callout only works on managed properties and not on crawled properties, as was done with Pipeline Extensibility in FAST for SharePoint 2010.
  • Flows have different branches that can handle operations like deletes or partial updates on security descriptors separately from main content, improving performance.
  • Content Parsing is now handled by Parsers and Format Handlers(will be described in later posts)

Note:  If Content Enrichment Web Service does not meet your current needs and you need more of an ESP-style functionality when it comes to pipeline customization, talk to Microsoft Consulting Services or Microsoft Premier Field Engineers regarding CEWS Pipeline Toolkit.

http://social.technet.microsoft.com/wiki/contents/articles/19376.sharepoint-cews-pipeline-toolkit.aspx

Indexing

Job of the indexer is to receive all processed content from Content Processing Component, eventually persist it to disk(store it) and have it ready to be searchable via Query Processing Component.  It’s the “heart” of your search engine, this is where your crawled content lives.  Your index will reside on a something called an Index Partition.   You may have multiple Index Partitions, with each one containing a unique subset of the index.  All of your Partitions taken together is your entire search index.  Each Partition may have 1 or more Replicas, which will contain an exact copy of the index from that partition.  There will always be at least one replica, meaning that one of your index partitions is also a primary replica.  So when coming from FAST, think “partitions and replicas” instead of “columns and rows”.

Each index replica an Index Component.  When we provision an Index Component, we associate with an index partition.

Scaling:

Increase Query load or fault tolerance:  Add more index replicas

Increase content volume:  Add more index partitions

 

IndexPartitions

There are a couple of very important changes to internals of the indexer that I’d like to touch upon:

- There is NO MORE FIXML.  Just a reminder, FIXML stood for FAST Index XML and contained an XML representation of each document that the indexer used to create the binary index.  FIXML was stored locally on disk and was frequently used to re-create binary index without have to re-feed from scratch.  There is now a new mechanism called a ‘partial update’, which replaces the need for FIXML.

- Instant Indexing: We can now serve queries much quicker directly from memory instead of waiting for them to be persisted to disk.

- Journaling:  Think RDBMS “transaction log”, a sequential history of all operations to each index partition and its replicas.  Together with checkpointing, allows for  “instant indexing” feature above , as well as ACID features (atomicity, consistency, isolation and durability).  For the end-user, this ensures that a full document or set of documents as a group is either fully indexed or not indexed at all.  We will discuss this in much more detail in subsequent posts.

- Update Groups/Partial Update Mechanism:  All document properties(managed properties) are split into Update Groups. In the past with FAST, “partial updates” where quite expensive as indexer would have read the whole FIXML document, find the element, update the file, save it and re-index the FIXML document.  Now, properties in a one update group can be updated at a low cost without affecting the rest of the index.

There is also an updated mechanism to merging Index Parts, which you can somewhat compare to how FAST handled what was then called “index partitions” in the past and merged them.

Indexing_Merging

Index internally is built up of several smaller inverted index parts, each one being an independent portion of the index.  From time to time, based on specific criteria, they need to be merged in order to free up resources associated with maintaining many small indices.  Typically, smaller ones will be merged more often while larger ones will be done less frequently.

Keep in mind that Level/Part 0 is the in-memory section that directly allows for the “Instant Indexing” feature.   When documents come into the indexing subsystem, they come into 2 places at the same time:

  1. The Journal
  2. The Checkpoint section(Level 0 in the figure above)

Checkpoint section contains document that are in memory but have not yet been persisted to disk, yet searchable.  If search crashes, the  in-memory portion will be lost but will be restored/replayed from the Journal on the next start up.

Query Processing

Query Processing Component is tasked with taking a user query that comes from a search front-end and submits it to the Index Component.  It routes incoming queries to index replicas, one from each index partition.  Results are returned as a result set based on the processed query back to the QPC, which in turn processes the result set prior to sending it back to the search front-end.  It also contains a set of flows and operators, similar to the Content Processing Component.  If coming from FAST, you can compare it to the QRServer with its Query Processing Pipelines and stages.

QueryProcessing

  • Stateless node
  • Query-side flows/operators
  • Query federation
  • Query Transformation
  • Load-balancing and health checking
  • Configurations stored in Admin database

What’s new:

  • Result Sources/Query Rules

Analytics

Analytics Processing Component is a powerful component that allows for features such as Recommendations(‘if you like this you might like that’), anchor text/link analysis and much more.  It extracts both search analytics and usage analytics, analyzes all the data and returns the data in various forms, such as via reporting or by sending it to Content Processing Component to be included in the search index for improved relevance calculations and recall.

AnalyticsComponent

Let’s quickly define both search analytics and usage analytics:

-  Search analytics is information such as links, anchor text, information related to people, metadata, click distance, social distance, etc. from items that APC receives via the Content Processing Component and stores the information in the Link database.

- Usage analytics is information such as the number of times an item is viewed from the front-end event store and is stored in the Analytics Reporting database.

  • Learns by usage
  • Search Analytics
  • Usage Analytics
  • Enriches index for better relevance calculations and recall
  • Based on Map/Reduce framework – workers execute needed tasks.

What’s new:

  • Coming from FAST ESP/FAST for SharePoint, it combines many separate features and components such as FAST Recommendations, WebAnalyzer, Click-through analysis into a single component…and adds more.

 

 

I hope to be able to do some deep-dives into each component in future posts, feel free to drop me a note with any questions that may come up.

Posted in SP2013 | Tagged , , , , , , , , , , , , , , | 9 Comments

Sharepoint 2013 Search Ranking and Relevancy Part 1: Let’s compare to FS14

I’m very happy to do some “guest” blogging for my good friend Leo and continue diving into various search-related topics.  In this and upcoming posts, I’d like to jump right into something that interests me very much, and that is taking a look at what makes some documents more relevant than others as well as what factors influence rank score calculations.

Since Sharepoint 2013 is already out, I’d like to touch upon a question that comes up often when someone is considering moving from FAST ESP or FAST for Sharepoint 2010 to Sharepoint 2013 :  “So how are rank scores calculated in Sharepoint 2013 Search as opposed to previous FAST versions”?

In upcoming posts, I will go more into “internals” of the current Sharepoint 2013 ranking model as well as introduce the basics of relevancy calculation concepts that apply across many search engines and are not necessarily specific to FAST or Sharepoint Search.

There are some excellent blog posts out there that go in-depth on how Sharepoint 2013 Search rank models work, including the ones below from Alexey Kozhemiakin and Mikael Svenson.

http://powersearching.wordpress.com/2013/03/29/how-sharepoint-2013-ranking-models-work/

http://techmikael.blogspot.com/2013/04/rank-models-in-2013main-differences.html

 

To avoid being repetitive, what I’ve tried to do is to create an easy to see comparison chart between factors that influence rank calculations in FS14 to Sharepoint 2013 Search.  I may update this chart in the future to include FAST ESP, although the main factors involved in both ESP and FS14 are somewhat similar to each other as opposed to Sharepoint 2013 Search(which is closer related to Sharepoint 2010 Search model).

One of the main differences is with the fact that Sharepoint 2013 Search uses a 2-stage process for rank calculations:  a linear ranking model as a 1st stage and a Neural Network as a 2nd stage.  The 1st stage is “light” and we can afford to apply it to all documents in a result set.  There are specific rank features that are part of this stage that are applied to all documents.  The top 1000 documents(candidates) based on Stage 1 Rank are input to Stage 2.  This stage is more performance intensive and re-computes the rank score for documents used as an input, which is why it is only applied to a limited set.  It consists of all the same rank features as Stage 1 plus 4 additional Proximity features.

 For my comparison below, I was mainly using a model called “Search Ranking Model with Two Linear Stages”, which has been put in place as of August 2013 CU.  This model is recommended to use as a template when creating custom rank models, as it provides you with proximity without a Neural Network.

 

Rank Factor

FS14

SP2013 Search

Rank Models 1 OOTB rank model 16 Rank Models
Freshness Available OOTB and customizable N/A OOTB, possible to be configured
Dynamic Ranking (field weighting/managed properties) Context Boost:

Title, DocSubject, Keywords, DocKeywords, urlkeywords, Description, Author, CreatedBy, ModifiedBy,  MetadataAuthor, WorkEmail, Body, crawledpropertiescontent

Document MP’s + Usage/Social data

Title, QLogClickedText, SocialTag, Filename, Author, AnchorText, body

FileType Field-Boost weight/Managed Property Boost(OOTB -4000 points):

 

Format:

Unknown Format, XML, XLS

 

FileExtension:

CVS, TXT, MSG, OFT, ZIP, VSD, RTF

 

IsEmptyList, IsListItem

FileType rank feature:

 

 

 

PPT, Sharepoint site, DOC, HTML, ListItems, Image, Message, XLS, TXT

 

Language

 

N/A Dynamic Rank(query-based).  LCID, i.e locale ID is used.
Social Distance  N/A Static Rank(colleague relationship to the person issuing the query).

 

0 bucket – No colleague relationship

 

1 bucket – first level(direct) relationship

 

2 bucket – second level(indirect) relationship

Static Rank Boost (Query-Independent) Quality Weight Components:

 

hwboost

docrank

siterank

urldepthrank

 

 

Authority Weight– Partial and Complete

 

 

Now part of Analytics Processing Component.  Static Rank features calculated with Search and Usage Analytics:

 

QLogClicks

QLogSkips

QLogLastClicks

EventRate

 

Proximity Enabled by default MinSpan (Neural Networks 2nd stage, parameters for proximity minimal span

 

Anchortext (Query-Dependent) Extnumocc = part of Dynamic Rank calculations, query-time hits in anchortext

 

AnchortextComplete
URLDepth (Query-Dependent) N/A – in FS14, this was a static rank feature. UrlDepth – Depth of the document URL(number of slashes)

 

Click-Through Weight(Query-Dependent) Query-Authority weight:  click-through weight, dynamic rank N/A

Now part of static rank features used in Analytics processing Component(QLogClicks, etc)

 

Rank Tuning

FS14

SP2013 Search

GUI-based applications. Ease of tuning rank calculations and user-friendliness N/A

Rank calculations  and scores can be seen either via ranklog output or via Codeplex tools such as FS4SP Query Logger.   However, there isn’t a user-friendly tool to help you make the changes and push them live, or preferably see them in “Preview” mode offline.  A separate ‘spreladmin’ tool is needed for click analysis.

 

Rank Tuning App(coming  soon).  A GUI-based and user-friendly way to tune/customize ranking and impact relevancy.  Includes a “preview”, i.e offline mode.
Rank logging availability Server-side:

Ranklog is available via QRServer output.  However, it is server-side and only available to Admins with local access to QRServer port 13280.

 

Client-side:

 

N/A

Server-side:

Rank tuning app/ULS logs

 

 

 

                                                                                                  Client-side:

 

ExplainRank template available to clients.

 

http://powersearching.wordpress.com/2013/01/25/explain-rank-in-sharepoint-2013-search/

 

 

 

Posted in SP2013 | Tagged , , , , | 3 Comments