This is a question that I have received quite a few times, and this time I thought I would get some screenshots and detail the process a little bit in here so that other folks can take advantage of this info as well.
First of all, what is the “Remove Duplicate Results” feature? It is a feature that tells the search engine to collapse results that are perceived as duplicates (such as the same document located in different paths), so that only one instance of the document is returned in the search results (instead of showing the multiple copies to the end-user).
The setting that enables this feature (which is on by default) in a Search Center is available at the Search Core Results Web Part, under the section Result Query Options, as shown below:
Once this option is enabled, any duplicate items will be collapsed in the search results, as you can see in this example:
And if you want to see all the duplicates (in order to delete one of them in the source repository, for example), all you have to do is click in the “Duplicates (2)” link highlighted above. This will execute another query, filtering results to display only the duplicates of the item you selected:
Now let’s investigate how this feature works. To do this, we will go in reverse order (from search to processing) to understand all the pieces involved.
The first clue is that this is enabled/disabled during search time, so there must be some parameter being sent by the Search Center to FAST Search for SharePoint to inform that duplicate removal is enabled. Taking a look at one of the full search requests in the querylogs (%FASTSearch%\var\log\querylogs\) we can confirm this:
Note: the querylog shown above has some query parameters removed so we can focus on the items that matter to duplicate removal.
As you can see, there are two parameters sent to FAST Search for SharePoint indicating which property should be used for collapsing (batvdocumentsignature) and how many items should be kept after collapsing is performed (1). And if we want more information about these options, the MSDN documentation explains these two parameters used for duplicate removal (the names differ because the querylog shows the internal query parameter names received by FAST Search for SharePoint):
onproperty – Specifies the name of a non-default managed property to use as the basis for duplicate removal. The default value is the DocumentSignature managed property. The managed property must be of type Integer. By using a managed property that represents a grouping of items, you can use this feature for field collapsing.
keepcount – Specifies the number of items to keep for each set of duplicates. The default value is 1. It can be used for result collapsing use cases. If TrimDuplicates is based on a managed property that can be used as a group identifier (for example, a site ID), you can control how many results are returned for each group. The items returned are the items with the highest dynamic rank within each group.
The last parameter that can be used with the Duplicate Removal feature is also described in the MSDN article, explaining what happened behind the scenes when we clicked the “Duplicates (2)” link to display all the duplicates for that item:
includeid – Specifies the value associated with a collapse group, typically used when a user clicks the Duplicates (n) link of an item with duplicates. This value corresponds to the value of the fcoid managed property that is returned in query results.
Ok, so far we know that duplicate removal is enabled by default and is applied by collapsing results that have the same value for the managed property DocumentSignature. Let’s have a look at the settings for this managed property then:
As you can see, the type of this managed property is Integer (which the MSDN article defined as a requirement) and it is also configured both as Sortable and Queryable. The peculiar thing is that when looking at the crawled properties mapped to this managed property we get nothing as a result, which indicates that the value for this property is most likely being computed by FAST Search for SharePoint during content processing.
So let’s take a look at some lower-level configuration files to track this down. We start with the mother file of all configurations related to the content processing pipeline: %FASTSearch%\etc\PipelineConfig.xml. This is a file that can’t be modified (since it is not included here in the list of configuration files that can be modified), but nothing prevents us from just looking at it. After opening this configuration file and searching for “documentsignature”, you will find the definition for the stage responsible for assigning the value to this property:
<processor name="DocumentSignature" type="general" hidden="0">
<load module="processors.DuplicateId" class="DuplicateId"/>
<param name="Output" value="documentsignature" type="string"/>
<param name="Input" value="title:0:required body:1024:required documentsignaturecontribution" type="string"/>
The parameters that matter most to us are highlighted above:
- Input: which properties will be used to calculate the document signature –> title and the first 1024 bytes of body (as well as a property called documentsignaturecontribution that will also be used if it has any value)
- Output: our dear documentsignature property
And with this we get to the bottom of this duplicate removal feature, which means is a good time to recap everything we found out:
- During content processing, for every item being processed, FAST Search for SharePoint will obtain the value of title and the first 1024 bytes of body for this item, and use it to compute a numerical checksum that will be used as a document signature. This checksum is stored in the property documentsignature for every item processed.
- During query time, whenever “Remove Duplicate Results” is enabled, the Search Center tells FAST Search for SharePoint to collapse results using the documentsignature property, effectively eliminating any duplicates for items that have the same title+first-1024-bytes-of-body.
- When a user clicks on the “Duplicates (n)” link next to an item that has duplicates, another query is submitted to FAST Search for SharePoint, passing as an additional parameter the value of the fcoid managed property for the item selected, which will be used to return all items that contain the same checksum (aka “the duplicates”).
A few extra questions that may appear in your head after you read this:
- Is it possible to collapse results by anything other than documentsignature and use this feature for other types of collapsing (collapse by site, collapse by category, etc.)?
- Answer: Yes, it is absolutely possible. All you will need is an Integer, Sortable, Queryable managed property containing the values you want to collapse by + a custom web part (or an extended Search Core Results Web Part) where you request this managed property to be used for collapsing and how many items should be preserved (as explained in this MSDN article linked before)
- Can two items that are not identical be considered as duplicates?
- Answer: Yep. As we saw above, only the first 1024 bytes of body are used for calculating the checksum, which means that any other differences these documents may have beyond the first 1024 bytes will not be considered for the purposes of duplicate removal. (Note: roughly speaking, the body property will have just the text of the document, without any of the formatting)
- Can I change how the default document signature is computed?
- Answer: Yes, read the update to this post below.
No, since this is done by one of FAST Search for SharePoint out-of-the-box stages. But, what you can do is calculate your own checksum using the Pipeline Extensibility and then use your custom property for duplicate removal.
- When you have duplicates and only one item is shown in the results, how does FS4SP decides which item to show?
- Answer: The item shown is the item with the highest rank score among all the duplicates.
That’s it for today. If any other questions related to duplicate removal surface after this post is published, I will add them to the list above.
UPDATE (2011-12-09): Yes, you can influence how document signature is computed!
After I posted this article yesterday I was still thinking about that documentsignaturecontribution property mentioned, as I had a feeling that there was a way to use it to influence how the document signature is compute. Well, today I found some time to test it and yes, it works! Here is how to do it.
What you have to do is create a new managed property with exactly the name documentsignaturecontribution and then map to this managed property any values that you also want to use for the checksum computation (as with other managed properties, to assign a value to this property you must map a crawled property to it).
You need a managed property because the DocumentSignature stage is executed after the mapping from crawled properties to managed properties, so FAST Search for SharePoint is looking for a managed property named documentsignaturecontribution to use as part of the checksum computation. When you create this managed property and assign it some value, FAST Search for SharePoint simply uses this, along with title and the first 1024 bytes of body, to calculate the checksum.
I followed the great idea from Mikael Svenson to create two text files filled with the same content just to force them to be perceived as duplicates by the system. The key here was to create these two files with exactly the same name and content, but put them in different folders. This way I could guarantee that both items had the same title and body, which would result in them having the same checksum. This was confirmed after I crawled the folders with these items:
Both items had the same checksum, which could be checked by looking at the property fcoid that was returned with the results:
My next step was to create the managed property documentsignaturecontribution and map to it some value that would allow me to distinguish between the two files. In my case, that value was the path to the items, which were located in different folders. So, after creating my managed property documentsignaturecontribution of type Text, I mapped to it the same crawled properties that are mapped to the managed property path, just to make sure I would get the same value as that property:
With this done, I just had to perform another full crawl to force SharePoint to process both of my text files again, and confirm that they were not perceived as duplicates anymore (since I was also using their path to compute the checksum):
Another look into the fcoid property confirmed that both items did have different checksums now:
- file://demo2010a/c$/testleo/test.txt – fcoid: 334483385708934799
- file://demo2010a/c$/testleo/subdir/test.txt – fcoid: 452732803969334619
So what we learned with this is that you can influence how the document signature is computed by creating this managed property documentsignaturecontribution and mapping to it any value you want to be part of the checksum computation. And if you want to use the full body of the document to compute the checksum, you could accomplish this through the following steps:
- add to the Pipeline Extensibility a custom process that uses the body property as input and return a checksum crawled property based on the full contents of the body as the output
- map this checksum crawled property returned in the previous step to the managed property documentsignaturecontribution
- re-crawl your content and celebrate when it works