How “Remove Duplicate Results” works in FAST Search for SharePoint

This is a question that I have received quite a few times, and this time I thought I would get some screenshots and detail the process a little bit in here so that other folks can take advantage of this info as well.

First of all, what is the “Remove Duplicate Results” feature? It is a feature that tells the search engine to collapse results that are perceived as duplicates (such as the same document located in different paths), so that only one instance of the document is returned in the search results (instead of showing the multiple copies to the end-user).

The setting that enables this feature (which is on by default) in a Search Center is available at the Search Core Results Web Part, under the section Result Query Options, as shown below:

Search Core Results Web Part - Remove Duplicate Results

Once this option is enabled, any duplicate items will be collapsed in the search results, as you can see in this example:

FAST Search for SharePoint Duplicate Results Removal

And if you want to see all the duplicates (in order to delete one of them in the source repository, for example), all you have to do is click in the “Duplicates (2)” link highlighted above. This will execute another query, filtering results to display only the duplicates of the item you selected:

FAST Search for SharePoint Duplicate Removal Example

 

Now let’s investigate how this feature works. To do this, we will go in reverse order (from search to processing) to understand all the pieces involved.

The first clue is that this is enabled/disabled during search time, so there must be some parameter being sent by the Search Center to FAST Search for SharePoint to inform that duplicate removal is enabled. Taking a look at one of the full search requests in the querylogs (%FASTSearch%\var\log\querylogs\) we can confirm this:

/cgi-bin/search?hits=10&resubmitflags=1&rpf_navigation:hits=50&query=sharepoint&spell=suggest&collapsenum=1&qtf_lemmatize=True…&collapseon=batvdocumentsignature&type=kwall…

Note: the querylog shown above has some query parameters removed so we can focus on the items that matter to duplicate removal.

As you can see, there are two parameters sent to FAST Search for SharePoint indicating which property should be used for collapsing (batvdocumentsignature) and how many items should be kept after collapsing is performed (1). And if we want more information about these options, the MSDN documentation explains these two parameters used for duplicate removal (the names differ because the querylog shows the internal query parameter names received by FAST Search for SharePoint):

onproperty – Specifies the name of a non-default managed property to use as the basis for duplicate removal. The default value is the DocumentSignature managed property. The managed property must be of type Integer. By using a managed property that represents a grouping of items, you can use this feature for field collapsing.

keepcount – Specifies the number of items to keep for each set of duplicates. The default value is 1. It can be used for result collapsing use cases. If TrimDuplicates is based on a managed property that can be used as a group identifier (for example, a site ID), you can control how many results are returned for each group. The items returned are the items with the highest dynamic rank within each group.

The last parameter that can be used with the Duplicate Removal feature is also described in the MSDN article, explaining what happened behind the scenes when we clicked the “Duplicates (2)” link to display all the duplicates for that item:

includeid – Specifies the value associated with a collapse group, typically used when a user clicks the Duplicates (n) link of an item with duplicates. This value corresponds to the value of the fcoid managed property that is returned in query results.

Ok, so far we know that duplicate removal is enabled by default and is applied by collapsing results that have the same value for the managed property DocumentSignature. Let’s have a look at the settings for this managed property then:

image

As you can see, the type of this managed property is Integer (which the MSDN article defined as a requirement) and it is also configured both as Sortable and Queryable. The peculiar thing is that when looking at the crawled properties mapped to this managed property we get nothing as a result, which indicates that the value for this property is most likely being computed by FAST Search for SharePoint during content processing.

So let’s take a look at some lower-level configuration files to track this down. We start with the mother file of all configurations related to the content processing pipeline: %FASTSearch%\etc\PipelineConfig.xml. This is a file that can’t be modified (since it is not included here in the list of configuration files that can be modified), but nothing prevents us from just looking at it. After opening this configuration file and searching for “documentsignature”, you will find the definition for the stage responsible for assigning the value to this property:

    <processor name="DocumentSignature" type="general" hidden="0">
      <load module="processors.DuplicateId" class="DuplicateId"/>
      <config>
       <param name="Output" value="documentsignature" type="string"/>
       <param name="Input"  value="title:0:required body:1024:required documentsignaturecontribution" type="string"/>
      </config>
    </processor>

The parameters that matter most to us are highlighted above:

  • Input: which properties will be used to calculate the document signature –> title and the first 1024 bytes of body (as well as a property called documentsignaturecontribution that will also be used if it has any value)
  • Output: our dear documentsignature property

And with this we get to the bottom of this duplicate removal feature, which means is a good time to recap everything we found out:

  1. During content processing, for every item being processed, FAST Search for SharePoint will obtain the value of title and the first 1024 bytes of body for this item, and use it to compute a numerical checksum that will be used as a document signature. This checksum is stored in the property documentsignature for every item processed.
  2. During query time, whenever “Remove Duplicate Results” is enabled, the Search Center tells FAST Search for SharePoint to collapse results using the documentsignature property, effectively eliminating any duplicates for items that have the same title+first-1024-bytes-of-body.
  3. When a user clicks on the “Duplicates (n)” link next to an item that has duplicates, another query is submitted to FAST Search for SharePoint, passing as an additional parameter the value of the fcoid managed property for the item selected, which will be used to return all items that contain the same checksum (aka “the duplicates”).

A few extra questions that may appear in your head after you read this:

  • Is it possible to collapse results by anything other than documentsignature and use this feature for other types of collapsing (collapse by site, collapse by category, etc.)?
    • Answer: Yes, it is absolutely possible. All you will need is an Integer, Sortable, Queryable managed property containing the values you want to collapse by + a custom web part (or an extended Search Core Results Web Part) where you request this managed property to be used for collapsing and how many items should be preserved (as explained in this MSDN article linked before)
  • Can two items that are not identical be considered as duplicates?
    • Answer: Yep. As we saw above, only the first 1024 bytes of body are used for calculating the checksum, which means that any other differences these documents may have beyond the first 1024 bytes will not be considered for the purposes of duplicate removal. (Note: roughly speaking, the body property will have just the text of the document, without any of the formatting)
  • Can I change how the default document signature is computed?
    • Answer: Yes, read the update to this post below. No, since this is done by one of FAST Search for SharePoint out-of-the-box stages. But, what you can do is calculate your own checksum using the Pipeline Extensibility and then use your custom property for duplicate removal.
  • When you have duplicates and only one item is shown in the results, how does FS4SP decides which item to show?
    • Answer: The item shown is the item with the highest rank score among all the duplicates.

That’s it for today. If any other questions related to duplicate removal surface after this post is published, I will add them to the list above. Smile

UPDATE (2011-12-09): Yes, you can influence how document signature is computed!

After I posted this article yesterday I was still thinking about that documentsignaturecontribution property mentioned, as I had a feeling that there was a way to use it to influence how the document signature is compute. Well, today I found some time to test it and yes, it works! Here is how to do it.

What you have to do is create a new managed property with exactly the name documentsignaturecontribution and then map to this managed property any values that you also want to use for the checksum computation (as with other managed properties, to assign a value to this property you must map a crawled property to it).

You need a managed property because the DocumentSignature stage is executed after the mapping from crawled properties to managed properties, so FAST Search for SharePoint is looking for a managed property named documentsignaturecontribution to use as part of the checksum computation. When you create this managed property and assign it some value, FAST Search for SharePoint simply uses this, along with title and the first 1024 bytes of body, to calculate the checksum.

I followed the great idea from Mikael Svenson to create two text files filled with the same content just to force them to be perceived as duplicates by the system. The key here was to create these two files with exactly the same name and content, but put them in different folders. This way I could guarantee that both items had the same title and body, which would result in them having the same checksum. This was confirmed after I crawled the folders with these items:

FAST Search for SharePoint - Duplicate Removal - Example of duplicates

Both items had the same checksum, which could be checked by looking at the property fcoid that was returned with the results:

fcoid: 102360510986285564

My next step was to create the managed property documentsignaturecontribution and map to it some value that would allow me to distinguish between the two files. In my case, that value was the path to the items, which were located in different folders. So, after creating my managed property documentsignaturecontribution of type Text, I mapped to it the same crawled properties that are mapped to the managed property path, just to make sure I would get the same value as that property:

FAST Search for SharePoint - Duplicate Removal - documentsignaturecontribution managed property

With this done, I just had to perform another full crawl to force SharePoint to process both of my text files again, and confirm that they were not perceived as duplicates anymore (since I was also using their path to compute the checksum):

image

Another look into the fcoid property confirmed that both items did have different checksums now:

  • file://demo2010a/c$/testleo/test.txt – fcoid: 334483385708934799
  • file://demo2010a/c$/testleo/subdir/test.txt – fcoid: 452732803969334619

So what we learned with this is that you can influence how the document signature is computed by creating this managed property documentsignaturecontribution and mapping to it any value you want to be part of the checksum computation. And if you want to use the full body of the document to compute the checksum, you could accomplish this through the following steps:

  1. add to the Pipeline Extensibility a custom process that uses the body property as input and return a checksum crawled property based on the full contents of the body as the output
  2. map this checksum crawled property returned in the previous step to the managed property documentsignaturecontribution
  3. re-crawl your content and celebrate when it works Smile

Additional info:

About these ads

About leonardocsouza

Mix together a passion for social media, search, recommendations, books, writing, movies, education, knowledge sharing plus a few other things and you get me as result :)
This entry was posted in FS4SP and tagged . Bookmark the permalink.

14 Responses to How “Remove Duplicate Results” works in FAST Search for SharePoint

  1. Excellent explanation and I wrote about the 1024 limit before in the MS forums back in July (http://pzl.no/syFN8O). In my opinion they should have done a CRC32 on all the content instead of just the first 1024 characters. I suspect larger organizations with a lot of appendixes, contracts etc will see duplicates where it actually isn’t.

    And what content goes potentially into “documentsignaturecontribution”?

  2. Janette says:

    I really wanted to present this unique blog post, “How
    “Remove Duplicate Results” works in FAST Search for SharePoint | Search Unleashed” with my personal buddies on facebook.
    I personallysimply planned to disperse your wonderful posting!
    Many thanks, Sherman

  3. Juergen says:

    Thank you for this very helpful explanation. Do you know if these arguments also apply to the Search Server (which is integrated in SP 2010) or only to FAST Search?
    I tried to manipulate the fcoid like you did, but it didn`t work, still getting duplicates in my search results.

    • Hi Juergen!

      I’m not sure about how all of this works with Search Server, but considering that this collapsing functionality already existed in this form with FAST ESP (which was the basis for FS4SP), I wouldn’t expect Search Server to have it implemented in the same way.

      Best,
      Leo

  4. Daniel Sullivan says:

    Hi,

    I’m interested in making a making my own managed property and having FAST use that to filter duplicates instead of documentsignature. I have successfully created my crawled property and mapped it to a custom managed property but I’m having trouble understanding how to incorporate that into my search results. You mentioned “a custom web part (or an extended Search Core Results Web Part) where you request this managed property”. Will that have to be entirely custom, as in logic for checking if a document is a duplicate in the xslt or the web part? Or is there a way to simply sub-in my managed property?

    Thanks!

    • Hi Daniel!

      I’m sorry my post isn’t too clear on this. What I meant to indicate is that (at the time I researched this in FS4SP 2010), the only option you had to replace which managed property should be used for duplicate removal, was by setting the appropriate parameter when making requests using either the Query Object Model or the Query Webservice. There was no way I could find at the time to define the duplicate removal managed property in the ootb web parts, which meant that you would have to develop a custom web part (probably extending from an existing one) in order to be able to modify this setting to use your own custom managed property.

      Hope that helps!

      Best,
      Leo

  5. Daniel Sullivan says:

    Hi Leo,

    Thanks for answering my previous question, I’ve got another for you. Let’s say I have three urls: mysite.com/one, mysite.com/one/two, and mysite.com/three/one. What I was trying to do by viewing this post was collapse results so that if a user searched for “one” they would see two results: mysite.com/one and mysite.com/three/one. Since mysite.com/one/two is a sub site of mysite.com/one, it wouldn’t get returned even though these sites wouldn’t be duplicates. Is there a way to configure this kind of result collapsing in FAST that you know about?

    Thanks very much for your help!

    • That’s a great question, Daniel.

      From what I understand then, it seems like you would like the collapse to happen based on the URL structure of your site, in a way that the results coming from the same level of the URL structure (mysite.com/one and mysite.com/one/two) would then be collapsed. In this case, if you have both mysite.com/three/one and mysite.com/three/one/four, then the collapse would happen as well.

      If that’s really the case, then you would need to:

      1. compute some hash for up to the first level of the URL structure (mysite.com/*something*)
      2. store this hash in a managed property
      3. use this managed property as the basis for your duplicate removal (as described in my previous reply)

      Also remember that the when duplicate removal is applied, the result that it is returned is the one that has the highest ranking score. This means that in order to ensure that a specific record would be returned instead of the other, you would have to somehow define a boost/sorting option to make sure the item you want would be returned first.

      Last, if I understood this wrong and you want to do something else entirely, then just keep in mind that in order to use duplicate removal, all you need is some managed property that has a value to be collapsed on. I know sites that collapse based on URL structure (such as Google, which sometimes will return just the top URL for a site, if there are multiple matches within the same site), order search applications collapse based on category (such as a news site collapsing news from the same category, to allow the user to see articles across multiple categories), etc. The only requirements are the managed property, and customizing the search request to use your specific managed property for duplicate removal.

      Hope that helps!

      Best,
      Leo

  6. Claudia says:

    Hi Leo,
    I have a problem with displaying duplicate. I nocited that when I clik duplicate(n) there is no result and i get “Scope in your query does not exist”. However, if I click Edit Page and after Stop Editing I can see duplicates. I didn’t change anything in Edit Page mode, so I’m confused and have no idea how to resolve it. Have you any idea what is going on.

    Thanks,
    Claudia

    • Claudia says:

      I found that the problem is related with url. For some reason, url contains lots of space. If the space is right after “All%20Sites” the url is read in wrong way and the result is not dispaly. Have you any idea how the link Duplicates: (n) is generated?

      • Claudia says:

        &dupid=

        This line of code make this bug, Somebody added tab before &dupid=.

      • I’m sorry about the delay in getting back to you on this, Claudia. But I’m happy you figured out the problem! These small mistakes are the really tough ones to detect.

        Good luck with your projects!

        Best,
        Leo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s