Sharepoint 2013 Search Ranking and Relevancy Part 1: Let’s compare to FS14

I’m very happy to do some “guest” blogging for my good friend Leo and continue diving into various search-related topics.  In this and upcoming posts, I’d like to jump right into something that interests me very much, and that is taking a look at what makes some documents more relevant than others as well as what factors influence rank score calculations.

Since Sharepoint 2013 is already out, I’d like to touch upon a question that comes up often when someone is considering moving from FAST ESP or FAST for Sharepoint 2010 to Sharepoint 2013 :  “So how are rank scores calculated in Sharepoint 2013 Search as opposed to previous FAST versions”?

In upcoming posts, I will go more into “internals” of the current Sharepoint 2013 ranking model as well as introduce the basics of relevancy calculation concepts that apply across many search engines and are not necessarily specific to FAST or Sharepoint Search.

There are some excellent blog posts out there that go in-depth on how Sharepoint 2013 Search rank models work, including the ones below from Alexey Kozhemiakin and Mikael Svenson.

http://powersearching.wordpress.com/2013/03/29/how-sharepoint-2013-ranking-models-work/

http://techmikael.blogspot.com/2013/04/rank-models-in-2013main-differences.html

 

To avoid being repetitive, what I’ve tried to do is to create an easy to see comparison chart between factors that influence rank calculations in FS14 to Sharepoint 2013 Search.  I may update this chart in the future to include FAST ESP, although the main factors involved in both ESP and FS14 are somewhat similar to each other as opposed to Sharepoint 2013 Search(which is closer related to Sharepoint 2010 Search model).

One of the main differences is with the fact that Sharepoint 2013 Search uses a 2-stage process for rank calculations:  a linear ranking model as a 1st stage and a Neural Network as a 2nd stage.  The 1st stage is “light” and we can afford to apply it to all documents in a result set.  There are specific rank features that are part of this stage that are applied to all documents.  The top 1000 documents(candidates) based on Stage 1 Rank are input to Stage 2.  This stage is more performance intensive and re-computes the rank score for documents used as an input, which is why it is only applied to a limited set.  It consists of all the same rank features as Stage 1 plus 4 additional Proximity features.

 For my comparison below, I was mainly using a model called “Search Ranking Model with Two Linear Stages”, which has been put in place as of August 2013 CU.  This model is recommended to use as a template when creating custom rank models, as it provides you with proximity without a Neural Network.

 

Rank Factor

FS14

SP2013 Search

Rank Models 1 OOTB rank model 16 Rank Models
Freshness Available OOTB and customizable N/A OOTB, possible to be configured
Dynamic Ranking (field weighting/managed properties) Context Boost:

Title, DocSubject, Keywords, DocKeywords, urlkeywords, Description, Author, CreatedBy, ModifiedBy,  MetadataAuthor, WorkEmail, Body, crawledpropertiescontent

Document MP’s + Usage/Social data

Title, QLogClickedText, SocialTag, Filename, Author, AnchorText, body

FileType Field-Boost weight/Managed Property Boost(OOTB -4000 points):

 

Format:

Unknown Format, XML, XLS

 

FileExtension:

CVS, TXT, MSG, OFT, ZIP, VSD, RTF

 

IsEmptyList, IsListItem

FileType rank feature:

 

 

 

PPT, Sharepoint site, DOC, HTML, ListItems, Image, Message, XLS, TXT

 

Language

 

N/A Dynamic Rank(query-based).  LCID, i.e locale ID is used.
Social Distance  N/A Static Rank(colleague relationship to the person issuing the query).

 

0 bucket – No colleague relationship

 

1 bucket – first level(direct) relationship

 

2 bucket – second level(indirect) relationship

Static Rank Boost (Query-Independent) Quality Weight Components:

 

hwboost

docrank

siterank

urldepthrank

 

 

Authority Weight– Partial and Complete

 

 

Now part of Analytics Processing Component.  Static Rank features calculated with Search and Usage Analytics:

 

QLogClicks

QLogSkips

QLogLastClicks

EventRate

 

Proximity Enabled by default MinSpan (Neural Networks 2nd stage, parameters for proximity minimal span

 

Anchortext (Query-Dependent) Extnumocc = part of Dynamic Rank calculations, query-time hits in anchortext

 

AnchortextComplete
URLDepth (Query-Dependent) N/A – in FS14, this was a static rank feature. UrlDepth – Depth of the document URL(number of slashes)

 

Click-Through Weight(Query-Dependent) Query-Authority weight:  click-through weight, dynamic rank N/A

Now part of static rank features used in Analytics processing Component(QLogClicks, etc)

 

Rank Tuning

FS14

SP2013 Search

GUI-based applications. Ease of tuning rank calculations and user-friendliness N/A

Rank calculations  and scores can be seen either via ranklog output or via Codeplex tools such as FS4SP Query Logger.   However, there isn’t a user-friendly tool to help you make the changes and push them live, or preferably see them in “Preview” mode offline.  A separate ‘spreladmin’ tool is needed for click analysis.

 

Rank Tuning App(coming  soon).  A GUI-based and user-friendly way to tune/customize ranking and impact relevancy.  Includes a “preview”, i.e offline mode.
Rank logging availability Server-side:

Ranklog is available via QRServer output.  However, it is server-side and only available to Admins with local access to QRServer port 13280.

 

Client-side:

 

N/A

Server-side:

Rank tuning app/ULS logs

 

 

 

                                                                                                  Client-side:

 

ExplainRank template available to clients.

 

http://powersearching.wordpress.com/2013/01/25/explain-rank-in-sharepoint-2013-search/

 

 

 

Posted in SP2013 | Tagged , , , , | 3 Comments

The Myths and Perils in the Pursuit of Advanced Search Options

One question that I’ve heard a lot over the years working in the search space is: “How can I provide advanced search options for users, such as exposing boolean operators?”

My answer: don’t waste your time with it (and especially don’t do it in the first iteration of your search application).

Many search usability studies confirm that most users have no idea how to use advanced search, and instead rely only on simple keyword searches to try and find what they want. As Jakob Nielsen brilliantly put it in his article on “Converting Search into Navigation”:

In study after study, we see the same thing: most users reach for search, but they don’t know how to use it.

Given this fact, my recommendation for customers is always to start simple, with a search interface that lets users enter their keywords to find what they are looking for. Then, after collecting usage log for a few weeks/months, you can look for query patterns that could be used to trigger more advanced search functionality, such as the Costco example in Nielsen’s article about redirecting the user to a category page instead of a search results page for certain queries where you know (from inspecting usage logs) that most users just want to get the category page.

This way you can gradually improve the performance of your search application, by “listening” to the search behaviors of your users and adjusting your search application accordingly.

Posted in Uncategorized | Tagged , | 2 Comments

The 4 Essential Concepts You Need to Know To Use Any Search Engine Efficiently

When you go to your insert-search-application-name-here enter a query and hit the search button, what exactly are you searching on?

One of the hardest things to do in IT (or in any field, really) is to sometimes take a step back and look at the basics, at the foundational knowledge behind some things that we may use every day without necessarily understanding how they really work.

After realizing I’ve been having the same conversation with different customers/students to explain these same main concepts over the last few years (both at FAST and now at cXense), I decided to explain a little bit about these 4 essential concepts here:

  1. Type of query (AND, PHRASE, OR)
  2. Where to search (all fields, body, title, etc.)
  3. Field Importance
  4. Sorting

Type of Query

The first thing you have to think about when constructing your search interface is: how do I want the system to match the text/query specified by the user?

To answer this question you must understand the differences in each of the following three search requests explained below.

Obs.: Note that the examples below are query language-agnostic, so just replace them with whatever is the proper syntax for the search engine you are using. Even though the syntax may change, the concepts should remain the same.

AND query

Example: venture AND capital

This query above will match only documents that contain all terms in the query, which in this case means that any document, in order to be returned, must have both the term venture as well as the term capital. Those terms can be found together (e.g. “raised more venture capital money…”), separately (e.g. “is initiating a new venture with capital raised…” or even in a different order (e.g. “for his new venture he raised capital from…”).

This is the most common operator used across search applications and also the default operator in many search platforms (e.g. FAST ESP, FS4SP, cX::search).

PHRASE query

Example: “venture capital”

This query above, in contrary to the AND query, will only return documents that contain this exact phrase. This means that a document with a text like “raised more venture capital money…” will match, but a document with “is initiating a new venture with capital raised…” will not (due to the fact that there is an extra term – with – in between the two required terms).

This is an operator often used behind the scenes by search applications whenever a user puts some text in between quotes into the search box. It’s very useful for scenarios where the user is trying to find some exact phrase he/she is looking for.

OR query

Example: venture OR capital

This last query is the most open of all, as it will return documents that contain any of the terms in the query. With this query, a document only needs to have the term venture or capital to be returned, without the need for having both (as it was the case with the AND and PHRASE queries). This means that a document with the text “he decided to venture down the hall…” will be returned, as well as a document with the text “Brasília is the federal capital of Brazil”.

Where to search

Now that you have decided what type of queries you want to execute (and, phrase, or), the next step is to decide where do you want this search to occur. When asked “where do you want to search?” people usually reply with “everywhere, of course!”. Yet it is important to step back and think if that’s really what you want.

Imagine you go to your search application and type “financial systems” (with/without the quotes) and click the search button, what will happen then? Where in the document do you believe this query will try to find the terms financial and systems?

The answer to these questions depends heavily on which search technology you are using behind the scenes:

  • in FAST ESP – this would be a query against the default composite field, which out-of-the-box would be comprised of fields such as body, title, url, keywords, etc.
  • in FAST Search for SharePoint – this would be a query against the fulltext index, which by default contains fields such as title, author, body, etc.
  • in cX::search – this would be a query against all searchable fields in the index

In the case of cX::search, if you do not define exactly which fields should be searched on, by default the search will be executed against all the searchable fields in the index. This means that cX::search will look for the terms financial and systems in the fields title and body, but also in fields such as category, related_content, or even unitsInStock which may not be exactly what you are looking for.

When I was teaching FAST Search for SharePoint, the main confusion for students was the fact that the default search was not across ALL fields, but instead just a subset of them, which meant that for every new managed property that you wanted to search by default (just by typing some terms in the search box, that is) you needed to make sure to add it to the fulltext index as well.

As you can see, even such a simple question can have very distinct answers depending on which search platform you are using, so the best way to avoid future problems is to first understand exactly how your specific search platform handles the default queries, and then use this knowledge to control exactly which fields you want to search on by default.

For cX::search, for example, this could be done by adding the desired list of fields before the query term:

?p_aq=query(title,body,description,tags,url,author:"financial systems", token-op=and)

In the example above we are being very clear about which fields should be used when looking for the query terms defined by the user, which makes it a lot easier to debug and answer questions like “why was this document returned in the results?”.

Field Importance

By now should know how you want to search (and, phrase, or) and also where to search (title, body, etc.), so it’s time to decide which fields matter more to you among all the ones that were selected to be searched in the previous step. As a starting point, take look at these document examples below:

Document 1
Title: Market Research Findings – 2012
Description: This document summarizes the findings from the 2012 market research study…
Tags: research, 2012

Document 2
Title: About the market crash of 1929
Description: All the available research on the market crash of 1929…
Tags: stock, market, 1929

Document 3
Title: XYZ begins to explore new market
Description: After a few years focused on research, company XYZ began exploring a new market…
Tags: XYZ

And now consider the following query: market AND research

Based on the sample query and documents above, which document would you expect to be ranked higher?

Most people would say Document 1 listed above should be ranked higher, and the reason is that users got trained by search engines to expect, among other things, that anything that is found in the title of a document should have more relevance than something found somewhere in the body of the document. This is a very reasonable expectation, because we tend to accept that if someone went through the trouble of choosing specific terms to put in the title of a document, then those terms must be important.

So, depending on your search platform of choice, there are different ways for you to be explicit about what fields should have higher importance.

In cX::search, for example, the modified query would look like this:

?p_aq=query(title^5,tags^3,body:"market research", token-op=and)

The query above is defining that cX::search should:

  • look for documents containing the terms market and research;
  • these terms must be found in the title, tags or body fields; and, even more importantly
  • terms found in the title have 5 times (title^5) more importance than terms found in the body (the default field boost is 1)
  • terms found in the tags have 3 times (tags^3) more importance than terms found in the body

In a similar fashion, FAST ESP has the composite-rank piece of the rank profile, which allows you to define how much importance you want to give for each field that is part of a composite field.

In FAST Search for SharePoint, you also have some options available both through the UI or through PowerShell, which allow you to configure which importance level a managed property should belong to when mapped to a fulltext index, as shown in the screenshot below:

Fulltext Index Mapping

As you can see from the examples above, using field boosts (or any similar feature for the search platform you are using) give you the flexibility to be very precise about which fields matter most according to your specific business rules.

Sorting

The last important piece of this puzzle of configuring basic relevance settings for your search application is to decide how results should be sorted before being returned. This is crucial because, in the end, this is what decides what results will be displayed on top.

Remember the previous example above that used field boosts to define the importance of each field? Well, now take a look at this cX::search request below:

?p_aq=query(title^5,tags^3,body:"market research", token-op=and)&p_sm=publication_date:desc

As you can see above, this query is explicitly requesting that results be sorted by publication_date in descending order. What this means is that any field boosts are completely ignored by the search engine. Yes, they are simply ignored, since we directly requested results to be sorted based on a date field, instead of the default sorting that is based on the ranking score.

Sometimes this is exactly what you want, such as the case when the user has already drilled down to a subset of results and you want to allow him/her to just sort by price or average rating, for example (two options I often use when searching for products at Amazon).

And a last option is the case when you want to mix the two approaches, in a way that you can still use the ranking score, but with extra boosts that take into consideration how recent is a document (or how many units it has sold, or what is its average rating, etc.). Those are more advanced options that we will discuss another day, but for now just keep in mind that yes, that’s also possible :)

Posted in cXense, FS4SP | Leave a comment

How to get authenticated/secure results through the QRServer in FAST Search for SharePoint

I received an email from an ex-student today that forced me to remember how to send an authenticated query to the QRServer in FAST Search for SharePoint.

The reason for doing this is that when you issue a query through the SharePoint UI, additional security parameters are sent to FAST along with the query. But when you go directly against the QRServer interface (accessible through http://localhost:13280 directly in the server running the query component in the FAST farm), the queries typed in there are sent without any security parameters by default, which means you will not get back any results that require security permissions (such as all your crawled SharePoint content, for example).

I’ve sent instructions to students on how to get authenticated results from the QRServer many times in the past, and even commented about it in this post here, but I just realized I never posted this here on the blog, so I’m doing it now to make this information easier to be found.

Below are the steps to get secure results through the QRServer without having to modify qtf-config.xml (which is something advisable):

Note: you will need to perform the steps below in a query server in your FAST farm

  1. Edit %FASTSEARCH%\components\sam\worker\user_config.xml
  2. Change:
    <add name=”AllowNonCleanUpClaimsCacheForTestingOnly” value=”false” type=”System.Boolean” />To:
    <add name=”AllowNonCleanUpClaimsCacheForTestingOnly” value=”true” type=”System.Boolean” />
  3. To pick up your changes, open a command prompt window and restart the samworker
    nctrl restart samworker
  4. Make sure the samworker is running. If it is not running, check your previous edits.
    nctrl status
  5. Execute a query through a search center in SharePoint and ensure results are returned. You will use the security credentials from this query to get secure results from the QRServer.
  6. Navigate to %FASTSEARCH%\var\log\querylogs and open your latest query log (if the file is locked; make a copy of the file and open the copy).
  7. Locate and copy this parameter: &qtf_securityfql:uid=<token>= (the trailing equal sign should be included)
  8. Navigate to the qrserver page: http://localhost:13280/
  9. In the additional parameters text box add:
    &qtf_securityfql:uid=<token>=
  10. Issue a query and ensure you get secure results back.

Another way to also get authenticated results (from outside the SharePoint UI) without having to make any modifications in your system, is to use the terrific FAST Search for SharePoint 2010 Query Logger tool created by Mikael Svenson.

Enjoy! :)

Posted in FS4SP | Tagged | 10 Comments

Goodbye Microsoft, Hello New Year and New Ventures

“All changes, even the most longed for, have their melancholy; for what we leave behind us is a part of ourselves; we must die to one life before we can enter another.” ~ Anatole France

This has pretty much been my feeling for the past few weeks, as I gathered my things and organized everything for my departure from Microsoft. As of today (or better saying, yesterday at midnight) I’m no longer working as a Senior Technical Instructor at Microsoft.

Combining my time at FAST, and then at Microsoft after the acquisition, I have been with the company for over 6 years. During this time I’ve made many friends, worked in challenging and exciting projects and also had a chance to spend a lot of time exploring, understanding and trying to help others understand (at least I hope so :)) the enterprise search world. In short, it was a lot of fun and I learned a LOT. Which is why I decided to change and keep learning…

Starting today I’m joining some great friends over at cXense as their Director Customer Excellence in Americas (US, Canada and Latin America), which is a very fancy title to say that I will continue doing what I love to do most: help customers to be successful.

As for the future of the blog, it will get even better. I will keep trying to help as much as I can, responding to questions and comments, as well as posting about enterprise search, this time also including posts about other flavors of search (such as Solr/Lucene and cXsearch, which I can’t wait to learn more about!). I’ve also invited a few friends, who are still working with SharePoint/FAST Search on a daily basis, to guest post here now and then. Questions about any of these search technologies will be more than welcome, as always!

6 years ago I got a call inviting me to join FAST and, by accepting that offer, I had one of my best professional experiences to date. When the same people called me again, this time with an invitation to join cXense, I had no choice than to say yes and look forward to many more adventures. :)

Posted in cXense | Tagged , , | 22 Comments

2011 in review – Annual Report by WordPress

It’s very interesting to look at the 2011 annual report for my blog that was put together by WordPress. Not surprisingly, the most read posts were the learning roadmaps. It seems like with so much information spread all over the place about SharePoint Search & FAST Search for SharePoint, it becomes even more important to be have a clear path to follow to learn more.

Now let’s get ready for 2012! :)

The WordPress.com stats helper monkeys prepared a 2011 annual report for this blog.

Here’s an excerpt:

The concert hall at the Syndey Opera House holds 2,700 people. This blog was viewed about 14,000 times in 2011. If it were a concert at Sydney Opera House, it would take about 5 sold-out performances for that many people to see it.

Posted in Uncategorized | Leave a comment

How “Remove Duplicate Results” works in FAST Search for SharePoint

This is a question that I have received quite a few times, and this time I thought I would get some screenshots and detail the process a little bit in here so that other folks can take advantage of this info as well.

First of all, what is the “Remove Duplicate Results” feature? It is a feature that tells the search engine to collapse results that are perceived as duplicates (such as the same document located in different paths), so that only one instance of the document is returned in the search results (instead of showing the multiple copies to the end-user).

The setting that enables this feature (which is on by default) in a Search Center is available at the Search Core Results Web Part, under the section Result Query Options, as shown below:

Search Core Results Web Part - Remove Duplicate Results

Once this option is enabled, any duplicate items will be collapsed in the search results, as you can see in this example:

FAST Search for SharePoint Duplicate Results Removal

And if you want to see all the duplicates (in order to delete one of them in the source repository, for example), all you have to do is click in the “Duplicates (2)” link highlighted above. This will execute another query, filtering results to display only the duplicates of the item you selected:

FAST Search for SharePoint Duplicate Removal Example

 

Now let’s investigate how this feature works. To do this, we will go in reverse order (from search to processing) to understand all the pieces involved.

The first clue is that this is enabled/disabled during search time, so there must be some parameter being sent by the Search Center to FAST Search for SharePoint to inform that duplicate removal is enabled. Taking a look at one of the full search requests in the querylogs (%FASTSearch%\var\log\querylogs\) we can confirm this:

/cgi-bin/search?hits=10&resubmitflags=1&rpf_navigation:hits=50&query=sharepoint&spell=suggest&collapsenum=1&qtf_lemmatize=True…&collapseon=batvdocumentsignature&type=kwall…

Note: the querylog shown above has some query parameters removed so we can focus on the items that matter to duplicate removal.

As you can see, there are two parameters sent to FAST Search for SharePoint indicating which property should be used for collapsing (batvdocumentsignature) and how many items should be kept after collapsing is performed (1). And if we want more information about these options, the MSDN documentation explains these two parameters used for duplicate removal (the names differ because the querylog shows the internal query parameter names received by FAST Search for SharePoint):

onproperty – Specifies the name of a non-default managed property to use as the basis for duplicate removal. The default value is the DocumentSignature managed property. The managed property must be of type Integer. By using a managed property that represents a grouping of items, you can use this feature for field collapsing.

keepcount – Specifies the number of items to keep for each set of duplicates. The default value is 1. It can be used for result collapsing use cases. If TrimDuplicates is based on a managed property that can be used as a group identifier (for example, a site ID), you can control how many results are returned for each group. The items returned are the items with the highest dynamic rank within each group.

The last parameter that can be used with the Duplicate Removal feature is also described in the MSDN article, explaining what happened behind the scenes when we clicked the “Duplicates (2)” link to display all the duplicates for that item:

includeid – Specifies the value associated with a collapse group, typically used when a user clicks the Duplicates (n) link of an item with duplicates. This value corresponds to the value of the fcoid managed property that is returned in query results.

Ok, so far we know that duplicate removal is enabled by default and is applied by collapsing results that have the same value for the managed property DocumentSignature. Let’s have a look at the settings for this managed property then:

image

As you can see, the type of this managed property is Integer (which the MSDN article defined as a requirement) and it is also configured both as Sortable and Queryable. The peculiar thing is that when looking at the crawled properties mapped to this managed property we get nothing as a result, which indicates that the value for this property is most likely being computed by FAST Search for SharePoint during content processing.

So let’s take a look at some lower-level configuration files to track this down. We start with the mother file of all configurations related to the content processing pipeline: %FASTSearch%\etc\PipelineConfig.xml. This is a file that can’t be modified (since it is not included here in the list of configuration files that can be modified), but nothing prevents us from just looking at it. After opening this configuration file and searching for “documentsignature”, you will find the definition for the stage responsible for assigning the value to this property:

    <processor name="DocumentSignature" type="general" hidden="0">
      <load module="processors.DuplicateId" class="DuplicateId"/>
      <config>
       <param name="Output" value="documentsignature" type="string"/>
       <param name="Input"  value="title:0:required body:1024:required documentsignaturecontribution" type="string"/>
      </config>
    </processor>

The parameters that matter most to us are highlighted above:

  • Input: which properties will be used to calculate the document signature –> title and the first 1024 bytes of body (as well as a property called documentsignaturecontribution that will also be used if it has any value)
  • Output: our dear documentsignature property

And with this we get to the bottom of this duplicate removal feature, which means is a good time to recap everything we found out:

  1. During content processing, for every item being processed, FAST Search for SharePoint will obtain the value of title and the first 1024 bytes of body for this item, and use it to compute a numerical checksum that will be used as a document signature. This checksum is stored in the property documentsignature for every item processed.
  2. During query time, whenever “Remove Duplicate Results” is enabled, the Search Center tells FAST Search for SharePoint to collapse results using the documentsignature property, effectively eliminating any duplicates for items that have the same title+first-1024-bytes-of-body.
  3. When a user clicks on the “Duplicates (n)” link next to an item that has duplicates, another query is submitted to FAST Search for SharePoint, passing as an additional parameter the value of the fcoid managed property for the item selected, which will be used to return all items that contain the same checksum (aka “the duplicates”).

A few extra questions that may appear in your head after you read this:

  • Is it possible to collapse results by anything other than documentsignature and use this feature for other types of collapsing (collapse by site, collapse by category, etc.)?
    • Answer: Yes, it is absolutely possible. All you will need is an Integer, Sortable, Queryable managed property containing the values you want to collapse by + a custom web part (or an extended Search Core Results Web Part) where you request this managed property to be used for collapsing and how many items should be preserved (as explained in this MSDN article linked before)
  • Can two items that are not identical be considered as duplicates?
    • Answer: Yep. As we saw above, only the first 1024 bytes of body are used for calculating the checksum, which means that any other differences these documents may have beyond the first 1024 bytes will not be considered for the purposes of duplicate removal. (Note: roughly speaking, the body property will have just the text of the document, without any of the formatting)
  • Can I change how the default document signature is computed?
    • Answer: Yes, read the update to this post below. No, since this is done by one of FAST Search for SharePoint out-of-the-box stages. But, what you can do is calculate your own checksum using the Pipeline Extensibility and then use your custom property for duplicate removal.
  • When you have duplicates and only one item is shown in the results, how does FS4SP decides which item to show?
    • Answer: The item shown is the item with the highest rank score among all the duplicates.

That’s it for today. If any other questions related to duplicate removal surface after this post is published, I will add them to the list above. Smile

UPDATE (2011-12-09): Yes, you can influence how document signature is computed!

After I posted this article yesterday I was still thinking about that documentsignaturecontribution property mentioned, as I had a feeling that there was a way to use it to influence how the document signature is compute. Well, today I found some time to test it and yes, it works! Here is how to do it.

What you have to do is create a new managed property with exactly the name documentsignaturecontribution and then map to this managed property any values that you also want to use for the checksum computation (as with other managed properties, to assign a value to this property you must map a crawled property to it).

You need a managed property because the DocumentSignature stage is executed after the mapping from crawled properties to managed properties, so FAST Search for SharePoint is looking for a managed property named documentsignaturecontribution to use as part of the checksum computation. When you create this managed property and assign it some value, FAST Search for SharePoint simply uses this, along with title and the first 1024 bytes of body, to calculate the checksum.

I followed the great idea from Mikael Svenson to create two text files filled with the same content just to force them to be perceived as duplicates by the system. The key here was to create these two files with exactly the same name and content, but put them in different folders. This way I could guarantee that both items had the same title and body, which would result in them having the same checksum. This was confirmed after I crawled the folders with these items:

FAST Search for SharePoint - Duplicate Removal - Example of duplicates

Both items had the same checksum, which could be checked by looking at the property fcoid that was returned with the results:

fcoid: 102360510986285564

My next step was to create the managed property documentsignaturecontribution and map to it some value that would allow me to distinguish between the two files. In my case, that value was the path to the items, which were located in different folders. So, after creating my managed property documentsignaturecontribution of type Text, I mapped to it the same crawled properties that are mapped to the managed property path, just to make sure I would get the same value as that property:

FAST Search for SharePoint - Duplicate Removal - documentsignaturecontribution managed property

With this done, I just had to perform another full crawl to force SharePoint to process both of my text files again, and confirm that they were not perceived as duplicates anymore (since I was also using their path to compute the checksum):

image

Another look into the fcoid property confirmed that both items did have different checksums now:

  • file://demo2010a/c$/testleo/test.txt – fcoid: 334483385708934799
  • file://demo2010a/c$/testleo/subdir/test.txt – fcoid: 452732803969334619

So what we learned with this is that you can influence how the document signature is computed by creating this managed property documentsignaturecontribution and mapping to it any value you want to be part of the checksum computation. And if you want to use the full body of the document to compute the checksum, you could accomplish this through the following steps:

  1. add to the Pipeline Extensibility a custom process that uses the body property as input and return a checksum crawled property based on the full contents of the body as the output
  2. map this checksum crawled property returned in the previous step to the managed property documentsignaturecontribution
  3. re-crawl your content and celebrate when it works Smile

Additional info:

Posted in FS4SP | Tagged | 16 Comments