Goodbye Microsoft, Hello New Year and New Ventures

“All changes, even the most longed for, have their melancholy; for what we leave behind us is a part of ourselves; we must die to one life before we can enter another.” ~ Anatole France

This has pretty much been my feeling for the past few weeks, as I gathered my things and organized everything for my departure from Microsoft. As of today (or better saying, yesterday at midnight) I’m no longer working as a Senior Technical Instructor at Microsoft.

Combining my time at FAST, and then at Microsoft after the acquisition, I have been with the company for over 6 years. During this time I’ve made many friends, worked in challenging and exciting projects and also had a chance to spend a lot of time exploring, understanding and trying to help others understand (at least I hope so :)) the enterprise search world. In short, it was a lot of fun and I learned a LOT. Which is why I decided to change and keep learning…

Starting today I’m joining some great friends over at cXense as their Director Customer Excellence in Americas (US, Canada and Latin America), which is a very fancy title to say that I will continue doing what I love to do most: help customers to be successful.

As for the future of the blog, it will get even better. I will keep trying to help as much as I can, responding to questions and comments, as well as posting about enterprise search, this time also including posts about other flavors of search (such as Solr/Lucene and cXsearch, which I can’t wait to learn more about!). I’ve also invited a few friends, who are still working with SharePoint/FAST Search on a daily basis, to guest post here now and then. Questions about any of these search technologies will be more than welcome, as always!

6 years ago I got a call inviting me to join FAST and, by accepting that offer, I had one of my best professional experiences to date. When the same people called me again, this time with an invitation to join cXense, I had no choice than to say yes and look forward to many more adventures. :)

Posted in cXense | Tagged , , | 22 Comments

2011 in review – Annual Report by WordPress

It’s very interesting to look at the 2011 annual report for my blog that was put together by WordPress. Not surprisingly, the most read posts were the learning roadmaps. It seems like with so much information spread all over the place about SharePoint Search & FAST Search for SharePoint, it becomes even more important to be have a clear path to follow to learn more.

Now let’s get ready for 2012! :)

The WordPress.com stats helper monkeys prepared a 2011 annual report for this blog.

Here’s an excerpt:

The concert hall at the Syndey Opera House holds 2,700 people. This blog was viewed about 14,000 times in 2011. If it were a concert at Sydney Opera House, it would take about 5 sold-out performances for that many people to see it.

Posted in Uncategorized | Leave a comment

How “Remove Duplicate Results” works in FAST Search for SharePoint

This is a question that I have received quite a few times, and this time I thought I would get some screenshots and detail the process a little bit in here so that other folks can take advantage of this info as well.

First of all, what is the “Remove Duplicate Results” feature? It is a feature that tells the search engine to collapse results that are perceived as duplicates (such as the same document located in different paths), so that only one instance of the document is returned in the search results (instead of showing the multiple copies to the end-user).

The setting that enables this feature (which is on by default) in a Search Center is available at the Search Core Results Web Part, under the section Result Query Options, as shown below:

Search Core Results Web Part - Remove Duplicate Results

Once this option is enabled, any duplicate items will be collapsed in the search results, as you can see in this example:

FAST Search for SharePoint Duplicate Results Removal

And if you want to see all the duplicates (in order to delete one of them in the source repository, for example), all you have to do is click in the “Duplicates (2)” link highlighted above. This will execute another query, filtering results to display only the duplicates of the item you selected:

FAST Search for SharePoint Duplicate Removal Example

 

Now let’s investigate how this feature works. To do this, we will go in reverse order (from search to processing) to understand all the pieces involved.

The first clue is that this is enabled/disabled during search time, so there must be some parameter being sent by the Search Center to FAST Search for SharePoint to inform that duplicate removal is enabled. Taking a look at one of the full search requests in the querylogs (%FASTSearch%\var\log\querylogs\) we can confirm this:

/cgi-bin/search?hits=10&resubmitflags=1&rpf_navigation:hits=50&query=sharepoint&spell=suggest&collapsenum=1&qtf_lemmatize=True…&collapseon=batvdocumentsignature&type=kwall…

Note: the querylog shown above has some query parameters removed so we can focus on the items that matter to duplicate removal.

As you can see, there are two parameters sent to FAST Search for SharePoint indicating which property should be used for collapsing (batvdocumentsignature) and how many items should be kept after collapsing is performed (1). And if we want more information about these options, the MSDN documentation explains these two parameters used for duplicate removal (the names differ because the querylog shows the internal query parameter names received by FAST Search for SharePoint):

onproperty – Specifies the name of a non-default managed property to use as the basis for duplicate removal. The default value is the DocumentSignature managed property. The managed property must be of type Integer. By using a managed property that represents a grouping of items, you can use this feature for field collapsing.

keepcount – Specifies the number of items to keep for each set of duplicates. The default value is 1. It can be used for result collapsing use cases. If TrimDuplicates is based on a managed property that can be used as a group identifier (for example, a site ID), you can control how many results are returned for each group. The items returned are the items with the highest dynamic rank within each group.

The last parameter that can be used with the Duplicate Removal feature is also described in the MSDN article, explaining what happened behind the scenes when we clicked the “Duplicates (2)” link to display all the duplicates for that item:

includeid – Specifies the value associated with a collapse group, typically used when a user clicks the Duplicates (n) link of an item with duplicates. This value corresponds to the value of the fcoid managed property that is returned in query results.

Ok, so far we know that duplicate removal is enabled by default and is applied by collapsing results that have the same value for the managed property DocumentSignature. Let’s have a look at the settings for this managed property then:

image

As you can see, the type of this managed property is Integer (which the MSDN article defined as a requirement) and it is also configured both as Sortable and Queryable. The peculiar thing is that when looking at the crawled properties mapped to this managed property we get nothing as a result, which indicates that the value for this property is most likely being computed by FAST Search for SharePoint during content processing.

So let’s take a look at some lower-level configuration files to track this down. We start with the mother file of all configurations related to the content processing pipeline: %FASTSearch%\etc\PipelineConfig.xml. This is a file that can’t be modified (since it is not included here in the list of configuration files that can be modified), but nothing prevents us from just looking at it. After opening this configuration file and searching for “documentsignature”, you will find the definition for the stage responsible for assigning the value to this property:

    <processor name="DocumentSignature" type="general" hidden="0">
      <load module="processors.DuplicateId" class="DuplicateId"/>
      <config>
       <param name="Output" value="documentsignature" type="string"/>
       <param name="Input"  value="title:0:required body:1024:required documentsignaturecontribution" type="string"/>
      </config>
    </processor>

The parameters that matter most to us are highlighted above:

  • Input: which properties will be used to calculate the document signature –> title and the first 1024 bytes of body (as well as a property called documentsignaturecontribution that will also be used if it has any value)
  • Output: our dear documentsignature property

And with this we get to the bottom of this duplicate removal feature, which means is a good time to recap everything we found out:

  1. During content processing, for every item being processed, FAST Search for SharePoint will obtain the value of title and the first 1024 bytes of body for this item, and use it to compute a numerical checksum that will be used as a document signature. This checksum is stored in the property documentsignature for every item processed.
  2. During query time, whenever “Remove Duplicate Results” is enabled, the Search Center tells FAST Search for SharePoint to collapse results using the documentsignature property, effectively eliminating any duplicates for items that have the same title+first-1024-bytes-of-body.
  3. When a user clicks on the “Duplicates (n)” link next to an item that has duplicates, another query is submitted to FAST Search for SharePoint, passing as an additional parameter the value of the fcoid managed property for the item selected, which will be used to return all items that contain the same checksum (aka “the duplicates”).

A few extra questions that may appear in your head after you read this:

  • Is it possible to collapse results by anything other than documentsignature and use this feature for other types of collapsing (collapse by site, collapse by category, etc.)?
    • Answer: Yes, it is absolutely possible. All you will need is an Integer, Sortable, Queryable managed property containing the values you want to collapse by + a custom web part (or an extended Search Core Results Web Part) where you request this managed property to be used for collapsing and how many items should be preserved (as explained in this MSDN article linked before)
  • Can two items that are not identical be considered as duplicates?
    • Answer: Yep. As we saw above, only the first 1024 bytes of body are used for calculating the checksum, which means that any other differences these documents may have beyond the first 1024 bytes will not be considered for the purposes of duplicate removal. (Note: roughly speaking, the body property will have just the text of the document, without any of the formatting)
  • Can I change how the default document signature is computed?
    • Answer: Yes, read the update to this post below. No, since this is done by one of FAST Search for SharePoint out-of-the-box stages. But, what you can do is calculate your own checksum using the Pipeline Extensibility and then use your custom property for duplicate removal.
  • When you have duplicates and only one item is shown in the results, how does FS4SP decides which item to show?
    • Answer: The item shown is the item with the highest rank score among all the duplicates.

That’s it for today. If any other questions related to duplicate removal surface after this post is published, I will add them to the list above. Smile

UPDATE (2011-12-09): Yes, you can influence how document signature is computed!

After I posted this article yesterday I was still thinking about that documentsignaturecontribution property mentioned, as I had a feeling that there was a way to use it to influence how the document signature is compute. Well, today I found some time to test it and yes, it works! Here is how to do it.

What you have to do is create a new managed property with exactly the name documentsignaturecontribution and then map to this managed property any values that you also want to use for the checksum computation (as with other managed properties, to assign a value to this property you must map a crawled property to it).

You need a managed property because the DocumentSignature stage is executed after the mapping from crawled properties to managed properties, so FAST Search for SharePoint is looking for a managed property named documentsignaturecontribution to use as part of the checksum computation. When you create this managed property and assign it some value, FAST Search for SharePoint simply uses this, along with title and the first 1024 bytes of body, to calculate the checksum.

I followed the great idea from Mikael Svenson to create two text files filled with the same content just to force them to be perceived as duplicates by the system. The key here was to create these two files with exactly the same name and content, but put them in different folders. This way I could guarantee that both items had the same title and body, which would result in them having the same checksum. This was confirmed after I crawled the folders with these items:

FAST Search for SharePoint - Duplicate Removal - Example of duplicates

Both items had the same checksum, which could be checked by looking at the property fcoid that was returned with the results:

fcoid: 102360510986285564

My next step was to create the managed property documentsignaturecontribution and map to it some value that would allow me to distinguish between the two files. In my case, that value was the path to the items, which were located in different folders. So, after creating my managed property documentsignaturecontribution of type Text, I mapped to it the same crawled properties that are mapped to the managed property path, just to make sure I would get the same value as that property:

FAST Search for SharePoint - Duplicate Removal - documentsignaturecontribution managed property

With this done, I just had to perform another full crawl to force SharePoint to process both of my text files again, and confirm that they were not perceived as duplicates anymore (since I was also using their path to compute the checksum):

image

Another look into the fcoid property confirmed that both items did have different checksums now:

  • file://demo2010a/c$/testleo/test.txt – fcoid: 334483385708934799
  • file://demo2010a/c$/testleo/subdir/test.txt – fcoid: 452732803969334619

So what we learned with this is that you can influence how the document signature is computed by creating this managed property documentsignaturecontribution and mapping to it any value you want to be part of the checksum computation. And if you want to use the full body of the document to compute the checksum, you could accomplish this through the following steps:

  1. add to the Pipeline Extensibility a custom process that uses the body property as input and return a checksum crawled property based on the full contents of the body as the output
  2. map this checksum crawled property returned in the previous step to the managed property documentsignaturecontribution
  3. re-crawl your content and celebrate when it works Smile

Additional info:

Posted in FS4SP | Tagged | 14 Comments

How to force an item to be removed from the index immediately with FAST Search for SharePoint

Very important reminder: the tip below will explain how to remove an item from the index, yet this does not prevent it from being picked up during the next crawl (full/incremental). In order to prevent the item from being crawled again additional steps must be taken (such as creating a crawl rule for the item). Big thanks to Mikael Svenson for the reminder!

Earlier today I posted an article on Microsoft Learning’s Born to Learn website, and I thought it would be an interesting article for the audience here as well. The article covers how to force FAST Search for SharePoint to remove an item from the index immediately, without having to wait until the next crawl.

In the article linked above I showed how to obtain the Item ID (or ssic://<id>) using the Crawl Logs, so here in this post I will show how to obtain it directly through the Search Center. This will also be a neat way to show you how to get the raw XML for the search results, which is very helpful when you are troubleshooting issues with your search results (such as to confirm that some property is being returned properly).

The first thing you need to do is execute a query that returns the item you want to remove from the index. Next you will need to customize the Search Core Results Web Part with this XSL (obtained from this MSDN article on how to get XML search results):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xmp><xsl:copy-of select="*"/></xmp>
</xsl:template>
</xsl:stylesheet>

These are the detailed steps to configure the Search Core Results Web Part to use this XSL as well as return the additional property we need to run the command to remove the item from the index:

1) While editing the Search Core Results Web Part, uncheck the option “Use Location Visualization” under Display Properties

Use Location Visualization

2) Click to open the XSL Editor and replace all the contents of the XSL entered there with the XSL listed above

XSL Editor

3) Edit the Fetched Properties parameter, adding the following entry right after the opening <Columns> tag (and before the </Columns> tag):  <Column Name=”contentid”/>

4) Confirm the changes to the web part, save the page and check the new output of your search results page showing the full XML with the additional properties you wanted

Search results

Now, with the contentid in hands we can execute the command to remove this item from the index:

docpush -c sp -U -d ssic://4583

And with that you have a way to remove items from the FAST Search for SharePoint index whenever you need. It does take some work, but at least now you have a way to accomplish that. Smile

Posted in FS4SP | Tagged , , | 4 Comments

Understanding Crawled Properties, Managed Properties and Full-text Index – Part 1

A while ago I received an email from a previous student asking my thoughts about an issue in his environment. He was crawling some content that had a metadata he wanted to use as the title for the items indexed, but even after doing the proper mapping to the Title managed property he was still not able to get his custom title to be used and appear in the results. “What could be going on?”, he wondered.

My first thought after reading his email was “I think I know what’s the issue”, followed closely by “Maybe I should look for another profession, since I was supposed to have taught this clearly during class”. And then I proceeded to explain to him what I will explain to you in this series of posts about Crawled Properties, Managed Properties, Full-text index, and how they all work together to make your search work.

To get to the bottom of this issue, we need to understand three concepts perfectly clear (ok, to tell you the truth you need to understand only two concepts for this particular issue, but the third one is also very important, so please play along Smile):

  • Crawled Properties (this post – Part 1)
  • Managed Properties (future post – Part 2)
  • Full-text Index (future post – Part 3)

Crawled Properties

As the official documentation states: “Crawled properties are automatically extracted from crawled content and grouped by category based on the protocol handler or IFilter used.”

What this means is that crawled properties are metadata associated with your content (such as title and url), which can be found in two ways: during crawling of the content and during content processing.

Crawled properties found during crawling

When you crawl content from a SharePoint Document Library, each column in the library is metadata associated to the corresponding document, therefore they end up being exposed as crawled properties. For example, in the document library below, each column will be exposed as a crawled property, including my custom column Department:

Document Library with custom column

After my first crawl of this content, I can quickly check that my custom column was exposed as a crawled property either through PowerShell or through Central Administration. To do it through PowerShell, I can run this simple cmdlet:

Get-FASTSearchMetadataCrawledProperty –Name ows_Department

image

In my case, it was easy to locate my property, since I already knew that all custom columns in a SharePoint document library are exposed as crawled properties with the prefix “ows_”, and they are also assigned to the category SharePoint. Now, what if I did not know this, or if this metadata was coming from somewhere else?

Then I could simply use the name of the column to do a wildcard search for it:

Get-FASTSearchMetadataCrawledProperty –Name *department

Crawled properties found during content processing

The second way crawled properties can be found is during content processing. One example are the metadata properties you can define for Office documents, such as the ones shown below for a Microsoft Word document:

Word document properties

Each one of the properties shown above is also extracted during content processing and exposed as a crawled property. Take the property Comments shown above, for example. Any information stored in this property will be extracted and exposed through the crawled property Office:6(Text).

Now you may be asking yourself: how am I supposed to know that a crawled property named Office:6(Text) will contain the values of the metadata property Comments from an Office document?

The only answer I have for that is this brilliant series of posts from Anne Stenberg where she reverse-engineer the out-of-the-box crawled properties and their mappings to figure out what metadata they represent. In my case, all I had to do was check this post here from her series that talks specifically about the crawled properties for Office documents.

Note: even though Anne’s articles mentioned above are all about SharePoint 2007 (MOSS 2007), so far they have been spot on every time I needed to find a specific crawled property in FAST Search for SharePoint.

Closing thoughts

Ok, so now we know that crawled properties are metadata found either through crawling or through content processing, and we also know how to identify specific crawled properties associated with our content. The real deal though will come from understanding how we can use these crawled properties during search, such as you can see bellow, where I’m searching for the contents I entered into the Comments property of my Word document:

Search for Comments

And this will be the subject of the next post in this series: Managed Properties. See you then! Smile

Posted in FS4SP | Tagged | 29 Comments

Working with FAST Search for SharePoint and Multivalued Properties

Imagine the following scenario. You have some content in a database (or file share, or SharePoint site), and this content has some metadata that is comprised of multiple values for one specific field, such as a list of authors in a book, or a list of contributors for a project, or even a list of departments associated with an item. Your first question is: how do I configure this multivalued property (author, contributors, departments) to be crawled by FAST Search for SharePoint (FS4SP)?

After some thinking, you decide to return all of those values inside this one field, using a separator such as a semi-colon as a delimiter between each individual value. You run a full crawl against this content, find the crawled property associated with this multivalued metadata, map it to a new managed property and expose it in a refiner. All beautiful, correct?

Well, not quite. When you look at your resulting refiner, this is what you see:

FS4SP Multivalued Refiners

Notice how, instead of considering each individual value in the property, FS4SP is considering the whole property as one big value, which results in the refiner counters being all off.

The issue here is that FS4SP doesn’t know that this is a multivalued property, as the semi-colon is not a separator that it recognizes for multivalued items. To be able to get FS4SP to recognize your multivalued property and display the refiners correctly, you will need to follow a few steps:

  1. Configure the Managed Property with the correct options
  2. Create a custom processing component to apply the correct multivalued character separator
  3. Configure the Pipeline Extensibility to call your custom processing component and re-crawl your content
  4. Troubleshooting (lets hope this won’t be needed Smile)


Configure the Managed Property with the correct options

The first thing you have to do is configure your multivalued Managed Property with the option MergeCrawledProperties set to true. You can do this through PowerShell using the Set-FASTSearchMetadataManagedProperty cmdlet, or you can do this through Central Administration, as shown below:

FS4SP MergeCrawledProperties setting

This is detailed in the MSDN documentation for the ManagedProperty Interface, where it defines:

MergeCrawledProperties

Specifies whether to include the contents of all crawled properties mapped to a managed property. If this setting is disabled, the value of the first non-empty crawled property is used as the contents of the managed property.

This property must also be set to True to include all values from a multivalued crawled property. If set to False, only the first value from a multivalued crawled property is mapped to the managed property.


Create a custom processing component to apply the correct multivalued character separator

As I mentioned above, the main issue with the semi-colon character used as a separator is that FS4SP doesn’t recognize it as a multivalued separator, so in order to do this correctly you must create a custom processing component (in C#, in PowerShell, or any other language) that can replace the simple string separator (in this case the semi-colon), with the special multivalued separator that FS4SP can recognize (“\u2029″). The detailed procedure to incorporate a custom processing component is detailed in this reference on MSDN.

In my specific case, I followed the great steps described by Mikael Svenson on how to use PowerShell to create quick-and-easy custom processing components. This proved to be a very quick approach to get my customization in place and be able to test it very quickly. Still, you should do this only for prototyping, as Mikael describes, because there is a performance penalty associated with the use of PowerShell, so it is recommended that you “port the code over to e.g. C# when you are done testing your code”.

My final custom code (directly inspired by Mikael’s post) to replace the semi-colon separator with the proper multivalued separator is shown below:

function CreateXml()
{
    param ([string]$set, [string]$name, [int]$type, $value)

    $resultXml = New-Object xml
    $doc = $resultXml.CreateElement("Document")

    $crawledProperty = $resultXml.CreateElement("CrawledProperty")
    $propSet = $resultXml.CreateAttribute("propertySet")
    $propSet.innerText = $set
    $propName = $resultXml.CreateAttribute("propertyName")
    $propName.innerText = $name
    $varType = $resultXml.CreateAttribute("varType")
    $varType.innerText = $type

    $crawledProperty.Attributes.Append($propSet) > $null
    $crawledProperty.Attributes.Append($propName) > $null
    $crawledProperty.Attributes.Append($varType) > $null

    $crawledProperty.innerText = $value

    $doc.AppendChild($crawledProperty) > $null
    $resultXml.AppendChild($doc) > $null
    $xmlDecl = $resultXml.CreateXmlDeclaration("1.0", "UTF-8", "")
    $el = $resultXml.psbase.DocumentElement
    $resultXml.InsertBefore($xmlDecl, $el) > $null

    return $resultXml
}

function DoWork()
{
    param ([string]$inputFile, [string]$outputFile)    
    $propertyGroupIn = "00130329-0000-0130-c000-000000131346" # SharePoint Crawled Property Category
    $propertyNameIn = "ows_DepartmentTest" # property name
    $dataTypeIn = 31 # string

    $propertyGroupOut = "00130329-0000-0130-c000-000000131346" # SharePoint Crawled Property Category
    $propertyNameOut = "ows_DepartmentTest" # property name
    $dataTypeOut = 31 # string

    $xmldata = [xml](Get-Content $inputFile -Encoding UTF8)
    $node = $xmldata.Document.CrawledProperty | Where-Object {  $_.propertySet -eq $propertyGroupIn -and  $_.propertyName -eq $propertyNameIn -and $_.varType -eq $dataTypeIn }
    $data = $node.innerText

    [char]$multivaluedsep = 0x2029
    [char]$currentsep = ';'
    
    #Replace current separator (semi-colon) with special multivalued separator
    $data = $data.Replace($currentsep, $multivaluedsep)
    
    $resultXml = CreateXml $propertyGroupOut $propertyNameOut $dataTypeOut $data
    $resultXml.OuterXml | Out-File $outputFile -Encoding UTF8
    
    #Copy-Item $inputFile C:\Users\Administrator\AppData\LocalLow
}

# pass input and output file paths as arguments
DoWork $args[0] $args[1]

The first highlighted section above (lines 34 through 40) show the section which defines the input crawled property that will contain the items with the semi-colon separator, as well as the output crawled property that will store the updated content with the correct multivalued separator. In my case, both properties are the same, since I simply want to do an in-place replacement.

The second highlighted section (lines 46 through 50) shows the definitions for the current separator (semi-colon) and for the multivalued separator (0×2029 in PowerShell). In the following line the replacement for the correct separator is applied in the input crawled property string.


Configure the Pipeline Extensibility to call your custom processing component and re-crawl your content

The next important step is to tell FS4SP that you want to call your custom processing component during content processing. To do this you must configure the %FASTSearch%\etc\pipelineextensibility.xml configuration file. This is how this file looked on my system:

<!-- For permissions and the most current information about FAST Search Server 2010 for SharePoint configuration files, see the online documentation, (http://go.microsoft.com/fwlink/?LinkId=1632279). -->

<PipelineExtensibility>
	<Run command="C:\Windows\System32\WindowsPowerShell\v1.0\PowerShell.exe C:\FASTSearch\bin\multivalued.ps1 %(input)s %(output)s">
		<Input>      
			<CrawledProperty propertySet="00130329-0000-0130-c000-000000131346" varType="31" propertyName="ows_DepartmentTest"/>
		</Input>
		<Output>
			<CrawledProperty propertySet="00130329-0000-0130-c000-000000131346" varType="31" propertyName="ows_DepartmentTest"/>
		</Output>
	</Run>
</PipelineExtensibility>

As you can see above, all I’m doing is defining that I want my custom PowerShell script to be called, receiving as an input crawled property my property that contains the contents with the semi-colon separator and then returning as output the same crawled property, in order to just replace its contents with the new-and-updated value, now using the multivalued separator.

After saving this configuration file, the next step is to force your Document Processors to reload their configuration so they can be aware of this new content processing component, which you can accomplish by executing psctrl reset in a command prompt.

With all the pieces in place, you can start a re-crawl of your content and then test your refiner after crawl is complete. If all goes well, your refiner should now look exactly like you wanted!

FS4SP Multivalued Refiner - Correct


Troubleshooting

My main warning is that you pay a lot of attention to the fact that the name of your input and output crawled properties (both in the pipelineextensibility.xml and in the PowerShell script) are case-sensitive.

Many people have spent a very long time troubleshooting their code only to realize that it was a case-sensitive issue with the name of these properties. The best way I found to troubleshoot a new custom processing component is through these techniques:

  1. Investigate the contents of the input file sent to your custom code: as described in this post, the only path in the file system with full access for your custom code is the AppData\LocalLow directory for the account running the FAST Search Service. By uncommenting line 55 in the PowerShell script above, a copy of the input file received by the script will be created in the AppData\LocalLow directory. By looking at the contents of the input file you can detect what is the content of the input crawled property. If the input crawled property doesn’t contain any value, and you are sure that your document has that property, check for issues wite case-sensitive property names.
  2. Validate the list of crawled properties received by FS4SP: you can accomplish this through the use of the optional processing stage FFDDumper.
  3. If both options 1 and 2 look ok, use the input file from step 1 to call your custom code directly and debug it to identify the error (you can debug the PowerShell script above using the ISE Editor)

And that’s it for today. Enjoy your coding and your multivalued properties! Smile

Posted in FS4SP | Tagged , | 9 Comments

SharePoint Search and FAST Search for SharePoint Architecture Diagrams – Fault Tolerance and Performance

Update: For those interested in watching a presentation of this content below you can download (right-click and select “Save target as..”) and watch this video here (200+ MB) that was recorded during a webcast on 2011-07-27. My presentation starts at 6min20sec.

In previous posts I showed and explained a few architecture diagrams of search in SharePoint 2010 for both SharePoint Search and FAST Search for SharePoint, I shared my all-time-favorite resource on SharePoint Search Architecture and Scale for crawl and query, and (hopefully) helped you understand, scale and monitor Crawling / Processing / Indexing in FAST Search for SharePoint.

What I will try to do in this post is convert most of that content into additional diagrams that should help you “see” how these changes related to fault tolerance and/or performance affect your search diagram.

These are the architecture diagrams discussed in this post:

SharePoint Search

FAST Search for SharePoint


SharePoint Search – Query Component (Fault Tolerance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Query Component (Fault Tolerance)

In this diagram you see how your architecture would look like after you add a new mirror Query Component for an existing Index Partition, which you do in order to provide fault tolerance for your lookup of matched items for full-text search queries against your index. The reasons for doing that are pretty simple (and detailed in here): one server goes down, the other can still keep serving queries, and unless you configure the mirror server as “failover only” it will also distribute the load of incoming queries.


SharePoint Search – Query Component (Performance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Query Component (Performance)

In this diagram there is just a very subtle change from the previous one (marked in red), but it makes a lot of difference in your architecture: the additional Query Component has a different Index Partition. What this means is that now your content is divided between the two Index Partitions, so if for example you have a total of 6 million indexed items, then each Index Partition has 3 million items. This also means that your Query Processor will send requests in parallel to both Query Components and, since each one of them has to search against only half of the index (3 million out of 6 million total), they will be able to do this faster.

The supported number of indexed items is 100 million per search service application and 10 million for each Index Partition.


SharePoint Search – Property db (Performance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Property Db (Performance)

Here things start to get interesting, with not only a new Query Component/Index Partition, but also with a new Property db (added items marked in red). If you read this post (mentioned a dozen times by now Smile) you understand that in order to provide search results, the Query Processor need to perform a lookup not only in the Index Partition but also in the Property db in order to retrieve the metadata associated with the results found. When you start to increase your indexed content, for example by having 20M items that you then split across 2 Index Partitions to improve your index lookup time, it may happen that your Property db is now your bottleneck. A way to minimize this impact in the growing number of indexed items is by adding a new Property db and assigning a new Query Component/Index Partition to it. This way, each combination of Index Partition/Property db has to store and handle search requests for only half of the total number of indexed items.

It is also important to notice that all search-related databases (Property db, Search Admin db and Crawl db) can be configured for fault tolerance through the use of database mirroring.


SharePoint Search – Query Processor (Fault Tolerance and Performance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Query Processor (Fault Tolerance and Performance)

Even after you have scaled your Query Components, your Index Partitions, your Property dbs, another query component that may require your attention is the Query Processor. This is the component that does the hard work of accessing the Query Component (to check items that match the query), the Property db (to get metadata associated with those items) and the Search Admin db (to get security descriptors in order to apply security trimming in the results). By adding a new Query Processor (marked in red and described in here), you divide the load of this task across multiple servers, increasing your query performance and providing fault tolerance (if one goes down, the other can still handle queries).


SharePoint Search – Crawl Component (Fault Tolerance and Performance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Crawl Component (Fault Tolerance and Performance)

Now let’s take a look at the other side of search: Crawling/Processing/Indexing. You can notice a new Crawl Component that was added in the diagram above, now what does this mean? This means that both Crawl Components will split the load of crawling the content sources defined, and both will keep pulling from and updating the crawling queue stored in the Crawl db. For example, if your full crawl with one Crawl Component and one Crawl db was taking 4 days, by adding another Crawl Component (and considering you have sufficient CPU/Memory/IO/bandwidth/etc. resources) the same full crawl should be reduced to around 2 days. Also, with two Crawl Components working from the same Crawl db, you also get fault tolerance in case one of them goes down.


SharePoint Search – Crawl Component and Crawl db (Performance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Crawl Component and Crawl Db (Performance)

What happens when you start to add many Crawl Components to the same Crawl db? Well, the db can easily become your bottleneck. One way to keep scaling out and increasing your crawling performance is through the use of an additional set of Crawl Component/Crawl db, as shown in the diagram above. In this way, distinct content sources (web applications, web sites, file shares, etc.) will be split among these two Crawl dbs, and their respective Crawl Components will have to handle (crawl/process/index) only part of the content, making it easier to deal with.

There are a lot of things that go into this, from how content to be crawled is split among multiple Crawl dbs to how you can manually define this mapping yourself (if you want to). All of this and more are detailed in this post here.


FAST Search for SharePoint – Content Processing (Fault Tolerance and Performance)

SharePoint 2010 - FAST Search Architecture Diagram - Content Processing (Fault Tolerance and Performance)

Since we are starting with content processing You may be asking “what about the crawling part of FAST Search?”. Well, the good news is that if you are using the FAST Content SSA to crawl your content, then your crawling architecture looks pretty much like what we just saw for SharePoint Search above. The main difference is that the FAST Content SSA will be tasked only with crawling, since processing and indexing will be done in the FAST Search farm. And talking about content processing, the first component that can be scaled out is the Content Distributor (as shown above in red). What this gives you is just fault tolerance, since the FAST Content SSA will connect and send batches to only one Content Distributor at a time, and will switch to the other one just in case of failure to submit batches to the “primary” Content Distributor (you also must make sure to configure the FAST Content SSA listing both Content Distributors).

In regards to Document Processors, you will definitely have more than one (you get 4 of them by default in a simple installation), which gives both fault tolerance (in case one of them goes down) and performance (since they will work in parallel). Also, if the “primary” Content Distributor goes down, the Document Processors will be smart enough to switch to the other available Content Distributor.


Indexer (Fault Tolerance)

SharePoint 2010 - FAST Search Architecture Diagram - Indexer (Fault Tolerance)

Remember the option to mirror an Index Partition in SharePoint Search to provide fault tolerance? This is the similar way that FAST Search can do that, but with a name change, since the documentation will refer to this process as adding a backup indexer row. In this case both Indexers will have the same content, which means that if your primary Indexer goes down, the backup Indexer can be configured to become the new primary Indexer.


Indexer (Performance)

SharePoint 2010 - FAST Search Architecture Diagram - Indexer (Performance)

In the diagram above, instead of adding a new backup Indexer for fault tolerance, it was added a new Indexer column to increase the volume of indexed content that can be stored in your search farm. In this scenario your content will be divide among the two Indexer columns (very similar to how we divided the content into separate Index Partitions for SharePoint Search).

The official guideline is to have one Indexer column for each 15 million items to index.


Indexer and Search (Fault Tolerance)

SharePoint 2010 - FAST Search Architecture Diagram - Indexer and Search (Fault Tolerance)

Above is the diagram of a somewhat common deployment of FAST Search for SharePoint, where you have two servers and each one is configured with a combination of Indexer and Search in a way that one server is the primary Indexer and backup Search, and the other server is backup Indexer and primary Search. In this way, with just your two servers you are providing fault tolerance for both Indexer and Search.


Query Processing (Fault Tolerance)

SharePoint 2010 - FAST Search Architecture Diagram - Query Processing (Fault Tolerance)

In this diagram above a Query Processing server (with QRServer, QRProxy and FSA Worker components) was added to the FAST Search farm and also properly configured in the FAST Query SSA by listing both servers in its setup. With this configuration, queries will be sent to both servers in a round robin fashion, and if one of the servers fails the FAST Query SSA will keep sending queries just to the active server.

Conclusion

There is a lot you can configure in both SharePoint Search and FAST Search for SharePoint to increase performance and/or provide fault tolerance for components of your search farm. The important thing is to understand what options are available for each platform and keep them in mind when you first design your search architecture as well as after your search project is in production, in case you need to scale out your deployment.

Posted in FS4SP, SP2010 | Tagged , , , , | 12 Comments