How to get authenticated/secure results through the QRServer in FAST Search for SharePoint

I received an email from an ex-student today that forced me to remember how to send an authenticated query to the QRServer in FAST Search for SharePoint.

The reason for doing this is that when you issue a query through the SharePoint UI, additional security parameters are sent to FAST along with the query. But when you go directly against the QRServer interface (accessible through http://localhost:13280 directly in the server running the query component in the FAST farm), the queries typed in there are sent without any security parameters by default, which means you will not get back any results that require security permissions (such as all your crawled SharePoint content, for example).

I’ve sent instructions to students on how to get authenticated results from the QRServer many times in the past, and even commented about it in this post here, but I just realized I never posted this here on the blog, so I’m doing it now to make this information easier to be found.

Below are the steps to get secure results through the QRServer without having to modify qtf-config.xml (which is something advisable):

Note: you will need to perform the steps below in a query server in your FAST farm

  1. Edit %FASTSEARCH%\components\sam\worker\user_config.xml
  2. Change:
    <add name=”AllowNonCleanUpClaimsCacheForTestingOnly” value=”false” type=”System.Boolean” />To:
    <add name=”AllowNonCleanUpClaimsCacheForTestingOnly” value=”true” type=”System.Boolean” />
  3. To pick up your changes, open a command prompt window and restart the samworker
    nctrl restart samworker
  4. Make sure the samworker is running. If it is not running, check your previous edits.
    nctrl status
  5. Execute a query through a search center in SharePoint and ensure results are returned. You will use the security credentials from this query to get secure results from the QRServer.
  6. Navigate to %FASTSEARCH%\var\log\querylogs and open your latest query log (if the file is locked; make a copy of the file and open the copy).
  7. Locate and copy this parameter: &qtf_securityfql:uid=<token>= (the trailing equal sign should be included)
  8. Navigate to the qrserver page: http://localhost:13280/
  9. In the additional parameters text box add:
    &qtf_securityfql:uid=<token>=
  10. Issue a query and ensure you get secure results back.

Another way to also get authenticated results (from outside the SharePoint UI) without having to make any modifications in your system, is to use the terrific FAST Search for SharePoint 2010 Query Logger tool created by Mikael Svenson.

Enjoy! 🙂

Posted in FS4SP | Tagged | 10 Comments

Goodbye Microsoft, Hello New Year and New Ventures

“All changes, even the most longed for, have their melancholy; for what we leave behind us is a part of ourselves; we must die to one life before we can enter another.” ~ Anatole France

This has pretty much been my feeling for the past few weeks, as I gathered my things and organized everything for my departure from Microsoft. As of today (or better saying, yesterday at midnight) I’m no longer working as a Senior Technical Instructor at Microsoft.

Combining my time at FAST, and then at Microsoft after the acquisition, I have been with the company for over 6 years. During this time I’ve made many friends, worked in challenging and exciting projects and also had a chance to spend a lot of time exploring, understanding and trying to help others understand (at least I hope so :)) the enterprise search world. In short, it was a lot of fun and I learned a LOT. Which is why I decided to change and keep learning…

Starting today I’m joining some great friends over at cXense as their Director Customer Excellence in Americas (US, Canada and Latin America), which is a very fancy title to say that I will continue doing what I love to do most: help customers to be successful.

As for the future of the blog, it will get even better. I will keep trying to help as much as I can, responding to questions and comments, as well as posting about enterprise search, this time also including posts about other flavors of search (such as Solr/Lucene and cXsearch, which I can’t wait to learn more about!). I’ve also invited a few friends, who are still working with SharePoint/FAST Search on a daily basis, to guest post here now and then. Questions about any of these search technologies will be more than welcome, as always!

6 years ago I got a call inviting me to join FAST and, by accepting that offer, I had one of my best professional experiences to date. When the same people called me again, this time with an invitation to join cXense, I had no choice than to say yes and look forward to many more adventures. 🙂

Posted in cXense | Tagged , , | 22 Comments

2011 in review – Annual Report by WordPress

It’s very interesting to look at the 2011 annual report for my blog that was put together by WordPress. Not surprisingly, the most read posts were the learning roadmaps. It seems like with so much information spread all over the place about SharePoint Search & FAST Search for SharePoint, it becomes even more important to be have a clear path to follow to learn more.

Now let’s get ready for 2012! 🙂

The WordPress.com stats helper monkeys prepared a 2011 annual report for this blog.

Here’s an excerpt:

The concert hall at the Syndey Opera House holds 2,700 people. This blog was viewed about 14,000 times in 2011. If it were a concert at Sydney Opera House, it would take about 5 sold-out performances for that many people to see it.

Posted in Uncategorized | Leave a comment

How “Remove Duplicate Results” works in FAST Search for SharePoint

This is a question that I have received quite a few times, and this time I thought I would get some screenshots and detail the process a little bit in here so that other folks can take advantage of this info as well.

First of all, what is the “Remove Duplicate Results” feature? It is a feature that tells the search engine to collapse results that are perceived as duplicates (such as the same document located in different paths), so that only one instance of the document is returned in the search results (instead of showing the multiple copies to the end-user).

The setting that enables this feature (which is on by default) in a Search Center is available at the Search Core Results Web Part, under the section Result Query Options, as shown below:

Search Core Results Web Part - Remove Duplicate Results

Once this option is enabled, any duplicate items will be collapsed in the search results, as you can see in this example:

FAST Search for SharePoint Duplicate Results Removal

And if you want to see all the duplicates (in order to delete one of them in the source repository, for example), all you have to do is click in the “Duplicates (2)” link highlighted above. This will execute another query, filtering results to display only the duplicates of the item you selected:

FAST Search for SharePoint Duplicate Removal Example

 

Now let’s investigate how this feature works. To do this, we will go in reverse order (from search to processing) to understand all the pieces involved.

The first clue is that this is enabled/disabled during search time, so there must be some parameter being sent by the Search Center to FAST Search for SharePoint to inform that duplicate removal is enabled. Taking a look at one of the full search requests in the querylogs (%FASTSearch%\var\log\querylogs\) we can confirm this:

/cgi-bin/search?hits=10&resubmitflags=1&rpf_navigation:hits=50&query=sharepoint&spell=suggest&collapsenum=1&qtf_lemmatize=True…&collapseon=batvdocumentsignature&type=kwall…

Note: the querylog shown above has some query parameters removed so we can focus on the items that matter to duplicate removal.

As you can see, there are two parameters sent to FAST Search for SharePoint indicating which property should be used for collapsing (batvdocumentsignature) and how many items should be kept after collapsing is performed (1). And if we want more information about these options, the MSDN documentation explains these two parameters used for duplicate removal (the names differ because the querylog shows the internal query parameter names received by FAST Search for SharePoint):

onproperty – Specifies the name of a non-default managed property to use as the basis for duplicate removal. The default value is the DocumentSignature managed property. The managed property must be of type Integer. By using a managed property that represents a grouping of items, you can use this feature for field collapsing.

keepcount – Specifies the number of items to keep for each set of duplicates. The default value is 1. It can be used for result collapsing use cases. If TrimDuplicates is based on a managed property that can be used as a group identifier (for example, a site ID), you can control how many results are returned for each group. The items returned are the items with the highest dynamic rank within each group.

The last parameter that can be used with the Duplicate Removal feature is also described in the MSDN article, explaining what happened behind the scenes when we clicked the “Duplicates (2)” link to display all the duplicates for that item:

includeid – Specifies the value associated with a collapse group, typically used when a user clicks the Duplicates (n) link of an item with duplicates. This value corresponds to the value of the fcoid managed property that is returned in query results.

Ok, so far we know that duplicate removal is enabled by default and is applied by collapsing results that have the same value for the managed property DocumentSignature. Let’s have a look at the settings for this managed property then:

image

As you can see, the type of this managed property is Integer (which the MSDN article defined as a requirement) and it is also configured both as Sortable and Queryable. The peculiar thing is that when looking at the crawled properties mapped to this managed property we get nothing as a result, which indicates that the value for this property is most likely being computed by FAST Search for SharePoint during content processing.

So let’s take a look at some lower-level configuration files to track this down. We start with the mother file of all configurations related to the content processing pipeline: %FASTSearch%\etc\PipelineConfig.xml. This is a file that can’t be modified (since it is not included here in the list of configuration files that can be modified), but nothing prevents us from just looking at it. After opening this configuration file and searching for “documentsignature”, you will find the definition for the stage responsible for assigning the value to this property:

    <processor name="DocumentSignature" type="general" hidden="0">
      <load module="processors.DuplicateId" class="DuplicateId"/>
      <config>
       <param name="Output" value="documentsignature" type="string"/>
       <param name="Input"  value="title:0:required body:1024:required documentsignaturecontribution" type="string"/>
      </config>
    </processor>

The parameters that matter most to us are highlighted above:

  • Input: which properties will be used to calculate the document signature –> title and the first 1024 bytes of body (as well as a property called documentsignaturecontribution that will also be used if it has any value)
  • Output: our dear documentsignature property

And with this we get to the bottom of this duplicate removal feature, which means is a good time to recap everything we found out:

  1. During content processing, for every item being processed, FAST Search for SharePoint will obtain the value of title and the first 1024 bytes of body for this item, and use it to compute a numerical checksum that will be used as a document signature. This checksum is stored in the property documentsignature for every item processed.
  2. During query time, whenever “Remove Duplicate Results” is enabled, the Search Center tells FAST Search for SharePoint to collapse results using the documentsignature property, effectively eliminating any duplicates for items that have the same title+first-1024-bytes-of-body.
  3. When a user clicks on the “Duplicates (n)” link next to an item that has duplicates, another query is submitted to FAST Search for SharePoint, passing as an additional parameter the value of the fcoid managed property for the item selected, which will be used to return all items that contain the same checksum (aka “the duplicates”).

A few extra questions that may appear in your head after you read this:

  • Is it possible to collapse results by anything other than documentsignature and use this feature for other types of collapsing (collapse by site, collapse by category, etc.)?
    • Answer: Yes, it is absolutely possible. All you will need is an Integer, Sortable, Queryable managed property containing the values you want to collapse by + a custom web part (or an extended Search Core Results Web Part) where you request this managed property to be used for collapsing and how many items should be preserved (as explained in this MSDN article linked before)
  • Can two items that are not identical be considered as duplicates?
    • Answer: Yep. As we saw above, only the first 1024 bytes of body are used for calculating the checksum, which means that any other differences these documents may have beyond the first 1024 bytes will not be considered for the purposes of duplicate removal. (Note: roughly speaking, the body property will have just the text of the document, without any of the formatting)
  • Can I change how the default document signature is computed?
    • Answer: Yes, read the update to this post below. No, since this is done by one of FAST Search for SharePoint out-of-the-box stages. But, what you can do is calculate your own checksum using the Pipeline Extensibility and then use your custom property for duplicate removal.
  • When you have duplicates and only one item is shown in the results, how does FS4SP decides which item to show?
    • Answer: The item shown is the item with the highest rank score among all the duplicates.

That’s it for today. If any other questions related to duplicate removal surface after this post is published, I will add them to the list above. Smile

UPDATE (2011-12-09): Yes, you can influence how document signature is computed!

After I posted this article yesterday I was still thinking about that documentsignaturecontribution property mentioned, as I had a feeling that there was a way to use it to influence how the document signature is compute. Well, today I found some time to test it and yes, it works! Here is how to do it.

What you have to do is create a new managed property with exactly the name documentsignaturecontribution and then map to this managed property any values that you also want to use for the checksum computation (as with other managed properties, to assign a value to this property you must map a crawled property to it).

You need a managed property because the DocumentSignature stage is executed after the mapping from crawled properties to managed properties, so FAST Search for SharePoint is looking for a managed property named documentsignaturecontribution to use as part of the checksum computation. When you create this managed property and assign it some value, FAST Search for SharePoint simply uses this, along with title and the first 1024 bytes of body, to calculate the checksum.

I followed the great idea from Mikael Svenson to create two text files filled with the same content just to force them to be perceived as duplicates by the system. The key here was to create these two files with exactly the same name and content, but put them in different folders. This way I could guarantee that both items had the same title and body, which would result in them having the same checksum. This was confirmed after I crawled the folders with these items:

FAST Search for SharePoint - Duplicate Removal - Example of duplicates

Both items had the same checksum, which could be checked by looking at the property fcoid that was returned with the results:

fcoid: 102360510986285564

My next step was to create the managed property documentsignaturecontribution and map to it some value that would allow me to distinguish between the two files. In my case, that value was the path to the items, which were located in different folders. So, after creating my managed property documentsignaturecontribution of type Text, I mapped to it the same crawled properties that are mapped to the managed property path, just to make sure I would get the same value as that property:

FAST Search for SharePoint - Duplicate Removal - documentsignaturecontribution managed property

With this done, I just had to perform another full crawl to force SharePoint to process both of my text files again, and confirm that they were not perceived as duplicates anymore (since I was also using their path to compute the checksum):

image

Another look into the fcoid property confirmed that both items did have different checksums now:

  • file://demo2010a/c$/testleo/test.txt – fcoid: 334483385708934799
  • file://demo2010a/c$/testleo/subdir/test.txt – fcoid: 452732803969334619

So what we learned with this is that you can influence how the document signature is computed by creating this managed property documentsignaturecontribution and mapping to it any value you want to be part of the checksum computation. And if you want to use the full body of the document to compute the checksum, you could accomplish this through the following steps:

  1. add to the Pipeline Extensibility a custom process that uses the body property as input and return a checksum crawled property based on the full contents of the body as the output
  2. map this checksum crawled property returned in the previous step to the managed property documentsignaturecontribution
  3. re-crawl your content and celebrate when it works Smile

Additional info:

Posted in FS4SP | Tagged | 17 Comments

How to force an item to be removed from the index immediately with FAST Search for SharePoint

Very important reminder: the tip below will explain how to remove an item from the index, yet this does not prevent it from being picked up during the next crawl (full/incremental). In order to prevent the item from being crawled again additional steps must be taken (such as creating a crawl rule for the item). Big thanks to Mikael Svenson for the reminder!

Earlier today I posted an article on Microsoft Learning’s Born to Learn website, and I thought it would be an interesting article for the audience here as well. The article covers how to force FAST Search for SharePoint to remove an item from the index immediately, without having to wait until the next crawl.

In the article linked above I showed how to obtain the Item ID (or ssic://<id>) using the Crawl Logs, so here in this post I will show how to obtain it directly through the Search Center. This will also be a neat way to show you how to get the raw XML for the search results, which is very helpful when you are troubleshooting issues with your search results (such as to confirm that some property is being returned properly).

The first thing you need to do is execute a query that returns the item you want to remove from the index. Next you will need to customize the Search Core Results Web Part with this XSL (obtained from this MSDN article on how to get XML search results):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xmp><xsl:copy-of select="*"/></xmp>
</xsl:template>
</xsl:stylesheet>

These are the detailed steps to configure the Search Core Results Web Part to use this XSL as well as return the additional property we need to run the command to remove the item from the index:

1) While editing the Search Core Results Web Part, uncheck the option “Use Location Visualization” under Display Properties

Use Location Visualization

2) Click to open the XSL Editor and replace all the contents of the XSL entered there with the XSL listed above

XSL Editor

3) Edit the Fetched Properties parameter, adding the following entry right after the opening <Columns> tag (and before the </Columns> tag):  <Column Name=”contentid”/>

4) Confirm the changes to the web part, save the page and check the new output of your search results page showing the full XML with the additional properties you wanted

Search results

Now, with the contentid in hands we can execute the command to remove this item from the index:

docpush -c sp -U -d ssic://4583

And with that you have a way to remove items from the FAST Search for SharePoint index whenever you need. It does take some work, but at least now you have a way to accomplish that. Smile

Posted in FS4SP | Tagged , , | 6 Comments

Understanding Crawled Properties, Managed Properties and Full-text Index – Part 1

A while ago I received an email from a previous student asking my thoughts about an issue in his environment. He was crawling some content that had a metadata he wanted to use as the title for the items indexed, but even after doing the proper mapping to the Title managed property he was still not able to get his custom title to be used and appear in the results. “What could be going on?”, he wondered.

My first thought after reading his email was “I think I know what’s the issue”, followed closely by “Maybe I should look for another profession, since I was supposed to have taught this clearly during class”. And then I proceeded to explain to him what I will explain to you in this series of posts about Crawled Properties, Managed Properties, Full-text index, and how they all work together to make your search work.

To get to the bottom of this issue, we need to understand three concepts perfectly clear (ok, to tell you the truth you need to understand only two concepts for this particular issue, but the third one is also very important, so please play along Smile):

  • Crawled Properties (this post – Part 1)
  • Managed Properties (future post – Part 2)
  • Full-text Index (future post – Part 3)

Crawled Properties

As the official documentation states: “Crawled properties are automatically extracted from crawled content and grouped by category based on the protocol handler or IFilter used.”

What this means is that crawled properties are metadata associated with your content (such as title and url), which can be found in two ways: during crawling of the content and during content processing.

Crawled properties found during crawling

When you crawl content from a SharePoint Document Library, each column in the library is metadata associated to the corresponding document, therefore they end up being exposed as crawled properties. For example, in the document library below, each column will be exposed as a crawled property, including my custom column Department:

Document Library with custom column

After my first crawl of this content, I can quickly check that my custom column was exposed as a crawled property either through PowerShell or through Central Administration. To do it through PowerShell, I can run this simple cmdlet:

Get-FASTSearchMetadataCrawledProperty –Name ows_Department

image

In my case, it was easy to locate my property, since I already knew that all custom columns in a SharePoint document library are exposed as crawled properties with the prefix “ows_”, and they are also assigned to the category SharePoint. Now, what if I did not know this, or if this metadata was coming from somewhere else?

Then I could simply use the name of the column to do a wildcard search for it:

Get-FASTSearchMetadataCrawledProperty –Name *department

Crawled properties found during content processing

The second way crawled properties can be found is during content processing. One example are the metadata properties you can define for Office documents, such as the ones shown below for a Microsoft Word document:

Word document properties

Each one of the properties shown above is also extracted during content processing and exposed as a crawled property. Take the property Comments shown above, for example. Any information stored in this property will be extracted and exposed through the crawled property Office:6(Text).

Now you may be asking yourself: how am I supposed to know that a crawled property named Office:6(Text) will contain the values of the metadata property Comments from an Office document?

The only answer I have for that is this brilliant series of posts from Anne Stenberg where she reverse-engineer the out-of-the-box crawled properties and their mappings to figure out what metadata they represent. In my case, all I had to do was check this post here from her series that talks specifically about the crawled properties for Office documents.

Note: even though Anne’s articles mentioned above are all about SharePoint 2007 (MOSS 2007), so far they have been spot on every time I needed to find a specific crawled property in FAST Search for SharePoint.

Closing thoughts

Ok, so now we know that crawled properties are metadata found either through crawling or through content processing, and we also know how to identify specific crawled properties associated with our content. The real deal though will come from understanding how we can use these crawled properties during search, such as you can see bellow, where I’m searching for the contents I entered into the Comments property of my Word document:

Search for Comments

And this will be the subject of the next post in this series: Managed Properties. See you then! Smile

Posted in FS4SP | Tagged | 32 Comments

Working with FAST Search for SharePoint and Multivalued Properties

Imagine the following scenario. You have some content in a database (or file share, or SharePoint site), and this content has some metadata that is comprised of multiple values for one specific field, such as a list of authors in a book, or a list of contributors for a project, or even a list of departments associated with an item. Your first question is: how do I configure this multivalued property (author, contributors, departments) to be crawled by FAST Search for SharePoint (FS4SP)?

After some thinking, you decide to return all of those values inside this one field, using a separator such as a semi-colon as a delimiter between each individual value. You run a full crawl against this content, find the crawled property associated with this multivalued metadata, map it to a new managed property and expose it in a refiner. All beautiful, correct?

Well, not quite. When you look at your resulting refiner, this is what you see:

FS4SP Multivalued Refiners

Notice how, instead of considering each individual value in the property, FS4SP is considering the whole property as one big value, which results in the refiner counters being all off.

The issue here is that FS4SP doesn’t know that this is a multivalued property, as the semi-colon is not a separator that it recognizes for multivalued items. To be able to get FS4SP to recognize your multivalued property and display the refiners correctly, you will need to follow a few steps:

  1. Configure the Managed Property with the correct options
  2. Create a custom processing component to apply the correct multivalued character separator
  3. Configure the Pipeline Extensibility to call your custom processing component and re-crawl your content
  4. Troubleshooting (lets hope this won’t be needed Smile)

Configure the Managed Property with the correct options

The first thing you have to do is configure your multivalued Managed Property with the option MergeCrawledProperties set to true. You can do this through PowerShell using the Set-FASTSearchMetadataManagedProperty cmdlet, or you can do this through Central Administration, as shown below:

FS4SP MergeCrawledProperties setting

This is detailed in the MSDN documentation for the ManagedProperty Interface, where it defines:

MergeCrawledProperties

Specifies whether to include the contents of all crawled properties mapped to a managed property. If this setting is disabled, the value of the first non-empty crawled property is used as the contents of the managed property.

This property must also be set to True to include all values from a multivalued crawled property. If set to False, only the first value from a multivalued crawled property is mapped to the managed property.

Create a custom processing component to apply the correct multivalued character separator

As I mentioned above, the main issue with the semi-colon character used as a separator is that FS4SP doesn’t recognize it as a multivalued separator, so in order to do this correctly you must create a custom processing component (in C#, in PowerShell, or any other language) that can replace the simple string separator (in this case the semi-colon), with the special multivalued separator that FS4SP can recognize (“\u2029”). The detailed procedure to incorporate a custom processing component is detailed in this reference on MSDN.

In my specific case, I followed the great steps described by Mikael Svenson on how to use PowerShell to create quick-and-easy custom processing components. This proved to be a very quick approach to get my customization in place and be able to test it very quickly. Still, you should do this only for prototyping, as Mikael describes, because there is a performance penalty associated with the use of PowerShell, so it is recommended that you “port the code over to e.g. C# when you are done testing your code”.

My final custom code (directly inspired by Mikael’s post) to replace the semi-colon separator with the proper multivalued separator is shown below:

function CreateXml()
{
    param ([string]$set, [string]$name, [int]$type, $value)

    $resultXml = New-Object xml
    $doc = $resultXml.CreateElement("Document")

    $crawledProperty = $resultXml.CreateElement("CrawledProperty")
    $propSet = $resultXml.CreateAttribute("propertySet")
    $propSet.innerText = $set
    $propName = $resultXml.CreateAttribute("propertyName")
    $propName.innerText = $name
    $varType = $resultXml.CreateAttribute("varType")
    $varType.innerText = $type

    $crawledProperty.Attributes.Append($propSet) > $null
    $crawledProperty.Attributes.Append($propName) > $null
    $crawledProperty.Attributes.Append($varType) > $null

    $crawledProperty.innerText = $value

    $doc.AppendChild($crawledProperty) > $null
    $resultXml.AppendChild($doc) > $null
    $xmlDecl = $resultXml.CreateXmlDeclaration("1.0", "UTF-8", "")
    $el = $resultXml.psbase.DocumentElement
    $resultXml.InsertBefore($xmlDecl, $el) > $null

    return $resultXml
}

function DoWork()
{
    param ([string]$inputFile, [string]$outputFile)    
    $propertyGroupIn = "00130329-0000-0130-c000-000000131346" # SharePoint Crawled Property Category
    $propertyNameIn = "ows_DepartmentTest" # property name
    $dataTypeIn = 31 # string

    $propertyGroupOut = "00130329-0000-0130-c000-000000131346" # SharePoint Crawled Property Category
    $propertyNameOut = "ows_DepartmentTest" # property name
    $dataTypeOut = 31 # string

    $xmldata = [xml](Get-Content $inputFile -Encoding UTF8)
    $node = $xmldata.Document.CrawledProperty | Where-Object {  $_.propertySet -eq $propertyGroupIn -and  $_.propertyName -eq $propertyNameIn -and $_.varType -eq $dataTypeIn }
    $data = $node.innerText

    [char]$multivaluedsep = 0x2029
    [char]$currentsep = ';'
    
    #Replace current separator (semi-colon) with special multivalued separator
    $data = $data.Replace($currentsep, $multivaluedsep)
    
    $resultXml = CreateXml $propertyGroupOut $propertyNameOut $dataTypeOut $data
    $resultXml.OuterXml | Out-File $outputFile -Encoding UTF8
    
    #Copy-Item $inputFile C:\Users\Administrator\AppData\LocalLow
}

# pass input and output file paths as arguments
DoWork $args[0] $args[1]

The first highlighted section above (lines 34 through 40) show the section which defines the input crawled property that will contain the items with the semi-colon separator, as well as the output crawled property that will store the updated content with the correct multivalued separator. In my case, both properties are the same, since I simply want to do an in-place replacement.

The second highlighted section (lines 46 through 50) shows the definitions for the current separator (semi-colon) and for the multivalued separator (0x2029 in PowerShell). In the following line the replacement for the correct separator is applied in the input crawled property string.

Configure the Pipeline Extensibility to call your custom processing component and re-crawl your content

The next important step is to tell FS4SP that you want to call your custom processing component during content processing. To do this you must configure the %FASTSearch%\etc\pipelineextensibility.xml configuration file. This is how this file looked on my system:

<!-- For permissions and the most current information about FAST Search Server 2010 for SharePoint configuration files, see the online documentation, (http://go.microsoft.com/fwlink/?LinkId=1632279). -->

<PipelineExtensibility>
	<Run command="C:\Windows\System32\WindowsPowerShell\v1.0\PowerShell.exe C:\FASTSearch\bin\multivalued.ps1 %(input)s %(output)s">
		<Input>      
			<CrawledProperty propertySet="00130329-0000-0130-c000-000000131346" varType="31" propertyName="ows_DepartmentTest"/>
		</Input>
		<Output>
			<CrawledProperty propertySet="00130329-0000-0130-c000-000000131346" varType="31" propertyName="ows_DepartmentTest"/>
		</Output>
	</Run>
</PipelineExtensibility>

As you can see above, all I’m doing is defining that I want my custom PowerShell script to be called, receiving as an input crawled property my property that contains the contents with the semi-colon separator and then returning as output the same crawled property, in order to just replace its contents with the new-and-updated value, now using the multivalued separator.

After saving this configuration file, the next step is to force your Document Processors to reload their configuration so they can be aware of this new content processing component, which you can accomplish by executing psctrl reset in a command prompt.

With all the pieces in place, you can start a re-crawl of your content and then test your refiner after crawl is complete. If all goes well, your refiner should now look exactly like you wanted!

FS4SP Multivalued Refiner - Correct

Troubleshooting

My main warning is that you pay a lot of attention to the fact that the name of your input and output crawled properties (both in the pipelineextensibility.xml and in the PowerShell script) are case-sensitive.

Many people have spent a very long time troubleshooting their code only to realize that it was a case-sensitive issue with the name of these properties. The best way I found to troubleshoot a new custom processing component is through these techniques:

  1. Investigate the contents of the input file sent to your custom code: as described in this post, the only path in the file system with full access for your custom code is the AppData\LocalLow directory for the account running the FAST Search Service. By uncommenting line 55 in the PowerShell script above, a copy of the input file received by the script will be created in the AppData\LocalLow directory. By looking at the contents of the input file you can detect what is the content of the input crawled property. If the input crawled property doesn’t contain any value, and you are sure that your document has that property, check for issues wite case-sensitive property names.
  2. Validate the list of crawled properties received by FS4SP: you can accomplish this through the use of the optional processing stage FFDDumper.
  3. If both options 1 and 2 look ok, use the input file from step 1 to call your custom code directly and debug it to identify the error (you can debug the PowerShell script above using the ISE Editor)

And that’s it for today. Enjoy your coding and your multivalued properties! Smile

Posted in FS4SP | Tagged , | 11 Comments

SharePoint Search and FAST Search for SharePoint Architecture Diagrams – Fault Tolerance and Performance

Update: For those interested in watching a presentation of this content below you can download (right-click and select “Save target as..”) and watch this video here (200+ MB) that was recorded during a webcast on 2011-07-27. My presentation starts at 6min20sec.

In previous posts I showed and explained a few architecture diagrams of search in SharePoint 2010 for both SharePoint Search and FAST Search for SharePoint, I shared my all-time-favorite resource on SharePoint Search Architecture and Scale for crawl and query, and (hopefully) helped you understand, scale and monitor Crawling / Processing / Indexing in FAST Search for SharePoint.

What I will try to do in this post is convert most of that content into additional diagrams that should help you “see” how these changes related to fault tolerance and/or performance affect your search diagram.

These are the architecture diagrams discussed in this post:

SharePoint Search

FAST Search for SharePoint

SharePoint Search – Query Component (Fault Tolerance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Query Component (Fault Tolerance)

In this diagram you see how your architecture would look like after you add a new mirror Query Component for an existing Index Partition, which you do in order to provide fault tolerance for your lookup of matched items for full-text search queries against your index. The reasons for doing that are pretty simple (and detailed in here): one server goes down, the other can still keep serving queries, and unless you configure the mirror server as “failover only” it will also distribute the load of incoming queries.

SharePoint Search – Query Component (Performance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Query Component (Performance)

In this diagram there is just a very subtle change from the previous one (marked in red), but it makes a lot of difference in your architecture: the additional Query Component has a different Index Partition. What this means is that now your content is divided between the two Index Partitions, so if for example you have a total of 6 million indexed items, then each Index Partition has 3 million items. This also means that your Query Processor will send requests in parallel to both Query Components and, since each one of them has to search against only half of the index (3 million out of 6 million total), they will be able to do this faster.

The supported number of indexed items is 100 million per search service application and 10 million for each Index Partition.

SharePoint Search – Property db (Performance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Property Db (Performance)

Here things start to get interesting, with not only a new Query Component/Index Partition, but also with a new Property db (added items marked in red). If you read this post (mentioned a dozen times by now Smile) you understand that in order to provide search results, the Query Processor need to perform a lookup not only in the Index Partition but also in the Property db in order to retrieve the metadata associated with the results found. When you start to increase your indexed content, for example by having 20M items that you then split across 2 Index Partitions to improve your index lookup time, it may happen that your Property db is now your bottleneck. A way to minimize this impact in the growing number of indexed items is by adding a new Property db and assigning a new Query Component/Index Partition to it. This way, each combination of Index Partition/Property db has to store and handle search requests for only half of the total number of indexed items.

It is also important to notice that all search-related databases (Property db, Search Admin db and Crawl db) can be configured for fault tolerance through the use of database mirroring.

SharePoint Search – Query Processor (Fault Tolerance and Performance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Query Processor (Fault Tolerance and Performance)

Even after you have scaled your Query Components, your Index Partitions, your Property dbs, another query component that may require your attention is the Query Processor. This is the component that does the hard work of accessing the Query Component (to check items that match the query), the Property db (to get metadata associated with those items) and the Search Admin db (to get security descriptors in order to apply security trimming in the results). By adding a new Query Processor (marked in red and described in here), you divide the load of this task across multiple servers, increasing your query performance and providing fault tolerance (if one goes down, the other can still handle queries).

SharePoint Search – Crawl Component (Fault Tolerance and Performance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Crawl Component (Fault Tolerance and Performance)

Now let’s take a look at the other side of search: Crawling/Processing/Indexing. You can notice a new Crawl Component that was added in the diagram above, now what does this mean? This means that both Crawl Components will split the load of crawling the content sources defined, and both will keep pulling from and updating the crawling queue stored in the Crawl db. For example, if your full crawl with one Crawl Component and one Crawl db was taking 4 days, by adding another Crawl Component (and considering you have sufficient CPU/Memory/IO/bandwidth/etc. resources) the same full crawl should be reduced to around 2 days. Also, with two Crawl Components working from the same Crawl db, you also get fault tolerance in case one of them goes down.

SharePoint Search – Crawl Component and Crawl db (Performance)

SharePoint 2010 - SharePoint Search Architecture Diagram - Crawl Component and Crawl Db (Performance)

What happens when you start to add many Crawl Components to the same Crawl db? Well, the db can easily become your bottleneck. One way to keep scaling out and increasing your crawling performance is through the use of an additional set of Crawl Component/Crawl db, as shown in the diagram above. In this way, distinct content sources (web applications, web sites, file shares, etc.) will be split among these two Crawl dbs, and their respective Crawl Components will have to handle (crawl/process/index) only part of the content, making it easier to deal with.

There are a lot of things that go into this, from how content to be crawled is split among multiple Crawl dbs to how you can manually define this mapping yourself (if you want to). All of this and more are detailed in this post here.

FAST Search for SharePoint – Content Processing (Fault Tolerance and Performance)

SharePoint 2010 - FAST Search Architecture Diagram - Content Processing (Fault Tolerance and Performance)

Since we are starting with content processing You may be asking “what about the crawling part of FAST Search?”. Well, the good news is that if you are using the FAST Content SSA to crawl your content, then your crawling architecture looks pretty much like what we just saw for SharePoint Search above. The main difference is that the FAST Content SSA will be tasked only with crawling, since processing and indexing will be done in the FAST Search farm. And talking about content processing, the first component that can be scaled out is the Content Distributor (as shown above in red). What this gives you is just fault tolerance, since the FAST Content SSA will connect and send batches to only one Content Distributor at a time, and will switch to the other one just in case of failure to submit batches to the “primary” Content Distributor (you also must make sure to configure the FAST Content SSA listing both Content Distributors).

In regards to Document Processors, you will definitely have more than one (you get 4 of them by default in a simple installation), which gives both fault tolerance (in case one of them goes down) and performance (since they will work in parallel). Also, if the “primary” Content Distributor goes down, the Document Processors will be smart enough to switch to the other available Content Distributor.

Indexer (Fault Tolerance)

SharePoint 2010 - FAST Search Architecture Diagram - Indexer (Fault Tolerance)

Remember the option to mirror an Index Partition in SharePoint Search to provide fault tolerance? This is the similar way that FAST Search can do that, but with a name change, since the documentation will refer to this process as adding a backup indexer row. In this case both Indexers will have the same content, which means that if your primary Indexer goes down, the backup Indexer can be configured to become the new primary Indexer.

Indexer (Performance)

SharePoint 2010 - FAST Search Architecture Diagram - Indexer (Performance)

In the diagram above, instead of adding a new backup Indexer for fault tolerance, it was added a new Indexer column to increase the volume of indexed content that can be stored in your search farm. In this scenario your content will be divide among the two Indexer columns (very similar to how we divided the content into separate Index Partitions for SharePoint Search).

The official guideline is to have one Indexer column for each 15 million items to index.

Indexer and Search (Fault Tolerance)

SharePoint 2010 - FAST Search Architecture Diagram - Indexer and Search (Fault Tolerance)

Above is the diagram of a somewhat common deployment of FAST Search for SharePoint, where you have two servers and each one is configured with a combination of Indexer and Search in a way that one server is the primary Indexer and backup Search, and the other server is backup Indexer and primary Search. In this way, with just your two servers you are providing fault tolerance for both Indexer and Search.

Query Processing (Fault Tolerance)

SharePoint 2010 - FAST Search Architecture Diagram - Query Processing (Fault Tolerance)

In this diagram above a Query Processing server (with QRServer, QRProxy and FSA Worker components) was added to the FAST Search farm and also properly configured in the FAST Query SSA by listing both servers in its setup. With this configuration, queries will be sent to both servers in a round robin fashion, and if one of the servers fails the FAST Query SSA will keep sending queries just to the active server.

Conclusion

There is a lot you can configure in both SharePoint Search and FAST Search for SharePoint to increase performance and/or provide fault tolerance for components of your search farm. The important thing is to understand what options are available for each platform and keep them in mind when you first design your search architecture as well as after your search project is in production, in case you need to scale out your deployment.

Posted in FS4SP, SP2010 | Tagged , , , , | 12 Comments

Learning roadmap for Search in SharePoint 2010 (including FAST Search for SharePoint) – Part 2: Planning, Scale, Installation and Deployment, and Crawling

Did you enjoy your break since our last post in the series, when we finished up with some architecture diagrams for both SharePoint Search and FAST Search for SharePoint? Now let’s have a deeper look into some of those components, focusing on some considerations to properly plan and scale search solutions. Following up, we will cover some installation and deployment topics and then close with crawling. This should be enough to keep you entertained for a few days. Smile

In case you want the full list of this roadmap, the planned sections (so far) are the following:

Planning and Scale

Ready to dig a little deeper into SharePoint Search? Then read these two out-of-this-world articles that explain not only how the architecture of SharePoint Search works, but also how to scale it. Believe me, these two posts have saved me more times than I can count. Extra points for those working with FAST, as almost everything related to the crawling components, including scaling, also applies to FS4SP:

In the links above you understood more about the SharePoint Search architecture, now in this next step you can expand your knowledge by looking at how these same things apply to FS4SP. It is important to note that scaling the FAST Query SSA is mostly done for failover reasons, as the hard work done during query time for FS4SP is executed in the FAST farm (and not in the SharePoint farm):

Now, if you got to here you understand about the crawling and query components running in the SharePoint farm, either for SharePoint Search or for FS4SP, so it is time to do some deep reading into the product documentation. I know hardly anyone likes to read the documentation (I don’t like it either Smile), but there are great nuggets of useful information in the links below that will allow you to understand more about how to design the search solution and topology with FS4SP. The whole piece on performance and capacity management/testing/recommendations under the “Plan search topology” section is definitely worth a look (trust me, it will save you valuable time later on):

Advanced Material on Planning, Design, High Availability

A scenario that I get inquired about somewhat often is the idea of sharing the search service application across multiple SharePoint farms (something much discussed when you have dispersed SharePoint farms and want to provide a central Search farm). If that caught your attention, first you can read the official documentation, then you can go ahead and check the very detailed blog post covering step-by-step instructions on how to set this up for the User Profile Service Application and Search Service Application. The same principles apply to both SharePoint Search and FS4SP (since you are publishing/consuming the SSAs on the SharePoint farm):

Installation / Deployment

First, review and understand the steps required to configure search in SharePoint 2010. Even for those that will only work with FAST, this still matters, as a lot of the overall guidance here will also apply to FAST:

After you complete your reading above, you can go ahead and understand the steps required to deploy FS4SP from the official documentation:

Also, if you are planning to virtualize FS4SP, you better make sure to check the official recommendations here:

Crawling

First, learn the basics of configuring a new Content Source to crawl content in SharePoint 2010, since you will have to do this at some point. The best part? Most of what you learn about defining content sources, crawl rules, starting and stopping here is also valid for FS4SP. The video linked below shows the sequence of events when you trigger a full crawl (the part about crawling is the same for both SharePoint Search and FS4SP, but the part about processing and indexing is different in FS4SP)

For similar information but specific to FS4SP, this is the official documentation:

If you got through here, but still manage to recall the FS4SP architecture diagram from the previous post, you probably noticed that in FS4SP there are a bunch of new components, each with their own function. As I mentioned above, the crawling piece of FS4SP when you use the FAST Content SSA to define content sources will work the same way as it does for SharePoint Search. Below is one of my previous posts trying to explain the crawling/processing/indexing flow in FS4SP:

Another difference in FS4SP is the ability to use one of the FAST Search specific connectors (Web content, Database content, Lotus Notes content). Those are the connectors that came from the previous standalone version of the FAST product, and for those non-initiated in FAST administration, they may look a little strange (command line utilities only? xml configuration files?). These FAST Search specific connectors are completely unknown to your SharePoint farm (SharePoint basically doesn’t even know they exist, as they reside directly on the FAST farm) which means that a SP administrator will not have access to them through Central Administration, so you should be aware of that. My recommendation is that you always try to use the connectors through Central Administration (FAST Content SSA), and go to the FAST Search specific connectors only if you need a specific functionality that you can only get with them (such as the support to Lotus Notes security through the FAST Search Lotus Notes connector):

Now that you already understand how to crawl standard content with both SharePoint Search and FS4SP, it is time to understand how to bring content from other external sources (beyond Web Sites, File Shares, etc.). So do yourself a big favor and learn about Business Connectivity Services (BCS) in SharePoint 2010. To me this is one of THE most important pieces of technology in SharePoint that can really make search shine, as it integrates with other sources in a company (databases, web services, whatever-you-want) bringing all together inside SharePoint. The best part? It is a technology that works with both SharePoint Search and FS4SP seamlessly. The post below has the most detailed explanation I have ever found on how to create the basic External Content Types (to get content from a database, probably the most common scenario):

If you are looking for extra credits as an applied student (as you should Smile), then you can not only learn about BCS for search, but explore the broader capabilities that BCS brings to SharePoint overall, besides search. Believe me, you won’t regret this.

Advanced Material on Crawling and Connectors

Through BCS you can also create your own connectors to link SharePoint with any external sources you want. The first post below is a great starting point on this, and is the exact post I first read to understand how this works:

This second reference is a small gem buried on MSDN that explains how to create something that a lot of people want to do, which is to have a connector that aggregates metadata with an attached document and bring both together to be processed and indexed (such as indexing the metadata information for a candidate along with his/hers resume, allowing users to search for both and get just one result). Powerful stuff.

Another frequently asked question is about the possibility to use BCS to crawl databases other than SQL Server. The article below explains how to do this for Oracle, but gives some clues to the fact that you could do something similar for any other database supporting OLE DB or ODBC:

 

This should keep you busy for a while. And remember that if you just want a quick way to get a server to try some of the things you read above, you can always play around with one of the MSDN Virtual Labs instances, such as this one here that will give you a VM with both SharePoint 2010 and FS4SP.

Didn’t understand some of the materials? Have other resources you want to share? By all means, feel free to comment below. Smile

Posted in FS4SP, SP2010 | Tagged , , | 9 Comments

Learning roadmap for Search in SharePoint 2010 (including FAST Search for SharePoint) – Part 1: Search 101 and Architecture

If you want to learn about search in SharePoint 2010, there is so much information everywhere, spread across many sites, in different media formats that it would be a daunting task to try and make sense of it all. That’s why a lot of people come to our instructor-led trainings here at FAST University (the training division of FAST that came to Microsoft through the FAST Search & Transfer acquisition in 2008). Still, even after class, students often ask me for additional material they can explore, sometimes just to help them refresh the concepts and other times to help them deepen their knowledge.

For this purpose I created a OneNote notebook with a collection of my favorite reference links about FS4SP (and also now with a bunch of links about SharePoint Search). Unfortunately though, my OneNote notebook doesn’t seem to be enough anymore, now that it grew to 11+ sections with dozens of links on each(!!).

With that in mind, I’m starting with this post a series of articles that when put together intend to provide a “roadmap” with the references (articles, posts, videos, etc.) that I would follow if I had to start learning about search in SharePoint 2010 (including both SharePoint Search and FAST Search). I hope it helps others find their way too.

The planned sections are the following:

Search 101

In a VERY simplistic way, a search engine has to perform the following main tasks:

  • Crawling: acquire content from wherever it may be located (web sites, intranet, file shares, email, databases, internal systems, etc.)
  • Processing: prepare this content to make it more “searchable”. Think of a Word document, where you will want to extract the text contained in the document, or an image that has some text that you want people to be able to search, or even an web page where you want to extract the title, the body and maybe some of its HTML metadata tags.
  • Indexing: this is the magic sauce of search and what makes it different than just storing the content in a database and searching it using SQL statements. The content in a search engine is stored in a certain way optimized for later retrieval of this content. We typically call this optimized version of the content as the search index.
  • Searching: the part of search engines most well known. You pass one or more query terms and the search engine will return results based on what is available in its search index.

If you are interested in exploring more on the subject you can check these resources:

Search Architecture in SharePoint 2010

To be able to fully comprehend and use search well in SharePoint 2010, I highly recommend you to begin by understanding the concept of Service Applications. You can do this by checking these resources:

If you did read the articles above you now know that SharePoint 2010 has a Search Service Application and this is what you will use to configure your search environment. Now let’s have a look at what the architecture of search on SharePoint 2010 (without FAST) looks like:

SharePoint 2010 - SharePoint Search Architecture Diagram

In a nutshell, this is what these components are:

  • Search Service Application Proxy: the proxy for this service application, as explained in the articles about Service Applications listed above.
  • Admin Component: responsible for applying all the changes you make to the system, such as adding new query components or new crawler dbs, for example.
  • Search Admin db: stores administrative information, such as the search topology, and also the security descriptors (ACL).
  • Crawl db: stores crawl information, such as the crawl queue and the crawl log, and also social and anchor data.
  • Crawl Component: effectively crawl information from data sources, process content, builds indexes and ship them to Index Partition(s), as well as store metadata information on Property db.
  • Property db: stores metadata for items in the search index.
  • Query Component: conducts full text queries against the search index stored in the Index Partition.
  • Query Processor: sends full text queries to Query Component, grabs metadata from Property db and apply security trimming to search results based on ACL info stored on Search Admin db.

In future posts we will explore in more detail each one of these components as well as point to places where you can get more information about them.

But as you may have noticed above, this diagram we just saw is only valid when you have a “pure” SharePoint 2010 Search environment (an installation without FAST Search that is). How does the search architecture looks like when you have FAST Search for SharePoint on top of regular SharePoint Search?

SharePoint 2010 - FAST Search Architecture Diagram - short version

Whoa! A bunch of new things appeared in this diagram. The first thing thing you probably noticed is that you now have not only one Search Service Application (SSA), but two! That’s right, when you want to use FAST Search for SharePoint you must configure two SSAs for the communication between the SharePoint farm and the FAST Search farm.

Two farms as well? Yes, the first thing you must understand about your architecture when you have SP2010 and FS4SP together is that you will configure and add servers to each one of these farms independently. The bridge between these two farms will be precisely our two SSAs listed above, and here is my quick breakdown of them:

  • FAST Content SSA: responsible for crawling content and pushing this content to be processed/indexed in the FAST Search farm.
  • FAST Query SSA: responsible for receiving incoming queries from search applications and routing them to the appropriate search engine (SharePoint Search for People Search and FAST Search for all other Content-related queries). Also responsible for crawling people content.

To understand more about these two SSAs, check this excellent explanation about them in the resource below:

Now that you understand a little bit more about the SSAs, let’s check the remaining components in the FS4SP architecture diagram, the ones in the FAST Search farm:

  • FAST-specific Connectors: additional connectors (beyond the ones available through the FAST Content SSA) for crawling data from multiple data sources and send it for processing. (Note: unless you explicitly configure one of these connectors to crawl your content, they will not be used at all)
  • Item Processing: receives incoming batches of documents, processes these documents (to make them more easily searchable) and forward them to Indexing.
  • Indexing: builds a search-optimized index of the processed content (the search index).
  • Query Processing: responsible for processing queries (e.g. modify the query to add security filters based on the user that issued the request) coming from the FAST Query SSA and also for processing results to be returned, performing activities such as the removal of duplicates.
  • FAST Search Authorization: contacted by Query Processing to return the security filters that should be added to the query.
  • Query Matching: performs the actual lookup in the search index to retrieve items that match the user’s query.
  • Administration: responsible for applying changes you make to the system, such as changes to the index schema.
  • FAST Search Admin db: stores information about the FAST Search environment, including the index schema.

Does it look like a lot of components? Well, in fact they are not really components per se, but more like subsystems that perform a specific task (note how the SSAs don’t have all of their components listed, but are instead collapsed into a single entity).

The real components in a FAST Search farm are the ones listed in this diagram below that shows each subsystem now divided into its components:

SharePoint 2010 - FAST Search Architecture Diagram - detailed

As you can see, there are a LOT of components just in the FAST Search farm (beyond the ones on each SSA on the SharePoint farm). Don’t worry about memorizing all of them now, as we will cover each one of them in more details in future posts.

 

If you managed to get all the way here in this post, congratulations! Smile

If you feel like you still have so much to learn, the answer is “yes, you do”. But, the important thing here is to understand the overall architecture of the system, how all these pieces fit together. If you manage to understand this, then you are on the right track to start going deeper into the roles each one of these components perform in your search environment, which is the subject for the following posts in this series.

Makes sense? It doesn’t? Just let me know in the comments.

Posted in FS4SP, SP2010 | Tagged , , | 9 Comments