A while ago I received an email from a previous student asking my thoughts about an issue in his environment. He was crawling some content that had a metadata he wanted to use as the title for the items indexed, but even after doing the proper mapping to the Title managed property he was still not able to get his custom title to be used and appear in the results. “What could be going on?”, he wondered.
My first thought after reading his email was “I think I know what’s the issue”, followed closely by “Maybe I should look for another profession, since I was supposed to have taught this clearly during class”. And then I proceeded to explain to him what I will explain to you in this series of posts about Crawled Properties, Managed Properties, Full-text index, and how they all work together to make your search work.
To get to the bottom of this issue, we need to understand three concepts perfectly clear (ok, to tell you the truth you need to understand only two concepts for this particular issue, but the third one is also very important, so please play along ):
- Crawled Properties (this post – Part 1)
- Managed Properties (future post – Part 2)
- Full-text Index (future post – Part 3)
As the official documentation states: “Crawled properties are automatically extracted from crawled content and grouped by category based on the protocol handler or IFilter used.”
What this means is that crawled properties are metadata associated with your content (such as title and url), which can be found in two ways: during crawling of the content and during content processing.
Crawled properties found during crawling
When you crawl content from a SharePoint Document Library, each column in the library is metadata associated to the corresponding document, therefore they end up being exposed as crawled properties. For example, in the document library below, each column will be exposed as a crawled property, including my custom column Department:
After my first crawl of this content, I can quickly check that my custom column was exposed as a crawled property either through PowerShell or through Central Administration. To do it through PowerShell, I can run this simple cmdlet:
Get-FASTSearchMetadataCrawledProperty –Name ows_Department
In my case, it was easy to locate my property, since I already knew that all custom columns in a SharePoint document library are exposed as crawled properties with the prefix “ows_”, and they are also assigned to the category SharePoint. Now, what if I did not know this, or if this metadata was coming from somewhere else?
Then I could simply use the name of the column to do a wildcard search for it:
Get-FASTSearchMetadataCrawledProperty –Name *department
Crawled properties found during content processing
The second way crawled properties can be found is during content processing. One example are the metadata properties you can define for Office documents, such as the ones shown below for a Microsoft Word document:
Each one of the properties shown above is also extracted during content processing and exposed as a crawled property. Take the property Comments shown above, for example. Any information stored in this property will be extracted and exposed through the crawled property Office:6(Text).
Now you may be asking yourself: how am I supposed to know that a crawled property named Office:6(Text) will contain the values of the metadata property Comments from an Office document?
The only answer I have for that is this brilliant series of posts from Anne Stenberg where she reverse-engineer the out-of-the-box crawled properties and their mappings to figure out what metadata they represent. In my case, all I had to do was check this post here from her series that talks specifically about the crawled properties for Office documents.
Note: even though Anne’s articles mentioned above are all about SharePoint 2007 (MOSS 2007), so far they have been spot on every time I needed to find a specific crawled property in FAST Search for SharePoint.
Ok, so now we know that crawled properties are metadata found either through crawling or through content processing, and we also know how to identify specific crawled properties associated with our content. The real deal though will come from understanding how we can use these crawled properties during search, such as you can see bellow, where I’m searching for the contents I entered into the Comments property of my Word document:
And this will be the subject of the next post in this series: Managed Properties. See you then!