Content Source vs. Content Collection in FS4SP

A very common confusion with FS4SP is the one around the distinctions between a Content Source and a Content Collection. If you just made a funny face while reading this stick around and I’ll try to make things a little bit more clear for you.

A few months ago I saw the following thread on TechNet:

Search a Specific Content Collection

I have created a new content collection “news”. I am able to crawl content and feed it into that specific content collection, but when I do a query it always queries the default “sp” collection.

Is there a way to specifically tell my FAST site to query my “news” content collection?

When I looked at this question, I thought it would be a good idea to step back a little and define what is a Content Collection and what is a Content Source, since those are two different concepts that, depending on how you configure your FS4SP will have distinct and important roles to play. With that in mind, this was my response:

The first question I would ask you is this: what are you trying to achieve by having multiple content collections?

Note that FAST Search for SharePoint has two distinct things:

  • Content Sources: those are a logical grouping defined through Central Administration. New content sources are created to crawl distinct types of content or to use different crawl schedules, for example. By default all content crawled through any of these content sources is sent and stored inside FAST in the content collection “sp”;
  • Content Collections: those are a logical grouping defined through Windows PowerShell and implemented directly on FAST. You can define multiple content collections to facilitate maintenance when you use one of the FAST-specific connectors (http://technet.microsoft.com/en-us/library/ff383278.aspx#About_FS_specific). Having a separate content collection for the FAST Search database connector, for example, would allow you to clear the contents of just that content collection in case that is needed.

The important thing is that no matter how you configure any of the above (content sources or content collections), they are just logical groupings of content. In the end it all goes to the same FAST index. Even more important, any queries by default will be executed against the entire index.

Now, if you do want to limit some of your queries to execute against just part of the content, you have two options:

  • To filter by Content Source you can search against “contentsource”. For example, if you have a Content Source named “FAST Contoso” defined through Central Administration, you would be able to search against only this content with a query like this (KQL syntax) -> contentsource:”FAST Contoso” <query term>
  • To filter by Content Collection you can search against “meta.collection”. For example, if you created a new Content Collection named “news” using Windows PowerShell, you would be able to search against this content with a query like this (KQL syntax) -> meta.collection:news <query term>

As you can see, you have many options, how you use them will depend on your business needs.

All of this reminds me of another thread that I just saw related to how you can get the contentsource property of a document in the Pipeline Extensibility (say you want to apply a certain processing rule only to documents coming through a specific content source, for example). But this is a topic for another day. Smile

About leonardocsouza

Mix together a passion for social media, search, recommendations, books, writing, movies, education, knowledge sharing plus a few other things and you get me as result :)
This entry was posted in FS4SP and tagged . Bookmark the permalink.

9 Responses to Content Source vs. Content Collection in FS4SP

  1. rahul says:

    Hi, I have created a new collection. Can you please tell me how to crawl content on it.
    I am unable to find any option.

    • Hi Rahul,

      As I described above, you only need a new Content Collection when you plan to use one of the FAST Search specific connectors (FAST Search Web Crawler, FAST Search Database connector or FAST Search Lotus Notes connector). When configuring any of these connectors you will have a parameter on the configuration file defining the collection you want to send the content to. All you need to do is configure this parameter with the name of the new collection you created.

      Good luck!

      Best,
      Leo

  2. Francis says:

    Hi – I’ve got 2 Content Collections and would like to create a deep refiner called ‘Source’ that would list the 2 Content Collections (Intranet & Internet). I’ve been looking at the crawled and managed properties using PowerShell but can’t see a property from either list that would allow me to do this. Any suggestions would be appreciated.

    Thanks

    • Hi Francis,

      You have 2 Content Collections or 2 Content Sources?

      For Content Sources (defined through the FAST Content SSA), I believe you can use the managed property “contentsource”, which by default isn’t configured to have a refiner by it, so you would have to change that and re-crawl all your content.

      For Content Collections (which you wouldn’t have unless you are using one of the FAST-specific connectors), as far as I know there is no out-of-the-box property that will show this to you. I believe the approach here would be to create a custom Pipeline Extensibility component that would “tag” the content with the appropriate Content Source value, depending on some characteristic that allow you to differenciate between content crawled through the FAST Content SSA and content crawled through one of the FAST-specific connectors.

      Hope that makes sense. If it doesn’t, just let me know. 🙂

      Best,
      Leo

  3. Francis says:

    Hi Leo – I’ve got 2 Content Collections as i’m using the FAST Search Web Crawler.
    The reason I thought there must be a way to use the Content Collection values as refiners was that I was looking at the XML returned from the QR server and saw a ‘collection’ attribute for each search result with the values I wanted to use (either ‘Internet’ or ‘Intranet’ depending on the Content Collection the result belonged to). e.g.:

    intranet

    But from your answer there doesn’t seem to be a way to access this through crawled or managed properties. I’m relatively new at this and haven’t looked into custom pipeline extensions yet but the link you provided looks promising.

    Thanks also for this article as it helped to, finally, straighten these concepts out.
    Francis

  4. Akshay prasad says:

    Hi Leo,

    I have got a custom executable which runs on the content which will be crawled in FAST. I wanted to understand as in how can i restrict the content sources which pass through the custom executable while on their way to the indexer whereas others would directly go to the FAST indexer without going through the custom executable. There are content sources of other web applications as well in the farm which should not go through the custom executable. Could you please help me out with this?

    Thanks
    Akshay Prasad

    • Hi Akshay,

      I believe the best option for this scenario is for you to send an extra input crawled property to your custom executable and implement some logic in there to only process documents that match your expected value, ignoring all the other documents.

      You could do this by passing along the ContentSource property (which should be possible according to this thread here). Then your code would have some logic like this below to process only documents sent to Content Source XYZ:


      if (contentsource == "XYZ") {
      //do some processing
      } else {
      //do nothing to the document
      }

      Another option would be to use some other crawled property that will only have a value for the content sources that should be processed. This should be perfectly feasible as well, but I would recommend you try to use the content source name itself for this conditional processing, as this should make the content easier to understand/maintain later on.

      In any case, the main message is that your approach shouldn’t be on trying to prevent some content sources from passing through your custom executable, but instead in changing your custom executable logic to only process content from the desired collections, ignoring the other ones.

      I haven’t played with FS4SP in quite a while now, but I hope I’m not that rusty and that this is somewhat helpful to you. 🙂

      Best,
      Leo

  5. Akshay prasad says:

    Hi Leo,

    Thanks for your swift reply. Yes you got the message absolutely right. Just that with the implementation i have a difficulty. I wanted to clarify one thing with you whether i need to put this “if, else” loop inside pipeline extensibility.xml or inside the custom executable. The custom executable is a 3rd party product and unfortunately modifying the exe will be a difficult activity. Please guide me on this

    Thanks
    Akshay Prasad

    • Hi again Akshay!

      First of all, I’m sorry about the delay to reply to you.

      In regards to your question, the only way I can think of for you to use a 3rd party executable ONLY for some content sources would be for you to create your own custom executable that in turn calls the 3rd party one only for those content sources you want.

      I can’t think of any other way of going around this at the moment. If someone else reading this has a better idea, feel free to jump in with your comment! 🙂

      Hope this helps!

      Best,
      Leo

Leave a reply to leonardocsouza Cancel reply