A question from one of my students earlier today brought to my attention the fact that even though I sometimes try to cover some deeper aspects of the FAST Search for SharePoint platform, there are still some architectural concepts and guidance that need to be fully absorbed before we can take this deep dive into FS4SP, so in this post I will try to cover the Crawling –> Processing –> Indexing architecture of FS4SP, as well as its ties to SharePoint 2010 to show which points should you pay attention to when configuring/administering your search architecture.
To put it in a simple way, this is the “diagram” of the components involved from the time a document is crawled until it is indexed:
FAST Content SSA (SP farm) –> Content Distributor (FAST farm) –> Document Processor(s) –> Indexing Dispatcher –> Indexer
FAST Content SSA (or FAST Search Content SSA, or FAST Search Connector, etc.): The sole purpose of this SSA is to crawl content (in the SP farm, since it is configured in a SP server) and send it to be processed in the FAST farm. Apart from the fact that the content is been sent to FS4SP to be processed, this SSA will have the same crawling behavior as it would for a regular SharePoint Search SSA, therefore the same rules of scaling and fault tolerance applies. Now it is important to understand how to scale this component and how to monitor its performance (to find out if you need to scale ).
How (and why) to scale: this post has the some terrific guidance on how to architect and scale the crawling components of a SSA in SP2010, which will also apply in regards to the FAST Content SSA.
How (and what) to monitor: you can check the following Performance Counters in the server configured with the FAST Content SSA to see if you have lots of batches just waiting for the FAST farm to catch up (or if the FAST Content SSA itself is the one to blame): OSS Search FAST Content Plugin –> Batches Ready (if this value is 0, then the FAST farm is processing content faster than the current crawl rate). Other useful crawling performance counters to monitor are available on TechNet as well as the full list of performance counters available for the FAST Content SSA.
Additional info: even though I’m focusing only on the FAST Content SSA, I highly recommend you read the tale of the two Search Service Applications (SSAs) in FS4SP to understand how FS4SP connects to SP2010 through these two SSAs.
Content Distributor: first component on the FAST farm, its main role is to receive batches of documents from the FAST Content SSA and route them to the Document Processors to be processed. This is a very lightweight component that should not impact your system’s performance, so the main observation here is in regards of scaling for fault tolerance.
How (and why) to scale: deploy this component in at least two servers in your FS4SP farm so that in a case of failure of the primary Content Distributor, the second Content Distributor can pick up the work while you troubleshoot the failure on the other server. You can do this either during initial deployment of your FAST farm or later by modifying your farm deployment. The important thing is that you remember to add the address/port of both your Content Distributors when configuring the FAST Content SSA (either through Central Administration or through the Set-SPEnterpriseSearchExtendedConnectorProperty cmdlet, separating the multiple hostname:port values with semicolons).
Document Processor (s): this guy is always busy in a FS4SP deployment with ongoing crawling/feeding as it has the tough role of processing the content before it is sent to be indexed. Among its tasks are language and encoding detection, tokenization/word processing, stemming/lemmatization, property extraction, document conversion (extract content from PDF, Office documents, etc.), mappings from crawled properties to managed properties, etc., etc., etc.. As I said, very busy guy. And all of this is done in memory, which means that the primary resources consumed by this process will be memory and CPU.
How (and why) to scale: you will definitely want to add multiple instances of this component to your FAST farm either during initial deployment or even later on by simply opening a command prompt and issuing the command “nctrl add procserver” in any FS4SP server. This can be very helpful during the initial load of your system, when you can temporarily add multiple instances of this component to the search nodes (which are not yet been used by your users), and you can just as easily remove them later by executing “nctrl remove procserver”. A good rule of thumb in a dedicated processing server is that you could have 1
2 document processor per CPU (after some tests done by a friend, I would recommend you to be conservative, with just 1 per CPU – the recommendation for observing your CPU/memory usage during peak processing times is still valid, so you can properly assess your resource consumption/availability), so in a non-dedicated server you need to make sure there are enough CPU resources for both document processing and the other processes.
How (and what) to monitor: the main indicator if this component is your bottleneck is constantly reaching 90-100% CPU in the servers hosting document processors. Each Document Processor component will be a separate instance of procserver.exe. You can also monitor the Document Processor performance counters.
Indexing Dispatcher: remember what I said about the Content Distributor? The Indexing Dispatcher will have a similar role, receiving batches of processed content and forwarding them to the Indexer component to be effectively indexed. This component is also very lightweight and should not impact your performance.
How (and why) to scale: add at least two instances of this component across different servers for fault tolerance reasons, either during the initial deployment of your FAST farm or later on by reconfiguring it.
Indexer: I could spend a whole other post just talking about the Indexer component, but if you got through here, just understand that this component will effectively receive the processed documents that were in memory, save them to disk (by default on C:\FASTSearch\data\data_fixml) and then build the actual optimized binary index (along with sorting tables, summary tables, etc., etc.) also in the disk (by default on C:\FASTSearch\data\data_index).
How (and why) to scale: as I said, this deserves its own post, but for now you can check this good reference about increasing indexing capacity as well as the astounding-and-official FS4SP capacity planning.
How (and what) to monitor: to check the activity in this component you can use the Indexer Performance counters. The main concept to understand is that all documents arrive in an Indexer at Partition 0 (a different concept and purpose than Index Partitions in SharePoint Search), where they get indexed and move to the upper partitions (Partition 1-4), so you will want to monitor how fast your Partition 0 can reindex itself, making the new content available to search as it arrives. If this seems confusing (and it most likely will), check the FS4SP capacity planning guide mentioned above as it explains the index partitions in more detail.
Additional reference that you must check: the entire section about Performance and capacity management available on TechNet is a must-read.
That’s it for now. Enjoy!