I’d like to revisit the topics that Leo have so well described in his 2 previous posts titled ”Search 101″ and “Search Architecture in SharePoint 2010″, but discuss those in the context of SharePoint 2013 Search. This post will address general architecture of SharePoint 2013 Search, describe all the components involved and briefly touch upon the biggest changes when coming from FAST for SharePoint 2010 “world”. Future posts will go deeper into each search component and provide both an overview and troubleshooting information.
- Search 101: general concepts of search, including crawling, processing, indexing and searching (Leo’s initial post)
- Search Architecture in SharePoint 2013: the overall architecture of search-related components in SharePoint 2013 (this post)
- Planning and Scale (future post)
- Installation / Deployment (future post)
- Processing (future post)
- Indexing (future post)
- Searching (future post)
As Leo has described in his previous post, if are a complete ‘newbie’ as to how search engine should work, the very basic tasks it should perform are:
- Crawling: acquire content from wherever it may be located (web sites, intranet, file shares, email, databases, internal systems, etc.)
- Processing: prepare this content to make it more “searchable”. Think of a Word document, where you will want to extract the text contained in the document, or an image that has some text that you want people to be able to search, or even an web page where you want to extract the title, the body and maybe some of its HTML metadata tags.
- Indexing: this is the magic sauce of search and what makes it different than just storing the content in a database and searching it using SQL statements. The content in a search engine is stored in a certain way optimized for later retrieval of this content. We typically call this optimized version of the content as the search index.
- Searching: the part of search engines most well known. You pass one or more query terms and the search engine will return results based on what is available in its search index.
Armed with this knowledge, let’s take a look at SharePoint 2013 Search Architecture and we can we can immediately see that the main components do just that:
Search Architecture in SharePoint 2013
- Crawling: SharePoint Crawler via SharePoint Content SSA
- Content Processing: CPC(Content Processing Component)
- Indexing: Indexing Component
- Searching: Query Processing Component(QPC)
You’ll notice that there is one more component that we didn’t describe in our ‘basics’, but it’s quite an important one. The Analytics Processing Component is extremely important to SharePoint 2013 Search, as it does both Usage and Search Analytics and learns by usage and by processing various events such as ‘views’, ‘clicks’ and so on. It then enriches the index by updating index items, which impacts relevancy calculations based on processed data, and provides valuable information in such forms as Recommendations and Usage reports.
- Analytics: Analytics Processing Component(APC)
Let’s take a brief look at each sub-system and its architecture:
Simply put, SharePoint 203 crawler grabs content from various repositories, runs it through various crawler rules and sends it off to Content Processing Components for further processing. You can think of it as an initial step for your feeding chain with search index being the final destination.
Crawling can be scaled out using multiple crawl components and databases. New Continuous crawl mode ensures index freshness, while architecture has been simplified from FAST for SharePoint with having a single SharePoint Search Service Application handle both crawling and querying.
A Continuous Crawl can have multiple continuous crawl sessions running in parallel. This capability enables crawler to keep Search Index fresher - for example if a preceding Continuous Crawl session is busy processing a deep security change, the subsequent crawl can process content updates. Unlike Incremental crawl, there is no longer a need to wait for completion before new changes can be picked up, these crawls are spun-up every 15 minutes and crawl the “change logs”.
- Invokes Connectors/Protocol Handlers to content sources to retrieve data
- Crawling is done via a single SharePoint Search SSA
- Crawl Database is used to store information about crawled items and to track crawl history
- Crawl modes: Incremental, Full and Continuous
- Incremental, Full and Continuous crawl modes
- No need for Content SSA and Query SSA: a single Search SSA
- FAST Web Crawler no longer exists
- Improved Crawl Reporting/Analytics
In SharePoint Search 2010, there was a single role involved in the feeding chain: the Crawler. In FAST for SharePoint 2010, feeding chain consisted of 3 additional components(other than the crawler): Content Distributor, Document Processor and Indexing Dispatcher. With SharePoint 2013 Search, Content Processing Component combines all three.
A simple way to described Content Processing Component is that it takes the content produced by the Crawler, does some analysis/processing on the content to prepare it for indexing and sends it off to the Indexing Component. It takes crawled properties as input from the Crawler and produces output in terms of Managed Properties for the Indexer.
Content Processing Component uses Flows and Operators to process the content. If coming from FAST “world”, think of Flows as Pipelines and Operators as Stages. Flows define how to process content, queries and results and each flow processes 1 item at a time. Flows consist of operators and connections organized as graphs. This is really where all the “magic” happens, things like language detection, word breaking, security descriptors, content enrichment(web service callout), entity and metadata extraction, deep link extraction and so on.
CPC comes with pre-defined flows and operators that currently cannot be changed in a supported way. If you search hard enough, you will find blogs that will described how to customize flows and operators in an unsupported fashion. The flow has branches that handle different operations, like inserts, deletes and partial updates. Notice that security descriptors are now updated in a separate flow, which should make the dreaded “security-only” crawl perform better as opposed to previous versions.
As I’ve mentioned, CPC has an internal mechanism to load-balance items coming from the Crawler between available flows(analogy to the old FAST Content Distributor). It also has a mechanism at the very end of the flow to load-balance indexing across the available Indexing Components(analogy to the old FAST Indexing Dispatcher). We will revisit this topic in more detail in subsequent posts.
- Stateless node
- Analyzes content for indexing
- Enriches content as needed via Content Enrichment Web Service (web service callout)
- Schema mapping. Produces managed properties from crawled properties
- Stores links and anchors in Link database(analytics)
- Web Service callout only works on managed properties and not on crawled properties, as was done with Pipeline Extensibility in FAST for SharePoint 2010.
- Flows have different branches that can handle operations like deletes or partial updates on security descriptors separately from main content, improving performance.
- Content Parsing is now handled by Parsers and Format Handlers(will be described in later posts)
Note: If Content Enrichment Web Service does not meet your current needs and you need more of an ESP-style functionality when it comes to pipeline customization, talk to Microsoft Consulting Services or Microsoft Premier Field Engineers regarding CEWS Pipeline Toolkit.
Job of the indexer is to receive all processed content from Content Processing Component, eventually persist it to disk(store it) and have it ready to be searchable via Query Processing Component. It’s the “heart” of your search engine, this is where your crawled content lives. Your index will reside on a something called an Index Partition. You may have multiple Index Partitions, with each one containing a unique subset of the index. All of your Partitions taken together is your entire search index. Each Partition may have 1 or more Replicas, which will contain an exact copy of the index from that partition. There will always be at least one replica, meaning that one of your index partitions is also a primary replica. So when coming from FAST, think “partitions and replicas” instead of “columns and rows”.
Each index replica an Index Component. When we provision an Index Component, we associate with an index partition.
Increase Query load or fault tolerance: Add more index replicas
Increase content volume: Add more index partitions
There are a couple of very important changes to internals of the indexer that I’d like to touch upon:
- There is NO MORE FIXML. Just a reminder, FIXML stood for FAST Index XML and contained an XML representation of each document that the indexer used to create the binary index. FIXML was stored locally on disk and was frequently used to re-create binary index without have to re-feed from scratch. There is now a new mechanism called a ‘partial update’, which replaces the need for FIXML.
- Instant Indexing: We can now serve queries much quicker directly from memory instead of waiting for them to be persisted to disk.
- Journaling: Think RDBMS “transaction log”, a sequential history of all operations to each index partition and its replicas. Together with checkpointing, allows for “instant indexing” feature above , as well as ACID features (atomicity, consistency, isolation and durability). For the end-user, this ensures that a full document or set of documents as a group is either fully indexed or not indexed at all. We will discuss this in much more detail in subsequent posts.
- Update Groups/Partial Update Mechanism: All document properties(managed properties) are split into Update Groups. In the past with FAST, “partial updates” where quite expensive as indexer would have read the whole FIXML document, find the element, update the file, save it and re-index the FIXML document. Now, properties in a one update group can be updated at a low cost without affecting the rest of the index.
There is also an updated mechanism to merging Index Parts, which you can somewhat compare to how FAST handled what was then called “index partitions” in the past and merged them.
Index internally is built up of several smaller inverted index parts, each one being an independent portion of the index. From time to time, based on specific criteria, they need to be merged in order to free up resources associated with maintaining many small indices. Typically, smaller ones will be merged more often while larger ones will be done less frequently.
Keep in mind that Level/Part 0 is the in-memory section that directly allows for the “Instant Indexing” feature. When documents come into the indexing subsystem, they come into 2 places at the same time:
- The Journal
- The Checkpoint section(Level 0 in the figure above)
Checkpoint section contains document that are in memory but have not yet been persisted to disk, yet searchable. If search crashes, the in-memory portion will be lost but will be restored/replayed from the Journal on the next start up.
Query Processing Component is tasked with taking a user query that comes from a search front-end and submits it to the Index Component. It routes incoming queries to index replicas, one from each index partition. Results are returned as a result set based on the processed query back to the QPC, which in turn processes the result set prior to sending it back to the search front-end. It also contains a set of flows and operators, similar to the Content Processing Component. If coming from FAST, you can compare it to the QRServer with its Query Processing Pipelines and stages.
- Stateless node
- Query-side flows/operators
- Query federation
- Query Transformation
- Load-balancing and health checking
- Configurations stored in Admin database
- Result Sources/Query Rules
Analytics Processing Component is a powerful component that allows for features such as Recommendations(‘if you like this you might like that’), anchor text/link analysis and much more. It extracts both search analytics and usage analytics, analyzes all the data and returns the data in various forms, such as via reporting or by sending it to Content Processing Component to be included in the search index for improved relevance calculations and recall.
Let’s quickly define both search analytics and usage analytics:
- Search analytics is information such as links, anchor text, information related to people, metadata, click distance, social distance, etc. from items that APC receives via the Content Processing Component and stores the information in the Link database.
- Usage analytics is information such as the number of times an item is viewed from the front-end event store and is stored in the Analytics Reporting database.
- Learns by usage
- Search Analytics
- Usage Analytics
- Enriches index for better relevance calculations and recall
- Based on Map/Reduce framework – workers execute needed tasks.
- Coming from FAST ESP/FAST for SharePoint, it combines many separate features and components such as FAST Recommendations, WebAnalyzer, Click-through analysis into a single component…and adds more.
I hope to be able to do some deep-dives into each component in future posts, feel free to drop me a note with any questions that may come up.