Solr architecture
Solr is a search engine built on top of Apache Lucene that provides full-text search and indexing capabilities. It is used to implement fast and scalable search applications intended to work with large sets of structured and semi-structured data. Solr provides advanced search features like faceting, filtering, relevance ranking, results highlighting, and many others.
A few examples of Solr applicability:
-
enterprise search applications;
-
log and event search;
-
e-commerce search with faceting and aggregation.
Features
The major Solr features are as follows:
-
High-performance search.
Using Lucene, Solr provides high-performance full-text search on large volumes of data.
-
Powerful search capabilities.
Solr has many built-in features like rich query syntax, complex filters, relevance ranking, results highlighting, spell checking, etc. It also provides language-specific capabilities like stemming, synonym matching, stop words exclusion, and so on.
-
Faceting and filtering.
With built-in faceted search, Solr can aggregate search results by a field, range, or any custom criteria. This allows getting more informative result sets, which is not only convenient for use but extremely helpful for analytics.
-
Scaling.
In ADH, Solr scales horizontally by adding more Solr Server components via ADCM. By default, Solr runs in the high availability mode and the built-in replication mechanism is enabled. The automatic failover and leader election are implemented via ZooKeeper.
-
Flexible schema management.
Solr supports schema-based and schemaless approaches to data indexing. A well-designed schema with proper field types leads to a more compact index, and significantly reduces the search request time. For medium-sized datasets, where performance is not an issue, Solr can generate schema automatically and ingest raw data without any structure.
-
Integration.
Solr exposes a REST API, which is a primary communication channel for submitting indexing and search requests. Tools like Solr Cell, Apache Tika, SolrJ allow programmatic interaction with the Solr core. In ADH, Solr is configured to work with HDFS, allowing you to run search operations over your HDFS storage.
Components
Below is a high-level diagram of the Solr architecture.
The major Solr concepts and components are as follows:
-
Document. The basic unit of information in Solr, a set of data that describes something, for example, an e-shop product card with fields like name, price, description, and so on.
-
Field. A document may have one or more fields, which carry specific information, for example, a product’s price. Fields can store data of various types — numeric, text, binary, and so on. Defining appropriate data type in a schema allows Solr to perform search queries faster.
-
Index. Solr stores all data in a Lucene index. Lucene uses an inverted index as a data structure for storing documents. An inverted index is similar to how words are indexed at the end of a book: for each distinct word there is a list of pages, where this word is mentioned. Storing data in this way makes indexing a little bit more costly, however, allows for lightning-fast lookups. Adding a document to the index is called indexing. Once indexed, a document becomes searchable.
-
Collection. One or more documents grouped together in a single logical index using the same configuration and schema.
-
Replica. A physical copy of a shard. Replicas enhance fault tolerance by providing additional copies of the data and enhance scalability by providing additional capacity for searching.
Execution workflow
Below you can find step-by-step descriptions of indexing/search request processing by Solr.
Indexing request
-
A client submits an indexing request to a Solr Server component. For example:
$ curl -X POST 'http://ka-adh-3.ru-central1.internal:8983/solr/test_collection/update?commit=true&' -H 'Content-Type: application/json' --data @transactions.jsonThe transactions.json document:
[ { "id": 1, "txn_id": "1", "acc_id": 1001, "txn_value": 75.0, "txn_date": "2026-01-02", "comment": "The first transaction." } ] -
The receiving Solr Server component is called a coordinator. The coordinator parses the request payload, determines the target collection, and generates the document IDs.
-
The coordinator contacts ZooKeeper to get cluster state information, including the list of alive Solr nodes. Solr applies a routing function (hashing) to determine the shards to which the documents from the payload should be sent.
-
Once shards are identified, Solr requests ZooKeeper to get the current leader replica for every shard. If the leader is unknown, ZooKeeper triggers a leader election. At this step, Solr already knows, to which shard each document should be sent.
-
Solr forwards documents to the shards on different ADH hosts. When a document reaches its target shard, the document goes through
UpdateRequestProcessorChainwhere it is validated, matched against the schema, and assigned a version. -
Before modifying the Lucene index, the leader replica writes updates to the transaction log (tlog). After this moment, the update is considered durable, meaning that it can be replayed even if Solr crashes.
-
The leader replica forwards updates to all active replicas of the shard. Replicas update their tlogs and acknowledge the receipt of the new documents.
-
Once the leader replica has logged the update and the required number of non-leader replicas have done the same, the coordinator responds to the user with
200 OK. However, the document is not yet searchable and no writes have been made to the Lucene index so far (so-called near real-time update). -
Right after responding to the user, Solr adds the documents to a Lucene in-memory buffer.
-
The last step is the commit operation, which flushes Lucene segments to disk and the submitted documents become searchable. The moment when a commit triggers is defined by Solr settings.
Search request
-
A client submits a search request with filtering and faceting parameters to a Solr Server component. For example:
$ curl -X GET 'http://ka-adh-1.ru-central1.internal:8983/solr/test_collection/query?q=acc_id:1001&q.op=OR&indent=true&fq=txn_date:%5B2026-01-01T00:00:00Z%20TO%202026-01-31T23:59:59Z%5D' -
The receiving Solr Server component is called a query coordinator. The coordinator parses request parameters, loads the schema and field types.
-
The coordinator requests the Solr cluster state from ZooKeeper to identify target shards and available replicas.
-
Solr sends subqueries to replicas to run them in parallel. The following actions take place on each replica:
-
Every filter expression (
fq) is processed and the results are cached. -
The main query expression is executed.
-
-
Solr performs faceting. For this, it iterates over cached results and computes interim facet counts. Then, the coordinator receives shard responses and merges the results, sums up per-shard facet values to get a final one.
-
The coordinator fetches matching documents from the Lucene index. It requests fields from required documents by ID and constructs a final response, for example:
{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":28, "params":{ "q":"acc_id:1001", "facet.field":"acc_id", "indent":"true", "q.op":"OR", "fq":"txn_date:[2026-01-01T00:00:00Z TO 2026-01-31T23:59:59Z]", "facet":"true", "_":"1768832698829"}}, "response":{"numFound":1,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[ { "id":"1", "txn_id":[1], "acc_id":[1001], "txn_value":[75.0], "txn_date":["2026-01-02T00:00:00Z"], "comment":["The first transaction."], "_version_":1854745967899705344}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "acc_id":[ "1001",1]}, "facet_ranges":{}, "facet_intervals":{}, "facet_heatmaps":{}}}