Solr performance tuning

Konstantin Alpashkin

Contents

Overview
Index configuration
Memory and JVM parameters
Commit parameters
Query optimization

Overview

This article provides optimization techniques and best practices that can improve your Solr performance.

There is no definitive guide on improving Solr performance that would work for all deployments. Solr has a lot of moving parts that affect each other, so changing one configuration parameter can affect the work of others. When tuning your Solr performance, it is very important to iteratively test your Solr setup and make only a few configuration updates between each test. This way you can be confident that you are moving in the right direction and can identify the bottlenecks earlier.

All the recommendations and optimizations described further can be divided into groups:

Index configuration
Memory and JVM parameters
Commit parameters
Query optimization

Index configuration

A proper indexing configuration allows Solr to store documents more efficiently in the Lucene index files and results in faster queries when the documents are retrieved from the index.

Schema design

Choose the field types that best match the data type stored in the fields. For example, storing dates in DatePointField fields rather than StrField can improve the performance both at index and query time.
Minimize the use of text_general (TextField) fields to store simple strings. Indexing and searching using fields of this type triggers tokenization/analysis processing chains for each request, which consumes computing power. Instead, prefer field-specific types like string (StrField) to store strings without any additional processing.
Minimize the use of dynamic fields. Dynamic fields allow Solr to index data types that were not explicitly defined in the schema. However, such flexibility brings additional overhead. For example, using dynamic fields is a good choice for sparsely populated fields or when the field type can vary.
Use docValues for fields that are frequently used in sorting or faceting.
When using custom fields, specify the field properties taking into account how the field will be used in future for querying. The basic recommendations are as follows:
- Avoid unnecessary multiValued fields. This parameter indicates that a document might contain multiple values for a field type.
- Do not use stored=true unless the field is needed for retrieval.
- Use indexed=false for fields that are not part of the search or filtering process.

Index size

Avoid adding and storing redundant and duplicate information to the index. For more information on catching duplicates, see Deduplication in Solr.
Change the merging policy that defines a strategy for merging smaller Lucene segment files into one. By default, TieredMergePolicy merges segments of approximately equal size, subject to an allowed number of segments per tier.

Sharding and replication

Distribute your data across multiple shards residing on different ADH hosts to handle more queries simultaneously. In a multi-sharded Solr index, the documents are distributed among shards so that every document is contained in exactly one shard.
Use replication for high availability and better fault tolerance.

Memory and JVM parameters

One of the crucial factors affecting Solr performance is RAM. Solr requires sufficient memory for Java heap and additionally needs the free memory for the OS-level page cache operations.

Allocate sufficient Java heap memory for every Solr Server node. For this, use the Solr Server Heap Memory parameter in ADCM (Clusters → <clusterName> → Services → Solr → Primary configuration). However, simply allocating a larger Java heap does not guarantee speeding up your Solr setup. Your heap should be able to accommodate all caches and internal objects while leaving enough space to avoid frequent GC runs. Also, there must be host memory left to handle OS page caching activities to cache Solr index data.
Allocate JVM heap size equal to 40-70% of the host RAM, a portion of memory should always remain free. Apart from heap, Solr utilizes free direct memory (off-heap) for faster caching/flushing document data to index segment files. Too large heap size can leave little memory for these operations and make them a bottleneck.
Use the garbage collection logs to detect abnormalities and monitor memory usage. By default, Solr stores garbage collection logs in /var/log/solr/solr_gc.log.* files. For more information, see Logging in Solr.

Commit parameters

In Solr, commits are used to make updated documents visible for search queries. Until a commit is made, updated documents are unavailable for searching by other Solr clients. Solr performs two types of commits:

Hard commit. Makes updates available for searching by calling fsync on Lucene index files and flushing bits to disk. After a hard commit, Solr truncates the transaction log and opens a new one.
Soft commit. Makes index changes visible without calling fsync on the Lucene segment files. After a soft commit, updated document data is written to a transaction log but is not yet stored in a segment file.

Soft commits are a good choice for achieving near-real-time search results, as they make documents visible without waiting for a costly hard commit. Hard commits are essential for durability and ensuring that documents are not lost when a node goes down unexpectedly. You can change the commit settings in solrconfig.xml (<autoCommit> and <autoSoftCommit> sections).

Consider the following tips on tuning the commit mechanism:

Avoid too frequent hard commits as their massive load on CPU and disk I/O can heavily impact the overall performance. Disk I/O can be the major bottleneck during hard commits if enough memory is not available for OS page caching operations. When choosing a value for automatic hard commits, you should balance between as infrequent commits as possible and meeting the time SLA for any updates to be visible to Solr clients.
Although Solr soft commits are cheaper than hard commits, they are still not CPU/RAM-free. It is recommended to auto-run soft commits more frequently than hard commits, but still balance between near real-time update delay and indexing speed.

Query optimization

Complex queries can slow down Solr. If your query is complex and runs for too long, try splitting the query in two shorter ones.
Change the query caching behavior. Solr automatically pre-populates caches for frequent and similar queries. The time taken for this is called cache warm-up time. However, when data is written to Solr, all caches get invalidated after each commit and have to be warmed up again, which consumes resources. Use the autowarmCount=0 parameter for <filterCache>, <queryResultCache>, and <documentCache> settings in your solrconfig.xml to reduce cache warmup cycles.
Do not use too many filters within one search request.
Avoid using expensive operations like wildcard searches at the beginning of a term.
Use lazy field loading (the <enableLazyFieldLoading> parameter) to allow Solr to load fields that are not directly requested later as needed. This can improve the performance if most queries request only a small subset of fields.

Found a mistake? Seleсt text and press Ctrl+Enter to report it