Solr indexing overview

This article describes Solr indexing — the process of adding new content to a Solr index to make the content searchable, as well as modifying and deleting existing content.

There are several ways on how you can feed data to Solr for indexing:

  • Upload XML/JSON/CSV files within an HTTP request body to a Solr server’s REST endpoint. Solr comes with a utility Post tool that facilitates uploading documents through the REST endpoint. This method is used in examples throughout the article.

  • Use Solr Admin UI to upload documents through the web interface.

  • Create a client application that interacts with Solr over HTTP using Client APIs. For example, SolrJ library for JVM-based applications.

  • Use frameworks like Solr Cell developed for efficient ingesting of data into Solr.

Regardless of the method used for delivering data to Solr, there is a common structure for all data submitted to a Solr index: this must be a document containing one or more named fields, each storing a content payload (which may be empty). The structure of documents is defined by a schema file; however, Solr can also accept schemaless documents. Typically, one of the fields is chosen to be unique (analogous to a primary key in traditional databases), although the use of unique keys is not strictly required by Solr.

Apart from JSON, XML, and CSV formats, Solr can accept "rich" file formats, such as HTML, PDF, etc. In most cases, Solr automatically detects the file type, digests the data, and requires no manual actions on specifying the file type.

For more information about Solr indexing concepts, see Solr documentation.

Post tool

Every Solr Server component deployed by ADH includes a utility tool that facilitates POSTing documents to a Solr server. You can use this tool by running the /usr/lib/solr/bin/post script available on hosts with the Solr Server component.

The script usage syntax is as follows:

$ /usr/lib/solr/post -c <collection_name> [OPTIONS] <file|directory|url|-d ["...",...]>

The supported OPTIONS flags are described in the table below.

-c <collection_name>

Solr collection to upload data for indexing. Either a collection name or a URL must be specified

-url <URL>

Solr server’s endpoint URL. If not specified, the tool generates the URL automatically, for example http://localhost:<solr_port>/solr/<collection_name>/update. Parameters embedded in the URL override the values provided as -c, -host, and -port options

-host <host_name>

The host name or IP address of the Solr server

-port <p>, -p

The port number used for the connection

-user <user:pass>, -u <user:pass>

Basic authentication credentials for requests

-type <content-type>

Sets the Content-Type header. Defaults to application/xml

-filetypes <type1,…​>

Specifies the file types accepted by Solr. By default, the accepted file types are: xml, json, csv, pdf, doc, docx, ppt, pptx, xls, xlsx, odt, odp, ods, ott, otp, ots, rtf, htm, html, txt, log

-out <yes|no>

Outputs Solr’s response to the console

Below are several examples of using the /usr/lib/solr/bin/post tool.

/usr/lib/solr/bin/post -c demo_index_collection transactions.json
/usr/lib/solr/bin/post -c demo_index_collection test-*.xml
/usr/lib/solr/bin/post -c demo_index_collection -out yes test.csv
/usr/lib/solr/bin/post -c demo_index_collection ~/test/solr-docs

Index data

The following scenario highlights major steps of the Solr indexing process. The steps show how to index a simple JSON document using the Post tool.

  1. Connect to an ADH host with the Solr Server component.

  2. Create a test Solr collection using the following command:

    $ /usr/lib/solr/bin/solr create -c demo_index_collection -s 2 -rf 2

    Output:

    Created collection 'demo_index_collection' with 2 shard(s), 2 replica(s) with config-set 'demo_index_collection'

    This creates a new collection (demo_index_collection) with two shards and the replication factor set to 2. The collection is created based on the _default configset that provides schemaless support.

    TIP
    You can also use Solr web UI to create a new collection.
  3. Create a sample JSON document, as shown below. You can create a similar document using XML or CSV formats. Solr will automatically detect the file type and digest the document accordingly.

    {
      "id": 1,
      "txn_id": "1",
      "acc_id": 1001,
      "txn_value": 75.0,
      "txn_date": "2024-01-02",
      "comment": "The first transaction."
    }

    In this sample document, the id field acts as a primary key to uniquely identify the document among other indexed documents. Using this ID, you can update or delete the document in the future. The id field is chosen because it is specified in the default schema file (the uniqueKey definition). If you remove the id field from the document above, Solr will auto-generate one for you; however, you will have to track Solr-generated IDs to be able to work with indexed documents in the future.

  4. Using the Post tool, submit the sample document to Solr with the following command:

    $ /usr/lib/solr/bin/post -c demo_index_collection transactions.json

    Running the script yields a similar output:

    java -classpath /usr/lib/solr/dist/solr-core-8.11.2.jar -Dauto=yes -Dc=demo_index_collection -Ddata=files org.apache.solr.util.SimplePostTool transactions.json
    SimplePostTool version 5.0.0
    Posting files to [base] url http://localhost:8983/solr/demo_index_collection/update...
    Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
    POSTing file transactions.json (application/json) to [base]/json/docs
    1 files indexed.
    COMMITting Solr index changes to http://localhost:8983/solr/demo_index_collection/update...
    Time spent: 0:00:01.341

To verify that Solr has indexed the document successfully, follow the steps:

  1. In Solr Admin UI, select the required collection (demo_index_collection).

  2. Go to the collection’s Query page.

  3. Click Execute Query to run a query with default parameters. The query results are displayed as shown in the image below.

Query results
Query results
Query results
Query results

Update indexed data

To update an indexed document, you have to re-submit the modified document version to Solr and ensure that the modified document’s unique key matches the document you want to update. If you omit or provide an incorrect unique key, Solr will create a new document instead of updating the document you need. The default Solr schema (the one that belongs to _default configset) uses the id field as a unique key.

The steps required to update the sample document are described below:

  1. Create an updated version of the document that is already indexed by Solr. For example, the transactions_updated.json contents:

    {
      "id": 1,
      "txn_id": "1",
      "acc_id": 1001,
      "txn_value": 75.0,
      "txn_date": "2024-01-02",
      "comment": "Updated transaction description."
    }
  2. Submit the updated document to Solr using the Post tool:

    $ /usr/lib/solr/bin/post -c demo_index_collection transactions_updated.json

    The output:

    java -classpath /usr/lib/solr/dist/solr-core-8.11.2.jar -Dauto=yes -Dc=demo_index_collection -Ddata=files org.apache.solr.util.SimplePostTool transactions_updated.json
    SimplePostTool version 5.0.0
    Posting files to [base] url http://localhost:8983/solr/demo_index_collection/update...
    Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
    POSTing file transactions_updated.json (application/json) to [base]/json/docs
    1 files indexed.
    COMMITting Solr index changes to http://localhost:8983/solr/demo_index_collection/update...
    Time spent: 0:00:00.307
  3. Using Solr UI, query the updated document using the txn_id:1 query parameter (q). This is equivalent to SQL’s SELECT * FROM <collection_name> WHERE txn_id=1;. The query response reflects updated document fields, as shown in the example:

    {
      "responseHeader":{
        "zkConnected":true,
        "status":0,
        "QTime":8,
        "params":{
          "q":"txn_id:1",
          "indent":"true",
          "q.op":"OR",
          "_":"1723478346726"}},
      "response":{"numFound":1,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[
          {
            "id":"1",
            "txn_id":[1],
            "acc_id":[1001],
            "txn_value":[75.0],
            "txn_date":["2024-01-02T00:00:00Z"],
            "comment":["Updated transaction description."],
            "_version_":1807198162248531968}]
      }}

Delete indexed data

You can delete a document from a Solr index by POSTing a delete command to the update URL and specifying the document’s unique key or a query to match multiple documents. For example, to delete a document with the id=1 field, use the command:

$ /usr/lib/solr/bin/post -c demo_index_collection -d "<delete><id>1</id></delete>"

To delete a document using Solr query syntax, use the syntax, as shown below:

$ /usr/lib/solr/bin/post -c demo_index_collection -d "<delete><query>txn_id:1</query></delete>"

The command that deletes all the documents from a collection looks as follows:

$ /usr/lib/solr/bin/post -c demo_index_collection -d "<delete><query>*:*</query></delete>"

Solr search queries are regular HTTP requests where search parameters are specified within the URL query string. To submit a search request to Solr, you can use HTTP clients like curl, wget, Postman, or use the Post tool. A sample URL to submit a search request has the following view:

http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?indent=true&q.op=AND&q=*%3A*

Where ka-adh-1.ru-central1.internal is name of the host with a running Solr Server component.

The response from Solr for such a query may look as follows:

{
    "responseHeader": {
        "zkConnected": true,
        "status": 0,
        "QTime": 6,
        "params": {
            "q": "*:*",
            "indent": "true"
        }
    },
    "response": { (1)
        "numFound": 1,
        "start": 0,
        "maxScore": 1.0,
        "numFoundExact": true,
        "docs": [ (2)
            {
                "id": "1",
                "txn_id": [
                    1
                ],
                "acc_id": [
                    1001
                ],
                "txn_value": [
                    75.0
                ],
                "txn_date": [
                    "2024-01-02T00:00:00Z"
                ],
                "comment": [
                    "Updated transaction description."
                ],
                "_version_": 1807199238892814336
            }
        ]
    }
}
1 The response section with query results.
2 An array of documents that match the search query.

Also, you can use Solr Admin UI, which provides a query builder interface where you can construct and run search queries.

Using the query builder interface, you can enter query parameters in the UI fields and get the resulting query URL as shown in the image below.

Query URL
Query URL
Query URL
Query URL

Solr query parsers

When a search query reaches Solr, it is processed by a request handler — a plugin that defines the request processing logic. Then, the request handler activates a query parser that interprets the parameters of the received query. Solr supports several query parsers listed below. Follow the links to get more information on using each parser.

  • Standard Query Parser. The default query parser that supports robust and intuitive syntax. The major disadvantage is the parser’s intolerance of syntax errors.

  • DisMax. Allows processing simple phrases (without complex syntax) and searching for terms across several fields using a weighting mechanism based on the significance of each field.

  • Extended DisMax (eDismax). An extended version of the DisMax Query Parser that provides features like partial escaping, advanced stopword handling, etc.

  • Other parsers like Block Join Query Parser, Boolean Query Parser, Boost Query Parser, etc.

Search examples

The following examples demonstrate basic search capabilities using the default Standard Query Parser. For detailed information about search parameters supported by various Solr query parsers, see Solr query syntax.

Get all documents with sorting

 
The sample search request fetches all fields of all documents from a collection, sorting the results by the txn_id field in ascending order.

Request parameters
Request parameter Value

q

*:*

This default query string is equivalent to SQL’s SELECT * FROM <collection_name> expression

sort

txn_id asc

Sample request URL:

http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?indent=true&q.op=OR&q=*%3A*&sort=txn_id%20asc

Sample response:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":4,
    "params":{
      "q":"*:*",
      "indent":"true",
      "q.op":"OR",
      "sort":"txn_id asc",
      "_":"1723495198651"}},
  "response":{"numFound":3,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"1",
        "txn_id":[1],
        "acc_id":[1001],
        "txn_value":[75.0],
        "txn_date":["2024-01-02T00:00:00Z"],
        "comment":["The first transaction."],
        "_version_":1807214831002976256},
      {
        "id":"3",
        "txn_id":[5],
        "acc_id":[1003],
        "txn_value":[105.0],
        "txn_date":["2024-02-02T00:00:00Z"],
        "comment":["Yet another transaction."],
        "_version_":1807215090454233088},
      {
        "id":"2",
        "txn_id":[256],
        "acc_id":[1002],
        "txn_value":[55.0],
        "txn_date":["2024-02-03T00:00:00Z"],
        "comment":["Another transaction."],
        "_version_":1807214995052691456}]
  }}
Get specific fields with filtering

 
The sample search request returns only specific fields (fl=txn_id,acc_id,txn_value,comment) of those documents, whose txn_date matches the specified date. Additionally, the result set gets filtered to retain only those records whose comment field contains the first substring.

Request parameters
Request parameter Value

q

txn_date:"2024-01-02T00:00:00Z"

fl

txn_id,acc_id,txn_date,comment

fq

comment:\"*first*\"

Sample request URL:

http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?fl=txn_id%2Cacc_id%2Ctxn_value%2Ccomment&fq=comment%3A%22*first1*%22&indent=true&q.op=OR&q=txn_date%3A%222024-01-02T00%3A00%3A00Z%22

Sample response:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":5,
    "params":{
      "q":"txn_date:\"2024-01-02T00:00:00Z\"",
      "indent":"true",
      "fl":"txn_id,acc_id,txn_value,comment",
      "q.op":"OR",
      "fq":"comment:\"*first*\"",
      "_":"1723492286745"}},
  "response":{"numFound":1,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[
      {
        "txn_id":[1],
        "acc_id":[1001],
        "txn_value":[75.0],
        "comment":["The first transaction."]}]
  }}
Queries combination

 
By using boolean operators, you can create complex query logic and combine multiple search conditions. For example, the sample request fetches documents whose comment field contains the transaction string and the txn_id field is 1.

Request parameters
Request parameter Value

q

comment:transaction AND txn_id:1

Sample request URL:

http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?indent=true&q.op=OR&q=comment%3Atransaction%20AND%20txn_id%3A1

Sample response:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":4,
    "params":{
      "q":"comment:transaction AND txn_id:1",
      "indent":"true",
      "q.op":"OR",
      "_":"1723492286745"}},
  "response":{"numFound":1,"start":0,"maxScore":1.1307646,"numFoundExact":true,"docs":[
      {
        "id":"1",
        "txn_id":[1],
        "acc_id":[1001],
        "txn_value":[75.0],
        "txn_date":["2024-01-02T00:00:00Z"],
        "comment":["The first transaction."],
        "_version_":1807214831002976256}]
  }}
Range search

 
The sample request returns documents whose txn_value falls in the range of 100-200.

Request parameters
Request parameter Value

q

txn_value:[100 TO 200]

Sample request URL:

http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?indent=true&q.op=OR&q=txn_value%3A%5B100%20TO%20*%5D

Sample response:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":9,
    "params":{
      "q":"txn_value:[100 TO 200]",
      "indent":"true",
      "q.op":"OR",
      "_":"1723492286745"}},
  "response":{"numFound":1,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[
      {
        "id":"3",
        "txn_id":[5],
        "acc_id":[1003],
        "txn_value":[105.0],
        "txn_date":["2024-02-02T00:00:00Z"],
        "comment":["Yet another transaction."],
        "_version_":1807215090454233088}]
  }}
Set response size and format

 
The sample request returns the first three documents that match the search query and instructs Solr to return the response as XML.

Request parameters
Request parameter Value

q

*:*

rows

3

wt

xml

Sample request URL:

http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?indent=true&q.op=OR&q=*%3A*&rows=3&start=0&wt=xml

Sample response:

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <bool name="zkConnected">true</bool>
  <int name="status">0</int>
  <int name="QTime">4</int>
  <lst name="params">
    <str name="q">*:*</str>
    <str name="indent">true</str>
    <str name="start">0</str>
    <str name="q.op">OR</str>
    <str name="rows">3</str>
    <str name="wt">xml</str>
    <str name="_">1723492286745</str>
  </lst>
</lst>
<result name="response" numFound="3" start="0" maxScore="1.0" numFoundExact="true">
  <doc>
    <str name="id">2</str>
    <arr name="txn_id">
      <long>256</long>
    </arr>
    <arr name="acc_id">
      <long>1002</long>
    </arr>
    <arr name="txn_value">
      <double>55.0</double>
    </arr>
    <arr name="txn_date">
      <date>2024-02-03T00:00:00Z</date>
    </arr>
    <arr name="comment">
      <str>Another transaction.</str>
    </arr>
    <long name="_version_">1807214995052691456</long></doc>
  <doc>
    <str name="id">3</str>
    <arr name="txn_id">
      <long>5</long>
    </arr>
    <arr name="acc_id">
      <long>1003</long>
    </arr>
    <arr name="txn_value">
      <double>105.0</double>
    </arr>
    <arr name="txn_date">
      <date>2024-02-02T00:00:00Z</date>
    </arr>
    <arr name="comment">
      <str>Yet another transaction.</str>
    </arr>
    <long name="_version_">1807215090454233088</long></doc>
  <doc>
    <str name="id">1</str>
    <arr name="txn_id">
      <long>1</long>
    </arr>
    <arr name="acc_id">
      <long>1001</long>
    </arr>
    <arr name="txn_value">
      <double>75.0</double>
    </arr>
    <arr name="txn_date">
      <date>2024-01-02T00:00:00Z</date>
    </arr>
    <arr name="comment">
      <str>The first transaction.</str>
    </arr>
    <long name="_version_">1807214831002976256</long></doc>
</result>
</response>
Found a mistake? Seleсt text and press Ctrl+Enter to report it