Solr indexing overview
This article describes Solr indexing — the process of adding new content to a Solr index to make the content searchable, as well as modifying and deleting existing content.
There are several ways on how you can feed data to Solr for indexing:
-
Upload XML/JSON/CSV files within an HTTP request body to a Solr server’s REST endpoint. Solr comes with a utility Post tool that facilitates uploading documents through the REST endpoint. This method is used in examples throughout the article.
-
Use Solr Admin UI to upload documents through the web interface.
-
Create a client application that interacts with Solr over HTTP using Client APIs. For example, SolrJ library for JVM-based applications.
-
Use frameworks like Solr Cell developed for efficient ingesting of data into Solr.
Regardless of the method used for delivering data to Solr, there is a common structure for all data submitted to a Solr index: this must be a document containing one or more named fields, each storing a content payload (which may be empty). The structure of documents is defined by a schema file; however, Solr can also accept schemaless documents. Typically, one of the fields is chosen to be unique (analogous to a primary key in traditional databases), although the use of unique keys is not strictly required by Solr.
Apart from JSON, XML, and CSV formats, Solr can accept "rich" file formats, such as HTML, PDF, etc. In most cases, Solr automatically detects the file type, digests the data, and requires no manual actions on specifying the file type.
For more information about Solr indexing concepts, see Solr documentation.
Post tool
Every Solr Server component deployed by ADH includes a utility tool that facilitates POSTing documents to a Solr server. You can use this tool by running the /usr/lib/solr/bin/post script available on hosts with the Solr Server component.
The script usage syntax is as follows:
$ /usr/lib/solr/post -c <collection_name> [OPTIONS] <file|directory|url|-d ["...",...]>
The supported OPTIONS
flags are described in the table below.
-c <collection_name> |
Solr collection to upload data for indexing. Either a collection name or a URL must be specified |
-url <URL> |
Solr server’s endpoint URL.
If not specified, the tool generates the URL automatically, for example http://localhost:<solr_port>/solr/<collection_name>/update.
Parameters embedded in the URL override the values provided as |
-host <host_name> |
The host name or IP address of the Solr server |
-port <p>, -p |
The port number used for the connection |
-user <user:pass>, -u <user:pass> |
Basic authentication credentials for requests |
-type <content-type> |
Sets the Content-Type header.
Defaults to |
-filetypes <type1,…> |
Specifies the file types accepted by Solr. By default, the accepted file types are: xml, json, csv, pdf, doc, docx, ppt, pptx, xls, xlsx, odt, odp, ods, ott, otp, ots, rtf, htm, html, txt, log |
-out <yes|no> |
Outputs Solr’s response to the console |
Below are several examples of using the /usr/lib/solr/bin/post tool.
/usr/lib/solr/bin/post -c demo_index_collection transactions.json
/usr/lib/solr/bin/post -c demo_index_collection test-*.xml
/usr/lib/solr/bin/post -c demo_index_collection -out yes test.csv
/usr/lib/solr/bin/post -c demo_index_collection ~/test/solr-docs
Index data
The following scenario highlights major steps of the Solr indexing process. The steps show how to index a simple JSON document using the Post tool.
-
Connect to an ADH host with the Solr Server component.
-
Create a test Solr collection using the following command:
$ /usr/lib/solr/bin/solr create -c demo_index_collection -s 2 -rf 2
Output:
Created collection 'demo_index_collection' with 2 shard(s), 2 replica(s) with config-set 'demo_index_collection'
This creates a new collection (
demo_index_collection
) with two shards and the replication factor set to2
. The collection is created based on the_default
configset that provides schemaless support.TIPYou can also use Solr web UI to create a new collection. -
Create a sample JSON document, as shown below. You can create a similar document using XML or CSV formats. Solr will automatically detect the file type and digest the document accordingly.
{ "id": 1, "txn_id": "1", "acc_id": 1001, "txn_value": 75.0, "txn_date": "2024-01-02", "comment": "The first transaction." }
In this sample document, the
id
field acts as a primary key to uniquely identify the document among other indexed documents. Using this ID, you can update or delete the document in the future. Theid
field is chosen because it is specified in the default schema file (theuniqueKey
definition). If you remove theid
field from the document above, Solr will auto-generate one for you; however, you will have to track Solr-generated IDs to be able to work with indexed documents in the future. -
Using the Post tool, submit the sample document to Solr with the following command:
$ /usr/lib/solr/bin/post -c demo_index_collection transactions.json
Running the script yields a similar output:
java -classpath /usr/lib/solr/dist/solr-core-8.11.2.jar -Dauto=yes -Dc=demo_index_collection -Ddata=files org.apache.solr.util.SimplePostTool transactions.json SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/demo_index_collection/update... Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file transactions.json (application/json) to [base]/json/docs 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/demo_index_collection/update... Time spent: 0:00:01.341
To verify that Solr has indexed the document successfully, follow the steps:
-
In Solr Admin UI, select the required collection (
demo_index_collection
). -
Go to the collection’s Query page.
-
Click Execute Query to run a query with default parameters. The query results are displayed as shown in the image below.
Update indexed data
To update an indexed document, you have to re-submit the modified document version to Solr and ensure that the modified document’s unique key matches the document you want to update.
If you omit or provide an incorrect unique key, Solr will create a new document instead of updating the document you need.
The default Solr schema (the one that belongs to _default
configset) uses the id
field as a unique key.
The steps required to update the sample document are described below:
-
Create an updated version of the document that is already indexed by Solr. For example, the transactions_updated.json contents:
{ "id": 1, "txn_id": "1", "acc_id": 1001, "txn_value": 75.0, "txn_date": "2024-01-02", "comment": "Updated transaction description." }
-
Submit the updated document to Solr using the Post tool:
$ /usr/lib/solr/bin/post -c demo_index_collection transactions_updated.json
The output:
java -classpath /usr/lib/solr/dist/solr-core-8.11.2.jar -Dauto=yes -Dc=demo_index_collection -Ddata=files org.apache.solr.util.SimplePostTool transactions_updated.json SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/demo_index_collection/update... Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file transactions_updated.json (application/json) to [base]/json/docs 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/demo_index_collection/update... Time spent: 0:00:00.307
-
Using Solr UI, query the updated document using the
txn_id:1
query parameter (q
). This is equivalent to SQL’sSELECT * FROM <collection_name> WHERE txn_id=1;
. The query response reflects updated document fields, as shown in the example:{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":8, "params":{ "q":"txn_id:1", "indent":"true", "q.op":"OR", "_":"1723478346726"}}, "response":{"numFound":1,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[ { "id":"1", "txn_id":[1], "acc_id":[1001], "txn_value":[75.0], "txn_date":["2024-01-02T00:00:00Z"], "comment":["Updated transaction description."], "_version_":1807198162248531968}] }}
Delete indexed data
You can delete a document from a Solr index by POSTing a delete command to the update URL and specifying the document’s unique key or a query to match multiple documents.
For example, to delete a document with the id=1
field, use the command:
$ /usr/lib/solr/bin/post -c demo_index_collection -d "<delete><id>1</id></delete>"
To delete a document using Solr query syntax, use the syntax, as shown below:
$ /usr/lib/solr/bin/post -c demo_index_collection -d "<delete><query>txn_id:1</query></delete>"
The command that deletes all the documents from a collection looks as follows:
$ /usr/lib/solr/bin/post -c demo_index_collection -d "<delete><query>*:*</query></delete>"
Search data
Solr search queries are regular HTTP requests where search parameters are specified within the URL query string. To submit a search request to Solr, you can use HTTP clients like curl, wget, Postman, or use the Post tool. A sample URL to submit a search request has the following view:
http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?indent=true&q.op=AND&q=*%3A*
Where ka-adh-1.ru-central1.internal
is name of the host with a running Solr Server component.
The response from Solr for such a query may look as follows:
{
"responseHeader": {
"zkConnected": true,
"status": 0,
"QTime": 6,
"params": {
"q": "*:*",
"indent": "true"
}
},
"response": { (1)
"numFound": 1,
"start": 0,
"maxScore": 1.0,
"numFoundExact": true,
"docs": [ (2)
{
"id": "1",
"txn_id": [
1
],
"acc_id": [
1001
],
"txn_value": [
75.0
],
"txn_date": [
"2024-01-02T00:00:00Z"
],
"comment": [
"Updated transaction description."
],
"_version_": 1807199238892814336
}
]
}
}
1 | The response section with query results. |
2 | An array of documents that match the search query. |
Also, you can use Solr Admin UI, which provides a query builder interface where you can construct and run search queries.
Solr query parsers
When a search query reaches Solr, it is processed by a request handler — a plugin that defines the request processing logic. Then, the request handler activates a query parser that interprets the parameters of the received query. Solr supports several query parsers listed below. Follow the links to get more information on using each parser.
-
Standard Query Parser. The default query parser that supports robust and intuitive syntax. The major disadvantage is the parser’s intolerance of syntax errors.
-
DisMax. Allows processing simple phrases (without complex syntax) and searching for terms across several fields using a weighting mechanism based on the significance of each field.
-
Extended DisMax (eDismax). An extended version of the DisMax Query Parser that provides features like partial escaping, advanced stopword handling, etc.
-
Other parsers like Block Join Query Parser, Boolean Query Parser, Boost Query Parser, etc.
Search examples
The following examples demonstrate basic search capabilities using the default Standard Query Parser. For detailed information about search parameters supported by various Solr query parsers, see Solr query syntax.
The sample search request fetches all fields of all documents from a collection, sorting the results by the txn_id
field in ascending order.
Request parameter | Value |
---|---|
q |
*:* This default query string is equivalent to SQL’s |
sort |
txn_id asc |
Sample request URL:
http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?indent=true&q.op=OR&q=*%3A*&sort=txn_id%20asc
Sample response:
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":4,
"params":{
"q":"*:*",
"indent":"true",
"q.op":"OR",
"sort":"txn_id asc",
"_":"1723495198651"}},
"response":{"numFound":3,"start":0,"numFoundExact":true,"docs":[
{
"id":"1",
"txn_id":[1],
"acc_id":[1001],
"txn_value":[75.0],
"txn_date":["2024-01-02T00:00:00Z"],
"comment":["The first transaction."],
"_version_":1807214831002976256},
{
"id":"3",
"txn_id":[5],
"acc_id":[1003],
"txn_value":[105.0],
"txn_date":["2024-02-02T00:00:00Z"],
"comment":["Yet another transaction."],
"_version_":1807215090454233088},
{
"id":"2",
"txn_id":[256],
"acc_id":[1002],
"txn_value":[55.0],
"txn_date":["2024-02-03T00:00:00Z"],
"comment":["Another transaction."],
"_version_":1807214995052691456}]
}}
The sample search request returns only specific fields (fl=txn_id,acc_id,txn_value,comment
) of those documents, whose txn_date
matches the specified date.
Additionally, the result set gets filtered to retain only those records whose comment
field contains the first
substring.
Request parameter | Value |
---|---|
q |
txn_date:"2024-01-02T00:00:00Z" |
fl |
txn_id,acc_id,txn_date,comment |
fq |
comment:\"*first*\" |
Sample request URL:
http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?fl=txn_id%2Cacc_id%2Ctxn_value%2Ccomment&fq=comment%3A%22*first1*%22&indent=true&q.op=OR&q=txn_date%3A%222024-01-02T00%3A00%3A00Z%22
Sample response:
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":5,
"params":{
"q":"txn_date:\"2024-01-02T00:00:00Z\"",
"indent":"true",
"fl":"txn_id,acc_id,txn_value,comment",
"q.op":"OR",
"fq":"comment:\"*first*\"",
"_":"1723492286745"}},
"response":{"numFound":1,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[
{
"txn_id":[1],
"acc_id":[1001],
"txn_value":[75.0],
"comment":["The first transaction."]}]
}}
By using boolean operators, you can create complex query logic and combine multiple search conditions.
For example, the sample request fetches documents whose comment
field contains the transaction
string and the txn_id
field is 1
.
Request parameter | Value |
---|---|
q |
comment:transaction AND txn_id:1 |
Sample request URL:
http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?indent=true&q.op=OR&q=comment%3Atransaction%20AND%20txn_id%3A1
Sample response:
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":4,
"params":{
"q":"comment:transaction AND txn_id:1",
"indent":"true",
"q.op":"OR",
"_":"1723492286745"}},
"response":{"numFound":1,"start":0,"maxScore":1.1307646,"numFoundExact":true,"docs":[
{
"id":"1",
"txn_id":[1],
"acc_id":[1001],
"txn_value":[75.0],
"txn_date":["2024-01-02T00:00:00Z"],
"comment":["The first transaction."],
"_version_":1807214831002976256}]
}}
The sample request returns documents whose txn_value
falls in the range of 100-200.
Request parameter | Value |
---|---|
q |
txn_value:[100 TO 200] |
Sample request URL:
http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?indent=true&q.op=OR&q=txn_value%3A%5B100%20TO%20*%5D
Sample response:
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":9,
"params":{
"q":"txn_value:[100 TO 200]",
"indent":"true",
"q.op":"OR",
"_":"1723492286745"}},
"response":{"numFound":1,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[
{
"id":"3",
"txn_id":[5],
"acc_id":[1003],
"txn_value":[105.0],
"txn_date":["2024-02-02T00:00:00Z"],
"comment":["Yet another transaction."],
"_version_":1807215090454233088}]
}}
The sample request returns the first three documents that match the search query and instructs Solr to return the response as XML.
Request parameter | Value |
---|---|
q |
*:* |
rows |
3 |
wt |
xml |
Sample request URL:
http://ka-adh-1.ru-central1.internal:8983/solr/demo_index_collection/select?indent=true&q.op=OR&q=*%3A*&rows=3&start=0&wt=xml
Sample response:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<bool name="zkConnected">true</bool>
<int name="status">0</int>
<int name="QTime">4</int>
<lst name="params">
<str name="q">*:*</str>
<str name="indent">true</str>
<str name="start">0</str>
<str name="q.op">OR</str>
<str name="rows">3</str>
<str name="wt">xml</str>
<str name="_">1723492286745</str>
</lst>
</lst>
<result name="response" numFound="3" start="0" maxScore="1.0" numFoundExact="true">
<doc>
<str name="id">2</str>
<arr name="txn_id">
<long>256</long>
</arr>
<arr name="acc_id">
<long>1002</long>
</arr>
<arr name="txn_value">
<double>55.0</double>
</arr>
<arr name="txn_date">
<date>2024-02-03T00:00:00Z</date>
</arr>
<arr name="comment">
<str>Another transaction.</str>
</arr>
<long name="_version_">1807214995052691456</long></doc>
<doc>
<str name="id">3</str>
<arr name="txn_id">
<long>5</long>
</arr>
<arr name="acc_id">
<long>1003</long>
</arr>
<arr name="txn_value">
<double>105.0</double>
</arr>
<arr name="txn_date">
<date>2024-02-02T00:00:00Z</date>
</arr>
<arr name="comment">
<str>Yet another transaction.</str>
</arr>
<long name="_version_">1807215090454233088</long></doc>
<doc>
<str name="id">1</str>
<arr name="txn_id">
<long>1</long>
</arr>
<arr name="acc_id">
<long>1001</long>
</arr>
<arr name="txn_value">
<double>75.0</double>
</arr>
<arr name="txn_date">
<date>2024-01-02T00:00:00Z</date>
</arr>
<arr name="comment">
<str>The first transaction.</str>
</arr>
<long name="_version_">1807214831002976256</long></doc>
</result>
</response>