Use bloom

Overview

The bloom module implements an index access method based on Bloom filters. A Bloom filter is a space-efficient data structure used to check whether an element is a member of a set. The index access method built on bloom filters allows you to quickly exclude non-matching tuples using signatures, the size of which is determined when creating the index.

A signature is an imprecise representation of the indexed columns (attributes) and, as a result, is prone to false positives. It means that it may be reported that an element is in the set, when it is not. So, index search results must always be rechecked using the actual attribute values of the heap entry. The larger the signature size, the lower the probability of false positives and the number of unnecessary heap accesses, but this entails a larger index and slower scans.

There are no false negatives when using a bloom index, so it can be used to effectively exclude the presence of an element in a table.

This type of index is most useful when a table has many columns and queries check arbitrary combinations of them. A traditional B-tree index is faster than a bloom index, but supporting all possible queries may require multiple B-tree indexes, while only one bloom index is needed. Note that bloom indexes only support equality checks, while B-tree indexes are also useful for inequality checks and range searches.

Limitations

The bloom indexes have the following limitations:

  • Only operator classes for the int4 and text types are included with the bloom module.

  • Only the equal operator (=) is supported for search.

  • The bloom index access method does not support UNIQUE indexes.

  • NULL values are ignored when creating bloom index.

Installation

The package required for the bloom installation is shipped with ADP. To use bloom, just run the CREATE EXTENSION command:

CREATE EXTENSION bloom;
NOTE
If the bloom extension is created in the template1 database used as the default template, all subsequently created databases will have this extension installed.

ADP uses the 1.0 bloom version. To check it, execute the following query:

SELECT extversion FROM pg_extension
    WHERE extname = 'bloom';
 extversion
------------
1.0

Once you create the bloom extension, its functionality will be available in the current database.

Create a bloom Index

To create a bloom index, use the command CREATE INDEX. The USING expression should specify the bloom method, and the following parameters can be set in the WITH expression:

  • length — length of each signature (index entry) in bits. It is rounded up to the nearest multiple of 16. The default value is 80; the maximum value is 4096.

  • col1 — col32 — number of bits generated for each index column. Each parameter name refers to the number of the index column for which it is set. The default value is 2, and the maximum value is 4095. Parameters for unused index columns are ignored.

CREATE INDEX <index_name> ON <table_name> USING bloom (<column1>, <column2>...<columnN>)
       WITH (length=<length_value>, col1=<col_value1>, col2=<col_value2>... colN=<col_valueN>);

Test a bloom index

This section compares the performance of multicolumn B-tree and bloom indexes. The test below is deliberately simplified and configured to show that the bloom index can be more effective for a specific query.

To test indexes, create a table and fill it with data:

CREATE TABLE table_bloom (field1 int,
                          field2 int,
                          field3 int,
                          field4 text,
                          field5 int);

INSERT INTO table_bloom SELECT (random() * 1000000)::int,
                               (random() * 1000000)::int,
                               (random() * 1000000)::int,
                               md5(i::text),
                               floor(random()* 10010)
                               FROM generate_series(1,100*1e6) i;

Use the pg_relation_size function to determine the table size:

SELECT pg_size_pretty(pg_relation_size('table_bloom'));
 pg_size_pretty
----------------
8054 MB

Create a B-tree index:

CREATE INDEX idx_btree ON table_bloom (field1, field2, field3, field4, field5);

Check the index size:

SELECT pg_size_pretty(pg_relation_size('idx_btree'));
 pg_size_pretty
----------------
 7327 MB

To ensure that the query planner uses indexes in the test, disable sequential scan:

SET enable_seqscan = off;

Run the ANALYZE command to collect statistics for the query planner:

 ANALYZE table_bloom;

Use the EXPLAIN ANALYZE command to estimate performance. Execute two queries to get more objective data:

EXPLAIN ANALYZE SELECT * FROM table_bloom WHERE field3 = 665955 and field5 = 9044;
                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using idx_btree on table_bloom  (cost=0.57..4751544.58 rows=1 width=49) (actual time=5827.032..5827.033 rows=0 loops=1)
   Index Cond: ((field3 = 665955) AND (field5 = 9044))
   Heap Fetches: 0
 Planning Time: 0.066 ms
 JIT:
   Functions: 1
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.121 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 0.121 ms
 Execution Time: 5827.240 ms
EXPLAIN ANALYZE SELECT * FROM table_bloom WHERE field2 = 293202 and field3 = 12624;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using idx_btree on table_bloom  (cost=0.57..4751544.74 rows=1 width=49) (actual time=5790.038..5790.040 rows=0 loops=1)
   Index Cond: ((field2 = 293202) AND (field3 = 12624))
   Heap Fetches: 0
 Planning Time: 0.068 ms
 JIT:
   Functions: 1
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.136 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 0.136 ms
 Execution Time: 5790.268 ms

The query execution time is 5827.240 ms and 5790.268 ms, the size of the B-tree index is 7327 MB.

Create a bloom index for the same columns:

CREATE INDEX idx_bloom ON table_bloom USING bloom (field1, field2, field3, field4, field5)
            WITH(length=64, col1=4, col2=4, col3=4, col4=4, col5=4);

Check the index size:

SELECT pg_size_pretty(pg_relation_size('idx_bloom'));
 pg_size_pretty
----------------
  1342 MB

Run the same queries:

EXPLAIN ANALYZE SELECT * FROM table_bloom WHERE field3 = 665955 and field5 = 9044;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on table_bloom  (cost=1687292.00..1687296.02 rows=1 width=49) (actual time=763.776..763.778 rows=0 loops=1)
   Recheck Cond: ((field3 = 665955) AND (field5 = 9044))
   Rows Removed by Index Recheck: 861
   Heap Blocks: exact=861
   ->  Bitmap Index Scan on idx_bloom  (cost=0.00..1687292.00 rows=1 width=0) (actual time=732.889..732.890 rows=861 loops=1)
         Index Cond: ((field3 = 665955) AND (field5 = 9044))
 Planning Time: 0.080 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.267 ms, Inlining 2.481 ms, Optimization 17.609 ms, Emission 10.009 ms, Total 30.366 ms
 Execution Time: 1839.632 ms
EXPLAIN ANALYZE SELECT * FROM table_bloom WHERE field2 = 293202 and field3 = 12624;
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on table_bloom  (cost=1687292.00..1687296.02 rows=1 width=49) (actual time=1144.026..1144.028 rows=0 loops=1)
   Recheck Cond: ((field2 = 293202) AND (field3 = 12624))
   Rows Removed by Index Recheck: 930
   Heap Blocks: exact=930
   ->  Bitmap Index Scan on idx_bloom  (cost=0.00..1687292.00 rows=1 width=0) (actual time=1044.350..1044.351 rows=930 loops=1)
         Index Cond: ((field2 = 293202) AND (field3 = 12624))
 Planning Time: 0.078 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.229 ms, Inlining 2.192 ms, Optimization 13.375 ms, Emission 8.267 ms, Total 24.063 ms
 Execution Time: 1144.350 ms

The query execution time is 1839.632 ms and 1144.350 ms, the size of the bloom index is 1342 MB.

From the test above, it can be concluded that a multicolumn bloom index allows you to save disk space and speed up searching through combinations of columns when creating multiple B-tree indexes is impractical.

Found a mistake? Seleсt text and press Ctrl+Enter to report it