Конференция Arenadata
Новое время — новый Greenplum
Мы приглашаем вас принять участие в конференции, посвященной будущему Open-Source Greenplum 19 сентября в 18:00:00 UTC +3. Встреча будет проходить в гибридном формате — и офлайн, и онлайн. Онлайн-трансляция будет доступна для всех желающих.
Внезапное закрытие Greenplum его владельцем — компанией Broadcom - стало неприятным сюрпризом для всех, кто использует или планирует начать использовать решения на базе этой технологии. Многие ожидают выхода стабильной версии Greenplum 7 и надеются на её дальнейшее активное развитие.
Arenadata не могла допустить, чтобы разрабатываемый годами Open-Source проект Greenplum прекратил своё существование, поэтому 19 сентября мы представим наш ответ на данное решение Broadcom, а участники сообщества получат исчерпывающие разъяснения на все вопросы о дальнейшей судьбе этой технологии.

На конференции вас ждёт обсуждение следующих тем:

  • План возрождения Greenplum;
  • Дорожная карта;
  • Экспертное обсуждение и консультации.
Осталось до события

HBase data model

HBase is a column-oriented database. Though it stores data in tables with rows and columns (as relational databases do), do not consider these tables as sets of rows with a fixed scheme. Instead of it, it is helpful to consider HBase tables as multi-dimensional maps.

Basic objects

First, consider the data model terminology used in HBase:

  • Table. A set of multiple rows that can have one or more columns, belonged to the fixed column families. Each table scheme defines only column families.

  • Row. Consists of a row key and columns with values, associated with them. Rows are sorted alphabetically by the row key, as they are stored.

  • Row key. A unique row identifier, used as a primary key. Any access to HBase tables uses this key. You should think carefully about the design of your row keys.

    TIP
    As rows are sorted by the row key, the goal is to store the related rows near each other. But at the same time, keep an eye on the uniformity of the distribution — in order to prevent that some portions of row keys are accessed much more often than others.
  • Column. Consists of a column family and a column qualifier, which are delimited by the colon character, for example: cf:column1, cf:column2, etc.

  • Column family. A logical and physical grouping of columns and their values — often for performance reasons. Column families are fixed at table creation and can’t be changed on the fly. Nevertheless, each given table row might not store anything in a given column family.

    A column family can have a set of storage properties, such as: whether its values should be cached in memory, how its data is compressed or its row keys are encoded, time to live (TTL), and maximum number of stored versions for values, etc.

    CAUTION
    It is not recommended to define more than three column families for each table.
  • Column qualifier. A column identifier, needed to provide the index for a given piece of data in a column family. Though column families are fixed, column qualifiers are mutable: they can be changed on fly and differ greatly between rows of the same table.

  • Cell. A combination of a row, a column family, and a column qualifier. It contains a data value and a timestamp that represents the current value version.

  • Timestamp. A version identifier of a cell value. By default, it represents the time on the RegionServer when this data value was written to HBase. You can also specify a different timestamp value.

  • Value. A piece of data, stored in HBase with a specific combination: <Table, Row Key, Column Family, Column Qualifier, Timestamp>.

The HBase data model can be remembered as the following key/value match: <Table, Row key, Column family, Column qualifier, Timestamp> → Value

Workflow overview

At the logical level the data in HBase is organized into tables, indexed by the primary row key. You can store an unlimited set of columns for each row key. Columns are organized into column groups, called column families. As a rule, each column family combines the columns with the same pattern of usage and storage.

Columns are used for storing pieces of data, called values. Each column is defined by column family name and its own unique name, called column qualifier. A column can store several versions of one value. Different value versions have different timestamps.

Suppose that there is a table with two column families and two rows identified by their row keys. For each combination of the row key and the column family, you can define different column qualifiers. You also can add new column qualifiers on the fly.

The approximate logical view of such table is shown below (where ts is an abbreviation of "timestamp").

Logical view

Row Key

Column Family 1

Column Family 2

Column Qualifier 1

Column Qualifier 2

Column Qualifier 3

Column Qualifier 4

Row Key 1

ts1:value1

ts3:value3

ts2:value2

ts6:value6

 — 

Row Key 2

ts5:value5

ts4:value4

 — 

 — 

ts8:value8 ts7:value7

You can see sparse data in the table, shown above, as filling of columns for each row key is optional. You don’t even have to fill each column family for your rows. If the column is missing, it does not cause overhead for storing empty values, because at the physical level the data are stored differently, than at the logical one.

Physically all the data values in HBase are stored in the row key-sorted order. Blocks of nearby row keys form Regions described in the HBase architecture article. The data, corresponding to different column families, are stored separately — in different HFiles. Each column family uses at least one HFile. It allows, if necessary, to read data only from the desired column family. Columns that correspond to the same column family and the same row key are also stored as a list, sorted alphabetically.

Any column can be missing or present for each row key — empty values are never stored in the database.

When you delete a certain column value, it is not physically deleted immediately — it is marked with a special flag tombstone. The physical deletion of the data will occur later, while performing the Major compaction operation.

In addition to manual deletion of old values versions, you can configure automatic one. At the column family level, you can define such parameters as time to live (TTL) and the maximum number of stored versions. If the difference between the timestamp for a particular version and the current time is greater than TTL, this record is marked for deletion. If the number of versions for a certain column exceeds the maximum number of stored versions, the record is also marked for deletion.

The approximate physical view of our example table is shown below. You can see that all the data is stored in one Region, and different column families belong to different HFiles inside of this Region.

Physical view
Physical view
Physical view
Physical view

Example

This section describes the HBase data model on the particular example.

Suppose there is a need to store text articles. Each article can have basic attributes, which are changed very rarely (author name, article header, creation datetime, etc.), and tags, which describe the article content and help to perform text search among all articles (arch, concepts, tutorials, ref, etc.). The tags are modified more often.

TIP
Since the data that corresponds to different column families is stored separately, try to distribute it according to the frequency of use.

In the given example, we could create the table articles with two column families: basic — for storing rarely modified data, and tags — for storing document tags. The following JSON illustrates the possible data model in this case.

Please, do not consider this code as illustration of how data are physically stored: this is only a mock-up used to understand the basic data objects of HBase. JSON is more suitable for describing mappings, used in HBase, than a regular two-dimensional table.

{
    "article1": { (1)
        "basic": { (2)
            "author": { (3)
                1637054560096: "Test author" (4)
            },
            "header": {
                1637056832082: "Test article. Version 3",
                1637055836875: "Test article. Version 2",
                1637054560118: "Test article"
            }
        },
        "tags": {
            "arch": {
                1637054560141: true
            },
            "concepts": {
                1637054560160: true
            },
            "tutorials": {
                1637054564066: true
            }
        }
    },
    "article2": {
        "basic": {
            "author": {
                1637054576501: "Test author2"
            },
            "header": {
                1637054576516: "Test article2"
            }
        },
        "tags": {
            "ref": {
                1637054577512: true
            }
        }
    }
}
1 The top level of key/value pairs describes table rows: article1 and article2 are row keys, unique for each row.
2 The second level describes column families, which are fixed for each table row: basic and tags.
3 The third level describes column qualifiers: author and header — inside of the column family basic; arch, concepts, tutorials, and ref — inside of the column family tags. Column qualifiers are not fixed, they can be added on fly. Their set can differ from one row to another.
4 The bottom level describes cells — pairs of timestamps and values. Thanks to timestamps, versioning of values is supported.

So, to get a specific data value stored in HBase, you should know the following path: Table → Row Key → Column Family → Column Qualifier → Timestamp. Without defining timestamps, you get the most recent data. For example, if you try to get the value of the row with the key article1 and the column basic:header, you get the most recent value Test article. Version 3. But if you additionally define the timestamp 1637054560118 — you get the value Test article.

hbase(main):005:0> get 'articles', 'article1', {COLUMN => 'basic:header', TIMESTAMP => 1637054560118}
COLUMN                CELL
 basic:header         timestamp=1637054560118, value=Test article
1 row(s)
Took 0.0202 seconds
hbase(main):006:0> get 'articles', 'article1', 'basic:header'
COLUMN                CELL
 basic:header         timestamp=1637056832082, value=Test article. Version 3
1 row(s)
Took 0.0054 seconds

The main commands, based on the data model showed in this example, are described in the Quick start with HBase shell article.

Found a mistake? Seleсt text and press Ctrl+Enter to report it