PostgreSQL nodes used with Hive Metastore

External database sizing for the Apache Hive Metastore requires careful consideration of various factors such as the number of tables, partitions, users, and overall workload. Although it is difficult to determine the exact size of an external database, the following guidelines can help you make an informed decision:

  • Estimate the database size

    • The number of tables. Each table requires a storage for schema and metadata, ranging from a few kilobytes to several megabytes per table. It depends on the schema complexity.

    • The number of partitions. The storage requirements for partitions depend on the partition key data types and the number of partition columns. On average, you can estimate around 1 KB per partition.

    • User information and permissions. This requires a small amount of storage, usually a few kilobytes per user.

  • Historical and growth data. Consider your data’s historical growth rate and expected future growth. Regularly monitor the database size and scale it as needed.

  • Replication and backup. Allocate additional storage for backup and replication to maintain redundancy and ensure data durability. 50% of the total database size is recommended for backup quota.

  • Workload. Estimate the concurrency of queries and the number of users who will access the system simultaneously. A higher workload may require more resources and storage for caching, query planning, and temporary storage.

  • Buffer for performance optimization. Leave some additional storage space for indexing, caching, and other performance optimizations, typically around 20-30% of the total database size.

While these guidelines can help you estimate the size of the external database for the Hive Metastore, it is crucial to keep monitoring usage and adjust the size accordingly.

IMPORTANT

The following system requirements are minimal. The target sizing should be calculated based on the customer requirements.

The minimum hardware requirements for the external database hosts are listed in the table below.

Requirement Small cluster Medium cluster Large cluster

Number of concurrent connections (user sessions)

5-10

20-50

50+

CPU

64bit is recommended *, 4+ dedicated cores

64bit, 4-8 cores

64bit, 8+ cores

CPUs with larger L3 caches are more suitable for larger sets of data

RAM

8-16 GB

16-32 GB

32+ GB

Disk space

(minimum 1 separate disk, SSD, RAID 1 or RAID 10)

50+ GB

100 GB

200+ GB

* If you need to use a 32-bit binary, set LDR_CNTRL to MAXDATA=0xn0000000, where 1 <= n <= 8, before starting the PostgreSQL server, try different values and postgresql.conf settings to find a configuration that operates satisfactorily. For more information, see Platform-Specific Notes.

Found a mistake? Seleсt text and press Ctrl+Enter to report it