HDFS vs Ozone

Both Ozone (O3) and HDFS are open-source suitable distributed Hadoop storages, but there are several key differences which are discussed in this article.

Feature comparison

Key feature comparison is presented in the table below.

Feature HDFS Ozone

Data model

A file-based storage system where data is stored in files and directories

An object store that works with large amounts of unstructured data and is optimized for cloud

Data replication

Replication among DataNodes to ensure fault tolerance by default

Software-defined storage that allows for custom data replication policies and data redundancy

Scalability

Good scalability for handling massive processing jobs

Designed to provide even better scalability than HDFS

Namespace management

Single namespace for the entire cluster

Multiple namespaces for different use cases

Object storage

No

Yes

Support for S3 and other object storage protocols

No

Yes

Access control

POSIX-style permissions

S3-style permissions and bucket-level access controls

Authentication and authorization

Kerberos

Kerberos, Ozone Token

Data consistency

Eventual consistency

Strong consistency due to protocols like RAFT

Pros and cons

HDFS

HDFS is the default file system in Hadoop, and it has the following pros:

  • massive data storage support;

  • quick detection and response to hardware failures;

  • support for data streaming;

  • simplified consistency model;

  • high fault tolerance and easy recovery;

  • designed for commercial hardware.

However, there are also some disadvantages to it:

  • not suitable for a large number of small files;

  • doesn’t support file modification (HDFS 2.x supports appending content to files);

  • struggles with over 400 million files;

  • doesn’t support parallel writing.

Ozone

With HDFS' cons leading to a big discomfort with modern big data storage needs, a new solution had to be implemented with the following key advantages:

  • strong consistency;

  • designed to store more than 100 billion objects in a single cluster;

  • great scalability due to layered architecture;

  • just as fault tolerant and easily recoverable as HDFS;

  • can work alongside with HDFS on the same hosts.

Since the project is rather new, there are some cons:

  • little deployment cases to learn from;

  • designed to integrate with Hadoop ecosystem, it’s still not widely supported, and some services may require additional configuration to work with Ozone;

  • no local socket, and overall performance is slower.

Use cases

Apache Ozone is advantageous over HDFS in environments requiring scalability for small files, S3 compatibility, or cloud-native capabilities. However, HDFS remains suitable for Hadoop workloads with fewer demands for storage of small files without the possibility of combining them or cloud integration.

Found a mistake? Seleсt text and press Ctrl+Enter to report it