Arenadata Hadoop

Arenadata Hadoop is a full-fledged enterprise distribution package based on Apache Hadoop and designed for storing and processing semi-structured and unstructured data.

TOP-10 popular articles

Work with Hive tables

Hive provides several ways to work with tables. You can use data manipulation language (DML) queries to import or add data to a table. Also, you can directly ingest data to a Hive table using HDFS commands.

Use the Beeline shell with Hive

HiveServer2 supports the Beeline command shell which is a JDBC client based on the SQLLine CLI.

Logging in Airflow

Airflow writes text logs used for analyzing errors that can occur while running DAGs. These logs are located in the logs subfolder of the Airflow home directory.

Connect to kerberized Hive using DBeaver

A guide on using DBeaver to connect to Hive with Kerberos authentication enabled.

Create a simple DAG

The article shows how to create and run your first DAG to process CSV files.

Airflow architecture

Airflow is a platform that allows to develop, plan, run, and monitor complex workflows. It fits perfectly with ETL/ELT processes and also can be useful if you need to periodically run any processes and monitor their execution.

Protect files in HDFS

In HDFS, you can restrict access to files or directories using a standard model based on POSIX with modifications. You can grant permissions to a file for its owner, a specified user group, and other users.

ADB Spark3 Connector

ADB Spark 3 Connector provides the possibility of high-speed, parallel data exchange between Spark 3 and Arenadata DB. The article contains a full description of the ADB Spark 3 Connector.

Solr overview

Solr is a search server that deals with large sets of data. Since Solr can also store data, it is a NoSQL, non-relational storage, and a processing technology.

spark-submit & spark-shell

There are two major ways to launch Spark jobs on your cluster: by using spark-submit and via spark-shell.

Found a mistake? Seleсt text and press Ctrl+Enter to report it