Subqueries in Hive

A Hive subquery is a nested HiveQL statement that returns a result set during the execution of a parent (outer) query. As of ADH 3.1.2, Hive supports nesting subqueries inside FROM and WHERE clauses.

When Hive processes a query with a nested subquery, the subquery is evaluated first; then, its results are available in the parent query and can be referenced. The subquery mechanism is enabled by default and requires no manual activation.

Test database

Throughout this article, the following Hive tables employees and departments are used to demonstrate Hive subquery examples.

SELECT * FROM employees;
+--------------+----------------+---------------------+----------------------------+---------------+-------------------+
| employees.id | employees.name | employees.last_name |      employees.email       | employees.age | employees.dept_id |
+--------------+----------------+---------------------+----------------------------+---------------+-------------------+
| 1            | Ivan           | Ivanov              | ivan_ivanov123@mail.ru     | 30            | 1                 |
| 2            | Sarah          | Connor              | sarah_connor123@yahoo.com  | 35            | 2                 |
| 3            | Rick           | Sanchez             | rick_123@mail.ru           | 29            | 2                 |
| 4            | John           | Smith               | john_smith123@gmail.com    | 50            | 3                 |
+--------------+----------------+---------------------+----------------------------+---------------+-------------------+

SELECT * FROM departments;
+-----------------+-----------------------+-------------------------+------------------------+
| departments.id  | departments.dep_name  | departments.dep_office  | departments.dep_phone  |
+-----------------+-----------------------+-------------------------+------------------------+
| 1               | sales                 | 110                     | 9379992                |
| 2               | it                    | 115                     | 12345678               |
| 3               | support               | 125                     | 880080080              |
+-----------------+-----------------------+-------------------------+------------------------+
SQL for creating test tables

 
The following SQL can be used to create and populate test Hive tables using /bin/beeline.

CREATE DATABASE hr;

USE hr;

CREATE TABLE IF NOT EXISTS hr.employees (id int, name string, last_name string, email string, age int, salary int, dept_id int)
COMMENT 'Employee Table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

INSERT INTO hr.employees VALUES
(1, 'Ivan', 'Ivanov', 'ivan_ivanov123@mail.ru', 30, 1000, 1),
(2, 'Sarah', 'Connor', 'sarah_connor123@yahoo.com', 35, 1500, 2),
(3, 'Rick', 'Sanchez', 'rick_123@mail.ru', 29, 1300, 2),
(4, 'John', 'Smith', 'john_smith123@gmail.com', 29, 2000, 3);

CREATE TABLE IF NOT EXISTS hr.departments (id int, dep_name string, dep_office int, dep_phone string)
COMMENT 'Departments Table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

INSERT INTO hr.departments VALUES
(1, 'sales', 110, '9379992'),
(2, 'it', 115, '12345678'),
(3, 'support', 125, '880080080');

Subqueries in the WHERE clause

Only one subquery is allowed inside a WHERE clause. A subquery is called correlated if it contains a query predicate with the = operator and one side of the operator references at least one column from the parent query and the other side references at least one column from the subquery.

When used inside WHERE, the result of a subquery can be treated as input for IN/NOT IN statements.

Subqueries with IN

In the example below, the subquery gets department IDs from the employees table using a specific office number. Then, the outer query selects employees who work in this department.

SELECT e.name, e.last_name, e.email FROM employees e
WHERE e.dept_id IN
    (SELECT id FROM departments
    WHERE dep_office=110);

The result:

+---------+--------------+-------------------------+
| e.name  | e.last_name  |         e.email         |
+---------+--------------+-------------------------+
| Ivan    | Ivanov       | ivan_ivanov123@mail.ru  |
+---------+--------------+-------------------------+

Subqueries with NOT IN

In the example below, the subquery gets a department ID by a specific office number. Then, the outer query retrieves information about employees who are not located in the given office.

 SELECT e.name, e.last_name, e.email
 FROM employees e
 WHERE e.dept_id NOT IN
       (SELECT d.id
        FROM departments d
        WHERE d.dep_office=110)

The result:

+---------+--------------+----------------------------+
| e.name  | e.last_name  |          e.email           |
+---------+--------------+----------------------------+
| Sarah   | Connor       | sarah_connor123@yahoo.com  |
| Rick    | Sanchez      | rick_123@mail.ru           |
| John    | Smith        | john_smith123@gmail.com    |
+---------+--------------+----------------------------+

Subqueries with EXISTS

The following sample query returns office numbers of those employees who are of the same specific age. First, the subquery selects employees by age, and then the selection results are picked by EXISTS. For each record returned by the subquery, the outer query fetches the corresponding office number from the departments table.

SELECT d.dep_office
FROM departments d
WHERE EXISTS (
    SELECT * FROM employees e
    WHERE e.dept_id=d.id AND e.age=29
    );

The result:

+---------------+
| d.dep_office  |
+---------------+
| 115           |
| 125           |
+---------------+

Subqueries with aggregate functions

Hive provides limited support for aggregate functions within subqueries. The following sample query returns employees whose salary is higher than the average.

SELECT e.last_name, e.salary FROM employee e
WHERE salary > (
    SELECT AVG(salary)
    FROM employee);

The result:

+--------------+-----------+
| e.last_name  | e.salary  |
+--------------+-----------+
| Smith        | 2000      |
| Connor       | 1500      |
+--------------+-----------+

Subqueries in the FROM clause

Nested SELECT subqueries are typically used inside FROM clauses to fetch a draft dataset from one or more tables and then have it filtered/sort/modified in the outer query.

The number of subqueries nested inside a FROM clause can be arbitrary. SELECT subqueries nested inside FROM must be named since every table inside FROM must have a name. All columns in the subquery selection list must have unique names and are available in the outer query like columns of a table. For example:

SELECT t.name, t.email
FROM (SELECT e.name, e.email
      FROM employees e
      WHERE e.age=35) t; (1)
1 An alias (t) is required to qualify the nested SELECT query. This qualifier can be used in the parent query to reference columns selected by the subquery.

Nested SELECT examples

In the example below, the subquery fetches employees data from explicitly joined employees and departments tables. Then, the outer query assembles new email addresses using the columns fetched by the subquery.

SELECT t.old_email, concat(t.name, t.last_name, '_', t.department_name, '@new.domain.org') AS new_email
FROM
    (SELECT e.name, e.last_name, e.email AS old_email, d.dep_name AS department_name
     FROM employee e
        JOIN departments d ON e.dept_id=d.id
     ) t;

The result:

+----------------------------+-----------------------------------+
|        t.old_email         |            new_email              |
+----------------------------+-----------------------------------+
| sarah_connor123@yahoo.com  | SarahConnor_it@new.domain.org     |
| rick_123@mail.ru           | RickSanchez_it@new.domain.org     |
| john_smith123@gmail.com    | JohnSmith_support@new.domain.org  |
| ivan_ivanov123@mail.ru     | IvanIvanov_sales@new.domain.org   |
+----------------------------+-----------------------------------+
Found a mistake? Seleсt text and press Ctrl+Enter to report it