Subqueries in Hive
A Hive subquery is a nested HiveQL statement that returns a result set during the execution of a parent (outer) query. As of ADH 3.1.2, Hive supports nesting subqueries inside FROM and WHERE clauses.
When Hive processes a query with a nested subquery, the subquery is evaluated first; then, its results are available in the parent query and can be referenced. The subquery mechanism is enabled by default and requires no manual activation.
Test database
Throughout this article, the following Hive tables employees
and departments
are used to demonstrate Hive subquery examples.
SELECT * FROM employees; +--------------+----------------+---------------------+----------------------------+---------------+-------------------+ | employees.id | employees.name | employees.last_name | employees.email | employees.age | employees.dept_id | +--------------+----------------+---------------------+----------------------------+---------------+-------------------+ | 1 | Ivan | Ivanov | ivan_ivanov123@mail.ru | 30 | 1 | | 2 | Sarah | Connor | sarah_connor123@yahoo.com | 35 | 2 | | 3 | Rick | Sanchez | rick_123@mail.ru | 29 | 2 | | 4 | John | Smith | john_smith123@gmail.com | 50 | 3 | +--------------+----------------+---------------------+----------------------------+---------------+-------------------+ SELECT * FROM departments; +-----------------+-----------------------+-------------------------+------------------------+ | departments.id | departments.dep_name | departments.dep_office | departments.dep_phone | +-----------------+-----------------------+-------------------------+------------------------+ | 1 | sales | 110 | 9379992 | | 2 | it | 115 | 12345678 | | 3 | support | 125 | 880080080 | +-----------------+-----------------------+-------------------------+------------------------+
The following SQL can be used to create and populate test Hive tables using /bin/beeline.
CREATE DATABASE hr;
USE hr;
CREATE TABLE IF NOT EXISTS hr.employees (id int, name string, last_name string, email string, age int, salary int, dept_id int)
COMMENT 'Employee Table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
INSERT INTO hr.employees VALUES
(1, 'Ivan', 'Ivanov', 'ivan_ivanov123@mail.ru', 30, 1000, 1),
(2, 'Sarah', 'Connor', 'sarah_connor123@yahoo.com', 35, 1500, 2),
(3, 'Rick', 'Sanchez', 'rick_123@mail.ru', 29, 1300, 2),
(4, 'John', 'Smith', 'john_smith123@gmail.com', 29, 2000, 3);
CREATE TABLE IF NOT EXISTS hr.departments (id int, dep_name string, dep_office int, dep_phone string)
COMMENT 'Departments Table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
INSERT INTO hr.departments VALUES
(1, 'sales', 110, '9379992'),
(2, 'it', 115, '12345678'),
(3, 'support', 125, '880080080');
Subqueries in the WHERE clause
Only one subquery is allowed inside a WHERE
clause.
A subquery is called correlated if it contains a query predicate with the =
operator and one side of the operator references at least one column from the parent query and the other side references at least one column from the subquery.
When used inside WHERE
, the result of a subquery can be treated as input for IN
/NOT IN
statements.
Subqueries with IN
In the example below, the subquery gets department IDs from the employees
table using a specific office number.
Then, the outer query selects employees who work in this department.
SELECT e.name, e.last_name, e.email FROM employees e
WHERE e.dept_id IN
(SELECT id FROM departments
WHERE dep_office=110);
The result:
+---------+--------------+-------------------------+ | e.name | e.last_name | e.email | +---------+--------------+-------------------------+ | Ivan | Ivanov | ivan_ivanov123@mail.ru | +---------+--------------+-------------------------+
Subqueries with NOT IN
In the example below, the subquery gets a department ID by a specific office number. Then, the outer query retrieves information about employees who are not located in the given office.
SELECT e.name, e.last_name, e.email
FROM employees e
WHERE e.dept_id NOT IN
(SELECT d.id
FROM departments d
WHERE d.dep_office=110)
The result:
+---------+--------------+----------------------------+ | e.name | e.last_name | e.email | +---------+--------------+----------------------------+ | Sarah | Connor | sarah_connor123@yahoo.com | | Rick | Sanchez | rick_123@mail.ru | | John | Smith | john_smith123@gmail.com | +---------+--------------+----------------------------+
Subqueries with EXISTS
The following sample query returns office numbers of those employees who are of the same specific age.
First, the subquery selects employees by age, and then the selection results are picked by EXISTS
.
For each record returned by the subquery, the outer query fetches the corresponding office number from the departments
table.
SELECT d.dep_office
FROM departments d
WHERE EXISTS (
SELECT * FROM employees e
WHERE e.dept_id=d.id AND e.age=29
);
The result:
+---------------+ | d.dep_office | +---------------+ | 115 | | 125 | +---------------+
Subqueries with aggregate functions
Hive provides limited support for aggregate functions within subqueries. The following sample query returns employees whose salary is higher than the average.
SELECT e.last_name, e.salary FROM employee e
WHERE salary > (
SELECT AVG(salary)
FROM employee);
The result:
+--------------+-----------+ | e.last_name | e.salary | +--------------+-----------+ | Smith | 2000 | | Connor | 1500 | +--------------+-----------+
Subqueries in the FROM clause
Nested SELECT
subqueries are typically used inside FROM
clauses to fetch a draft dataset from one or more tables and then have it filtered/sort/modified in the outer query.
The number of subqueries nested inside a FROM
clause can be arbitrary.
SELECT
subqueries nested inside FROM
must be named since every table inside FROM
must have a name.
All columns in the subquery selection list must have unique names and are available in the outer query like columns of a table.
For example:
SELECT t.name, t.email
FROM (SELECT e.name, e.email
FROM employees e
WHERE e.age=35) t; (1)
1 | An alias (t ) is required to qualify the nested SELECT query. This qualifier can be used in the parent query to reference columns selected by the subquery. |
Nested SELECT examples
In the example below, the subquery fetches employees data from explicitly joined employees
and departments
tables.
Then, the outer query assembles new email addresses using the columns fetched by the subquery.
SELECT t.old_email, concat(t.name, t.last_name, '_', t.department_name, '@new.domain.org') AS new_email
FROM
(SELECT e.name, e.last_name, e.email AS old_email, d.dep_name AS department_name
FROM employee e
JOIN departments d ON e.dept_id=d.id
) t;
The result:
+----------------------------+-----------------------------------+ | t.old_email | new_email | +----------------------------+-----------------------------------+ | sarah_connor123@yahoo.com | SarahConnor_it@new.domain.org | | rick_123@mail.ru | RickSanchez_it@new.domain.org | | john_smith123@gmail.com | JohnSmith_support@new.domain.org | | ivan_ivanov123@mail.ru | IvanIvanov_sales@new.domain.org | +----------------------------+-----------------------------------+