Hive computation example
In this article, we use Hive to process criminal statistics based on the real incident reports provided by Boston Police Department (BPD). The dataset is provided by Analyze Boston. The fresh dataset is available on the Kaggle page.
Dataset
The dataset includes records starting from June 14, 2015 up to September 3, 2018. Each row represents a crime report including the type of crime, date and time, and location.
We have two files for processing:
-
crime.csv — the main file with crime records;
-
offense_codes.csv — the data file that contains crime codes.
After the processing we get the following information:
-
the most common types of crime;
-
the distribution of crimes by districts;
-
the frequency of different types of crimes.
Preparations
Before the processing, complete the steps below:
-
Create directories for the data files on the Hive server host using these commands:
$ hdfs dfs -mkdir /user/user1/crime $ hdfs dfs -mkdir /user/user1/offense_codes
-
Copy the data files into the directories you have created:
$ hdfs dfs -put ~/crime.csv /user/user1/crime $ hdfs dfs -put ~/offense_codes.csv /user/user1/offense_codes
Computations
-
Create the
offense_codes
table:CREATE EXTERNAL TABLE offense_codes ( CODE int, NAME string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/user1/offense_codes' tblproperties ("skip.header.line.count"="1");
-
Create the
crime
table:CREATE EXTERNAL TABLE crime ( INCIDENT_NUMBER string, OFFENSE_CODE int, OFFENSE_CODE_GROUP string, OFFENSE_DESCRIPTION string, DISTRICT string, REPORTING_AREA string, SHOOTING string, OCCURRED_ON_DATE string, YEAR int, MONTH int, DAY_OF_WEEK string, HOUR int, UCR_PART string, STREET string, Lat DECIMAL, Long DECIMAL, Location string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/user1/crime' tblproperties ("skip.header.line.count"="1");
TIPDo not forget to change the path where you store tables and data files. In this example, the sample path is /user/user1/. -
Run the computations with this SQL script:
with filtered_crimes as ( select * from crime cr where cr.district is not null ), crimes_w_codes as ( select cr.*, oc.*, trim(case when instr(oc.name, '-') != 0 then substr(oc.name, 1, instr(oc.name, '-') - 1) else oc.name end) as crime_type from filtered_crimes cr join offense_codes oc on cr.offense_code = oc.code ), crimes_district_analytics as ( select cr.district, count(incident_number) as crimes_total, avg(cr.lat) as lat, avg(cr.long) as lng from filtered_crimes cr group by cr.district ), crimes_by_district_by_month as ( select cr.district, cr.month, count(cr.incident_number) as crimes_cnt from filtered_crimes cr group by cr.district, cr.month ), crimes_district_median as ( select cr.district, percentile(cr.crimes_cnt,0.5) as crimes_monthly from crimes_by_district_by_month cr group by cr.district ), crimes_by_district_by_crime_type as ( select cr.district, cr.crime_type, count(cr.INCIDENT_NUMBER) as crimes_cnt from crimes_w_codes cr group by cr.district, cr.crime_type ), crimes_sort as ( select pre.district, collect_list(pre.crime_type) as frequent_crime_types from ( select cr.district, cr.crime_type, row_number() over (partition by cr.district order by cr.crimes_cnt desc) as rn from crimes_by_district_by_crime_type cr) pre where pre.rn <= 3 group by pre.district ) select cda.district, cda.crimes_total, cdm.crimes_monthly, cs.frequent_crime_types, cda.lat, cda.lng from crimes_district_analytics cda join crimes_district_median cdm on cda.district = cdm.district join crimes_sort cs on cda.district = cs.district;
Where:
-
crimes_total
— the total crimes number in the specified district. -
crimes_monthly
— the median crimes number per month in the specified district. -
frequent_crime_types
— the three most frequent crime types in the entire history of observations in the specified district, separated by a comma with a single space (,
), arranged in the descending order of frequency. -
crime_type
— the first part of theNAME
from the offense_codes.csv table, split by the-
separator (for example, ifNAME
isBURGLARY - COMMERICAL - ATTEMPT
, thencrime_type
isBURGLARY
). -
lat
— the latitude of the district coordinates, average for all incidents. -
lng
— the longitude of the district coordinates, average for all incidents.
After the processing, we get the following results:
No Stats for default@crime, Columns: district, incident_number, lat, long No Stats for default@crime, Columns: month No Stats for default@crime, Columns: offense_code No Stats for default@offense_codes, Columns: code, name Query ID = user1_20211112125439_22d654c1-caeb-47f8-adc9-8e2028602653 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1636705252031_0039) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 3 3 0 0 0 0 Map 4 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 5 ...... container SUCCEEDED 3 3 0 0 0 0 Map 10 ......... container SUCCEEDED 1 1 0 0 0 0 Map 6 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 7 ...... container SUCCEEDED 3 3 0 0 0 0 Reducer 8 ...... container SUCCEEDED 2 2 0 0 0 0 Reducer 9 ...... container SUCCEEDED 3 3 0 0 0 0 Reducer 3 ...... container SUCCEEDED 3 3 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 10/10 [==========================>>] 100% ELAPSED TIME: 9,21 s ---------------------------------------------------------------------------------------------- OK cda.district cda.crimes_total cdm.crimes_monthly cs.frequent_crime_types cda.lat cda.lng ETC" 1522 369.5 ["\"WEAPON","FIREARM/WEAPON"] NULL 41.9364 ETC. " 1514 382.0 ["\"DRUGS"] NULL 41.8329 A7 13151 1062.5 ["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","VANDALISM"] 41.9898 -70.9834 B2 48669 3876.0 ["VERBAL DISPUTE","SICK/INJURED/MEDICAL","INVESTIGATE PERSON"] 41.9943 -70.9907 C11 41460 3212.5 ["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","VERBAL DISPUTE"] 41.9925 -70.9877 D4 41020 3221.5 ["PROPERTY","INVESTIGATE PERSON","SICK/INJURED/MEDICAL"] 41.9977 -70.9963 E5 12966 1022.5 ["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","PROPERTY"] 41.9178 -70.8662 BRUSH 525 525.0 ["\"FIRE REPORT"] NULL NULL BUILDING 1269 1269.0 ["\"FIRE REPORT"] NULL NULL A15 6378 491.5 ["INVESTIGATE PERSON","VANDALISM","SICK/INJURED/MEDICAL"] 41.8010 -70.6760 1693 107.0 ["M/V ACCIDENT","M/V","INVESTIGATE PROPERTY"] 24.8485 -43.0789 ETC." 2597 680.0 ["\"DRUGS"] NULL 41.8254 FEET 1 1.0 ["\"A&B HANDS"] NULL NULL NON-NEGLIGIENT MANSLAUGHTER" 161 40.0 ["\"MURDER","MURDER NON"] NULL 41.7190 A1 34834 2698.5 ["PROPERTY","SICK/INJURED/MEDICAL","WARRANT ARREST"] 41.9741 -70.9578 B3 34618 2696.0 ["VERBAL DISPUTE","INVESTIGATE PERSON","MISSING PERSON"] 41.9961 -70.9936 C6 22903 1826.5 ["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","TOWED MOTOR VEHICLE"] 41.8786 -70.8023 D14 19785 1581.0 ["TOWED MOTOR VEHICLE","SICK/INJURED/MEDICAL","INVESTIGATE PERSON"] 41.9931 -70.9888 E13 17010 1323.0 ["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","TOWED MOTOR VEHICLE"] 41.9947 -70.9913 E18 16997 1289.5 ["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","VERBAL DISPUTE"] 42.0000 -71.0000 Time taken: 12.691 seconds, Fetched: 20 row(s)