Конференция Arenadata
Новое время — новый Greenplum
Мы приглашаем вас принять участие в конференции, посвященной будущему Open-Source Greenplum 19 сентября в 18:00:00 UTC +3. Встреча будет проходить в гибридном формате — и офлайн, и онлайн. Онлайн-трансляция будет доступна для всех желающих.
Внезапное закрытие Greenplum его владельцем — компанией Broadcom - стало неприятным сюрпризом для всех, кто использует или планирует начать использовать решения на базе этой технологии. Многие ожидают выхода стабильной версии Greenplum 7 и надеются на её дальнейшее активное развитие.
Arenadata не могла допустить, чтобы разрабатываемый годами Open-Source проект Greenplum прекратил своё существование, поэтому 19 сентября мы представим наш ответ на данное решение Broadcom, а участники сообщества получат исчерпывающие разъяснения на все вопросы о дальнейшей судьбе этой технологии.

На конференции вас ждёт обсуждение следующих тем:

  • План возрождения Greenplum;
  • Дорожная карта;
  • Экспертное обсуждение и консультации.
Осталось до события

Hive computation example

In this article, we use Hive to process criminal statistics based on the real incident reports provided by Boston Police Department (BPD). The dataset is provided by Analyze Boston. The fresh dataset is available on the Kaggle page.

Dataset

The dataset includes records starting from June 14, 2015 up to September 3, 2018. Each row represents a crime report including the type of crime, date and time, and location.

We have two files for processing:

  • crime.csv — the main file with crime records;

  • offense_codes.csv — the data file that contains crime codes.

 

After the processing we get the following information:

  • the most common types of crime;

  • the distribution of crimes by districts;

  • the frequency of different types of crimes.

Preparations

Before the processing, complete the steps below:

  1. Create directories for the data files on the Hive server host using these commands:

    $ hdfs dfs -mkdir /user/user1/crime
    $ hdfs dfs -mkdir /user/user1/offense_codes
  2. Copy the data files into the directories you have created:

    $ hdfs dfs -put ~/crime.csv /user/user1/crime
    $ hdfs dfs -put ~/offense_codes.csv /user/user1/offense_codes

Computations

  1. Create the offense_codes table:

    CREATE EXTERNAL TABLE offense_codes (
      CODE int,
      NAME string
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    LOCATION '/user/user1/offense_codes' tblproperties ("skip.header.line.count"="1");
  2. Create the crime table:

    CREATE EXTERNAL TABLE crime (
        INCIDENT_NUMBER string,
        OFFENSE_CODE int,
        OFFENSE_CODE_GROUP string,
        OFFENSE_DESCRIPTION string,
        DISTRICT string,
        REPORTING_AREA string,
        SHOOTING string,
        OCCURRED_ON_DATE string,
        YEAR int,
        MONTH int,
        DAY_OF_WEEK string,
        HOUR int,
        UCR_PART string,
        STREET string,
        Lat DECIMAL,
        Long DECIMAL,
        Location string
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    LOCATION '/user/user1/crime' tblproperties ("skip.header.line.count"="1");
    TIP
    Do not forget to change the path where you store tables and data files. In this example, the sample path is /user/user1/.
  3. Run the computations with this SQL script:

    with filtered_crimes as (
        select *
        from crime cr
        where cr.district is not null
    ),
         crimes_w_codes as (
             select cr.*,
                    oc.*,
                    trim(case
                        when instr(oc.name, '-') != 0 then
                            substr(oc.name, 1, instr(oc.name, '-') - 1)
                        else oc.name end) as crime_type
             from filtered_crimes cr
                      join offense_codes oc
                           on cr.offense_code = oc.code
         ),
         crimes_district_analytics as (
             select cr.district, count(incident_number) as crimes_total, avg(cr.lat) as lat, avg(cr.long) as lng
             from filtered_crimes cr
             group by cr.district
         ),
         crimes_by_district_by_month as (
             select cr.district, cr.month, count(cr.incident_number) as crimes_cnt
             from filtered_crimes cr
             group by cr.district, cr.month
         ),
         crimes_district_median as (
             select cr.district, percentile(cr.crimes_cnt,0.5) as crimes_monthly
             from crimes_by_district_by_month cr
             group by cr.district
         ),
         crimes_by_district_by_crime_type as (
             select cr.district, cr.crime_type, count(cr.INCIDENT_NUMBER) as crimes_cnt
             from crimes_w_codes cr
             group by cr.district, cr.crime_type
         ),
         crimes_sort as (
             select pre.district, collect_list(pre.crime_type) as frequent_crime_types
             from (
                      select cr.district,
                             cr.crime_type,
                             row_number() over (partition by cr.district order by cr.crimes_cnt desc) as rn
                      from crimes_by_district_by_crime_type cr) pre
             where pre.rn <= 3
             group by pre.district
         )
    select cda.district, cda.crimes_total, cdm.crimes_monthly, cs.frequent_crime_types, cda.lat, cda.lng
    from crimes_district_analytics cda
             join crimes_district_median cdm on cda.district = cdm.district
             join crimes_sort cs on cda.district = cs.district;

Where:

  • crimes_total — the total crimes number in the specified district.

  • crimes_monthly — the median crimes number per month in the specified district.

  • frequent_crime_types — the three most frequent crime types in the entire history of observations in the specified district, separated by a comma with a single space (), arranged in the descending order of frequency.

  • crime_type — the first part of the NAME from the offense_codes.csv table, split by the - separator (for example, if NAME is BURGLARY - COMMERICAL - ATTEMPT, then crime_type is BURGLARY).

  • lat — the latitude of the district coordinates, average for all incidents.

  • lng — the longitude of the district coordinates, average for all incidents.

After the processing, we get the following results:

No Stats for default@crime, Columns: district, incident_number, lat, long
No Stats for default@crime, Columns: month
No Stats for default@crime, Columns: offense_code
No Stats for default@offense_codes, Columns: code, name
Query ID = user1_20211112125439_22d654c1-caeb-47f8-adc9-8e2028602653
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1636705252031_0039)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      3          3        0        0       0       0
Map 4 .......... container     SUCCEEDED      1          1        0        0       0       0
Reducer 5 ...... container     SUCCEEDED      3          3        0        0       0       0
Map 10 ......... container     SUCCEEDED      1          1        0        0       0       0
Map 6 .......... container     SUCCEEDED      1          1        0        0       0       0
Reducer 7 ...... container     SUCCEEDED      3          3        0        0       0       0
Reducer 8 ...... container     SUCCEEDED      2          2        0        0       0       0
Reducer 9 ...... container     SUCCEEDED      3          3        0        0       0       0
Reducer 3 ...... container     SUCCEEDED      3          3        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 10/10  [==========================>>] 100%  ELAPSED TIME: 9,21 s
----------------------------------------------------------------------------------------------
OK
cda.district	cda.crimes_total	cdm.crimes_monthly	cs.frequent_crime_types	cda.lat	cda.lng
 ETC"	1522	369.5	["\"WEAPON","FIREARM/WEAPON"]	NULL	41.9364
 ETC. "	1514	382.0	["\"DRUGS"]	NULL	41.8329
A7	13151	1062.5	["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","VANDALISM"]	41.9898	-70.9834
B2	48669	3876.0	["VERBAL DISPUTE","SICK/INJURED/MEDICAL","INVESTIGATE PERSON"]	41.9943	-70.9907
C11	41460	3212.5	["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","VERBAL DISPUTE"]	41.9925	-70.9877
D4	41020	3221.5	["PROPERTY","INVESTIGATE PERSON","SICK/INJURED/MEDICAL"]	41.9977	-70.9963
E5	12966	1022.5	["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","PROPERTY"]	41.9178	-70.8662
 BRUSH	525	525.0	["\"FIRE REPORT"]	NULL	NULL
 BUILDING	1269	1269.0	["\"FIRE REPORT"]	NULL	NULL
A15	6378	491.5	["INVESTIGATE PERSON","VANDALISM","SICK/INJURED/MEDICAL"]	41.8010	-70.6760
	1693	107.0	["M/V ACCIDENT","M/V","INVESTIGATE PROPERTY"]	24.8485	-43.0789
 ETC."	2597	680.0	["\"DRUGS"]	NULL	41.8254
 FEET	1	1.0	["\"A&B HANDS"]	NULL	NULL
 NON-NEGLIGIENT MANSLAUGHTER"	161	40.0	["\"MURDER","MURDER NON"]	NULL	41.7190
A1	34834	2698.5	["PROPERTY","SICK/INJURED/MEDICAL","WARRANT ARREST"]	41.9741	-70.9578
B3	34618	2696.0	["VERBAL DISPUTE","INVESTIGATE PERSON","MISSING PERSON"]	41.9961	-70.9936
C6	22903	1826.5	["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","TOWED MOTOR VEHICLE"]	41.8786	-70.8023
D14	19785	1581.0	["TOWED MOTOR VEHICLE","SICK/INJURED/MEDICAL","INVESTIGATE PERSON"]	41.9931	-70.9888
E13	17010	1323.0	["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","TOWED MOTOR VEHICLE"]	41.9947	-70.9913
E18	16997	1289.5	["SICK/INJURED/MEDICAL","INVESTIGATE PERSON","VERBAL DISPUTE"]	42.0000	-71.0000
Time taken: 12.691 seconds, Fetched: 20 row(s)
Found a mistake? Seleсt text and press Ctrl+Enter to report it