Sorting Data with in groups using DISTRIBUTE BY and SORT BY

#1

Let us create database training_nyse if it is not already existing and then create table stocks_eod

CREATE DATABASE IF NOT EXISTS training_nyse;
USE training_nyse;

CREATE TABLE stocks_eod (
  stockticker STRING,
  tradedate INT,
  openprice FLOAT,
  highprice FLOAT,
  lowprice FLOAT,
  closeprice FLOAT,
  VOLUME BIGINT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

LOAD DATA LOCAL INPATH '/data/nyse_all/nyse_data'
INTO TABLE stocks_eod;

Let us try creating new table stocks_eod_orderby using stocks_eod data sorting by tradedate and then volume descending.

  • Create table stocks_eod_orderby
  • Set number of reducers to 8
  • Insert into stocks_eod_orderby from stocks_eod using ORDER BY tradedate, volume DESC
  • Even though number of reducers are set to 8, it will use only 1 reducer as we have ORDER BY clause in our query.
    Here are the commands to create table stocks_eod_orderby and insert data into the table.
CREATE TABLE stocks_eod_orderby (
  stockticker STRING,
  tradedate INT,
  openprice FLOAT,
  highprice FLOAT,
  lowprice FLOAT,
  closeprice FLOAT,
  VOLUME BIGINT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

SET mapreduce.job.reduces=8;

INSERT INTO stocks_eod_orderby
SELECT * FROM orders
ORDER BY tradedate, volume DESC;

Now let us create the table stocks_eod_sortby where data is grouped and sorted with in each date by volume in descending order.

  • Create table stocks_eod_sortby
  • Set number of reducers to 8
  • Insert into stocks_eod_sortrby from stocks_eod using DISTRIBUTE BY tradedate SORT BY tradedate, volume DESC
  • Now data will be inserted into stocks_eod_sortby using 8 reducers.
  • Data will be distributed/grouped by tradedate and with in each tradedate data will be sorted by volume in descending order.
  • Data need not be globally sorted on the tradedate.

Here are the commands to create table stocks_eod_sortby.

CREATE TABLE stocks_eod_sortby (
  stockticker STRING,
  tradedate INT,
  openprice FLOAT,
  highprice FLOAT,
  lowprice FLOAT,
  closeprice FLOAT,
  VOLUME BIGINT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Let us take care of inserting the data using DISTRIBUTE BY and SORT BY.

SET mapreduce.job.reduces=8;

INSERT INTO stocks_eod_orderby
SELECT * FROM orders
DISTRIBUTE BY tradedate
SORT BY tradedate, volume DESC;
  • Now we can look into the location of the table and we will see 8 files as 8 reducers are used.

Practice hive on state of the art Big Data cluster - https://labs.itversity.com
You can sign up for our courses on Udemy using $10 coupons - Udemy Coupons - Big Data Courses


0 Likes