We can also ensure that data is sorted with in the bucket in the bucketed table.
- Number of files in the bucketed tables will be multiples of number of buckets.
- Using hash mod algorithm on top of bucket key, data will land into appropriate file.
- However data is not sorted with in bucket.
- We can sort the data with in the bucket by using SORTED BY while creating bucketed tables
- Let us get create table statement of orders_buck and recreate with SORTED BY clause.
Here is the example of creating bucketed table using sorting and then inserting data into it.
USE training_retail; CREATE TABLE orders_buck ( order_id INT, order_date STRING, order_customer_id INT, order_status STRING ) CLUSTERED BY (order_id) SORTED BY (order_id) INTO 8 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; INSERT INTO orders_buck SELECT * FROM orders;
- Use dfs -tail command to confirm data is sorted as expected.