Does anyone know the difference between GROUP BY and DISTRIBUTE BY in Hive?
Distribute by --> it guarantees the record goes to same reducer.
Eg: let’s say 5 records processed in hive with 2 no. of reducers.
set mapreduce.job.reduces=2; //setting no. of reducers.
select * from sample distribute by id;
so, for R1 : x1,x3,x1
for R2 : x2,x4
see x1 goes to R1 as it guarantees in that way. And here it does not support any sorting, you need to use sort by optionally.
Group by --> it gets the resultant for specific coln.
Group by is a special clause in sql which aggregates the result by column specific.
select deptname,count(id) from sample group by deptname;
it,2 // there are 2 id’s
since, group by works with aggreagate functions. You need to use in combination of aggregate and normal coln value. (normal coln value sh’d specify in group by as it acts as the key. As in general, one key can have multiple values).
You can check in this link for more clarrification:
But I suggest you take your own table and try experimenting.