Get top 10 rated movies for each year using Hive

spark-sql

#1

we have below schema :
Ratings.dat :-UserID:: MovieID::Rating::Timestamp
Tags.dat :-UserID:: MovieID::Tag::Timestamp
Movies.dat :-MovieID::Title::Genres

Using Hive how I will find :-the top 10 ranking movies for the year ?

Thanks in Advance


Practice hive on our state of the art Big Data Cluster - https://labs.itversity.com



#2

You need to do the following:

  • Join movies and ratings
  • Use analytic functions to get top 10 movier per year

If you can share the data along with create table commands we can get back to you with query.


#3

Please find below link for data set
http://files.grouplens.org/datasets/movielens/ml-10m.zip

CREATE DATABASE movielens;

CREATE TABLE moviedata (
MovieID INT,
Title STRING,
Genres STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' 
WITH SERDEPROPERTIES ("field.delim"="::")
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/home/osgdev/Downloads/movies.dat' INTO TABLE moviedata;

CREATE TABLE ratingdata (
Ratings INT,
UserID INT,
MovieID INT,
RatingTS INT )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' 
WITH SERDEPROPERTIES ("field.delim"="::")
STORED AS TEXTFILE;

#4

I am able to create the tables and load the data. Timestamp is represented as integer, do you have any idea about it? We need to typecast it to date to get top 10 movies by year.

You need to figure it out and also it is better for you to come up with the query up to an extent you can build it and we can improvise on top of it.


#5

Solution is provided using Hive on our state of the art Big Data cluster - http://labs.itversity.com
It works in Hive as well as Spark SQL

Learn Spark SQL along with other concepts via our Udemy courses. Here is the link for the coupons.


Problem Statement:

  • Get top 10 rated movies per year. As part of the original question, @anandbabu have not mentioned criteria properly, hence we have chosen average rating as the criteria. If some one throw the actual criteria we can rewrite the query.

Getting Started

  • Tables are created using tables provided as part of the previous response from @anandbabu

  • Some corrections are made - such as renaming column timestamp to RatingTS as Timestamp is keyword and cannot be used for naming column. Also order of columns is not correct

    CREATE DATABASE movielens;

    CREATE TABLE moviedata (
    MovieID INT,
    Title STRING,
    Genres STRING )
    ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe’
    WITH SERDEPROPERTIES (“field.delim”="::")
    STORED AS TEXTFILE;

    LOAD DATA LOCAL INPATH ‘/home/itversity/ml-10M100K/movies.dat’ INTO TABLE moviedata;

    CREATE TABLE ratingdata (
    UserID INT,
    MovieID INT,
    Ratings INT,
    RatingTS INT )
    ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe’
    WITH SERDEPROPERTIES (“field.delim”="::")
    STORED AS TEXTFILE;

    LOAD DATA LOCAL INPATH ‘/home/itversity/ml-10M100K/ratings.dat’ INTO TABLE ratingdata;

  • Sample data

    • moviedata: 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
    • ratingdata: 1 122 5 838985046
  • RatingTS is represented as integer (unix timestamp)

Design

  • Convert RatingTS to date and then extract year
select from_unixtime(838985046);
1996-08-02 07:24:06
select substring(from_unixtime(838985046), 1, 4);
1996
  • Join moviedata and ratingdata
  • Get average rating for each movie in each year
  • Get top 10 rated movies using analytics function

Solutin

  • Add hive-contrib.jar

add jar /usr/hdp/2.5.0.0-1245/hive/lib/hive-contrib.jar;

  • Run simple join and preview the results
select * 
from moviedata
join ratingdata 
  on moviedata.movieid = ratingdata.movieid
limit 10;
  • Get average rating per title per year and preview the results
select 
  substring(from_unixtime(ratingts), 1, 4) ratingyear, 
  title, avg(ratings) avg_rating
from moviedata join ratingdata 
  on moviedata.movieid = ratingdata.movieid
group by 
  substring(from_unixtime(ratingts), 1, 4), title
limit 10;
  • Use analytics function to assign rank for each of the movie
select 
  title, ratingyear, 
  rank() over (partition by ratingyear order by avg_rating desc) rnk
from (select 
  substring(from_unixtime(ratingts), 1, 4) ratingyear, 
  title, avg(ratings) avg_rating
from moviedata join ratingdata 
  on moviedata.movieid = ratingdata.movieid
group by 
  substring(from_unixtime(ratingts), 1, 4), title) q1
limit 10;
  • Get top 10 movies each year by nesting the query generated in previous step
select * from (
select 
  title, ratingyear, 
  rank() over (partition by ratingyear order by avg_rating desc) rnk
from (select 
  substring(from_unixtime(ratingts), 1, 4) ratingyear, 
  title, avg(ratings) avg_rating
from moviedata join ratingdata 
  on moviedata.movieid = ratingdata.movieid
group by 
  substring(from_unixtime(ratingts), 1, 4), title) q1) q2
where rnk <= 10
order by ratingyear, rnk;

#7

how I will find the distinct movies which are associated with the Tag “fairy tale”

we have data with fairy tale in two places :-1:

792::Hungarian Fairy Tale, A (Hol volt, hol nem volt) (1987)::Fantasy
1654::FairyTale: A True Story (1997)::Children|Drama|Fantasy