You tube video Data analysis

Dear Itversity,

The entire thing I get it from the link --> https://acadgild.com/blog/spark-use-case-youtube-data-analysis/

The problem statement is to find out what are the top five categories with maximum number of videos uploaded.

Data set sample

QuRYeRnAuXM EvilSquirrelPictures 1135 Pets & Animals 252 1075 4.96 46 86 gFa1YMEJFag nRcovJn9xHg 3TYqkBJ9YRk rSJ8QZWBegU 0TZqX5MbXMA UEvVksP91kg ZTopArY7Nbg 0RViGi2Rne8 HT_QlOJbDpg YZev1imoxX8 8qQrrfUTmh0 zQ83d_D2MGs u6_DQQjLsAw 73Wz9CQFDtE

Data set column names all are tab delimited

Column 1: Video id of 11 characters.

Column 2: Uploader of the video.

Column 3: Interval between day of establishment of YouTube and the date of uploading of the video.

Column 4: Category of the video.

Column 5: Length of the video.

Column 6: Number of views for the video.

Column 7: Rating on the video.

Column 8: Number of ratings given for the video

Column 9: Number of comments on the videos.

Column 10: Related video ids with the uploaded video.

Below is code

val youtubeRDD = sc.textFile("/user/anuvenkatesheee/youtubedata.txt")

val categoryMap = youtubeRDD.map(rec => (rec.split("\\t")(3),1))

val topCategoriesVideo = categoryMap.reduceByKey((total,value) => total + value).sortByKey(false)

Getting Error AS

scala> categoryMap .collect().foreach(println)
17/04/12 07:50:42 ERROR TaskSetManager: Task 0 in stage 4.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 30, wn02.itversity.com): java.lang.ArrayIndexOutOfBoundsException: 3
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:33)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:29)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)

Alternate code

val youtubeRDD = sc.textFile("/user/anuvenkatesheee/youtubedata.txt")

val category = youtubeRDD.map(line=>
{var YoutubeRecord = β€œβ€œ
val temp=line.split(”\t”)
if(temp.length >= 3)
{YoutubeRecord=temp(3)}
YoutubeRecord})

val categoryMap=category.map ( x => (x,1) )

val topvideoUpload=categoryMap.reduceByKey(+).map(item => item.swap).sortByKey(false).take(5)

Guys will you please explain why am not getting in my first code.
How the below logic helps in alternate code, Plz explain why it mandatory
val category = youtubeRDD.map(line=>
{var YoutubeRecord = β€œβ€œ
val temp=line.split(”\t”)
if(temp.length >= 3)
{YoutubeRecord=temp(3)}
YoutubeRecord})

Regards
venkat

@venkateshm It’s purely based on the quality of the given data. In this case, the input file has a bad record i.e Video ID(column 1) = TMark2489 so you need to handle that. That has been taken care in the alternate code.

Hope, this would make sense.

1 Like

Thanks for the reply, If possible will I get more info.