I am trying to apply filter transformation on the below JSON to obtain reviewerID, asin and overall

“reviewerID”: “A3AF8FFZAZYNE5”,
“asin”: “0000000078”,
“helpful”: [1, 1],
“reviewText”: “Conversations with God Book 1 is the single most extraordinary book I have ever read!!!It totally changed my life.”,
“overall”: 5.0,
“summary”: “Impactful!”,
“unixReviewTime”: 1092182400,
“reviewTime”: “08 11, 2004”
when I apply a filter transformation on it, I am getting an empty RDD. I am able to work using DataFrames but I want to use RDD for it. Any help would be appreciated.

@Avinash Can you write the code which you have used?


I have used the below code. I am not sure what delimiter to use here.

val raw_reviews = reviews.filter( x => (x.startsWith(“reviewerID:”) || x.startsWith(“asin:”) || x.startsWith(“overall:”)))

@Avinash Just a small doubt. In what way you have created the RDD?

reading from hdfs like sc.textFile(‘path to reviews file’)
creating direct RDD by this way [reviews object]

I have used sc.textFile(“example.json”)

@Avinash try this below one

val r = sc.textFile("/user/cloudera/example.json")

example.json file data
val f = r.filter(line => line.contains(“rahul”))

I have more than 1,000,000 records, also when I use the line.contains(“rahul”), we will still be getting the whole JSON object, where rahul is matched not the filtered ones. Right?

I think when u do line.filter() you will be getting the filter object, Not the entire Json. Have you tried this in your code.

The line.filter() is just a sample one i have shown you.
If you have many records you also write a function in scala where you can pass rdd object to that function and then you will be passing your objects to filter.You can return back the values which have passed scala filter function.
Here is the example of passing functions
Goto above link and search in that page : Passing Functions to Spark.

This is one way of doing.

@Avinash, Check your JSON format, does it have a line which exactly starts with “reviewerID” or any space before it. If so please use .contains instead of “startsWith” . And again it will return the line which has the “reviwerID”. Because when use sc.textFile() API you have the flexibility to access line by line however you want.