Need help on Scala code snippet

#1

Hi Durga Sir and All,
While executing the below code in scala,
val customersRDD = sc.textFile("/user/cloudera/sqoop_import/customers")
val customerMap = customersRDD.map(x => x.split("\t").toString)
customerMap.take(5).foreach(println)

Output:
[Ljava.lang.String;@66a9b240
[Ljava.lang.String;@ef5bc68
[Ljava.lang.String;@27245784
[Ljava.lang.String;@42edaf2f
[Ljava.lang.String;@28858fd

But when do the same in Pyspark getting the values.

Is there anything wrong am I doing?

0 Likes

#2

error is in the line
val customerMap = customersRDD.map(x => x.split("\t").toString)

You will have to emit a key value pair as a output from map function

in the same code try to run
customerMap.first()

It should display the first row of the RDD

try using
val customerMap = customersRDD.map(x=> (x.split("\t")(0),x))

0 Likes

#3

Hi All,

val customersRDD = sc.textFile("/user/cloudera/sqoop_import/customers")
val customerMap = customersRDD.map(x => x.split("\t").toString)

Then I would like to apply filter on say on first field. How can i proceed in scala?

0 Likes

#4

you can try using the following code:

val customerMap = customersRDD .map(rec=>(rec.split("\t")(0),rec))

val filteredOrdersRDD = customerMap .filter(rec=>rec._2.split(",")(0)==“condition”)

0 Likes

#5

I don’t want to use val customerMap = customersRDD .map(rec=>(rec.split("\t")(0),rec)).
I want to use only val customerMap = customersRDD .map(rec=>(rec.split("\t")). Then apply filter…

0 Likes

#6

@Sumanth_Sharma I think map can take T type and return type U. Here U can be either T or some (key, value) pair also. Isn’t it?

0 Likes

#7

@praveen Can you provide sample input data and expected output data if possible? It will be really help to answer you quickly.

0 Likes

#8

Sample data:
John,250,abcd,xyz
Ton,450,rst,pqrs
Jim,350,mno,abcd
Ton,700,abcd,pqrs

I want to filter data with name Ton

Here first two steps are:
val customersRDD = sc.textFile("/user/cloudera/sqoop_import/customers")
val customerMap = customersRDD.map(x => x.split("\t").toString)

Now third step is apply filter on customerMap to get the output result.

0 Likes

#9

sorry field delimiter in the first step is “,”

0 Likes

#10

@praveen Is there any purpose in using toString method in 2nd line? In Scala when you apply toString method on array of string it just returns hashcode of the array object.

Hope the following code works for you!

val customersRDD = sc.textFile("/user/cloudera/sqoop_import/customers")
val customerMap = customersRDD.map(x => x.split(","))
val customerTon = customersMap.filter( x => x(0).equals("Ton"))
customerTon.first
customerTon.count
0 Likes

#11

@venkata

Yes. Now i got it. But i have some doubts.
.
val customerMap = customersRDD.map(x => x.split(",")) This statement produces tuple or array? As per your code its producing array. But I thought it produces tuple. So map transformation does not produces tuple as output?

0 Likes

#12

@praveen Here split() output is the output of map as well which is Array[String].

It seems you misunderstand map function and its capabilities. Generally, map takes type T and returns type U.

In this particular case[map(x => x.split(","))] map takes String as input and returns Array[String].

map provides you lots of flexibility so that you return almost any type you want! It could be simple String, Int or complex tuple or some other object.

Hope it clears the things :slight_smile:

2 Likes

#13

Thank you very much venkat!!!

0 Likes

#14

Here is the original code snippet

val customersRDD = sc.textFile("/user/cloudera/sqoop_import/customers")
val customerMap = customersRDD.map(x => x.split("\t").toString)
customerMap.take(5).foreach(println)

val customersRDD = sc.textFile("/user/cloudera/sqoop_import/customers")
This will read data and create RDD

val customerMap = customersRDD.map(x => x.split("\t").toString)
This is incorrect x.split will generate array and you are converting to string

You do not understand how map function work. It take one record as input and generate another record as output. Here output is an array.

val customersRDD = sc.textFile("/Users/dgadiraju/Research/data/retail_db/products")
val customerMap = customersRDD.map(x => x.split(","))
customerMap.take(10).foreach(rec => println(rec(0) + ":" + rec(1)))

Python and Scala are 2 different languages, you should not compare the 2 the way you are doing.

1 Like

#15

You spotted it, @venkatreddy-amalla :+1:

1 Like