CCA175 - Does storing the output in HDFS in Unicode for string via Pyspark is wrong?

Hi,

I recently gave CCA175 certification. While storing the output in HDFS thru spark, I did store it in text file format as expected but the string variables are stored in unicode format. Is that right or we need to convert the unicode to normal string type?

Regards,
Venkat

is there any delimiter or some symbol you missed? Sometimes they may ask for table delimiter or space delimited or with some symbol

Nope, they asked to store it back as textfile. One thing I observed is pyspark stores textfile with 'U (unicode) as prefix for character field and Scala doesn’t do it. The original input file didn’t had any unicode as prefix and guess that’s what the issue is. It was a simple aggregate problem and I verified the answer with Mysql too which matched the output.

Regards,
Venkat

The delimiter was default ‘,’ which I had in output.

@s_venkatragavan, yes pyspark by default works with unicode kind of datatype, we need to convert them explicitly. And in scala doesn’t have unicode as we have to define datatype must. One more thing, in aggregations there shouldn’t be comma delimiter. Just try to remember your question. Result dataset shouldn’t store as tuple, just a record.

Ravi,

You are right, the ouput was saved as tuple with () enclosed. Do you know how to convert a tuple to a normal record?

Regard,
Venkat

Yes, you are right that unicode is what I missed as I used Pyspark and I never though about it as all the itversity topics all did the same, maybe they should make a point to convert it as string for Pyspark so other candidates might get more clarity.
Regarding storing the result after aggregation, I followed the below for ex:

OriginalRec let say is a customer table with below format:
1,venkat, bangalore, KA
2,ravi, bangalore, KA
I need to aggregate the customer from each state. So lets say I did the following:

cust=sc.textFile(“path”)
custmap = cust.map(lambda x: (x…split(",")[3],1))
custmap.reduceByKey(lambda x,y: x+y).saveAsTextFile("/output/somepath")

I hope the above will store it as a record with state and its count as each record with commas a a delimiter by default, but not as a tuple. Correct me if I am wrong.

Regards,
Venkat

@s_venkatragavan
I think it will save as tuple of [State-name, value]. As you are mapping only 4th field I.e. state and assigning 1 to it. After reduceByKey, it will will give only distinct states and its count.
State-state is string, so I think it will be like u’state-name’.
But need to confirm if saving like this in saveAsTextFile is correct or wrong. Or Need to convert and then save.

1 Like

Try to re-visit the videos, Durga sir did mentioned about the ‘unicode’ in pyspark videos. That’s how i learned about them.

This result will be stored as records not tuples, but remember that aggregation results shouldn’t be stored with delimiter as ‘,’ (comma). They should be stored as space or ‘\t’ delimited. I am pretty sure about this.

You are right, it got saved as tuple (u’State’, Count). Any idea how to convert this as a normal record i.e.
State, Count without brackets () ?

Regard,
Venkat

Yes, that I figured out. ‘rdd name’.map(lambda rec: (str(rec[0]), rec[1])) will remove the Unicode. But how to remove the () is a concern now. Because in the exam they said they want the output like below:

StateName Count
KA 5
KL 10

So there is no way to convert a tuple into a normal record? I am not sure if storing the output as Tuple is a right. ravi.tejarockon also mentioned the result set should be stored as record not tuple.

Regards,
Venkat

1 Like

I am not sure how it is possible, if it possible to store with out (). As it is tuple it will be in (). But you can save with out u’Statename by using below code
’rdd name’.map(lambda rec: (str(rec[0]), rec[1]))

Try removing the round bracket as below.

‘rdd name’.map(lambda rec: str(rec[0]), rec[1] )

Yatish,

That would give a syntax error, we need to enclose str(rec[0]), rec[1] it with () else rec[1] will be considered as not defined.

Regards,
Venkat

Hi,

Yes, you are right. i worked on it and got the solution. .

it should like:
rddname.map(lambda x: ‘,’.join(str(d) for d in x)) this is working fine.

example:
ordersByStatus rdd:
(u’COMPLETE’, 22899)
(u’PAYMENT_REVIEW’, 729)
(u’PROCESSING’, 8275)
(u’CANCELED’, 1428)
(u’PENDING’, 7610)
(u’CLOSED’, 7556)
(u’PENDING_PAYMENT’, 15030)
(u’SUSPECTED_FRAUD’, 1558)
(u’ON_HOLD’, 3798)

orderstring = ordersByStatus.map(lambda x: ‘,’.join(str(d) for d in x))

orderstring rdd:
COMPLETE,22899
PAYMENT_REVIEW,729
PROCESSING,8275
CANCELED,1428
PENDING,7610
CLOSED,7556
PENDING_PAYMENT,15030
SUSPECTED_FRAUD,1558
ON_HOLD,3798

Yes we need to convert them to string and save, else in textFile it will be saved with Unicode.

ordersRDD = sc.textFile(“path”)
ordersMap = ordersRDD.map(lambda rec: (str(rec.split(",")[3]), str(rec.split(",")[0])))
ordersSave = ordersMap.map(lambda rec: (rec[0]+" "+rec[1]))
for i in ordersSave.take(5):print(i)
CLOSED 1
PENDING_PAYMENT 2
COMPLETE 3
CLOSED 4
COMPLETE 5

It will save as records. Happy Coding :slight_smile:

1 Like

Please refer above post for easier than using join.

It is not about the brackets you use. It is about the delimiter you use, you are using ‘,’ so automatically it will be tuple. Try to concatenate with some delimiter. It is as simple as that.