How to convert a tuple into a record?

Hi,

Suppose I have a tuple in Pyspark after doing some mapping and I need to store it as textfile in HDFS. Currently it gets saved as in HDFS when I do SaveastextFile

(xxxx, 25)
(yyyy,55)
(gggg,10)

I dont want the enclosed brackets (). I need the output as:

xxxx,25
yyyy,55
gggg,10

Please let me know as how to achieve this?

Regards.
Venkat

@s_venkatragavan

Let’s assume below is your tuple input “dataRDDTuple” RDD
(u’xxxx’, 25)
(u’yyyy’, 55)
(u’gggg’, 10)

In below I have converted the Int to STR for second field, first field already string format & separated each fields with comma.

dataText = dataRDDTuple.map(lambda rec: (rec[0]) + “,” + str(rec[1]))
for i in dataText.take(10): print(i)
xxxx,25
yyyy,55
gggg,10
dataText.saveAsTextFile("/user/gnanaprakasam/pyspark/Textdata")

hadoop fs -cat /user/gnanaprakasam/pyspark/Textdata/part*

4 Likes

Thanks @gnanaprakasam. What if dataRDDTuple RDD has many columns/elements instead of just two? Is there a different way?

if you use Spark-Scala then you can create case class and you have a flexibility to override the toString() method to do this. In the same way, you can create class in the Python and you can create a custom method to do that.

1 Like

Thanks gnanaprakasam, it did work. But in this case both are stored as string character in the HDFS, but the expectation was to have the count as the second column which should be numeric. Visually it looks fine but incase if they get to verify the output by each column type, will it fail thru or will we be good?

Regards,
Venkat

@Murali_Rachakonda,

ordersRDD = sc.textFile(“path”)
ordersMap2 = ordersRDD.map(lambda rec: (str(rec.split(",")[3])+ " " +str(rec.split(",")[1])+" $"+str(rec.split(",")[0])))
for i in ordersMap2.take(5):print(i)
CLOSED 2013-07-25 00:00:00.0 $1
PENDING_PAYMENT 2013-07-25 00:00:00.0 $2
COMPLETE 2013-07-25 00:00:00.0 $3
CLOSED 2013-07-25 00:00:00.0 $4
COMPLETE 2013-07-25 00:00:00.0 $5

This should answer your question.

1 Like

@s_venkatragavan,

True both are stored as string, but in good readable format. Remember end-user doesn’t bother about the data type you used but it should be in understandable and readable format along with reusing same data in BI. So what @gnanaprakasam did is right. Even cloudera also find with this and they will accept this result.

After aggregations once you saved in above format then it’s fine with cloudera.

i got your point, all they expect is to save the output file in their mentioned delimiter and it should match and look-a-like with same data that they have shown in questions.

1 Like

Thanks Ravi. This small details cost me couple of questions, but its good to know atleast now.

1 Like

Everything you store in HDFS as a text file would be stored as a String. It is the responsibility of the user taking that data from HDFS and performing operations on it to cast it to Integer.

3 Likes

Hi @s_venkatragavan

If you want iterate too many elements in tuple you can follow this code in scala

val t = (1,2,3,4,5,6) // scala tuple
val builder = new StringBuilder // scala String builder
val result = t.productIterator.addString(builder, “,”)
println(result) // output is r: StringBuilder = 1,2,3,4,5,6

Sorry for the code in scala. I guess we can find similar thing in python

1 Like