How to sum on specified column in a RDD


#1

Hello,

Here are two data files:

spark16/file1.txt
1,9,5
2,7,4
3,8,3

spark16/file2.txt
1,g,h
2,i,j
3,k,l

After joined, I have:

after joined, we have:
(1, ((9,5),(g,h)) )
(2, ((7,4),(i,j)) )
(3, ((8,3),(k,l)) )

I want to get the sum of (5,4,3) = 12

val file1 = sc.textFile(“data96/file1.txt”).map(x=>(x.split(",")(0).toInt, (x.split(",")(1), x.split(",")(2).toInt)))
val file2 = sc.textFile(“data96/file2.txt”).map(x=>(x.split(",")(0).toInt, (x.split(",")(1), x.split(",")(2))))

val joined = file1.join(file2)
val sorted = joined.sortByKey()

val first = sorted.first
res4: (Int, ((String, Int), (String, String))) = (1,((9,5),(g,h)))

scala> joined.reduce(_._2._1._2 + _._2._1.2)
:34: error: type mismatch;
found : Int
required: (Int, ((String, Int), (String, String)))
joined.reduce(
._2._1._2 + _._2._1._2)

How can I get the sum on the _._2._1._2?

Thank you very much.