Need help on Split() - Getting an error

Input file:

1 matthew@test.com EN US
2 matthew@test2.com EN GB
3 matthew@test3.com FR FR

Fields are delimited by “\t” (Tab).

scala>val User = sc.textFile(“file:///home/cloudera/ScalaTest/UserInfo.txt”)
User: org.apache.spark.rdd.RDD[String] = file:///home/cloudera/ScalaTest/UserInfo.txt MapPartitionsRDD[11] at textFile at :27

scala> val SUser = User.map(rec => rec.split("\t")(rec(0)))
SUser: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at map at :29

Question 1:

scala>SUser.foreach(print)
Above statement will throw “Array out of bound” exception. what am i doing wrong?

Question 2:

scala> val SUser = User.map(rec => rec.split("\t")(rec(0), rec(2)))
Above statement is giving error. How can i select multiple fields in map transformation phase?

Thanks in advance. Appreciate your help.

Q1:
scala> val User = sc.textFile(“file:///home/cloudera/test.txt”)
User: org.apache.spark.rdd.RDD[String] = file:///home/cloudera/test.txt MapPartitionsRDD[5] at textFile at :27

scala> val SUser = User.map(rec => rec.split("\t")(0))
SUser: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at map at :29

scala> SUser.collect().foreach(println)
1
2
3

Q2:
scala> val SUser = User.map(rec => (rec.split("\t")(0),rec.split("\t")(1)))
SUser: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[8] at map at :29

scala> SUser.collect().foreach(println)
(1,matthew@test.com)
(2,matthew@test2.com)
(3,matthew@test3.com)

1 Like

Hi
when you want to split the file data into columns always use “case class”. Here you go:

case class test(col_1: Int, col_2:String, col_3: Double, col_4: Int){

override def toString(col_1,col_2,col_3,col_4) = s"$col_1,$col_2,$col_3,$col_4"
}

val user = sc.textFile(“file:///home/cloudera/test.txt”)

val data = user.map(n=>val z = n.split("\t"); new test(z(0),z(1),z(2),z(3))

println(data.col_1 + data.col_2 + data.col_3 + data.col_4)
(Here “data” RDD identifies the columns using “unapply” method in case class, it is also called extraction)

The above code may be bit complicated for beginners but it is the best practice as you can name the columns. In future if you have any enhancement on columns, its very easy to enhance this code.

1 Like

Spark,

Thank you for your quick response. I am a beginner and it is definitely difficult for me to understand. I did copy your response. hopefully, one day i will understand.

Sri,

Thank you for your quick response. I did test your code and it is working as expected.