Filtering RDD in pySpark

#1

I have two files

content.csv and remove.csv

I am trying to filter words which are in remove.csv from content.csv

Can anyone help me to go forward. Not able to figure out how to do it in pySpark

content = spark.sparkContext.textFile("/user/kuldeepc/spark2/content.txt")
remove = spark.sparkContext.textFile("/user/kuldeepc/spark2/remove.txt")
removeRDD = remove.flatMap(lambda x: x.split(",")).map(lambda x: x.strip())
contentRDD = content.flatMap(lambda x: x.split())
bremove = spark.sparkContext.broadcast(removeRDD.map(lambda x: [x]).collect())

filterRDD = contentRDD.filter(lambda x: bremove.values in x )

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster

0 Likes

#2

@Kuldeep_Chitrakar

Please paste the content in those two files.

0 Likes