#1

Input data

3070811,1963,1096,“US”,“CA”,1,
3022811,1963,1096,“US”,“CA”,1,56
3033811,1963,1096,“US”,“CA”,1,23

Expected Output

3070811,1963,1096,0,“US”,“CA”,0,1,
3022811,1963,1096,0,“US”,“CA”,0,1,56
3033811,1963,1096,0,“US”,“CA”,0,1,23

All the null values should be replaced with zero

Looking for pyspark solution for this.

0 Likes

#2

@balanightingale1994 could you please reply the code which you are using

0 Likes

#3

a=sc.textFile("/user/cloudera/spark40/file1.txt")
[u’3070811,1963,1096,“US”,“CA”,1,’, u’3022811,1963,1096,“US”,“CA”,1,56’, u’3033811,1963,1096,“US”,“CA”,1,23’]
b=a.map(lambda x: x.split(",",-1))

b.collect()
[[u’3070811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’’], [u’3022811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’56’], [u’3033811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’23’]]

c=b.map(lambda y: “0” if y==" " else y).collect()
c
[[u’3070811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’’], [u’3022811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’56’], [u’3033811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’23’]]

–finalout script

c=b.map(lambda y: 0 if y==’’ else y).collect()

c
[[u’3070811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’’], [u’3022811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’56’], [u’3033811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’23’]]

0 Likes

#4

@balanightingale1994

I suppose your input data would be
3070811, 1963, 1096, US, CA, ,1,
3022811, 1963, 1096, US, CA, ,1,56
3033811,1963,1096, US, CA, , 1, 23

Hence the output is

For this I came up with solution using Spark Dataframe API. Here’s the code for your problem.

from pyspark.sql import SparkSession

spark = SparkSession.builder
.master(“local”)
.appName(“Replace with 0”)
.getOrCreate()

df= spark
.format(“csv”)