Pyspark handling null values

pyspark
#1

Input data

3070811,1963,1096,“US”,“CA”,1,
3022811,1963,1096,“US”,“CA”,1,56
3033811,1963,1096,“US”,“CA”,1,23

Expected Output

3070811,1963,1096,0,“US”,“CA”,0,1,
3022811,1963,1096,0,“US”,“CA”,0,1,56
3033811,1963,1096,0,“US”,“CA”,0,1,23

All the null values should be replaced with zero

Looking for pyspark solution for this.

0 Likes

#2

@balanightingale1994 could you please reply the code which you are using

0 Likes

#3

a=sc.textFile("/user/cloudera/spark40/file1.txt")
[u’3070811,1963,1096,“US”,“CA”,1,’, u’3022811,1963,1096,“US”,“CA”,1,56’, u’3033811,1963,1096,“US”,“CA”,1,23’]
b=a.map(lambda x: x.split(",",-1))

b.collect()
[[u’3070811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’’], [u’3022811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’56’], [u’3033811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’23’]]

c=b.map(lambda y: “0” if y==" " else y).collect()
c
[[u’3070811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’’], [u’3022811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’56’], [u’3033811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’23’]]

–finalout script

c=b.map(lambda y: 0 if y==’’ else y).collect()

c
[[u’3070811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’’], [u’3022811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’56’], [u’3033811’, u’1963’, u’1096’, u’’, u’“US”’, u’“CA”’, u’’, u’1’, u’23’]]

0 Likes