Looping in Spark dataframe using python

#1

I want to loop through spark dataframe, check if a condition i.e aggregated value of multiple rows is true/false then create a dataframe. Please see the code outline, can you please help fix the code? i’m pretty new to spark and python- struggling may way through it,any help is greatly appreciated

df = spark.read.format(“csv”)
.option(“header”, ‘true’)
.option(“sep”, ‘,’)
.load(‘trade_sample.csv’)

sort trades by Instrument and date (in asc order)

dfsorted = df.orderBy(‘Instrument’,‘Date’).show()

new temp variable to keep track of the quantity sum

sumofquantity = 0
for each row in the dfsorted
sumofquantity = sumofquantity + dfsorted[‘Quantity’]
# keep appending the rows looped thus far into this new dataframe called dftemp
dftemp= dfsorted (how to write this?)
if ( sumofquantity == 0)
# once the sumofquantity becomes zero, for all the rows in the tempview-add a new column with unique seqential number
# and append rows into the final dataframe
dffinal= dftemp.withColumn(‘trade#’, assign a unique trade number)
# reset the sumofquantity back to 0
sumofquantity = 0
# clear the df.tempview
dftemp clear ( how to clear the dataframe so i can start wtih zero rows for next iteration?)

trade_sample.csv ( raw input file)

Customer ID,Instrument,Action,Date,Price,Quantity U16,ADM6,BUY,20160516;214611,0.7337,2
U16,ADM6,SELL,20160516;214729,0.7337,-1 U16,ADM6,SELL,20160516;233333,0.9439,-1
U16,CLM6,BUY,20160516;214811,48.09,1
U16,CLM6,SELL,20160517;042418,48.08,-1
U16,ZSM6,BUY,20160517;214811,48.09,1
U16,ZSM6,SELL,20160518;042418,48.08,-1

Expected Result

Customer ID,Instrument,Action,Date,Price,Quantity,trade# U16,ADM6,BUY,20160516;214611,0.7337,2,10001 U16,ADM6,SELL,20160516;214729,0.7337,-1,10001 U16,ADM6,SELL,20160516;233333,0.9439,-1,10001 U16,CLM6,BUY,20160516;214811,48.09,1,10002 U16,CLM6,SELL,20160517;042418,48.08,-1,10002 U16,ZSM6,BUY,20160517;214811,48.09,1,10003 U16,ZSM6,SELL,20160518;042418,48.08,-1,10003

0 Likes