Pyspark 2 pandas udfs

Hi please help me with pandas udfs in pyspark.

I am getting the error ‘no module named pyarrow’.

How can we install pyarrow in a cluster


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster

Hi @Govind_Varma

Can you share the details about code.

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf(“order_id int,order_date string ,order_item_subtotal float”, PandasUDFType.GROUPED_MAP)
def sum_total(df):
_sum = df[“order_item_subtotal”].sum()
return df.assign(order_item_subtotal =_sum)

datafinal = data1.groupBy(“order_id”,“order_date”).apply(sum_total)