Pyspark - Unable to cast column in a dataframe: AttributeError: 'str' object has no attribute 'alias'

pyspark
cca-175

#1

That’s about the Second Problem scenario by Arun’s Blog but some error appears while i’m using Dataframe to accomplish the task:

sqoop import
–connect jdbc:mysql://quickstart.cloudera:3306/retail_db
–password=cloudera
–username=retail_dba
–table products
–target-dir /user/cloudera/products
–as-textfile
–fields-terminated-by ‘|’
-m 1

productRdd=sc.textFile("/user/cloudera/problem2/products/part-m-00000")
productRdd.take(5)

prodFilt=productRdd.filter(lambda x: x.split("|")[4] != “”)
prodFilt.take(5)

prodFiltCento=prodFilt.filter(lambda x: float(x.split("|")[4]) < 100)
prodFiltCento.take(5)

productSplit=prodFiltCento.map(lambda x: x.split("|"))
productSplit.take(10)

productRDD=productSplit.map(lambda x: (int(x[0]), int(x[1]), float(x[4])))

from pyspark.sql import Row

productDF=productRDD.map(lambda x: Row(table_id=int(x[0]), prod_cat_id=int(x[1]), prod_price=float(x[2]))).toDF()

productDF.printSchema()

productDF.printSchema()
root
|-- prod_cat_id: long (nullable = true)
|-- prod_price: double (nullable = true)
|-- table_id: long (nullable = true)

prodDfResultOK=productDF.groupBy(‘prod_cat_id’)
.agg(max(‘prod_price’).alias(‘Max_Price’), count(‘prod_cat_id’), avg(‘prod_price’).alias(‘Avg_Price’), min(‘prod_price’).alias(‘Min_Plrice’))
.orderBy(‘prod_cat_id’)

It results in the below error:

Traceback (most recent call last):
File “”, line 2, in
AttributeError: ‘str’ object has no attribute ‘alias’

Where am i wrong?