SparkSQL: dataframe . unable to convert from string to float: value error


#1

I’m getting error while i run the spark SQL.

Valuerror: cant convert string to float. attached the commands i 'm running…

If i try to covert the fields to float ( order_items.subtotal and products.product_price) during the dataframe creation and while i run the actual query , i get the above errors. ( as shown in attachment)

if i remove float conversions in the DF, i m getting the results.

Q1. Cant we do float conversion in DF?
Q2. is there any auto-type casting happens in hive/spark ? even with String data type for revenue , able to aggregate


#2

Yes we can do float conversion in Data frame, you can try below syntax

val df2 = df.withColumn(“newColName”, df.originalColName.cast(typeCast)).drop(“originalColName”).withColumnRenamed(“newColName”, “originalColName”)

There is no auto type casting in the hive database we will only give datatypes while creating a table but in spark, when we use option(“inferSchema”, “true”). This essentially instructs Spark to automatically infer the data type for each column when reading the file.


#3

I’m using pyspark , this is how i 'm making a dataframe.

how to convert the order_item_subtotal to float hear so that query do not fail while i run ?

Order_items_DF=Order_itemsRDD.map( lambda p: Row( order_item_id= int(p.split(",")[0]),
order_item_order_id= int( p.split(",")[1] ),
order_item_product_id= int(p.split(",")[2]),
order_item_quantity=p.split(",")[3] ,
order_item_subtotal=float(p.split(",")[4] ),
order_item_product_price=p.split(",")[5] ,
)).toDF()


#4

OrdersDF=ordersRDD.map( lambda p: Row( order_id= int(p.split(",")[0]),
order_date=p.split(",")[1],
order_customer_id= int(p.split(",")[2]),
order_status=p.split(",")[3]
)).toDF()

OrdersDF.printSchema()

sqlContext.sql(“use kirtad”);

OrdersDF.registerTempTable(“orders_DF”);
sqlContext.sql(“select * from orders_DF”).show()

Order_items_DF=Order_itemsRDD.map( lambda p: Row( order_item_id= int(p.split(",")[0]),
order_item_order_id= int( p.split(",")[1] ),
order_item_product_id= int(p.split(",")[2]),
order_item_quantity=p.split(",")[3] ,
order_item_subtotal=float(p.split(",")[4] ),
order_item_product_price=p.split(",")[5] ,
)).toDF()

Order_items_DF.printSchema()

sqlContext.sql(“use kirtad”);

Order_items_DF.registerTempTable(“order_items_DF”);
sqlContext.sql(“select * from order_items_DF”).show()

Products_DF=productsRDD.map( lambda p: Row( product_id= int(p.split(",")[0]),
product_category_id = int( p.split(",")[1] ),
product_name = p.split(",")[2],
product_description=p.split(",")[3] ,
product_price=float(p.split(",")[4] ),
product_image=p.split(",")[5] ,
)).toDF()

Products_DF.printSchema()

sqlContext.sql(“use kirtad”);

Products_DF.registerTempTable(“products_DF”);
sqlContext.sql(“select * from products_DF”).show()

–Spark SQL

sqlContext.sql("select Ord.order_date , Ord.order_item_product_id , p.product_name , sum(Ord.order_item_subtotal ) as Revenue
from
(
select o.order_date, oi.order_item_product_id , oi.order_item_subtotal from
orders_df o inner join order_items_df oi
ON
o.order_id = oi.order_item_order_id
and
o.order_status in ( ‘COMPLETE’ , ‘CLOSED’ )
) Ord
inner join products_df p ON
Ord.order_item_product_id = p.product_id
group by Ord.order_date, Ord.order_item_product_id , p.product_name
distribute by Ord.order_date
sort by Ord.order_date ASC , Revenue desc " ).show()


#5

my question is: when I do not use any float conversion, aggregation is working fine.

but if i dont type cast, type of order_item_subtotal is a string. Then how it is able to perform SUM on the string and gives the results correctly , in the spark SQL?