pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

pyspark
apache-spark
spark-sql

#1

I’m running pyspark-sql code on Horton sandbox

18/08/11 17:02:22 INFO spark.SparkContext: Running Spark version 1.6.3

# code 
from pyspark.sql import *
from pyspark.sql.types import *
rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv")
rdd2 = rdd1.map( lambda x : x.split("," ) )
df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"])
df1.printSchema()

root
 |-- id: string (nullable = true)
 |-- cat_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- desc: string (nullable = true)
 |-- price: string (nullable = true)
 |-- url: string (nullable = true)

df1.show() 
+---+------+--------------------+----+------+--------------------+
| id|cat_id|                name|desc| price|                 url|
+---+------+--------------------+----+------+--------------------+
|  1|     2|Quest Q64 10 FT. ...|    | 59.98|http://images.acm...|
|  2|     2|Under Armour Men'...|    |129.99|http://images.acm...|
|  3|     2|Under Armour Men'...|    | 89.99|http://images.acm...|
|  4|     2|Under Armour Men'...|    | 89.99|http://images.acm...|
|  5|     2|Riddell Youth Rev...|    |199.99|http://images.acm...|

# When I try to get counts I get the following error.
df1.count()

**Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.**

# I get the same error for the following code as well
df1.registerTempTable("products_tab")
df_query = sqlContext.sql ("select id, name, desc from products_tab order by name, id ").show();

I see column desc is null, not sure if null column needs to be handled differently when creating data frame and using any method on it.

The same error occurs when running sql query. It seems sql error is due to “order by” clause, if I remove order by then query runs successfully.

Please let me know if you need more info and appreciate answer on how to handle this error.


#2

@harshadocp Try to use filter and run below command:

products = sc.textFile("/public/retail_db/products")
productsFiltered = products.filter(lambda p: p.split(",")[4] != “”)
productsMap = productsFiltered.map(lambda p: p.split(","))
df= sqlContext.createDataFrame(productsMap, [“id”, “cat_id”, “name”, “desc”, “price”, “url”])
df.count()


#3

Thanks a lot Annapurna, your solution works. The problematic record was filtered out.
Do you know a solution to load a record that has comma embeded in it ?

Thanks again
Harshad