Can't load text file format in Scala


#1

Hello,

I have orders table imported into this folder as default text file format:
/user/paslechoix/sqoop_import/retail_db/orders/

Here is the folder:
[paslechoix@gw01 ~]$ hdfs dfs -ls /user/paslechoix/sqoop_import/retail_db/orders/
Found 3 items
-rw-r–r-- 3 paslechoix hdfs 0 2018-01-15 13:23 /user/paslechoix/sqoop_import/retail_db/orders/_SUCCESS
-rw-r–r-- 3 paslechoix hdfs 1494591 2018-01-15 13:23 /user/paslechoix/sqoop_import/retail_db/orders/part-m-00000
-rw-r–r-- 3 paslechoix hdfs 1505353 2018-01-15 13:23 /user/paslechoix/sqoop_import/retail_db/orders/part-m-00001

and the file’s content:
[paslechoix@gw01 ~]$ hdfs dfs -tail /user/paslechoix/sqoop_import/retail_db/orders/part-m-00000

34418,2014-02-21 00:00:00.0,10326,PENDING_PAYMENT
34419,2014-02-21 00:00:00.0,11103,PENDING
34420,2014-02-21 00:00:00.0,5917,PROCESSING
34421,2014-02-21 00:00:00.0,7212,COMPLETE
34422,2014-02-21 00:00:00.0,1437,COMPLETE
34423,2014-02-21 00:00:00.0,1838,COMPLETE
34424,2014-02-21 00:00:00.0,8567,COMPLETE
34425,2014-02-21 00:00:00.0,4831,PENDING_PAYMENT
34426,2014-02-21 00:00:00.0,12024,CLOSED
34427,2014-02-21 00:00:00.0,8174,COMPLETE
34428,2014-02-21 00:00:00.0,6112,PENDING
34429,2014-02-21 00:00:00.0,431,PROCESSING
34430,2014-02-21 00:00:00.0,10215,PENDING_PAYMENT
34431,2014-02-21 00:00:00.0,4419,CLOSED
34432,2014-02-21 00:00:00.0,567,PROCESSING
34433,2014-02-21 00:00:00.0,8472,PENDING
34434,2014-02-21 00:00:00.0,8119,CLOSED
34435,2014-02-21 00:00:00.0,3192,PENDING
34436,2014-02-21 00:00:00.0,1169,CLOSED
34437,2014-02-22 00:00:00.0,8053,COMPLETE
34438,2014-02-22 00:00:00.0,8116,COMPLETE
34439,2014-02-22 00:00:00.0,7857,PROCESSING
34440,2014-02-22 00:00:00.0,5921,CLOSED
34441,2014-02-22 00:00:00.0,5778,CLOSED

I am loading them in Scala as below:
sqlContext.load("/user/paslechoix/sqoop_import/retail_db/orders").show

Here is the error I encountered:
Could not read footer: java.lang.RuntimeException: hdfs://nn01.itversity.com:8020/user/paslechoix/sqoop_import/retail_db/orders/part-m-00000 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [83, 69, 68, 10]

So I thought maybe I should specify a file instead of a folder, and I changed the code to:
sqlContext.load("/user/paslechoix/sqoop_import/retail_db/orders/part-m-00000").show
I encountered the same error message

What am I missing here?

Thank you.


#2

@paslechoix
When you use sqlContext.load the default data source is parquet unless specified otherwise. If you want to read a text file then follow the below command

sqlContext.read.text("/user/varunu28/retail_db/orders").show()

This would show all your data as a single column.
Output Example

+--------------------+
|               value|
+--------------------+
|1,2013-07-25 00:0...|
|2,2013-07-25 00:0...|
|3,2013-07-25 00:0...|
|4,2013-07-25 00:0...|
|5,2013-07-25 00:0...|
|6,2013-07-25 00:0...|
|7,2013-07-25 00:0...|
|8,2013-07-25 00:0...|
|9,2013-07-25 00:0...|

To segregate it and do operations you can create it into a dataframe using a case class or an RDD using sc.textFile


#3

Thank you Varun. Now it is working as expected.


#4