Joining desperate dataset in python

Hello Durga sir,
i was going thriugh your video for joining datasets using pyspark.
i have 2 diferent virtual machine and in spark i donot have sqoop in it.
so for creating order and order_item table, i created the table in hive and then copied the data from ur github acount and loaded my hive table by LOAD INPATH.
now the problem is when i create the split for key value i get an error.
for order_item i donot get an error as the datatypes are nt of string type. but for order table i get error as we have 2 fields of string type.

below is my table structure , sample data, RDD,and error.
please help as i am stuck with it.

1)table structure
describe orders;
order_id int
order_date string
order_cust_id int
order_status string
Time taken: 0.409 seconds, Fetched: 4 row(s)

2)sample data
68861,2014-06-13 00:00:00.0,3031,PENDING_PAYMENT
68862,2014-06-15 00:00:00.0,7326,PROCESSING
68863,2014-06-16 00:00:00.0,3361,CLOSED
68864,2014-06-18 00:00:00.0,9634,ON_HOLD

ordersplitRDD = rec:(int(rec.split(",")[0]),rec))

4)error:(it takes the data value as string)
yield next(iterator)
File “”, line 1, in
ValueError: invalid literal for int() with base 10: ‘:00:00.0’

please let me know for any further information if required.

in point 4) i mean its taking the DATE value as INT, where as i am only applying the int to first field [0]

@itversity hello sir please look into this wen you get time.
thank you

Ideally, it should not throw an error since only the first field is being cast. @Avinash_Parida did you follow any video for the example. @itversity, any idea on why the issue is being present.

hello Pramod,
yes this example is from the video tutorial only.
the only difference is i didnot import the table from mysql as i donot have sqoop in my spark machine.
so i created a table in hive and loaded the data from a textfile usinf LOAD INPATH … (records are same as that of the video , i copied it from github account)
thank you.

Pl check you delimiters, if it is using hive default delimiters which is \u0001 and not “,” (comma) as you are using comma in your pyspark command. Due to this it may be causing the issue.

Check describe formatted in hive.

oh yes, correct . i havent checked that . i will check this one too and let you know.
thanks @N_Chakote.

i checked the hive table and it is using “,” as field separator and “\n” as line terminator.
i am not sure what the issue is here then.

Can you post u r all code from order RDD. Can you try removing Int type cast and check. Also check u r source file, may be date column in one or all records are not seperated properly in source file.
From where u r reading file, i.e from Hive dir or user space in hdfs.

Hello @N_Chakote : i finally found the reason for this .
as i had mentioned earlier i loaded a texfile with records copied from github account into the hive table.
when i created the texfile, while doing a vi i had actually had two copies open for the same text file , so when we do this linux gives a warning that another copy is open and it wrires the datetime stamp and the status in the file.
so basically the textfile i had imported to the hive table had first line of data as the time stamp and the status as ON_HOLD.
the coincidence is the data i copied from github(table :order) had also same data format , i.e date time stamp and the order status so i overlooked the first line thinking it is the line from the record where as it was actually the warning that linux had written while i did a duplicate vi. ( it was also written as 2016-12-30 00:00:00, ON_HOLD).
so it was giving an error. i removed the first line from the textfile and recompiled the code and it works fine.

its really intriguing to note the data format and the error format being the same.

thank you for your patience for this long post.

Ohh good you found the error. Good learning experience to me too for future.