Outer Join not showing expected result


#1

Support Team

I am running the below commands as shown in video 74 & 75 - Outer Join ,where I am trying to create RDD and then do an outer join. When I try to filter records which are not there in right table while doing a LeftOuterJoin it doesn’t give any record. Just wondering if the data set has changed from what is being demonstrated in Video. Also I see some exception in the log, not sure what is the issue. I have also provided exception detail below.

orders=sc.textFile("/public/retail_db/orders")
orderitems=sc.textFile("/public/retail_db/order_items")

ordersmap= orders.
map(lambda o: (int(o.split(",")[0]),o.split(",")[1]))

orderitemsmap= orderitems.
map(lambda oi: (int(oi.split(",")[0]),float(oi.split(",")[4])))

ordersLeftOuterJoin=ordersmap.leftOuterJoin(orderitemsmap)

ordersLeftOuterJoinFilter = ordersLeftOuterJoin.
filter(lambda o: o[1][1] == None)

for i in ordersLeftOuterJoinFilter.take(10): print(i)

Error:
18/06/03 15:35:56 WARN RetryInvocationHandler: Exception while invoking ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode over null. Not retrying because try once and fail.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /spark-history/application_1525279861629_32259.inprogress (inode 28924882): File is not open for writing. Holder DFSClient_NONMAPREDUCE_-1663007147_21 does not have any open files.


Sign up for our state of the art Big Data cluster for hands on practice as developer. Cluster have Hadoop, Spark, Hive, Sqoop, Kafka and more.


strong text


#2

Hi @sumit8724
Change the split parameters like this below and try to run the code

ordersMap = orders.map(lambda o:(int(o.split(",")[0]), o.split(",")[3]))

orderItemsMap = orderItems.map(lambda oi:(int(oi.split(",")[1]), float(oi.split(",")[4])))