Problems with joins (Test Exam)


Dear all,

I’m currently try something out with joins and I have the following problem:

I would like to find out all customers which have not placed an order.

For that I use all orders for 1 customer and 2 different customers (so one has placed no order). I need the given and the family name.

Data are the data provided for the course.

Please find here the example:

orders = sc.textFile(“data/orders”)
customers = sc.textFile(“data/customers”)

ordersFilter = orders.filter(lambda o: o.split(",")[2] == “3097”)
customersFilter = customers.filter(lambda o: o.split(",")[0] in (“3097”, “3099”))


12254,2013-10-08 00:00:00.0,3097,PROCESSING

27662,2014-01-12 00:00:00.0,3097,CLOSED

57513,2014-07-23 00:00:00.0,3097,COMPLETE


3097,Douglas,Hanna,XXXXXXXXX,XXXXXXXXX,1112 Rustic Range,Caguas,PR,00725

3099,Brittany,Copeland,XXXXXXXXX,XXXXXXXXX,5735 Round Beacon Terrace,Caguas,PR,00725

What I need now is a data set like

3097, 1, Hanna, Douglas

3097, 1, Hanna, Douglas

3097, 1, Hanna, Douglas

3099, None, Copeland, Brittany

I tried that way:

ordersMap = o: ((o.split(",") [2]), 1))

customersMap = c: ((c.split(",")[0]), (c.split(",")[2]), (c.split(",")[1])))

customersJoin = ordersMap.fullOuterJoin(customersMap)
customersJoin2 = customersMap.fullOuterJoin(ordersMap)

and I get the following data:

(u’3099’, (None, u’Copeland’))

(u’3097’, (1, u’Hanna’))

(u’3097’, (1, u’Hanna’))

(u’3097’, (1, u’Hanna’))


(u’3099’, (u’Copeland’, None))

(u’3097’, (u’Hanna’, 1))

(u’3097’, (u’Hanna’, 1))

(u’3097’, (u’Hanna’, 1))

First question: How can I include the family name as well in the result file.

Second questions: how can I extract the “None” values for the result set?

I tried it with the following code, but it didn’t worked out.

customersNeverOrdered = customersJoin.filter(lambda cj: cj.split(",")[1].split(",")[0]==None)