How to join three dataframes?


#1

I have created three datasets as below;

Person:
Person_id,Person_name,Person_age,Person_gender,Person_City
1,John,35,M,Toronto
2,Smith,25,M,Chicago
3,Linda,32,F,Dallas
4,Matilda,37,F, New York
5,Steve,45,M,New York

Order:
Order_id,Product_id,Person_id,Units_Bought
1,1,3,10
2,1,1,5
3,3,2,7
4,2,4,7
5,2,5,12

Product:
Product_id,Product_name,Product_Category,Product_Brand,Product_Price
1,KitKat,Chocolate,Nestle,2
2,Diet Coke,Carbonated Drink,Coca-Cola,1.20
3,Lays Chips,Snacks,PepsiCo,2.1

I want to list products bought by persons with age group 34 to 48 who do not live in city “Chicago” and who live in “New York” and only those persons who never bought any product of category “Snacks”.

Also i need to see the products, number of transactions, total number of units and the total amount in $.


I created 3 dataframes

val productsDF = spark.read.format(“csv”).option(“header”, “true”).load(“datasets/product.csv”)
val personsDF = spark.read.format(“csv”).option(“header”, “true”).load(“datasets/person.csv”)
val ordersDF = spark.read.format(“csv”).option(“header”, “true”).load(“datasets/order.csv”)

Is there a way i can get my insights by applying logic to the above dataframes?


#3

I was able to join the 3 dataframes using the primary keys

val joined = personsDF.join(ordersDF, “Person_id”)

val fulltable = joined.join(productsDF, “Product_id”)

fulltable.createOrReplaceTempView(“customer”)

Now how do i derive
I want to list products bought by persons with age group 34 to 48 who do not live in city “Chicago” and who live in “New York” and only those persons who never bought any product of category “Snacks”.

Also i need to see the products, number of transactions, total number of units and the total amount in $.

Using sqlContext.sql?


#5

Please do not flag or tag to get the help in coding. If anyone comes forward voluntarily you will get help.