Pyspark: Create dataframes in a loop and then run a join among all of them

pyspark
apache-spark
aws
Pyspark: Create dataframes in a loop and then run a join among all of them
4.0 1

#1

Hello everyone,

I have a situation and I would like to count on the community advice and perspective. I’m working with pyspark 2.0 and python 3.6 in an AWS environment with Glue.

I need to catch some historical information for many years and then I need to apply a join for a bunch of previous queries. So decide to create a DF for every query so easily I would be able to iterate in the years and months I want to go back and create on the flight the DF’s.

The problem comes up when I need to apply a join among the DF’s created in a loop because I use the same DF name within the loop and if I tried to create a DF name in a loop the name is read as a string not really as a DF then I can not join them later,

So far my code looks like:

query = 'SELECT * FROM TABLE WHERE MONTH = {}'
months = [1,2]
frame_list = []
 
 
for item in months:
    df = 'cohort_2013_{}'.format(item)
    query = query_text.format(item) 
    frame_list.append(df)  # I pretend to retain in a list the name of DF to recall it later
    df = spark.sql(query)
    df = DynamicFrame.fromDF( df , glueContext, "df")
    applyformat = ApplyMapping.apply(frame = df, mappings =
        [("field1","string","field1","string"),
         ("field2","string","field2","string")],
        transformation_ctx = "applyformat")
 
for df in frame_list:
      create a join query for all created DF.

Please if someone knows how could I achieve this requirement let me know your ideas.

thanks so much

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster