Parquet to CSV using Pyspark 1.6.2

pyspark

#1

I read parquet file parquet_df = sqlContext.read.parquet(’/home/sanjay/submissions-parquet’)
Now, I want to convert selective columns (Parquet2CSV) from parquet_df to csv file with Pyspark 1.6.2
Any help will be greatly appreciated.


#2

Hi Sanjay,
I came up with this below solution. I think there could be another optimal way to deal with the selective columns.

  1. pyspark --master yarn --conf spark.ui.port=12345 --num-executors 4 --packages
    com.databricks:spark-csv_2.10:1.5.0
  2. parquet_df = sqlContext.read.parquet(’/home/sanjay/submissions-parquet’)
  3. parquet_df .registerTempTable(“parquet_table”)
  4. col_parquet_df = sqlContext.sql(“select col_1, col_2 from parquet_table”)
  5. col_parquet_df.save(“output dir”, “com.databricks.spark.csv”)

Hope it solves.

Thanks
Aparna


#3

Thanks Aparna, I am not allowed to download external libraries on my cluster but I got a clue.


#4

Hi Sanjay,
Now I am getting another question, Is there any other way to have only selective columns in the dataframe (instead of converting to temporary table and then selecting columns)?

Thanks
Aparna