How to read excel data into a dataframe in spark/scala

apache-spark

#1

I have a requirement where-in I need to read the excel file (with .xlsx extension) in spark/scala. I need to create a dataframe with the data read from excel and apply/write sql queries on top it to do some analysis. The excel file has some column headers/titles like “time_spend_company (Years)”, “average_monthly_hours (hours)” etc which as spaces in the headers itself, these spaces are causing me problems to apply any sql queries on the loaded dataframe.

I am using com.crealytics.spark.excel library to parse the excel contents, and my code looks like below

val empFile = "C:\\EmpDatasets.xlsx"

val employeesDF = sc.sqlContext.read
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Sheet1")
  .option("useHeader", "true")
  .option("treatEmptyValuesAsNulls", "false")
  .option("inferSchema", "false")
  .option("location", empFile)
  .option("addColorColumns", "False")
  .load()

employeesDF.createOrReplaceTempView("EMP")

I want to apply some group by and other aggregate functions on these columns and I am facing issues with these columns like below, my requirement is to apply group by on time_spent_company column and get a count of it.

val expLevel = sc.sqlContext.sql("Select 'time_spend_company (Years)' as 'Years_spent_in_company',count(1) from EMP where left_company = 1 group by 'time_spend_company (Years)'")
expLevel.show

I need help on :-

  1. Is there any better way to load the excel and assign custom column names to it and create a dataframe?
  2. how to write sql queries for these column names which has spaces within in it?

Note: I need to read it as excel file only, I can’t convert into csv or any other file formats.