Apache Spark - Reading text file format with custom record delimiter than new line character


#1

As part of this topic let us see how we can read text files with custom record delimiters other than new line character such as \r and complex field delimiters.

Custom Record Delimiter

  • sc.textFile or spark.read.csv works fine as long as record delimiter is new line character
  • But if record delimiter is any other character than new line, then we have to use lower level HDFS APIs such as org.apache.hadoop.mapreduce.lib.input.TextInputFormat
  • Spark Context (sc) have API called newAPIHadoopFile to use lower level HDFS APIs
  • newAPIHadoopFile takes 5 arguments
    • path
    • input file format
    • key type
    • value type
    • configuration
  • We need to first get hadoop configuration from spark context and set textinputformat.record.delimiter
  • Key type and value type are purely based on the file format. For text file format, key type is org.apache.hadoop.io.LongWritable and value type is org.apache.hadoop.io.Text
  • To preview the data we have to convert into toString as part of map
  • You can see the complete code snippet here