Apache Spark Python - Data Processing Overview - Overview of Spark Read APIs

Apache Spark provides a variety of APIs to read data from files of different formats. These APIs are exposed under spark.read and offer flexibility in reading various types of data efficiently.

Available APIs

  • text: Reads single-column data from text files, treating each whole text file as one record.
  • csv: Reads text files with delimiters. The default delimiter is a comma, but other delimiters can be specified.
  • json: Reads data from JSON files.
  • orc: Reads data from ORC files.
  • parquet: Reads data from Parquet files.

Additional Options

  • inferSchema: Infers the data types of columns based on the data.
  • header: Uses the header to get the column names in the case of text files.
  • schema: Allows for the explicit specification of the schema.

Example: Reading Delimited Data from Text Files

from pyspark.sql import SparkSession

import getpass

username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \
    getOrCreate()

spark.read.csv('/public/retail_db/orders',
               schema='''
                   order_id INT, 
                   order_date STRING, 
                   order_customer_id INT, 
                   order_status STRING
               '''
              ). \
    show()

Example: Reading JSON Data from Text Files

spark.read. \
    json('/public/retail_db_json/orders'). \
    show()

For more detailed information on specific APIs, you can use help(spark.read.<API_NAME>), such as help(spark.read.csv).

Watch the video tutorial here

Conclusion

Understanding the Spark read APIs is crucial for efficiently processing data from different file formats. By leveraging these APIs and their options, you can effectively read and analyze data in Apache Spark. Experiment with different APIs and engage with the community to enhance your skills in data processing and analysis with Spark.