Apache Spark Python - Basic Transformations - Overview of Sorting Data Frames

Siva · May 22, 2024, 10:42am

In this article, we will explore how to sort data in a Data Frame using Apache Spark. We will focus on sorting data in ascending and descending order, as well as handling null and empty values appropriately.

Sorting by Count of Cancelled Flights

We will begin by sorting the data based on the count of cancelled flights. We can achieve this using the orderBy function in Spark SQL. Let’s start by getting the daily count of cancelled flights sorted in ascending order:

airtraffic. \
    filter('cancelled = 1'). \
    groupBy(
        concat(
            col("Year"),
            lpad(col("Month"), 2, "0"),
            lpad(col("DayOfMonth"), 2, "0")
        ).alias('FlightDate')
    ). \
    agg(count(lit(1)).alias('FlightCount')). \
    orderBy(col('FlightCount').asc()). \
    show(31)

If you want to sort in descending order instead, you can use the following code snippet:

airtraffic. \
    filter('cancelled = 1'). \
    groupBy(
        concat(
            col("Year"),
            lpad(col("Month"), 2, "0"),
            lpad(col("DayOfMonth"), 2, "0")
        ).alias('FlightDate')
    ). \
    agg(count(lit(1)).alias('FlightCount')). \
    orderBy(col('FlightCount').desc()). \
    show(31)

Sorting by Scheduled Time

Next, we will project specific columns and sort the data based on year, month, day of the month, and scheduled time. Here’s how you can achieve this:

airtraffic. \
    select('Year', 'Month', 'DayOfMonth', 'CRSDepTime', 'Origin'). \
    orderBy('Year', 'Month', 'DayOfMonth', col('CRSDepTime').desc()). \
    show()

Sorting by Nationality with Special Handling for United States

In this section, we will sort employee data based on nationality, ensuring that data related to the United States always comes first. Here’s how you can accomplish this:

employeesDF. \
    withColumn(
        'sort_column', 
        expr("""CASE WHEN upper(nationality) = 'UNITED STATES' THEN 0 else 1 END""")
    ). \
    orderBy('sort_column', 'nationality'). \
    show()

Sorting by Bonus with Numeric Values

Lastly, we will sort the employee data based on the bonus column, ensuring that null and empty values come at the end. Here’s a code snippet to achieve this:

employeesDF. \
    orderBy(col('bonus').cast('int').asc_nulls_last()). \
    show()

By following these examples, you will have a better understanding of sorting data frames in Apache Spark. Feel free to experiment with different sorting criteria and explore the flexibility that Spark offers in data manipulation. Happy coding!

Watch the video tutorial here

Conclusion

In this article, we explored various techniques for sorting data in a Data Frame using Apache Spark. We demonstrated how to sort by the count of cancelled flights, scheduled time, nationality with special handling for specific values, and numeric columns with null and empty values. These examples highlight the flexibility and power of Spark SQL in organizing and analyzing large datasets effectively. By mastering these sorting techniques, you can enhance your data manipulation skills and gain deeper insights from your data. For a detailed walkthrough, watch the full video tutorial here. Happy coding!