Let us understand details about Spark SQL properties which control Spark SQL run time environment.
-
Spark SQL inherits properties defined for Spark. There are some Spark SQL related properties as well and these are applicable even for Data Frames.
-
We can review these properties using Management Tools such as Ambari or Cloudera Manager Web UI
-
Spark run time behavior is controlled by HDFS Properties files, YARN Properties files, Hive Properties files etc in those clusters where Spark is integrated with Hadoop and Hive.
-
We can get all the properties using
SET;
in Spark SQL CLI
Let us review some important properties in Spark SQL.
spark.sql.warehouse.dir
spark.sql.catalogImplementation
spark.sql.warehouse.dir
Property
The spark.sql.warehouse.dir
property specifies the location of the default database in a Spark SQL environment. This is where Spark SQL will create default databases and tables.
To review the current value of this property, you can use the following command in Spark SQL CLI:
SET spark.sql.warehouse.dir;
spark.sql.catalogImplementation
Property
The spark.sql.catalogImplementation
property determines the implementation used to store the metadata in Spark SQL. It can have values such as in-memory
, hive
, in-memory-in-db
, etc.
To review the current value of this property, you can use the following command in Spark SQL CLI:
SET spark.sql.catalogImplementation;
Overwriting Properties
Properties with default values may not show up in the SET
command output. You can overwrite properties by setting new values using the same SET
command. For example, you can change the value of spark.sql.shuffle.partitions
property to 2 as follows:
SET spark.sql.shuffle.partitions=2;
Hands-On Tasks
- Open Spark SQL CLI and review the values of
spark.sql.warehouse.dir
andspark.sql.catalogImplementation
properties. - Experiment with changing the value of a property, such as
spark.sql.shuffle.partitions
, using theSET
command.
Conclusion
In this article, we discussed the important properties in Spark SQL that control the run time behavior of Spark SQL environment. It is essential to understand and manage these properties effectively for optimal performance. Experiment with different properties and values to customize the behavior of Spark SQL according to your requirements. Practice more and engage with the community for further learning and insights.