Hive on Spark parameters
This section describes the parameters used to control the interaction between Spark and Hive Metastore. For more information on using the parameters, see the Spark and Hive section.
Parameter | Description | Default value |
---|---|---|
hive.spark.job.monitor.timeout |
The timeout for the job monitor to get a Spark job state (in seconds) |
60 |
hive.spark.dynamic.partition.pruning |
When set to |
false |
hive.spark.dynamic.partition.pruning.map.join.only |
Similar to |
false |
hive.spark.dynamic.partition.pruning.max.data.size |
The maximum data size for the dimension table that generates partition pruning information (in megabytes). If a table reaches this limit, the optimization is disabled |
100 |
hive.spark.exec.inplace.progress |
Updates Spark job execution progress in-place in the terminal |
true |
hive.spark.use.ts.stats.for.mapjoin |
If set to |
false |
hive.spark.explain.user |
Defines whether to show |
false |
hive.prewarm.spark.timeout |
Time to wait to finish pre-warming Spark executors when |
5000 |
hive.spark.optimize.shuffle.serde |
If set to |
false |
hive.merge.sparkfiles |
Merges small files at the end of a Spark DAG Transformation |
false |
hive.spark.use.op.stats |
Defines whether to use operator stats to determine reducer parallelism for Hive on Spark. If set to |
true |
hive.spark.use.ts.stats.for.mapjoin |
If set to |
false |
hive.spark.use.groupby.shuffle |
When set to |
true |
Remote Spark Driver
The remote Spark driver is the application launched in the Spark cluster that submits actual Spark jobs. It is a long-lived application initialized upon the first query of the current user and is running until the user session is closed.
The following properties control the remote communication between the remote Spark driver and the Hive client that spawns it.
Parameter | Description | Default value |
---|---|---|
hive.spark.client.future.timeout |
Timeout for requests from the Hive client to the remote Spark driver (in seconds) |
60 |
hive.spark.client.connect.timeout |
Timeout for the remote Spark driver to connect back to the Hive client (in milliseconds) |
1000 |
hive.spark.client.server.connect.timeout |
Timeout for the handshake between the Hive client and the remote Spark driver (in milliseconds). Checked by both processes |
90000 |
hive.spark.client.secret.bits |
Number of bits of randomness in the generated secret for communication between the Hive client and the remote Spark driver. Rounded down to the nearest multiple of 8 |
256 |
hive.spark.client.rpc.server.address |
The server address of the HiveServer2 host to be used for communication between the Hive client and the remote Spark driver |
hive.spark.client.rpc.server.address, |
hive.spark.client.rpc.threads |
The maximum number of threads for the remote Spark driver’s RPC event loop |
8 |
hive.spark.client.rpc.max.size |
The maximum message size in bytes for communication between the Hive client and the remote Spark driver |
52,428,800 (50 * 1024 * 1024, or 50 MB) |
hive.spark.client.channel.log.level |
Channel logging level for remote Spark driver. Possible values: |
— |