Hive - Performance Tuning Tips
Apache Hive Performance Tuning Tips
You might have seen hive Developer screaming
for his breaking hive queries often. If you are one among them or you want to
rescue them , then here is the guide for u,
We all aware that Hive is a warehouse suitable for offline batch
processes where you can’t expect near real time response, but few things if
considered we can certainly achieve better response time for your hive queries
. Let us think beyond Partitioning,
bucketing, Choosing ORC file format etc
Vectorization
set hive.vectorized.execution.enabled=true;
Vectorization allows Hive to process a batch of rows
together instead of processing one row at a time . Each batch consists of a column vector which is
usually an array of primitive types. Operations are performed on the entire
column vector, which improves the instruction pipelines and cache usage.
Cost Based Optimization
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Hive optimizes each query’s
logical and physical execution plan before submitting for final execution.
These optimizations are not based on the cost of the query – that is, until
now. A
recent addition to Hive, Cost-based optimization, performs further
optimizations based on query cost, resulting in potentially different
decisions: how to order joins, which type of join to perform, degree of
parallelism and others.
CBO works better when the table is gathered stats using analyze
command.
MapJoin Optimization
set
hive.auto.convert.join=true;
set
hive.auto.convert.join.noconditionaltask = true;
set
hive.auto.convert.join.noconditionaltask.size = 10000000;
Map joins are really efficient if
a table on the other side of a join is small enough to fit in the memory. With above set commands Hive tries to map join
automatically in a n join statements when sum of total size of the n-1 tables
is less than the bytes specified using hive.auto.convert.join.noconditionaltask.size , meaning if 3 tables are joined , and the total size of the
smaller 2 tables is less than of 10 MB, that join will be auto converted to
mapjoin for better query performance . When using this parameter, be sure the
auto convert is enabled in the Hive configuration.
Parallel execution
set
hive.exec.parallel=true;
Hadoop
can execute MapReduce jobs in parallel, and several queries executed on Hive
automatically use this parallelism. However, single, complex Hive queries
commonly are translated to a number of MapReduce jobs that are executed by
default sequencing. Often though, some of a query’s MapReduce stages are not
interdependent and could be executed in parallel. They then can take advantage
of spare capacity on a cluster and improve cluster utilization while at the
same time reducing the overall query executions time.
Compressed map output
set mapred.compress.map.output=true;
set mapred.output.compress=true;
Compression techniques
significantly reduce the intermediate data volume, which internally reduces the
amount of data transfers between mappers and reducers. All this generally
occurs over the network. Compression can be applied on the mapper and reducer output
individually. Keep in mind that gzip compressed files are not splittable. That
means this should be applied with caution. A compressed file size should not be
larger than a few hundred megabytes Otherwise it can potentially lead to an
imbalanced job. Other options of compression codec could be snappy, lzo, bzip,
etc.
Comments
Post a Comment