Hive - Performance Tuning Tips


   Apache Hive Performance Tuning Tips


       You might have seen hive Developer screaming for his breaking hive queries often. If you are one among them or you want to rescue them , then here is the guide for u,  We all aware that Hive is a warehouse suitable for offline batch processes where you can’t expect near real time response, but few things if considered we can certainly achieve better response time for your hive queries .  Let us think beyond Partitioning, bucketing, Choosing ORC file format etc



      Vectorization



set hive.vectorized.execution.enabled=true;



                  Vectorization allows Hive to process a batch of rows together instead of processing one row at a time . Each batch consists of a column vector which is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage.



      Cost Based Optimization



set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;



Hive optimizes each query’s logical and physical execution plan before submitting for final execution. These optimizations are not based on the cost of the query – that is, until now. A recent addition to Hive, Cost-based optimization, performs further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others.



CBO works better when the table is gathered stats using analyze command.



      MapJoin Optimization



set hive.auto.convert.join=true;

set hive.auto.convert.join.noconditionaltask = true;

set hive.auto.convert.join.noconditionaltask.size = 10000000;



Map joins are really efficient if a table on the other side of a join is small enough to fit in the memory.  With above set commands Hive tries to map join automatically in a n join statements when sum of total size of the n-1 tables is less than the bytes specified using  hive.auto.convert.join.noconditionaltask.size , meaning if 3 tables are joined , and the total size of the smaller 2 tables is less than of 10 MB, that join will be auto converted to mapjoin for better query performance . When using this parameter, be sure the auto convert is enabled in the Hive configuration.



Parallel execution



set hive.exec.parallel=true;



Hadoop can execute MapReduce jobs in parallel, and several queries executed on Hive automatically use this parallelism. However, single, complex Hive queries commonly are translated to a number of MapReduce jobs that are executed by default sequencing. Often though, some of a query’s MapReduce stages are not interdependent and could be executed in parallel. They then can take advantage of spare capacity on a cluster and improve cluster utilization while at the same time reducing the overall query executions time.



Compressed map output



set mapred.compress.map.output=true;

set mapred.output.compress=true;

Compression techniques significantly reduce the intermediate data volume, which internally reduces the amount of data transfers between mappers and reducers. All this generally occurs over the network. Compression can be applied on the mapper and reducer output individually. Keep in mind that gzip compressed files are not splittable. That means this should be applied with caution. A compressed file size should not be larger than a few hundred megabytes Otherwise it can potentially lead to an imbalanced job. Other options of compression codec could be snappy, lzo, bzip, etc.

Comments

Popular Posts