Showing posts with label Hive queues. Show all posts
Showing posts with label Hive queues. Show all posts

Sunday, May 22, 2016

HIve QUEUEs ,SESSIONS,TEZ ,and Hive Optimization rules



TO create  queues with in yarn depending on cluster  size
{Explanation:A container is the basic unit of processing capacity in YARN, and is an
encapsulation of resource elements (for example, memory, CPU, and so
on).}

Step1: Go to $HADOOP_YARN_HOME
cli>cd $HADOOP_YARN_HOME
Hadoop_yarn_home> cd /etc/Hadoop

/etc/Hadoop>vi capacity-scheduler.xml
{Explanation:Makes changes as necessary:set up a Configuration that allocates 50% of the cluster capacity to a default queue for batch jobs, and two queues for interactive Hive queries, with each assigned 25% of cluster resources as shown below:}
yarn.scheduler.capacity.root.queues=default,hive1,hive2
yarn.scheduler.capacity.root.default.capacity=50
yarn.scheduler.capacity.root.hive1.capacity=25
yarn.scheduler.capacity.root.hive2.capacity=25
yarn.scheduler.capacity.root.default.maximum-capacity={ 50 or 100}
yarn.scheduler.capacity.root.default.user-limit-factor={ 1 or 2}
{Explanation:Setting maximum-capacity = 50 restricts queue users to 50% of the
queue capacity with a hard limit. If the maximum-capacity > 50%, the queue can use more than its capacity when there are other idle resources in the
cluster. However, any user can use only the configured queue capacity.
default value of "1" for user-limit means that any single user in
the queue can at a maximum occupy 1X the queue's configured capacity. }
once the configuaration settings are done changing, save it and refresh the cluster settings to take effect in capacity-scheduler.xml

$HADOOP_YARN_HOME/bin>yarn rmadmin –refreshQueues
{Exaplanation:
YARN Single Queue with Fair Scheduling .The concept of "fair scheduling"
policy in YARN is introduced in HDP 2.3. Fair scheduling enables all
sessions running within a queue to get equal resources.
Fair scheduling is specified by setting Ordering policy to fair in
the Capacity Scheduler View in Ambari
Setting Up Time-Based Queue Capacity Change.To configure this scenario, schedule-based policies are used. This is an alpha Apache feature.}

Tez:
{explanation: Container Re-Use. Container Re-Use. Container Re-Use. In a Tez Session, containers are re-used even across DAGs. Containers,
when not in use, are kept around for a configurable period before
being released back to YARN’s ResourceManager.}
Use the following settings in tez-site.xml to configure container reuse in Tez:
tez.session.am.dag.submit.timeout.secs=900(15 min X 60 seconds per minute)
tez.am.session.min.heldcontainers=<number_minimum_containers_to_retain>
{Explanation:However, these settings apply globally to all jobs running in the cluster. To ensure that the settings apply to only one application, you must use separate tez-site.xml files on separate HiveServer2 nodes.}

Tez.am.grouping.split.waves =1.7 (if mappers does not have correct parallelism,no of tasks to set for a vertex is equal to 1.7 of available containers in the queue.)

------------------------------------------------------------------------

Hive settings:
Go to $HIVE_HOME/etc/conf>gedit hive-site.xml
hive.execution.engine=tez
hive.server2.tez.default.queues=hive1,hive2
hive.server2.tez.initialize.default.sessions=true
hive.server2.tez.sessions.per.default.queue={1 or more if needed}
hive.server2.enable.doAs=false
hive.tez.exec.print.summary = enable
{explanation: a default session is used for jobs that use HiveServer2 even
if they do not use Tez. To enable default sessions, set to "true": When doAs is set to false, queries execute as the Hive user and not the end user. When multiple queries run as the Hive user, they can share resources. Otherwise, YARN does not allow resources to be shared across different users. When the Hive user executes all of the queries, a Tez session opened for one query and is holding onto resources can use those resources for the next query without re-allocation.}
Check garbage collection time against the CPU time by either enabling hive.tez.exec.print.summary, or by checking the Tez UI:
hive.exec.orc.default.buffer.size = 64KB or increase the container size, to insert
                                                            into table that has large number of columns)
hive.optimize.sort.dynamic.partition = true (Insert large number of tasks into                                  multiple partitions at the same time,less than 10 set it to false)
hive.exec.dynamic.partition.mode=non strict(when inserting records dynamically                                                                                                                                    instead of static)
hive.server2.tez.default.queues=hive1,hive2,hive3,hive4,hive5
hive.server2.tez.sessions.per.default.queue=3(If the number of concurrent users increases to 15, you might achieve better performance by using 5 queues with 3 sessions per queue)
hive.auto.convert.join.noconditionaltask.size determines whether a table is broadcasted or shuffled for a join. If the small table size is larger than hive.auto.convert.join.noconditonaltask.size a shuffle join is used
hive.exec.reducers.bytes.per.reducer = 130(to lower number to increase                                                       parallelism for shuffle joins for large tables)

 ++++++++++++++++++++++++++++++++++++++

Start the HiveServer2 service:
cli>su $HIVE_USER
/usr/lib/hive/bin/>hiveserver2 -hiveconf hive.metastore.uris=" "
-hiveconf hive.log.file=hiveserver2.log
>$HIVE_LOG_DIR/hiveserver2.out 2
>$HIVE_LOG_DIR/hiveserver2.log &
Connect to the new HiveServer2 instance by using Beeline and validate that it is running:
    Open the Beeline command line shell to interact with HiveServer2:
Establish a connection to the server:
    /usr/bin/beeline> !connect jdbc:hive2://$hive.server.full.hostname:10000
    $HIVE_USER password org.apache.hive.jdbc.HiveDriver

Run sample commands:at beeline shell

  O: jdbc:hive2//bivm.ibm.com:10000>     show databases;