hadooop ambari administration: Hive queues

TO create queues with in yarn depending on cluster size

{Explanation:A container is the basic unit of processing capacity in YARN, and is an

encapsulation of resource elements (for example, memory, CPU, and so

on).}

Step1: Go to $HADOOP_YARN_HOME

cli>cd $HADOOP_YARN_HOME

Hadoop_yarn_home> cd /etc/Hadoop

/etc/Hadoop>vi capacity-scheduler.xml

{Explanation:Makes changes as necessary:set up a Configuration that allocates 50% of the cluster capacity to a default queue for batch jobs, and two queues for interactive Hive queries, with each assigned 25% of cluster resources as shown below:}

yarn.scheduler.capacity.root.queues=default,hive1,hive2

yarn.scheduler.capacity.root.default.capacity=50

yarn.scheduler.capacity.root.hive1.capacity=25

yarn.scheduler.capacity.root.hive2.capacity=25

yarn.scheduler.capacity.root.default.maximum-capacity={ 50 or 100}

yarn.scheduler.capacity.root.default.user-limit-factor={ 1 or 2}

{Explanation:Setting maximum-capacity = 50 restricts queue users to 50% of the

queue capacity with a hard limit. If the maximum-capacity > 50%, the queue can use more than its capacity when there are other idle resources in the

cluster. However, any user can use only the configured queue capacity.

default value of "1" for user-limit means that any single user in

the queue can at a maximum occupy 1X the queue's configured capacity. }

once the configuaration settings are done changing, save it and refresh the cluster settings to take effect in capacity-scheduler.xml

$HADOOP_YARN_HOME/bin>yarn rmadmin –refreshQueues

{Exaplanation:

YARN Single Queue with Fair Scheduling .The concept of "fair scheduling"

policy in YARN is introduced in HDP 2.3. Fair scheduling enables all

sessions running within a queue to get equal resources.

Fair scheduling is specified by setting Ordering policy to fair in

the Capacity Scheduler View in Ambari

Setting Up Time-Based Queue Capacity Change.To configure this scenario, schedule-based policies are used. This is an alpha Apache feature.}

Tez:

{explanation: Container Re-Use. Container Re-Use. Container Re-Use. In a Tez Session, containers are re-used even across DAGs. Containers,

when not in use, are kept around for a configurable period before

being released back to YARN’s ResourceManager.}

Use the following settings in tez-site.xml to configure container reuse in Tez:

tez.session.am.dag.submit.timeout.secs=900(15 min X 60 seconds per minute)

tez.am.session.min.heldcontainers=<number_minimum_containers_to_retain>

{Explanation:However, these settings apply globally to all jobs running in the cluster. To ensure that the settings apply to only one application, you must use separate tez-site.xml files on separate HiveServer2 nodes.}

Tez.am.grouping.split.waves =1.7 (if mappers does not have correct parallelism,no of tasks to set for a vertex is equal to 1.7 of available containers in the queue.)

------------------------------------------------------------------------

Hive settings:

Go to $HIVE_HOME/etc/conf>gedit hive-site.xml

hive.execution.engine=tez

hive.server2.tez.default.queues=hive1,hive2

hive.server2.tez.initialize.default.sessions=true

hive.server2.tez.sessions.per.default.queue={1 or more if needed}

hive.server2.enable.doAs=false

hive.tez.exec.print.summary = enable

{explanation: a default session is used for jobs that use HiveServer2 even

if they do not use Tez. To enable default sessions, set to "true": When doAs is set to false, queries execute as the Hive user and not the end user. When multiple queries run as the Hive user, they can share resources. Otherwise, YARN does not allow resources to be shared across different users. When the Hive user executes all of the queries, a Tez session opened for one query and is holding onto resources can use those resources for the next query without re-allocation.}

Check garbage collection time against the CPU time by either enabling hive.tez.exec.print.summary, or by checking the Tez UI:

hive.exec.orc.default.buffer.size = 64KB or increase the container size, to insert

into table that has large number of columns)

hive.optimize.sort.dynamic.partition = true (Insert large number of tasks into multiple partitions at the same time,less than 10 set it to false)

hive.exec.dynamic.partition.mode=non strict(when inserting records dynamically instead of static)

hive.server2.tez.default.queues=hive1,hive2,hive3,hive4,hive5

hive.server2.tez.sessions.per.default.queue=3(If the number of concurrent users increases to 15, you might achieve better performance by using 5 queues with 3 sessions per queue)

hive.auto.convert.join.noconditionaltask.size determines whether a table is broadcasted or shuffled for a join. If the small table size is larger than hive.auto.convert.join.noconditonaltask.size a shuffle join is used

hive.exec.reducers.bytes.per.reducer = 130(to lower number to increase parallelism for shuffle joins for large tables)

++++++++++++++++++++++++++++++++++++++

Start the HiveServer2 service:

cli>su $HIVE_USER

/usr/lib/hive/bin/>hiveserver2 -hiveconf hive.metastore.uris=" "

-hiveconf hive.log.file=hiveserver2.log

>$HIVE_LOG_DIR/hiveserver2.out 2

>$HIVE_LOG_DIR/hiveserver2.log &

Connect to the new HiveServer2 instance by using Beeline and validate that it is running:

Open the Beeline command line shell to interact with HiveServer2:

Establish a connection to the server:

/usr/bin/beeline> !connect jdbc:hive2://$hive.server.full.hostname:10000

$HIVE_USER password org.apache.hive.jdbc.HiveDriver

Run sample commands:at beeline shell

O: jdbc:hive2//bivm.ibm.com:10000> show databases;

hadooop ambari administration

Sunday, May 22, 2016

HIve QUEUEs ,SESSIONS,TEZ ,and Hive Optimization rules