TO create queues with in yarn depending on cluster size
{Explanation:A
container is the basic unit of processing capacity in YARN, and is an
encapsulation
of resource elements (for example, memory, CPU, and so
on).}
Step1: Go to $HADOOP_YARN_HOME
cli>cd $HADOOP_YARN_HOME
Hadoop_yarn_home> cd /etc/Hadoop
/etc/Hadoop>vi capacity-scheduler.xml
{Explanation:Makes changes as necessary:set up a
Configuration that allocates 50% of the cluster capacity to a default queue for batch jobs, and two queues for interactive Hive queries, with each
assigned 25% of cluster resources as shown below:}
yarn.scheduler.capacity.root.queues=default,hive1,hive2
yarn.scheduler.capacity.root.default.capacity=50
yarn.scheduler.capacity.root.hive1.capacity=25
yarn.scheduler.capacity.root.hive2.capacity=25
yarn.scheduler.capacity.root.default.maximum-capacity={ 50 or 100}
yarn.scheduler.capacity.root.default.user-limit-factor={ 1 or 2}
{Explanation:Setting maximum-capacity = 50 restricts
queue users to 50% of the
queue
capacity with a hard limit. If the maximum-capacity > 50%, the queue can use
more than its capacity when there are other idle resources in the
cluster.
However, any user can use only the configured queue capacity.
default
value of "1" for user-limit means that any single user in
the queue
can at a maximum occupy 1X the queue's configured capacity. }
once the
configuaration settings are done changing, save it and refresh the cluster
settings to take effect in capacity-scheduler.xml
$HADOOP_YARN_HOME/bin>yarn
rmadmin –refreshQueues
{Exaplanation:
YARN Single Queue with Fair Scheduling .The concept of
"fair scheduling"
policy in YARN is introduced in HDP 2.3. Fair scheduling
enables all
sessions running within a queue to get equal resources.
Fair scheduling is specified by setting Ordering policy to
fair in
the Capacity Scheduler View in Ambari
Setting Up Time-Based Queue Capacity Change.To configure this
scenario, schedule-based policies are used. This is an alpha Apache feature.}
Tez:
{explanation: Container Re-Use. Container Re-Use. Container Re-Use. In a Tez Session, containers
are re-used even across DAGs. Containers,
when not in use, are kept around for a configurable period
before
being released back to YARN’s ResourceManager.}
Use the following settings in tez-site.xml to configure container reuse in Tez:
tez.session.am.dag.submit.timeout.secs=900(15 min X 60 seconds per minute)
tez.am.session.min.heldcontainers=<number_minimum_containers_to_retain>
{Explanation:However,
these settings apply globally to all jobs running in the cluster. To ensure that
the settings apply to only one application, you must use separate tez-site.xml files on separate HiveServer2 nodes.}
Tez.am.grouping.split.waves =1.7 (if
mappers does not have correct parallelism,no of tasks to set for a vertex is
equal to 1.7 of available containers in the queue.)
------------------------------------------------------------------------
Hive settings:
Go
to $HIVE_HOME/etc/conf>gedit hive-site.xml
hive.execution.engine=tez
hive.server2.tez.default.queues=hive1,hive2
hive.server2.tez.initialize.default.sessions=true
hive.server2.tez.sessions.per.default.queue={1 or more if needed}
hive.server2.enable.doAs=false
hive.tez.exec.print.summary
= enable
{explanation:
a default session is used for jobs that use HiveServer2 even
if they do
not use Tez. To enable default sessions, set to "true": When doAs is set to false, queries
execute as the Hive user and not the end user. When multiple queries run as the
Hive user, they can share resources. Otherwise, YARN does not allow resources
to be shared across different users. When the Hive user executes all of the
queries, a Tez session opened for one query and is holding onto resources can
use those resources for the next query without re-allocation.}
Check
garbage collection time against the CPU time by either enabling
hive.tez.exec.print.summary, or by checking the Tez UI:
hive.exec.orc.default.buffer.size
= 64KB or increase the
container size, to insert
into
table that has large number of columns)
hive.optimize.sort.dynamic.partition = true (Insert large number of tasks
into multiple partitions at the same
time,less than 10 set it to false)
hive.exec.dynamic.partition.mode=non strict(when inserting records
dynamically instead
of static)
hive.server2.tez.default.queues=hive1,hive2,hive3,hive4,hive5
hive.server2.tez.sessions.per.default.queue=3(If the number of concurrent users
increases to 15, you might achieve better performance by using 5 queues with 3
sessions per queue)
hive.auto.convert.join.noconditionaltask.size
determines whether a
table is broadcasted or shuffled for a join. If the small table size is larger
than hive.auto.convert.join.noconditonaltask.size a shuffle join is used
hive.exec.reducers.bytes.per.reducer
= 130(to lower number to
increase parallelism
for shuffle joins for large tables)
++++++++++++++++++++++++++++++++++++++
Start the HiveServer2 service:
cli>su $HIVE_USER
/usr/lib/hive/bin/>hiveserver2 -hiveconf
hive.metastore.uris=" "
-hiveconf
hive.log.file=hiveserver2.log
>$HIVE_LOG_DIR/hiveserver2.out
2
>$HIVE_LOG_DIR/hiveserver2.log
&
Connect to
the new HiveServer2 instance by using Beeline and validate that it is running:
Open the Beeline command line shell to
interact with HiveServer2:
Establish a connection to the server:
/usr/bin/beeline>
!connect jdbc:hive2://$hive.server.full.hostname:10000
$HIVE_USER password
org.apache.hive.jdbc.HiveDriver
Run sample commands:at beeline shell
O: jdbc:hive2//bivm.ibm.com:10000> show
databases;
No comments:
Post a Comment