Sunday, May 22, 2016

YARN work preserving restarts for node manager and resource manager(YARN Work-Preserving Restarts)



YARN Work-Preserving Restarts
There are two types of workpreserving restarts: ResourceManager and NodeManager.


yarn.resourcemanager.recovery.enabled=true

yarn.resourcemanager.workpreserving-recovery.enabled =true

yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resoucemanager.recovery.ZKRMStateStore

yarn.resourcemanager.am.max-attempts=2

yarn.resourcemanager.zkaddress=<host>:2181

yarn.resourcemanager.zkstate-store.parent-path =/rmstore

yarn.resourcemanager.zknum-retries=1000

yarn.resourcemanager.zkretry-interval-ms=1000

yarn.resourcemanager.zktimeout-ms=10000

yarn.resourcemanager.zk-acl= world:anyone:rwcda

yarn.nodemanager.recovery.enabled=true

yarn.nodemanager.recovery.dir = /var/log/hadoop-yarn/nodemanager/recovery-state

yarn.nodemanager.address= 0.0.0.0:45454

Another value than can be tuned but is not usually necessary:
yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms = 10000
(10,000 milliseconds = 10 seconds)
Determines how long the ResourceManager waits before allocating new containers on work-preserving recovery to give the NodeManagers time to “settle” before resuming application work.

A ResourceManager work-preserving restart allows for recovery from failure of the ResourceManager component for any reason (software or hardware failure) without the need to re-launch running applications.

NodeManager work-preserving restart allows for recovery from software-based failure and restart of the NodeManager component without the need to re-launch running containers or job tasks on that node, assuming no other components on the server were affected by whatever caused the NodeManager to restart.

No comments:

Post a Comment