Showing posts with label save work for container failures.. Show all posts
Showing posts with label save work for container failures.. Show all posts

Sunday, May 22, 2016

YARN work preserving restarts for node manager and resource manager(YARN Work-Preserving Restarts)



YARN Work-Preserving Restarts
There are two types of workpreserving restarts: ResourceManager and NodeManager.


yarn.resourcemanager.recovery.enabled=true

yarn.resourcemanager.workpreserving-recovery.enabled =true

yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resoucemanager.recovery.ZKRMStateStore

yarn.resourcemanager.am.max-attempts=2

yarn.resourcemanager.zkaddress=<host>:2181

yarn.resourcemanager.zkstate-store.parent-path =/rmstore

yarn.resourcemanager.zknum-retries=1000

yarn.resourcemanager.zkretry-interval-ms=1000

yarn.resourcemanager.zktimeout-ms=10000

yarn.resourcemanager.zk-acl= world:anyone:rwcda

yarn.nodemanager.recovery.enabled=true

yarn.nodemanager.recovery.dir = /var/log/hadoop-yarn/nodemanager/recovery-state

yarn.nodemanager.address= 0.0.0.0:45454

Another value than can be tuned but is not usually necessary:
yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms = 10000
(10,000 milliseconds = 10 seconds)
Determines how long the ResourceManager waits before allocating new containers on work-preserving recovery to give the NodeManagers time to “settle” before resuming application work.

A ResourceManager work-preserving restart allows for recovery from failure of the ResourceManager component for any reason (software or hardware failure) without the need to re-launch running applications.

NodeManager work-preserving restart allows for recovery from software-based failure and restart of the NodeManager component without the need to re-launch running containers or job tasks on that node, assuming no other components on the server were affected by whatever caused the NodeManager to restart.