YARN Work-Preserving Restarts
There are two types of workpreserving restarts:
ResourceManager and NodeManager.
yarn.resourcemanager.recovery.enabled=true
yarn.resourcemanager.workpreserving-recovery.enabled
=true
yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resoucemanager.recovery.ZKRMStateStore
yarn.resourcemanager.am.max-attempts=2
yarn.resourcemanager.zkaddress=<host>:2181
yarn.resourcemanager.zkstate-store.parent-path
=/rmstore
yarn.resourcemanager.zknum-retries=1000
yarn.resourcemanager.zkretry-interval-ms=1000
yarn.resourcemanager.zktimeout-ms=10000
yarn.resourcemanager.zk-acl=
world:anyone:rwcda
yarn.nodemanager.recovery.enabled=true
yarn.nodemanager.recovery.dir
= /var/log/hadoop-yarn/nodemanager/recovery-state
yarn.nodemanager.address=
0.0.0.0:45454
Another value than can be tuned but is not usually
necessary:
yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms
= 10000
(10,000 milliseconds = 10 seconds)
Determines how long the ResourceManager waits before
allocating new containers on work-preserving recovery to give the NodeManagers time to “settle” before
resuming application work.
A ResourceManager work-preserving restart allows for
recovery from failure of the ResourceManager component for any reason (software or hardware failure)
without the need to re-launch running applications.
NodeManager work-preserving restart allows for recovery
from software-based failure and restart of the NodeManager component without the need to re-launch
running containers or job tasks on that node, assuming no other components on the server were
affected by whatever caused the NodeManager to restart.