你们用的Hadoop Cluster是怎么升级的?下面是我的问题
Rolling Upgrade Hadoop Cluster Question
In our company, one of main Hadoop clusters (HDP) has about 600 nodes. It
upgrades almost monthly plus some other maintenance. Every time doing so
takes hours to a couple of days and all apps running on it have to be shut
off. I just cannot imagine the clusters performing such important work in
other companies will get interrupted so often and so long. I asked why don't
we do rolling upgrade? Here is one of main architect's answer. Is it true?
How about the upgrades in your company?
================================================
Regarding rolling upgrades, I want to be careful that everyone
understand what happens during this process. Up to 12 nodes per hour get
upgraded to the next version of HDP. As this process continues with each
passing hour the capacity of the cluster is reduced by X number of nodes
that get completed. When the cluster gets in the neighborhood of 75% a
restart is required for most of the services. The core services are handled
under the up-time such as MapReduce, HDFS, Name Node HA, Resource Manager HA
, Zookeeper and Hive HA if it is configured. Spark, Kafka, Storm and the
other services are not included in Rolling upgrade with no downtime. Express
upgrade has allowed our team to upgrade the clusters in a much faster
timeframe. The last upgrade of the cluster was 5 hours. I believe the issue
of downtime you stated above with 2 days and 4 hours would not be correct
for the actual HDP downtime. This is likely the entire maintenance which
would include Ambari Upgrades, HDP upgrades, stopping jobs, sanity checks,
and restarting all of the jobs to complete catch up with batch processing. I
would like to suggest that your team is engaged with the messages that will
be sent out and stop your job at the time the upgrade will be executing
which would be on Saturday morning. When the upgrade is completed you will
be able to start your job again, another notification will be sent out.