Redian新闻
>
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
avatar
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region# Programming - 葵花宝典
l*t
1
【 以下文字转载自 NextGeneration 讨论区 】
发信人: lazzycat (辛勤的小蜜蜂), 信区: NextGeneration
标 题: 七个月大的宝宝去daycare好吗?
发信站: BBS 未名空间站 (Fri May 14 00:05:46 2010, 美东)
今天收到老公学校daycare的电话说九月下旬我可以送宝宝去daycare了,让我一下子不
知道该怎么办好了。我没
想到会这么快。
我是怀孕七个月的时候去排的队,因为据说要排两年。结果没想到现在就排上了,宝宝
现在才不到三个月,到
九月底的时候才整七个月,而且九月下旬爷爷奶奶会接替外公外婆继续照顾宝宝半年。
但是不送daycare我又要重新排队,不知道下次要排多久。我和老公符合学校的低收入
要求,可以免费送宝宝去
daycare,所以不牵扯到钱的问题。
我现在担心的是宝宝太小了会不会在daycare生病,说不定会被大孩子欺负,而且我以
前还在版上听大家讨论过
老师给宝宝们喝冰奶,还没送呢想着我就心疼的不行。而我老公觉得应该早点让宝宝出
去社交与小朋友们相
处。
大家给点建议吧,送还是不送呢?
avatar
k*i
2
老婆h4 暑假要回国一个月,在美国呆了一年。 我h1工作没有换,去年的延期, 绿卡
申请还在等PERM, 那老婆返签需要注意哪些呢? 可以直接中信银行代签么?
谢谢啦
avatar
S*U
3
「明」是四圣谛如实知,阿罗汉的四圣谛如实知才叫「增上慧」。初果的四圣谛如实知
是「慧」,不叫「增上慧」。
杂阿含八一七经
如是我闻∶一时,佛住舍卫国只树给孤独园。尔时、世尊告诸比丘∶「复有三学,
何等为三?谓增上戒学,增上意学,增上慧学。何等为增上戒学?若比丘住于戒,波罗
提木叉律仪 ,威仪、行处具足,见微细罪则生怖畏,受持学戒。何等为增上意学?若
比丘离欲恶不善法,乃至第四禅具足住。何等为增上慧学?是比丘此苦圣谛如实知,集
、灭、道圣谛如实知,是名增上慧学」。
杂阿含八二一经
尔时、世尊告诸比丘∶「过二百五十戒,随次半月来,说波罗提木叉修多罗。若彼善男
子,自随意所欲而学者,我为说三学,若学此三学,则 摄受一切学戒。何等为三?谓
增上戒学,增上意学,增上慧学。
何等为增上戒学?是比丘重于戒,戒增上;不重于定,定不增上,不重于慧,慧不增上
。于彼彼分细微戒,乃至受持学戒。如是知、如是见,断三结,谓身见、戒取、疑。贪
、恚、痴薄,成一种子道;彼地未等觉者,名斯陀含;彼地未等觉者,名家家;彼地末
等觉者,名七有;彼地未等觉者,名随法行;彼地未等觉者,名随信行,是名增上戒学。
何等为增上意学?是比丘重于戒,戒增上,重于定,定增上;不重于慧,慧不增上。于
彼彼分细微戒学,乃至受持学戒。如是知,如是见,断五下分结,谓身见、戒取、疑、
贪欲、瞠恚。断此五下分结,能得中般涅盘;彼地未等觉者,得生般涅盘;彼地未等觉
者,得无行般涅盘;彼地未等觉者,得有行般涅盘;彼地未等觉者,得上流般涅盘,是
名增上意学。
何等为增上慧学?是比丘重于戒,戒增上,重于定,定增上,重于慧,慧增上。如是知
、如是见,欲有漏心解脱,有有漏心解脱,无明有漏心解脱,解脱知见∶我生已尽,梵
行已立,所作已作,自知不受后有,是名增上慧学」
avatar
w*z
4
https://aws.amazon.com/message/5467D2/?utm_content=buffere5a1e&utm_medium=
social&utm_source=linkedin.com&utm_campaign=bufferSummary of the Amazon
DynamoDB Service Disruption and Related Impacts in the US-East Region
Early Sunday morning, September 20, we had a DynamoDB service event in the
US-East Region that impacted DynamoDB customers in US-East, as well as some
other services in the region. The following are some additional details on
the root cause, subsequent impact to other AWS services that depend on
DynamoDB, and corrective actions we’re taking.
Some DynamoDB Context
Among its many functions, DynamoDB stores and maintain tables for customers.
A single DynamoDB table is separated into partitions, each containing a
portion of the table’s data. These partitions are spread onto many servers,
both to provide consistent low-latency access and to replicate the data for
durability.
The specific assignment of a group of partitions to a given server is called
a “membership.” The membership of a set of table/partitions within a
server is managed by DynamoDB’s internal metadata service. The metadata
service is internally replicated and runs across multiple datacenters.
Storage servers hold the actual table data within a partition and need to
periodically confirm that they have the correct membership. They do this by
checking in with the metadata service and asking for their current
membership assignment. In response, the metadata service retrieves the list
of partitions and all related information from its own store, bundles this
up into a message, and transmits back to the requesting storage server. A
storage server will also get its membership assignment after a network
disruption or on startup. Once a storage server has completed processing its
membership assignment, it verifies that it has the table/partition data
locally stored, creates any new table/partitions assigned, and retrieves
data from other storage servers to replicate existing partitions assigned.
The DynamoDB Event
On Sunday, at 2:19am PDT, there was a brief network disruption that impacted
a portion of DynamoDB’s storage servers. Normally, this type of networking
disruption is handled seamlessly and without change to the performance of
DynamoDB, as affected storage servers query the metadata service for their
membership, process any updates, and reconfirm their availability to accept
requests. If the storage servers aren’t able to retrieve this membership
data back within a specific time period, they will retry the membership
request and temporarily disqualify themselves from accepting requests.
But, on Sunday morning, a portion of the metadata service responses exceeded
the retrieval and transmission time allowed by storage servers. As a result
, some of the storage servers were unable to obtain their membership data,
and removed themselves from taking requests. The reason these metadata
service requests were taking too long relates to a recent development in
DynamoDB. Over the last few months, customers have rapidly adopted a new
DynamoDB feature called Global Secondary Indexes (“GSIs”). GSIs allow
customers to access their table data using alternate keys. Because GSIs are
global, they have their own set of partitions on storage servers and
therefore increase the overall size of a storage server’s membership data.
Customers can add multiple GSIs for a given table, so a table with large
numbers of partitions could have its contribution of partition data to the
membership lists quickly double or triple. With rapid adoption of GSIs by a
number of customers with very large tables, the partitions-per-table ratio
increased significantly. This, in turn, increased the size of some storage
servers’ membership lists significantly. With a larger size, the processing
time inside the metadata service for some membership requests began to
approach the retrieval time allowance by storage servers. We did not have
detailed enough monitoring for this dimension (membership size), and didn’t
have enough capacity allocated to the metadata service to handle these much
heavier requests.
So, when the network disruption occurred on Sunday morning, and a number of
storage servers simultaneously requested their membership data, the metadata
service was processing some membership lists that were now large enough
that their processing time was near the time limit for retrieval. Multiple,
simultaneous requests for these large memberships caused processing to slow
further and eventually exceed the allotted time limit. This resulted in the
disrupted storage servers failing to complete their membership renewal,
becoming unavailable for requests, and retrying these requests. With the
metadata service now under heavy load, it also no longer responded as
quickly to storage servers uninvolved in the original network disruption,
who were checking their membership data in the normal cadence of when they
retrieve this information. Many of those storage servers also became
unavailable for handling customer requests. Unavailable servers continued to
retry requests for membership data, maintaining high load on the metadata
service. Though many storage servers’ renewal requests were succeeding,
healthy storage servers that had successfully processed a membership request
previously were having subsequent renewals fail and were transitioning back
to an unavailable state. By 2:37am PDT, the error rate in customer requests
to DynamoDB had risen far beyond any level experienced in the last 3 years,
finally stabilizing at approximately 55%.
Initially, we were unable to add capacity to the metadata service because it
was under such high load, preventing us from successfully making the
requisite administrative requests. After several failed attempts at adding
capacity, at 5:06am PDT, we decided to pause requests to the metadata
service. This action decreased retry activity, which relieved much of the
load on the metadata service. With the metadata service now able to respond
to administrative requests, we were able to add significant capacity. Once
these adjustments were made, we were able to reactivate requests to the
metadata service, put storage servers back into the customer request path,
and allow normal load back on the metadata service. At 7:10am PDT, DynamoDB
was restored to error rates low enough for most customers and AWS services
dependent on DynamoDB to resume normal operations.
There’s one other bit worth mentioning. After we resolved the key issue on
Sunday, we were left with a low error rate, hovering between 0.15%-0.25%. We
knew there would be some cleanup to do after the event, and while this rate
was higher than normal, it wasn’t a rate that usually precipitates a
dashboard post or creates issues for customers. As Monday progressed, we
started to get more customers opening support cases about being impacted by
tables being stuck in the updating or deleting stage or higher than normal
error rates. We did not realize soon enough that this low overall error rate
was giving some customers disproportionately high error rates. It was
impacting a relatively small number of customers, but we should have posted
the green-i to the dashboard sooner than we did on Monday. The issue turned
out to be a metadata partition that was still not taking the amount of
traffic that it should have been taking. The team worked carefully and
diligently to restore that metadata partition to its full traffic volume,
and closed this out on Monday.
There are several actions we'll take immediately to avoid a recurrence of
Sunday's DynamoDB event. First, we have already significantly increased the
capacity of the metadata service. Second, we are instrumenting stricter
monitoring on performance dimensions, such as the membership size, to allow
us to thoroughly understand their state and proactively plan for the right
capacity. Third, we are reducing the rate at which storage nodes request
membership data and lengthening the time allowed to process queries. Finally
and longer term, we are segmenting the DynamoDB service so that it will
have many instances of the metadata service each serving only portions of
the storage server fleet. This will further contain the impact of software,
performance/capacity, or infrastructure failures.
Impact on Other Services
There are several other AWS services that use DynamoDB that experienced
problems during the event. Rather than list them all, which had similar
explanations for their status, we’ll list a few that customers most asked
us about or where the actions are more independent from DynamoDB’s
Correction of Errors (“COE”).
Simple Queue Service (SQS)
In the early stages of the DynamoDB event, the Amazon Simple Queue Service
was delivering slightly elevated errors and latencies. Amazon SQS uses an
internal table stored in DynamoDB to store information describing its queues
. While the queue information is cached within SQS, and is not in the direct
path for “send-message” and “receive-message” APIs, the caches are
refreshed frequently to accommodate creation, deletion, and reassignment
across infrastructure. When DynamoDB finished disabling traffic at 5:45am
PDT (to enable the metadata service to recover), the Simple Queue Service
was unable to read this data to refresh caches, resulting in significantly
elevated error rates. Once DynamoDB began re-enabling customer traffic at 7:
10am PDT, the Simple Queue Service recovered. No data in queues, or
information describing queues was lost as a result of the event.
In addition to the actions being taken by the DynamoDB service, we will be
adjusting our SQS metadata caching to ensure that send and receive
operations continue even without prolonged access to the metadata table.
EC2 Auto Scaling
Between 2:15am PDT and 7:10am PDT, the EC2 Auto Scaling Service delivered
significantly increased API faults. From 7:10am PDT to 10:52am PDT, the Auto
Scaling service was substantially delayed in bringing new instances into
service, or terminating existing unhealthy instances. Existing instances
continued to operate properly throughout the event.
Auto Scaling stores information about its groups and launch configurations
in an internal table in DynamoDB. When DynamoDB began to experience elevated
error rates starting at 2:19am PDT, Auto Scaling could not update this
internal table when APIs were called. Once DynamoDB began recovery at 7:10am
PDT, the Auto Scaling APIs recovered. Recovery was incomplete at this time,
as a significant backlog of scaling activities had built up throughout the
event. The Auto Scaling service executes its launch and termination
activities in a background scheduling service. Throughout the event, a very
large amount of pending activities built up in this job scheduler and it
took until 10:52am PDT to complete all of these tasks.
In addition to the actions taken by the DynamoDB team, to ensure we can
recover quickly when a large backlog of scaling activities accumulate, we
will adjust the way we partition work on the fleet of Auto Scaling servers
to allow for more parallelism in processing these jobs, integrate mechanisms
to prune older scaling activities that have been superseded, and increase
the capacity available to process scaling activities.
CloudWatch
Starting at 2:35am PDT, the Amazon CloudWatch Metrics Service began
experiencing delayed and missing EC2 Metrics along with slightly elevated
errors. CloudWatch uses an internal table stored in DynamoDB to add
information regarding Auto Scaling group membership to incoming EC2 metrics.
From 2:35am PDT to 5:45am PDT, the elevated DynamoDB failure rates caused
intermittent availability of EC2 metrics in CloudWatch. CloudWatch also
observed an abnormally low rate of metrics publication from other services
that were experiencing issues over this time period, further contributing to
missing or delayed metrics.
Then, from approximately 5:51am PDT to 7:10am PDT CloudWatch delivered
significantly elevated error rates for PutMetricData calls affecting all AWS
Service metrics and custom metrics. The impact was due to the significantly
elevated error rates in DynamoDB for the group membership additions
mentioned above. The CloudWatch Metrics Service was fully recovered at 7:
29am PDT.
We understand how important metrics are, especially during an event. To
further increase the resilience of CloudWatch, we will adjust our caching
strategy for the DynamoDB group membership data and only require refresh for
the smallest possible set of metrics. We also have been developing faster
metrics delivery through write-through caching. This cache will provide the
ability to present metrics directly before persisting them and will, as a
side benefit, provide additional protection during an event.
Console
The AWS Console was impacted for some customers from 5:45am PDT to 7:10am
PDT. Customers who were already logged into the Console would have continued
to remain connected. Customers attempting to log into the Console during
this period saw much higher latency in the login process. This was due to a
very long timeout being set on an API call that relied on DynamoDB. The API
call did not have to complete successfully to allow login to proceed but,
with the long timeout, it blocked progress for tens of seconds while it
waited to finish. It should have simply failed quickly and allowed progress
on login to continue.
The timeout had already been changed in a version of the login code that has
entered our test process. Unfortunately it wasn’t yet rolled out when the
event happened. We will make this change in the coming days. The reduced
timeout will mitigate any impact of latency in the API call on the Console.
Final Words
We apologize for the impact to affected customers. While we are proud of the
last three years of availability on DynamoDB (it’s effectively been 100%),
we know how critical this service is to customers, both because many use it
for mission-critical operations and because AWS services also rely on it.
For us, availability is the most important feature of DynamoDB, and we will
do everything we can to learn from the event and to avoid a recurrence in
the future.
avatar
d*u
5
如果是我,宁可重新排队,也绝不送,不送daycare
我的朋友在daycare做过老师,她跟我讲了一些事情,让我觉得
1岁内地孩子还是在家带比较好。父母如果不能带,爷爷奶奶可以带
一定比daycare好得多
主要是孩子太小了。等到两岁半以后,那时送daycare比爷爷奶奶带要好。

【在 l******t 的大作中提到】
: 【 以下文字转载自 NextGeneration 讨论区 】
: 发信人: lazzycat (辛勤的小蜜蜂), 信区: NextGeneration
: 标 题: 七个月大的宝宝去daycare好吗?
: 发信站: BBS 未名空间站 (Fri May 14 00:05:46 2010, 美东)
: 今天收到老公学校daycare的电话说九月下旬我可以送宝宝去daycare了,让我一下子不
: 知道该怎么办好了。我没
: 想到会这么快。
: 我是怀孕七个月的时候去排的队,因为据说要排两年。结果没想到现在就排上了,宝宝
: 现在才不到三个月,到
: 九月底的时候才整七个月,而且九月下旬爷爷奶奶会接替外公外婆继续照顾宝宝半年。

avatar
c*k
6
欢迎加入 北美华人H4互助qq群 390895539
avatar
g*g
7
没事弄个 latency monkey在 dynamo上跑跑就不会出这么大事了。

some
customers.

【在 w**z 的大作中提到】
: https://aws.amazon.com/message/5467D2/?utm_content=buffere5a1e&utm_medium=
: social&utm_source=linkedin.com&utm_campaign=bufferSummary of the Amazon
: DynamoDB Service Disruption and Related Impacts in the US-East Region
: Early Sunday morning, September 20, we had a DynamoDB service event in the
: US-East Region that impacted DynamoDB customers in US-East, as well as some
: other services in the region. The following are some additional details on
: the root cause, subsequent impact to other AWS services that depend on
: DynamoDB, and corrective actions we’re taking.
: Some DynamoDB Context
: Among its many functions, DynamoDB stores and maintain tables for customers.

avatar
b*s
8
就是, 爷爷奶奶带宝宝多用心啊, 宝宝听到的语言刺激都比daycare的多的多,还能推着
宝宝出去溜弯.把宝宝关在daycare的屋子里有啥好的.

【在 d**u 的大作中提到】
: 如果是我,宁可重新排队,也绝不送,不送daycare
: 我的朋友在daycare做过老师,她跟我讲了一些事情,让我觉得
: 1岁内地孩子还是在家带比较好。父母如果不能带,爷爷奶奶可以带
: 一定比daycare好得多
: 主要是孩子太小了。等到两岁半以后,那时送daycare比爷爷奶奶带要好。

avatar
w*z
9
client retry 没做好。我碰到过一次差不多的,是在 C*, 最后只能把clients 全断了
才recover 了。

【在 g*****g 的大作中提到】
: 没事弄个 latency monkey在 dynamo上跑跑就不会出这么大事了。
:
: some
: customers.

avatar
j*u
10
虽然我家娃10个月上的daycare,但是我还是坚持认为至少1岁半送比较好。

【在 l******t 的大作中提到】
: 【 以下文字转载自 NextGeneration 讨论区 】
: 发信人: lazzycat (辛勤的小蜜蜂), 信区: NextGeneration
: 标 题: 七个月大的宝宝去daycare好吗?
: 发信站: BBS 未名空间站 (Fri May 14 00:05:46 2010, 美东)
: 今天收到老公学校daycare的电话说九月下旬我可以送宝宝去daycare了,让我一下子不
: 知道该怎么办好了。我没
: 想到会这么快。
: 我是怀孕七个月的时候去排的队,因为据说要排两年。结果没想到现在就排上了,宝宝
: 现在才不到三个月,到
: 九月底的时候才整七个月,而且九月下旬爷爷奶奶会接替外公外婆继续照顾宝宝半年。

avatar
x*1
11
double-faults: GSI 引起了几何级数的metadata service query + heaviliy
partitioned table + max burst enabled. question是为什么cache坏了。为什么
不能动态加capacities。 DDB的failover设计肯定有问题。
avatar
s*9
12
能不能分享一下你朋友都说了什么事情啊?

【在 d**u 的大作中提到】
: 如果是我,宁可重新排队,也绝不送,不送daycare
: 我的朋友在daycare做过老师,她跟我讲了一些事情,让我觉得
: 1岁内地孩子还是在家带比较好。父母如果不能带,爷爷奶奶可以带
: 一定比daycare好得多
: 主要是孩子太小了。等到两岁半以后,那时送daycare比爷爷奶奶带要好。

avatar
x*1
13
首先没有设计emergent stop。 一键把max burst关掉。
avatar
f*0
14
我们家宝宝在7个月的时候排队排上了,结果总共去了4天(2周,每周2天)。当时我父
母还在这里,想让宝宝先适应一下。天气那个时候已经很暖和了,但是宝宝去了2天就
开始拉肚子,而且只要去就接着一天开始拉,而且鼻涕不断。后来决定不送了,情愿送
回国养上一段日子等我毕业再接过来。我当时找的托儿所还是不错的那种,州里都特别
认证,非常规范和干净,但是终究是daycare,不能指望她们把你孩子当自己孩子那么
精心照顾的。如果你家里人能带还是带到大点再送吧,不然小孩大人都折腾得很。

【在 l******t 的大作中提到】
: 【 以下文字转载自 NextGeneration 讨论区 】
: 发信人: lazzycat (辛勤的小蜜蜂), 信区: NextGeneration
: 标 题: 七个月大的宝宝去daycare好吗?
: 发信站: BBS 未名空间站 (Fri May 14 00:05:46 2010, 美东)
: 今天收到老公学校daycare的电话说九月下旬我可以送宝宝去daycare了,让我一下子不
: 知道该怎么办好了。我没
: 想到会这么快。
: 我是怀孕七个月的时候去排的队,因为据说要排两年。结果没想到现在就排上了,宝宝
: 现在才不到三个月,到
: 九月底的时候才整七个月,而且九月下旬爷爷奶奶会接替外公外婆继续照顾宝宝半年。

avatar
x*1
15
首先没有设计emergent stop。 一键把max burst关掉。
avatar
V*8
16
我年初的时候也想过送daycare,那是小壁虎才9个多月。后来还是留在家里了。觉得LZ
还是等一岁半以后再送吧
avatar
x*1
17
cache 坏了是因为oncall bounce了机器,invalidate cache了。 request storm在此
刻就形成。
SN 和RR 的双错,导致了这个LSE
avatar
g*g
18
各种设计上的失误或者 bug是不可避免的,主动在产品环境里有控制地模拟各种宕机和
延迟增加能有效增强系统鲁棒性。

【在 x*******1 的大作中提到】
: cache 坏了是因为oncall bounce了机器,invalidate cache了。 request storm在此
: 刻就形成。
: SN 和RR 的双错,导致了这个LSE

avatar
x*1
19
S3曾经有一个LSE,是因为一块网卡flip bits,正好送来的是dynamic throttle的
rules,你妈坑爹的是,S3没有validate 这个rule。最后。。。。。
相关阅读
logo
联系我们隐私协议©2024 redian.news
Redian新闻
Redian.news刊载任何文章,不代表同意其说法或描述,仅为提供更多信息,也不构成任何建议。文章信息的合法性及真实性由其作者负责,与Redian.news及其运营公司无关。欢迎投稿,如发现稿件侵权,或作者不愿在本网发表文章,请版权拥有者通知本网处理。