{"id":15530,"date":"2019-06-07T10:45:14","date_gmt":"2019-06-07T18:45:14","guid":{"rendered":"https:\/\/www.palada.net\/index.php\/2019\/06\/07\/news-9279\/"},"modified":"2019-06-07T10:45:14","modified_gmt":"2019-06-07T18:45:14","slug":"news-9279","status":"publish","type":"post","link":"http:\/\/www.palada.net\/index.php\/2019\/06\/07\/news-9279\/","title":{"rendered":"How a Google Cloud Catch-22 Broke the Internet"},"content":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/media.wired.com\/photos\/5cf98a7af2022442180c347d\/master\/pass\/Security%20-%20Feature%20Art%20-%20google%20cloud%20outage%20v2.jpg\"\/><\/p>\n<p><strong>Credit to Author: Brian Barrett| Date: Fri, 07 Jun 2019 16:26:01 +0000<\/strong><\/p>\n<p><span class=\"lede\">Five days ago, <\/span>the internet had a conniption. In broad patches around the globe, YouTube sputtered. Shopify stores shut down. Snapchat blinked out. And millions of people couldn\u2019t access their Gmail accounts. The disruptions all stemmed <a href=\"https:\/\/www.wired.com\/story\/google-needs-grow-cloud-business-carefully\/\">from Google Cloud<\/a>, which suffered a prolonged outage\u2014which also prevented Google engineers from pushing a fix. And so, for an entire afternoon and into the night, the internet was stuck in a crippling ouroboros: Google couldn\u2019t fix its cloud, because Google\u2019s cloud was broken.<\/p>\n<p>The root cause of the outage, as Google explained <a href=\"https:\/\/status.cloud.google.com\/incident\/cloud-networking\/19009\" target=\"_blank\">this week<\/a>, was fairly unremarkable. (And no, <a href=\"https:\/\/www.wired.com\/story\/facebook-down-dont-blame-hackers\/\">it wasn\u2019t hackers<\/a>.) At 2:45 pm ET on Sunday, the company initiated what should have been a routine configuration change, a maintenance event intended for a few servers in one geographic region. When that happens, Google routinely reroutes jobs those servers are running to other machines, like customers switching lines at Target when a register closes. Or sometimes, importantly, it just pauses those jobs until the maintenance is over.<\/p>\n<p class=\"paywall\">What happened next gets technically complicated\u2014a cascading combination of two misconfigurations and a software bug\u2014but had a simple upshot. Rather than that small cluster of servers blinking out temporarily, Google\u2019s automation software descheduled network control jobs in multiple locations. Think of the traffic running through Google\u2019s cloud like cars approaching the Lincoln Tunnel. In that moment, its capacity effectively went from six tunnels to two. The result: internet-wide gridlock.<\/p>\n<p class=\"paywall\">Still, even then, everything held steady for a couple minutes. Google\u2019s network is designed to \u201cfail static,\u201d which means even after a control plane has been descheduled, it can function normally for a small period of time. It wasn\u2019t long enough. By 2:47 pm ET, this happened:<\/p>\n<p class=\"paywall\">In moments like this, not all traffic fails equally. Google has automated systems in place to ensure that when it starts sinking, the lifeboats fill up in a specific order. \u201cThe network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows,\u201d wrote Google vice president of engineering Benjamin Treynor Sloss in an incident <a href=\"https:\/\/cloud.google.com\/blog\/topics\/inside-google-cloud\/an-update-on-sundays-service-disruption\" target=\"_blank\">debrief<\/a>, \u201cmuch as urgent packages may be couriered by bicycle through even the worst traffic jam.\u201d See? Lincoln Tunnel.<\/p>\n<p class=\"paywall\">You can see how Google prioritized in the downtimes experienced by various services. According to Sloss, Google Cloud lost nearly a third of its traffic, which is why third parties like Shopify got nailed. YouTube lost 2.5 percent of views in a single hour. One percent of Gmail users ran into issues. And Google search skipped merrily along, at worst experiencing a barely perceptible slowdown in returning results.<\/p>\n<p class=\"paywall\">\u201cIf I type in a search and it doesn\u2019t respond right away, I\u2019m going to Yahoo or something,\u201d says Alex Henthorn-Iwane, vice president at digital experience monitoring company ThousandEyes. \u201cSo that was prioritized. It\u2019s latency-sensitive, and it happens to be the cash cow. That\u2019s not a surprising business decision to make on your network.\u201d<\/p>\n<p class=\"paywall\">But those decisions don\u2019t only apply to the sites and services you saw flailing last week. In those moments, Google has to triage among not just user traffic but also the network\u2019s control plane, which tells the network where to route traffic, and management traffic, which encompasses the sort of administrative tools that Google engineers would need to correct, say, a configuration problem that knocks a bunch of the internet offline.<\/p>\n<p class=\"paywall\">\u201cManagement traffic, because it can be quite voluminous, you\u2019re always careful. It\u2019s a little bit scary to prioritize that, because it can eat up the network if something wrong happens with your management tools,\u201d Henthorn-Iwane says. \u201cIt\u2019s kind of a Catch-22 that happens with network management.\u201d<\/p>\n<p class=\"paywall\">Which is exactly what played out on Sunday. Google says its engineers were aware of the problem within two minutes. And yet! \u201cDebugging the problem was significantly hampered by failure of tools competing over use of the now-congested network,\u201d the company wrote in a detailed <a href=\"https:\/\/status.cloud.google.com\/incident\/cloud-networking\/19009\" target=\"_blank\">postmortem<\/a>. \u201cFurthermore, the scope and scale of the outage, and collateral damage to tooling as a result of network congestion, made it initially difficult to precisely identify impact and communicate accurately with customers.\u201d<\/p>\n<p class=\"paywall\">That \u201cfog of war,\u201d as Henthorn-Iwane calls it, meant that Google didn\u2019t formulate a diagnosis until 6:01 pm ET, well over three hours after the trouble began. Another hour later, at 7:03 pm ET, it rolled out a new configuration to steady the ship. By 8:19 pm ET, the network started to recover; at 9:10 pm ET, it was back to business as usual.<\/p>\n<p class=\"paywall\">Google has taken some steps to ensure that a similar network brownout doesn\u2019t happen again. It took the automation software that deschedules jobs during maintenance offline, and says it won\u2019t bring it back until \u201cappropriate safeguards are in place\u201d to prevent a global incident. It has also lengthened the amount of time its systems stay in \u201cfail static\u201d mode, which will give Google engineers more time to fix problems before customers feel the impact.<\/p>\n<p class=\"paywall\">Still, it\u2019s unclear whether Google, or any cloud provider, can avoid collapses like this entirely. Networks don\u2019t have infinite capacity. They all make choices about what keeps working, and what doesn\u2019t, in times of stress. And what\u2019s remarkable about Google\u2019s cloud outage isn\u2019t the way the company prioritized, but that it has been so open and precise about what went wrong. Compare that to <a href=\"https:\/\/www.wired.com\/story\/facebook-down-dont-blame-hackers\/\">Facebook\u2019s hours of downtime<\/a> one day in March, which the company attributed to a \u201cserver configuration change that triggered a cascading series of issues,\u201d full stop.<\/p>\n<p class=\"paywall\">As always, take the latest cloud-based downtime as a reminder that much of what you experience as the internet lives in servers owned by a handful of companies, and that companies are run by humans, and that humans make mistakes, some of which can ripple out much further than seems anything close to reasonable.<\/p>\n<p class=\"related-cne-video-component__dek\">Thanks to an assist from Congress, your cable company has the legal right to sell your web-browsing data without your consent. This is how to protect your data from preying eyes.<\/p>\n<p><a href=\"https:\/\/www.wired.com\/story\/google-cloud-outage-catch-22\" target=\"bwo\" >https:\/\/www.wired.com\/category\/security\/feed\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/media.wired.com\/photos\/5cf98a7af2022442180c347d\/master\/pass\/Security%20-%20Feature%20Art%20-%20google%20cloud%20outage%20v2.jpg\"\/><\/p>\n<p><strong>Credit to Author: Brian Barrett| Date: Fri, 07 Jun 2019 16:26:01 +0000<\/strong><\/p>\n<p>A Google Cloud outage that knocked huge portions of the internet offline also blocked access to the tools Google needed to fix it.<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[10378,10607],"tags":[714,21357],"class_list":["post-15530","post","type-post","status-publish","format-standard","hentry","category-security","category-wired","tag-security","tag-security-security-news"],"_links":{"self":[{"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/15530","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/comments?post=15530"}],"version-history":[{"count":0,"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/15530\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/media?parent=15530"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/categories?post=15530"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/tags?post=15530"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}