Do you need help with Cybersecurity?
Yesterday, a significant internet outage affected many users, with major services like Google and Cloudflare experiencing downtime for several hours. This event highlights the fragility of our interconnected digital world and prompts a look at the underlying causes and potential business impacts.
Key Takeaways
- The outage was likely caused by a change in Google’s API quotas, which cascaded to Cloudflare, impacting a large portion of internet traffic.
- The internet’s design, while robust, is increasingly reliant on a few major providers, increasing systemic risk.
- Businesses need to plan for internet disruptions, considering communication, operations, and resilience strategies.
Understanding Internet Outages
Internet outages, especially large-scale ones, usually stem from one of two main issues. The first is the Domain Name Service (DNS). Think of DNS as the internet’s phonebook; it translates human-readable website names like google.com into the numerical IP addresses that computers use to find each other. This system relies on 13 root servers globally, and if they fail, the internet can effectively stop working. While this hasn’t happened in recent decades, it’s a potential point of failure.
The second, more common cause of recent outages is the Border Gateway Protocol (BGP). BGP is how different internet providers tell each other how to route traffic. It’s like a traffic management system for the internet. If incorrect routing information is broadcast – for instance, if a provider mistakenly claims it can handle all traffic for a specific service – it can lead to widespread disruptions. We saw this a few years ago when a BGP error caused YouTube to go offline.
Yesterday’s Specific Cause
In yesterday’s case, the situation appears to be different. Google has stated that a change to their API quotas was the trigger. While the exact technical details of this change aren’t fully clear, it seems to have caused a widespread failure of Google’s own services. This failure then cascaded to Cloudflare, a company that provides essential services like content delivery for a huge number of websites. Cloudflare helps speed up websites by storing data closer to users, so when they went down, a significant portion of the internet – estimated to be around 20% – became inaccessible.
The Trend Towards Centralisation
This event, alongside other recent incidents like a power outage in Spain, points to a growing issue: the increasing concentration of internet infrastructure. Unlike the original design of the internet, which aimed for decentralisation and resilience, we now see a few major players like Google, Microsoft, and Cloudflare managing a vast amount of global traffic. This centralisation, while often efficient, creates single points of failure. Reducing the number of providers means that when one of them has a problem, the impact is much larger.
Preparing Your Business for Outages
Given the increasing reliance on the internet and the potential for more frequent and complex disruptions, especially with the rise of AI and cyber risks, businesses need to be prepared. It’s not just about whether the internet goes down, but how your business can continue to operate, or at least mitigate the impact.
Here are some steps to consider:
- Risk Assessment: Conduct a thorough review of how a day without internet would affect your business. Identify critical functions and potential bottlenecks.
- Communication Plan: How will you communicate with your customers and your team if normal channels are unavailable? Consider alternative methods.
- Operational Adjustments: What systems or processes can be put in place to reduce the impact? This might involve offline capabilities or alternative service providers.
- Resilience Testing: Don’t just assume your backup plans will work. Test them regularly. For example, if you rely on status pages to inform customers, ensure they are accessible even during an outage.
While it’s impossible to completely eliminate the risk of global infrastructure failures, taking proactive steps can significantly improve your business’s ability to withstand and recover from such events. Thinking about how to maintain essential functions, like customer communication or even basic transaction processing, can make a big difference.