Now that the outage is over, Amazon has come forward and said that one of their team members was doing maintenance while trying to improve the speed of their billing system. The employee put in the wrong codes and the mistake took more servers off-line than they meant to.
“With a few mistaken keystrokes, the employee wound up knocking out systems that supported other systems that help AWS work properly,” says Brian Fung at the Washington Post.
This caused a cascade failure where websites could not modify data and users could not load pages.
“In this instance, the tool used allowed too much capacity to be removed too quickly,” Amazon said. “We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.”
The outage was caused by human error and ended up costing Amazon’s cloud storage customers over $300 million. This incident may also lead your customers who use web services to question if, or when, this will happen again.
Rather than wait for another outage, help your customers to be proactive by taking the following actions:
1. Backup Their Data To Multiple Services
Why would anyone assume that just because a service provider houses their data that it no longer needs to be backed up? Don’t allow your customers to take this kind of risk. Although it’s great that Amazon and the other services have multiple physical locations, in a case like the outage mentioned above, if your customer’s data only exist on one cloud service, it will be unavailable until the problem is resolved and (hopefully) access to their data is restored.
While having a redundant system is wise, you also need to keep backups of your customer’s data in other locations with other providers so that outages will have less impact on their services. A common option includes synchronous or asynchronous replication, which allows your customer to keep a current copy of their data available on another cloud service.
2. Monitor And Identify Any Problems Quickly
Sometimes an outage may be obvious. In other cases, it may only negatively impact the user experience, with systems being slower than usual.
That’s why monitoring is extremely important. Make sure you are monitoring not only specific endpoints, but entire cloud infrastructures. Then, you can quickly identify issues and reach. Unigma offers AWS Platform Monitoring as one of its’ features, and this helped many customers identify issues during the AWS outage. Get a sense of how Unigma’s Cloud Monitoring feature work by accessing our website.
3. Maximize Their Availability (And Minimize Their Risk)
Use built-in redundancy with systems on standby: this way when you have a failure, the application automatically detects a failover and switches to a system on standby. Unfortunately with standby, there can be a lag between when one system becomes unavailable and the other system detects it and picks up the load.
Use active redundancy: You can avoid this scenario by having active redundancy where a second system can absorb the load immediately while the initial system is still running.
If your customer only deploys to one instance and it becomes unavailable, they won’t get the level of fault-tolerance they need. Instead, they should spread out the workload using one of these 3 methods of ever-increasing protection:
- Use multiple availability zones within AWS: If you must stick with one service, at least moving to different availability zones will protect you from a localized outage (tornado, flooding, etc.). Out of these 3 options, this one has the highest risk to your customer.
- Spread your instances across multiple regions: With multiple regions, you can spread out your protection, but you can still be vulnerable to a service provider level outage.
- Use multiple service providers such as Google Cloud or Microsoft Azure as a backup service: Using this method, if one provider becomes unavailable, as in the example above, your customer’s services can continue running, providing seamless services. This method has the least risk to your customer.
For the best protection and to mitigate the risk of outages, use multiple cloud services so that your customer’s risk is greatly minimized. Don’t wait until the next outage to realize it’s time to protect your customer.
Unigma can help you simplify the cloud with monitoring features built to identifying possible problems of your cloud resources. Learn how to properly manage your cloud accounts by requesting a 30 minutes complimentary demo with Unigma.