Is AWS Reliable?
When choosing a cloud provider, one important thing to consider is how reliable that cloud providers services are. Even if the cloud provider offers compelling services, if they aren’t reliable, they aren’t worth using.
Is AWS Reliable? Amazon Web Services has had only twelve major service events that impacted AWS service availability between June 8, 2011 and August 23, 2019 as shown on their post event summaries page. This is an incredibly reliable cloud provider and their services can be highly trusted to remain available.
Knowing that the Amazon Web Services cloud is reliable, you may want to get a better understanding of how that reliability is achieved. Each service offered by Amazon is built from the ground up in order to offer the highest reliability for their customers.
Amazon Simple Storage service is designed to offer eleven nines of durability of stored data, or 99.999999999%. This means that if an Amazon S3 customer stores one million objects in an S3 bucket, the customer can expect to lose, or have corrupted, a single one of those objects over a one hundred thousand year period!
The service accomplishes this by storing multiple copies of the data in at least three different availability zones in a given Amazon region. These availability zones are separated by many miles from one another so that any local event should not affect the others. For example a fire, flood, or other similar natural disasters that affect one availability zone in a region should not cause issues for the other availability zones in a region.
The service is also designed so that it can sustain multiple device failures at the same time all while quickly detecting the failure and making more copies from the working devices to restore the multiple working copies of the data. The objects stored within the service also regularly have their integrity verified by confirming checksums of the data to make sure they match what was originally seen when the object was first created.
The Amazon Elastic Compute Cloud service level agreement guarantees customers a 99.99% uptime for EC2 instances during a given month. If this uptime is not met, the amazon customer will receive credits for any amount of uptime lower than this amount.
One way that Amazon helps keep EC2 instances reliable is by providing multiple availability zones in a given region. This is so that customers can deploy copies of instances across those different zones. By doing this if a service is running on these instances, but spread across the availability zones, and one of those availability zones fails, the nodes in the working availability zones will continue to function properly.
Amazon recommends that a service usually be setup to run across three availability zones concurrently to help maintain uptime. With the machines deployed across a minimum of these three availability zones, the cloud customer would be able to sustain the failure of two zones at the same time without having their service going down. The way the load is spread across these zones is usually with an EC2 Load Balancer which is availability zone aware. This load balancer will monitor the different availability zones and when it is detected that one or more is down, traffic will be prevented from being routed to the down zones.
Amazon DynamoDB is another service provided by Amazon Web Services that was designed from the ground up to be highly reliable. It is designed to run on top of three availability zones just like the Amazon S3 service. Any data written into the Amazon DynamoDB database is stored with three copies, one in each of the three availability zones for the selected region. This helps make sure that even if two availability zones were to fail, the data contained in the database would still be available.
Another feature offered by Amazon DynamoDB is called Global Tables. This service helps replicate the data contained within an Amazon DynamoDB table across multiple regions in the Amazon cloud. Given that the data is replicated multiple times in a single region, having it replicated across many other regions would make it virtually impossible to lose data stored in the DynamoDB table, or have a request to the database fail as there are many locations that could be queried, even if an entire region fails.
The Amazon DynamoDB service also offers point-in-time recovery on tables if enabled, this helps with restoring any accidental write or deletions that might occur on the table. With this feature enabled, you can be quite sure that any data stored within this database is going to there when you need it.
Amazon Route 53 is Amazon Web Services version of a DNS service. The cloud customers can use this service to configure any type of DNS setting for their Hosted Zones, or domains, within the service. It is also designed to be a global service, so that customers in any part of the world can have quick response times to the DNS queries being issues for an Amazon Route 53 configured domain.
Amazon Route 53 uses a global anycast network of DNS servers that are hosted around the world in order to provide the lowest latency possible for almost any customer around the world. Because of this globally distributed network, the number of copies of DNS records, and number of servers available to respond to any given request, makes this service an incredibly reliable Amazon cloud service.
The Amazon Route 53 service also allows for configuring health checks on the different domain configurations. These health checks can be made to simulate what the request of a normal user might look like and if the health check fails, the Route 53 service can be configured to fail over, so that DNS requests only return results for currently healthy endpoints. This avoids sending users to a service that is currently not healthy.
Amazon Simple Queue Service is a service provided by Amazon Web Services which allow their customers to queue up work, or tasks, that need to be accomplished. These queues give a central location for a worker node to find and retrieve the tasks that need to completed.
This service is also built on top of the Amazon regions by taking advantage of the different availability zones available to the specific region that the queue is configured for. Like other Amazon services, this service will make multiple redundant copies of the queued data in at least three availability zones so that if any single zone were to fail, the data would still be available to retrieve. Even if multiple zones were to fail, the data would still be available. However, if an entire region were to go down, you will be out of luck when trying to get the queued data.
However, one difference between this service and the others previously mentioned, is that the messages stored with the SQS queue have a maximum retention period of fourteen days. So even if the data is replicated across many availability zones, once the maximum retention period is passed, the data will be lost if a worker has not processed it.
Amazon Lambda is a service provided by Amazon which allows their customers to run small pieces of functional code in many different languages on top of compute infrastructure managed and hosted by Amazon. Instead of the cloud customer needing to run their code on self managed servers, all of the setup and configuration of the underlying machines is managed and maintained by Amazon Engineers.
Amazon makes this service reliable to their customers by having a large fleet of machines available in each region that supports the service available to run the Lambda functions on demand, whenever they are triggered. The system is also designed so that if a call to a Lambda functions fails to run properly, it can automatically be retried one or several times. So if the lambda function were initially provisioned on a bad node, the retrial could run on another working compute node which successfully runs the Lambda function for the customer.
Amazon also maintains the underlying compute nodes in several availability zones within the region that the Lambda service is running. With this setup, even if an availability zone were to fail, or become no longer reachable, the customers Lambda functions could still run successfully as there are a fleet of compute nodes in the other availability zones in that region that can pick up the work and execute it.
Amazon Simple Notification service is a cloud based service which is used by AWS customers to send notifications to their customers. Amazon SNS can push these notifications to many endpoints including email, SMS, and even an SQS message queue. There are many potential use cases for this service for an Amazon cloud customer.
Just like many of the other services, the Amazon SNS service takes advantage of the Availability Zones available in an Amazon region to durably store any message that it receives and needs to transmit. It will make multiple copies of these message across the different Availability Zones so that any Availability Zone failure does not cause a failure in this service to do its job of delivering the message to the specified endpoint or customer.
Due to the reliability of this service, Amazon is able to guarantee the delivery of the messages sent through this service, as long as the receiving endpoint is accessible to the SNS service. The only other time this might fail is if their is an entire region wide failure in the region where the SNS service is running. However, the chance of that happening is incredibly low.
The Amazon Simple Email Service is provided by Amazon so that their customers can programmatically send out email to their customers or systems. Initially customers start out in this service with sandbox only privileges, but can request to be upgraded to enable sending a certain rate limited amount of emails to any email address on the internet. However, if the quality of emails being sent is deemed to be low by Amazon, limits may be placed on the SES account.
This service is another service that takes advantage of the regional availability zones to make sure that the service is always up and available within a given region. By having many servers running in each of the available availability zones, Amazon can be sure that the service will be up and available for customers when they need it.
If an availability zone were to fail or become unavailable, all SES requests would be routed to one of the remaining working availability zones. This keeps the service operational, even if it is at reduced capacity than when all of the availability zones are fully functional. For companies that need to send a lot of emails to their customers during normal operation, this can be great knowing that very rarely will this service ever be unavailable, if ever.
Amazon Elastic Container Service is a cloud service that effectively runs on top of Amazon EC2. Because of this, it gains all of the reliability benefits that the EC2 service has. This service will use EC2 instances in the AWS customers account to run the container applications.
To make this service reliable, it is best to spread the EC2 instances across at least three of the Availability Zones in the region that the ECS service is running within. Once that is configured, it is also likely a good idea to make the ECS service deploy multiple copies of an application container in each of these availability zones. This again helps with the possibility of an Availability Zone failure will still keeping the service that the container provides up and available during this period.
Another offering by Amazon related to ECS is known as Amazon Fargate. This is very similar to ECS, except that it is Amazon hosting the underlying compute instances in this setup. Again Amazon will host many compute nodes spread across all of the available Availability Zones in the given region so that a zone failure will not break the service. However even in this setup, it makes sense for the Amazon customer to deploy copies of their containers into at least three of these Availability Zones so that the service will survive at least two zone failures for the highest possible up time.
Amazon Aurora is a cloud based relational database built by Amazon from the ground up to take advantage of the cloud environment. It was specifically designed to use all of the reliable features previously described to make this database service extremely reliable.
One thing that Amazon Aurora does to make the service very reliable is that it breaks the data up into 10GB chunks that are replicated at least six times across three availability zones on different drives. This allows the database to handle the failure of at least two copies of the data simultaneously without losing the ability to allow more writes into the system. This also allows the service to sustain the loss of up to three copies of the data without losing any read availability of the records stored with the database!
The Amazon Aurora database is also self healing as it continuously scans the data stored within the database for errors or issues, and if any are discovered, the bad copies of the data are replaced with valid copies of the data on new drives.
There is also the option within Aurora to use the Aurora Global Database which configures the database to make asynchronous copies of itself to other regions around the world. With this setup in Aurora, an entire region could fail, which is very unlikely, but the database data would still be available to customers from the copy that was made in one or more secondary regions. If a region failure did occur, one of the other regional copies can be failed over to, so that it becomes the new primary region and takes over th main workload of the database.
As you can see, all of the services described in this post, and also many of the other Amazon services that weren’t mentioned, are very reliably built on top of the Amazon cloud infrastructure. You can also be confident that there are thousands of Amazon engineers across the world continuously monitoring, maintaining, upgrading and patching all of these different systems so that they remain reliable and available to the Amazon cloud customers when they need them.
You are also able to view the Amazon Web Services status page at any time to see the list of every service in every region with their current status. At almost any point in time, this page will be filled with green check marks. If there is ever an issue, it is usually not a region wide issue and because of that applications and services that are designed to span Availability Zones will remain healthy.