Intro
First, let’s go over some concepts: region and availability zone.
Amazon Availability Zones are distinct physical locations that have low-latency network connectivity between them, are located inside the same region, and are also engineered to be insulated from failures that happen to afflict other AZ’s.
Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable; they have Independent power, cooling, network and security.
Common points of failure like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate; such that even extremely uncommon disasters like fires, tornadoes, or flooding would only affect a single Availability Zone.[^1]
If your platform is working mostly in one area of the world, it makes sense to put your servers in that region. The region will then have multiple “Availability Zones”.
This means that you can put redundant servers in different zones within the same region, and as a result, you’ll have better availability.
The important twist here is that within the same region, network latency is minimal. So we have separate facilities with good interconnectedness. Here is an image of the available regions, along with the number of available zones on AWS. Two green circles are new regions that are opening soon (in Paris and Ningxia).
Scale Up
Things are simple here, you have one machine that serves all your traffic. When you notice that the server can’t handle the traffic, you simply shut down your machine, upgrade the CPU, RAM and storage and run it again. This approach is the cheapest and is ideal for the MPV state.
Don’t be fooled, though, it can still get you very far, and I would most definitely always begin with this approach.
PROS | CONS |
---|---|
simplest | single point of failure (all eggs in one basket) |
cheapest | downtime when upgrading |
you can’t adjust dynamically (for spike traffic) |
Single Availability Zone
This is similar to having a single machine in the sense that if our availability zone goes down, production goes down too. So our server(s) live in a single region within a single availability zone. We are merely adding an Elastic Load Balancer that distributes traffic to multiple servers within the same availability zone.
PROS | CONS |
---|---|
possible to upgrade without downtime (multiple servers) | single point of failure |
possible to adjust dynamically (for spike traffic) |
Multiple Availability Zones Amazon
EC2/RDS instances have an uptime guarantee of 99.95% on a monthly basis. The maximum permissible downtime roughly equates to 22 minutes per month (assuming 30 days per month)[^2] When we combine multiple availability zones, it makes it very unlikely we will have an outage. Elastic Load Balancer can detect problems in each zone and redirect traffic to healthy instances.
PROS | CONS |
---|---|
possible to upgrade without downtime (multiple servers) | affected by whole region going down |
possible to adjust dynamically (for spike traffic) | |
possible to survive one or more availability zones going down |
This combination is a sweet spot for reasonable reliability and cost.
Multiple Regions - Active/Passive Failover
Although it is very rare for an entire AWS region to go down, it does happen. Many enterprises want to replicate their databases across regions, so that when a catastrophe does occur and the primary region goes down, infrastructure can be quickly set up in another region. [^3] Such a setup requires the database to be synced across regions. Total time from endpoint failure to DNS failover is about 3 minutes, so we can have a backup server running soon, preventing a big outage.
One possibility to cut costs is to use a passive setup as a staging area for testing prior to production rollout.
PROS | CONS |
---|---|
possible to upgrade without downtime (multiple servers) | partially affected by whole region going down |
possible to adjust dynamically (for spike traffic) | we need read replicas in different region for havoc scenarios |
possible to survive whole region going down with little to no down time |
Multiple Regions - Active/Active Failover
When your server handles lots of customers across multiple regions, it makes sense to keep both regions active. In normal circumstances, you might use Amazon Route 53 Latency Based Routing (LBR) or Weight Round Robin (WRR) to distribute load. In case of emergency, when an entire region goes down, you transfer the traffic over to a working region.
This means you get slower responses, but it certainly beats suffering complete downtime.
The configuration is exactly the same as #4 Active/Passive Failover but we use both regions and we distribute the load between them at all times, not just in case of one region going down.
PROS | CONS |
---|---|
possible to upgrade without downtime (multiple servers) | we need read replicas in different regions |
possible to adjust dynamically (for spike traffic) * should survive whole region going down without major issues | we probably need a database master in each region |
allows region by region rollout to test new production |
Common Concerns
For a big system, a major problem is always the database. So in a sense, you do everything you can to remove the burden from it:
- Read Replicas
- Caching of static and dynamic content
- Splitting data based on regions (multiple masters depending on region) Another good tip is protecting web servers from being burdened by using a CDN for static content delivery or streaming. DDOS protection is another valid concern.
Conclusion
Congratulations on making it all the way here. If you just jumped here, shame on you, otherwise, I hope you found this useful :)
If you are in search of an awesome RoR/Vue.js/Nuxt team, or you need help setting up your project, feel free to contact Kodius.
References:
- Exploring Amazon Availability Zones
- Does-Amazon-EC2-have-an-uptime-guarantee
- New AWS Feature: Amazon RDS now support cross-region replication
- Active-Active for Multi-Regional Resiliency
- A Beginner’s Guide To Scaling To 11 Million+ Users On Amazon’s AWS
- Amazon RDS for MySQL – Promote Read Replica
- Overview of Amazon Web Services
- Calculator S3
- Creating a Billing Alarm to Monitor Your Estimated AWS Charges
- Using regions availability zones
- New AWS Feature: Amazon RDS now supports cross-region replication