Scale for Speed and Availability

Intro

First, let’s go over some concepts: region and availability zone.

Amazon Availability Zones are distinct physical locations that have low-latency network connectivity between them, are located inside the same region, and are also engineered to be insulated from failures that happen to afflict other AZ’s.

Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable; they have Independent power, cooling, network and security.

A diagram showing the different layers of a network.

Common points of failure like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate; such that even extremely uncommon disasters like fires, tornadoes, or flooding would only affect a single Availability Zone.[^1]

If your platform is working mostly in one area of the world, it makes sense to put your servers in that region. The region will then have multiple “Availability Zones”.

This means that you can put redundant servers in different zones within the same region, and as a result, you’ll have better availability.

The important twist here is that within the same region, network latency is minimal. So we have separate facilities with good interconnectedness. Here is an image of the available regions, along with the number of available zones on AWS. Two green circles are new regions that are opening soon (in Paris and Ningxia).

Scale Up

Things are simple here, you have one machine that serves all your traffic. When you notice that the server can’t handle the traffic, you simply shut down your machine, upgrade the CPU, RAM and storage and run it again. This approach is the cheapest and is ideal for the MPV state.

Don’t be fooled, though, it can still get you very far, and I would most definitely always begin with this approach.

PROS	CONS
simplest	single point of failure (all eggs in one basket)
cheapest	downtime when upgrading
	you can’t adjust dynamically (for spike traffic)

Single Availability Zone

This is similar to having a single machine in the sense that if our availability zone goes down, production goes down too. So our server(s) live in a single region within a single availability zone. We are merely adding an Elastic Load Balancer that distributes traffic to multiple servers within the same availability zone.

PROS	CONS
possible to upgrade without downtime (multiple servers)	single point of failure
possible to adjust dynamically (for spike traffic)

It is much better to use the approach #3 with multiple zones. This can be used when the load is so low it requires only one server (so it has to be in one availability zone) as a stepping stone in the right direction.

Multiple Availability Zones Amazon

EC2/RDS instances have an uptime guarantee of 99.95% on a monthly basis. The maximum permissible downtime roughly equates to 22 minutes per month (assuming 30 days per month)[^2] When we combine multiple availability zones, it makes it very unlikely we will have an outage. Elastic Load Balancer can detect problems in each zone and redirect traffic to healthy instances.

PROS	CONS
possible to upgrade without downtime (multiple servers)	affected by whole region going down
possible to adjust dynamically (for spike traffic)
possible to survive one or more availability zones going down

This combination is a sweet spot for reasonable reliability and cost.

Multiple Regions - Active/Passive Failover

Although it is very rare for an entire AWS region to go down, it does happen. Many enterprises want to replicate their databases across regions, so that when a catastrophe does occur and the primary region goes down, infrastructure can be quickly set up in another region. [^3] Such a setup requires the database to be synced across regions. Total time from endpoint failure to DNS failover is about 3 minutes, so we can have a backup server running soon, preventing a big outage.

A diagram of a network with multiple platforms.

One possibility to cut costs is to use a passive setup as a staging area for testing prior to production rollout.

PROS	CONS
possible to upgrade without downtime (multiple servers)	partially affected by whole region going down
possible to adjust dynamically (for spike traffic)	we need read replicas in different region for havoc scenarios
possible to survive whole region going down with little to no down time

Multiple Regions - Active/Active Failover

When your server handles lots of customers across multiple regions, it makes sense to keep both regions active. In normal circumstances, you might use Amazon Route 53 Latency Based Routing (LBR) or Weight Round Robin (WRR) to distribute load. In case of emergency, when an entire region goes down, you transfer the traffic over to a working region.

This means you get slower responses, but it certainly beats suffering complete downtime.

The configuration is exactly the same as #4 Active/Passive Failover but we use both regions and we distribute the load between them at all times, not just in case of one region going down.

PROS	CONS
possible to upgrade without downtime (multiple servers)	we need read replicas in different regions
possible to adjust dynamically (for spike traffic) * should survive whole region going down without major issues	we probably need a database master in each region
allows region by region rollout to test new production

Common Concerns

For a big system, a major problem is always the database. So in a sense, you do everything you can to remove the burden from it:

Read Replicas
Caching of static and dynamic content
Splitting data based on regions (multiple masters depending on region) Another good tip is protecting web servers from being burdened by using a CDN for static content delivery or streaming. DDOS protection is another valid concern.

Conclusion

Congratulations on making it all the way here. If you just jumped here, shame on you, otherwise, I hope you found this useful :)

If you are in search of an awesome RoR/Vue.js/Nuxt team, or you need help setting up your project, feel free to contact Kodius.

References:

Protect the Data!

Implementing Google Authenticator in Active Admin