The Closest Healthy Region

A multi-region application needs requests routed to the closest region, failing over to the next-closest healthy region when the preferred one drops out, with no client-side retries and no separate health-check resources to maintain. Route 53 supports all of that in a single record set – but only with the right combination of routing policy, record type, and health-evaluation setting. Working out which combination means touring all seven routing policies and the attributes that separate them.

The situation

A company operates a web application with regional deployments in three AWS Regions – us-east-1, eu-west-1, and ap-southeast-2. Each Region has an Application Load Balancer fronting an Auto Scaling group of EC2 instances.

The team wants app.example.com to route each user to the Region with the lowest network latency for their resolver, and – when the preferred Region is unhealthy – to fall over to the next-lowest-latency healthy Region automatically. They do not want to rely on client-side retries or third-party DNS, and they are allergic to extra health-check resources to configure, bill, and maintain.

Today every region is independently reachable via its own hostname; there is no intelligent DNS layer in front of them. This scenario is the work of picking one.

What we might want from this

Before opening the Route 53 console it’s worth asking what this routing layer is actually supposed to do, because “closest region with failover” hides a cluster of decisions.

The first question is what “closest” even means. Physical distance is the intuitive answer, and it’s usually wrong. A user in Istanbul is physically close to Frankfurt but might take a faster network path to Dublin depending on peering; a user in Perth is physically close to Singapore but the undersea cable topology can put Sydney at a similar RTT. The honest definition of “closest” for a web application is “lowest measured latency”, and the routing layer has to have some way of knowing that. A policy that picks on continent-code geography will be wrong for every user whose network path doesn’t match their passport.

The second question is how the routing layer learns that a Region is unhealthy. There are two paths in Route 53: a dedicated health-check resource that we configure, point at an endpoint, and pay for, or a derived signal lifted from an AWS resource we already own – an ALB’s own view of its target-group health. The second path is cheaper and less likely to drift out of sync with reality, because the ALB is the thing that knows whether any backend is serving traffic. The first path is more flexible (it can check any URL anywhere) but it’s another resource in the inventory.

The third question is what happens when every region is unhealthy. DNS does not have a clean way to say “the service is globally offline”. Returning an empty answer sounds honest but breaks clients that can’t distinguish “resolver broken” from “service broken”. Returning a wrong answer gives the client something to try. Route 53’s own choice here – return everything when nothing is healthy – is worth knowing because it’s the default we inherit, not a knob we tune.

The fourth question is whether the policy composes. Real traffic-flow graphs rarely fit inside one routing decision. “Closest region that’s healthy, but EU users only ever go to EU regions”, “closest region that’s healthy, with a per-region active/passive inner layer” – these are common shapes, and picking a policy that refuses to nest underneath or on top of another one writes the team into a corner later. Route 53 supports up to ten levels of nesting; the tool is ready even if the first problem only needs one level.

And finally there’s operational overhead. Each of these routing policies is a set of record configurations plus, sometimes, a Traffic Flow policy document plus, sometimes, health-check resources plus, sometimes, CloudWatch alarms wiring those checks to SNS. The cheapest answer on paper isn’t the cheapest answer in the on-call rotation if the pager goes off because someone edited the Traffic Flow JSON by hand.

The attributes that matter

Distilling that exploration into filters we can score each routing policy against:

Locality-aware. The policy picks on measured network latency, not continent code, not weight, not physical distance.
Health-aware. Unhealthy records drop out of the candidate set without a human editing DNS.
Supports three or more records. Two-record policies are structurally insufficient for this three-Region shape.
Low operational overhead. The health signal comes from a resource we already own – ideally the ALB itself – rather than a separate Route 53 health check configured, monitored, and billed.

The Route 53 routing-policy landscape

Route 53 ships seven routing policies. Each picks an answer on a different axis.

1. Simple routing. One record, no decision logic. Route 53 returns whatever is configured, regardless of who asked. Useful for single-Region services and static aliases. Can’t filter by health. Can’t do locality.

2. Weighted routing. Distributes answers across N records in proportion to integer weights. Supports health checks – unhealthy records drop from the pool. Ignores latency entirely. A resolver in Sydney with three equal-weight records would get a random continent on every new query. Useful for canary rollouts and gradual blue/green traffic shifts, not locality.

3. Latency-based routing. Returns the record pointing to the AWS Region with the lowest measured round-trip time for the resolver’s location. Supports one record per AWS Region. Supports health checks, including the lightweight EvaluateTargetHealth path on alias records, which reuses an existing resource’s health signal instead of configuring a separate health-check resource.

4. Geolocation routing. Returns the record matching the resolver’s geographic location. Granularity runs continent → country → US state, with a mandatory default record for resolvers that don’t match any configured entry. A resolver in Istanbul sits in Asia by IANA continent code, so it would be routed to the APAC record even when eu-west-1 has a lower network RTT. Failover is weaker too – a failing geographic record falls through only to the default, not to the next-nearest neighbour.

5. Geoproximity routing. Returns the record “closest” to the resolver by physical distance from a user-configured origin, with optional bias (-99 to +99) that stretches or shrinks each origin’s pull. Distance is physical, not network. Requires Route 53 Traffic Flow (a visual policy-editor layer on top of DNS records) to configure, so the setup is no longer a simple record set.

6. Failover routing. Two records, primary and secondary. Route 53 returns the primary while its health check passes, the secondary when it doesn’t. Supports exactly N = 2. Ignores latency – a Sydney user would hit us-east-1 at 220 ms when ap-southeast-2 would give them 5 ms. Useful for active/passive DR and as the inner layer of a nested policy, not as a standalone answer.

7. Multi-value answer routing. Returns up to eight healthy records per query; the client’s resolver picks one (typically the first). Doesn’t consider latency, weight, or geography – it’s DNS-level load balancing for clients that retry if the first answer fails.

The attribute table

Routing policy	Locality-aware	Health-aware	N ≥ 3	Low ops overhead
Simple	✗	✗	✗	✓
Weighted	✗	✓	✓	✓
Latency-based (alias + `EvaluateTargetHealth`)	✓	✓	✓	✓
Geolocation	✗	✓ (weak)	✓	✓
Geoproximity	✗	✓	✓	✗
Failover	✗	✓	✗	✓
Multi-value answer	✗	✓	✓	✓

One row is all ticks – latency-based routing with alias records and EvaluateTargetHealth = true.

Matching shape to policy

Each candidate falls out of the funnel on a different attribute. Latency-based with alias records and EvaluateTargetHealth is the one that reaches the bottom intact.

Latency-based routing, in depth

What Route 53 calls “latency” is not the latency from the resolver to the ALB. It’s the latency from Route 53’s own measurement points to each AWS Region – not to the service, not to the ALB, to the Region. When a resolver queries, Route 53 looks up which Region has the lowest RTT for that resolver’s network position (based on its IP and Route 53’s internal map of measurement points) and returns the record pointing at that Region.

Two practical consequences. First, the latency readings are independent of the application’s performance. If the ALB is slow but the Region’s network paths are fast, Route 53 still treats the Region as fast – the health signal is what compensates. Second, the measurements cover AWS’s public-internet paths to its edge locations and Regions, which are the ones that matter for reaching the Region; per-service latency inside the Region is a separate problem.

The health-awareness wiring is EvaluateTargetHealth on an alias record. Aliases are an AWS extension to DNS that let a record point directly at an AWS resource – ALB, NLB, CloudFront distribution, S3 website endpoint, another Route 53 record – instead of a hard-coded IP. When a latency record is an alias to an ALB with EvaluateTargetHealth = true, Route 53 consults the ALB’s own health signal as part of the routing decision. For an ALB, “healthy” means at least one target is healthy in at least one of the ALB’s target groups. The ALB already knows this; there is nothing new to configure.

Compared to the alternative – a separate Route 53 health check pointing at an HTTP endpoint or IP – alias + EvaluateTargetHealth is cheaper, has one fewer moving part, and can’t drift out of sync with the actual backend health the way that a separately-configured health check can.

When the latency-preferred record’s alias target reports unhealthy, Route 53 excludes it from the candidate set for the response and falls through to the next-lowest-latency record whose target is healthy. No client retry, no application-level awareness of the failover, no DNS record change from the administrator’s side.

One last-resort behaviour worth knowing. If every record in the set is unhealthy, Route 53 does not return an empty answer. It returns all of them, regardless of health. The reasoning is that a broken DNS response (NXDOMAIN or empty answer set) is strictly worse than a long-shot answer – the client might still succeed via retry, via a health check that’s lagging reality, or via an in-flight recovery. “Try something” beats “refuse to answer.”

A worked example: Madrid through four states

Madrid resolver, DNS TTL of 60 seconds on the latency record set.

State 1 – all three regions healthy. Route 53 evaluates candidates: eu-west-1 (~28 ms), us-east-1 (~95 ms), ap-southeast-2 (~210 ms). All three alias targets healthy. Response: eu-west-1. Client connects at ~28 ms.

State 2 – eu-west-1 ALB has no healthy targets. EvaluateTargetHealth = true on the eu-west-1 record excludes it from the candidate set. Next-lowest healthy: us-east-1. Response: us-east-1. Client connects at ~95 ms. End-to-end cutover is roughly the DNS TTL (60 s) plus Route 53’s internal health-propagation delay (~30 s).

State 3 – eu-west-1 and us-east-1 both unhealthy. Candidate set: only ap-southeast-2. Response: ap-southeast-2. Madrid connects at ~210 ms. Painful, but the application is intact.

State 4 – all three ALBs unhealthy. Candidate set empty – Route 53 returns all three records regardless of health. The client receives three addresses. First attempt fails; resolver behaviour varies from there. The point is the system doesn’t hand out DNS failures when the whole routing set is down.

Where this nests

Route 53 supports up to ten levels of nesting, and two-level patterns show up in the same scenario shape repeatedly. Latency → Failover puts a per-Region active/passive inside each latency leg – one record set giving “closest region AND in-region active/passive”. Geolocation → Latency pins GDPR-scoped users to the EU continent, then latency-selects among EU Regions inside that rule. Weighted → Latency runs a 10% canary globally with locality preserved inside both cohorts.

The useful skill is spotting the primary axis (the one the scenario optimises on) and the secondary axis (the one it constrains on). Once those are clear, the nesting writes itself.

What’s worth remembering

Seven routing policies exist – simple, weighted, latency, geolocation, geoproximity, failover, multi-value answer. Each optimises a different axis and most real setups nest two or more.
“Latency” means Route 53’s measured RTT from its probes to each AWS Region – not resolver-to-ALB, not physical distance. Continent-code geolocation is not a substitute.
Alias + EvaluateTargetHealth is the lightweight health-awareness path. It reuses the ALB’s own view of target health instead of a separately-configured Route 53 health-check resource.
Unhealthy records drop out of the candidate set silently. Route 53 falls through to the next-lowest-latency healthy record with no DNS edit and no client retry.
When every record is unhealthy, Route 53 returns them all. Empty-set fallback is the least-worst choice – a long-shot answer beats NXDOMAIN.
Nesting goes up to ten levels deep. Latency-outer with Failover-inner is a common two-level shape for “closest region AND in-region active/passive”.
DNS TTL plus propagation sets the cutover floor. A 60-second TTL and Route 53’s ~30-second propagation means clients cache the old answer for up to a minute and a half.