Dev.to
Surviving the region you run in: failover on Aurora DSQL, and what the demo proves
The thesis Quorum is built on is uncomfortable and true: the tools a team uses to coordinate an incident often live in the same region as the thing that is failing. When the region goes, the incident response goes with it. You are now coordinating a region outage over a status page that the region outage took down.
Quorum is an incident command plane designed to survive a region loss. This post is about how the failover works, what the live demo does and does not prove, and where the survival story currently ends, because a database audience will ask all three and they deserve a straight answer.
What DSQL gives you
A multi-region DSQL cluster in the US set is three regions: two full regions, which for Quorum are us-east-1 and us-east-2, and a log-only witness in us-west-2 that has no cluster endpoint of its own. Both full-region endpoints present a single logical database with strong consistency, and the architecture is designed for 99.999% multi-region availability with no single point of failure and automated failure recovery.
The behavior that matters for an incident tool is stated plainly in the GA announcement: applications can keep reading and writing with strong consistency even when they are unable to connect to a region's cluster endpoint, and the third region acts as a log-only witness with no cluster resource or endpoint. The survivor keeps serving; the witness holds the log so the surviving region keeps commit quorum. Quorum is, in effect, a live demonstration of that reference behavior with an incident-command product wrapped around it.
Quorum's failover layer
AWS's guidance for multi-region DSQL is to put routing in front of the endpoints: either DNS-based routing with Route 53, or application-level routing logic, so traffic redirects automatically when an endpoint becomes unreachable. This is laid out in Implement multi-Region endpoint routing for Amazon Aurora DSQL. Quorum, a Next.js app on Vercel, does the application-level version: it detects an unreachable region and routes writes and reads to the healthy endpoint.
The piece I am most satisfied with is that the health panel is itself failover-protected. A monitor Lambda re-validates failover on a schedule and writes a status snapshot through DSQL. So the component that tells you about the outage reads from the same database that survives the outage. The status display cannot become a casualty of the thing it is reporting on, which is the failure mode that makes most status pages useless at the exact moment you need them.
Ingestion works the same way. A CloudWatch alarm fires, EventBridge routes it, and an ingest Lambda writes the signal into DSQL as an event. Monitoring events become incidents through a path that does not hinge on a single region's data layer.
The capstone is recursive. Running a failover drill inside the product opens a real sev1 incident, "us-east-1 region impairment," which you then coordinate from the surviving region, in the same war room, on the same database, and resolve when the region restores. The drill exercises the exact flow a real region failure would. Because the event UUID is the idempotency key, the drill is safe to run repeatedly without leaving residue.
What the demo proves, and what it does not
Now the precise part, because precision here is the whole point, and it cuts in my favor before it cuts against.
Here is what the demo proves, and it is all real: the application detects an unreachable region and routes to the survivor, the incident record does not fork under contention, the recursive drill opens and coordinates a real incident from the surviving side, and the health panel keeps reading because it reads through DSQL. Those are measured live, on the click. Recovery point is effectively zero, because strong consistency means a failover loses no committed data.
Here is the boundary. The chaos toggle simulates a region's endpoint becoming unreachable, which is AWS's own framing of the failure scenario, and it exercises the application-layer failover. It does not partition DSQL's internal commit quorum, because I cannot safely destroy a real AWS region to film a demo and you should not trust a tool that claimed to. So the latencies the demo shows are happy-path latencies. On the happy path a commit needs a majority of the three cluster members, so it commits as soon as the two fastest acknowledge and never waits on the slowest. With one full region gone, the commit must reach the surviving region and the us-west-2 witness specifically, so it can no longer hide the slowest link behind the quorum, and commit latency in that degraded state runs higher than the demo's numbers. That degraded-quorum behavior is AWS's to guarantee, and Marc Brooker, who led the DSQL team, documents how it stays consistent and available on the majority side of a partition. The application-layer survival is mine to demonstrate, and that is what the demo does. The number is real; what it measures is application failover, not DSQL committing through a degraded quorum.
The both-regions-down case
There is a state most demos would quietly fake: both full regions unreachable at once. Quorum does not fake it. When no region can serve, the product says so, and it says the true thing, that committed data stays safe via the witness's journal and writes resume when a region recovers. The proofs that would write step aside rather than claim a commit that cannot happen. A coordination tool that lies about its own state in the failure case is worse than no tool, because it lies precisely when you are relying on it most.
Where the survival ends, for now
The same honesty applies to the architecture, not only the demo. What survives a region loss today is the data plane. DSQL's multi-region cluster keeps the incident record available and strongly consistent on the surviving side, with no data loss. The application tier does not yet match it: the Vercel functions and the ingestion and monitor Lambdas, as deployed, run in a single region, so a real loss of that region would take the serving layer down even though the data underneath it survives.
AWS names the fix directly, in its writeup on DSQL for global-scale financial transactions: pair an active-active data layer with active-active application tiers so the full stack absorbs a regional disruption. That is the next step here, deploy the functions and the Lambdas across regions so the serving layer fails over the way the data already does. The hard part, strongly consistent coordination state that does not fork across regions, is done. The remaining part is standard multi-region deployment. I would rather name that boundary than imply the whole stack already survives.
Break it yourself at https://quorum-h0.vercel.app: run a drill and watch it fail over, then take both regions down and see the honest state. The source and the full decision log are on GitHub at https://github.com/hocmemini/quorum.
This post was created for the purposes of entering the H0 "Hack the Zero Stack" hackathon. It is one of three: a companion post covers the event-sourced data model and optimistic concurrency, and a third covers how the system was built by directing an AI agent under an append-only decision log. #H0Hackathon
5 hours ago