Sport

Quorum uses append-only decision logs and Aurora DSQL for multi-region incident coordination

AI-Generated Summary

1 sources

3 hours ago

1 views

Quorum uses append-only decision logs and Aurora DSQL for multi-region incident coordination

Key Points

Quorum coordinates incidents using an event-sourced data model on Amazon Aurora DSQL across multiple regions, including two full regions and a log-only witness.
Incident state changes are appended as immutable events (no in-place updates), and each event UUID is used to provide idempotency during retries.
Concurrent responder updates avoid record forking through Aurora DSQL optimistic concurrency control with retry-on-serialization-conflict behavior.
A failover layer routes application reads and writes to a healthy database endpoint when a region becomes unreachable, and the health/status display reads from the same surviving DSQL database.
Software development is governed by an append-only architecture decision log that acts as the agent’s continuity, paired with an end-to-end test gate and secrets hygiene prechecks.

Quorum is an incident command plane designed to keep incident data consistent when a region fails and when multiple responders update the same record concurrently. Across three related posts, the developer describes governance and database design choices that replace the “drift” risk of agent-generated code with a durable source of truth. Most software work is done by directing an AI coding agent, but changes are constrained by an append-only architecture decision log (DEC-001 onward). The log captures context, the decision, references to prior decisions, and a status; it is never edited, and implementation code is kept in separate commits. An end-to-end test suite functions as a merge gate, and standard practices such as clean commit history and secrets hygiene are applied via preflight checks.

On the database side, Quorum uses event sourcing on Amazon Aurora DSQL across four tables, storing incident state as immutable events. Each event’s UUID serves as an idempotency key to prevent duplicate application during retries. Writes rely on Aurora DSQL’s optimistic concurrency control: transactions read a consistent snapshot and handle conflicts at commit time via retries, avoiding lock-based failure complications. The developer also details a failover mechanism that routes reads and writes to a healthy Aurora endpoint using application-level logic, while the status panel remains failover-protected through monitors that write back to the same database. A live reliability and failover demo is provided, along with explicit boundaries about what the demo does and does not simulate.

How Outlets Covered This Story

DEV

Dev.to

I built a region-survivable system by directing an AI agent. An append-only decision log kept it coherent.

Most of the code in Quorum was written by directing Claude Code, an AI coding agent. That is not the interesting claim, and on its own it is not even a good one. An agent left to run unsupervised produces fast, plausible, locally-correct code that drifts into an incoherent system. The interesting part is the discipline that turned agent speed into a coherent, correct, multi-region database application. That discipline was an append-only architecture decision log. The failure mode of agent-built software An agent has no memory across sessions. It will happily contradict a decision it "made" yesterday, re-open a question that was settled last week, or quietly drift from the design because the local change in front of it looks fine. Each individual output is reasonable. The aggregate, without governance, is a system where the data model fights the access layer and the third change undoes the first. This is the part people underestimate when they talk about AI coding velocity. Speed without a source of truth does not get you to a good system faster. It gets you to entropy faster. A fast writer with no memory and no sense of consequence is a liability at scale unless something outside the agent supplies the continuity. The decision log Quorum carries a file of numbered architecture decisions, DEC-001 onward, now past two dozen. Each entry has the same shape: the context that forced the decision, the decision itself, references to the prior decisions it refines or interacts with, and a status. Three rules make it work: Append-only. Entries are never edited. A later decision can supersede an earlier one, but it does so as a new numbered entry that references the old one. The history of why the system is shaped the way it is stays intact and readable, including the choices that were later reversed and why. Committed separately from code. The decision and the code that implements it are different commits. The log reads as a clean narrative of intent, independent of the diffs that carried it out. It is the contract. Every prompt I gave the agent carried the log as context. When a new instruction risked contradicting an earlier decision, the log was there to catch it, for the agent and for me. This is not documentation written after the fact to make the project look organized. It is the input that keeps the next change consistent with every change before it. The log is the memory the agent does not have. The other guardrails The log is the spine. A few standing rules are the ribs, and every prompt carried them: A 50-test end-to-end suite is the merge gate. Nothing lands that does not keep it green. The agent can write whatever it likes; it does not merge unless the proofs still pass on the real deployment. Conventional commits and a clean working tree, so the history stays legible to a human reading it later. Secrets hygiene as a hard rule: a secrets scanner runs clean before anything approaches a public branch, account identifiers live only in gitignored files, and deploys run through a CLI preflight that verifies the right account is selected. None of this is exotic. It is the ordinary discipline of a careful engineer. The entire point is that the agent does not supply it. You do. What got built this way Under that governance, the agent built an event-sourced incident command plane on Amazon Aurora DSQL in a multi-region active-active configuration, with a Next.js front end on Vercel. Optimistic-concurrency-based correctness so the incident record cannot fork under cross-region contention. A chaos-aware failover demo that is precise about what it simulates rather than overclaiming. Ingestion from CloudWatch through EventBridge and Lambda into DSQL. Credential-free auth over IAM with OIDC, so no static database secrets exist in the system. The decision log is public in the repository, so the architecture is not only shipped, it is explained. You can read the reason for every choice, and you can read the reversals. The actual lesson for engineers Agent-assisted development at a senior level is not about typing less. The agent is fast and competent at the local task; that is settled. What it lacks is judgment across time: the memory of why a thing was decided, the refusal to re-litigate it, the sense of what a change will cost three decisions from now. That is the part you keep for yourself. My job on Quorum was the architecture and the governance: the decision log, the test gate, the boundaries the agent worked inside. The decisions that needed a human were the ones an agent cannot weigh: choosing an event-sourced model so the audit trail and idempotency came for free, making the event UUID the idempotency key so retries are safe by construction, designing the chaos demo to simulate the real failure mode rather than fake a region kill, and owning the line between what the data plane survives and what the application tier does not yet. The agent wrote the code for those decisions. It did not make any of them. If you are on a team adopting agents, the question is not "how much can the agent write." It is "what is your decision log, what is your gate, and who owns the judgment the agent does not have." Answer those three and the velocity is real. Skip them and the velocity is a trap. The system is live at https://quorum-h0.vercel.app. The repository, including the full decision log, is at https://github.com/hocmemini/quorum. Two companion posts go deeper on the event-sourced data model and optimistic concurrency and the failover layer and what the chaos demo proves. This post was created for the purposes of entering the H0 "Hack the Zero Stack" hackathon. #H0Hackathon

5 hours ago

DEV

Dev.to

Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

The thesis Quorum is built on is uncomfortable and true: the tools a team uses to coordinate an incident often live in the same region as the thing that is failing. When the region goes, the incident response goes with it. You are now coordinating a region outage over a status page that the region outage took down. Quorum is an incident command plane designed to survive a region loss. This post is about how the failover works, what the live demo does and does not prove, and where the survival story currently ends, because a database audience will ask all three and they deserve a straight answer. What DSQL gives you A multi-region DSQL cluster in the US set is three regions: two full regions, which for Quorum are us-east-1 and us-east-2, and a log-only witness in us-west-2 that has no cluster endpoint of its own. Both full-region endpoints present a single logical database with strong consistency, and the architecture is designed for 99.999% multi-region availability with no single point of failure and automated failure recovery. The behavior that matters for an incident tool is stated plainly in the GA announcement: applications can keep reading and writing with strong consistency even when they are unable to connect to a region's cluster endpoint, and the third region acts as a log-only witness with no cluster resource or endpoint. The survivor keeps serving; the witness holds the log so the surviving region keeps commit quorum. Quorum is, in effect, a live demonstration of that reference behavior with an incident-command product wrapped around it. Quorum's failover layer AWS's guidance for multi-region DSQL is to put routing in front of the endpoints: either DNS-based routing with Route 53, or application-level routing logic, so traffic redirects automatically when an endpoint becomes unreachable. This is laid out in Implement multi-Region endpoint routing for Amazon Aurora DSQL. Quorum, a Next.js app on Vercel, does the application-level version: it detects an unreachable region and routes writes and reads to the healthy endpoint. The piece I am most satisfied with is that the health panel is itself failover-protected. A monitor Lambda re-validates failover on a schedule and writes a status snapshot through DSQL. So the component that tells you about the outage reads from the same database that survives the outage. The status display cannot become a casualty of the thing it is reporting on, which is the failure mode that makes most status pages useless at the exact moment you need them. Ingestion works the same way. A CloudWatch alarm fires, EventBridge routes it, and an ingest Lambda writes the signal into DSQL as an event. Monitoring events become incidents through a path that does not hinge on a single region's data layer. The capstone is recursive. Running a failover drill inside the product opens a real sev1 incident, "us-east-1 region impairment," which you then coordinate from the surviving region, in the same war room, on the same database, and resolve when the region restores. The drill exercises the exact flow a real region failure would. Because the event UUID is the idempotency key, the drill is safe to run repeatedly without leaving residue. What the demo proves, and what it does not Now the precise part, because precision here is the whole point, and it cuts in my favor before it cuts against. Here is what the demo proves, and it is all real: the application detects an unreachable region and routes to the survivor, the incident record does not fork under contention, the recursive drill opens and coordinates a real incident from the surviving side, and the health panel keeps reading because it reads through DSQL. Those are measured live, on the click. Recovery point is effectively zero, because strong consistency means a failover loses no committed data. Here is the boundary. The chaos toggle simulates a region's endpoint becoming unreachable, which is AWS's own framing of the failure scenario, and it exercises the application-layer failover. It does not partition DSQL's internal commit quorum, because I cannot safely destroy a real AWS region to film a demo and you should not trust a tool that claimed to. So the latencies the demo shows are happy-path latencies. On the happy path a commit needs a majority of the three cluster members, so it commits as soon as the two fastest acknowledge and never waits on the slowest. With one full region gone, the commit must reach the surviving region and the us-west-2 witness specifically, so it can no longer hide the slowest link behind the quorum, and commit latency in that degraded state runs higher than the demo's numbers. That degraded-quorum behavior is AWS's to guarantee, and Marc Brooker, who led the DSQL team, documents how it stays consistent and available on the majority side of a partition. The application-layer survival is mine to demonstrate, and that is what the demo does. The number is real; what it measures is application failover, not DSQL committing through a degraded quorum. The both-regions-down case There is a state most demos would quietly fake: both full regions unreachable at once. Quorum does not fake it. When no region can serve, the product says so, and it says the true thing, that committed data stays safe via the witness's journal and writes resume when a region recovers. The proofs that would write step aside rather than claim a commit that cannot happen. A coordination tool that lies about its own state in the failure case is worse than no tool, because it lies precisely when you are relying on it most. Where the survival ends, for now The same honesty applies to the architecture, not only the demo. What survives a region loss today is the data plane. DSQL's multi-region cluster keeps the incident record available and strongly consistent on the surviving side, with no data loss. The application tier does not yet match it: the Vercel functions and the ingestion and monitor Lambdas, as deployed, run in a single region, so a real loss of that region would take the serving layer down even though the data underneath it survives. AWS names the fix directly, in its writeup on DSQL for global-scale financial transactions: pair an active-active data layer with active-active application tiers so the full stack absorbs a regional disruption. That is the next step here, deploy the functions and the Lambdas across regions so the serving layer fails over the way the data already does. The hard part, strongly consistent coordination state that does not fork across regions, is done. The remaining part is standard multi-region deployment. I would rather name that boundary than imply the whole stack already survives. Break it yourself at https://quorum-h0.vercel.app: run a drill and watch it fail over, then take both regions down and see the honest state. The source and the full decision log are on GitHub at https://github.com/hocmemini/quorum. This post was created for the purposes of entering the H0 "Hack the Zero Stack" hackathon. It is one of three: a companion post covers the event-sourced data model and optimistic concurrency, and a third covers how the system was built by directing an AI agent under an append-only decision log. #H0Hackathon

5 hours ago

DEV

Dev.to

Optimistic concurrency is the whole design: event sourcing on Aurora DSQL

Quorum is an incident command plane built on Amazon Aurora DSQL. The failover story lives in another post. This one is about a narrower question that turned out to be the foundation: when several responders write to the same incident at the same moment, across regions, during the worst minutes of an outage, how do you guarantee the record never forks into two conflicting truths. The answer is two design choices that are really one choice seen from two angles: event sourcing, and DSQL's optimistic concurrency control. The data model is append-only Quorum is event-sourced across four tables. Every state change is an immutable event appended to a log, not an in-place update. The current state of an incident is a fold over its events. There is no UPDATE incidents SET status = ...; there is an acknowledged event, a note event, a resolved event, and the status you render is computed from them. The event's UUID is its primary key and its idempotency key at the same time. A retried write carrying the same UUID cannot double-apply: the insert collides on the primary key and becomes a no-op. That property sounds minor until you remember what kind of system this is. A tool designed to survive network failure retries writes constantly, and "the responder tapped resolve twice because the first response was slow" must not produce two resolutions. Append-only also suits the domain directly. For an incident system the audit trail is the product, not a side effect. "Who acknowledged this, at what time, and what did the timeline look like at 02:14" is a first-class question for the post-incident review and a compliance requirement in regulated environments. Event sourcing gives you that for free. It also gives DSQL a write pattern it likes, which matters more than you would expect. The stack, briefly TypeScript end to end. Kysely as a typed query builder rather than an ORM, because I wanted type safety without surrendering control of the SQL: on a distributed database the exact shape of a query has real consequences, and I did not want a query planner I could not see. Next.js App Router on Vercel for the front end and the server-side data access. DSQL as the database, reached over IAM using Vercel's OIDC federation to AWS, so there are no static database credentials anywhere in the system. DSQL uses a PostgreSQL parser, planner, and type system, so the dialect is largely compatible and the standard Postgres driver and Kysely work with minimal ceremony. The places it diverges are documented and worth reading before you design a schema: How Amazon Aurora DSQL differs from single-instance PostgreSQL. Optimistic concurrency, the core DSQL does not take row locks. A transaction reads a consistent snapshot, does its work, and the conflict check happens at commit time. When two transactions modify the same data, the one with the earliest commit time wins and the other receives a serialization error, the PostgreSQL SQLSTATE 40001 (DSQL also surfaces its own OC000 and OC001 codes), which the application is expected to retry. No locks are held for the duration of a transaction, and there are no deadlocks, ever. This is documented in Concurrency control in Aurora DSQL. So a DSQL application is not "write SQL and hope." It is "write SQL, catch the conflict, retry the whole transaction." Quorum wraps writes in a bounded retry with a small backoff. AWS's guidance is that the retried transaction should be idempotent, which closes a loop with the data model: the event UUID is already the idempotency key, so the retry is safe by construction rather than by hope. A subtlety the docs call out, and worth internalizing: SELECT ... FOR UPDATE is syntactically supported but does not block. It surfaces as a commit-time conflict instead. If you carry over a Postgres habit of serializing access to a hot row with FOR UPDATE, that path becomes a retryable conflict rather than a blocking wait, and a row everyone updates at once becomes a retry storm. The schema fix is the one event sourcing already gives you: append new rows instead of updating a shared counter in place. The insight that makes this more than a concurrency trick Here is the part worth slowing down for. Optimistic concurrency is usually sold as a throughput story: no locks, no lock contention, no deadlocks, scales cleanly. All true. But Marc Brooker, who led the team that built DSQL, has written about a deeper consequence, which is that the lock-free design is also why DSQL's failure recovery is clean. His post on what DSQL does during a partition is the source worth reading. Think about what a pessimistic, lock-based database has to do when it loses a region or a node mid-flight: there are locks held by transactions that were in progress when the failure hit, and that state has to be reconciled before the survivor can safely proceed. A lock-free system has no lock state to strand. Brooker is concrete about why this matters in DSQL: the component that orders commits, the adjudicator, holds no durable state, so when a region drops away the adjudicator leader moves to the surviving majority side, which already knows every committed transaction and so has everything it needs to recreate that state. There are no stranded locks to untangle, because there were never any locks. That is the same mechanism that keeps Quorum's incident record from forking. When two responders contend on the same record, optimistic concurrency guarantees one commits and the other retries against the now-updated state. There is one truth. The property that lets the database survive a region loss and the property that keeps the incident record consistent under contention are not two features bolted together. They are one design choice. You can watch it The Reliability surface on the live deployment runs this in front of you. The no-split-brain demo races two writers at the same record and shows that the record never diverges. A burst test fires fifty concurrent writes; they all commit durably, with conflicts resolved by retry rather than lost. Every number on that page is measured on the click, not canned, so the latency you see is the latency the database returned for that request. Run it yourself: https://quorum-h0.vercel.app/reliability. Why Vercel fits DSQL is serverless: there is no instance to size and no connection pool to manage, and it scales to zero between bursts. That removes the pool-babysitting that taxes traditional Postgres on serverless platforms, and it fits Next.js on Vercel, where server functions are short-lived and you do not want a connection proxy sitting in the middle. Pair that with OIDC for credential-free auth and the data tier and the deployment tier fit together without a secrets manager or a proxy between them. The live demo is at https://quorum-h0.vercel.app. The source, including the full architecture decision log, is on GitHub at https://github.com/hocmemini/quorum. This post was created for the purposes of entering the H0 "Hack the Zero Stack" hackathon. It is one of three: a companion post covers the failover layer and what the chaos demo proves, and a third covers how the system was built by directing an AI agent under an append-only decision log. #H0Hackathon

5 hours ago

How to Watch Spain vs. Cape Verde at the 2026 FIFA World Cup

Multiple outlets provide viewing information for the upcoming Spain vs. Cape Verde match in the 2026 FIFA World Cup. One...

2 sources 10 hours ago

Sport

Trump celebrates 80th birthday with Iran deal and UFC event at White House

U.S. President Donald Trump marks his 80th birthday on Sunday by touting an agreement intended to end the war with Iran...

10 sources 5 hours ago

Sport

UFC Freedom 250 event held on White House South Lawn for 250th anniversary and Trump’s 80th

UFC’s Freedom 250 fight night is held on the White House South Lawn, according to reports. The event takes place in conn...

2 sources 54 minutes ago