Your System is a State Machine | The Chris Frequency

The most common first question in a new system design is "what database should I use"? The second is usually "should this be a microservice"? Both are premature. Before you pick the infrastructure, model the domain.

Your system is a state machine whether you name it or not. Naming it gives you leverage: over complexity, over testing, over the conversations you have with your team about what the system actually does. This article is about that mental model. Not a framework, not a library, but a way of thinking about systems that makes the downstream choices obvious.

The architecture that follows from this model is described in Strategic Monolith + Satellites. This article is the foundation it rests on.

The Mental Model

Work items exist in a defined set of states, and they move between those states via transitions. Each transition is:

Atomic: it either fully happens or it does not.
Idempotent: triggering the same transition twice produces the same result as triggering it once.

This is not a novel idea. It is what most well-designed systems already do, often without naming it.

A few examples:

An order: PENDING → CONFIRMED → SHIPPED → DELIVERED (or CANCELLED)
A KYC check: SUBMITTED → PROCESSING → APPROVED / REJECTED / REFERRED
A payment: INITIATED → AUTHORISED → SETTLED / FAILED / REVERSED
A loan application: DRAFT → SUBMITTED → UNDER_REVIEW → APPROVED / DECLINED

The states and transitions are the domain model. Get them right and everything else follows. Get them wrong and no amount of clever infrastructure compensates.

What Happens Without It

Every system has states and transitions. The question is whether they are explicit or accidental. When they are accidental, the same failure modes appear over and over.

Boolean field proliferation. Instead of an enum, the table has is_processed, is_approved, is_sent, is_failed. Half the combinations are impossible but nothing prevents them. What does is_processed = true, is_approved = false, is_failed = false mean? Nobody is sure. Every query that touches these fields needs to encode its own interpretation of which combinations are valid. This is sometimes called "boolean blindness": the type system cannot help you because it does not know what the booleans represent in combination.

Status columns that drift. A VARCHAR status column, updated from anywhere in the codebase. New values appear informally. Nobody knows all possible states without grepping. No guard prevents an item moving from DELIVERED back to PENDING. The column is technically a state machine; it is just one with no rules.

Partial state from non-atomic updates. A "state change" touches three tables in sequence. The process fails halfway through. The item is now half-approved: one table says yes, another says no. No single transaction boundary protected the change. Fixing these items is a manual, data-specific recovery job.

Invisible transitions. State changes are side effects buried in application code. Understanding the lifecycle of a work item means reading the entire codebase. No single function you can point to and say "this is how an item moves from X to Y". Onboarding a new developer means walking them through the implicit workflow verbally, because it is not written down anywhere the code can enforce.

Mystery states in production. Items end up somewhere nobody recognises. "How did this get to PROCESSING with no verification result?" Nobody knows, because there was no guard to prevent it. The incident response is a forensic exercise across logs, database snapshots, and guesswork.

The accidental distributed system. Multiple services each own a slice of the workflow state. Nobody designed the coordination; it emerged. You now have eventual consistency problems without ever having chosen eventual consistency. Debugging requires correlating state across services, and the answer to "what state is this item in?" depends on which service you ask.

At a previous fintech company, I saw this pattern in a data processing pipeline built on multiple Kafka topics and aggregation stages. Each service would receive a document, process it, and forward the output. A final aggregator combined the results from all upstream stages to publish the system's output data points.

The problem: everything had to work flawlessly for any of it to work at all.

When an upstream service failed or produced bad data, the final aggregator had an incomplete or inaccurate view. Fixing it meant updating aggregator state ad-hoc, and in the worst case, replaying entire input topics and publishing potentially out-of-date data for days before eventual consistency kicked in and state normalised.

Hard to reason about, hard to identify affected cases, hard to test, and no definitive source of truth anywhere.

If any of this sounds familiar, naming the state machine is the fix.

Modelling States and Transitions

States as Enums

The first discipline is representing states as enums, not strings or booleans. An enum makes invalid states unrepresentable. The compiler (or runtime, depending on your language) becomes your first line of defence.

class KYCStatus(Enum):
    SUBMITTED  = "submitted"
    PROCESSING = "processing"
    APPROVED   = "approved"
    REJECTED   = "rejected"
    REFERRED   = "referred"

Five states. Each one is named, each one is distinct, and no combination of boolean fields can produce a state that does not exist in this list. If a new state is needed, it is added to the enum; the type system then surfaces every place in the codebase that needs to handle it.

Immutable Domain Models

The domain model is an immutable record. No ORM, no save() methods; just the shape of the data at a given point in time.

class KYCCheck(NamedTuple):
    id:           UUID
    customer_id:  UUID
    status:       KYCStatus
    submitted_at: datetime
    decided_at:   Optional[datetime]

A KYCCheck does not know how to persist itself. It does not contain a database connection. It is data, and nothing more. Transitions produce new records rather than mutating existing ones.

Explicit Transition Functions

Transitions are functions, not methods on the model. Each one asserts the required prior state, writes atomically through the data layer, and fires a signal to notify the rest of the system.

def begin_processing(check_id: UUID, context: Context) -> KYCCheck:
    check = context.repositories.kyc_checks.get(check_id)
    assert check.status == KYCStatus.SUBMITTED

    # ... validate submission is complete, prepare for processing ...

    # transition
    updated = check._replace(status=KYCStatus.PROCESSING)
    context.repositories.kyc_checks.save(updated)

    notify(Signal.PROCESSING_STARTED, updated, context)
    return updated

def approve(check_id: UUID, context: Context) -> KYCCheck:
    check = context.repositories.kyc_checks.get(check_id)
    assert check.status == KYCStatus.PROCESSING

    # ... finalise approval, compute metadata ...

    # transition
    updated = check._replace(
        status=KYCStatus.APPROVED,
        decided_at=datetime.now(UTC),
    )
    context.repositories.kyc_checks.save(updated)

    notify(Signal.APPROVED, updated, context)
    return updated

The guard (assert check.status == ...) is the key. It makes illegal transitions impossible at runtime. If an item is already APPROVED, calling begin_processing on it fails immediately and explicitly rather than silently corrupting the data.

The context parameter is the application's Context object: a single container holding all repositories and services, injected at startup. Every transition function receives the full execution context. The full treatment of this pattern is in Strategic Monolith + Satellites.

Guard Conditions

Not every transition from a given state should be unconditional. A guard condition is a rule that determines whether a transition is allowed beyond just the prior state.

def refer(check_id: UUID, reason: str, context: Context) -> KYCCheck:
    check = context.repositories.kyc_checks.get(check_id)
    assert check.status == KYCStatus.PROCESSING
    assert reason.strip(), "Referral requires a reason"

    # transition
    updated = check._replace(status=KYCStatus.REFERRED)
    context.repositories.kyc_checks.save(updated)
    context.repositories.kyc_checks.record_referral_reason(check_id, reason)

    notify(Signal.REFERRED, updated, context)
    return updated

The guard is explicit and testable. If a referral requires a reason, the function enforces it. No external validation layer, no framework annotation; just an assertion in the transition function where the rule belongs.

Worked Example: KYC Onboarding

The building blocks above (enums, immutable models, transition functions, guards) are general. KYC (Know Your Customer) onboarding is a good place to show how they compose into a complete workflow, because the domain is genuinely complex, the state machine maps naturally, and the interactions with external systems are varied.

The States

Signals and Orchestration

When a transition happens, interested code within the core needs to react. A signal is an in-process notification that a transition occurred; lightweight, synchronous by default, zero infrastructure. It is not a message bus. It is a function call with a subscriber list.

@signal_handler(Signal.PROCESSING_STARTED)
def run_verification(check: KYCCheck, context: Context) -> None:
    result = context.verification.verify(check.customer_id)
    if result.approved:
        approve(check.id, context)
    elif result.needs_review:
        refer(check.id, result.reason, context)
    else:
        reject(check.id, context)

@signal_handler(Signal.APPROVED)
def run_risk_scoring(check: KYCCheck, context: Context) -> None:
    score = context.risk_scoring.score(check.customer_id)
    context.repositories.kyc_checks.record_risk_score(check.id, score)

context.verification and context.risk_scoring are satellites: stateless services that do heavy compute (ML inference, rules engines) and return a result. They are accessed through the Context like everything else. Whether they are HTTP calls, gRPC, or in-process for testing is hidden behind the interface.

The distinction between signals, domain events, and queue items is important and frequently conflated. The full treatment is in a companion article, Events vs Signals.

Sub-State-Machines

Approval triggers a webhook to a partner system. That webhook has its own lifecycle:

PENDING → SENDING → DELIVERED / FAILED / RETRYING

This is a child state machine, not part of the parent. It gets its own model, its own enum, and its own transitions:

class WebhookStatus(Enum):
    PENDING   = "pending"
    SENDING   = "sending"
    DELIVERED = "delivered"
    FAILED    = "failed"
    RETRYING  = "retrying"

class WebhookDelivery(NamedTuple):
    id:        UUID
    check_id:  UUID   # linked to parent by foreign key
    status:    WebhookStatus
    endpoint:  str
    attempts:  int

The key rule: the child's state does not leak into the parent's enum. A KYC check is APPROVED regardless of whether the webhook succeeded or is still retrying. The parent triggered the child; it does not wait for it or change state based on it.

The same pattern applies to any stateful side effect: sending notifications, generating PDFs, syncing records to external systems. Each one gets its own table, its own enum, its own transition functions. Without this separation, the parent enum grows to encode every combination of business state and delivery state. Five business states multiplied by four delivery states is twenty combinations, most of which are meaningless.

Testing

Because every dependency enters through the Context, testing a transition is a constructor call and an assertion:

def test_approved_check_gets_risk_score():
    context = make_test_context(
        verification=FakeVerificationService(
            results={CUSTOMER_ID: VerificationResult(approved=True)}
        ),
        risk_scoring=FakeRiskScoringService(default_score=75),
    )
    context.repositories.kyc_checks.insert(make_submitted_check(CUSTOMER_ID))

    begin_processing(CHECK_ID, context)

    check = context.repositories.kyc_checks.get(CHECK_ID)
    assert check.status == KYCStatus.APPROVED

    score = context.repositories.kyc_checks.get_risk_score(CHECK_ID)
    assert score == 75

No mock framework. No patching. No cleanup. The fakes are real implementations that accept the same types and return the same types. The test reads as a description of the workflow: submit, process, assert the outcome.

When the Model Breaks Down

Not everything is a clean state machine. Some workflows are more graph than linear sequence. Long-running processes with external dependencies and unpredictable timeouts do not always fit neatly into a set of named states with deterministic transitions.

The pragmatic response is: model what you can, document what you cannot. A system where 80% of the workflows are explicit state machines and 20% are pragmatic deviations is vastly better than one where 100% of the state management is implicit. The named state machines give you a baseline of clarity; the deviations are visible precisely because the baseline exists.

Perfection is not the goal. Deliberateness is.

Summary

Pick the mental model before the stack.

A system is a state machine. States are enums. Transitions are explicit, atomic functions that assert the required prior state and write through the data layer. Side effects with their own lifecycle are child state machines, not extensions of the parent. Signals handle in-process orchestration. The Context object makes every dependency explicit and every component testable.

What this gives you:

A readable domain. Every possible state and transition is named and visible in the code.
Atomic writes. No partial state, no distributed transactions.
Idempotency by construction. The same transition on the same state produces the same result.
Clear error handling. Invalid transitions fail explicitly at the guard, not silently in the data.
Auditability. The model naturally supports capturing a record of every transition; the implementation options are explored in Audit Trails Done Properly.

The rest of the series builds on this foundation:

Strategic Monolith + Satellites: the architecture that follows from this model
How to Build a Data Access Layer: the abstraction behind the repositories
Your Database is Already a Queue: async processing within this architecture
Events vs Signals: the distinction between in-process signals, domain events, and queue items
Idempotency is Not Optional: the discipline that makes transitions safe
Audit Trails Done Properly: how to capture and query the record of what happened