Audit Logs Done Right | The Chris Frequency

Most audit logs are full of lies.

Not malicious ones. They are lies of omission: the row that got updated in place and took the history with it; the audit write that was added to three of five code paths and nobody noticed the gap; the log that records the final state but not how it got there. The system says the record is APPROVED. The audit log says nothing useful.

This is not too hard to solve, but it requires upfront design; it is difficult to bolt-on later.

Why Most Audit Logs Fail

The failure modes are consistent:

Mutable records: A status column gets overwritten, the previous value is gone, and there is no way to reconstruct what it was.
Partial coverage: Audit writes were added to the obvious code paths when the feature was built, but the edge cases and admin tools that make "corrections" were never instrumented.
Insufficient data: The log records that a transition happened but not what the entity looked like before or after it.
Late addition: The system went to production without an audit log, and retrofitting one means touching every write path, which nobody has time to do thoroughly.

The result is an audit log that cannot be trusted. Worse: one that cannot be trusted but looks like it can.

Design for Auditability from the Start

In a regulated industry, this is not optional. Financial services, healthcare, KYC/AML, payments: regulators require a complete, tamper-proof record of what happened, when, and by whom. "We didn't think about that" is not a defensible position here.

Outside regulated industries, the argument still holds. Customer disputes. Incident post-mortems. Support tickets that arrive six months after the fact. The question "what exactly changed, and when?" comes up constantly in production systems. The team without an audit log answers it by correlating application logs, database snapshots, and guesswork. The team with one writes a query.

The cost of building for auditability from day one is low: one extra table, one extra write per transition, one structural constraint. The cost of retrofitting it into a live system is high, and the result is never as trustworthy as the version built in from the start, because retrofits have gaps.

Design for it first. Treat it as infrastructure, not a feature.

Audit as a Cross-Cutting Concern

Audit logging touches every state-writing path in the system. That is the problem. Handle it naively and you scatter audit writes throughout the codebase: one in each orchestration function, one in each service method, each slightly different, some missing. The coverage is never complete, and the fields are never consistent.

Two patterns fit naturally with the rest of this series. They are not the only approaches: many frameworks offer their own decoupling mechanisms. But both work well without introducing new dependencies.

Pattern 1: Signal Handlers

If you are using the signal pattern from Your System is a State Machine and Events, Queues and Signals, the audit write is a signal handler.

Every state transition fires a signal. Registering an audit handler for that signal writes the log entry in the same transaction. The business logic does not change; adding the handler is one registration line:

registry.register(OrderConfirmedSignal, write_audit_log)
registry.register(KYCApprovedSignal,    write_audit_log)
registry.register(KYCRejectedSignal,    write_audit_log)

The handler receives the signal, which carries the old and new state. It writes the audit row. The business logic that fired the signal has no knowledge of or dependency on the audit system.

The strength of this pattern is semantic richness: each signal type is a named, meaningful domain event. The handler knows not just that a write occurred but that a KYC check was approved by a specific actor for a specific reason. The audit record reflects that.

The limitation is coverage: signals only fire for transitions you have explicitly modelled. Writes that do not go through the signal system are not captured.

Pattern 2: Auditing Proxy

The proxy pattern wraps the real data provider and writes an audit record after every write, transparently. The structure is the same as the migration proxy in How to Build a Data Access Layer, applied to audit concerns:

class AuditingKYCCheckRepository(KYCCheckRepository):

    def __init__(
        self,
        inner:     KYCCheckRepository,
        audit_log: AuditLogRepository,
        actor:     str,
    ):
        self._inner     = inner
        self._audit_log = audit_log
        self._actor     = actor

    def save(self, old: KYCCheck, new: KYCCheck) -> None:
        self._inner.save(new)
        self._audit_log.record(
            entity_type  = "kyc_check",
            entity_id    = str(new.id),
            action       = "updated",
            old_status   = old.status.value,
            new_status   = new.status.value,
            performed_by = self._actor,
            performed_at = datetime.now(UTC),
            detail       = {"old": old._asdict(), "new": new._asdict()},
        )

save() takes both old and new: the proxy needs both to record the transition. With immutable domain models this is natural, since the caller always has both objects at the call site. The application calls save() on what it believes is a plain repository. The proxy writes to the real store, then writes the audit record. Nothing in the business logic changes.

The strength of this pattern is comprehensive coverage: every write through the data layer is captured, with no dependency on the signal system being consistently wired. It is well suited to adding audit coverage to an existing codebase, or as a safety net for write paths that fall outside the signal system.

The trade-off is context. The proxy knows what was written but not necessarily why. A signal handler can record the reason for a referral decision; the proxy records only that a save() call occurred.

Choosing between them: the two patterns can coexist. Use signal handlers where semantic context matters: state transitions, approvals, manual decisions. Use the auditing proxy as a safety net for everything else, or as the primary mechanism when instrumenting an existing codebase without a signal system.

What to Capture: The Retention Spectrum

Once you have decided how to wire audit logging in, you need to decide how much each record captures. There is a spectrum, and the right point depends on domain and regulatory requirements.

Full model snapshots. Serialise the old and new model in full as a JSONB column. Every field, at the moment of the transition. This gives complete reconstructibility: the exact state of any entity at any point in time, including fields that seemed unimportant when the schema was written.

The immutable model pattern from this series makes this natural. A KYCCheck is a NamedTuple; old._asdict() and new._asdict() are single calls. Before and after are already distinct objects, so capturing them costs almost nothing in code.

The cost is storage. A system writing full snapshots on every transition may store several times more data than one recording only state changes. On a high-throughput pipeline this matters; on a typical transactional system, storage is almost certainly cheaper than the engineering time spent answering questions the log cannot answer.

Selective field capture. Record old and new values only for the fields that matter: status, decision reason, risk score, key identifiers. A reasonable middle ground when storage is genuinely constrained and the compliance-relevant fields are well understood. The risk: the field you chose not to capture is the one that becomes important. Be conservative with what you exclude.

State transitions only. Record the fact that a transition occurred: entity, action, old status, new status, who, when. No field-level detail. Adequate for many domains where the state machine captures everything meaningful. For fintech and KYC/AML specifically, this is almost certainly insufficient: regulators want to know what data was presented at the time of the decision, not just that a decision was made.

Choose the level your domain requires, and document why. Changing strategy later means migrating historical records or accepting inconsistency in the log.

The Schema

Building on the audit log introduced in Idempotency Is Not Optional:

CREATE TABLE audit_log (
    id           VARCHAR(100)             NOT NULL,
    entity_type  VARCHAR(50)              NOT NULL,
    entity_id    VARCHAR(100)             NOT NULL,
    action       VARCHAR(50)              NOT NULL,
    old_status   VARCHAR(50),
    new_status   VARCHAR(50)              NOT NULL,
    performed_by VARCHAR(100)             NOT NULL,
    performed_at TIMESTAMP WITH TIME ZONE NOT NULL,
    detail       JSONB,
    PRIMARY KEY  (id),
    UNIQUE       (entity_type, entity_id, action, new_status, performed_at)
);

CREATE INDEX audit_log_entity
    ON audit_log (entity_type, entity_id, performed_at DESC);

CREATE INDEX audit_log_actor
    ON audit_log (performed_by, performed_at DESC);

The detail column carries whatever the retention strategy dictates: full model snapshots, selected fields, or nothing. The schema does not change between strategies; only what you put in detail does.

The two indexes cover the two most common query shapes: the full history of a specific entity, and the activity of a specific actor. Add further indexes as query patterns become clear in production.

Querying the Log

An append-only log is only useful if you can read it.

Full history of a single entity:

SELECT
    action,
    old_status,
    new_status,
    performed_by,
    performed_at,
    detail
FROM  audit_log
WHERE entity_type = 'kyc_check'
  AND entity_id   = '9f4e2b1a'
ORDER BY performed_at ASC;

Activity by actor over a period:

SELECT
    performed_by,
    COUNT(*)                                    AS total_actions,
    COUNT(*) FILTER (WHERE action = 'approved') AS approvals,
    COUNT(*) FILTER (WHERE action = 'rejected') AS rejections
FROM  audit_log
WHERE performed_at >= NOW() - INTERVAL '7 days'
GROUP BY performed_by
ORDER BY total_actions DESC;

Items that skipped a required step (KYC checks approved without passing through processing):

SELECT DISTINCT entity_id
FROM   audit_log
WHERE  entity_type = 'kyc_check'
  AND  new_status  = 'approved'
  AND  entity_id NOT IN (
      SELECT entity_id
      FROM   audit_log
      WHERE  entity_type = 'kyc_check'
        AND  new_status  = 'processing'
  );

Compliance and Retention

Keeping the audit log in the same database as your business data has a practical advantage: compliance queries are joins, not cross-system correlations. Finding all approved KYC checks where the customer account is now suspended is a single JOIN. That is not always possible, and a dedicated audit system is a legitimate choice, but the co-location benefit is real.

Compliance

GDPR introduces a genuine tension: Article 17 gives individuals the right to erasure, but an immutable audit log cannot delete anything. The two obligations conflict when PII is embedded in audit records. The resolution is to keep PII out of the log wherever possible: reference entities by ID, not by name or email. The detail column is the risk area; full model snapshots that include names, addresses, and document numbers embed PII in an immutable record.

Separate the PII from the audit record at write time, and an erasure request can pseudonymise the customer record without touching the log. The log remains intact, showing that a check was performed and a decision made, without retaining personal details in immutable form. In some jurisdictions this separation is not sufficient, and you will need to consult more widely for advice.

Retention

Most regulatory frameworks specify minimum retention periods: the FCA expects MiFID transaction records for five years; PCI DSS requires audit logs for at least one year with three months immediately accessible. Know your specific obligations.

Postgres table partitioning by month keeps the active dataset manageable and makes archiving or dropping old data a single DDL operation rather than a long-running bulk DELETE:

CREATE TABLE audit_log (
    ...
    performed_at TIMESTAMP WITH TIME ZONE NOT NULL
) PARTITION BY RANGE (performed_at);

CREATE TABLE audit_log_2026_05
    PARTITION OF audit_log
    FOR VALUES FROM ('2026-05-01') TO ('2026-06-01');

At end of retention, drop the partition or move it to cold storage. Either way, the operation is instantaneous and does not touch the live table.

Summary

Audit logging is one of those concerns that feels optional until it isn't. In a regulated industry the requirement is explicit; everywhere else it surfaces as a support ticket you cannot answer, an incident you cannot reconstruct, a customer dispute you cannot resolve.

The upfront investment is modest: one table, one write per transition, one structural decision about how much to capture. The patterns here keep it out of your business logic. Retrofitting it later, under pressure, into every write path is considerably more expensive, and the result is never quite trustworthy.

Think about it before you ship. It is much easier to build it in than to wish you had.

This is part of a series on pragmatic architecture for startups and scaleups.