"Don't use your database as a queue" is one of those pieces of engineering orthodoxy that gets repeated so often it starts to feel like a law. It isn't. It's a guideline with a scale constraint attached, and for most workloads that constraint is never reached.

The argument against is legitimate: at high throughputs or strict sub-second latency requirements, a dedicated message broker will outperform a relational database. The argument in favour is equally legitimate, and almost never stated as plainly: your database is already running, already monitored, already backed up, and already the source of truth for your business data. A queue is just a table with a few index tricks. Adding one costs almost nothing operationally.

The case gets stronger when you consider what you give up with a dedicated broker:

  • One more component to run. Deployment, monitoring, upgrades, on-call runbooks.
  • Consumer group management. Schema registries, offset tracking, partition assignment.
  • The dual-write problem. Writing to your database and enqueuing in the same logical operation requires distributed transactions or careful idempotency design. With a database queue, it's a single INSERT in a single transaction.
  • Observability. When a queue item is failing in production, the cause is almost always in the data: a bad record, an unexpected foreign key, a customer account in a broken state. With a dedicated broker you're correlating message IDs across systems. With a database queue, you write a JOIN.

This article covers three practical patterns, each building on the last. All examples use Postgres; the patterns rely on Postgres-specific features (FOR UPDATE SKIP LOCKED, UPDATE ... RETURNING, partial indexes). If you do eventually outgrow this approach, the patterns here are structured to migrate cleanly: the queue table is decoupled from your business data, so swapping the backend is a well-scoped change rather than a rewrite.

Basic Pattern

The core pattern uses a dedicated queue table linked 1-to-1 to the item table, rather than adding queue columns directly to the item.

This separation is worth the small amount of extra schema. Queue state is transient and noisy: rows are constantly being updated (claimed), deleted (completed), and re-inserted (retried). Mixing that with your business data creates vacuum pressure, bloated updates, and indexes that grow to include items long since processed. A separate table keeps business data clean and lets the queue index stay lean: a partial index on next_action_at covers only items currently in the queue.

Three Postgres features do the heavy lifting:

  • FOR UPDATE SKIP LOCKED: allows multiple workers to pull from the queue simultaneously. Rather than blocking on a locked row, a worker skips it and claims the next available one.
  • UPDATE ... RETURNING: the claim is a single atomic query. There is no SELECT-then-UPDATE window in which two workers can claim the same item.
  • The lock-via-future-timestamp mechanic: this is the key insight. Claiming an item doesn't lock a database row indefinitely; it pushes next_action_at forward by an interval. If the worker crashes mid-processing, the item becomes visible again automatically when the interval expires. No heartbeat, no explicit lock release, no cleanup job required.
-- table of items that we want to process in some kind of queue
CREATE TABLE my_item (
  id VARCHAR(100) PRIMARY KEY,
  field_1 ...,
  field_2 ...,
);

-- dedicated queue table, linked to items table 1-to-1
CREATE TABLE my_item_queue (
  item_id VARCHAR(100) PRIMARY KEY REFERENCES my_item (id),
  next_action_at TIMESTAMP WITH TIME ZONE NULL,
  attempt INTEGER NOT NULL,
  attempt_last_error TEXT NULL
)

-- only store enqueued items that are due to be processed
CREATE INDEX my_item_queue_next_action_at
  ON my_item_queue (next_action_at)
  WHERE next_action_at IS NOT NULL;

-- optional: useful for searching errors from a monitoring dashboard
CREATE INDEX my_item_queue_last_error
  ON my_item_queue (attempt_last_error)
  USING GIN (to_tsvector('english', attempt_last_error))
  WHERE attempt_last_error IS NOT NULL;

-- query to return the next item
WITH next_id AS (
  SELECT item_id
  FROM my_item_queue
  WHERE next_action_at <= NOW()
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
UPDATE my_item_queue miq
  SET next_action_at = NOW() + INTERVAL '1 hour'
  FROM next_id ni
  WHERE ni.item_id = miq.item_id
RETURNING miq.item_id;

The worker loop is straightforward: claim an item, process it, delete it on success. On failure, increment the attempt counter, record the error, and re-enqueue with a recalculated next_action_at. Once the attempt count exceeds the maximum, the item is moved to the dead-letter queue rather than re-enqueued. The time.sleep(1) on an empty queue is simple backoff to avoid polling the database at full speed when there is nothing to do.

@dataclass
class QueuedItem(Generic[T]):
    next_action_at: datetime
    attempt: int
    attempt_last_error: Optional[str]
    item: T

def process_items():
    while is_processing():
        try:
            queued_item = get_next_queued_item()
            if queued_item is None:
                time.sleep(1)
                continue

            process(queued_item.item)
            delete_from_queue(queued_item)

        except Exception as exc:
            queued_item.attempt += 1
            queued_item.attempt_last_error = format_error(exc)
            if is_final_attempt(queued_item.attempt):
                move_to_dead_letter_queue(queued_item)
            else:
                queued_item.next_action_at = calculate_next_action_at(queued_item.attempt)
                add_to_queue(queued_item)

Failure Modes and Recovery

There are two distinct failure modes, and they are handled differently.

Worker crash. The lock interval handles this automatically. If the worker process dies mid-job, next_action_at is already set to an hour in the future; the item re-surfaces when that interval expires, with no application logic required. The INTERVAL '1 hour' in the claim query is the knob for this.

Application exception. The except block handles this explicitly. The attempt counter is incremented and calculate_next_action_at determines the next retry window. This is where exponential backoff belongs: NOW() + INTERVAL '1 minute' * 2^attempt gives you 1 minute, then 2, then 4, preventing your workers from hammering a failing downstream dependency on every retry cycle.

Dead letters. Items that exceed a maximum attempt threshold need to be surfaced rather than silently re-queued forever. is_final_attempt encapsulates that threshold check; move_to_dead_letter_queue handles disposal. Two approaches for what disposal means:

  • Simple approach: after max attempts, park the item by setting next_action_at = NULL rather than re-queuing. Dead items are then detectable with SELECT * FROM my_item_queue WHERE next_action_at IS NULL. move_to_dead_letter_queue is just an update to the same row.
  • Explicit DLQ table: move the row to a dedicated dead-letter table on max attempts. Cleaner separation, easier to alert on independently, and the right choice if your dead-letter queue needs its own retention policy or tooling. move_to_dead_letter_queue becomes a delete-and-insert across two tables, wrapped in a transaction.

Fair Queuing Batches

The basic pattern works well when all items are equal. In a multi-tenant system where each customer has a different rate limit, it falls apart: a naive LIMIT 1 worker will process whichever customer has the most pending items, starving everyone else.

The solution is a single dispatcher that runs once per minute and claims a batch. For each customer, it uses RANK() OVER (PARTITION BY customer_id ORDER BY next_action_at ASC) to rank their pending items by age, then takes the top N items where N equals rate_per_minute. The result is a batch that respects per-customer limits across any number of customers, in a single declarative query.

Because there is only one dispatcher (running on a timer rather than concurrently), SKIP LOCKED is not needed. The query either runs or it doesn't; there is no race to claim the same items. This simplifies the query considerably. The trade-off is that the dispatcher is a single point of failure: if it crashes, items queue up until it restarts. For latency-insensitive workloads this is usually fine — a missed minute is a minor delay, not an outage — but it is worth being aware of if your processing deadlines are tight.

customer_id is denormalised into the queue table so the composite partial index on (customer_id, next_action_at) can cover the dispatcher query entirely, without touching my_item at all. As a side benefit, customer_id is a natural partition key: if you ever need to scale horizontally, the schema is shard-friendly by design.

CREATE TABLE customer (
  id VARCHAR(100) PRIMARY KEY,
  rate_per_minute FLOAT,
  ...
);

CREATE TABLE my_item (
  id VARCHAR(100) PRIMARY KEY,
  customer_id VARCHAR(100) REFERENCES customer (id),
  field_1 ...,
  field_2 ...,
);

-- customer_id is denormalised into the queue table so the index can cover it
CREATE TABLE my_item_queue (
  item_id VARCHAR(100) PRIMARY KEY REFERENCES my_item (id),
  customer_id VARCHAR(100) NOT NULL REFERENCES customer (id),
  next_action_at TIMESTAMP WITH TIME ZONE NULL,
  attempt INTEGER NOT NULL,
  attempt_last_error TEXT NULL
);

-- only store enqueued items that are due to be processed, with customer_id
CREATE INDEX my_item_queue_next_action_at
  ON my_item_queue (customer_id, next_action_at)
  WHERE next_action_at IS NOT NULL;

-- optional: useful for searching errors from a monitoring dashboard
CREATE INDEX my_item_queue_last_error
  ON my_item_queue (attempt_last_error)
  USING GIN (to_tsvector('english', attempt_last_error))
  WHERE attempt_last_error IS NOT NULL;

-- query to return a batch of items, run once per minute with single dispatcher
WITH eligible (item_id, customer_id, rank) AS (
  SELECT
    item_id,
    customer_id,
    RANK() OVER (PARTITION BY customer_id ORDER BY next_action_at ASC)
  FROM my_item_queue
  WHERE next_action_at <= NOW()
),
next_ids AS (
  SELECT e.item_id, e.customer_id
  FROM eligible e
    INNER JOIN customer c ON c.id = e.customer_id
  WHERE e.rank <= c.rate_per_minute
)
UPDATE my_item_queue miq
  SET next_action_at = NOW() + INTERVAL '1 hour'
  FROM next_ids nis
  WHERE nis.item_id = miq.item_id
RETURNING miq.item_id, miq.customer_id;

The batch pattern has one characteristic limitation: all items in a batch fire simultaneously. If a downstream system is sensitive to bursts rather than just sustained rate, read on.

Fair Queuing Delay

The batch pattern fires N items at the top of each minute. For many use cases this is fine. For others (outbound webhooks, SMS delivery, calls to rate-limited third-party APIs) the downstream system cares about spacing, not just volume. Sending 60 messages in a burst at minute boundaries is meaningfully different from sending one per second, even if the average rate is identical.

The distinction in simple terms: batching is "N items now, wait a minute." Delay is "one item now, wait 1/N of a minute, one item, wait 1/N..."

This pattern introduces a frontier table that tracks when each customer is next eligible to send. It is declared UNLOGGED because it holds ephemeral state: if the database restarts, the frontier resets to NULL and every customer becomes immediately eligible, which is safe (they will simply fire their next item slightly early on restart).

Each claim query does three things atomically:

  1. Finds an item that is both due (next_action_at <= NOW()) and whose customer is past their frontier (frontier_at <= NOW(), or not yet set).
  2. Updates the customer's frontier to NOW() + (1 minute / rate_per_minute), scheduling the next eligible window.
  3. Claims the item by pushing next_action_at forward, as in the basic pattern.

FOR UPDATE OF miq SKIP LOCKED locks only the queue row, not the frontier row. Multiple workers can therefore run concurrently: each races to claim a different item, and the frontier update is committed atomically in the same transaction.

The trade-off versus the batch pattern is complexity: the query is harder to reason about, and it processes one item per transaction. Use it when burst behaviour causes real downstream problems; it is overkill otherwise.

CREATE TABLE customer (
  id VARCHAR(100) PRIMARY KEY,
  rate_per_minute FLOAT,
  ...
);

CREATE UNLOGGED TABLE customer_queue_frontier (
  id VARCHAR(100) PRIMARY KEY REFERENCES customer (id),
  frontier_at TIMESTAMP WITH TIME ZONE
);

CREATE TABLE my_item (
  id VARCHAR(100) PRIMARY KEY,
  customer_id VARCHAR(100) REFERENCES customer (id),
  field_1 ...,
  field_2 ...,
);

-- customer_id is denormalised into the queue table so the index can cover it
CREATE TABLE my_item_queue (
  item_id VARCHAR(100) PRIMARY KEY REFERENCES my_item (id),
  customer_id VARCHAR(100) NOT NULL REFERENCES customer (id),
  next_action_at TIMESTAMP WITH TIME ZONE NULL,
  attempt INTEGER NOT NULL,
  attempt_last_error TEXT NULL
);

-- only store enqueued items that are due to be processed, with customer_id
CREATE INDEX my_item_queue_next_action_at
  ON my_item_queue (customer_id, next_action_at)
  WHERE next_action_at IS NOT NULL;

-- optional: useful for searching errors from a monitoring dashboard
CREATE INDEX my_item_queue_last_error
  ON my_item_queue (attempt_last_error)
  USING GIN (to_tsvector('english', attempt_last_error))
  WHERE attempt_last_error IS NOT NULL;

-- get the next eligible item
WITH next_id (item_id, customer_id) AS (
  SELECT miq.item_id, miq.customer_id
  FROM my_item_queue miq
  LEFT JOIN customer_queue_frontier cf ON cf.id = miq.customer_id
  WHERE miq.next_action_at <= NOW()
  AND (cf.frontier_at <= NOW() OR cf.frontier_at IS NULL)
  ORDER BY miq.next_action_at
  LIMIT 1
  FOR UPDATE OF miq SKIP LOCKED
),
customer_frontier_id (customer_id) AS (
  INSERT INTO customer_queue_frontier (id, frontier_at)
  SELECT ni.customer_id, NOW() + (INTERVAL '1 minute' / c.rate_per_minute)
  FROM next_id ni
  INNER JOIN customer c ON c.id = ni.customer_id
  ON CONFLICT (id) DO UPDATE SET frontier_at = EXCLUDED.frontier_at
  RETURNING id
)
UPDATE my_item_queue miq
  SET next_action_at = NOW() + INTERVAL '1 hour'
  FROM next_id ni
  INNER JOIN customer_frontier_id cfi ON cfi.customer_id = ni.customer_id
  WHERE miq.item_id = ni.item_id
RETURNING miq.item_id, miq.customer_id;

Observability

How much you invest in error introspection depends on your throughput.

At low throughput, the GIN index on attempt_last_error is a good fit. Full-text search across error messages is fast, and the write overhead of maintaining the index is negligible at low insert rates:

SELECT
  miq.item_id,
  miq.attempt,
  miq.attempt_last_error,
  miq.next_action_at
FROM my_item_queue miq
WHERE to_tsvector('english', miq.attempt_last_error) @@ to_tsquery('timeout | connection')
ORDER BY miq.next_action_at;

At moderate throughput, drop the GIN index but keep attempt_last_error TEXT NULL. You lose full-text search, but the column still enables something a dedicated broker cannot easily offer: ad-hoc SQL analysis against your live business data. Because the queue table lives in the same database as everything else, a single query can isolate exactly which customers are affected, what their account state looks like, and whether the failures share a pattern in the underlying data:

SELECT
  c.id,
  c.name,
  COUNT(*) AS failing_items,
  MAX(miq.attempt) AS max_attempts,
  miq.attempt_last_error
FROM my_item_queue miq
INNER JOIN my_item i ON i.id = miq.item_id
INNER JOIN customer c ON c.id = i.customer_id
WHERE miq.attempt_last_error IS NOT NULL
GROUP BY c.id, c.name, miq.attempt_last_error
ORDER BY failing_items DESC;

With a dedicated broker you would be correlating message IDs across systems. Here you write a JOIN. GIN index maintenance is expensive under write pressure, and a mass failure event is exactly when you can least afford the extra overhead — but the column itself remains cheap, and the SQL access it unlocks is often all you need to diagnose the problem quickly.

At high throughput, even writing error strings to a TEXT column on every failure adds up. At that scale your queue is moving fast enough that your application logs, metrics, and tracing tooling are better indicators of what's going wrong than a column on a database row. The attempt counter alone is often sufficient: a rising attempt count is a reliable signal that something is failing, and the cause is almost always visible in your observability stack.

Monitoring

Because the queue is just a table, any metric you want is a SQL query. Queue depth, in-flight items, dead letters, per-customer backlog, retry distribution — all expressible without any broker-specific SDK or metrics API:

SELECT
  COUNT(*) FILTER (WHERE next_action_at <= NOW())          AS pending,
  COUNT(*) FILTER (WHERE next_action_at > NOW())           AS in_flight,
  COUNT(*) FILTER (WHERE next_action_at IS NULL)           AS dead_letters,
  MAX(attempt)                                             AS max_attempts,
  AVG(attempt)                                             AS avg_attempts,
  NOW() - MIN(next_action_at)
    FILTER (WHERE next_action_at <= NOW())                 AS oldest_pending
FROM my_item_queue;

Because these are just SQL queries, exporting them as metrics requires nothing special: run the query on a schedule and push the results to whatever you already use for metrics. No separate agent, no broker-specific SDK, no specialist / vendor query language to learn.

The practical upshot: queue depth trending up, dead letter count non-zero, or max attempts climbing are all conditions you can alert on with a threshold rule against a gauge — the same as any other metric in your system. With a dedicated broker these signals come through a separate API, a separate dashboard, and a separate alerting integration.

The more interesting capability is the metrics that a dedicated broker simply cannot produce: anything that requires joining queue state to business data, for example:

  • Queue depth broken down by customer subscription tier
  • Failure rate for accounts created in the last 30 days
  • Items stuck in-flight belonging to customers whose accounts are now suspended

These cross-cutting questions would require pulling data from two separate systems and correlating them in application code or a data warehouse. With a database queue they are just queries — expressible in a single SQL statement, runnable on demand, and trivially added to the same scheduled export as the standard metrics above.

When You Hit the Ceiling

All three patterns share the same ceiling. They break down under sustained high throughput, strict sub-second latency requirements, or fan-out to a very large number of independent consumers. For the vast majority of workloads, that ceiling is never reached. As a rough guide: if you are processing a few thousand items per second — measured at the queue operations themselves, not the business logic — start paying attention to contention and vacuum pressure. That is the point to consider moving on, not necessarily the point to act.

If you do start approaching it, the progression is the same regardless of which pattern you are using:

  1. Tune first. Batch more aggressively, increase worker concurrency, optimise connection pooling. There is usually more headroom here than expected.
  2. Partition the table. Postgres table partitioning by customer_id (or a suitable key) is simpler than sharding, requires no application changes, and significantly reduces vacuum pressure.
  3. Graduate to a dedicated queue. RabbitMQ, SQS, or similar: the right move once queue throughput is genuinely the constraint, not everything around it.
  4. Shard across instances. Theoretical ceiling-raiser, but operationally heavy. Rarely the right next step.

Because the queue table is already decoupled from your business data, migration at step 3 is well-scoped: swap the enqueue, claim, and delete operations and leave everything else untouched. Most teams never reach step 2, let alone step 3.

Summary

The orthodoxy exists for good reason. At high throughput and strict latency, a dedicated queue wins. But for most workloads, that ceiling is never reached; and in the meantime, your database is already running, already monitored, and already the source of truth. A queue table adds almost nothing to your operational burden.

Three patterns, each suited to a different problem:

  • Basic pattern: crash-safe, multi-worker processing using FOR UPDATE SKIP LOCKED and the lock-via-future-timestamp mechanic. The right starting point for most use cases.
  • Fair queuing (batch): a single dispatcher using RANK() OVER (PARTITION BY customer_id) to distribute work fairly across customers at different rate limits. Simple and effective when per-minute latency is acceptable.
  • Fair queuing (delay): a frontier table to space items evenly across time rather than firing in bursts. More complex, but the right tool when downstream systems are sensitive to burst load.

The schema is consistent across all three: a queue table separate from your business data, customer_id denormalised in where needed, partial indexes for efficiency, and a GIN index for error observability. When you do outgrow it, the decoupled design makes migration a well-scoped change rather than a rewrite.

Happy queuing.