Most engineering teams rewrite their systems three times. The first is largely unavoidable. The second is potentially avoidable. The third is self-inflicted, and it is the one that kills velocity.

The pattern is consistent enough that you can use it as a diagnostic. If a team is about to start a rewrite, the first question worth asking is: which one is this? The answer changes what you should actually do.

Rewrite 1: The Domain Rewrite

The first rewrite happens when the team discovers they were building the wrong thing.

Not wrong in the sense of building a product nobody wants; that is a different problem. Wrong in the sense that the data model encoded assumptions about the domain that turned out to be false. The Order model assumed one address per customer; now there are multiple. The User model assumed a single account type; now there are three. The state machine was a boolean; now it has seven states.

This rewrite is the tuition fee for learning the domain. It is not a failure of engineering; it is evidence that the team shipped fast enough to discover what they were actually building. The only way to avoid it entirely is to understand the domain before writing a line of code, which is typically not possible for genuinely novel products.

The damage is bounded if the codebase is small and the data model is reasonably isolated. It becomes expensive when:

  • The domain model leaked into the database schema in ways that are hard to migrate
  • The business logic is entangled with the storage layer, so changing one means changing both
  • The team kept building on top of the wrong model rather than correcting it early

The third point is the important one. Rewrite 1 becomes expensive not because it happens, but because it is delayed. Every week of building on a wrong model is a week of technical debt that compounds. The teams that navigate it well are the ones that course-correct as soon as the model is clearly wrong, not the ones that defer it until the next funding round.

The best mitigation is not to avoid rewrite 1. It is to design so that when it comes, it is a data model change, not an architectural one.

Rewrite 2: The Architecture Rewrite

The second rewrite happens when the system's structure makes it impossible to evolve without touching everything.

The symptoms are recognisable. Adding a new client requires duplicating infrastructure. Changing the storage backend for one part of the system means touching application code everywhere. A schema migration that should take a day takes a month because the database is accessed from forty different places with no consistent abstraction. The system has grown, but it has grown as a tangle rather than a structure.

This is the expensive rewrite. Not expensive in the sense of taking a long time (though it does); expensive in the sense that it happens at the worst possible moment. By the time the symptoms are undeniable, the team is under commercial pressure, the codebase is large, and the cost of the rewrite is compounded by the cost of maintaining the old system while the new one is built because angry clients are bombarding the product team.

A previous article in this series describes a concrete case: a system built without a clean data layer that could not be migrated to a multi-tenanted architecture, because storage was inseparable from application logic. The team duplicated the entire infrastructure stack per client instead. The operational cost ran to millions of dollars per year more than a properly abstracted system would have required.

The cause is almost always the same: storage and application logic were not kept separate from the start. The ORM leaked into business logic. Queries were written inline. The database schema became the domain model. When the architecture needs to change, there is no seam to cut along.

Rewrite 2 is optional. A clean data layer from day one creates the seam. When a storage backend needs to change, you write a new provider implementation. The application code does not move. The rewrite becomes a targeted swap behind an interface rather than a systemic untangling.

This is not a large investment. A data layer is a few interfaces and a dependency injection pattern. The cost is a few days of upfront structure. The saving is rewrite 2 never happening.

Rewrite 3: The Complexity Rewrite

The third rewrite is the one that looks like progress.

The team has hit real scaling constraints, or believes it has. The decision is made to move to microservices, or to adopt an event-driven architecture, or to introduce a message broker. The reasoning feels sound: the system is under load, distributed systems handle load, therefore the system should be distributed.

Two years later, the system is harder to understand than it was before. Deployments are coordination exercises. Debugging requires correlating logs across eight services. A feature that would have been a function call is now a choreography of events across a queue. New engineers take months to become productive. The team is spending more time on infrastructure than on the business problem.

This is rewrite 3: adopting complexity that was not warranted by the actual constraints, and then living with the consequences.

The Spark migration described in this series is a version of this: a batch processing system migrated to a distributed compute framework to handle scale that had not yet arrived, and then retrofitted for a streaming requirement that the framework handled awkwardly. The engineering cost ran to low millions over the life of the system. The lesson was not that Spark was wrong in principle; it was that the decision was made for hypothetical scale rather than real constraints.

Microservices have the same failure mode at the architectural level. The cases where distribution genuinely helps are real but specific: services with meaningfully different scaling profiles, teams that need independent deployment, compute that is genuinely too heavy to run in-process. Outside those cases, distribution adds coordination overhead without adding capability. The monolith that was "too slow" is usually not slow because it is a monolith. It is slow because of a handful of specific bottlenecks that could be addressed without distributing state.

The irony of rewrite 3 is that it is often triggered by rewrite 2 not having happened. The system is a tangle. Adding the data layer now feels too risky; the coupling is everywhere. The path of least resistance is to extract the messy parts into services instead, hoping that a service boundary will provide the separation that good internal structure would have provided. It does not. What it provides is a distributed tangle, which is strictly harder to reason about than the original.

The Relationship Between Them

Rewrite 1 is paid for by discovery. It is a reasonable cost.

Rewrite 2 is paid for by a missing abstraction. It is an avoidable cost.

Rewrite 3 is paid for by the belief that complexity is a solution to a problem that was actually a missing abstraction. It is a compounding cost.

Teams that do rewrite 3 have usually skipped or deferred the work that would have made rewrite 2 unnecessary. The coupling that should have been addressed with a data layer is addressed instead with a service boundary, which moves the coupling rather than resolving it. The service boundary introduces distributed systems problems that did not previously exist. The team now has to solve both the original coupling problem and the distributed coordination problem.

The rewrite that was avoided was one sprint. The rewrite that replaced it is six months, a degraded system in the interim, and a hiring constraint because the specialist knowledge required is now deeper.

What To Do Instead

The objective is to do rewrite 1 cheaply, avoid rewrite 2 entirely, and never reach rewrite 3.

For rewrite 1: keep the domain model separate from the storage layer from day one. When the model turns out to be wrong (and it will), the change is bounded. The storage layer implements the new model; the application code uses the new types. No ORM annotations to migrate, no inline queries to update, no framework magic to unpick.

For rewrite 2: build a data layer before you need one. This is not premature optimisation; it is the minimum structure that keeps the exit routes open. When a touchpoint outgrows its backend, you write a new provider. When a schema changes, you change the provider; the engineering effort happens "under the hood". The application does not move.

For rewrite 3: earn distribution before adopting it. The bar is concrete: a component needs independent deployment, or its resource profile is genuinely incompatible with the core, or a team boundary requires a service boundary. Not "this might scale better" or "the job posting mentioned Kafka". When those concrete conditions are met, the strategic monolith with satellites gives you the topology to add distribution precisely, at the right boundary, without spreading state.

The teams that avoid rewrites 2 and 3 are not the ones that planned more carefully or hired more senior engineers. They are the ones that kept the structure simple early, built the right abstractions before they were needed under pressure, and resisted the pull towards complexity before the constraints were real.

Summary

Three rewrites. One is learning. One is a missing abstraction. One is the wrong solution to the missing abstraction.

The first is the cost of discovery: largely unavoidable, bounded if the domain model is kept separate from storage.

The second is the cost of coupling: avoidable with a clean data layer from day one.

The third is the cost of mistaking distribution for a solution to coupling: entirely self-inflicted, and considerably more expensive than the problem it was meant to solve.

The goal is not to build the perfect system first. It is to build a system that can absorb the changes that discovery demands, without requiring a rewrite every time the constraints shift.