From Monolith to Microservices

Most systems don't start as microservices.

In fact, many of the products we use every day began as a single application connected to a single database. That is usually the right decision.

Yet, if you spend enough time building software, there is a good chance you will eventually hear a familiar statement:

"We need to move to microservices."

The statement often arrives after the system has grown beyond its original expectations.

The engineering team has grown.

Features are being delivered faster.

Deployments are becoming riskier.

Different parts of the system are competing for resources.

What once felt simple now feels fragile.

At this point, many teams begin looking at microservices as the solution.

Unfortunately, this is where many of them make a costly mistake.

They focus on splitting the application into smaller services but fail to prepare for the realities that come with distributed systems.

Network failures.
Service dependencies.
Message duplication.
Observability challenges.
Complex debugging.
Cascading failures.

I have seen engineers successfully break apart a monolith only to discover that operating ten services is significantly harder than operating one.

This is why I believe the conversation around microservices is often incomplete.

The challenge is not moving from a monolith to microservices.

The challenge is building APIs that continue to work when services fail, traffic spikes, dependencies become unavailable, and production incidents occur.

"The challenge is not moving to microservices. The challenge is building APIs that survive production."

In this article, I will walk through the practical journey from monolith to microservices. More importantly, I will show how to design APIs that are resilient, observable, and capable of surviving production.

The Day Your Monolith Starts Fighting Back

I often tell engineers that there is nothing wrong with a monolith.

In fact, I would rather inherit a well-structured monolith than a poorly designed microservices architecture.

Many teams rush into microservices far too early.

They hear success stories from companies like Netflix, Amazon, and Uber and assume microservices are the natural next step for every application.

They are not.

A monolith is usually the fastest way to deliver value during the early stages of a product.

Everything lives in one codebase.

Everything is deployed together.

Debugging is straightforward.

Transactions are simple.

Development is faster.

A typical monolithic architecture looks something like this:

Everything appears manageable.

Until it isn't.

Imagine an e-commerce platform.

At the beginning, the application supports a few thousand users.

The backend handles user management, orders, payments, inventory, notifications, and reporting.

The system performs well.

Deployments are easy.

Everyone is happy.

Then the business grows.

The marketing team launches successful campaigns.

Traffic increases.

New engineers join the team.

Additional features are introduced.

What was once a clean application slowly becoming more difficult to manage.

Now a change in the notification module requires deploying the entire application.

A bug in reporting can affect order processing.

A spike in product searches forces the entire application to scale, even though only one module is experiencing heavy traffic.

Deployment windows become stressful.

Engineers become increasingly cautious about making changes.

At some point, the monolith starts fighting back.

💡

A monolith is not a problem. A monolith that has outgrown the assumptions it was designed for is the problem.

Not because monoliths are bad.

But because the needs of the business have evolved beyond the architecture's original assumptions.

This is usually the point where teams begin exploring microservices.

The problem is that many engineers think microservices exist primarily to scale servers.

In my experience, that is only part of the story.

The real reason microservices exist is much more interesting.

Microservices help organizations scale teams.

"Microservices exist to scale teams far more than they exist to scale servers."

And understanding that distinction changes how you design systems.

In the next section, I'll explain why microservices exist, when they make sense, and why moving to them too early can create more problems than they solve.

Why Microservices Exist (And Why Most Teams Misunderstand Them)

One of the biggest misconceptions I see is the belief that microservices exist primarily to help applications handle more traffic.

Traffic is certainly part of the conversation, but it is rarely the root problem.

I have seen monolithic applications comfortably handle millions of requests.

I have also seen microservices architectures struggle under workloads that a well-designed monolith could have managed without difficulty.

The question is not:

Can my application handle more traffic?

The question is:

Can my engineering team continue to deliver changes safely and efficiently as the product grows?

That is where microservices begin to make sense.

As systems grow, the bottleneck often shifts from infrastructure to people.

A team of three engineers can comfortably work within a monolith.

A team of thirty engineers working on the same codebase is a very different story.

The challenges become less about CPUs and memory and more about coordination.

Teams begin stepping on each other's toes.

Deployments become frequent.

Merge conflicts increase.

Release cycles slow down.

Changes that should take hours begin taking days.

At this point, the architecture is no longer serving the organization effectively.

This is where microservices can help.

Not because they magically improve performance.

But because they allow teams to move independently.

Scaling Servers vs Scaling Teams

Let's consider an e-commerce platform.

Initially, the system consists of a single backend application responsible for:

User Management
Product Catalog
Inventory
Payments
Notifications

A single engineering team owns everything.

This works well.

But as the business grows, things become more complicated.

Different teams emerge.

One team focuses on ride operations.

Another team focuses on payments.

Another team focuses on customer communication.

Now imagine the payments team wants to release a critical fix.

Unfortunately, they must wait because another team is preparing a large deployment involving ride matching logic.

The two changes are completely unrelated.

Yet they are tightly coupled because everything lives inside the same application.

This is one of the first signs that the architecture may be limiting organizational growth.

Microservices address this problem by allowing responsibilities to be separated into independently deployable units.

Instead of one large application, we begin moving toward something like this:

Now each service can evolve independently.

The Payment Service can be updated without redeploying the Order Service.

The Notification Service can scale independently during peak communication periods.

The Notification Service can adopt a different technology stack if there is a legitimate reason to do so.

The architecture begins to mirror the structure of the business itself.

💡

Good service boundaries usually reflect business boundaries. If your architecture and organizational structure are constantly fighting each other, one of them is probably wrong.

A Microservice Is Not Just a Smaller Application

This is another mistake I frequently see.

Many teams take a monolith and simply split it into smaller pieces.

Unfortunately, this often creates what I call a distributed monolith.

The services are physically separated, but they remain tightly coupled.

Every service depends on every other service.

Changes ripple through the system.

Deployments remain coordinated.

Failures spread quickly.

The team has inherited the complexity of microservices without gaining their benefits.

A good microservice is not defined by its size.

It is defined by its responsibility.

Each service should own a specific business capability.

Examples include:

User Service
Payment Service
Inventory Service
Notification Service
Order Service

Notice that each of these represents a business function.

Now consider these examples:

Utility Service
Common Service
Shared Service
Database Service

These names usually indicate unclear boundaries.

When service boundaries are unclear, ownership becomes unclear.

And when ownership becomes unclear, complexity increases.

A useful question I often ask is:

"What business capability disappears if this service is removed?"

If the answer is obvious, the boundary is probably healthy.

If the answer is vague, the service may need to be reconsidered.

The Database Conversation Nobody Likes

Many teams successfully split their application into multiple services.

Then they connect every service to the same database.

At that point, they have not really solved the coupling problem.

They have simply moved it.

A shared database creates hidden dependencies.

One service can unintentionally break another.

Schema changes become risky.

Teams lose autonomy.

The database becomes the new monolith.

A fundamental principle of microservices is data ownership.

Each service should own its data.

For example:

Order Service

Orders
Order Statuses

Payment Service

Transactions
Refunds

Notification Service

Message History
Delivery Status

This separation allows teams to evolve their services independently.

It also introduces new challenges around consistency, which we will discuss later.

Every architectural decision is a trade-off.

Microservices are no exception.

Before You Reach for Microservices

Whenever someone asks me whether they should adopt microservices, I usually ask a different question:

What problem are you trying to solve?

If the answer is:

We want to look modern.
Netflix uses microservices.
Everyone else is doing it.

Then I strongly recommend staying with the monolith.

If the answer is:

Teams are blocked by each other.
Deployments are becoming risky.
Different domains need to scale independently.
Organizational growth is creating bottlenecks.

Then microservices may be worth considering.

The goal is never to collect services.

The goal is to create a system that allows the business and engineering teams to move faster without sacrificing reliability.

That sounds simple.

In practice, this is where the real challenge begins.

Because the moment services become independent, they need a way to communicate.

And communication is where many microservices architectures either succeed or fail.

In the next section, I'll explore the communication patterns that power microservices, when to use synchronous communication, when to use asynchronous messaging, and the trade-offs that every engineering team must understand before choosing either approach.

Communication Between Services: Where Most Systems Start to Break

Once you move beyond a single application, the real work begins. At this point, I usually tell engineers that the architecture is no longer the hard part. The hard part is communication.

Because now, instead of functions calling functions, you have services calling services over the network. And the network, unlike in-process memory, is unreliable by default.

I have seen systems that looked perfectly clean on paper fall apart in production simply because communication patterns were chosen without enough thought. A simple user request ends up bouncing between services, and when something goes wrong, nobody can immediately tell where the failure started.

This is where you start to feel the difference between synchronous and asynchronous communication, not as theory, but as lived experience.

Synchronous Communication: Simple, But Fragile

Synchronous communication is usually where teams begin. It feels natural because it mirrors how monoliths work. One service call another, waits for a response, and continues execution.

For example, in a simple order flow:

The Order Service receives a request, then calls the Payment Service to confirm payment before finalizing the order.

On the surface, this is straightforward. It is easy to reason about, easy to implement, and easy to debug in small systems.

But I have learned that simplicity in distributed systems can be deceptive.

Because now the Order Service is no longer just responsible for orders. It is also dependent on the availability, latency, and stability of the Payment Service. If the Payment Service slows down, the entire order flow slows down. If it fails, the order flow fails with it.

At scale, this creates a chain reaction. One slow service can degrade the entire system experience, even if everything else is working perfectly.

This is the first place I usually see teams underestimate the cost of microservices.

Asynchronous Communication: Where Systems Start to Breathe

At some point, teams begin to realize that not everything needs an immediate response. This is usually where asynchronous communication enters the picture.

Instead of calling another service directly, a service publishes an event and continues its work. Other services subscribe to that event and react independently.

A simple example is an order event. When an order is created, the Order Service publishes an OrderCreated event. The Notification Service listens and sends emails. The Analytics Service listens and updates metrics. The Inventory Service adjusts stock.

The interesting part here is not just the architecture, but the shift in thinking. The Order Service no longer cares who reacts to the event or how many services are involved. It only cares that the event is published successfully.

This is where systems start to feel more flexible. They scale better, failures become more isolated, and services stop being tightly dependent on each other.

But I always caution engineers here. Asynchronous communication does not remove complexity. It moves it somewhere else.

Now you have to think about event ordering, duplication, eventual consistency, and debugging across time rather than across a single request flow.

I have seen engineers struggle more with this shift than with microservices themselves.

What Real Systems Usually Look Like

One of the biggest misconceptions in software architecture is that systems are either synchronous or asynchronous.

In reality, most successful systems are a mixture of both.

A ride booking request might look like this:

The payment verification remains synchronous because the passenger needs an immediate answer.

The notifications and analytics updates become asynchronous because they do not need to block the booking process.

This approach allows the system to remain responsive while reducing unnecessary dependencies.

Over time, I have found that this hybrid model is usually the most practical approach for production systems.

The goal is not to eliminate synchronous communication.

The goal is to reserve it for situations where immediate feedback is genuinely required.

The moment services begin communicating over a network, failures become inevitable.

Not possible.

Not likely.

Inevitable.

And that realization changes how you design systems.

In the next section, I'll explore why production-ready microservices must be designed with failure in mind from day one, and the resilience patterns that help APIs continue functioning even when parts of the system are failing.

Designing for Failure: Because Production Doesn't Care About Your Architecture

One of the most important lessons I learned about distributed systems is that failure is not an edge case.

It is a feature of the environment.

When engineers first move from a monolith to microservices, they often focus on service boundaries, deployment strategies, and communication patterns. Those things matter, but they are not what keeps systems running in production.

What keeps systems running is how they behave when things go wrong.

And things will go wrong.

A service will crash.

A network connection will drop.

A database will become unavailable.

A third-party provider will experience an outage.

A message broker will become overloaded.

None of these scenarios are unusual. In fact, if your system runs long enough, every single one of them will happen eventually.

The question is not whether failure will occur.

The question is how your system responds when it does.

I often tell engineers that the difference between a development environment and a production environment is simple:

Everything works in development.

Production introduces reality.

The First Mistake: Assuming Every Request Will Succeed

Imagine a ride-hailing platform.

A passenger requests a ride.

The Ride Service calls the Payment Service to validate the passenger's payment method before confirming the booking.

Everything works perfectly during testing.

The Payment Service responds within milliseconds.

The booking is confirmed.

The passenger receives a notification.

Everyone is happy.

Now imagine the same flow in production.

The Payment Service becomes temporarily unavailable.

Maybe the service is being deployed.

Maybe the database is under heavy load.

Maybe there is a network issue between services.

Whatever the cause, the result is the same.

The Ride Service is waiting for a response that never arrives.

The ride booking fails.

Not because the Ride Service is broken.

Not because the passenger did anything wrong.

But because the system assumed every dependency would always be available.

That assumption is one of the fastest ways to create fragile systems.

Timeouts: Teaching Systems When to Stop Waiting

One of the simplest resilience patterns I use is the timeout.

A timeout defines how long a service is willing to wait before giving up on a request.

Without a timeout, a service can wait indefinitely for a dependency that may never respond.

This sounds harmless until dozens, hundreds, or thousands of requests begin piling up.

Eventually, the service runs out of resources and becomes unhealthy itself.

What started as a problem in one service has now spread to another.

I have seen incidents where a single slow dependency caused an entire system to become unresponsive simply because requests were allowed to wait forever.

A timeout acts as a boundary.

It allows a service to fail quickly and preserve resources rather than waiting endlessly for something outside its control.

A failed request is unfortunate.

A system-wide outage is much worse.

Retries: Giving Temporary Failures a Second Chance

Not every failure is permanent.

Sometimes a service is restarting.

Sometimes a network connection briefly drops.

Sometimes a dependency experiences a temporary spike in traffic.

These situations often resolve themselves within seconds.

This is where retries become useful.

Instead of immediately failing, the service attempts the request again.

At first glance, retries seem like an obvious improvement.

But I have learned that retries are dangerous when implemented carelessly.

Imagine a dependency that is already struggling under heavy load.

Now imagine thousands of services retrying requests simultaneously.

Instead of helping the dependency recover, the retries make the problem worse.

This is why I typically pair retries with exponential backoff.

Rather than retrying immediately, each attempt waits progressively longer before trying again.

The goal is not to overwhelm the failing service.

The goal is to give it room to recover.

Circuit Breakers: Preventing Failure from Spreading

One of my favorite resilience patterns is the circuit breaker.

The name comes from electrical systems.

When an electrical circuit experiences a fault, the breaker trips and temporarily stops the flow of electricity to prevent further damage.

The same idea applies to software.

Imagine a Payment Service that has been failing continuously for several minutes.

Without protection, every incoming ride request continues attempting payment verification.

The result is predictable.

Resources are wasted.

Latency increases.

More services become affected.

Instead, the circuit breaker detects repeated failures and temporarily stops sending requests.

When the circuit is open, requests fail immediately rather than waiting for an already unhealthy dependency.

After a configured period, the system can attempt a small number of test requests to determine whether the dependency has recovered.

If recovery is successful, the circuit closes and normal traffic resumes.

This pattern has saved countless systems from cascading failures.

Idempotency: Protecting Against Duplicate Operations

There is one resilience pattern that deserves special attention because it protects something more valuable than infrastructure.

It protects business correctness.

Consider a passenger making a payment for a ride.

The payment request reaches the Payment Service.

The payment succeeds.

Unfortunately, the response never reaches the client because of a network interruption.

The client assumes the request failed and sends it again.

Without protection, the passenger may be charged twice.

I have never met a customer who enjoys discovering duplicate charges.

This is where idempotency becomes essential.

An idempotent operation produces the same outcome regardless of how many times the same request is received.

Instead of processing the payment repeatedly, the service recognizes that it has already handled the request and returns the original result.

In distributed systems, duplicate requests are not unusual.

They are expected.

Retries, network interruptions, message redelivery, and client behavior can all produce duplicates.

Designing for idempotency ensures those duplicates do not become business problems.

Resilience Is Not About Preventing Failure

When engineers first hear about resilience patterns, they often assume the goal is to eliminate failures.

That is impossible.

No architecture, framework, cloud provider, or technology stack can guarantee that failures will never occur.

The purpose of resilience is not to prevent failure.

The purpose is to contain failure.

A resilient system recognizes that dependencies will occasionally become unavailable, requests will occasionally fail, and infrastructure will occasionally behave unpredictably.

Instead of collapsing under those conditions, it continues operating in a controlled and predictable way.

That shift in mindset changed how I design systems.

I stopped asking:

"How do I make sure this never fails?"

And started asking:

"What happens when this fails?"

The answers to that question usually reveal more about a system's production readiness than its architecture diagrams ever will.

Of course, surviving failure is only half the battle.

Even the most resilient system becomes difficult to operate if engineers cannot understand what is happening inside it.

And that brings us to one of the most overlooked aspects of modern software architecture: observability.

Observability: Understanding What Your System Is Actually Doing

The first time I worked on a system with multiple services, I thought logs would be enough.

I was wrong.

A user reported that a payment was successful, but the order was never confirmed. The Payment Service looked healthy. The Order Service looked healthy. The Notification Service looked healthy.

Yet somehow the workflow had failed.

I spent hours jumping between log files trying to piece together what happened.

That experience taught me something important:

Building distributed systems is one challenge.

Understanding them in production is another challenge entirely.

And that is where observability comes in.

Observability is not a monitoring tool.

It is not a dashboard.

It is not a logging library.

Observability is your ability to understand the internal state of a system by examining its outputs.

When a customer says:

"Something isn't working."

Observability helps you answer:

"What happened, where did it happen, and why did it happen?"

Without guesswork.

The Three Pillars of Observability

When I think about observability, I think about three things:

Logs
Metrics
Traces

Each one answers a different question.

Pillar	Answers
Logs	What happened?
Metrics	How often is it happening?
Traces	Where did it happen?

Most teams have logs.

Fewer teams have useful logs.

Even fewer teams have traces.

Let's fix that.

Structured Logging: Stop Writing Logs for Humans Only

One of the most common mistakes I see is unstructured logging.

Things like:

Payment failed

Error processing order

These logs seem helpful when you're developing locally.

They become almost useless when thousands of requests are flowing through multiple services.

Instead, I prefer structured logs.

A structured log contains context.

{
  "service": "payment-service",
  "orderId": "ord_123",
  "paymentId": "pay_456",
  "userId": "usr_789",
  "status": "failed",
  "reason": "insufficient_funds"
}

Now I can search by:

orderId
paymentId
userId
status

and immediately find relevant events.

In a NestJS application, I typically use Pino.

import { Logger } from 'nestjs-pino';

@Injectable()
export class PaymentService {
  constructor(
    private readonly logger: Logger,
  ) {}

  async processPayment(orderId: string) {
    this.logger.info({
      orderId,
      action: 'payment_processing_started',
    });

    // payment logic
  }
}

Notice what I'm logging.

Not sentences.

Context.

Machines can search context.

Humans can interpret context.

You need both.

Correlation IDs: Following a Request Across Services

Structured logs alone are not enough.

Imagine a request moving through four services:

Each service generates logs.

The challenge becomes identifying which logs belong to the same user request.

This is where correlation IDs become invaluable.

When a request enters the system, generate a unique identifier.

x-correlation-id:
7f4a1f1b-95a4-4ec8-85ab-bd8e87c43f76

Every service forward that value.

Now every log entry contains:

{
  "correlationId": "ABC123",
  "service": "payment-service",
  "event": "payment_completed"
}

When something goes wrong, I search for the correlation ID and instantly reconstruct the entire request journey.

This single practice has saved me more debugging time than almost any other observability technique.

Implementing Correlation IDs in NestJS

A simple middleware can generate and propagate correlation IDs.

import { v4 as uuid } from 'uuid';

@Injectable()
export class CorrelationIdMiddleware
implements NestMiddleware {

  use(
    req: Request,
    res: Response,
    next: NextFunction,
  ) {

    const correlationId =
      req.headers['x-correlation-id']
      || uuid();

    req['correlationId'] = correlationId;

    res.setHeader(
      'x-correlation-id',
      correlationId,
    );

    next();
  }
}

Every downstream service should preserve and forward the same value.

Think of it as a tracking number for a request.

Metrics: Detecting Problems Before Users Do

Logs tell me what happened.

Metrics tell me whether I should be worried.

For example:

A single failed payment isn't necessarily a problem.

But if payment failures suddenly increase from:

2% to 35%

within five minutes,

I want to know immediately.

Some of the metrics I monitor most often include:

Request Rate
Error Rate
Response Time
Queue Length
Database Connection Count
Memory Usage

A useful mental model is this:

Logs help with investigations.

Metrics help with detection.

Metrics answer:

Is something unusual happening?

before customers start opening support tickets.

Distributed Tracing: The Missing Piece

Let's revisit the earlier scenario.

A customer says:

"My payment succeeded but my order was never confirmed."

Logs can help.

Correlation IDs can help.

But traces provide something even better.

They show the complete request journey.

A trace doesn't just show the path.

It shows timing.

Example:

API Gateway       12ms
Order Service     24ms
Payment Service   980ms
Notification      15ms

Immediately, the bottleneck becomes obvious.

The Payment Service is responsible for most of the latency.

No guessing required.

This is why I consider distributed tracing one of the most valuable investments for any microservices architecture.

OpenTelemetry: My Preferred Starting Point

When teams ask me where to begin with tracing, my answer is almost always the same:

Start with OpenTelemetry.

OpenTelemetry provides a standard way to collect:

Traces
Metrics
Logs

across services.

A minimal setup in NestJS might look like:

import { NodeSDK }
from '@opentelemetry/sdk-node';

const sdk = new NodeSDK();

sdk.start();

In a real system, traces are exported to tools such as:

Jaeger
Tempo
Datadog
New Relic

Once configured, every request begins producing trace data automatically.

The result is visibility that would be nearly impossible to achieve through logs alone.

What Observability Looks Like in Practice

A mature production environment often looks something like this:

Each tool serves a different purpose.

OpenTelemetry collects telemetry data
Jaeger visualizes traces
Prometheus stores metrics
Grafana provides dashboards

Together, they provide a complete picture of system behavior.

Not assumptions.

Not guesses.

Evidence.

The Real Goal

The goal of observability is not collecting more data.

I have seen teams collect enormous amounts of logs and metrics while still struggling to diagnose incidents.

The goal is understanding.

When a customer reports a problem, I want answers within minutes, not hours.

I want to know:

Which service handled the request?
How long did it take?
Which dependency failed?
What happened before the failure?
What happened after the failure?

The most successful engineering teams I have worked with are not necessarily the teams that experience the fewest failures.

They are the teams that can quickly understand and recover from them.

And that ability begins with observability.

Now that we can see what's happening inside our system, the next challenge is operating microservices responsibly at scale.

Because building services that communicate, recover from failures, and expose useful telemetry is only part of the journey.

Eventually, we need to discuss what separates a collection of services from a production-ready microservices platform.

The Mistakes That Turn Microservices into Distributed Monoliths

One of the reasons I enjoy discussing microservices is because most conversations focus on success stories.

The diagrams are clean.

The services are neatly separated.

Everything appears elegant.

Production rarely looks like those diagrams.

Over the years, I have noticed that many teams do not fail because they chose microservices.

They fail because they unknowingly recreate the same coupling they were trying to escape.

They trade one monolith for a distributed monolith.

And the worst part is that they often do not realize it until the system becomes difficult to evolve.

Mistake #1: Splitting Services Without Clear Boundaries

The first sign of trouble usually appears during service decomposition.

A team takes a monolith and begins creating services:

User Service
Order Service
Product Service

Good so far.

Then:

Utility Service
Common Service
Shared Service

That is usually where I start asking questions.

A service should represent a business capability, not a collection of unrelated helper functions.

When service boundaries are unclear, ownership becomes unclear.

When ownership becomes unclear, complexity follows.

One question I often ask is:

"If this service disappeared tomorrow, what business capability would disappear with it?"

If nobody can answer confidently, the boundary probably needs more thought.

One of the most common anti-patterns I see is this:

Technically, the services are separate.

Operationally, they are still coupled.

The database becomes the new monolith.

A schema change from one team can break another team's service.

Independent deployments become difficult.

Autonomy disappears.

Microservices should own their data.

That ownership is what enables true independence.

Mistake #3: Making Everything Synchronous

This usually starts innocently.

A service needs information from another service.

A REST call is added.

Then another.

Eventually, a single request looks like this:

The architecture may look organized.

The latency certainly isn't.

One slow dependency can affect the entire chain.

I have seen systems where a single request required seven synchronous service calls before a response could be returned.

That is rarely sustainable.

Mistake #4: Ignoring Observability Until Production

This mistake usually reveals itself during the first serious incident.

A customer reports a problem.

Logs exist.

Metrics exist.

But nobody can explain what actually happened.

At that moment, observability stops feeling optional.

It becomes a necessity.

I strongly prefer introducing structured logging, metrics, and tracing before the first production deployment.

Retrofitting observability later is significantly harder.

Mistake #5: Choosing Microservices Too Early

This is probably my most controversial opinion.

Many systems should remain monoliths.

A well-structured monolith is not a failure.

It is often the simplest and most effective solution.

I would rather maintain a healthy monolith than operate ten unnecessary services.

Microservices introduce operational complexity.

The benefits must justify that complexity.

If they do not, the monolith is usually the better choice.

The Goal Was Never Microservices

If there is one thing I want every engineer reading this to understand, it is this:

Microservices were never the goal.

They were never the destination.

They were never the measure of good architecture.

They are simply one possible answer to a deeper problem.

Over the years, I have seen engineers obsess over architecture diagrams. I have seen teams celebrate the moment they “moved to microservices” as though it represents a level of engineering maturity.

But I have also seen what happens after the excitement fades.

Systems become harder to understand.

Deployments become more delicate.

Debugging becomes slower.

Incidents take longer to resolve.

And slowly, quietly, teams begin to realize that they have not eliminated complexity.

They have relocated it.

What Actually Matters in Production Systems

When I reflect on the systems, I consider truly well-designed, I notice a pattern.

It has very little to do with whether they are monoliths or microservices.

Instead, it comes down to a few practical questions:

Can the system evolve without breaking itself?

Can teams ship changes without fear?

Can failures be understood quickly?

Can issues be isolated instead of spreading?

Can the system recover gracefully when things go wrong?

Those are the real indicators of good architecture.

Not the number of services.

Not the sophistication of the stack.

Not the popularity of the tools.

A System Is Only as Strong as Its Weakest Assumption

One lesson I keep returning to is that most production failures are not caused by broken code.

They are caused by incorrect assumptions.

Assumptions like:

A dependency will always be available
A network call will always succeed
A service will always respond within a reasonable time
A message will only be processed once
Data will always be consistent across services

Microservices amplify these assumptions because they force them to cross boundaries.

That is why we spent so much time talking about:

Timeouts
Retries
Circuit breakers
Idempotency
Observability

These are not “advanced topics.”

They are survival mechanisms.

What I Tell Engineers Who Want to “Do Microservices Right”

When engineers ask me how to properly adopt microservices, I usually do not start with technology.

I start with questions.

Do you understand your domain well enough to split it safely?

Can your team operate multiple services independently?

Do you have visibility into what happens in production today?

Can you trace a single request across your system without guessing?

Can you recover quickly when something fails?

If the answer to most of these questions is no, then microservices will not solve the problem.

They will simply expose it.

Architecture Is a Reflection of Discipline

One thing I have come to believe strongly is that architecture does not create discipline.

It reveals it.

A well-disciplined team can build a stable monolith.

A well-disciplined team can operate microservices at scale.

An undisciplined team can struggle with both.

The difference is not the architecture.

The difference is the engineering mindset behind it.

The Real Evolution

If I step back and look at the journey we have walked through in this article, it is not really a story about monoliths or microservices.

It is a story about maturity.

We started with a simple system.

We introduced boundaries.

We handled communication.

We prepared for failure.

We added observability.

We discussed production realities.

And finally, we reflected on when not to adopt complexity at all.

That is the real evolution.

Not from monolith to microservices.

But from assumptions to understanding.

Final Thought

If I had to summarize everything in one idea, it would be this:

A great system is not one that avoids failure.

A great system is one that understands failure, contains it, observes it, and recovers from it quickly.

If microservices help you achieve that, use them.

If they do not, you are not obligated to use them.

Because in the end, the goal was never microservices.

The goal was always the same:

To build systems that survive production.

Command Palette