Back to Blog
Trending Post

Raul Junco on Why Boring Systems Scale Better

·System Design

A deeper take on Raul Junco's viral LinkedIn post: single-writer domains, protected hot paths, and designing for failure by default.

LinkedIn contentviral postscontent strategysystem designsoftware architecturescalabilityreliability engineeringdistributed systemssocial media marketing

Raul Junco recently shared something that caught my attention: "Boring systems scale better." He followed it with "3 lessons I learned the hard way" and then laid out a blueprint that feels almost too simple in a world that rewards novelty.

What I like about Raul's framing is that it does not romanticize complexity. It points to a practical reality: the systems that survive growth are usually the ones that stay readable under stress, debuggable at 3am, and predictable when traffic, teams, and dependencies multiply.

In this post, I want to expand on Raul Junco's three lessons as if we were in a design review together: one writer-many readers, protect the hot paths, and make failure the default path. These ideas are "boring" on purpose, and that is exactly why they work.

The hidden cost of "exciting" architecture

A lot of architectural complexity enters with good intentions:

  • We want teams to move independently.
  • We want to avoid bottlenecks.
  • We want to ship faster.

But excitement often looks like this in production:

  • Multiple services writing to the same database tables or shared events without a clear owner.
  • Critical user journeys spread across too many hops.
  • External calls assumed to be reliable until they are not.

Raul Junco's post is basically a checklist for avoiding those traps before they turn into outages and organizational gridlock.

1) One Writer, Many Readers (and one source of truth)

Raul Junco wrote: "Try to go for: One Writer. Many Readers." Then he warned: "The fastest way to chaos is to have multiple services writing the same truth."

That line should be printed and taped above every microservices diagram.

Why multiple writers create chaos

When multiple services can mutate the same business facts (orders, subscriptions, balances, inventory), you quickly lose the ability to answer questions like:

  • Which service is responsible for correctness?
  • Where do we enforce constraints?
  • How do we replay or audit changes?

Two writers means two competing interpretations of "truth," and the system becomes a negotiation instead of a ledger.

What "one writer per domain" looks like

Raul also emphasized operational ownership:

  • One team owns migrations
  • One place for constraints
  • One log to replay

In practice, this can mean:

  • A single service owns writes for a domain aggregate (for example, the Orders service owns the order state machine).
  • Other services do not directly write that state. They request changes via API calls, commands, or events.
  • The owning service enforces invariants (unique constraints, state transitions, idempotency, and validation).

"Writes stay where idempotency lives. Reads scale off replicas and caches."

That last sentence is the scaling unlock. Writes should be correct and controlled. Reads should be cheap and massively scalable.

Practical patterns that keep this boring and effective

  • Read models and materialized views: Let downstream services build their own query-optimized views from events, but do not let them write back to the source of truth.
  • Database replicas: Offload heavy reads to replicas when consistency requirements allow.
  • Caches with explicit freshness: Cache what is safe, measure hit rates, and define acceptable staleness.

If you do this well, you get autonomy where it matters (read scaling, analytics, search) without giving up correctness where it hurts most (writes).

2) Protect the Hot Paths (name them, budget them, guard them)

Raul Junco said: "Protect the Hot Paths. Most traffic flows through a few routes. Name them. Budget them. Guard them."

This is one of the most underused practices in system design. Teams often optimize what is loud (lots of endpoints, lots of dashboards) instead of what is financially or operationally critical.

What counts as a hot path?

Raul offered great prompts:

  • "Which path touches every dollar?"
  • "What dies if it slows 100ms?"
  • "What chain shows up in every trace?"

For many businesses, the hot paths are not the same as the highest QPS endpoints. A lower-traffic checkout flow might matter more than a high-traffic feed endpoint because it is revenue-critical and latency sensitive.

Turn hot paths into explicit contracts

Once you identify them, treat them like product requirements:

  • Latency SLOs (p95 and p99 targets)
  • Availability targets
  • Dependency budgets (max number of network calls)
  • Capacity assumptions and load-test scenarios

Raul's additional rules are a strong starting point:

  • Fewer hops
  • Proven caches only
  • Budgets per hop
  • Alerts on p95/p99

I would add one more: make it easy to see the hot path in a trace. If a new dependency sneaks in, you should spot it immediately.

A concrete example

Imagine a payment authorization flow:

  1. API gateway
  2. Checkout service
  3. Fraud scoring
  4. Payment processor
  5. Order writer
  6. Notification

If you apply Raul's "fewer hops" principle, you might move notification off the synchronous path, cache fraud features safely, and enforce timeouts and fallbacks so the user does not wait on non-critical work. You keep the path boring by keeping it short.

3) Make Failure the Default Path (design like dependencies are already down)

Raul Junco wrote: "Make Failure the Default Path." And he nailed the mindset: "The fastest way to a 3am page is assuming everything works."

This is the difference between systems that degrade gracefully and systems that fall off a cliff.

What "failure by default" actually means

It does not mean you expect everything to be broken all the time. It means every critical path is designed with explicit answers to:

  • What happens when this dependency dies?
  • Can we serve stale and still be useful?
  • What's the blast radius of one bad deploy?

If you cannot answer those quickly, you are relying on hope.

The boring reliability toolkit

Raul listed the essentials:

  • Circuit breakers on every external call
  • "90% Stale cache > no response"
  • Timeouts shorter than you think
  • Fallbacks that return value, not errors

Each one is worth expanding:

  • Circuit breakers: Stop hammering a failing dependency and protect your own thread pools and queues.
  • Stale cache: If you can serve slightly outdated data, you preserve user experience and reduce load during incidents.
  • Aggressive timeouts: A slow dependency can be as damaging as a down one. Tight timeouts prevent tail latency from cascading.
  • Value-returning fallbacks: Instead of propagating an error, return something useful: last known state, partial results, or a queued operation with clear messaging.

Make the blast radius small on purpose

Failure defaults also show up in deployment strategy:

  • Feature flags for risky changes
  • Gradual rollouts (canary, percentage-based)
  • Safe schema changes (expand-contract)
  • Bulkheads (separate pools for critical and non-critical work)

When you combine these with good observability, you create the kind of uptime Raul described:

"Uptime isn't about preventing failure. It's about making failure boring too."

Why "boring" is a competitive advantage

Boring systems:

  • Reduce cognitive load for engineers.
  • Make incident response faster.
  • Allow teams to change parts without unpredictable side effects.
  • Scale with organization size, not just traffic.

The irony is that boring enables speed. When your core workflows are simple, you can innovate at the edges without risking the heart of the business.

A quick checklist you can apply this week

If you want to operationalize Raul Junco's lessons, here are a few prompts to bring to your next architecture review:

  • Do we have exactly one writer for each core domain entity?
  • Are migrations and constraints owned by that same domain boundary?
  • What are our top 3 hot paths, and what are their p95 and p99 targets?
  • How many network hops are on those paths, and which ones can be removed?
  • For every external call on a hot path: do we have a timeout, circuit breaker, and fallback?
  • Can we serve stale data for at least one critical read scenario?

If any of these answers are unclear, that is a signal to simplify.


This blog post expands on a viral LinkedIn post by Raul Junco, Simplifying System Design. View the original LinkedIn post →