Scaling is mostly boring. The exciting stories — the all-night war rooms, the heroic refactors, the weekend migrations — are usually evidence that something obvious was missed six months earlier. The best scaling work is invisible because it was planned for.

We've taken half a dozen products from zero to six-figure user counts. The decisions that moved the needle were almost never the sexy ones. Here's the playbook, stage by stage, with the mistakes we've made so you don't have to.

0–1k users: pick boring infrastructure

At this stage, every hour you spend on infrastructure is an hour not spent on the product. Your users don't exist yet. Your job is to find them, not to build a platform for them.

The stack we recommend at this stage has not changed in three years:

  • Postgres on a managed host (Vercel, Supabase, Neon, RDS — pick one and move on).
  • One application runtime, one region, one deploy target.
  • Cloudflare or Vercel in front of everything.
  • Zero custom infrastructure. No Kafka, no Kubernetes, no microservices.

If you're tempted to "architect for scale" at this stage, you're solving the wrong problem. Ship the product. You can always rewrite the backend. You can't always get another launch window.

1k–10k users: cache aggressively

This is the stage where most teams first feel pain. Pages get slower. Database queries climb. The instinct is to "scale up" — bigger database, more replicas. That instinct is usually wrong.

Eighty percent of your traffic is reading the same thing. Cache it.

Eighty percent of your traffic is reading the same thing. Cache it.

What to cache, in order

  1. Edge caching of marketing and content pages. ISR in Next.js, plain HTML caching behind a CDN — either works. Most of your anonymous traffic should never reach your origin.
  2. Per-user dashboards with short TTLs. A 30-second cache on an expensive dashboard query is the difference between a working product and an outage.
  3. Session and auth state in Redis. Don't hit Postgres for every request just to check who someone is.
  4. Rate-limit counters in Redis. Critical for any public API.

If you add nothing else at this stage, add a Redis cache in front of your three slowest queries. The ROI is embarrassing.

10k–100k users: split the read path

Now the database is the bottleneck. Even with caching, you're doing enough reads that a single Postgres instance is working hard. Time to split.

Read replicas, used carefully

Add one or two read replicas. Route analytics queries, full-text search, and anything read-only to them. Keep writes going to the primary. Accept that replicas are seconds behind — most reads don't care.

The trap: accidentally reading stale data for writes that require immediate consistency. Mark your user-facing flows explicitly — do they tolerate lag, or not? Document it. Enforce it at the ORM layer if you can.

Queue non-critical writes

Email sends, webhook deliveries, image processing, analytics events — none of these need to happen synchronously with a user request. Push them into a queue (BullMQ, Trigger.dev, Inngest) and let a background worker handle them. Your request latency drops. Your error surface shrinks.

Get an APM the day you cross 10k

Flying blind at this scale is the most expensive mistake you can make. Sentry for errors, a real APM (Datadog, New Relic, or Grafana Cloud) for traces. You will find things you did not know were slow. You will save weeks.

The decision that matters most

Hire one person whose only job is reliability before you need them. Not a DevOps consultant, not a part-time SRE. One full-time engineer who owns on-call, observability, and capacity planning.

Every team we've seen burn out or lose trust with their users skipped this hire. They tried to spread on-call across five product engineers who already had full queues, and watched quality collapse over twelve months. The hire feels premature at 10k users. It is exactly correct.

A closing thought

Scale is a product of discipline, not cleverness. The teams that do it well make boring, unglamorous decisions early and keep making them. The teams that struggle are the ones who skipped those decisions because their product was special. It wasn't.