Use case · SaaS MVP to scale

From an MVP that works on Tuesday to a SaaS that survives Monday at 9 AM

Six audits, six fixes, in order. Auth and RLS hardened so a tenant cannot read another tenant's data. Rate limiting wired to every endpoint a customer can hit. A Redis cache layer that knows the difference between a stale price and a stale token. Stripe billing that handles failed cards, proration, and trial-to-paid. Observability that surfaces the next outage before the user emails support. A runbook the on-call person reads at 3 AM.

The problem

The MVP that won the first ten customers will lose the next hundred

The pattern is familiar. The founder shipped fast. Auth was one Supabase table, RLS was off because the dashboard worked, Stripe was a single `customer.subscription.created` webhook, the cache was either everything-in-memory or nothing, and every endpoint trusted the user. The product found ten paying customers. Then two things happened at once: someone tried a SQL-injection-shaped query against the public API, and a Friday-afternoon press mention drove 4,000 sign-ups in an hour. The MVP went down for a weekend and the founder shipped a hotfix branch that now powers production. We take that codebase, audit the six surfaces that decide whether a SaaS scales, fix them in order, and leave the team with a runbook that explains how to handle the next incident without paging the founder. The codebase ends the engagement boring; that is the goal.

Our approach

Six audits, six fixes, in order

Auth before billing. Billing before rate limiting. Rate limiting before caching. Caching before observability. Observability before the runbook. Skipping a step means the next step inherits the problem you skipped.

  1. 01

    Auth and tenant isolation

    We start by reading every Supabase table and every API route. RLS gets enabled on every tenant-scoped table. Every route either uses the user's JWT or has a documented reason for using `service_role`. We add a tenant-isolation test that signs in as user A, fetches user B's data, and fails the build when the response is non-empty. The first cutover ships when that test is green; if any other surface is still broken, the rest of the system at least cannot leak data.

  2. 02

    Billing maturity

    Stripe gets rewired. Webhooks for `invoice.payment_failed`, `customer.subscription.updated`, `invoice.upcoming` and `customer.deleted` (not only `subscription.created`). The billing portal becomes a link the customer can use; the support inbox stops being a refund desk. Proration, trial-to-paid transitions, and dunning are tested against a Stripe test clock, not against real customer money. Every billing event lands in an `events` table with idempotency keys so a replayed webhook does not double-charge.

  3. 03

    Rate limiting and abuse control

    Every public endpoint gets a rate limit. The default is per-IP and per-user, sliding window, Redis-backed. Auth endpoints get a stricter limit (5 per minute, ban for 15). Write endpoints get a soft limit per tenant. The 429 responses include a `Retry-After` header and a structured error the client can read. We document the limits in the runbook so the on-call person knows which knob to turn during a spike.

  4. 04

    Caching strategy

    Caches get a tier and a TTL written down. Hot reads (pricing, feature flags, public catalog) go to Redis with stale-while-revalidate. Tenant reads (dashboard, settings) go to TanStack Query with a per-user key and a five-minute `staleTime`. Mutations invalidate explicitly; nothing relies on cache TTL alone to refresh. Every Redis key has a typed config entry; no `redis.keys('*')` anywhere in the codebase.

  5. 05

    Observability

    Logs go from `console.log` to structured JSON. Error reporting goes to Sentry with release tagging and source maps. Latency gets exported to a metrics endpoint (Vercel Analytics, Datadog, or Posthog, depending on the stack). The four golden signals (latency, traffic, errors, saturation) each have a dashboard the founder bookmarks. Alerts fire on SLO burn rate, not on raw error counts; the on-call person sleeps through transient blips.

  6. 06

    Runbook and incident posture

    We write a runbook that covers the three incidents the team will hit in the first month: a Stripe webhook backlog, a Supabase connection-pool exhaustion, and a rate-limit-induced spike of 429s during a marketing push. Each one gets a one-page entry with the symptom, the dashboard to open, the command to run, and the rollback. The founder is not on call; the person on call has the runbook and the access.

What we deliver

RLS audit and remediation

A table-by-table review of Row Level Security policies, with a remediation plan for the tables that need policies tightened or added. The deliverable that decides whether a hostile customer can read a competitor's data; ships before anything else.

Tenant isolation test suite

A test suite that signs in as user A, attempts to access user B's resources across every API route, and fails on any non-empty response. Runs in CI on every PR. Catches the day a future developer ships an endpoint that forgets `where tenant_id = $1`.

Stripe webhook handler

A typed handler for the seven Stripe events that matter for a SaaS (`subscription.created/updated/deleted`, `invoice.paid/payment_failed/upcoming`, `customer.deleted`). Idempotent, replayable, signed-verification on every call. Replaces the single-event handler the MVP shipped with.

Billing test plan

A test plan against a Stripe test clock that walks the seven scenarios a SaaS billing engine has to handle (trial expiry, card decline, plan upgrade with proration, plan downgrade, manual cancellation, refund, customer deletion). The plan that the QA engineer runs before every billing change.

Rate-limit middleware

A Next.js middleware that wraps every public route with a sliding-window rate limit, Redis-backed, configurable per route. Returns 429 with `Retry-After` and a structured body. Logs every block to the events table for later review.

Redis key configuration

A typed config file enumerating every Redis key, its TTL, and its semantic. The single source of truth for cache invalidation. Eliminates `redis.keys('*')` and the bugs that come with it.

TanStack Query setup

A baseline TanStack Query configuration with `staleTime`, `gcTime`, retry policy, and a query-key factory. Replaces ad-hoc `useEffect` + `fetch` patterns across the codebase. Mutations include optimistic updates where the result is predictable.

Observability stack

Structured logging via Pino, error reporting via Sentry, metrics via the platform of choice. Each piece configured with release tagging, source maps, and PII filtering. The first dashboard ships with the four golden signals; the rest grow with the product.

SLO definitions and burn-rate alerts

Two SLOs to start: API latency P95 under 300 ms over a rolling 7-day window, and error rate under 0.5% over the same window. Burn-rate alerts (fast and slow) page only when the budget is at risk, not on every spike. Pageability gets earned.

Incident runbook

A 12-page runbook covering the three most likely first-month incidents (webhook backlog, connection-pool exhaustion, 429 spike) plus the generic "site is slow" and "site is down" entries. Each entry has a symptom, a dashboard, a command, and a rollback. The on-call person reads it; nothing more.

Deployment and rollback procedure

A documented deployment flow with a one-command rollback. Includes the cache invalidation steps after a deploy that changes a Redis key shape, and the database migration checklist for any schema change. Replaces the founder's mental model of `git push and pray`.

Capacity baseline

A simple load test against staging with three traffic levels (steady state, peak, abuse). The baseline shows where the system breaks and at what concurrency. Re-run quarterly; the chart over time tells the founder when the next round of scale-up work is due.

Five files that take a SaaS from MVP to scale

The five files below compose the scale-up pipeline. The tenant-isolation test that proves data does not leak, the Stripe webhook handler with idempotency, the rate-limit middleware, the cache wrapper with stale-while-revalidate, and the observability bootstrap that wires logs, errors, and metrics.

A SaaS scale-up is six audits and six fixes in a fixed order. The order matters because the surfaces compound: rate limiting on top of broken RLS still leaks data; caching on top of broken billing still loses money; observability on top of broken caching still tells you nothing useful. The work is mechanical once the order is set; the value is in the order.

The five files below are the scaffolding the engagement leaves behind. The tenant-isolation test, the Stripe webhook handler, the rate-limit middleware, the cache wrapper, and the observability bootstrap. Each file is small. Each file is the one place the team edits when the system has to change.

1. The tenant-isolation test

The test signs in as user A, attempts to read user B's resources across every API route, and fails the build on any non-empty response. We seed two users at the start of the test and tear them down at the end; the test is hermetic. Runs on every PR.

// tests/integration/tenant-isolation.spec.ts
import { test, expect } from '@playwright/test'
import { createClient } from '@supabase/supabase-js'

const ROUTES_THAT_RETURN_TENANT_DATA = [
  { method: 'GET', path: '/api/projects' },
  { method: 'GET', path: '/api/projects/:id' },
  { method: 'GET', path: '/api/invoices' },
  { method: 'GET', path: '/api/team' },
]

test('user A cannot read user B data', async ({ request }) => {
  const supabase = createClient(process.env.SUPABASE_URL!, process.env.SUPABASE_SERVICE_ROLE!)

  const userA = await supabase.auth.admin.createUser({ email: 'a@test', password: 'a' })
  const userB = await supabase.auth.admin.createUser({ email: 'b@test', password: 'b' })

  const { data: signIn } = await supabase.auth.signInWithPassword({ email: 'a@test', password: 'a' })
  const tokenA = signIn!.session!.access_token

  // Seed a project for user B
  await supabase.from('projects').insert({ owner: userB.data.user!.id, name: 'B project' })

  for (const route of ROUTES_THAT_RETURN_TENANT_DATA) {
    const res = await request.fetch(route.path, {
      method: route.method,
      headers: { Authorization: `Bearer ${tokenA}` },
    })
    const body = await res.json()
    expect(body.data ?? body, `route ${route.path} leaked tenant data`).toEqual([])
  }

  await supabase.auth.admin.deleteUser(userA.data.user!.id)
  await supabase.auth.admin.deleteUser(userB.data.user!.id)
})

2. The Stripe webhook handler

The handler is one route, switches on event type, writes every event to an events table with the Stripe event ID as the idempotency key. A replayed webhook is a no-op. A handler that throws gets retried by Stripe; the events table records the retry. We never modify customer state without an event row.

// app/api/stripe/webhook/route.ts
import { headers } from 'next/headers'
import Stripe from 'stripe'
import { createClient } from '@/lib/supabase/server'

const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!)

export async function POST(req: Request): Promise<Response> {
  const signature = (await headers()).get('stripe-signature') ?? ''
  const body = await req.text()

  let event: Stripe.Event
  try {
    event = stripe.webhooks.constructEvent(body, signature, process.env.STRIPE_WEBHOOK_SECRET!)
  } catch (err) {
    return new Response(`signature invalid: ${(err as Error).message}`, { status: 400 })
  }

  const supabase = createClient()

  // Idempotency: insert returns conflict if event was processed already.
  const { error: idempErr } = await supabase
    .from('stripe_events')
    .insert({ id: event.id, type: event.type, created: event.created })
  if (idempErr?.code === '23505') {
    return new Response('already processed', { status: 200 })
  }

  switch (event.type) {
    case 'customer.subscription.created':
    case 'customer.subscription.updated':
      await handleSubscriptionChange(supabase, event.data.object)
      break
    case 'customer.subscription.deleted':
      await handleSubscriptionDeleted(supabase, event.data.object)
      break
    case 'invoice.payment_failed':
      await handlePaymentFailed(supabase, event.data.object)
      break
    case 'invoice.payment_succeeded':
      await handlePaymentSucceeded(supabase, event.data.object)
      break
    case 'customer.deleted':
      await handleCustomerDeleted(supabase, event.data.object)
      break
  }

  return new Response('ok', { status: 200 })
}

3. The rate-limit middleware

A sliding-window rate limiter on Upstash Redis, with per-IP and per-user buckets. The middleware runs on every public route. The limits live in a config file; raising one is a code review, not a redis-cli session.

// middleware.ts
import { NextResponse, type NextRequest } from 'next/server'
import { Ratelimit } from '@upstash/ratelimit'
import { Redis } from '@upstash/redis'

const redis = Redis.fromEnv()

const RATE_LIMITS = {
  default: new Ratelimit({ redis, limiter: Ratelimit.slidingWindow(60, '1 m'), prefix: 'rl:default' }),
  auth: new Ratelimit({ redis, limiter: Ratelimit.slidingWindow(5, '1 m'), prefix: 'rl:auth' }),
  write: new Ratelimit({ redis, limiter: Ratelimit.slidingWindow(20, '1 m'), prefix: 'rl:write' }),
}

function pickLimit(path: string): keyof typeof RATE_LIMITS {
  if (path.startsWith('/api/auth')) return 'auth'
  if (path.startsWith('/api/') && ['POST', 'PUT', 'PATCH', 'DELETE'].includes('POST')) return 'write'
  return 'default'
}

export async function middleware(req: NextRequest): Promise<NextResponse> {
  if (!req.nextUrl.pathname.startsWith('/api/')) return NextResponse.next()

  const ip = req.headers.get('x-forwarded-for')?.split(',')[0]?.trim() ?? 'unknown'
  const limit = RATE_LIMITS[pickLimit(req.nextUrl.pathname)]
  const result = await limit.limit(ip)

  if (!result.success) {
    return new NextResponse(
      JSON.stringify({ success: false, error: 'rate_limited' }),
      {
        status: 429,
        headers: {
          'Content-Type': 'application/json',
          'Retry-After': String(Math.ceil((result.reset - Date.now()) / 1000)),
        },
      },
    )
  }

  return NextResponse.next()
}

export const config = { matcher: '/api/:path*' }

4. The cache wrapper with stale-while-revalidate

A typed helper that wraps every cacheable read. Returns cached data immediately when fresh, returns stale data and refreshes in the background when within the SWR window, and falls through to the loader when both windows have passed. Every cache hit and miss is counted; the metric goes to the observability stack.

// lib/cache/cachedApiCall.ts
import { Redis } from '@upstash/redis'

const redis = Redis.fromEnv()

interface CacheEntry<T> {
  value: T
  fresh: number
  stale: number
}

interface CacheOptions {
  ttl: number              // seconds
  staleWhileRevalidate?: number  // additional seconds where stale data is served while background refresh runs
}

export async function cachedApiCall<T>(
  key: string,
  loader: () => Promise<T>,
  options: CacheOptions,
): Promise<T> {
  const now = Date.now()
  const cached = await redis.get<CacheEntry<T>>(key)

  if (cached && cached.fresh > now) {
    return cached.value
  }

  if (cached && cached.stale > now) {
    // Serve stale, refresh in background; do not await.
    void refreshAndStore(key, loader, options).catch(() => undefined)
    return cached.value
  }

  const fresh = await loader()
  await refreshAndStore(key, async () => fresh, options)
  return fresh
}

async function refreshAndStore<T>(
  key: string,
  loader: () => Promise<T>,
  options: CacheOptions,
): Promise<void> {
  const value = await loader()
  const now = Date.now()
  const entry: CacheEntry<T> = {
    value,
    fresh: now + options.ttl * 1000,
    stale: now + (options.ttl + (options.staleWhileRevalidate ?? 0)) * 1000,
  }
  await redis.set(key, entry, { ex: options.ttl + (options.staleWhileRevalidate ?? 0) })
}

5. The observability bootstrap

One file wires structured logging, error reporting, and metrics. The runtime imports it once at the top of instrumentation.ts. The team gets the four golden signals on day one; the rest grows from the same primitives.

// instrumentation.ts
import * as Sentry from '@sentry/nextjs'
import pino from 'pino'

export const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  redact: ['*.password', '*.token', 'authorization'],
  formatters: {
    level: (label) => ({ level: label }),
  },
})

export async function register(): Promise<void> {
  Sentry.init({
    dsn: process.env.SENTRY_DSN,
    environment: process.env.NODE_ENV,
    release: process.env.VERCEL_GIT_COMMIT_SHA,
    tracesSampleRate: 0.1,
    beforeSend(event) {
      // Strip PII from breadcrumbs before reporting.
      if (event.request?.headers) {
        delete event.request.headers.authorization
        delete event.request.headers.cookie
      }
      return event
    },
  })

  logger.info({ release: process.env.VERCEL_GIT_COMMIT_SHA }, 'instrumentation ready')
}

6. What this composes

The tenant test proves the data does not leak. The webhook handler proves the billing does not lose state. The rate limit proves the public surface does not melt under abuse. The cache proves the database does not get hit on every page view. The bootstrap proves the team knows what is happening when something goes wrong.

The MVP is no longer an MVP. The codebase is boring in the specific way a SaaS that scales is boring: the work happens in the right files, the metrics tell the truth, the on-call person reads a runbook instead of paging the founder, and the next time the press picks the product up, the site stays up.

Related stacks

Frequently asked questions

How do you decide what is critical versus what can wait?

We rank surfaces by blast radius. Anything that can leak a customer's data goes first (RLS, tenant isolation). Anything that can lose money goes second (billing). Anything that can take the site down goes third (rate limiting, capacity). Anything that lets the team sleep goes fourth (observability, runbook). Anything else is post-engagement work the team can do on its own.

Can you do this without taking the product offline?

Yes, and we do. The cutover for each surface happens behind a feature flag and is reversible. The RLS rollout uses Supabase's `set` policies with permissive defaults during the transition. The Stripe rewire ships the new handler alongside the old one and switches once a week of webhooks have flowed through both cleanly. No customer notices the work.

What if our MVP is on a stack you do not usually work with?

The six surfaces are stack-agnostic. The specific tools change (Auth0 instead of Supabase Auth, Stigg instead of Stripe Billing, Cloudflare Rate Limiting instead of Upstash), but the pattern is the same. We do not run this on stacks where we cannot read the code; if the MVP is in Elixir or Ruby and we cannot read Ruby, we say so.

How long does it take?

Eight to twelve weeks from kickoff. Auth and RLS take two weeks. Stripe rewire takes two weeks (most of it is testing against the test clock). Rate limiting and caching take two weeks combined. Observability takes one week. The runbook takes one week. The last two weeks are buffer for the inevitable surprise an audit surfaces.

Do we have to choose between RLS and an API gateway?

No. RLS is the floor; the API gateway is the ceiling. RLS ensures the database refuses to return a tenant's data even if the API has a bug. The gateway ensures most bugs never reach the database. Belt and suspenders; the engagement ships both.

What about background jobs and queues?

Part of the scope when the MVP has any. Most early-stage SaaS uses a Vercel cron or a Supabase Edge Function for jobs, both of which we audit and harden. Heavier workloads move to Trigger.dev, Inngest, or a self-hosted BullMQ depending on the constraints. We do not introduce a queue the MVP did not need; we make the existing one reliable.

Do you replace our backend engineer?

No. We work alongside them. The runbook and the code are written so the existing engineer owns them after we leave. The engagement ends with a half-day handover and a calendar invite for a 30-day follow-up call. Most teams need us once; the ones who need us twice are the ones who skipped the runbook.

What does it cost if we wait?

The cheapest scale-up engagement starts before the first paid incident. The next-cheapest starts after the first incident. The most expensive starts after the first churned enterprise customer who left because of an outage. The math is in the runbook; we will share it on the scoping call.

Scope your SaaS scale-up

A scoping call, an audit of the six surfaces in week one, a fixed scope and a number we hold. Eight to twelve weeks from kickoff to a SaaS that survives Monday at 9 AM.