← onset

Reliability

Our committed numbers, our actual numbers, and what we do when something breaks.

Committed targets

Uptime target (Growth/Scale)

99.9%

~43 min/month max downtime

Uptime target (Enterprise)

99.95%

~21 min/month max downtime

P95 latency budget (chat reply)

< 3.0s

WATCHDOG enforces 24h rolling

P95 latency budget (voice TTS)

< 800ms

Real-time conversation quality

Snapshot retention

30 days

Encrypted, AWS ap-southeast-1

Recovery point objective (RPO)

< 24h

Worst-case data loss in DR

Recovery time objective (RTO)

< 4h

Time to restore service in DR

Incident response (SEV-0)

< 15 min

War-room declared + comms

Postmortem deadline

< 7 days

Blameless writeup + actions

DR drill cadence

Quarterly

Restore + smoke test, logged

Quality samples (SENTINEL)

5% / hourly

Per-module response audit

Cross-tenant isolation tests

Weekly

Automated RLS assertion

Where to verify

When something breaks

1. Detect — WATCHDOG sees the regression and auto-pauses the affected module (15-minute MTTD).

2. Declare — engineer on-call declares the incident in /admin/incidents/new and opens a war-room.

3. Contain — isolate affected surface. Revoke compromised secrets. Snapshot logs.

4. Communicate — status page updated within 30 min. Affected tenants emailed within 1h.

5. Recover — restore service in priority order. Confirm via SENTINEL replay.

6. Learn — blameless postmortem within 7 days. Action items tracked. Same incident never happens twice.