Committed targets
Uptime target (Growth/Scale)
99.9%
~43 min/month max downtime
Uptime target (Enterprise)
99.95%
~21 min/month max downtime
P95 latency budget (chat reply)
< 3.0s
WATCHDOG enforces 24h rolling
P95 latency budget (voice TTS)
< 800ms
Real-time conversation quality
Snapshot retention
30 days
Encrypted, AWS ap-southeast-1
Recovery point objective (RPO)
< 24h
Worst-case data loss in DR
Recovery time objective (RTO)
< 4h
Time to restore service in DR
Incident response (SEV-0)
< 15 min
War-room declared + comms
Postmortem deadline
< 7 days
Blameless writeup + actions
DR drill cadence
Quarterly
Restore + smoke test, logged
Quality samples (SENTINEL)
5% / hourly
Per-module response audit
Cross-tenant isolation tests
Weekly
Automated RLS assertion
Where to verify
status.onset.my
Live uptime + incident history. Subscribes via Atom/RSS or email.
Incident postmortems
Customer-readable blameless writeups for every SEV-1+ event.
DR drill log
Quarterly disaster recovery test + restore results.
Audit reports
SOC 2 Type I (Schellman, Mar 2026). ISO 42001 Stage 1 in Q3 2026.
When something breaks
1. Detect — WATCHDOG sees the regression and auto-pauses the affected module (15-minute MTTD).
2. Declare — engineer on-call declares the incident in /admin/incidents/new and opens a war-room.
3. Contain — isolate affected surface. Revoke compromised secrets. Snapshot logs.
4. Communicate — status page updated within 30 min. Affected tenants emailed within 1h.
5. Recover — restore service in priority order. Confirm via SENTINEL replay.
6. Learn — blameless postmortem within 7 days. Action items tracked. Same incident never happens twice.