RELIABILITY · OPERATIONAL REALITY

Our committed numbers, our actual numbers, and what we do when something breaks.

The SLA tells you what you can hold us to. This page tells you how we actually run — the budgets, the cadence, the playbook. Verifiable against status.onset.my.

99.9% Growth
99.95% Enterprise
RPO <24h
RTO <4h
DR drills quarterly

COMMITTED TARGETS · 12 NUMBERS

Twelve numbers, no hedging.

Uptime target (Growth / Scale)
99.9%
~43 min/month max downtime
Uptime target (Enterprise)
99.95%
~21 min/month max downtime
P95 latency budget — chat reply
< 3.0s
Enforced on 24h rolling window
P95 latency budget — voice TTS
< 800ms
Real-time conversation quality
Snapshot retention
30 days
Encrypted · AWS ap-southeast-1
Recovery point objective (RPO)
< 24h
Worst-case data loss in DR
Recovery time objective (RTO)
< 4h
Time to restore service in DR
Incident response — SEV0
< 15 min
War-room declared + comms started
Postmortem deadline
< 7 days
Blameless writeup + action items
DR drill cadence
Quarterly
Restore + smoke test, logged
Quality sampling
5% / hourly
Per-module response audit (quality monitor)
Cross-tenant isolation tests
Weekly
Automated RLS assertion

WHEN SOMETHING BREAKS · 6 STAGES

Detect → Declare → Contain → Communicate → Recover → Learn.

Every SEV0/SEV1 follows this sequence. No improvisation, no heroics — the playbook is in tasks/incident-runbook.md and every on-call engineer can recite it.

1
Detect
Automated health checks flag the regression and auto-pause the affected module. 15-minute mean-time-to-detect target.
2
Declare
On-call engineer opens the incident in audit_log, starts a Telegram war-room channel, and posts the SEV class.
3
Contain
Isolate the affected surface — revoke compromised secrets, snapshot logs, freeze pipelines that depend on the broken module.
4
Communicate
status.onset.my updated within 30 minutes of declaration. Affected tenants emailed within 1 hour. SEV0/SEV1 also WhatsApp-pushed.
5
Recover
Restore service in priority order. Confirm via independent quality replay before un-pausing the module.
6
Learn
Blameless postmortem published within 7 days. Action items tracked in tasks/lessons.md. Same incident never happens twice — that is the contract.

WHERE TO VERIFY

Our committed numbers, our actual numbers, and what we do when something breaks.

Twelve numbers, no hedging.

Detect → Declare → Contain → Communicate → Recover → Learn.

Don't take our word for it.