1. Architecture
Four-layer agentic system. Layer 1 modules are stateful AI Employees with perception, reasoning, action and escalation. Layer 2 wires modules together. Layer 3 (Flows) gives clients an IFTTT-style configurator. Layer 4 (Council) is the autonomous meta-agent. Every layer is observable, pausable and reversible.
2. Model selection
A model alias table is the single source of truth. PRIMARY_LLM (high-volume internal), SECONDARY_LLM (client-facing + regulated), FALLBACK_LLM (latency-critical recovery), EMBED_MODEL (one model, no alternatives). The aliases roll forward centrally — modules don't care which underlying provider is current. Sovereignty-sensitive workloads never leave audited regions.
3. Determinism over generation
Anything that's math (pricing, retention period, blackout windows, payout calculations) runs in deterministic code. Anything that's judgement (tone, recommendation, summary) runs in an LLM with structured output schemas. LLMs never do arithmetic. Code nodes never write copy.
4. Cancel windows + approval gates
Mandate 2.4: every outbound, financial, public or destructive action goes through a 30-second Telegram cancel window. Class A changes (model swap, behaviour-changing prompt) require written owner approval + 7-day shadow run. Margin under 60% hard-blocks Module N proposals.
5. Observability + evaluation
SENTINEL samples 5% of all responses hourly, scores them 1-10. Below 7 surfaces with the original message and a suggested fix. Eval sets versioned in /admin/agents/eval-sets — required 90%+ pass rate to ship a Class A change. Langfuse self-hosted captures every LLM trace.
6. Circuit breakers
WATCHDOG monitors module_runs error rate every 15 min. Any module above 20% error rate in a rolling 1h window auto-pauses and pages on Telegram. TREASURY auto-flags any client margin under 60%. GUARDIAN auto-alerts at 18K PDPA records (DPO threshold is 20K).
7. Multi-tenancy
RLS (Row Level Security) on every Postgres table at the database level. Tenant isolation is enforced before any application logic. Cross-tenant test fires weekly: log in as Tenant A, attempt to read Tenant B's rows, assert zero results. Failure pages immediately.
8. Disaster recovery
Daily encrypted snapshots, retained 30 days. Quarterly DR drills (actual restore + smoke test, documented in /admin/runbook/dr-drills). RPO 24h, RTO 4h. Postmortems within 7 days of any SEV-1+ incident.
9. Data residency
Primary: AWS ap-southeast-1 (Singapore) via Supabase. Backups in same region, encrypted at rest. Customer data never trains any model — commercial API contracts explicitly forbid this. Sovereign on-prem (SV1-2) available for regulated industries with air-gap requirement.
10. Governance
/admin/governance is auditor-ready. Change register, risk register, RACI, AI policy, audit log, evidence pack. ISO 42001 Schellman Stage 1 audit scheduled Q3 2026. SOC 2 Type II report expected Q4 2026.
