Skip to content

AP-009 — Robust and serviceable

The system must be operable by a small team. That sentence determines a lot.

Operational complexity has a direct cost. Every alert that requires a human decision, every runbook that must be followed, every incident that is not self-healing consumes engineering capacity that could be spent on features. Design the operational burden out of the system.

Systems fail gracefully. Every alert is actionable. Runbooks exist and are tested, not just written. Health checks and readiness probes on every service. Circuit breakers to protect downstream services.

The 2am test: Could a single engineer on call at 2am diagnose and resolve the most common failure modes without escalation? If not, fix the observability and automated recovery before going to production.

Observability for AI-assisted operations. Structured logs, distributed traces, and well-named metrics should be readable by an AI diagnostic assistant as well as a human. The standards are the same. A log that a language model can reason about is also a log a human can parse quickly.

KISS check: Robustness is achieved through simplicity. The most reliable systems have the fewest moving parts. Every component you do not build cannot fail.

Relationship to other principles

Principle Relationship
AP-001 KISS Robustness is achieved through simplicity — fewer moving parts means fewer failure modes.
AP-005 Customer driven Operational reliability is a customer experience metric — downtime and slow recovery directly hurt customers.
AP-006 Cost effective Operational complexity has a direct cost in engineering time; simpler systems are cheaper to run.
AP-007 Evolution by design Systems designed for easy regeneration are also easier to diagnose and recover in production.

See the full architectural principles index.