Observability Dashboard
Definition
The Observability Dashboard summarizes operational data from request logs, evaluation runs, and security audit records.
Why It Exists In Aurelia Ledger
An enterprise AI platform must be operable. Teams need to know which agents are used, how slow requests are, how often security controls trigger, and whether recent evaluation runs pass.
Implementation Links
| Area | File | Lines | Why It Matters |
|---|---|---|---|
| Summary service | observability_service.py | L19-L81 | Aggregates logs, eval runs, and security audits |
| Empty-state response | observability_service.py | L82-L97 | Keeps dashboard stable when DB is empty |
| Distribution and p95 helpers | observability_service.py | L98-L128 | Computes route shares and latency percentiles |
| Request log model | models.py | L38-L49 | Stores agent, latency, sources, and cost |
| Evaluation run model | models.py | L94-L104 | Stores eval summary history |
| Observability eval cases | observability_cases.json | Full file | Validates summary endpoint behavior |
Core Workflow
flowchart LR
Requests[(Request Logs)] --> Summary[Observability Service]
Evals[(Evaluation Runs)] --> Summary
Security[(Security Audits)] --> Summary
Summary --> API[/api/observability/summary/]
API --> UI[Observability Dashboard]Technical Deep Dive
The dashboard uses PostgreSQL logs already created by the platform. It avoids Prometheus and Grafana for the MVP, but still shows operational health signals.
The service must handle empty states. A fresh database should return zeros and empty arrays instead of breaking the UI.
Formula / Scoring Model
Average latency:
latency_avg_ms = sum(latency_ms) / request_countP95 latency:
p95 = sorted(latencies)[ceil(0.95 * n) - 1]Route share:
route_share(agent) = count(agent) / total_requestsSecurity action share:
action_share(action) = count(action) / total_security_eventsExample Walkthrough
Request:
Invoke-RestMethod http://localhost:8000/api/observability/summaryExpected response includes:
- request count
- average latency
- p95 latency
- agent route distribution
- latest evaluation pass rate
- security action distribution
- recent requests
Design Tradeoffs
- PostgreSQL summary is simple and low-friction.
- Dedicated metrics infrastructure would be stronger for production.
- Compact UI avoids adding a charting dependency.
Failure Modes
- Logs grow without retention policy.
- PostgreSQL is not optimized as a high-volume metrics store.
- Averages can hide outliers, so p95 is included.
Exercises
Checkpoint: Explain why p95 latency is more useful than only average latency.
Hands-on: Inspect observability_service.py L98-L128 and explain distribution and percentile calculation.
Interview Drill: Explain why observability changes the project from a chatbot demo into an operable system.
Interview Explanation
Observability provides the operational evidence behind the agent system. It shows what the platform did, not just what answer appeared in the UI.