Monitoring

Observability and alerting for platform health.

Key Metrics

Ingestion Health

Messages per second
Error rate
Latency percentiles
Queue depth

Processing Health

Throughput
Processing latency
Error rate
Backlog size

Symptoms Engine Health

Active symptoms count
Detection latency
State changes per second
Memory utilization

API Health

Request rate
Response latency
Error rate
Active connections

Dashboards

Pre-built monitoring dashboards:

System overview
Ingestion monitoring
Processing pipeline
Symptoms engine
API performance

Alerting

Platform alerts for:

Error rate thresholds
Latency degradation
Resource exhaustion
Component failures

Integration

Export metrics to:

Prometheus
Datadog
CloudWatch
Custom endpoints

Health Endpoints

Health check endpoints for:

Load balancer probes
Orchestration health
Dependency checks

Last updated on March 25, 2026

Environments Logging & Tracing