Metrics
π§ 1. Latency & Performance Metrics
π₯ 2. Error & Reliability Metrics
βοΈ 3. Throughput & Capacity Metrics
π§© 4. Resource Utilization Metrics
π¦ 5. Dependency / External Service Metrics
π§ 6. Business & Custom Metrics
π 7. Tracing & Distributed Observability Metrics
π 8. SLO/SLI/SLA Metrics
π°οΈ 9. Change & Deployment Metrics
Hereβs a structured overview of key observability metrics grouped by what aspect of the system they reveal:
π§ 1. Latency & Performance Metrics
Measure responsiveness and user experience.
| Metric | Meaning | Typical Use |
|---|---|---|
| p50 / p90 / p95 / p99 latency | 50th, 90th, 95th, 99th percentile of response times | Understand normal and tail latency behavior |
| Average latency (mean) | Simple average of all response times | Quick overview, but hides tail issues |
| Max latency | Highest recorded latency | Detect severe outliers |
| Request rate (RPS/QPS) | Requests per second | Load characterization |
| Saturation | % of resource utilization (CPU, I/O, thread pool) | Detect approaching bottlenecks |
| Queue length / wait time | Pending requests in queue | Backpressure visibility |
π₯ 2. Error & Reliability Metrics
Quantify correctness and stability.
| Metric | Meaning | Typical Use |
|---|---|---|
| Error rate | % of failed requests | Detect reliability drops |
| Error budget consumption | Portion of SLO used up by failures | SLO tracking |
| Availability (%) | Success / Total requests | SLI for uptime |
| Retry rate | % of retried requests | Catch flaky dependencies |
| Timeouts & circuit breaker opens | Defensive mechanisms triggering | Detect degraded dependencies |
βοΈ 3. Throughput & Capacity Metrics
Measure how much work the system can handle.
| Metric | Meaning | Typical Use |
|---|---|---|
| Requests per second (RPS) | Volume of incoming traffic | Capacity planning |
| Processed messages/sec | For queues, Kafka topics, etc. | Measure system throughput |
| Concurrent connections / sessions | Active clients | Detect overloads |
| Backlog depth | Pending jobs or queue size | Detect lag in consumers |
π§© 4. Resource Utilization Metrics
Track underlying system health.
| Resource | Common Metrics |
|---|---|
| CPU | Usage %, load average, throttling |
| Memory | Heap usage, GC pauses, OOM events |
| Disk / I/O | IOPS, read/write latency, disk full % |
| Network | In/out throughput, packet loss, retransmits |
| Thread pools | Active vs idle threads, queue size |
π¦ 5. Dependency / External Service Metrics
Track downstream impact and SLA compliance.
| Metric | Example |
|---|---|
| Dependency latency | DB, cache, external API call duration |
| Cache hit ratio | Redis/memcached effectiveness |
| DB query time | Slow queries, lock contention |
| Third-party API availability | Detect external failures early |
π§ 6. Business & Custom Metrics
Tie observability to product outcomes.
| Metric | Example |
|---|---|
| Signups / logins per minute | Track core flows |
| Checkout success rate | Detect conversion issues |
| Order fulfillment lag | E2E pipeline latency |
| User-facing error % | From frontend telemetry |
π 7. Tracing & Distributed Observability Metrics
For microservices and async systems.
| Metric | Meaning |
|---|---|
| Trace latency breakdown | Time spent in each span |
| Span error counts | Failures per service/component |
| Service dependency graph | Identify slow hops in a request path |
| Cross-service correlation | End-to-end flow latency (via trace IDs) |
π 8. SLO/SLI/SLA Metrics
Used for setting reliability targets.
| Category | Example SLI |
|---|---|
| Availability | β₯99.9% of requests succeed |
| Latency | 95% of requests <200 ms |
| Freshness | Data updated within 1 min |
| Durability | Zero data loss after failure |
π°οΈ 9. Change & Deployment Metrics
To correlate incidents with changes.
| Metric | Example |
|---|---|
| Deploy frequency | How often code is released |
| Change failure rate | % of deployments causing incidents |
| Mean time to recover (MTTR) | Speed of recovery from incidents |
| Mean time between failures (MTBF) | Stability over time |