Metrics

🧭 1. Latency & Performance Metrics
πŸ”₯ 2. Error & Reliability Metrics
βš™οΈ 3. Throughput & Capacity Metrics
🧩 4. Resource Utilization Metrics
πŸ“¦ 5. Dependency / External Service Metrics
🧠 6. Business & Custom Metrics
πŸ” 7. Tracing & Distributed Observability Metrics
πŸ“Š 8. SLO/SLI/SLA Metrics
πŸ•°οΈ 9. Change & Deployment Metrics

Here’s a structured overview of key observability metrics grouped by what aspect of the system they reveal:


🧭 1. Latency & Performance Metrics

Measure responsiveness and user experience.

Metric Meaning Typical Use
p50 / p90 / p95 / p99 latency 50th, 90th, 95th, 99th percentile of response times Understand normal and tail latency behavior
Average latency (mean) Simple average of all response times Quick overview, but hides tail issues
Max latency Highest recorded latency Detect severe outliers
Request rate (RPS/QPS) Requests per second Load characterization
Saturation % of resource utilization (CPU, I/O, thread pool) Detect approaching bottlenecks
Queue length / wait time Pending requests in queue Backpressure visibility

πŸ”₯ 2. Error & Reliability Metrics

Quantify correctness and stability.

Metric Meaning Typical Use
Error rate % of failed requests Detect reliability drops
Error budget consumption Portion of SLO used up by failures SLO tracking
Availability (%) Success / Total requests SLI for uptime
Retry rate % of retried requests Catch flaky dependencies
Timeouts & circuit breaker opens Defensive mechanisms triggering Detect degraded dependencies

βš™οΈ 3. Throughput & Capacity Metrics

Measure how much work the system can handle.

Metric Meaning Typical Use
Requests per second (RPS) Volume of incoming traffic Capacity planning
Processed messages/sec For queues, Kafka topics, etc. Measure system throughput
Concurrent connections / sessions Active clients Detect overloads
Backlog depth Pending jobs or queue size Detect lag in consumers

🧩 4. Resource Utilization Metrics

Track underlying system health.

Resource Common Metrics
CPU Usage %, load average, throttling
Memory Heap usage, GC pauses, OOM events
Disk / I/O IOPS, read/write latency, disk full %
Network In/out throughput, packet loss, retransmits
Thread pools Active vs idle threads, queue size

πŸ“¦ 5. Dependency / External Service Metrics

Track downstream impact and SLA compliance.

Metric Example
Dependency latency DB, cache, external API call duration
Cache hit ratio Redis/memcached effectiveness
DB query time Slow queries, lock contention
Third-party API availability Detect external failures early

🧠 6. Business & Custom Metrics

Tie observability to product outcomes.

Metric Example
Signups / logins per minute Track core flows
Checkout success rate Detect conversion issues
Order fulfillment lag E2E pipeline latency
User-facing error % From frontend telemetry

πŸ” 7. Tracing & Distributed Observability Metrics

For microservices and async systems.

Metric Meaning
Trace latency breakdown Time spent in each span
Span error counts Failures per service/component
Service dependency graph Identify slow hops in a request path
Cross-service correlation End-to-end flow latency (via trace IDs)

πŸ“Š 8. SLO/SLI/SLA Metrics

Used for setting reliability targets.

Category Example SLI
Availability β‰₯99.9% of requests succeed
Latency 95% of requests <200 ms
Freshness Data updated within 1 min
Durability Zero data loss after failure

πŸ•°οΈ 9. Change & Deployment Metrics

To correlate incidents with changes.

Metric Example
Deploy frequency How often code is released
Change failure rate % of deployments causing incidents
Mean time to recover (MTTR) Speed of recovery from incidents
Mean time between failures (MTBF) Stability over time