Metrics

🧭 1. Latency & Performance Metrics
🔥 2. Error & Reliability Metrics
⚙️ 3. Throughput & Capacity Metrics
🧩 4. Resource Utilization Metrics
📦 5. Dependency / External Service Metrics
🧠 6. Business & Custom Metrics
🔍 7. Tracing & Distributed Observability Metrics
📊 8. SLO/SLI/SLA Metrics
🕰️ 9. Change & Deployment Metrics

Here’s a structured overview of key observability metrics grouped by what aspect of the system they reveal:

🧭 1. Latency & Performance Metrics

Measure responsiveness and user experience.

Metric	Meaning	Typical Use
p50 / p90 / p95 / p99 latency	50th, 90th, 95th, 99th percentile of response times	Understand normal and tail latency behavior
Average latency (mean)	Simple average of all response times	Quick overview, but hides tail issues
Max latency	Highest recorded latency	Detect severe outliers
Request rate (RPS/QPS)	Requests per second	Load characterization
Saturation	% of resource utilization (CPU, I/O, thread pool)	Detect approaching bottlenecks
Queue length / wait time	Pending requests in queue	Backpressure visibility

🔥 2. Error & Reliability Metrics

Quantify correctness and stability.

Metric	Meaning	Typical Use
Error rate	% of failed requests	Detect reliability drops
Error budget consumption	Portion of SLO used up by failures	SLO tracking
Availability (%)	Success / Total requests	SLI for uptime
Retry rate	% of retried requests	Catch flaky dependencies
Timeouts & circuit breaker opens	Defensive mechanisms triggering	Detect degraded dependencies

⚙️ 3. Throughput & Capacity Metrics

Measure how much work the system can handle.

Metric	Meaning	Typical Use
Requests per second (RPS)	Volume of incoming traffic	Capacity planning
Processed messages/sec	For queues, Kafka topics, etc.	Measure system throughput
Concurrent connections / sessions	Active clients	Detect overloads
Backlog depth	Pending jobs or queue size	Detect lag in consumers

🧩 4. Resource Utilization Metrics

Track underlying system health.

Resource	Common Metrics
CPU	Usage %, load average, throttling
Memory	Heap usage, GC pauses, OOM events
Disk / I/O	IOPS, read/write latency, disk full %
Network	In/out throughput, packet loss, retransmits
Thread pools	Active vs idle threads, queue size

📦 5. Dependency / External Service Metrics

Track downstream impact and SLA compliance.

Metric	Example
Dependency latency	DB, cache, external API call duration
Cache hit ratio	Redis/memcached effectiveness
DB query time	Slow queries, lock contention
Third-party API availability	Detect external failures early

🧠 6. Business & Custom Metrics

Tie observability to product outcomes.

Metric	Example
Signups / logins per minute	Track core flows
Checkout success rate	Detect conversion issues
Order fulfillment lag	E2E pipeline latency
User-facing error %	From frontend telemetry

🔍 7. Tracing & Distributed Observability Metrics

For microservices and async systems.

Metric	Meaning
Trace latency breakdown	Time spent in each span
Span error counts	Failures per service/component
Service dependency graph	Identify slow hops in a request path
Cross-service correlation	End-to-end flow latency (via trace IDs)

📊 8. SLO/SLI/SLA Metrics

Used for setting reliability targets.

Category	Example SLI
Availability	≥99.9% of requests succeed
Latency	95% of requests <200 ms
Freshness	Data updated within 1 min
Durability	Zero data loss after failure

🕰️ 9. Change & Deployment Metrics

To correlate incidents with changes.

Metric	Example
Deploy frequency	How often code is released
Change failure rate	% of deployments causing incidents
Mean time to recover (MTTR)	Speed of recovery from incidents
Mean time between failures (MTBF)	Stability over time