Monitoring
Metrics, logs, traces, SLOs, alerting, and observability practices.
What is the difference between monitoring, logging, and tracing?
Monitoring tracks metrics, logging records events, and tracing follows a request across services.
How does Prometheus collect metrics?
Prometheus pulls metrics by scraping exposed endpoints.
What is the difference between SLI, SLO, and SLA?
SLI is a measured indicator, SLO is a target for that indicator, and SLA is a formal agreement with consequences.
What is alert fatigue?
Alert fatigue happens when teams receive too many noisy or low-value alerts and begin ignoring them.
What are the four golden signals?
Latency, traffic, errors, and saturation.
What is the difference between black-box and white-box monitoring?
Black-box tests external behavior, while white-box monitors internal metrics.
What is observability?
Observability is the ability to understand system behavior from its outputs like metrics, logs, and traces.
What are metrics?
Metrics are numeric measurements collected over time.
What are logs?
Logs are timestamped records of events produced by systems or applications.
What are traces?
Traces show how a request moves through multiple services.
What is time-series data?
Time-series data is data indexed by time.
What is an alert?
An alert is a notification triggered when a monitored condition is met.
What is a threshold-based alert?
It triggers when a metric goes above or below a fixed value.
What is the difference between symptom-based and cause-based alerting?
Symptom-based alerting focuses on user-visible impact, while cause-based alerting focuses on possible internal reasons.
What is Prometheus?
Prometheus is an open-source monitoring system for collecting and querying metrics.
What is PromQL?
PromQL is the query language used by Prometheus to analyze metrics.
What is an exporter in Prometheus?
An exporter exposes metrics from a system or service in a format Prometheus can scrape.
What does node_exporter do?
node_exporter exposes Linux host metrics to Prometheus.
What is Alertmanager?
Alertmanager handles alerts generated by Prometheus.
What is Grafana?
Grafana is a platform for visualizing metrics, logs, and observability data.
Why are dashboards useful in monitoring?
Dashboards provide a visual overview of system health and trends.
What is high cardinality in monitoring?
High cardinality means having too many unique metric label combinations.
What is sampling in observability?
Sampling means collecting only a subset of events or traces instead of all of them.
What is an SLI?
An SLI is a Service Level Indicator, a measured value that reflects service behavior.
What is an SLO?
An SLO is a Service Level Objective, a target value for an SLI.
What is an error budget?
An error budget is the amount of unreliability allowed while still meeting an SLO.
Why are percentiles like p95 or p99 used for latency?
Percentiles show how slow the slowest requests are, beyond just average latency.
What is an uptime check?
An uptime check verifies that a service is reachable and responding.
What is blackbox_exporter in Prometheus?
blackbox_exporter is used to probe endpoints such as HTTP, TCP, DNS, or ICMP targets.
What is a service map?
A service map shows dependencies and traffic flow between services.
Why is correlation between metrics, logs, and traces valuable?
Correlation helps engineers move from symptoms to root cause faster.
What is burn rate alerting?
Burn rate alerting detects whether an error budget is being consumed too quickly.
What is synthetic monitoring?
Synthetic monitoring uses scripted checks to simulate user actions and validate service behavior.
What is APM?
APM stands for Application Performance Monitoring.
What is white-box monitoring?
White-box monitoring observes internal metrics and state from inside a system.
How do you reduce noisy alerts?
Reduce noisy alerts by improving thresholds, focusing on actionable symptoms, and using deduplication or silencing.