Metrics Meanings and Sampling Windows
Who this is for
Users who want to understand exactly what each metric in the monitoring dashboard measures, and what a healthy value looks like for each.
What you will complete
Learn what CPU, RAM, disk, load average, and network metrics mean, what healthy ranges look like, and how sampling windows affect what you see.
Metric: CPU Usage
What it measures: The percentage of CPU capacity being used across all processor cores, averaged over the sampling interval.
Unit: Percentage (0–100%)
Healthy range:
- Under 70%: normal for most workloads
- 70–85%: elevated but typically not urgent — monitor closely
- Above 85% sustained: investigate and consider scaling
- 100% sustained: critical — services may become unresponsive
What causes spikes: Deployments, backups, database queries, PHP processes, cron jobs. Short spikes (under 2 minutes) during these operations are normal.
What to watch for: Sustained CPU above 85% outside of known operations. Sudden CPU spikes with no corresponding deployment or backup activity.
Metric: RAM Usage
What it measures: The percentage of total available memory currently in use by processes.
Unit: Percentage (0–100%)
Healthy range:
- Under 75%: normal
- 75–90%: elevated — verify no memory leak is occurring
- Above 90%: high — processes may be killed by the OS if RAM is exhausted. Investigate immediately.
What causes high RAM: PHP-FPM workers accumulating, memory leaks in long-running processes, database query caches, underpowered server for the workload.
Note: Linux systems typically use available memory for disk caching, so 70–80% RAM usage is normal and does not indicate a problem on its own. Watch for RAM that climbs continuously over time (memory leak pattern).
Metric: Disk Usage
What it measures: The percentage of total disk capacity currently used by files and data.
Unit: Percentage (0–100%)
Healthy range:
- Under 70%: normal
- 70–85%: elevated — plan a cleanup or disk expansion soon
- Above 85%: warning — take action within days
- Above 95%: critical — services may fail if disk becomes full
What fills disk: Application logs (the most common cause), database data growth, backup files, uploaded media, deployment artifacts.
Important: Disk usage only goes down when you delete files. It does not self-heal. Once a server runs out of disk space, web servers, databases, and the monitoring agent itself may stop functioning.
Metric: Load Average
What it measures: The average number of processes waiting for CPU time over the past 1 minute, 5 minutes, and 15 minutes. Load average is different from CPU percentage — it measures the queue depth.
Unit: Unitless number (e.g., 0.80, 2.40, 5.10)
Interpreting load average:
- A load of 1.0 on a single-core server means the CPU is 100% utilized with no waiting processes.
- On a 4-core server, a load of 4.0 is 100% utilization. A load of 2.0 is 50%.
- Rule of thumb: Load average should be at or below the number of CPU cores on the server for healthy operation.
Healthy range:
- Load / CPU cores below 0.75: healthy
- Load / CPU cores 0.75–1.0: normal peak load
- Load / CPU cores above 1.0 sustained: processes are waiting — investigate
- Load / CPU cores above 2.0 sustained: server is overloaded
The three numbers (1m / 5m / 15m):
- 1m load spiking but 15m normal: short burst, likely fine
- All three numbers are high: sustained overload — requires immediate attention
- 1m dropping from high 15m: situation is recovering
Metric: Network I/O
What it measures: Inbound and outbound network throughput on the server's primary network interface.
Unit: Kilobits per second (Kbps)
Healthy range: Depends entirely on your workload. A server receiving web traffic naturally shows higher inbound Kbps. A backup running will show high outbound Kbps.
What to watch for: Sustained high network I/O with no corresponding workload (may indicate unintended traffic or a compromised server). Sudden drops to zero (may indicate a network or server outage).
Sampling windows
CloudAIPilot's monitoring agent sends metrics to the platform at regular intervals:
| Chart time range | Data resolution | Sample interval |
|---|---|---|
| 1 hour view | 1-minute samples | Every 60 seconds |
| 6 hour view | 5-minute samples | Aggregated every 5 minutes |
| 24 hour view | 15-minute samples | Aggregated every 15 minutes |
| 7 day view | 1-hour samples | Aggregated hourly |
What this means for alerts: Alert rules evaluate the most recent 60-second sample. If your server's CPU spikes to 100% for 30 seconds and drops back, the alert may or may not fire depending on the sample timing.
What this means for charts: A spike that lasted only 45 seconds may appear smaller or be missed entirely in the 6-hour view. Use the 1-hour view for precise incident investigation.