Metrics Meanings and Sampling Windows

Published · Last updated: May 2026 · 4 min read

Who this is for

Users who want to understand exactly what each metric in the monitoring dashboard measures, and what a healthy value looks like for each.

What you will complete

Learn what CPU, RAM, disk, load average, and network metrics mean, what healthy ranges look like, and how sampling windows affect what you see.

Metric: CPU Usage

What it measures: The percentage of CPU capacity being used across all processor cores, averaged over the sampling interval.

Unit: Percentage (0–100%)

Healthy range:

Under 70%: normal for most workloads
70–85%: elevated but typically not urgent — monitor closely
Above 85% sustained: investigate and consider scaling
100% sustained: critical — services may become unresponsive

What causes spikes: Deployments, backups, database queries, PHP processes, cron jobs. Short spikes (under 2 minutes) during these operations are normal.

What to watch for: Sustained CPU above 85% outside of known operations. Sudden CPU spikes with no corresponding deployment or backup activity.

Metric: RAM Usage

What it measures: The percentage of total available memory currently in use by processes.

Unit: Percentage (0–100%)

Healthy range:

Under 75%: normal
75–90%: elevated — verify no memory leak is occurring
Above 90%: high — processes may be killed by the OS if RAM is exhausted. Investigate immediately.

What causes high RAM: PHP-FPM workers accumulating, memory leaks in long-running processes, database query caches, underpowered server for the workload.

Note: Linux systems typically use available memory for disk caching, so 70–80% RAM usage is normal and does not indicate a problem on its own. Watch for RAM that climbs continuously over time (memory leak pattern).

Metric: Disk Usage

What it measures: The percentage of total disk capacity currently used by files and data.

Unit: Percentage (0–100%)

Healthy range:

Under 70%: normal
70–85%: elevated — plan a cleanup or disk expansion soon
Above 85%: warning — take action within days
Above 95%: critical — services may fail if disk becomes full

What fills disk: Application logs (the most common cause), database data growth, backup files, uploaded media, deployment artifacts.

Important: Disk usage only goes down when you delete files. It does not self-heal. Once a server runs out of disk space, web servers, databases, and the monitoring agent itself may stop functioning.

Metric: Load Average

What it measures: The average number of processes waiting for CPU time over the past 1 minute, 5 minutes, and 15 minutes. Load average is different from CPU percentage — it measures the queue depth.

Unit: Unitless number (e.g., 0.80, 2.40, 5.10)

Interpreting load average:

A load of 1.0 on a single-core server means the CPU is 100% utilized with no waiting processes.
On a 4-core server, a load of 4.0 is 100% utilization. A load of 2.0 is 50%.
Rule of thumb: Load average should be at or below the number of CPU cores on the server for healthy operation.

Healthy range:

Load / CPU cores below 0.75: healthy
Load / CPU cores 0.75–1.0: normal peak load
Load / CPU cores above 1.0 sustained: processes are waiting — investigate
Load / CPU cores above 2.0 sustained: server is overloaded

The three numbers (1m / 5m / 15m):

1m load spiking but 15m normal: short burst, likely fine
All three numbers are high: sustained overload — requires immediate attention
1m dropping from high 15m: situation is recovering

Metric: Network I/O

What it measures: Inbound and outbound network throughput on the server's primary network interface.

Unit: Kilobits per second (Kbps)

Healthy range: Depends entirely on your workload. A server receiving web traffic naturally shows higher inbound Kbps. A backup running will show high outbound Kbps.

What to watch for: Sustained high network I/O with no corresponding workload (may indicate unintended traffic or a compromised server). Sudden drops to zero (may indicate a network or server outage).

Sampling windows

CloudAIPilot's monitoring agent sends metrics to the platform at regular intervals:

Chart time range	Data resolution	Sample interval
1 hour view	1-minute samples	Every 60 seconds
6 hour view	5-minute samples	Aggregated every 5 minutes
24 hour view	15-minute samples	Aggregated every 15 minutes
7 day view	1-hour samples	Aggregated hourly

What this means for alerts: Alert rules evaluate the most recent 60-second sample. If your server's CPU spikes to 100% for 30 seconds and drops back, the alert may or may not fire depending on the sample timing.

What this means for charts: A spike that lasted only 45 seconds may appear smaller or be missed entirely in the 6-hour view. Use the 1-hour view for precise incident investigation.

Metrics Meanings and Sampling Windows

Who this is for

What you will complete

Metric: CPU Usage

Metric: RAM Usage

Metric: Disk Usage

Metric: Load Average

Metric: Network I/O

Sampling windows

Related articles