Metrics Meanings and Sampling Windows

Who this is for

Users who want to understand exactly what each metric in the monitoring dashboard measures, and what a healthy value looks like for each.

What you will complete

Learn what CPU, RAM, disk, load average, and network metrics mean, what healthy ranges look like, and how sampling windows affect what you see.


Metric: CPU Usage

What it measures: The percentage of CPU capacity being used across all processor cores, averaged over the sampling interval.

Unit: Percentage (0–100%)

Healthy range:

  • Under 70%: normal for most workloads
  • 70–85%: elevated but typically not urgent — monitor closely
  • Above 85% sustained: investigate and consider scaling
  • 100% sustained: critical — services may become unresponsive

What causes spikes: Deployments, backups, database queries, PHP processes, cron jobs. Short spikes (under 2 minutes) during these operations are normal.

What to watch for: Sustained CPU above 85% outside of known operations. Sudden CPU spikes with no corresponding deployment or backup activity.


Metric: RAM Usage

What it measures: The percentage of total available memory currently in use by processes.

Unit: Percentage (0–100%)

Healthy range:

  • Under 75%: normal
  • 75–90%: elevated — verify no memory leak is occurring
  • Above 90%: high — processes may be killed by the OS if RAM is exhausted. Investigate immediately.

What causes high RAM: PHP-FPM workers accumulating, memory leaks in long-running processes, database query caches, underpowered server for the workload.

Note: Linux systems typically use available memory for disk caching, so 70–80% RAM usage is normal and does not indicate a problem on its own. Watch for RAM that climbs continuously over time (memory leak pattern).


Metric: Disk Usage

What it measures: The percentage of total disk capacity currently used by files and data.

Unit: Percentage (0–100%)

Healthy range:

  • Under 70%: normal
  • 70–85%: elevated — plan a cleanup or disk expansion soon
  • Above 85%: warning — take action within days
  • Above 95%: critical — services may fail if disk becomes full

What fills disk: Application logs (the most common cause), database data growth, backup files, uploaded media, deployment artifacts.

Important: Disk usage only goes down when you delete files. It does not self-heal. Once a server runs out of disk space, web servers, databases, and the monitoring agent itself may stop functioning.


Metric: Load Average

What it measures: The average number of processes waiting for CPU time over the past 1 minute, 5 minutes, and 15 minutes. Load average is different from CPU percentage — it measures the queue depth.

Unit: Unitless number (e.g., 0.80, 2.40, 5.10)

Interpreting load average:

  • A load of 1.0 on a single-core server means the CPU is 100% utilized with no waiting processes.
  • On a 4-core server, a load of 4.0 is 100% utilization. A load of 2.0 is 50%.
  • Rule of thumb: Load average should be at or below the number of CPU cores on the server for healthy operation.

Healthy range:

  • Load / CPU cores below 0.75: healthy
  • Load / CPU cores 0.75–1.0: normal peak load
  • Load / CPU cores above 1.0 sustained: processes are waiting — investigate
  • Load / CPU cores above 2.0 sustained: server is overloaded

The three numbers (1m / 5m / 15m):

  • 1m load spiking but 15m normal: short burst, likely fine
  • All three numbers are high: sustained overload — requires immediate attention
  • 1m dropping from high 15m: situation is recovering

Metric: Network I/O

What it measures: Inbound and outbound network throughput on the server's primary network interface.

Unit: Kilobits per second (Kbps)

Healthy range: Depends entirely on your workload. A server receiving web traffic naturally shows higher inbound Kbps. A backup running will show high outbound Kbps.

What to watch for: Sustained high network I/O with no corresponding workload (may indicate unintended traffic or a compromised server). Sudden drops to zero (may indicate a network or server outage).


Sampling windows

CloudAIPilot's monitoring agent sends metrics to the platform at regular intervals:

Chart time rangeData resolutionSample interval
1 hour view1-minute samplesEvery 60 seconds
6 hour view5-minute samplesAggregated every 5 minutes
24 hour view15-minute samplesAggregated every 15 minutes
7 day view1-hour samplesAggregated hourly

What this means for alerts: Alert rules evaluate the most recent 60-second sample. If your server's CPU spikes to 100% for 30 seconds and drops back, the alert may or may not fire depending on the sample timing.

What this means for charts: A spike that lasted only 45 seconds may appear smaller or be missed entirely in the 6-hour view. Use the 1-hour view for precise incident investigation.


Related articles