Common False-Positive Tuning Guide

Who this is for

Anyone experiencing alert fatigue — receiving too many notifications for alerts that do not require action.

What you will complete

Identify the most common sources of false-positive alerts and apply targeted fixes to eliminate noise without losing real alerts.


What is a false positive?

A false positive is an alert that fires correctly (the threshold was crossed) but does not represent a real problem requiring action. Common causes:

  • Thresholds set too low for the actual workload
  • Missing duration requirements (alerting on momentary spikes)
  • Missing hysteresis (alert oscillates between Firing and Resolved)
  • Normal operational activity triggering alerts (backups, deployments)

The five most common false positive scenarios

1. CPU spikes during deployments

Symptom: CPU alert fires every time you deploy a new version of a site or app.

Cause: Deployments naturally spike CPU while restarting services, running composer/npm installs, and warming up caches.

Fix: Add a 5-minute duration requirement to your CPU alert rules. A deployment CPU spike rarely lasts 5 continuous minutes.

Steps:

  1. Go to Alerts → Rules.
  2. Edit your CPU rule.
  3. Set Duration to 5 minutes.
  4. Save.

2. RAM appears high but server is healthy

Symptom: RAM alert fires at 80% but the server is responding normally.

Cause: Linux uses available memory for disk caching, which inflates the RAM usage metric. This is normal behavior.

Fix: Raise your RAM warning threshold to 85–88% and add a 10-minute duration requirement. True memory exhaustion sustains above 90%.

3. Disk alert fires unexpectedly at midnight

Symptom: Disk warning alert fires at the same time each night.

Cause: A cron job (log rotation, backup staging, database dump) is writing temporary files that fill disk temporarily.

Fix: Investigate what runs at that time (cron -l on the server). Either increase disk space, clean up the cron output, or raise the disk threshold if the temporary spike is safe.

4. Load average spikes on small servers

Symptom: Load average alerts fire frequently on a 1-core or 2-core server.

Cause: Small servers are more sensitive to load spikes. Even routine operations push load above 1.0.

Fix: Set the load threshold to 2× the number of cores for warning, and 3× for critical. For a 1-core server: warning at 2.0, critical at 3.0.

5. Alert flapping (fires and resolves repeatedly)

Symptom: The same alert fires, resolves, fires, resolves several times within an hour.

Cause: The metric is oscillating just above and below the threshold. The default 10% hysteresis is not wide enough.

Fix: Option A: Raise the alert threshold by 5–10%. Option B: Set a custom resolve threshold that requires the metric to recover further before clearing. Option C: Increase the duration requirement so the alert only fires after sustained elevated readings.


General tuning approach

Follow this process when tuning any alert rule:

  1. Note when it fires. Is there a pattern (time of day, after a deployment, always on one specific server)?
  2. Check the duration. Was the metric only briefly above threshold? Add or increase the duration requirement.
  3. Check the threshold. Is the threshold too close to the server's normal operating level? Raise it.
  4. Check the resolve threshold. Is the alert resolving and re-firing quickly? Widen the hysteresis.
  5. Check for scheduled tasks. Do backups, crons, or deployments correlate with the alert? If so, the spike is operational noise, not a real problem.

Related articles