Common False-Positive Tuning Guide
Who this is for
Anyone experiencing alert fatigue — receiving too many notifications for alerts that do not require action.
What you will complete
Identify the most common sources of false-positive alerts and apply targeted fixes to eliminate noise without losing real alerts.
What is a false positive?
A false positive is an alert that fires correctly (the threshold was crossed) but does not represent a real problem requiring action. Common causes:
- Thresholds set too low for the actual workload
- Missing duration requirements (alerting on momentary spikes)
- Missing hysteresis (alert oscillates between Firing and Resolved)
- Normal operational activity triggering alerts (backups, deployments)
The five most common false positive scenarios
1. CPU spikes during deployments
Symptom: CPU alert fires every time you deploy a new version of a site or app.
Cause: Deployments naturally spike CPU while restarting services, running composer/npm installs, and warming up caches.
Fix: Add a 5-minute duration requirement to your CPU alert rules. A deployment CPU spike rarely lasts 5 continuous minutes.
Steps:
- Go to Alerts → Rules.
- Edit your CPU rule.
- Set Duration to 5 minutes.
- Save.
2. RAM appears high but server is healthy
Symptom: RAM alert fires at 80% but the server is responding normally.
Cause: Linux uses available memory for disk caching, which inflates the RAM usage metric. This is normal behavior.
Fix: Raise your RAM warning threshold to 85–88% and add a 10-minute duration requirement. True memory exhaustion sustains above 90%.
3. Disk alert fires unexpectedly at midnight
Symptom: Disk warning alert fires at the same time each night.
Cause: A cron job (log rotation, backup staging, database dump) is writing temporary files that fill disk temporarily.
Fix: Investigate what runs at that time (cron -l on the server). Either increase disk space, clean up the cron output, or raise the disk threshold if the temporary spike is safe.
4. Load average spikes on small servers
Symptom: Load average alerts fire frequently on a 1-core or 2-core server.
Cause: Small servers are more sensitive to load spikes. Even routine operations push load above 1.0.
Fix: Set the load threshold to 2× the number of cores for warning, and 3× for critical. For a 1-core server: warning at 2.0, critical at 3.0.
5. Alert flapping (fires and resolves repeatedly)
Symptom: The same alert fires, resolves, fires, resolves several times within an hour.
Cause: The metric is oscillating just above and below the threshold. The default 10% hysteresis is not wide enough.
Fix: Option A: Raise the alert threshold by 5–10%. Option B: Set a custom resolve threshold that requires the metric to recover further before clearing. Option C: Increase the duration requirement so the alert only fires after sustained elevated readings.
General tuning approach
Follow this process when tuning any alert rule:
- Note when it fires. Is there a pattern (time of day, after a deployment, always on one specific server)?
- Check the duration. Was the metric only briefly above threshold? Add or increase the duration requirement.
- Check the threshold. Is the threshold too close to the server's normal operating level? Raise it.
- Check the resolve threshold. Is the alert resolving and re-firing quickly? Widen the hysteresis.
- Check for scheduled tasks. Do backups, crons, or deployments correlate with the alert? If so, the spike is operational noise, not a real problem.