Severity, Hysteresis, and Noise Reduction
Overview
A monitoring system is only as good as its signal-to-noise ratio. If CloudAIPilot alerts you every time a server's CPU spikes for 5 seconds, you will quickly suffer from alert fatigue and ignore critical warnings.
To prevent this, CloudAIPilot uses an advanced alerting engine that incorporates Severity Levels, Hysteresis (Anti-Flapping), and Deduplication Cooldowns. This article explains how these mechanisms work behind the scenes to ensure you only get notified when real action is required.
1. Severity Levels
Every alert rule you create (or that is auto-created during server provisioning) is assigned a severity. The severity dictates how aggressively the platform escalates the issue.
- Info: Used for standard lifecycle events (e.g., "Backup Completed", "Server Provisioned"). Does not trigger the alerting engine, only appears in the Activity Center.
- Warning: Indicates a threshold has been crossed, but the server/site is likely still functional (e.g., "Disk Usage > 80%"). Sends a single notification per cooldown period.
- Critical: Indicates severe degradation or imminent failure (e.g., "Site Unreachable", "Disk > 95%", "CPU > 99% for 5m"). Bypasses standard cooldowns and triggers the Escalation Engine (if configured).
Best Practice: Reserve Critical severity only for rules that require waking someone up or immediate manual intervention.
2. Hysteresis (Anti-Flapping)
"Flapping" occurs when a metric oscillates perfectly around your threshold. For example, if you set an alert for CPU > 90%, and the CPU fluctuates between 89% and 91% every minute, a basic system would send you an alert, then a resolve notification, then an alert, repeatedly.
CloudAIPilot prevents this using Hysteresis (built-in resolve thresholds):
- Upper Bound Rules (
gt- greater than): The metric must fall 10% below the threshold to resolve. - *Example:* Alert fires at 90% CPU. It will not resolve until the CPU drops below 81%.
- Lower Bound Rules (
lt- less than): The metric must rise 10% above the threshold to resolve. - *Example:* Alert fires if available memory drops below 500 MB. It will not resolve until available memory rises above 550 MB.
This ensures that the condition has truly stabilized before the alert is marked as resolved.
3. Deduplication and Cooldowns
Even if an alert remains in a "Firing" state, you don't want an email every 60 seconds (our evaluation polling rate).
CloudAIPilot uses the NotificationDispatcher to enforce a strict Cooldown Period:
- By default, identical alert events for the same organization are deduplicated on a 5-minute cooldown.
- If CPU remains at 95% for an hour, you will receive the initial alert, and you will not receive another notification unless the alert escalates, or it resolves and triggers again later.
4. Maintenance Windows
If you are performing planned upgrades, running a heavy database migration, or running load tests, you can suppress noise using Maintenance Windows.
How it works:
- Navigate to the Server view and click Enable Maintenance Mode.
- Select a duration (e.g., 1 hour, 4 hours).
- What happens: The monitoring engine continues to collect metrics, and alerts will still technically "Fire" in the database so you have an accurate audit trail. However, the
NotificationDispatcheris instructed to suppress all outbound notifications (Email, Slack, SMS) for that server until the window expires.
Common Issues & Troubleshooting
- Symptom: I set an alert for "Disk > 85%" and it fired, but my disk is now at 84% and the alert hasn't resolved.
Fix: This is Hysteresis working as intended. For an 85% threshold, the disk must drop to 76.5% (which is 85 * 0.9) to officially resolve. You can manually acknowledge or resolve the alert from the Dashboard if needed.
- Symptom: I am receiving duplicate alerts via webhook.
Fix: Ensure your custom webhook integration is properly returning a 200 OK status within 5 seconds. If CloudAIPilot does not receive a fast 200 OK, it assumes the webhook failed and the delivery log will reflect an error, potentially bypassing standard deduplication assumptions.