vRealize Operations can show us a lot of alarms from our environment. It’s pretty common to open it and find +10000 of generated alarms. I thought it would be a good idea to explain here how they are generated, canceled and deleted from the environment and what parameters can control those decisions. Adjusting these parameters can greatly help keep the vROPs focused on the real problems of the environment and avoid false positives!
vROPs Collection Interval
First of all, we have to understand what is a collection cycle in vROPs. VROPs has a collection cycle of 5 minutes by default. It means that at every 5 minutes it collects information from vCenter. It is important to recall that this collection (the point on the chart) is an average of 15 samples of vCenter 20s. Sunny Dua has a post that explains this math perfectly! Check it out HERE.
The default value is suitable for most environments. Shrinking it will consume more storage and CPU to process the additional data. If you increase it, it will consume less storage and CPU. In doubt, do not change! You can confirm this setting in the path shown in the Figures below.
Alarms and Symtoms
An alarm in vROPs is defined by one or more symptoms. For the alarm to be true all the conditions imposed by the symptoms must be true. Let's use the "Virtual Machine CPU Usage is 100% for an extended period of time" alarm in this article to understand its behavior. This alarm has only the "Virtual Machine sustained CPU Usage is 100%" symptom shown in the image below.
To understand the alarm, we have to see what makes the symptom true. To see this information simply follow the path shown in the image below.
Click the pencil icon to open the symptom settings. It will open the screen shown in the image below. In arrow number 1 we can see which metric is being checked by the symptom. In arrow number 2 we can see which threshold is being used to cause the symptom. This symptom is verifying if the metric CPU | Usage (%) is equal to or greater than 100%. But being equal to or greater than 100% still does not make the symptom true!
Wait Cycle e Cancel Cycle
VROP alarms and symptoms have two settings: Wait Cycle and Cancel Cycle. At the alert level these settings can be checked in the path shown in the image below. At the symptom level, you can check in the path shown in the image below indicated by the square number 3.
Wait Cycle tells you the number of cycles in which the symptom should find the condition to be true. In our example, the symptom is true when the CPU | Usage (%) is equal to or greater than 100% for 6 cycles. As each cycle is in a 5-minute interval, we can say that the virtual machine has to have 100% CPU for 30 minutes for the symptom to be true.
Cancel Cycle is the opposite. It will inform the number of cycles that the symptom has to be false so that the symptom is canceled. In this case the CPU Usage metric should be less than 100% for 6 cycles so that the symptom is false.
With Wait Cycle and Cancel Cycle you can customize how responsive the analysis of vRealize Operations will be. If you wanted a more sensitive alarm, simply turn down the Wait Cycle. Do you want a more conservative alarm? Increase the Wait Cycle. The same goes for Cancel Cycle.
Remember I said that Alerts also have wait cycle and cancel cycle configuration?
All alarms have the Wait Cycle setting set to 1 to ensure that the alarm will be activated as soon as all the symptoms that form the alarm are true. In our example, as soon as the symptom is true, the alarm will be activated and will appear in the Alarms tab with the active status. They also have the Cancel Cycle set to 1 to ensure the alarm will be canceled once all symptoms are no longer true.
The best way to control the sensitivity is to configure wait cycle and cancel cycle by the symptom. Leave the Alerts configuration to the default value of one!
Millions of Inactive alarms
We understood how alarms became active and how they are canceled. After the alarm is canceled it will appear with an inactive status as indicated in the image below.
The problem is that you can start to see several alarms in that state appearing on your Alarm tab. VROPs will store alarms and symptoms canceled for 45 days after they are canceled (by Cancel Cycle or manually by a user). If 45 days is too much for your environment, you can change this value in the path shown in image below. In my vROPs I had already changed this retention policy to 2 days (a very low value just for me to test that the inactive alarms were deleted). What value to use will depend on your company's information retention policies 😊
I hope this article makes it easier for you to understand vROPs alarms. If something is not right or you have any questions don’t be shy! Use the comment session below!