Alerts

Alerts

Index

Overview

Alerting is all about auditing events that happen across the system (where an event in this context means things like a particular metric value was collected, or a resource was detected as having gone down or an operation failed, etc). After raw events are generated, they are sent through the alerts processing engine. Here, they might be transformed, filtered, or even correlated with data from other parts of the system. If some event makes it through all of those gates successfully, then it is recorded permanently - the act of which might consequently trigger a notification to various interested parties.

Concepts & Terminology

Since this a relatively large subsystem with a lot of unique concepts and constructs, it's important to begin with a set of terminology that will allow the rest of the alerts documentation to be written more concisely.

Action Filter

Filters that are applied to an alert definition after it triggers an alert are referred to as action filters. These filters can, in theory, perform a wide range of post-processing activities. They might change the preset notification policy for the definition, or they may even change the state of the definition itself. Below is a quick discussion of each of the available action filters:

This definition, upon firing an alert, will disable itself. This prevents it from triggering any more alerts while the problem that arose (and caused this alert to fire in the first place) is remedied either by developers or system administrators.

Alert

An alert is a form of auditing data that tells you that some alert condition (or alert conditions) were known to be true at some point for some resource in inventory. Alerts are the end product of an alert definition. An alert will tell you specifically which conditions on the alert definition were true at the time it fired, and it will specify which parties were notified at that time.

Alert Definition

Each resource in inventory may have zero or many alert definitions. An alert definition defines how a specific list of alert conditions (known as a condition set) is evaluated, whether or not any filtering should take place, and what notifications to send to whom if an alert is fired from this definition.

Alert Condition

The alert condition is the main UI-facing construct that is used to integrate alerts with the other systems. See the Integrations section for more information.

Alert Dampening

Dampening is a form of filtering that occurs before an alert is fired. It is entirely based off of the condition set construct. Instead of being alerted each and every time the processing engine finds the condition set to be true, you now have a chance to suppress this in one of a variety of ways. Below is a quick discussion of each of the available dampening options:

This is the default option, and indicates that no dampening should take place. In other words, every single time the condition set is known to be true, an alert will fire.

This form of dampening is good to use when you have a system that spikes a lot. If you know that your CPU will often momentarily hit some high point but then should return to normal processing levels, then you can use this dampening option to ensure that you only get an alert if the CPU remains spiked for 2 or 3 consecutive measurement collections.

Setting the X option for this dampening rule to a high number isn't suggested, (though there are a few valid use cases to do so) because you're likely going to want to be notified sooner rather than later if one of your systems is experiencing such sustained problems. Remember, this option only needs the condition set to be found false once, and it resets the counter. So, using too high a number may effectively prevent alerts from ever being generated if you have just a few negative evaluations of your condition set in between lots of positive ones.

This dampening rule is very similar to the last one, but it's not as restrictive. This option allows for the condition set to be evaluated to false a couple times over a long span of evaluations and will still fire an alert.

This could be useful when alerting on memory usage. If you want to set up an alert definition see whether your free memory is being released periodically, you might use this dampening rule with X=5, Y=10. So, the alerts processing engine will look back at the last ten evaluations, and if at least half of them were positive that might indicate that your process is having trouble releasing memory back down to acceptable levels, and further investigation should be performed.

This option is nearly identical to the previous one, except that it's based on time instead of the number of evaluations. This time-based evaluation, however, can be dangerous because the effects aren't always obvious.

Since the measurement subsystem is defined in terms of collection intervals, the alert processing engine is thus restricted to the same concept when computing condition sets.

So, let's say you use this dampening rule and set the options to X=5, Y=30minutes. If the collection interval is currently 5minutes, then Y=30minutes effectively becomes Y=6collections. However, if you change your collection interval to 10 minutes, then Y=3collections. This might not seem so bad if the person responsible for setting the collections intervals on this resource is the same as the person responsible for creating all alert definitions too, but that is not always the case. Here, if someone comes along and changes the collection interval for some measurement, he or she may inadvertently and implicitly change how one or more alert definitions are dampened, resulting in either too many, too few, or no alerts at all.

Alert Notification

After an alert is fired, the notification list for an alert definition is processed. Although alerts are always recorded in the RHQ system, the user will never know of them until he or she logs into the application, goes to a specific resource, and looks at the alerts that have been fired for that resource.

Instead of requiring users to perform this relatively slow method of checking on their alerts, RHQ supports the concept of notifications. Currently, the only supported notification method is via email, but the email addressed can be registered against an alert definition in one of three ways: by direct email address, by role, or by user. Each of these is discussed in the following sections.

Email Notification

This type is used to specify a list of direct email addresses to notify if an alert is fired on some alert definition.

Role Notification

This type is used to specify a list of RHQ roles to notify if an alert is fired on some alert definition. Processing works here by finding the list of RHQ users in each role, and notifying each user in turn using the email address specified in that user's account.

User Notification

This type is used to specify a list of RHQ users to notify if an alert is fired on some alert definition. Processing works here by obtaining the email addresses from the users' accounts, and sending the notification email to each.

Condition Expression

ALL

Regardless of the size of the condition set, all alert conditions need to be true before an alert will fire. It doesn't matter if one condition is known to be true several times in a row. The last known value for each condition must be true simultaneously before this alert definition will fire an alert. This is convenient when you have a system where individual events may happen frequently, but not actually indicate a problem. Here, you can specify that multiple conditions must be true before the system takes further action.

ANY

Regardless of the size of the condition set, only one alert condition needs to be true before an alert will fire. This is really a convenience option for administrators. Instead of having to define 5 different alert definitions, each with one condition based on the 5 different things that could happen in the system, now you only have to define one alert definition with 5 conditions and set the processing to ANY to get the same effect.

Condition Log

A condition log is a piece of auditing data responsible for recording the time when a particular alert condition was met, as well as the specific runtime value that made the condition true.

Condition Set

A condition set is the entire list of alert conditions specified under some alert definition. It is the key construct used in alert dampening. To better understand the alert dampening options, you must first understand what it means for a condition set to be true, which depends entirely on the type of condition expression you selected for the alert definition.

Enable / Disable

When an alert definition is enabled (or active), it means that it is currently being checked by the alerts processing engine to see if its condition set and dampening rules have been met, and whether or not it should fire an alert off. If an alert is disabled (or inactive), it will be hidden from the alerts processing engine, and will never fire an alert until it is re-enabled / activated.

An alert definition can always be disabled manually via the UI. However, alert definitions will sometimes be disabled by a post-processing action filter. Regardless of how an alert definition is disabled, it can always be re-enabled manually via the UI. However, an automated recovery process can also kick in to re-enable an alert definition too.

Recovery

Sometimes it's necessary to have two alert definitions work in tandem to provide a more intelligent dampening scenario. For instance, if some condition set becomes true, it may indicate some system state which can not be automatically recovered from (i.e., human intervention is needed). So an alert is fired off when the original condition set is met, and using an action filter this alert definition is automatically disabled.

A systems administrator gets the alert notification, and takes some manual intervention to correct the problem. Once the issue is resolved (and conditions return to normal), another alert definition is triggered. This second definition just so happens to be the "Recovery Alert" definition for the one that was disabled above. Thus, when this second definition fires, it will automatically re-enable the first one again.

Since recovery alert definitions are mainly used to indicate healthy scenarios, and in theory healthy scenarios should ensue as close to 100% of the time as possible, one should be hard pressed to find a good reason to add even one notification to this type of alert definition. Nonetheless, the option is there.

Integrations

The alerts subsystem is a funnel for all different kinds of events that happen in the various other subsystems of RHQ. To date, alerts are integrated with and can record events that happen to the following types of data:

Availability

The integration with availability data is fairly straight forward. Since the availability of a resource can only ever be in one of two states - UP or DOWN - these are precisely the two states that the alerts subsystem uses.

However, being in one of these states is not as import as moving between these states. In other words, it doesn't make much sense to constantly be notified that some resource is UP...the resource is UP...the resource is UP, etc. Likewise, after a resource goes DOWN, notifications past that point while it is still down would be redundant.

So, for availability, you can only be notified if the resource changes state. Below is a screenshot of what this part of the UI looks like:

Measurement

The integration with measurement data is the most feature-rich integration to date. Here, you're given several options for how you want the alerts processing engine to operate on the incoming measurement data.

First, no matter which alert condition option you choose, you must first select a specific measurement to operate against. The list of measurement choices are derived from the type of resource you are looking at. Below is an image that shows the available measurements for a Platform:

Let's take the "Used Physical Memory" above as an example when describing the rest of the measurement-related alert condition options:

Absolute

For the example, this option is a comparison of the most recently collected used physical memory value against an absolutely specified value chosen by the user.

Here, the comparison can use any of the following three operators: "greater than", "less than", or "equal to". For "equal to" comparison, since we're dealing with floating point numbers, if the values are within 10e-9 of one another, they are considered equal.

Percentage

For the example, the option is a comparison of the most recently collected used physical memory value against a percentage of its known baseline. This alert condition option lets you choose whether you want the comparison to be made against the minimum, mean, or maximum baseline value currently recorded in the system. Notice how the last drop-down menu will show the current min, mean, and max baseline values; this gives the users all the information they need on a single page, enabling them to create the entire alert definition without having to navigate around to other parts of the UI.

Again, you can again choose any of the following three operators: "greater than", "less than", or "equal to". And, again, for "equal to" comparison, since we're dealing with floating point numbers, values are considered equal if they are within 10e-9 of each other.

Relative

For the example, this option is a comparison of the most recently collected used physical memory value against its last known value.

The user interface portion for this option is shown above, though it's rather dull.

Operations

To date, the integration with operations deals only with the current state of some operation, not its operational data (arguments and results). The image below depicts what the UI will look like when creating / editing an alert definition for a RHQ Agent resource type:

On the left-hand side you see the list of operations for your resource. On the right-hand side you see the list of all the states an operation can possibly be in. Below is a quick discussion of each state:

  • INPROGRESS - this is the only begin state, all operations move through this en route to being processed
  • SUCCESS - this is a terminating state, operations move into this if no errors have occurred during processing
  • FAILURE - this is a terminating state, operations move into this if any error has occurred during processing
  • CANCELLED - this is a terminating state, operations move into this only if they are successfully cancelled via the UI

Traits

The integration with traits is a subset of the functionality offered for measurement integration . Whereas measurement data is always numeric and tends to change very frequently, trait data is rather static and rarely changes. Thus, only relative comparison operations are supported for traits, which simply compares the latest value collected for that trait to its last known value.

Above is an image that shows the available traits that you can monitor for a Platform.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.