Event Correlation with opEvents

By Paul McClendon - Support Engineer

As a support engineer with Opmantek, I work with many organizations that monitor thousands of devices across their networks. In complex network environments, thousands or even millions of events are generated in a short period. These events range from critical to informational, identifying and understanding the two is key to keeping a network running efficiently.

Looking through event logs is tedious, and with many events, it’s easy to miss the critical ones. Engineers have told me they stopped looking at event notifications because there were so many happening, they became nonchalant about them. Opmantek’s event management solution, opEvents not only reduces event spam but, can also be used for effective time and event management.

A team of engineers at a large organization were being bombarded by events from hundreds of machines during their regularly scheduled Windows update period. This team ignored event notifications during this time since they occurred so frequently. However, at the same time, they had multiple notices that a group of servers along with their services had gone down. The event logs were indicating that this was happening, the IT staff were notified but, since it occurred during a typical busy event period, they were ignored. As a result, these servers stayed down until someone finally noticed the event hours later. This downtime resulted in lost revenue for the company and some very unhappy managers. 

It was discovered that this problem occurred due to a router that stopped working. The team looked for a solution and came upon Opmantek’s opEvents. With opEvents, your organization gains the ability to sort and correlate multiple events from various sources into a single event. This reduces event spam and clutter to help your team quickly identify which events are important and which are not. opEvents will intelligently analyze, sort, and correlate multiple events across various sources into a single event, reducing noise before any alert is created. This team of engineers can now quickly identify not only when a router is entirely dead but, also see if any router is underperforming preventing any future downtime, making the team more proactive.

The team of engineers in the example above discussed how opEvents could be used to prevent a situation like this from occurring again. They came up with an event correlation rule to notify them in similar cases. 

To create this type of correlation rule, start by navigating to the conf directory of your opEvents install and creating an entry in EventRules.nmis.

A simple event correlation rule consists of:

  • An event name, specifying the name of your newly created event.
  • A list of event names that are the events desired for correlation
  • A minimum count of events that have to be detected to trigger the rule
  • An optional list of groupby clauses. These define whether the count is interpreted globally for all named events, or separately within smaller groups.
  • An optional enrich clause. This adjusts the content of the newly created event.
  • Last a window parameter, which defines the time window to examine for the event.

An example event correlation rule is shown below:

‘3’=> {
name => ‘Customer Outage’,
events => [“Node Down”,”SNMP Down”],
window => ’60’,
count=> 5,
groupby=>[‘node.customer’], # count separately for every observed value of customer
enrich=>{priority => 3, answer => 42}, # any such items gets inserted in the new event
},

The example shows an event correlation event rule indicating that when the events “Node Down” and “SNMP Down” are triggered within a 60-second window, separate them into per-customer groups; if it counts 5 or more events in a group, then create a new event called Customer Outage. This is only one example of a custom event correlation rule. There are many more examples, use cases, and features that are discussed more on our opEvents Wiki page.

Using this Event Management tool will reduce event spam allowing your team to notice critical events that need action quickly. Important events will be harder to overlook during event storms. Redundant events can be reduced by automating event handling. Save time, reduce operational costs, gain network insight, and keep your network performing smoothly. Expand your toolkit with these features and more in opEvents and take control of your network.

For more information on Opmantek’s Event Management Tools, other Opmantek solutions, or to schedule a demonstration, please visit our website at www.opmantek.com. You can also email us at contact@opmantek.com.