NMIS Metrics, Reachability, Availability and Health

NMIS Metrics, Reachability, Availability and Health

Managing a large complex environment with ever-changing operational states is challenging, to assist, NMIS as a Network Management System which is performing performance management and fault management simultaneously monitors the health and operational status of devices and creates several individual metrics as well as an overall metric for each device. This article explains what those metrics are and what they mean.

Summary

Consider this in the context that a network device offers a service, the service it offers is connectivity, while a router or switch is up and all the interfaces are available, it is truly up, and when it has no CPU load it is healthy, as the interfaces get utilised and the CPU is busy, it has less capacity remaining. The following statistics are considered part of the health of the device:

  • Reachability – is it up or not;
  • Availability – interface availability of all interfaces which are supposed to be up;
  • Response Time;
  • CPU;
  • Memory;

All of these metrics are weighted and a health metric is created. This metric, when compared over time, should always indicate the relative health of the device. Interfaces which aren’t being used should be shut down so that the health metric remains realistic. The exact calculations can be seen in the runReach subroutine in nmis.pl.

Metric Details

Many people wanted network availability and many tools generated availability based on ping statistics and claimed success. This, however, was a poor solution, for example, the switch running the management server could be down and the management server would report that the whole network was down, which of course it wasn’t. OR worse, a device would be responding to a PING but many of its interfaces were down, so while it was reachable, it wasn’t really available.

So, it was determined that NMIS would use Reachability, Availability and Health to represent the network. Reachability being the pingability of device, Availability being (in the context of network gear) the interfaces which should be up, being up or not, e.g. interfaces which are “no shutdown” (ifAdminStatus = up) should be up, so a device with 10 interfaces of ifAdminStatus = up and ifOperStatus = up for 9 interfaces, the device would be 90% available.

Health is a composite metric, made up of many things depending on the device, router, CPU, memory. Something interesting here is that part of the health is made up of an inverse of interface utilisation, so an interface which has no utilisation will have a high health component, an interface which is highly utilised will reduce that metric. So the health is a reflection of load on the device and will be very dynamic.

The overall metric of a device is a composite metric made up of weighted values of the other metrics being collected. The formula for this is configurable so you can weight Reachability to be higher than it currently is, or lower, your choice.

Availability, ifAdminStatus and ifOperStatus

Availability is the interface availability, which is reflected in the SNMP metric ifOperStatus if an interface is ifAdminStatus = up and the ifOperStatus = up that is 100% for that interface if a device has 10 interfaces and all are ifAdminStatus = up and the ifOperStatus = up that is 100% for the device

If a device has 9 interfaces ifAdminStatus = up and the ifOperStatus = up and 1 interface ifAdminStatus = up and the ifOperStatus = down, that is 90% availability it is the availability of the network services which the router/switch offers

Configuring Metrics Weights

In the NMIS configuration, Config.nmis there are several configuration items for the these are as follows:
'metrics' => {
'weight_availability' => '0.1',
'weight_cpu' => '0.2',
'weight_int' => '0.3',
'weight_mem' => '0.1',
'weight_response' => '0.2',
'weight_reachability' => '0.1',
'metric_health' => '0.4',
'metric_availability' => '0.2',
'metric_reachability' => '0.4',
'average_decimals' => '2',
'average_diff' => '0.1',
},

 

The health metric uses items starting with “weight_” to weight the values into the health metric. The overall metric combines health, availability and reachability into a single metric for each device and for each group and ultimately the entire network.

If more weight should be given to interface utilisation and less to interface availability, these metrics can be tuned, for example, weight_availability could become 0.05 and weight_int could become 0.25, the resulting weights (weight_*) should add up to 100.

Other Metrics Configuration Options

Introduced in NMIS 8.5.2G are some additional configuration options to help how this all works, and to make it more or less responsive. The first two options are metric_comparison_first_period and metric_comparison_second_period, which are by default -8 hours and -16 hours.

These are the two main variables which control the comparisons you see in NMIS, the real-time health baselining. These two options will be calculations made from time now to time metric_comparison_first_period (8 hours ago) to calculations made from metric_comparison_first_period (8 hours ago) to metric_comparison_second_period (16 hours ago).

This means NMIS is comparing in real-time data from the last hour 8 hours to the 8 hour period before that. You can make this smaller or longer periods of time. In the lab I am running -4 hours and -8 hours, which makes the metrics a little more responsive to load and change.

The other new configuration option is metric_int_utilisation_above which is -1 by default. This means that interfaces with 0 (zero) utilisation will be counted into the overall interface utilisation metrics. So if you have a switch with 48 interfaces all active but basically no utilisation and two uplinks with 5 to 10% load, the average utilisation of the 48 interfaces is very low, so now we pick the highest of input and output utilisation and only add interfaces with utilisation above this configured amount, setting to 0.5 should produce more dynamic health metrics.

Metric Calculations Examples

Health Example

At the completion of a poll cycle for a node, some health metrics which have been cached are ready for calculating the health metric of a node, so let’s say the results for a router were:

  • CPU = 20%
  • Availability = 90%
  • All Interface Utilisation = 10%
  • Memory Free = 20%
  • Response Time = 50ms
  • Reachability = 100%

The first step is that the measured values are weighted so that they can be compared correctly. So if the CPU load is 20%, the weight for the health calculation will become 90%, if the response time is 100ms it will become 100%, but a response time of 500ms would become 60%, there is a subroutine weightResponseTime for this calculation.

So the weighted values would become:

  • Weighted CPU = 90%
  • Weighted Availability = 90% (does not require weighting, already in % where 100% is good)
  • Weighted Interface Utilisation = 90% (100 less the actual total interface utilisation)
  • Weighted Memory = 60%
  • Weighted Response Time = 100%
  • Weighted Reachability = 100% (does not require weighting, already in % where 100% is good)

NB. For servers, the interface weight is divided by two and used equally for interface utilisation and disk free.

These values are now dropped into the final calculation:

weight_cpu * 90 + weight_availability * 90 + weight_int * 90 + weight_mem * 60 + weight_response * 100 + weight_reachability * 100

which becomes “0.2 * 90 + 0.1 * 90 + 0.3 * 90 + 0.1 * 60 + 0.2 * 100 + 0.1 * 100” resulting in 90% for the health metric

The calculations can be seen in the collect debug, nmis.pl type=collect node=<NODENAME> debug=true
09:08:36 runReach, Starting node meatball, type=router
09:08:36 runReach, Outage for meatball is
09:08:36 runReach, Getting Interface Utilisation Health
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=200 count=1
09:08:36 runReach, Intf Summary in=0.06 out=0.55 intsumm=399.39 count=2
09:08:36 runReach, Intf Summary in=8.47 out=5.81 intsumm=585.11 count=3
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=785.11 count=4
09:08:36 runReach, Intf Summary in=0.06 out=0.56 intsumm=984.49 count=5
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=1184.49 count=6
09:08:36 runReach, Intf Summary in=8.47 out=6.66 intsumm=1369.36 count=7
09:08:36 runReach, Intf Summary in=0.05 out=0.56 intsumm=1568.75 count=8
09:08:36 runReach, Calculation of health=96.11
09:08:36 runReach, Reachability and Metric Stats Summary
09:08:36 runReach, collect=true (Node table)
09:08:36 runReach, ping=100 (normalised)
09:08:36 runReach, cpuWeight=90 (normalised)
09:08:36 runReach, memWeight=100 (normalised)
09:08:36 runReach, intWeight=98.05 (100 less the actual total interface utilisation)
09:08:36 runReach, responseWeight=100 (normalised)
09:08:36 runReach, total number of interfaces=24
09:08:36 runReach, total number of interfaces up=7
09:08:36 runReach, total number of interfaces collected=8
09:08:36 runReach, total number of interfaces coll. up=6
09:08:36 runReach, availability=75
09:08:36 runReach, cpu=13
09:08:36 runReach, disk=0
09:08:36 runReach, health=96.11
09:08:36 runReach, intfColUp=6
09:08:36 runReach, intfCollect=8
09:08:36 runReach, intfTotal=24
09:08:36 runReach, intfUp=7
09:08:36 runReach, loss=0
09:08:36 runReach, mem=61.5342941922784
09:08:36 runReach, operCount=8
09:08:36 runReach, operStatus=600
09:08:36 runReach, reachability=100
09:08:36 runReach, responsetime=1.32

Metric Example

The metric calculations are much more straight forward, these calculations are done in a subroutine called getGroupSummary in NMIS.pm, for each node the availability, reachability and health are extracted from the nodes “reach” RRD file, and then weighted according to the configuration weights.

So based on our example before, the node would have the following values:

  • Health = 90%
  • Availability = 90%
  • Reachability = 100%

The formula would become, “metric_health * 90 + metric_availability * 90 + metric_reachability * 100”, resulting in “0.4 * 90 + 0.2 * 90 + 0.4 * 100 = 94”, So a metric of 94 for this node, which is averaged with all the other nodes in this group, or the whole network to result in the metric for each group and the entire network.

Open Source Software and Chilli Con Carne

Open Source Software and Chilli Con Carne

I am a big proponent of Open Source Software and all the things it has delivered for individuals, organisations and society. Where would we be today if it wasn’t for GNU, Linux, Apache, MySQL (MariaDB), MongoDB, JavaScript, JQuery, Perl, Python and PHP, not to mention NMIS and Open-AudIT?

These and so many earlier Open Source projects were foundational and fundamental to the Internet as it grew and have been the Grand Parents, Uncles and Aunts of the more recent explosion of Open Source projects based around new innovations which would have only been created because of this heritage.

The classic birth of an Open Source project is, “well I really like this (software|language|database) but it does not meet all my requirements, I think I will write one” or “I have this problem and nothing existing really solves this problem the way I need it to, I think I will write one” or even better “this open source (software|language|database) is so good, how could I help to make it better”.

Open Source isn’t all about writing code, people can contribute in all kinds of ways, including testing, documentation, project management, requirements analysis and so much more.

Ultimately for me, Open Source Software is the awesome result of people with diverse backgrounds, skills, experiences and probably most importantly requirements working together to create a solution which embodies the definition of synergy. The result is something which is more generally useful to more people, because of the diversity of this input.

Which brings me to Chilli Con Carne, I love Mexican food, as soon as I first went to Montezuma’s Restaurant in Taringa Queensland as a teenager I have loved Mexican food. From travelling to the USA and then living in California for a while, I learnt about the different types of Mexican, how different Tex-Mex is to Mexican food. More recent trips to Mexico I have learnt how awesome and diverse Mexican cuisine is.

But Chilli Con Carne is not Mexican, it is really Tex-Mex and for me it also brings some of the slow food movement ideas by cooking what you need, using local produce in a traditional way.

I have been cooking Mexican food for years using meal kits and finally, I decided I could do better by doing something myself, so with the help of YouTube and Jamie Oliver, I found a great recipe, which I adapted to what I had and it produced an awesome result.

I was talking to my Opmantek colleagues about it, and they contributed some “code changes” to make it better MarkD suggested smoked paprika instead of paprika, that was an amazing improvement, MarkH sent his Chilli Con Carne recipe and I adopted the brown sugar and chocolate, this added a richness and smoothness to the dish.

Cooking is the ultimate in iterative development, cook, test, taste, improve, repeat. The current iteration of my Chilli Con Carne recipe is included below and it keeps changing and developing as I get new ideas and input from others.

For me, Chilli Con Carne is just like Open Source Software, the product of synergy.

Open Sauce Chilli Con Carne Recipe

I would call this a mild recipe, my kids have eaten this no problem, adding more chilli flakes or using hotter chilli’s would make this as hot as taste prefers.

This batch makes enough to feed 8 with some leftovers, I usually cook a big batch and freeze some convenient meals later.

Ingredients

Mexi Spice Mix

  • 3 teaspoons smoked paprika
  • 3 teaspoons of cumin
  • 2 teaspoons of dried oregano
  • Pinch salt
  • Pinch pepper
  • Lemon zest
  • Juice from lemon

Vegetables Chopped Roughly

  • 2 rough cut onions
  • 1-2 red capsicums (bell peppers)
  • 1-2 yellow capsicum (bell peppers)
  • 1-2 green capsicum (bell peppers)

Chilli’s cut up fine and remove seeds

(Leave the seeds in if you want some more heat)

  • 1 large Poblano chilli
  • OR 2 Aussie green chilli
  • OR your favourite chilli’s

Other things to add

  • 2 tins tomatoes
  • 1/2 tin water, use water from beans
  • Coriander (cilantro)
  • 1 cinnamon stick
  • 2 tins black beans including water
  • 2 tins red kidney beans
  • 1 tablespoon light brown sugar (optional)
  • 60 grams unsweetened baking choc pieces (optional)
  • 4 teaspoons hot chilli flakes

Butcher

  • 1.4kg beef chunks

 

Preparation

Marinate the Meat

Make the Mexi Spice Mix, combine with meat make sure it is really spread through all the meat. Leave to marinate in the fridge for as long as you have time for, overnight is good, an hour or so is ok.

Cooking

If you don’t have time to marinade that is OK, just prepare the same way and straight into the pan.

I cook using a large electric fry pan, which works well and I can leave it cooking overnight if I have time.

The intense part (10-15 mins)

  • Hi heat
  • Braised beef on the stove top
  • If not already marinated add in Mexi spices
  • Add in veggies, then tomatoes and black beans and chillis
  • Break cinnamon stick

The easy part (~60 mins)

If you want the chilli thicker, cook uncovered, if you want it thinner, keep it covered.

  • Reduced to cook for 15mins (level 9 180C)
  • Stir and cook for another 15 mins
  • Reduce heat to simmer and check after 15 mins
  • Reduce heat as needed and check every 15 mins

Extra flavour as needed

While cooking check flavour and add as taste proscribes, but add in small doses, stir through and taste again after 10-15 mins.

  • 1 teaspoon hot chilli flakes
  • 1 teaspoon of cumin
  • 1 teaspoon smoked paprika

The relaxing part (as long as you have time for)

  • Cover the dish
  • Reduce heat to a low simmer, probably the lowest setting you have
  • Leave for as long as you can, 2 hours good, 4 hours better, leaving overnight is awesome
  • Keep an eye on total moisture.

Soupy Tip

If too soupy, scoop off some of the liquid and keep as a soup, you can add beans to it and cook it up a little longer, but so much flavour in that soup.

Serving

Serve as you like, in a bowl, cover in cheese and add some sour cream, accompanied by corn chips is pretty good.

If you prefer a thicker chilli, serve in soft tacos or burrito wraps.

Enjoy.

[White Paper] An IT Managers guide to Network Process Automation

[White Paper] An IT Managers guide to Network Process Automation

This guide is designed for IT Managers looking to implement Network Process Automation in their organisation.

Key Points:

  • Focus on good operational practices.
  • Picking the right tasks.
  • Handling of common issues through automation.
  • Mapping out the automation process.
  • Time savings.
  • Checklist.

The guide discusses the best approach for change management and team buy-in, provides a methodology framework to use when considering the automation of a manual task in a network environment and the steps to take in order to identify an effective test case for your organization.

Get the White Paper

Enhancing Event Management using live real world data

Enhancing Event Management using live real world data

Overview

When dealing with thousands of events coming into your NOC from many locations and different customers, operators are relying on getting useful information which will help them to make sense of the events pouring through the NOC.

Using opEvents, it is relatively easy to bring just about any data source into your event feed so that the Operations team has improved context for what is happening and ultimately what might be the root cause of the network outage they are currently investigating.

Using Twitter Feeds for Event Management

If you look into Twitter, you will find many Government and other organisations using Twitter to issue alerts and make announcements. A little bit of Googling and I found some excellent Twitter feeds for severe weather, general weather and earthquake tweets. By monitoring for these in opEvents, the result is that you have tweets visualized in your overall Event Management view.

opEvents MGMT View - 700

Useful Twitter Feeds

Severe Weather

Weather Tweet

Earthquake Tweet

Listening to Twitter Feeds

There are several ways to listen to Twitter feeds. The quickest one for me was to use Node-RED, something I use for Home Automation and IoT like applications.  Configuring Node-RED with the feed data above and then creating an opEvents JSON event was very straightforward.

Node Red Configuration View - 700

The code included in the node “Make Event” is below. It creates a JSON document with the correct payload which is a compatible opEvents JSON event (which are a really great way to deal with events), then writes it to the file:

if ( msg.lang === "en" ) {
// initialise payload to be an object.
details = msg.payload;
event = msg.topic;
timenow = Date.now();
msg.filename = "/data/json-events/event-"+timenow+".json";
msg.payload = {
node: "twitter",
event: event,
element: "Sentiment: " + msg.sentiment.score,
details: details,
sentiment_score: msg.sentiment.score
};
return msg;
}

Getting Twitter Events into opEvents

Now we have a well-formed JSON document with the necessary fields, opEvents will consume that once told which directory to look into.

I added the following to the opCommon.nmis in the section opevents_logs and restarted the opEvents daemon, opeventsd.

'nmis_json_dir' => [
'/data/json-events'
],

The result can be seen well in opEvents when you drill into the “twitter” node (you could, of course, call this node anything you like, e.g. “weather” or “earthquake”).

opEvents Twitter Feed - 700

Clicking on one of the weather events with a high sentiment score (more on that in a second), you can see more details about this event and what impact it might have.  Unfortunately we have a Tropical Cyclone in North Queensland at the moment; hopefully, no one will be injured.

opEvents Event View - 700

Enriching the Tweet with a Sentiment Score

The sentiment score is a heuristic which calculates how positive or negative some text is, i.e., what is the sentiment of that text.  The text analysis looks for keywords and computes a score, then in opEvents, we use this score to set the priority of the event so that we can better see the more critical weather events because the sentiment of those tweets will be negative.

In the opEvents, EventActions.nmis I included some event policy to set the event priority based on the sentiment score which was an event property carried across from Node-RED.  This carries through the rest of opEvents automagically.

'15' => {
IF => 'event.sentiment_score =~ /\d+/',
THEN => {
'5' => {
IF => 'event.sentiment_score > 0',
THEN => 'priority(2)',
BREAK => 'false'
},
'10' => {
IF => 'event.sentiment_score == -1',
THEN => 'priority(3)',
BREAK => 'false'
},
'20' => {
IF => 'event.sentiment_score == -2',
THEN => 'priority(4)',
BREAK => 'false'
},
'30' => {
IF => 'event.sentiment_score == -3',
THEN => 'priority(5)',
BREAK => 'false'
},
'40' => {
IF => 'event.sentiment_score < -3',
THEN => 'priority(8)',
BREAK => 'false'
},
},
BREAK => 'false'
},

Because opEvents uses several techniques to make integration easy, I was able to get the tweets into the system in less than one hour (originally I was monitoring tweets about the Tour de France), then I spent a little more time looking for interesting weather tweets and refining how the events were viewed (another hour or so).

Summing Up

If you would like an event management system which can easily integrate with any type of data from virtually any source into your workflow, then opEvents could be the right solution for you.  As a bonus, you can watch the popularity of worldwide sporting events like the Tour de France.

Monitoring Tour de France Tweets with opEvents

opEvents Tour de France View - 700
[White Paper] Configuration Management Systems Delivering Change and Compliance

[White Paper] Configuration Management Systems Delivering Change and Compliance

The fundamental capability which configuration management provides is backup and archiving of critical configuration data from network and server equipment. This along with collecting detailed inventory data provide the basis for managing change and compliance.

This paper is to help Network Engineers, IT Managers and Executive Leadership understand the benefits of configuration management and how it contributes to change and compliance management at the business.

Get the White Paper