Packet Pushers: Detect, Diagnose, And Act Podcast

Packet Pushers: Detect, Diagnose, And Act Podcast


Podcast: Download (46.2MB)
Keith Sinclair, CTO and progenitor of NMIS, joins Greg Ferro on Packet Pushers

They discuss:

  • What NMIS does and how it works
  • Protocol support including SNMP, WMI, SSH, RESTful APIs, and more
  • The persistence of SNMP
  • Opmantek’s approach of detect, diagnose, and act
  • Automation capabilities
  • How NMIS uses dashboards, portals, and maps

How to manage capacity, before it becomes a problem.

How to manage capacity, before it becomes a problem.

Capacity Management is the proactive management of any measurable finite resource.

 

This blog will help you with a simple to follow outline on how to properly manage capacity, so if you ever have to resolve capacity issues, you are ahead of the curve and ready to implement remediation.

 

Capacity management has been considered by many as difficult to achieve. But all worthwhile achievements take discipline to execute and accomplish. So, with careful consideration, monitoring and planning you can ensure that it becomes manageable and deliverable.

Don’t forget that as part of any new deployment or upgrade, and as budget allows, additional demand should be incorporated into the design, with additional capacity ready to service the new capacity peaks. The new peak load is accounted for and new baselines are created.

 

Analysis Paralysis

 

The overall concept is that you don’t create reports just to create reports. People might read them once and never again. But as it’s automated, they will continue being sent and remain unopened, filtered or archived. This is not the result you want.

 

The behaviour you want to drive is for people to use your reports. So, you create reports that drive actions. For example, node health reports can provide checklists to drive daily troubleshooting, flag maintenance check-ups, apply upkeep maintenance or repair of devices. Use daily event reports to help the engineering team understand what the normal background noise and static is across your network or to drive a cleanup. Then of course weekly or monthly reports. For example, a WAN/interface report to support bandwidth and equipment investment might only need to be produced monthly, but a faster growing capacity consumption resource should be produced weekly.

 

Detecting capacity issues through threshold management.

 

The problem with capacity issues is that they can present themselves in so many different ways, with the result that something isn’t working the way it was, or should be. Just like what I talked about in my blog on bandwidth congestion , a user will report that “some application” doesn’t work like it did yesterday, a capacity threshold alarm has escalated. If you want to learn about root cause analysis, check out Marks video here –> MARKS WEBINAR.

 

Using Opmantek Products to manage capacity

 

Add your devices to NMIS (and while you’re at it, ensure that you have a naming convention to follow, have all your SNMP done and your network documented)

  1. IP, Name and Community String
  2. Assign roles to devices (use the in built Core, Distribution, Access)

Preparing Visibility

  1. Set up regular reports using opReports
    1. If you manage a network choose the network reports
    2. If you manage servers use the capacity report
    3. If you manage servers and networks do steps a + b
    4. Set up the scheduling – Have them emailed once a week in time for your planning and performance review session.
  2. Set up capacity Dashboards, Use TopN views in opCharts
    1. Add TopN and Network Maps to your view (good practise)
    2. Create charts for your most important resources

 

Simple Alarming and Notifications

  1. Enable notifications for critical resource capacity issues – Start with Critical and Fatal only out of this list Normal/Warning/Minor/Major/ Critical/Fatal.

Add more later as you gain insight.

  1. Set up email notification to the right teams based on the Role (Core, Distribution Access) or Type of device (Server, Router, Switch) devices for Threshold events to be sent.

Trending – for predictive capacity planning

  1. Enable opTrend to find anomalies in usage (events) and resources which are continuously trending outside of normal (Billboard)
    1. Notify on critical opTrend threshold events.
    2. Review opTrend Top of The Pops Billboard at your regular capacity review meetings.

 

Simple steps when managing capacity issues as incidents.

 

While not ideal, issues/incidents seen at the helpdesk could potentially originate from a change that took place on the network or in the environment. In a real world, even the best change management implementation or outage may cause a capacity issue somewhere and trigger an alarm.

 

Ask. What has changed? Has something in the environment changed?

 

Typically a capacity threshold breach is an indicator of:

    1. A new service added?
    2. A new demand?
    3. A network change?
    4. Some other change?
    5. A finite asset reaching a predetermined capacity

 

Approaches to Baselining for Monitoring and Support:

 

Look at all your resources and review and categorise your resource types, .e.g Internet Connections, Site links etc.  For each category conclude some baseline usage levels as percentages (Fatal , Critical, Major etc) which will be your starting baseline. It is critical to know your baseline as all your threshold alarms will be triggered at the levels you set and your Notifications of Threshold Alarms want to only be for the more serious alarms. You don’t want to “cry wolf.”

 

Consider grouping your resources, for example: Core, Application, DMZ, Edge, Branch, Internet Links, General WAN etc.

 

And within each group, consider the following resources you want to monitor:

 

CPU, Memory, Bandwidth Utilisation

 

Start by using general thresholds for each based on the peak demands you have seen.

 

These are your proactive warnings that will send an alarm to your management platform. You may want to set some escalation rules for the resource for example:

 

85% – 95% → Major → Alarm Notification (business hours) → to the capacity team

>95%+ → Critical → Alarm Notification (24×7) → helpdesk/NOC

 

Using the trend analysis provided by opTrend, you can identify very Anomalous usage (it’s low when it should normally high at that time of day) or pro-actively look at resources consistently trending up or down vs their normal levels. Hence ahead of time we can start reviewing the resource for appropriate modification (upgrade, downgrade, offloading work etc). As the network continues to grow and support new services, the baseline will change over time (sliding baseline), thus capacity issues may “creep up” on you as alarm thresholds may not be breached all the time to send an alert. It is important to look at the baseline “rate of change” over time as well to determine capacity needs (ex. 10% change over a one week timeframe).  When planning to increase capacity, be sure to allow for the procurement and provisioning time.

I mentioned the sliding baseline and tracking rate of change of the baseline so the capacity issues don’t “creep up”

opReports v3.1.11 New Release

opReports v3.1.11 New Release

This has been a busy year for opReports and the product gets better and better with every update. There are new reports that have been created for each release and you will gain a better understanding of your network by installing opReports.

In this release we have introduced the following:

A new report type: Monitored Services Report that offers the following new options for Node Availability and Grouped Availability;

‘opreports_availability_average_packetloss’ in  path/to/omk/conf/opCommon.nmis, controls whether the previous ‘Packet Loss %’ (now renamed ‘Count Packet Loss %’) or  ‘Average Packet Loss %’ is displayed in this report.

Uses a newly developed ‘known_reports_cache’ to speed up the loading of reports;

A new option ‘opreports_do_cache_known_reports’ in  path/to/omk/conf/opCommon.nmis determines whether this cache is enabled or disabled. The cache is enabled by default.

Testing on a server with a moderately large number of generated reports has found load time to view a large report improves from more than 9 minutes to less than 45 seconds with the cache enabled.

Full release notes are available – here.

opReports v3.1.11 New Release

opReports v3.1.8 New Release

opReports has been updated and we have added in four new reports that should help all organizations. The reports are all of the interface capacity reports group.

To run this report any of the reports, choose Create Capacity Report and select the corresponding type.

Grouped Interface Capacity Report

The Grouped Interface Capacity Report displays a comparison between configured interface speeds and observed actual bandwidth figures.

Statistics are shown for all devices in order of the devices’ Group Membership (ie. NMIS configuration property ‘group’) more info on groups.

opReports Grouped Interface Capacity Report - 700

Interface Unicast Packets Report

The Interface Unicast Packets report displays the ifInUcastPkts and ifOutUcastPkts statistics for one or more interfaces.

opReports Interface Unicast Packets Report - 700

WAN Utilisation Distribution Report & WAN Utilisation Distribution Summary Report.

The WAN Utilisation Distribution Report displays the combined, input and output utilisation frequency distributions for configured distribution groups. The WAN Utilisation Distribution Summary Report displays only the combined utilisation frequency distribution for configured distribution groups.

Customized WAN Utilisation Distribution Levels:

Two default configured distribution groups are provided: Default WAN Distribution Levels Descending and Default WAN Distribution Levels Ascending.

The default groupings for both of these default options are:

  • <=30%
  • >30% and <=70%
  • >70% and <=90%
  • >90%.

By adding distribution grouping in the correct format as provided for either of the aforementioned default options under the report_wan_distributions setting in /path/to/omk/config/opCommon.nmis, customised groupings can be added to the opReports WAN Utilisation Distribution Levels displayed under Create New Report >> Layout described under Setup above.
Column order can be customised by setting appropriate group names: group names are sorted ascending: group1 will display before group 2, group 2 before group 3, etc.

Here are the default configuration options as provided in opCommon.nmis:
'report_wan_distributions' => {
'Default WAN Distribution Levels Descending' => {
"group4" =>{
"description" => "<=30%", "min" => 0,
"max" => 30,
},
"group3" => {
"description" => ">30% <=70%", "min" => 30,
"max" => 70,
},
"group2" =>{
"description" => ">70% <=90%", "min" => 70,
"max" => 90,
},
"group1" => {
"description" => ">90%",
"min" => 90,
"max" => 1000000,
},
},
'Default WAN Distribution Levels Ascending' => {
"group1" =>{
"description" => "<=30%", "min" => 0,
"max" => 30,
},
"group2" => {
"description" => ">30% <=70%", "min" => 30,
"max" => 70,
},
"group3" =>{
"description" => ">70% <=90%", "min" => 70,
"max" => 90,
},
"group4" => {
"description" => ">90%",
"min" => 90,
"max" => 1000000,
},
},
}

Here is an example screenshot of a WAN Utilisation Distribution Summary Report with Show 95th Percentile selected and using the Default WAN Distribution Levels Descending default configured distribution group:

opReports WAN Utilisation Distribution Summary Report - 700
This version also introduces two new options under Sources >> Node Selection: ‘by Regular Expression for a Nodes and Interfaces Report per Group’ and ‘by Regular Expression for Groups, Nodes and Interfaces’.

(This option causes the generation of a separate report for each of the known groups. This option is available for scheduled reports only, excluding ‘once only’ scheduled reports.).

Groups must match the regular expression given for the group name
AND
Nodes must match the regular expression given for the node name
AND
Interface descriptions must also match a separate regular expression.

Only those interfaces are selected where all three regular expressions match. However, for reports where interfaces are not relevant, interfaces are disregarded.

The regular expression for interfaces is applied to both the interface’s ifDescr and Description properties in parallel, and a match for either or both selects the interface. (The NMIS GUI presents ifDescr as “Name” or “Name (ifDescr)”. Depending on the device and its modelling ifDescr may or may not be adjustable, but Description can be set easily within NMIS.)

In the GUI this option is called “by Regular Expression for a Nodes and Interfaces Report per Group“. The report schedule requires that you supply:

group_each_regexp = <regular expression>
AND
node_regexp = <regular expression>
AND
node_intf_regexp = <regular expression>.

All three regular expressions are evaluated at report creation time.

Extending NMIS with Opmantek’s Modules

Extending NMIS with Opmantek’s Modules

NMIS has long been one of the most widely used open-source network management systems in the world, but what many users don’t know is how easy it is to extend the core with the suite of add-on modules that replace other network tools and allow businesses to save on licensing costs and increase overall network performance visibility through system expansion and consolidation of applications.
OMK Product Wheel - 700

Building Solutions with NMIS Modules

By combining NMIS with various other modules, Opmantek is able to provide software solutions to suit many different enterprise needs – here are a few of the popular combinations that are delivering strong results and allowing our customers to roll several stand-alone applications into one single NMIS licensing bundle.

 

Network Performance Management and Diagnostics

NMIS, opCharts and opReports

This combination of modules will provide you with the full NMIS capabilities for monitoring network health, capacity planning and event management and alerting, presented in interactive dashboards and reports that can be customised for user groups so that business users can see relevant performance information and engineers can see more detailed operational and information.

 

Configuration Management Database:

Open-AudIT Professional and opConfig

Looking to replace your CMDB?  This combination of modules is saving organisations thousands of dollars in licensing fees each year by automating device discovery and audit, storing configurations, monitoring changes and pushing configuration changes out to sets of devices.

 

Network Configuration and Compliance Automation

NMIS, Open-AudIT Professional/Enterprise, opConfig and opEvents

Save time and money on network administration by using process automation to manage inventory, remediate known issues, consolidate and deduplify events, automatically gather network information, detect and roll back configuration and file changes and more.

 

Traffic Management

NMIS and opFlow

This combination replaces other network monitoring and Netflow tools to give you a consolidated view of flow data including heat maps that visually indicate areas of congestion.

 

Anomaly Detection, Event Prediction and Remediation

NMIS, opTrend and opEvents

Identify issues and threats before they impact your business by leveraging the device and network data gathered by NMIS along with advanced machine learning to determine minute by minute standard baselines for your environment that can help you to identify new threats, unusual behaviour and escalating problems before they impact operations.

Remote Monitoring and Management

NMIS, opHA, opEvents and opCharts

For Managed Service Providers you can replace multi-million dollar RMM systems by combining NMIS with opHA and opCharts.  opHA allows you to increase the performance of applications and deliver high scale and high availability environments, including geographical distribution of the system and overlapping IP address ranges, while opCharts provides a single pane of glass and tiered user views, so that engineers can drill down from a full view of all managed customer equipment to a single device in a remote location, while customers can view their own sites privately and in real time.

 


 

There are a lot of options to improve your network, however, the easiest way to start is with our Virtual Machine. The VM comes preconfigured and is operational in under 5 minutes, Download the Virtual Machine and activate free 20 device licenses of each of the modules that interest you or request a demo from one of our engineers.