Changes in the Wind at Opmantek

Changes in the Wind at Opmantek

Like every CEO, I was anxious at the beginning of COVID-19. How would that affect the business and staff? So we were expecting the worst and like many companies, we prepared the business and the team for change.

 

Fortunately, for us, just like many businesses, it afforded the team time to think about what they are doing, where they are heading, and what is and isn’t working for them. Just like us, many businesses revaluated their direction and operation. Out of that exploration came improvements to their processes and ways to reduce waste. Everyone got a bit smarter.

 

During 2020, we had a higher engagement than we expected from organisations that saw the importance of their networks and infrastructure. With more people relying on the business’s digital side to get work done and with IT staff working from home, network management became a primary focus for CEOs, CIOs, and CTOs. Heads of IT for all verticals had to ensure that their applications and supporting infrastructure were robust and no surprises. They had to now more than ever support their customers and provide the same level of support to their staff.

 

Network Management and Network Management improvements were no longer on the back burner. They are now front and centre.

 

More and more people were reaching out to us and talking to our teams around the world. Some of the largest organisations joined our family. They chose us as they trusted our team and software to deliver outstanding visibility of their networks and infrastructure, flexibility to fit their business process and great value. Organisations such as NextLink Internet out of Texas signed a 10-year agreement with us. NASA is using us for their next moon mission, Artemis. Only three members of Opmantek were born – yes, they were babies at the time – when Neil Armstrong landed on the Moon. Now we all get to do a little bit to get the first Woman and next Man on the Moon, a great honour for all of us.

 

We are proud that we have built great software that our customers recognise as the best. However, what makes us all pleased is that our customers believe in our team.

 

Introducing Programmable Button Actions

Introducing Programmable Button Actions

Opmantek has long believed that Operational Process Automation is one of the foundational pillars that a successful network management strategy is built upon. One key piece to this is to ensure that actions are undertaken in a consistent manner each time, there should be no variance from what is outlined as the standard protocol. opEvents has introduced programmable button actions that help assist organisations in replicating troubleshooting actions and escalation procedures further solidifying opEvents as a technical service desk.

The buttons use the same pipeline as scripts in EventActions but now operators have the ability to manually kick off an action for an event. One of the most common actions will be to create a ticket in your issue tracking system, in our case we will create a Jira Ticket.

opEvents-Programmable-Buttons - 700

Configuration

To start create the following file in omk/conf/table_schemas/opEvents_action-buttons.json This must be valid JSON schema or the buttons will fail to render. You should see an error in opEvents.log if this is the case. [ { “description”: “Example Events Button Action”, “label”: “Create Ticket”, “fa_icon”: “fas fa-jira”, “script”: “create_ticket”, “tags”: [“ticket”] } ] Then add the following policy in omk/conf/EventActions.json|.nmis that triggers show_button.tag() EventActions.json “policy”: { “5”: { “IF”: “event.any”, “THEN”: “show_button.ticket()”, “BREAK”: “true” }, } & EventActions.nmis %hash = ( ‘policy’ => { ‘5’ => { IF => ‘event.any’, THEN => ‘show_button.create_ticket()’, BREAK => ‘true’ }, } ); These are the supported keys and how the change operation and look of the button.
Key Type Required Description
script String Yes Name of the script defined in EventActions.json
label String Yes Label which the button will display to the user
description String optional Tool-tip help text to be displayed when you mouse over the button
tags array[string] optional If no tags are defined the button will show on all events, if tags are defined the button will only show on events which have been tagged with show_button.tag_name()
run_once boolean optional If set to true the button will look for script.script_name key on the event, if found the button will disable itself. This allows manual actions to only be triggered once. Will not influence any defined EventActions.json operations.
fa_icon string optional Icon to be displayed from the Font Awesome library shipped with opEvents example: “fas fa-table-tennis” Icons here.
class string optional Define a css class to colour the button, see Notes on Button Classes below to see a list of supported types.

Notes on Font Awesome

In opEvents-3.2.2 we are shipping the library 5.12.1 In opEvents-2.6.1 we are shipping the library 5.8.2
7 Steps to Network Management Automation & Engineer Sleep Insurance

7 Steps to Network Management Automation & Engineer Sleep Insurance

 

Quietly, somewhere in an office downtown, bearings designed to last for 25k hours have been running non-stop for over forty-three-thousand. The fan was cheaply made by machine from components sourced over several years across a dozen providers. It sat boxed for weeks before it was installed in the router chassis, which itself was boxed-up. Two months at sea, packed tight in a shipping container, then more months bounced around and shuffled from truck to warehouse, and back to a parcel delivery. Finally, the device was configured, boxed and shipped to its final installation point. Stuffed into a too tight closet with no air circulation this mission critical router been running non-stop for the past five-years. It’s a miracle really that it worked this long.

 

Fan speed was the first thing to be affected by the bearing failure.

Building friction on the fan’s impeller shaft caused the amperage draw to increase to compensate and maintain rotational speed. When the amperage draw maxed out, rotations per minute (RPM) dropped. With the slower fan speed came less airflow, with lower airflow the chassis temperature increased.

 

Complex devices, like routers, require low operating temperatures. The cooler it is, the easier it is for electrons to move. As the chassis temperature increased the router experienced issues processing the data packets traversing the interfaces. At first it was an error here or there, then routine traffic routing ran into problems and the router began discarding packets. From there things got much worse.

 

It’s late Saturday evening and your weekend has been restful so far. A night out with your significant other, a movie and dinner. It’s late now and you’re ready for bed when your phone chirps. The text message is short;

 

Device: Main Router

Event: Chassis high temperature with high discard output packets

Action Taken: Rerouted traffic by increasing OSPF cost

Action Required: Fan speed low, amperage high. Engineer investigate for repair/replacement.

 

A fan went bad, what’s next?

The system had responded as you would – it rerouted traffic off the affected interface preventing a possible impact to system operation. Adding a note to your calendar to investigate the router first thing Monday morning you turned in for a good night’s sleep.

 

Our Senior Engineer in Asia-PAC, Nick Day, likes to refer to Opmantek’s solutions as “engineer sleep insurance”. Coming from a background in managed service providers I can appreciate the situation. Equipment always breaks on your vacation time, often when the on-call engineer is as far away as possible, and with little useful information from the NMS. This was a prime scenario we used when building out our Operational Process Automation (OPA) solution.

 

Building a Solution

Leveraging the combined ability of opTrend to identify operational parameters outside of trended norms, opEvents correlates events and automates remediation. With the addition of opConfig configuration changes to network devices are then able to be automated. Operational Process Automation (OPA) builds on this statistical analysis and rules-based heuristics, to automate troubleshooting and remediation of network events. This in turn reduces the negative impact on user experience.

 

 

Magicians never reveal their secrets…but we’ll make an exception.

Now let’s see how this was accomplished using the above example. At its roots opTrend is a statistical analysis engine. opTrend collects performance data from NMIS, Opmantek’s fault and performance system and determines what is normal operation. Looking back over several weeks, usually twenty-six, opTrend determines what is normal for each parameter it processes. It does this hour by hour, considering each day of the week individually. So, Monday morning 9-10am has its own calculation, which is separate from 3-4pm Saturday afternoon. By looking across several weeks opTrend can normalize things like holidays and vacation time.

 

Once a mean for each parameter is determined opTrend then calculates a statistical deviation for the parameter and creates a window of three standard deviations above and below the mean. Any activity above or below these windows triggers an opTrend event into NMIS. These events can be in addition to those generated by NMIS’s Thresholding and Alert system, or in place of.

 

In the example above, opTrend would have seen the chassis temperature exceed the normal window of operation. Had fan speed and/or amperage also been processed by opTrend (it is not by default but can be configured to be if desired) these would have reported as a low fan speed, and high amperage).

 

This event from opTrend would have been sent to NMIS, then shared with opEvents for processing. A set of rules, or Event Actions, looked for events that could be caused by high temperature; often related to interface packet errors or discards. With wireless devices (WiFi and RF) this may affect signal strength and connection speed. A similar result could be handled using a Correlation Rule, which would group multiple events across a window of time into a new parent event. Both methods are relevant and have their own pros and cons.

 

opEvents now uses the high temperature / high discards event to start a troubleshooting routine. This may include directing opConfig to connect to the device via SSH and execute CLI commands to collect additional troubleshooting information. The result of these commands can have their own operational life – being evaluated for error conditions, firing off new events and themselves starting Event Actions.

 

Let’s review the process flow:

  1. NMIS collects performance data from the device, including fan speed, temperature and interface performance metrics.
  2. opTrend processes the collected performance data from NMIS and determines what is normal/abnormal behavior for each parameter.
  3. Events are generated by opTrend in NMIS, which are then shared with opEvents.
  4. opEvents receives events from opTrend identifying out of normal temperature and interface output discards. These events are then correlated into a single synthetic event, given a higher priority, and evaluated for action
  5. An Event Action rule matches for a performance impacting event on a Core device running a known OS. This calls opConfig to initiate Hourly and Daily configuration backups, then execute a configuration change to increase the OSPF cost on the interface forcing traffic to be rerouted off this interface.
  6. opEvents also opens a helpdesk ticket via a RESTful API, then texts the on-call technician with the actions taken, and recommended follow-on activities.
  7. Once traffic across the interface drops the discards error will clear, generating an Up-Notification text to the on-call technician.

 

This is an example of what we would consider a medium complexity automation. It is comprised of several Opmantek solutions, each configured (most automatically) to work together. These seven solutions share and process fault and performance information, correlate resulting events, apply a single set of event actions to gather additional information and configure around the event. When applying solution automations, we advocate a crawl-walk-run methodology where you start by collecting troubleshooting information (crawl), then automate simple single-step remediations (walk), then slowly deploy multi-path remediations with control points (run).

 

Contact Us & Start Automating Your Network Management

Contact our team of experts here if you would like to know about how this solution was developed, or how Operational Process Automation can be leveraged to save on manhours and reduce Mean Time to Resolve (MTTR).

How to detect, diagnose, and fix issues with network bandwidth

How to detect, diagnose, and fix issues with network bandwidth

Network bandwidth has always been a precious commodity and given our current circumstances with so many people working from home, many companies have not had the bandwidth they need in the right places. This blog will help you with some strategies on how to detect bandwidth issues, further diagnose those issues, and what actions you can take to relieve those bandwidth issues.

Detecting network bandwidth issues through congestion management.

Network Bandwidth Detect - 700
Most issues related to network bandwidth will present as congestion, that is there is not enough bandwidth to satisfy the demands of the users and applications. Users will report that “some application” doesn’t work like it did yesterday. After you have confirmed the application is up, and the user reports are correct, where do you look next?

Check the network:

  1. Monitor the helpdesk cases raised in particular where users are reporting problems with applications across the network. Knowing whether this is from a branch, remote site or from home (will shorten troubleshooting), it is likely to indicate network congestion.
  2. Monitor utilisation of network links and raise alerts when bandwidth becomes heavily utilised.
  3. Make sure you monitor packet discards and errors.
  4. And finally, monitor Quality of Service (QoS) parameters available in the network device; in particular, you are looking for where QoS caused packet loss.

The first step to detection is to get NMIS installed and let it start collecting data NOW. DOWNLOAD NMIS

Diagnosing Network Bandwidth Issues

Network Bandwidth Diagnose - 700
What issues are being reported by users about the network, is the application slow or is it unusable? For example, is there a problem with voice over IP or video conferencing? Does it occur during file transfers? The more qualified information you get from your helpdesk, the faster you can get to work.

By monitoring the network for issues related to congestion, you are ready to start further diagnosis to determine what is causing those issues and look for possible solutions to avoid the congestion firstly or control it secondly.

Depending on the tools available to you, you should have an idea of those causes. For example, putting aside transmission, format errors, or device health issues packet discards will generally be caused by QoS classes dropping packets, so the solution is to refine the QoS configuration to prevent the desired traffic from being discarded.

Depending on the application, the dropped packets will be causing retransmissions if they are using TCP, while voice and video symptoms are voice clipping or slow refreshing video or video and voice not keeping sync.

Depending on the devices and operating systems being used, you should be able to see key performance indicators for this, which will be collected by your monitoring system, like NMIS.  For example you could monitor for TCP retransmissions on servers, this would indicate issues with those applications.

Using systems like Cisco IPSLA are a great way to monitor for changes in latency or variability in latency (Jitter).  NMIS can collect your IPSLA data, providing graphs as well as alerts when it detects issues.

Monitoring these metrics will guide where you need to look deeper, you might need to collect more detailed information from the devices to determine what the issues are, e.g. looking at command outputs for QoS or interface information to decide what changes are available to resolve the helpdesk reports.

If you identify the QoS Classes which are exceeding their configuration limits with resulting packet loss, you will need to consider changing the bandwidth allocations for those classes, increasing the available bandwidth for voice and video, for example.

HOW TO DIAGNOSE: Use NMIS and opConfig to collect data, which can then be analysed. Here’s how to use opFlow to look at the application mix on a link.

OPA can help with the detection and diagnosis of congestion problems.

Actions to fix network bandwidth problems

Network Bandwidth Fixed - 700
Ultimately to fix a bandwidth issue, you should upgrade the overall capacity at the site. If you are not able to upgrade or need to buy time, then implementing QoS features to manage which traffic is less important to the business and have it shaped or dropped during times of congestion.

Contrary to popular belief, QoS does not create more throughput. It does create better “goodput,” with critical applications protected, and applications that are hogging bandwidth, controlled.

Two standard policy options for QoS are shape or police. Policing will ensure bandwidth is never exceeded and drop the offending traffic. Shaping will delay traffic to smooth out the traffic over time. Note that as shaping limits are exceeded, it may result in dropped traffic.

Talk to us about how our solutions can give you the insight you need to make data-based decisions. You’ll reduce helpdesk stress, own your infrastructure all while improving the user experience.

Book a Demo