7 Steps to Network Management Automation & Engineer Sleep Insurance

7 Steps to Network Management Automation & Engineer Sleep Insurance

 

Quietly, somewhere in an office downtown, bearings designed to last for 25k hours have been running non-stop for over forty-three-thousand. The fan was cheaply made by machine from components sourced over several years across a dozen providers. It sat boxed for weeks before it was installed in the router chassis, which itself was boxed-up. Two months at sea, packed tight in a shipping container, then more months bounced around and shuffled from truck to warehouse, and back to a parcel delivery. Finally, the device was configured, boxed and shipped to its final installation point. Stuffed into a too tight closet with no air circulation this mission critical router been running non-stop for the past five-years. It’s a miracle really that it worked this long.

 

Fan speed was the first thing to be affected by the bearing failure.

Building friction on the fan’s impeller shaft caused the amperage draw to increase to compensate and maintain rotational speed. When the amperage draw maxed out, rotations per minute (RPM) dropped. With the slower fan speed came less airflow, with lower airflow the chassis temperature increased.

 

Complex devices, like routers, require low operating temperatures. The cooler it is, the easier it is for electrons to move. As the chassis temperature increased the router experienced issues processing the data packets traversing the interfaces. At first it was an error here or there, then routine traffic routing ran into problems and the router began discarding packets. From there things got much worse.

 

It’s late Saturday evening and your weekend has been restful so far. A night out with your significant other, a movie and dinner. It’s late now and you’re ready for bed when your phone chirps. The text message is short;

 

Device: Main Router

Event: Chassis high temperature with high discard output packets

Action Taken: Rerouted traffic by increasing OSPF cost

Action Required: Fan speed low, amperage high. Engineer investigate for repair/replacement.

 

A fan went bad, what’s next?

The system had responded as you would – it rerouted traffic off the affected interface preventing a possible impact to system operation. Adding a note to your calendar to investigate the router first thing Monday morning you turned in for a good night’s sleep.

 

Our Senior Engineer in Asia-PAC, Nick Day, likes to refer to Opmantek’s solutions as “engineer sleep insurance”. Coming from a background in managed service providers I can appreciate the situation. Equipment always breaks on your vacation time, often when the on-call engineer is as far away as possible, and with little useful information from the NMS. This was a prime scenario we used when building out our Operational Process Automation (OPA) solution.

 

Building a Solution

Leveraging the combined ability of opTrend to identify operational parameters outside of trended norms, opEvents correlates events and automates remediation. With the addition of opConfig configuration changes to network devices are then able to be automated. Operational Process Automation (OPA) builds on this statistical analysis and rules-based heuristics, to automate troubleshooting and remediation of network events. This in turn reduces the negative impact on user experience.

 

 

Magicians never reveal their secrets…but we’ll make an exception.

Now let’s see how this was accomplished using the above example. At its roots opTrend is a statistical analysis engine. opTrend collects performance data from NMIS, Opmantek’s fault and performance system and determines what is normal operation. Looking back over several weeks, usually twenty-six, opTrend determines what is normal for each parameter it processes. It does this hour by hour, considering each day of the week individually. So, Monday morning 9-10am has its own calculation, which is separate from 3-4pm Saturday afternoon. By looking across several weeks opTrend can normalize things like holidays and vacation time.

 

Once a mean for each parameter is determined opTrend then calculates a statistical deviation for the parameter and creates a window of three standard deviations above and below the mean. Any activity above or below these windows triggers an opTrend event into NMIS. These events can be in addition to those generated by NMIS’s Thresholding and Alert system, or in place of.

 

In the example above, opTrend would have seen the chassis temperature exceed the normal window of operation. Had fan speed and/or amperage also been processed by opTrend (it is not by default but can be configured to be if desired) these would have reported as a low fan speed, and high amperage).

 

This event from opTrend would have been sent to NMIS, then shared with opEvents for processing. A set of rules, or Event Actions, looked for events that could be caused by high temperature; often related to interface packet errors or discards. With wireless devices (WiFi and RF) this may affect signal strength and connection speed. A similar result could be handled using a Correlation Rule, which would group multiple events across a window of time into a new parent event. Both methods are relevant and have their own pros and cons.

 

opEvents now uses the high temperature / high discards event to start a troubleshooting routine. This may include directing opConfig to connect to the device via SSH and execute CLI commands to collect additional troubleshooting information. The result of these commands can have their own operational life – being evaluated for error conditions, firing off new events and themselves starting Event Actions.

 

Let’s review the process flow:

  1. NMIS collects performance data from the device, including fan speed, temperature and interface performance metrics.
  2. opTrend processes the collected performance data from NMIS and determines what is normal/abnormal behavior for each parameter.
  3. Events are generated by opTrend in NMIS, which are then shared with opEvents.
  4. opEvents receives events from opTrend identifying out of normal temperature and interface output discards. These events are then correlated into a single synthetic event, given a higher priority, and evaluated for action
  5. An Event Action rule matches for a performance impacting event on a Core device running a known OS. This calls opConfig to initiate Hourly and Daily configuration backups, then execute a configuration change to increase the OSPF cost on the interface forcing traffic to be rerouted off this interface.
  6. opEvents also opens a helpdesk ticket via a RESTful API, then texts the on-call technician with the actions taken, and recommended follow-on activities.
  7. Once traffic across the interface drops the discards error will clear, generating an Up-Notification text to the on-call technician.

 

This is an example of what we would consider a medium complexity automation. It is comprised of several Opmantek solutions, each configured (most automatically) to work together. These seven solutions share and process fault and performance information, correlate resulting events, apply a single set of event actions to gather additional information and configure around the event. When applying solution automations, we advocate a crawl-walk-run methodology where you start by collecting troubleshooting information (crawl), then automate simple single-step remediations (walk), then slowly deploy multi-path remediations with control points (run).

 

Contact Us & Start Automating Your Network Management

Contact our team of experts here if you would like to know about how this solution was developed, or how Operational Process Automation can be leveraged to save on manhours and reduce Mean Time to Resolve (MTTR).

How to detect, diagnose, and fix issues with network bandwidth

How to detect, diagnose, and fix issues with network bandwidth

Network bandwidth has always been a precious commodity and given our current circumstances with so many people working from home, many companies have not had the bandwidth they need in the right places. This blog will help you with some strategies on how to detect bandwidth issues, further diagnose those issues, and what actions you can take to relieve those bandwidth issues.

Detecting network bandwidth issues through congestion management.

Network Bandwidth Detect - 700
Most issues related to network bandwidth will present as congestion, that is there is not enough bandwidth to satisfy the demands of the users and applications. Users will report that “some application” doesn’t work like it did yesterday. After you have confirmed the application is up, and the user reports are correct, where do you look next?

Check the network:

  1. Monitor the helpdesk cases raised in particular where users are reporting problems with applications across the network. Knowing whether this is from a branch, remote site or from home (will shorten troubleshooting), it is likely to indicate network congestion.
  2. Monitor utilisation of network links and raise alerts when bandwidth becomes heavily utilised.
  3. Make sure you monitor packet discards and errors.
  4. And finally, monitor Quality of Service (QoS) parameters available in the network device; in particular, you are looking for where QoS caused packet loss.

The first step to detection is to get NMIS installed and let it start collecting data NOW. DOWNLOAD NMIS

Diagnosing Network Bandwidth Issues

Network Bandwidth Diagnose - 700
What issues are being reported by users about the network, is the application slow or is it unusable? For example, is there a problem with voice over IP or video conferencing? Does it occur during file transfers? The more qualified information you get from your helpdesk, the faster you can get to work.

By monitoring the network for issues related to congestion, you are ready to start further diagnosis to determine what is causing those issues and look for possible solutions to avoid the congestion firstly or control it secondly.

Depending on the tools available to you, you should have an idea of those causes. For example, putting aside transmission, format errors, or device health issues packet discards will generally be caused by QoS classes dropping packets, so the solution is to refine the QoS configuration to prevent the desired traffic from being discarded.

Depending on the application, the dropped packets will be causing retransmissions if they are using TCP, while voice and video symptoms are voice clipping or slow refreshing video or video and voice not keeping sync.

Depending on the devices and operating systems being used, you should be able to see key performance indicators for this, which will be collected by your monitoring system, like NMIS.  For example you could monitor for TCP retransmissions on servers, this would indicate issues with those applications.

Using systems like Cisco IPSLA are a great way to monitor for changes in latency or variability in latency (Jitter).  NMIS can collect your IPSLA data, providing graphs as well as alerts when it detects issues.

Monitoring these metrics will guide where you need to look deeper, you might need to collect more detailed information from the devices to determine what the issues are, e.g. looking at command outputs for QoS or interface information to decide what changes are available to resolve the helpdesk reports.

If you identify the QoS Classes which are exceeding their configuration limits with resulting packet loss, you will need to consider changing the bandwidth allocations for those classes, increasing the available bandwidth for voice and video, for example.

HOW TO DIAGNOSE: Use NMIS and opConfig to collect data, which can then be analysed. Here’s how to use opFlow to look at the application mix on a link.

OPA can help with the detection and diagnosis of congestion problems.

Actions to fix network bandwidth problems

Network Bandwidth Fixed - 700
Ultimately to fix a bandwidth issue, you should upgrade the overall capacity at the site. If you are not able to upgrade or need to buy time, then implementing QoS features to manage which traffic is less important to the business and have it shaped or dropped during times of congestion.

Contrary to popular belief, QoS does not create more throughput. It does create better “goodput,” with critical applications protected, and applications that are hogging bandwidth, controlled.

Two standard policy options for QoS are shape or police. Policing will ensure bandwidth is never exceeded and drop the offending traffic. Shaping will delay traffic to smooth out the traffic over time. Note that as shaping limits are exceeded, it may result in dropped traffic.

Talk to us about how our solutions can give you the insight you need to make data-based decisions. You’ll reduce helpdesk stress, own your infrastructure all while improving the user experience.

Book a Demo

Optimising Your Network Experience for Video Conferencing

Optimising Your Network Experience for Video Conferencing

Over the last month, more and more businesses have found themselves working in predominantly online modes. Working from home, i.e. teleworking, is now the new normal. That means we have all had to be creative about how we structure our work days, build our work spaces and how we interact with our colleagues. Virtual meetings and video conferencing is now a standard part of daily life for many organisations. However, not all organisations are appropriately equipped to utilise this technology to its full potential. There is no use holding online meetings if the sound and video is jumpy, or the connection times out half way through. Ensuring your network is optimised for video conferencing is vital to your organisation’s ongoing success during these uncertain times.

The differences between streaming video and hosting a live conference are vast. When streaming a standard cat video from say YouTube, the video downloads in small parts in advance, which mitigates any network instability during playback. When you host live video, data must be received consistently in real time so that the content is clear. On average, businesses need to account for around 100 users sharing the internet, with each making 5 to 6 calls and accounting for at minimum 5 percent of the concurrent bandwidth overhead utilisation. Add another ten percent if you are using online conferencing due to the additional buffer strength required.

It sounds complicated, but it doesn’t need to be. That’s why we have put together this handy guide on how you can optimise your network experience for video conferencing to tackle the nuances of this format.

First things first

Whether you are just starting to use video conferencing, or are using it more with more users, it is integral that you scale up your network bandwidth to meet the demands placed on it. The transmission capacity of a connection is an important factor when determining the quality and speed of a network or the internet connection. Ask yourself if your service provider is giving you full bandwidth? One way you can tell is based on the quality of your video conferences. Meetings with the appropriate bandwidth will be stable and seamless. If the video and/or audio is sluggish, there might be a synching problem between motion and audio. Or, content sharing could be experiencing a delay. There is a range of equipment, including Opmantek’s opFlow, that will help enhance your overall video call and conference experience.

How opFlow can help with video conferencing?

When conducting a video conference between two users, about 2Mbps of upload and download bandwidth is used for both users. This is the minimum requirement to ensure the conference is smooth and clear with high quality audio and video. Opmantek opFlow gives valuable network insights that allows companies and users to see how much of the network is being used, by who and in what way. This is vital information for troubleshooting any issues caused by bandwidth availability. Further, opFlow rapidly identifies any bottlenecks occurring affecting bandwidth so that they can be quickly rectified. It produces summary reports to provide the greatest possible transparency on usage.

Traffic and security analysis is another important factor to consider when looking at bandwidth related interruptions. opFlow identifies any abnormalities in traffic patterns, and detects security threats to allow for the prevention of issues before they arise. This includes managing congestion, checking areas of high data usage and honing in on suspicious behaviour. When it comes to navigating the future of daily operations, infrastructure planning and capacity management is vital. Building video conferencing into plans is easy with the key information provided through opFlow – for both planning and network capacity management. There is also the capability for reduced downtime through the rapid change impact identification feature.

For added convenience, opFlow is extremely affordable and is compatible with multiple vendors and protocols including:

– Cisco NetFlow
– NetFlow-Lite
– NSEL
– Juniper J-Flow sFlow and IPFIX.

In summary

Now is the time to diversify and embrace the evolution of new business practices and standards. For many, this is leading to positive change organisation-wide. For more tips on optimising your network, network management, or to find out more about how Opmantek’s opFlow low-cost features can help you start managing and analysing your Netflow, contact us today.

Identify and remedy a failing web server

Identify and remedy a failing web server

A customer of ours reached out to us recently to help them solve and potentially reduce the outages they were experiencing to their public website. The first step to help remedy this situation was to identify the root cause of the fault.

Digging into the logs, we were able to identify there had been an accidental (perhaps) Distributed Denial of Service (DDoS) attack produced by around 1200 IP Address crawlers that overloaded both the web server and the application, requiring a server reboot. The resolution for this singular problem was to block that IP Address range to prevent this from occurring again. This, however, was only a partial solution, as this could happen again from a separate range.

This is where the power of Opmantek software began to shine.

Firstly, the engineering team must shift their mindset from a reactive one to being proactive; identify the issue before it becomes a problem and take automated action to prevent an outage. Dependent on how your network is set up, your staffing situation and personal preferences, you may tackle this issue in a variety of different ways.

There are several methods that can be implemented to identify the root cause of the service impact. From NMIS, you could run a service check on the web server that looks to identify if the quantity of connections exceeds a present threshold. You can test the number of open connections on the web server with a command such as;

netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n

One step further and we can use a combination of NMIS and opTrend to monitor for a sudden increase in CPU/memory utilization on the server and raise an event from there.

Once the event condition is satisfied the next step is to identify the attack vector and remediate. In this case opEvents could retrieve and parse the Apache logs, identifying the IP Address range, then instruct opConfig is reconfigure the necessary firewalls and applications to block them. Nick Day, Opmantek’s Senior Network Engineer in Asia-PAC, helped another customer by leveraging automated remediation; you can find out how in this blog.

Not comfortable with this level of automation? Once the event is properly identified, engineers could be notified of the situation and using opConfig’s Virtual Operator reconfigure the firewalls/applications to block the DDos attack and restart any services/applications/servers all without giving those operators command line access or sudo/root privilege.

Comparing and Contrasting Teleworking in the United States VS. in Australia

Comparing and Contrasting Teleworking in the United States VS. in Australia

Unprecedented events are rapidly changing the way teams work together, and social distancing recommendations have caused a shift to telework globally. While working remotely is the right thing to do at this time to reduce the COVID-19 spread, it’s a disruptive transition to make. In this blog post we explore exactly what telework is, and what differences and similarities of telework there are between Australia and the US.

 

What is telework?

Telework is an agreement between an employer and an employee that daily work by the employee can be completed in a flexible location, usually their home. Telework usually takes places during set hours and should not be confused with freelance work. Teleworking involves giving employees access and authorisation to the tools they need to complete their work from anywhere in the world.

 

How do Australian and United States definitions of telework vary?

The US definition of telework is taken from the Telework Enhancement Act of 2010 and states that telework is a flexible arrangement where an employer authorises activities for an employee from an approved worksite. Alternatively, while the Australian and US definitions are similar, the Australian government base their definition on a 2013 APS telework trial and accompanying telework policy development. The Australian definition states that telework is a flexible work arrangement that is enabled through technology.

 

Telework and productivity

Many organisations hesitate to introduce telework, as they fear without constant supervision, the productivity of their employees will decrease. However, if telework is introduced properly it should not have any negative effects on employee productivity. In fact, a 2019 study found that employees who are able to work remotely as least once a month are 24% more likely to feel happy and productive.

 

How to ensure employees remain productive while teleworking

• Communicate

Just because an employee is not physically present in an office, does not mean that they should be excluded from regular communication. Ensure to video chat or voice call them on a regular basis as some information can be lost in email and document chains. This will also ensure the employee still feels like part of the team, even though they are not physically present.

• Explain

Some employees may be confused as to why they have been chosen to telework and may think it is due to a negative perception of their character which can lower their morale and productivity. If you are introducing telework to an employee ensure you take the time to explain to them they have been chosen due to their job functions and positive work performance, this should ensure they remain positive and productive.

How does teleworking productivity differ for US and Australian workers?

Between 2005 and 2015, the number of employees teleworking in the US increased by 115%. Four years later, over 26 million Americans (roughly 16% of the US workforce) were teleworking on either a full or part-time basis. This number has peaked significantly due to the outbreak of the Coronavirus, and is expected to remain high after the effect of the virus is reduced. The percentage of teleworkers rising can only suggest that productivity has not been an issue for US employees, as otherwise, telework would not continue to be adopted.

In Australia, the latest telework statistics available from the Australian Bureau of Statistics reveal that roughly one-quarter of Australian workers, around 24%, worked at least part-time from their home. Although this number appears to be higher than US workers, many people in the survey who completed work from home explained they did it to complete extra work and only 1% had a formal teleworking agreement with their employer. So, why are Australian organisations seemingly more hesitant to promote teleworking?

 

Potential limitations of teleworking in Australia

• Limited social interactions

Limited social interactions may have a negative effect on an employee’s mental health if they rely on work for social interactions.

• Longer working hours

Some employee may fear without an office structure they may end up working longer hours.

• Harder for work to be completed

Some employers may fear that employees won’t be able to complete their work as efficiently at home due to not having access to as many resources.

With the right management software, many of these concerns can be eliminated. For example, RMM and multi-tenancy software can ensure an employee can access the work they need from anywhere in the world. Similarly, automated control can help manage employee workflow by giving specific employees access to user-specific portals.

 

Comparing the benefits of telework

If Australian and United States organisations use the right software and follow the aforementioned tips for improving employee productivity and operational management, they can enjoy the same benefits of teleworking; just some of these benefits include…

 

Lower overhead costs

With employees teleworking, most likely from their homes, an organisation can opt for a smaller office, or no office at all if all employees are teleworking. This will help organisations save a significant amount financially and lower their overhead costs.

 

Greater work/life balance

If employees are able to stick to regulated hours whilst working, they will be able to create a greater work/life balance for themselves. As employees will not need to commute to and from work they will have more free time to spend as they wish, for example, with their families.

 

Fewer office distractions

Often working in an office can be hectic, from constant noise to everyday dramas, teleworking can help employees create a work environment that they prosper in. For example, one employee may wish to work in complete silence whilst another may wish to listen to loud classical music.

 

Telework and Opmantek

Want to learn more about how Opmantek’s remote monitoring and management and RMM solutions can increase your network visibility, deliver unmatched automation, save money for your managed service organisation and increase profitability? Request a personalised demo from one of our engineers here.

 

Sources:

U.S. Bureau of Labour Statistics

https://www.labiotech.eu/inside-labiotech/why-remote-working-is-helping-us-become-a-productive-team/

https://www.apsc.gov.au/teleworking

How RMM Solutions Can Prepare Enterprise Teleworkers For COVID-19

How RMM Solutions Can Prepare Enterprise Teleworkers For COVID-19

COVID-19 has caused a wide range of employers to reassess their workplace arrangements. More workers than ever before are being asked to ‘telework’ but many don’t have experience in this kind of work or don’t really understand what it means. Basically, teleworking doesn’t just relate to using the telephone. It is a remote style of work from home that utilises automation solutions, internet, telephone, and email resources. Now, most workplaces are trying to quickly adapt to teleworking arrangements, and there are many RMM solutions that can prepare enterprise teleworkers for the realities of COVID-19.

 

If we cast our minds back, in 2018, according to the US Bureau of Labor Statistics, only 24 per cent of USA employees did some or all of their work from home. This figure was slightly higher in Australia, with the Australian Bureau of Statistics reporting around 30% of Australians telecommuting for at least part of their working week. In light of the global pandemic, now almost all workers who are still able to work are being asked to work from home.

 

Harvard epidemiologist William Hanage recently said that “everyone who can work from home should work from home”, and workplaces are undertaking rapid workforce planning to make this happen. Scalability and flexibility are key during these uncertain times, as no one is certain how long these measures are likely to last, and whether businesses might be expected to take their entire operations online for extended periods.

 

That’s where RMM solutions come in. RMM (Remote Monitoring and Management) software is a kind of software that has been designed to assist in managing IT services remotely. Basically, this means a computer or network can be managed from a remote location by installing software and monitoring or managing activities over a secure network. As COVID-19 continues to place strain and insecurity on global markets, it is important to understand how RMM can prepare employers for what is ahead.

 

Scalability and Control

An SaaS-based RMM system will give scalability and control as organisations’ working conditions change under COVID-19. Under this kind of system, management configuration can be adjusted as the size of the network increases. That means if the number of remote workers increases, the network will able to cope and teleworkers won’t be subject to unnecessary interruptions. Software like Opmantek is deployed either from the cloud or on-premises, but ownership of the database and control over its architecture remains with the client. So even if the situation worsens and network strain increases, adjustments can be made and the quality will remain stable.

 

Flexibility of Integration

Right now, few of us know what will be happening tomorrow, let alone in the coming months. RMM software gives flexibility in integration so that not everything within the network environment needs to be replaced at once. Network environmental diversity is possible and allows for a gradual rollout of teleworking within an organisation with a range of integration options and support.

 

Unlimited Scalability

Switching to an RMM system will mean that teleworkers will not only be prepared for COVID-19, but for the inevitable digitally connected future ahead. Many organisations will likely find teleworking to be favourable to their business once implemented, and an added bonus of software for MSPs (managed service providers) securing unlimited scalability. The number of connected devices is undoubtedly going to increase in the near future for most businesses, so the ability to grow and scale to meet operational needs is imperative.

 

It’s clear RMM has unlimited potential for assisting teleworkers to adapt their working habits and should be part of any remote workforce strategy. To better secure the success and increase the readiness of your business’s operations during these unprecedented times try Opmantek’s RMM solutions.