Key Points Discussed:
- Customer Intelligence and Support
- Lending and Risk Management
- Fraud Detection and Enhanced Verification
Key Points Discussed:
Key Points Discussed:
At Opmantek, we use our own software heavily for monitoring our production and development systems, solving our own IT Operations challenges that we know our customers share, it also helps us to develop the products faster in real-world environments through early testing.
We have been using Amazon’s Web Application Firewall (WAF) to help protect our web-facing infrastructure. One of the issues with the out-of-the-box solution is how does one monitor the firewall’s logs as part of your overall IT operations and how do you perform analysis of those logs, with context, to the workloads they relate to.
Firstly to help check newly implemented rules are working as intended and secondly, to provide quick diagnoses in the event of attack.
We first tested a 3rd party product to help visualise the logs and hopefully provide out-of-the-box insights into the data, but we found the TCO of this solution was much higher than using the extensibility of the Opmantek products. The results from the out-of-the-box solution would also have been isolated from the overall network health visibility.
Our WAF is set up with the rules sets provided by the AWS marketplace as well as internally developed custom rules sets with reputation / ip blacklists that are constantly evolving.
Our Architecture follows:
Our WAF is set up to send all logs through to our Kinesis Delivery Stream.
AWS kinesis delivery stream is set up to deliver batched requests over HTTPS to a specified endpoint within your own environment. We developed a small HTTP service in GoLang to securely ingest the batched logs from AWS, and we provide this AWS ingestion service to customers on request.
Our GoLang service also remaps JSON keys before writing the file out to disk.
To see what AWS publishes in its logs you can find them – here.
opEvents jsons_log service listens for filesystem changes, reads them, and runs the event through the opEvents Engine.
We have added a new property country which is the ISO country code of the request.
Element: is mapped to the Requestor’s IpAddress.
Node: is the name of our WAF in AWS.
Description: is the WAF action, WAF rule which was triggered, ipAddress and country code. This gives opEvents unique enough data to create rolled up event counts for WAF actions. Through opEvents’ dashboard you can see a quick count of clients who have made the most POST requests, or a bot trying SQL injection against your site.
We are using opEvents to store metadata about the WAF log, headers, requesting IP, country and which WAF rules were terminated. Using the IP address we can quickly make an assumption about the requesting origin and know if we have bots scraping us from data centres or users acting unlawfully. With this quick drill down into the event data we can make quick operational changes to implement rules to stop certain traffic or add entire subnets to our IP blacklist.
How we are using this information…
Debugging WAF rules
Implementing WAF rules can be challenging especially when you have to go back and look at access history. opEvents is storing 30 days of WAF logs which we can quickly filter to find the blocked request and debug the rule and make an exception or change how our application works for better security.
Some crawlers generate quite a large amount of web traffic as they quickly scan our domains, being able to have an aggregate view of requests for an ip Address and rules being triggered it’s easy to find the block of address causing issues. We then drill down into the request metadata checking the headers, location, who owns the IP and past requests patterns. From this we can then quickly ban malicious bot IP ranges.
Website usage statistics
With MongoDB backing opEvents it’s easy to write queries and run them through the mongo shell and aggregate usage data for more in depth reporting. Which Country Code uses this endpoint the most, Which user-agent makes the most requests.
Using Opmantek’s Operational Process Automation methodology when we correlate sets of WAF events we trigger automated actions into our AWS Environment to respond to the incidents and avoid issues.
opEvents engine makes it flexible to ingest any type of structured data and we could quickly integrate into our production monitoring to give us greater insight into our public facing web systems.
If you would like to know more about about using opEvents and processing web firewall logs we offer live demos with our technical team here.
This guide is designed for businesses who’d like to reduce manual effort through the automation of processes and tasks, reduce costs and free up senior staff for other more intellectual work.
Quietly, somewhere in an office downtown, bearings designed to last for 25k hours have been running non-stop for over forty-three-thousand. The fan was cheaply made by machine from components sourced over several years across a dozen providers. It sat boxed for weeks before it was installed in the router chassis, which itself was boxed-up. Two months at sea, packed tight in a shipping container, then more months bounced around and shuffled from truck to warehouse, and back to a parcel delivery. Finally, the device was configured, boxed and shipped to its final installation point. Stuffed into a too tight closet with no air circulation this mission critical router been running non-stop for the past five-years. It’s a miracle really that it worked this long.
Building friction on the fan’s impeller shaft caused the amperage draw to increase to compensate and maintain rotational speed. When the amperage draw maxed out, rotations per minute (RPM) dropped. With the slower fan speed came less airflow, with lower airflow the chassis temperature increased.
Complex devices, like routers, require low operating temperatures. The cooler it is, the easier it is for electrons to move. As the chassis temperature increased the router experienced issues processing the data packets traversing the interfaces. At first it was an error here or there, then routine traffic routing ran into problems and the router began discarding packets. From there things got much worse.
It’s late Saturday evening and your weekend has been restful so far. A night out with your significant other, a movie and dinner. It’s late now and you’re ready for bed when your phone chirps. The text message is short;
Device: Main Router
Event: Chassis high temperature with high discard output packets
Action Taken: Rerouted traffic by increasing OSPF cost
Action Required: Fan speed low, amperage high. Engineer investigate for repair/replacement.
The system had responded as you would – it rerouted traffic off the affected interface preventing a possible impact to system operation. Adding a note to your calendar to investigate the router first thing Monday morning you turned in for a good night’s sleep.
Our Senior Engineer in Asia-PAC, Nick Day, likes to refer to Opmantek’s solutions as “engineer sleep insurance”. Coming from a background in managed service providers I can appreciate the situation. Equipment always breaks on your vacation time, often when the on-call engineer is as far away as possible, and with little useful information from the NMS. This was a prime scenario we used when building out our Operational Process Automation (OPA) solution.
Leveraging the combined ability of opTrend to identify operational parameters outside of trended norms, opEvents correlates events and automates remediation. With the addition of opConfig configuration changes to network devices are then able to be automated. Operational Process Automation (OPA) builds on this statistical analysis and rules-based heuristics, to automate troubleshooting and remediation of network events. This in turn reduces the negative impact on user experience.
Now let’s see how this was accomplished using the above example. At its roots opTrend is a statistical analysis engine. opTrend collects performance data from NMIS, Opmantek’s fault and performance system and determines what is normal operation. Looking back over several weeks, usually twenty-six, opTrend determines what is normal for each parameter it processes. It does this hour by hour, considering each day of the week individually. So, Monday morning 9-10am has its own calculation, which is separate from 3-4pm Saturday afternoon. By looking across several weeks opTrend can normalize things like holidays and vacation time.
Once a mean for each parameter is determined opTrend then calculates a statistical deviation for the parameter and creates a window of three standard deviations above and below the mean. Any activity above or below these windows triggers an opTrend event into NMIS. These events can be in addition to those generated by NMIS’s Thresholding and Alert system, or in place of.
In the example above, opTrend would have seen the chassis temperature exceed the normal window of operation. Had fan speed and/or amperage also been processed by opTrend (it is not by default but can be configured to be if desired) these would have reported as a low fan speed, and high amperage).
This event from opTrend would have been sent to NMIS, then shared with opEvents for processing. A set of rules, or Event Actions, looked for events that could be caused by high temperature; often related to interface packet errors or discards. With wireless devices (WiFi and RF) this may affect signal strength and connection speed. A similar result could be handled using a Correlation Rule, which would group multiple events across a window of time into a new parent event. Both methods are relevant and have their own pros and cons.
opEvents now uses the high temperature / high discards event to start a troubleshooting routine. This may include directing opConfig to connect to the device via SSH and execute CLI commands to collect additional troubleshooting information. The result of these commands can have their own operational life – being evaluated for error conditions, firing off new events and themselves starting Event Actions.
This is an example of what we would consider a medium complexity automation. It is comprised of several Opmantek solutions, each configured (most automatically) to work together. These seven solutions share and process fault and performance information, correlate resulting events, apply a single set of event actions to gather additional information and configure around the event. When applying solution automations, we advocate a crawl-walk-run methodology where you start by collecting troubleshooting information (crawl), then automate simple single-step remediations (walk), then slowly deploy multi-path remediations with control points (run).
Contact our team of experts here if you would like to know about how this solution was developed, or how Operational Process Automation can be leveraged to save on manhours and reduce Mean Time to Resolve (MTTR).