Category Archives: Monitoring

DevOps Monitoring

Monitoring – The Challenge of Small Ops – Part 2

So your building or have built a web service, you’ve got a lot of challenges a head. You’ve got to scale software and keep customers happy. Not surprisingly that likely involves keeping your web service up, and that typically starts by setting up some form of monitoring when something goes wrong. In a large organization this may be the job of an entire team, but in a small organization its likely everyone’s job. So, given that everyone will tell you how much monitoring sucks, how do you make it better and hopefully manageable.

A Subject Mater Expert Election

Elect someone as your monitoring expert. I’d suggest someone with an Ops background, since they probably used several tools, and are familiar with configuration and options for this type of software. Its important that this person is in the loop with regard product decisions, and need to be notified when applications are goring to be deployed.

What Gets Monitored?

The simplistic answer you’ll hear over and over again is “everything”, but you’re a small organization with limited headcount. Your monitoring expert should have an idea since he’s in the product loop. They should be making suggestions as to what metrics the application should expose, and getting the basic information from the development team for alerting (process names, ports, etc..). They should prioritize the list of tests and metrics, starting with what you need to know that the application is up, and finishing with nice to haves.

Who Builds It?

Probably everyone. If that’s all you want you monitoring expert to do, then have them do it, but its far more efficient to spread the load. If people are having difficulty integrating with particular tools like Nagios or Ganglia, that’s where you expert steps in. They should know the in’s and out’s of the tools your using, and be able to train someone to extend a test, or get a metric collected within a few minutes.

These metrics and tests should be treated as part of the product, and should follow the same process you use to get code in production. You should consider making them goals for the project, since if you wait till after shipping there will be some lag between pushing code and getting it monitored.

Wrapping It Up

Its really not that hard; you just need someone to take a breath and plan. The part of monitoring that sucks is the human part, and since your app will not tell people when its up and down, you need a person to think about that for you.

Take a look at Part 1 if you liked this article.

DevOps Monitoring

Monitoring Your Customers with Selenium and Nagios

In a brief conversation with Noah Sussman at DevOps Days, when discussing the challenges of continious deployment for B2B services with SLAs, we got side tracked discussing using Selenium and Nagios in production.

A few years back while working for a B2B company that was compensated by an attributable sales, I got on a phone call early in the morning to discuss fixing a client side display issue. The previous night, after a release a integration engineer modified a config that broke our service from rendering on almost every page at a single customer. The bug was fairly subtle, allowing what he was working on to display correctly, but breaking every other div on the site. This was pushed in early Spring, at off hours, and caught at the beginning of the day on the east coast.

At 9am PST, we held a post-mortem with all of our engineers. We discussed the impact of the issue on our revenue, which fortunately was pretty small, and laid out the timeline. Immediately, we discussed whether this was a testing issue or a monitoring failure. The CEO came back, and said while it was understandable that we missed the failure, our goal as an Ops team should be to catch any rendering issue within 5 minutes of a failure. I was a little annoyed, but agreed to build a series of test to try and catch this.

Why Other Metrics Failed Us

We had a fairly sophisticated monitoring setup at this point. We tracked daily revenue per customer, and we would generally know within 30 minutes if we had a problem. Our customers were very sites, but typically for US only sites had almost no traffic between 0:00 PST/PDT and 6:00 PST/PDT; in that time period it wasn’t unusual to have 0-2 sales. Once we got into a busier sales period the issue was spotted withing 30 minutes, and we were alerted. During this time period it turns out our primary monitoring metric for this type of failure was useless.

QA Tools Can Solve Ops Problems Too

I was familiar with Selenium from some acceptance tests I helped our QA guys write. I began to put together a test suite that met my needs(can’t provide the code for this, sorry). It consisted of:

  • rendering the main page
  • navigating to a page which we displayed content
  • clicking a link we provided
  • verifying that we displayed our content on a new page

This worked fairly well, but I had to tweak some timings to make sure the page was “visable”. I rigged this up to run through jUnit, and left a selenium server running all the time. Every 5 minutes the test suite would execute, leaving behind a log of successes and failures. We eventually built a test suite for every sizable customer. Every 5 minutes we checked the output of the jUnit with a custom Nagios test, that would tell us which customers had failures, an send an individual alert for each one.

Great Success!

I was really annoyed when I first had this conversation with the CEO; I thought this was a boondoggle that ops should not be responsible for. Within the first month my annoyance turned to delight as I started getting paged when our customers had site issues. I typically called them before their NOC had noticed, and most of the time these were issue they introduced to their site. I’d do it again in a heartbeat, and recommend that anyone else give it a try.

 

DevOps Monitoring

Good Nagios Parenting, Avoids a Noisey Pager

Monitoring configuration is complicated, and the depths that you can configure alerts and tests seems endless. It may seem like a waste of time to invest in some options, but others can really help you eliminate states that send hundreds of alerts. Your end goal in your configuration is to narrow down any alert sent to the pager to be immediately actionable, and that all other issues are ignored. Certain Failure states like failed switches, routers, can cause a flood of alerts since they take down the network infrastructure, and obscure the true cause of an outage.

Defining the Right Config

The first step you can take to prevent a flood of pages is to define all you routers, switches, and other network equipment in your Nagios config. After you have that defined you simply need to define a parent on the config object.
For example:

# Primary Switch in VRRP Group
define host {
use switch
address 10.0.0.2
host_name switch-1
hostgroups switches
}

#Secondary Switch in VRRP Group
define host {
use switch
address 10.0.0.3
host_name switch-2
hostgroups switches
}

define host {
use server
address 10.0.0.100
host_name apache-server-1
hostgroups servers, www
parents switch-1, switch-2
}

This will configure the host apache-server-1 such that if switch-1 and switch-2 fail, alerts will be silence from the client. The alerts will remain off until either switch-1 or switch-2 becomes available again.

A Few Things to Keep in Mind

Nagios is pretty smart, and can handle multiple parents so that alerts will only be silenced if both parents become unavailable.

The availability of parent hosts is determined by the host health check, most commonly ping. If you need some other test of availability, make sure to define this in the host object.

Parent all the objects you can or that make sense to parent. For example, a router or transport failure at a remote data center should only send a single alert. This means you should define your routers, switches, and possibly your providers gateways. Do whatever you think makes sense, and take it as far as your can. Remember your goal is to make the number of alerts manageable, so the better you define the topology the less likely you are to get a useless page, or several hundred useless pages.

DevOps Monitoring

Three Monitoring Tenants

This week, I was seeing a drop in average back-end performance at work, we had an average drop in page load performance from ~250ms to around 500ms. This seemed to be an intermittent problem and we searched through out graphs at NewRelic with no clear culprit. Then we started looking at our internal MediaWiki profiling collector, and some of the various aggregation tools that I put together. After a few minutes it became clear that the connection time on one of our databases had increased 1000 fold. Having recently changed the spanning-tree configuration, and moving some of our cross connects to 10Gb/s, I suspected this may have been a spanning tree issue. It turns out our Ganglia daemon (gmond) on that host had consumed enough memory due to a memory leak to negative affect system performance. Unfortunately this was a pretty inefficient way to find the problem, and reminded me of a few basic tenants of monitoring.

 

Monitor Latency

A simple MySQL test could just tell you whether your server is up or down. Your alerts probably even have timeouts, but in most monitoring tool I’ve seen these are measured in seconds not milliseconds. Your should have your alert configured to tell you when the service you’re monitoring has gone to an unacceptable level, and maybe effecting site performance. So, your simple MySQL check should timeout in 3 seconds, but alert you if its taken more than 100ms to establish a connection. Remember, if the latency is high enough your service is effectively down.

 

Monitoring your Monitoring

Sometimes your monitoring can get out of whack. You may find that you tests are consuming so many resources that they are negatively effecting your performance. You need to define acceptable parameters for these application, and make sure that its doing what you expect.

 

Set Your Alerts Lower than You Need

Your alerts should go off before your services are broken. Ideally this would be done with alerts on warnings, but for a good number of people warning are too noisy. If you’re only going to alert on errors, set your threshold well below the service level you expect to provide. For example, if you have an HTTP service that you expect to answer within 100ms, and typically answers within 25ms, your warnings should be set at something like 70ms and errors at 80ms. By alerting early, your preventing a call from a customer, or an angry boss.

So give these three things a try, and you should end up with a better monitoring setup.