Monthly Archives: March 2012

DevOps Monitoring

Three Monitoring Tenants

This week, I was seeing a drop in average back-end performance at work, we had an average drop in page load performance from ~250ms to around 500ms. This seemed to be an intermittent problem and we searched through out graphs at NewRelic with no clear culprit. Then we started looking at our internal MediaWiki profiling collector, and some of the various aggregation tools that I put together. After a few minutes it became clear that the connection time on one of our databases had increased 1000 fold. Having recently changed the spanning-tree configuration, and moving some of our cross connects to 10Gb/s, I suspected this may have been a spanning tree issue. It turns out our Ganglia daemon (gmond) on that host had consumed enough memory due to a memory leak to negative affect system performance. Unfortunately this was a pretty inefficient way to find the problem, and reminded me of a few basic tenants of monitoring.


Monitor Latency

A simple MySQL test could just tell you whether your server is up or down. Your alerts probably even have timeouts, but in most monitoring tool I’ve seen these are measured in seconds not milliseconds. Your should have your alert configured to tell you when the service you’re monitoring has gone to an unacceptable level, and maybe effecting site performance. So, your simple MySQL check should timeout in 3 seconds, but alert you if its taken more than 100ms to establish a connection. Remember, if the latency is high enough your service is effectively down.


Monitoring your Monitoring

Sometimes your monitoring can get out of whack. You may find that you tests are consuming so many resources that they are negatively effecting your performance. You need to define acceptable parameters for these application, and make sure that its doing what you expect.


Set Your Alerts Lower than You Need

Your alerts should go off before your services are broken. Ideally this would be done with alerts on warnings, but for a good number of people warning are too noisy. If you’re only going to alert on errors, set your threshold well below the service level you expect to provide. For example, if you have an HTTP service that you expect to answer within 100ms, and typically answers within 25ms, your warnings should be set at something like 70ms and errors at 80ms. By alerting early, your preventing a call from a customer, or an angry boss.

So give these three things a try, and you should end up with a better monitoring setup.

DevOps solr

Groovy, A Reasonable JVM Language for DevOps

I’ve worked at several environments where most of our product was run through the JVM. I’ve always used the information available to me in Mbeans, but the overhead of exposing them to a monitoring system like Ganglia or Nagios has always been problematic. What I’ve been looking for is a simple JVM language that allows me to use any native object. In 2008 I found Groovy, and have used it as my JVM glue ever since.

Meet Groovy

I’ll skip from going into too much detail here, but Groovy is a simple dynamically typed language for the JVM. Its syntaxt and style is similar to that of Ruby. You can find more information here.

Solr JMX to Ganglia Example

This is a quick script to get data from JMX into ganglia fromSsolr. The syntax is simplfied, and missing the typical java boiler plate required to run from the command line. Not only that, but because of the simplified typing model, its very easy for me to build strings and then execute a command. This allows me to quickly construct a script like I would in Perl or Python, but also allows me easy access JMX, and any Java library of my choosing. The primary drawback will be the time it takes to instantiate the JVM, which you can work around if you can’t deal with the launch time.

So, if your working in a Java with a bunch of Java apps, think about giving Groovy a chance for writing some of your monitoring tests, and metric collectors. It is a simpler language than Java to put together those little applications that can tell you how your system is performing, and well within the reach of your average DevOps engineer.