Author Archives: papilion

Two Helpful Data Concepts

I’ve been batting around a couple terms while talking about various technical solutions for years, and I’ve found them useful while selecting and constructing technical solutions for managing data. They’ve helped me both build, and provide input on what I need from a technical solution.

1. Real-Timey-ness

When considering solutions for your data needs, you may say you need something real-time, but you’re unlikely to ever find such a solution. Most of them will incur some latency while persisting a record, and the availability of these items will be based on the latency it takes to write a record. Aggregate data will be further delayed, since it will depend on the latency of the first record. There are rarely any true real-time solutions, most solutions are real-timey; they’ll have some time measured in milliseconds, seconds, minutes,  or hours before that data becomes useable. In addition, most data is not actionable for sometime after it become available; a single data point is rarely enough to make a decision.

2. Correcty-ness

Is your answer correct? This is a fundamental question in dealing with data, and most people I’ve dealt with make one fundamental incorrect assumption. For most metrics, the data you collect is mostly correct. You will loose some transactions, you will over count others, your software will have bugs, and these will all lead to inaccuracy in your data. To improve your correcty-ness you’ll likely have to trade resolution or latency in order to push yourself closer to your achievable accuracy. Correcty-ness unlike real-timeness can be improved over time. You can look at higher resolution data, and adjust your initial measurement for a time period.

A Practical Example

Scalability of you’re application will generally be determined by your requirements around real-timey-ness and correcty-ness. Take for example a visits counter on a web-page. The most straight forward implementation would be increment a counter for each page render and store it in a database(UPDATE counter_table SET views_count = views_count+1 WHERE page_id = 5). This solution will have a high degree of accuracy, but comes with limited scalability since we’d be locking a row in order to increment its value. Furthermore the accuracy of this solution may actually degrade as usage increases since aborted page loads may fail to increment the counter. This solution would have a high real-timey and correcty-ness values, at the expense of scalability.

A more complex solution would be to look at the request logs and increment the row by batch using an asynchronous process. This solution will update the counter in whatever time it takes to aggregate the last set of logs. The correcty-ness of the count at any given point will be less than that of the above solution, as will the real-timey-ness since you will only have the answer since the last aggregation. However, the solution will support larger requests volumes, since the aggregation of requests will take place out-side of the page render.

The first solution presented is perfect for a small web-application. The small number of requests you receive at small scale can make asynchronous solutions look broken, and the latency incurred per page render is relatively small. In the small scale its probably better to favor a higher-degree of correcty-ness.

The second solution will perform much better at larger scale. Its lack of correcty-ness and real-timey-ness will be hidden by the hit counter incrementing by large numbers with each refresh. This solution would generally be called eventually consistent, but you can never really achieve consistency without looking at a fixed time window that is no longer being updated.

A Third Solution

During each page render a UDP packet could be sent to an application that increments an in memory counter. The page could then pull this count from the secondary application, and display the current count. To achieve consistency the request logs could be aggregated on a given interval, that then replaces the base value of the counter.

This solution will have a high degree of real-timey-ness since page views will be aggregated immediately. However the correcty-ness of the application will be less than the first solution, since the data transmission method is less reliable. This is a fairly scalable solution, that would balance actionable real-time data with the ability to correct measuring errors. That said it is likely less scalable then the two previous solutions.

Great, What Now?

When designing an application take the time to think about the acceptable real-timey-ness and correcty-ness. In general high correcty-ness and  real-timey-ness create slow applications at larger scale. So, when spec’ing out an application consider assigning a real-timey-ness value to data you present to users. I would typically define it as time windows, i.e. data presented in the UI will be at least 5 seconds old, and no more that 2 minutes. As for correcty-ness, I would define it as the acceptable accuracy within a given time period. For example, data must have an error no larger than 50%  within 5 seconds, 10% within 5 minutes, and 0.1% within 7 minutes.

Deciding what these numbers should be is a different problem. You can generally address real-timey-ness at the expense of correcty-ness, but its hard to improve correcty-ness and real-timey-ness at scale. I would generally look at the scale of the application to decide how important either is. Solutions that are low scale won’t generally have contention issues, so I would favor high real-timey-ness and correcty-ness. In addition, people are more likely to notice issues when the numbers increase by small amounts(i.e. the counter not going from 2 to 3 for 4 page views, and are more likely to complain about problems in the software. For large scale solutions I would take an approach looking at how long the data takes to become actionable. Being accurate within 1 to 5 minutes may be enough to help you derive a result from your data, but it may take 24 hours before you can conclude anything. Think about the amount of time this will take, and then build your specification accordingly.

Hopefully these concepts are useful, and can be put  to use when designing solutions for data.



Keep it Simple Sysadmin

I’ve been thinking about what I hate about my configuration management system. I seem to spend a lot of time when I want to make a change looking at the various resources in chef, and sometimes I end up using providers like the ops code apache2 cookbook to manage enabling and disabling of modules. A few days ago while in a rush I did the following:

A few days later, a co-worked decided this was verbose and replaced it with, using the previously mentioned apache2 cookbook syntax:
This seemed reasonable, and frankly is the approach I would have taken if I had taken time to figure it out if we had a cookbook providing this resource(this was an emergency change). The only problem was that we’re using a broken version of that module that didn’t actually do anything(I’ve still not dug in, but I found a similar bug that was fixed some time ago). So, no errors, no warning, and no configuration was applied or removed.

I’ve come to the decision that both are probably the wrong approach. Both of these approaches we’re trying to use the a2enmod command available in Ubuntu or provided similar functionality. It seems reasonable since it will manage dependencies, but why should I use this? The only reason would be to maintain compatibility with Ubuntu and Debians’ configuration management utilities, but I’ve already decided to outsource this to Chef which does that pretty well. I’ve come to believe that I should have just done this:


The right approach to configuration management is to use the minimal features of your tool. Chef(in this case), provides awesome tools for file management, but when I use higher level abstractions I’m actually introducing more obfuscation. The example given with the Apache module is painful, because when I look at what is really happening, I’m copying a file. Chef is really good at managing files, why would I want to abstract away? Do people really not understand how the a2enmod works? Is this really a good thing if you operations team doesn’t know how Apache configs work?

Cron is another great example:
Do we really find this simpler than creating a cron file, and having it copied to /etc/cron.d? Isn’t that why we introduced cron.d; to get out of having cron jobs installed in user’s crontabs? Its also difficult to ensure Chef removes the job, since a user can screw up that file. Not to mention that this has introduced a DSL to mange a DSL with a more verbose syntax than the original, which seems absurd.

KISS – Keep it Simple Sysadmin

My frustration here is that for the most part we are just copying files around. The higher level abstractions we use actually decrease clarity, and understanding of how the underlying configuration is being applied. If there is a logical flaw(bug) in the implementation, we’re stuck suffering and needing to address the issue, and sort out who is at fault.  So, use the simple resources, and stay away from the magic ones, and you’ll actually be able to understand the configuration of your systems and hopefully get to the root of your problems faster.

Monitoring – The Challenge of Small Ops – Part 2

So your building or have built a web service, you’ve got a lot of challenges a head. You’ve got to scale software and keep customers happy. Not surprisingly that likely involves keeping your web service up, and that typically starts by setting up some form of monitoring when something goes wrong. In a large organization this may be the job of an entire team, but in a small organization its likely everyone’s job. So, given that everyone will tell you how much monitoring sucks, how do you make it better and hopefully manageable.

A Subject Mater Expert Election

Elect someone as your monitoring expert. I’d suggest someone with an Ops background, since they probably used several tools, and are familiar with configuration and options for this type of software. Its important that this person is in the loop with regard product decisions, and need to be notified when applications are goring to be deployed.

What Gets Monitored?

The simplistic answer you’ll hear over and over again is “everything”, but you’re a small organization with limited headcount. Your monitoring expert should have an idea since he’s in the product loop. They should be making suggestions as to what metrics the application should expose, and getting the basic information from the development team for alerting (process names, ports, etc..). They should prioritize the list of tests and metrics, starting with what you need to know that the application is up, and finishing with nice to haves.

Who Builds It?

Probably everyone. If that’s all you want you monitoring expert to do, then have them do it, but its far more efficient to spread the load. If people are having difficulty integrating with particular tools like Nagios or Ganglia, that’s where you expert steps in. They should know the in’s and out’s of the tools your using, and be able to train someone to extend a test, or get a metric collected within a few minutes.

These metrics and tests should be treated as part of the product, and should follow the same process you use to get code in production. You should consider making them goals for the project, since if you wait till after shipping there will be some lag between pushing code and getting it monitored.

Wrapping It Up

Its really not that hard; you just need someone to take a breath and plan. The part of monitoring that sucks is the human part, and since your app will not tell people when its up and down, you need a person to think about that for you.

Take a look at Part 1 if you liked this article.

Three First Pass Security Steps

I’m no security expert, but in my experience these are three simple things you can do to avoid a security incident.

3. Fix Authentication

Don’t allow users to log in to your systems just using passwords. Passwords are easy to setup and get running, but also are easily lost. For SSH use SSH keys at a minimum. If you have the money and time implement a two factor authentication system that requires using passwords and an identifier. For SSL VPNs use a cert and a password, for a system use something like secure ID and a password.

2. Use Prepared Statements

Wherever you can use prepared statements. Since the statement will be compiled before input this help cut down on the majority of SQL injection attacks. This doesn’t exempt you from catching bad SQL, just makes it less likely to be an issue.

1. Reduce Your Attack Surface

Use a firewall, and use a host level firewall. The easiest way to prevent an attack is to lock the door, and if someone can’t reach your server via SSH they are unlikely to break in using SSH. You should make sure that you only expose the services your clients need to access your server. In addition, if you’ve done this at the edge of your network you should look at doing this on a host level as well.




Monitoring Your Customers with Selenium and Nagios

In a brief conversation with Noah Sussman at DevOps Days, when discussing the challenges of continious deployment for B2B services with SLAs, we got side tracked discussing using Selenium and Nagios in production.

A few years back while working for a B2B company that was compensated by an attributable sales, I got on a phone call early in the morning to discuss fixing a client side display issue. The previous night, after a release a integration engineer modified a config that broke our service from rendering on almost every page at a single customer. The bug was fairly subtle, allowing what he was working on to display correctly, but breaking every other div on the site. This was pushed in early Spring, at off hours, and caught at the beginning of the day on the east coast.

At 9am PST, we held a post-mortem with all of our engineers. We discussed the impact of the issue on our revenue, which fortunately was pretty small, and laid out the timeline. Immediately, we discussed whether this was a testing issue or a monitoring failure. The CEO came back, and said while it was understandable that we missed the failure, our goal as an Ops team should be to catch any rendering issue within 5 minutes of a failure. I was a little annoyed, but agreed to build a series of test to try and catch this.

Why Other Metrics Failed Us

We had a fairly sophisticated monitoring setup at this point. We tracked daily revenue per customer, and we would generally know within 30 minutes if we had a problem. Our customers were very sites, but typically for US only sites had almost no traffic between 0:00 PST/PDT and 6:00 PST/PDT; in that time period it wasn’t unusual to have 0-2 sales. Once we got into a busier sales period the issue was spotted withing 30 minutes, and we were alerted. During this time period it turns out our primary monitoring metric for this type of failure was useless.

QA Tools Can Solve Ops Problems Too

I was familiar with Selenium from some acceptance tests I helped our QA guys write. I began to put together a test suite that met my needs(can’t provide the code for this, sorry). It consisted of:

  • rendering the main page
  • navigating to a page which we displayed content
  • clicking a link we provided
  • verifying that we displayed our content on a new page

This worked fairly well, but I had to tweak some timings to make sure the page was “visable”. I rigged this up to run through jUnit, and left a selenium server running all the time. Every 5 minutes the test suite would execute, leaving behind a log of successes and failures. We eventually built a test suite for every sizable customer. Every 5 minutes we checked the output of the jUnit with a custom Nagios test, that would tell us which customers had failures, an send an individual alert for each one.

Great Success!

I was really annoyed when I first had this conversation with the CEO; I thought this was a boondoggle that ops should not be responsible for. Within the first month my annoyance turned to delight as I started getting paged when our customers had site issues. I typically called them before their NOC had noticed, and most of the time these were issue they introduced to their site. I’d do it again in a heartbeat, and recommend that anyone else give it a try.


The Challenge of Small Ops (Part 1)

I missed a open session at DevOps days, and I’m really disappointed that I did after hearing feedback from one of the conference participant. He said many people in the session we’re advocating for eliminating operations in the small scale.

I realize that the world is changing, and that operations teams need to adjust to the changing world. We have skilled development teams, cloud services, APIs for almost everything you need to build a moderately sized web service. Its no wonder that some smaller(and some large) organizations are beginning to question the need for ops teams. So, I’m going to write a series of articles discussing the challenges that exist in operations of small and medium sized teams, and how an operations expert can help solve these issues.

Ops as the Firefighter

In my discussion with a fellow conference-goer about this topic, when he said the general feeling was that you push responsibility to your vendors and eliminate ops, I suggested that perhaps we should think of ops as a firefighter.

Small towns still have firefighting teams, and they may be volunteers, but I’ll bet they were trained by a professional. You should think of an Operations Engineer as your companies trainer. You should lean on them for the knowledge that can only be gained working in an operational environment.


Failure is the only constant for web services, and your should expect them to happen. You will need to respond to failures in a calm and organized manner, but this is likely too much for a single individual. You’ll need a better approach.

A mid-level or senior operations engineer should be able to develop an on-call schedule for you. They should be able to identify how many engineers you need on-call in order to meet any SLA response requirement. In addition they can train your engineers how to respond, and make sure any procedure is followed that you might owe to customers. They can make everyone more effective in an emergency.

Vendor Management

Amazon, Heroku, and their friends all provide excellent reliable platforms, but from time to time they fail. Vendors typically like to restrict communications to as few people as possible, since it makes it easier for them to communicate. If you’re not careful you may find yourself spreading responsibility for vendors across your organization, as individuals add new vendors.

I believe it makes more sense to consolidate the knowledge in an operations engineer. An operations engineer is used to seeing vendors fail, and will understand the workflow required to report and escalate a problem. They understand how to read your vendors SLA, and hold them accountable to failures. Someone else can fill this role, but this person needs to be available at all hours, since failure occur randomly, and they will need to understand how to talk to the NOC on the other end.

The Advocate

Your platform provides a service, and you have customers that rely on you. Your engineering team often becomes focused on individual systems, and repairing failures in those systems. It is useful if someone plays the role of the advocate for the service, and I think operations is a perfect fit. A typical ops engineer will be able to determine if the service is still failing, and push for a resolution within the organization. They are generally familiar with the parts of the service and who is responsible for them.

Good Nagios Parenting, Avoids a Noisey Pager

Monitoring configuration is complicated, and the depths that you can configure alerts and tests seems endless. It may seem like a waste of time to invest in some options, but others can really help you eliminate states that send hundreds of alerts. Your end goal in your configuration is to narrow down any alert sent to the pager to be immediately actionable, and that all other issues are ignored. Certain Failure states like failed switches, routers, can cause a flood of alerts since they take down the network infrastructure, and obscure the true cause of an outage.

Defining the Right Config

The first step you can take to prevent a flood of pages is to define all you routers, switches, and other network equipment in your Nagios config. After you have that defined you simply need to define a parent on the config object.
For example:

# Primary Switch in VRRP Group
define host {
use switch
host_name switch-1
hostgroups switches

#Secondary Switch in VRRP Group
define host {
use switch
host_name switch-2
hostgroups switches

define host {
use server
host_name apache-server-1
hostgroups servers, www
parents switch-1, switch-2

This will configure the host apache-server-1 such that if switch-1 and switch-2 fail, alerts will be silence from the client. The alerts will remain off until either switch-1 or switch-2 becomes available again.

A Few Things to Keep in Mind

Nagios is pretty smart, and can handle multiple parents so that alerts will only be silenced if both parents become unavailable.

The availability of parent hosts is determined by the host health check, most commonly ping. If you need some other test of availability, make sure to define this in the host object.

Parent all the objects you can or that make sense to parent. For example, a router or transport failure at a remote data center should only send a single alert. This means you should define your routers, switches, and possibly your providers gateways. Do whatever you think makes sense, and take it as far as your can. Remember your goal is to make the number of alerts manageable, so the better you define the topology the less likely you are to get a useless page, or several hundred useless pages.

Redunancy Planning, more work than adding one of everything

Since I started my career, redundancy has been featured in almost every deployment discussion. The general best practice is to add an additional element for each service tier, also know as N+1 redundancy. This approach is straight forward, but many people would actually be surprised by how often these schemes fail. At a very famous incident in San Francisco, a data center lost power in the majority of its co-location suites, due to the failure of their N+2(one better than N+1) backup power generation scheme.

Start with the Easy Things

You start by looking at each individual component in your stack and deciding if this system fails, can it fail independently. If you do this with your stack, you’ll generally find that pieces that scale easily horizontally have failure boundaries isolated to the system itself. For example if a web server fails, generally it has no impact on service, because concurrency is maintained elsewhere, but it will reduce capacity. This is the easiest place to plan for, because an extra server will typically take care of the issue.

Now with the hard things

When you look at components such at database masters, or storage nodes the story becomes more complex. This type of equipment generally has failure boundaries that extend beyond themselves. A rack full of application servers may become useless when they are no longer able to access a database for writes. You don’t truly have redundancy here until you have a scheme for fail-over. Without planning you may be trying to figure out slave promotion in 2:32 am.

Then with the hard and really expensive things

Core infrastructure needs love too. Again, things like rack power, networks, carriers, cloud providers, and buildings have failure boundaries as well. They unfortunately extend to several portion of your stack at once. They are very difficult to plan around, and often take a significant investment to have redundancy in. The datacenter mentioned above used 2 spare generators for redundancy for all the co-location suites, when 3 of the primary generators failed, so did their redundancy plan. They had let each suite become dependent on all of the other suites having normal power operations.

Finally, figure out what you have to do

Once you’ve identified all of your failure boundaries, its time for the fun part, financial discussions! Remember, why its important to have backups of all data, redundancy is a financial hedge. When planning try to figure out what the cost of downtime is, and to what extents the business is willing to fund them. Its not uncommon that multi-datacenter redundancy would require an application change to achieve, but its probably not worth the investment if you have no customers. Create a target and engineer a system that meets that goal for the budget.

Three Monitoring Tenants

This week, I was seeing a drop in average back-end performance at work, we had an average drop in page load performance from ~250ms to around 500ms. This seemed to be an intermittent problem and we searched through out graphs at NewRelic with no clear culprit. Then we started looking at our internal MediaWiki profiling collector, and some of the various aggregation tools that I put together. After a few minutes it became clear that the connection time on one of our databases had increased 1000 fold. Having recently changed the spanning-tree configuration, and moving some of our cross connects to 10Gb/s, I suspected this may have been a spanning tree issue. It turns out our Ganglia daemon (gmond) on that host had consumed enough memory due to a memory leak to negative affect system performance. Unfortunately this was a pretty inefficient way to find the problem, and reminded me of a few basic tenants of monitoring.


Monitor Latency

A simple MySQL test could just tell you whether your server is up or down. Your alerts probably even have timeouts, but in most monitoring tool I’ve seen these are measured in seconds not milliseconds. Your should have your alert configured to tell you when the service you’re monitoring has gone to an unacceptable level, and maybe effecting site performance. So, your simple MySQL check should timeout in 3 seconds, but alert you if its taken more than 100ms to establish a connection. Remember, if the latency is high enough your service is effectively down.


Monitoring your Monitoring

Sometimes your monitoring can get out of whack. You may find that you tests are consuming so many resources that they are negatively effecting your performance. You need to define acceptable parameters for these application, and make sure that its doing what you expect.


Set Your Alerts Lower than You Need

Your alerts should go off before your services are broken. Ideally this would be done with alerts on warnings, but for a good number of people warning are too noisy. If you’re only going to alert on errors, set your threshold well below the service level you expect to provide. For example, if you have an HTTP service that you expect to answer within 100ms, and typically answers within 25ms, your warnings should be set at something like 70ms and errors at 80ms. By alerting early, your preventing a call from a customer, or an angry boss.

So give these three things a try, and you should end up with a better monitoring setup.

Groovy, A Reasonable JVM Language for DevOps

I’ve worked at several environments where most of our product was run through the JVM. I’ve always used the information available to me in Mbeans, but the overhead of exposing them to a monitoring system like Ganglia or Nagios has always been problematic. What I’ve been looking for is a simple JVM language that allows me to use any native object. In 2008 I found Groovy, and have used it as my JVM glue ever since.

Meet Groovy

I’ll skip from going into too much detail here, but Groovy is a simple dynamically typed language for the JVM. Its syntaxt and style is similar to that of Ruby. You can find more information here.

Solr JMX to Ganglia Example

This is a quick script to get data from JMX into ganglia fromSsolr. The syntax is simplfied, and missing the typical java boiler plate required to run from the command line. Not only that, but because of the simplified typing model, its very easy for me to build strings and then execute a command. This allows me to quickly construct a script like I would in Perl or Python, but also allows me easy access JMX, and any Java library of my choosing. The primary drawback will be the time it takes to instantiate the JVM, which you can work around if you can’t deal with the launch time.

So, if your working in a Java with a bunch of Java apps, think about giving Groovy a chance for writing some of your monitoring tests, and metric collectors. It is a simpler language than Java to put together those little applications that can tell you how your system is performing, and well within the reach of your average DevOps engineer.