hypergeometric: a place for me

  • Monitoring – The Challenge of Small Ops – Part 2

    So your building or have built a web service, you’ve got a lot of challenges a head. You’ve got to scale software and keep customers happy. Not surprisingly that likely involves keeping your web service up, and that typically starts by setting up some form of monitoring when something goes wrong. In a large organization this may be the job of an entire team, but in a small organization its likely everyone’s job. So, given that everyone will tell you how much monitoring sucks, how do you make it better and hopefully manageable.

    A Subject Mater Expert Election

    Elect someone as your monitoring expert. I’d suggest someone with an Ops background, since they probably used several tools, and are familiar with configuration and options for this type of software. Its important that this person is in the loop with regard product decisions, and need to be notified when applications are goring to be deployed.

    What Gets Monitored?

    The simplistic answer you’ll hear over and over again is “everything”, but you’re a small organization with limited headcount. Your monitoring expert should have an idea since he’s in the product loop. They should be making suggestions as to what metrics the application should expose, and getting the basic information from the development team for alerting (process names, ports, etc..). They should prioritize the list of tests and metrics, starting with what you need to know that the application is up, and finishing with nice to haves.

    Who Builds It?

    Probably everyone. If that’s all you want you monitoring expert to do, then have them do it, but its far more efficient to spread the load. If people are having difficulty integrating with particular tools like Nagios or Ganglia, that’s where you expert steps in. They should know the in’s and out’s of the tools your using, and be able to train someone to extend a test, or get a metric collected within a few minutes.

    These metrics and tests should be treated as part of the product, and should follow the same process you use to get code in production. You should consider making them goals for the project, since if you wait till after shipping there will be some lag between pushing code and getting it monitored.

    Wrapping It Up

    Its really not that hard; you just need someone to take a breath and plan. The part of monitoring that sucks is the human part, and since your app will not tell people when its up and down, you need a person to think about that for you.

    Take a look at Part 1 if you liked this article.


  • Three First Pass Security Steps

    I’m no security expert, but in my experience these are three simple things you can do to avoid a security incident.

    3. Fix Authentication

    Don’t allow users to log in to your systems just using passwords. Passwords are easy to setup and get running, but also are easily lost. For SSH use SSH keys at a minimum. If you have the money and time implement a two factor authentication system that requires using passwords and an identifier. For SSL VPNs use a cert and a password, for a system use something like secure ID and a password.

    2. Use Prepared Statements

    Wherever you can use prepared statements. Since the statement will be compiled before input this help cut down on the majority of SQL injection attacks. This doesn’t exempt you from catching bad SQL, just makes it less likely to be an issue.

    1. Reduce Your Attack Surface

    Use a firewall, and use a host level firewall. The easiest way to prevent an attack is to lock the door, and if someone can’t reach your server via SSH they are unlikely to break in using SSH. You should make sure that you only expose the services your clients need to access your server. In addition, if you’ve done this at the edge of your network you should look at doing this on a host level as well.

     

     

     


  • Monitoring Your Customers with Selenium and Nagios

    In a brief conversation with Noah Sussman at DevOps Days, when discussing the challenges of continious deployment for B2B services with SLAs, we got side tracked discussing using Selenium and Nagios in production.

    A few years back while working for a B2B company that was compensated by an attributable sales, I got on a phone call early in the morning to discuss fixing a client side display issue. The previous night, after a release a integration engineer modified a config that broke our service from rendering on almost every page at a single customer. The bug was fairly subtle, allowing what he was working on to display correctly, but breaking every other div on the site. This was pushed in early Spring, at off hours, and caught at the beginning of the day on the east coast.

    At 9am PST, we held a post-mortem with all of our engineers. We discussed the impact of the issue on our revenue, which fortunately was pretty small, and laid out the timeline. Immediately, we discussed whether this was a testing issue or a monitoring failure. The CEO came back, and said while it was understandable that we missed the failure, our goal as an Ops team should be to catch any rendering issue within 5 minutes of a failure. I was a little annoyed, but agreed to build a series of test to try and catch this.

    Why Other Metrics Failed Us

    We had a fairly sophisticated monitoring setup at this point. We tracked daily revenue per customer, and we would generally know within 30 minutes if we had a problem. Our customers were very sites, but typically for US only sites had almost no traffic between 0:00 PST/PDT and 6:00 PST/PDT; in that time period it wasn’t unusual to have 0-2 sales. Once we got into a busier sales period the issue was spotted withing 30 minutes, and we were alerted. During this time period it turns out our primary monitoring metric for this type of failure was useless.

    QA Tools Can Solve Ops Problems Too

    I was familiar with Selenium from some acceptance tests I helped our QA guys write. I began to put together a test suite that met my needs(can’t provide the code for this, sorry). It consisted of:

    • rendering the main page
    • navigating to a page which we displayed content
    • clicking a link we provided
    • verifying that we displayed our content on a new page

    This worked fairly well, but I had to tweak some timings to make sure the page was “visable”. I rigged this up to run through jUnit, and left a selenium server running all the time. Every 5 minutes the test suite would execute, leaving behind a log of successes and failures. We eventually built a test suite for every sizable customer. Every 5 minutes we checked the output of the jUnit with a custom Nagios test, that would tell us which customers had failures, an send an individual alert for each one.

    Great Success!

    I was really annoyed when I first had this conversation with the CEO; I thought this was a boondoggle that ops should not be responsible for. Within the first month my annoyance turned to delight as I started getting paged when our customers had site issues. I typically called them before their NOC had noticed, and most of the time these were issue they introduced to their site. I’d do it again in a heartbeat, and recommend that anyone else give it a try.

     


  • The Challenge of Small Ops (Part 1)

    I missed a open session at DevOps days, and I’m really disappointed that I did after hearing feedback from one of the conference participant. He said many people in the session we’re advocating for eliminating operations in the small scale.

    I realize that the world is changing, and that operations teams need to adjust to the changing world. We have skilled development teams, cloud services, APIs for almost everything you need to build a moderately sized web service. Its no wonder that some smaller(and some large) organizations are beginning to question the need for ops teams. So, I’m going to write a series of articles discussing the challenges that exist in operations of small and medium sized teams, and how an operations expert can help solve these issues.

    Ops as the Firefighter

    In my discussion with a fellow conference-goer about this topic, when he said the general feeling was that you push responsibility to your vendors and eliminate ops, I suggested that perhaps we should think of ops as a firefighter.

    Small towns still have firefighting teams, and they may be volunteers, but I’ll bet they were trained by a professional. You should think of an Operations Engineer as your companies trainer. You should lean on them for the knowledge that can only be gained working in an operational environment.

    On-Call

    Failure is the only constant for web services, and your should expect them to happen. You will need to respond to failures in a calm and organized manner, but this is likely too much for a single individual. You’ll need a better approach.

    A mid-level or senior operations engineer should be able to develop an on-call schedule for you. They should be able to identify how many engineers you need on-call in order to meet any SLA response requirement. In addition they can train your engineers how to respond, and make sure any procedure is followed that you might owe to customers. They can make everyone more effective in an emergency.

    Vendor Management

    Amazon, Heroku, and their friends all provide excellent reliable platforms, but from time to time they fail. Vendors typically like to restrict communications to as few people as possible, since it makes it easier for them to communicate. If you’re not careful you may find yourself spreading responsibility for vendors across your organization, as individuals add new vendors.

    I believe it makes more sense to consolidate the knowledge in an operations engineer. An operations engineer is used to seeing vendors fail, and will understand the workflow required to report and escalate a problem. They understand how to read your vendors SLA, and hold them accountable to failures. Someone else can fill this role, but this person needs to be available at all hours, since failure occur randomly, and they will need to understand how to talk to the NOC on the other end.

    The Advocate

    Your platform provides a service, and you have customers that rely on you. Your engineering team often becomes focused on individual systems, and repairing failures in those systems. It is useful if someone plays the role of the advocate for the service, and I think operations is a perfect fit. A typical ops engineer will be able to determine if the service is still failing, and push for a resolution within the organization. They are generally familiar with the parts of the service and who is responsible for them.


  • Good Nagios Parenting, Avoids a Noisey Pager

    Monitoring configuration is complicated, and the depths that you can configure alerts and tests seems endless. It may seem like a waste of time to invest in some options, but others can really help you eliminate states that send hundreds of alerts. Your end goal in your configuration is to narrow down any alert sent to the pager to be immediately actionable, and that all other issues are ignored. Certain Failure states like failed switches, routers, can cause a flood of alerts since they take down the network infrastructure, and obscure the true cause of an outage.

    Defining the Right Config

    The first step you can take to prevent a flood of pages is to define all you routers, switches, and other network equipment in your Nagios config. After you have that defined you simply need to define a parent on the config object.
    For example:

    # Primary Switch in VRRP Group
    define host {
    use switch
    address 10.0.0.2
    host_name switch-1
    hostgroups switches
    }

    #Secondary Switch in VRRP Group
    define host {
    use switch
    address 10.0.0.3
    host_name switch-2
    hostgroups switches
    }

    define host {
    use server
    address 10.0.0.100
    host_name apache-server-1
    hostgroups servers, www
    parents switch-1, switch-2
    }

    This will configure the host apache-server-1 such that if switch-1 and switch-2 fail, alerts will be silence from the client. The alerts will remain off until either switch-1 or switch-2 becomes available again.

    A Few Things to Keep in Mind

    Nagios is pretty smart, and can handle multiple parents so that alerts will only be silenced if both parents become unavailable.

    The availability of parent hosts is determined by the host health check, most commonly ping. If you need some other test of availability, make sure to define this in the host object.

    Parent all the objects you can or that make sense to parent. For example, a router or transport failure at a remote data center should only send a single alert. This means you should define your routers, switches, and possibly your providers gateways. Do whatever you think makes sense, and take it as far as your can. Remember your goal is to make the number of alerts manageable, so the better you define the topology the less likely you are to get a useless page, or several hundred useless pages.


  • Redunancy Planning, more work than adding one of everything

    Since I started my career, redundancy has been featured in almost every deployment discussion. The general best practice is to add an additional element for each service tier, also know as N+1 redundancy. This approach is straight forward, but many people would actually be surprised by how often these schemes fail. At a very famous incident in San Francisco, a data center lost power in the majority of its co-location suites, due to the failure of their N+2(one better than N+1) backup power generation scheme.

    Start with the Easy Things

    You start by looking at each individual component in your stack and deciding if this system fails, can it fail independently. If you do this with your stack, you’ll generally find that pieces that scale easily horizontally have failure boundaries isolated to the system itself. For example if a web server fails, generally it has no impact on service, because concurrency is maintained elsewhere, but it will reduce capacity. This is the easiest place to plan for, because an extra server will typically take care of the issue.

    Now with the hard things

    When you look at components such at database masters, or storage nodes the story becomes more complex. This type of equipment generally has failure boundaries that extend beyond themselves. A rack full of application servers may become useless when they are no longer able to access a database for writes. You don’t truly have redundancy here until you have a scheme for fail-over. Without planning you may be trying to figure out slave promotion in 2:32 am.

    Then with the hard and really expensive things

    Core infrastructure needs love too. Again, things like rack power, networks, carriers, cloud providers, and buildings have failure boundaries as well. They unfortunately extend to several portion of your stack at once. They are very difficult to plan around, and often take a significant investment to have redundancy in. The datacenter mentioned above used 2 spare generators for redundancy for all the co-location suites, when 3 of the primary generators failed, so did their redundancy plan. They had let each suite become dependent on all of the other suites having normal power operations.

    Finally, figure out what you have to do

    Once you’ve identified all of your failure boundaries, its time for the fun part, financial discussions! Remember, why its important to have backups of all data, redundancy is a financial hedge. When planning try to figure out what the cost of downtime is, and to what extents the business is willing to fund them. Its not uncommon that multi-datacenter redundancy would require an application change to achieve, but its probably not worth the investment if you have no customers. Create a target and engineer a system that meets that goal for the budget.


  • Three Monitoring Tenants

    This week, I was seeing a drop in average back-end performance at work, we had an average drop in page load performance from ~250ms to around 500ms. This seemed to be an intermittent problem and we searched through out graphs at NewRelic with no clear culprit. Then we started looking at our internal MediaWiki profiling collector, and some of the various aggregation tools that I put together. After a few minutes it became clear that the connection time on one of our databases had increased 1000 fold. Having recently changed the spanning-tree configuration, and moving some of our cross connects to 10Gb/s, I suspected this may have been a spanning tree issue. It turns out our Ganglia daemon (gmond) on that host had consumed enough memory due to a memory leak to negative affect system performance. Unfortunately this was a pretty inefficient way to find the problem, and reminded me of a few basic tenants of monitoring.

     

    Monitor Latency

    A simple MySQL test could just tell you whether your server is up or down. Your alerts probably even have timeouts, but in most monitoring tool I’ve seen these are measured in seconds not milliseconds. Your should have your alert configured to tell you when the service you’re monitoring has gone to an unacceptable level, and maybe effecting site performance. So, your simple MySQL check should timeout in 3 seconds, but alert you if its taken more than 100ms to establish a connection. Remember, if the latency is high enough your service is effectively down.

     

    Monitoring your Monitoring

    Sometimes your monitoring can get out of whack. You may find that you tests are consuming so many resources that they are negatively effecting your performance. You need to define acceptable parameters for these application, and make sure that its doing what you expect.

     

    Set Your Alerts Lower than You Need

    Your alerts should go off before your services are broken. Ideally this would be done with alerts on warnings, but for a good number of people warning are too noisy. If you’re only going to alert on errors, set your threshold well below the service level you expect to provide. For example, if you have an HTTP service that you expect to answer within 100ms, and typically answers within 25ms, your warnings should be set at something like 70ms and errors at 80ms. By alerting early, your preventing a call from a customer, or an angry boss.

    So give these three things a try, and you should end up with a better monitoring setup.


  • Groovy, A Reasonable JVM Language for DevOps

    I’ve worked at several environments where most of our product was run through the JVM. I’ve always used the information available to me in Mbeans, but the overhead of exposing them to a monitoring system like Ganglia or Nagios has always been problematic. What I’ve been looking for is a simple JVM language that allows me to use any native object. In 2008 I found Groovy, and have used it as my JVM glue ever since.

    Meet Groovy

    I’ll skip from going into too much detail here, but Groovy is a simple dynamically typed language for the JVM. Its syntaxt and style is similar to that of Ruby. You can find more information here.

    Solr JMX to Ganglia Example

    This is a quick script to get data from JMX into ganglia fromSsolr. The syntax is simplfied, and missing the typical java boiler plate required to run from the command line. Not only that, but because of the simplified typing model, its very easy for me to build strings and then execute a command. This allows me to quickly construct a script like I would in Perl or Python, but also allows me easy access JMX, and any Java library of my choosing. The primary drawback will be the time it takes to instantiate the JVM, which you can work around if you can’t deal with the launch time.

    So, if your working in a Java with a bunch of Java apps, think about giving Groovy a chance for writing some of your monitoring tests, and metric collectors. It is a simpler language than Java to put together those little applications that can tell you how your system is performing, and well within the reach of your average DevOps engineer.


  • A few things you should know about EC2

    Availability Zones are Randomized Between Accounts

    I had someone from Amazon tell me this, so I assume this to be true. In order to prevent people from gaming the system availability and over allocating instances in a singe az, zones ids are randomized across customers. So for any two accounts us-east-1a != us-east-1a. Amazon promises availability zones to be separate for your account, it makes no promises about keeping these consistent across accounts. If you’re using multiple accounts, don’t assume you can choose the same availability zone.

    No Instance is Single Tenant

    We all want to game they system, and I’ve heard rummors that XL instance, and 4XL instances are single tenant, one VM per hardware instance. I’ve come to believe that no EC2 instances are single tenant, even the cluster compute instances. Its a fair bet that systems can easily be purchases with 96GB+ of memory, so AWS has likely been using configurations like this for the past 2+ years. Its always possible to have a noisy neighbor, don’t assume you can buy your way out.

    Micro Instances Aren’t Good for Production Use

    If you do anything at any kind of scale, don’t use micro instances. They have variable performance, and you shouldn’t rely on them for anything.

    EBS Should only be Allocated 1TB at a Time

    This is one area where it seems you can game the system. Many people have reported that by using 1TB volumes you get better performance. The conventional wisdom is that you are allocating a drive, or at least most of one. So, don’t skimp; over allocate if you need EBS.


  • Configuration Management Tools Still Fall Short

    I have a gripe with almost every configuration management tool I’ve used. I’m most familiar with chef, but I’ve used puppet a bit, so I apologize to the fine people at OpsCode in advance since my examples will be chef based.

    The Cake is a Lie

    Every time I run chef I tell my self a lie. My system will be in a known state when chef finishes running. The spirit of the DevOps movement is that we are building repeatable processes, and tools, freeing our companies from unknown, undocumented production environments, but in practice we may be making it worse.

    The One Constant is Change

    This should be a surprise to no one, but occasionally broken recipes get checked in, and run. Sometimes these effect state, sometimes they don’t; it really depends on the text of your recipe. Sometimes recipes run, and are removed. This is the natural cycle, since developing environments change over time. We remove and fix these recipes cavalierly, and to eliminate unneeded packages, cut run times, and to make our configuration management tool work.

    The Server is an Accumulator Pattern Without Scope

    What we generally forget is that servers are javascript. We intend for all of our changes to modify the system in a known way, but since (particularly with persistent images) we may have run several generations of scripts, we may not know our starting state. From the moment we have an instance/server/image we are accumulating changes that our configuration management utilities rely on to operate. Long forgotten recipes may still be haunting your server, with an old package, or config file, that unknowingly you are now using. A new instance may be equally hard to recreate because, despite your base assumption every chef run modified state, and you’ve been relying on those side effects in every run since.

    Is it your Mies en Plas or Chef’s?

    Chef doesn’t clean up, it leaves it to you. You have to be the disciplined one, and make sure your work place is clean. If you have physical hardware, this is more challenging than with virtual instances, but if you persist images you can suffer from the same problems as well.

    Whats Missing?

    All of these tools lack state verification. I’d love for these tools to be transactional, but I’m realistic, that will never happen. When a run is completed, I would like to verify that some state condition is met, rather than knowing that all my commands succeeded. Unfortunately, I’m not sure this is realistic.

    Protect Your Neck

    So, given that we have these accumulators, my preferred solution is to zero them out; reinstall early and often, or start new images whenever you can. The only state that is known is a clean install, and when you make major changes reinstall.


Got any book recommendations?