Category Archives: DevOps

DevOps Linux Uncategorized

Just Enough Ops of Devs

A few weeks ago I was reading through the chef documentaion and I came across the page “Just Enough Ruby for Chef”. This inspired me to put together a quick article, on how much linux a developer needs to know. I’m going to be doing this as a series, and putting out one of these a week.
 

How to use and generate SSH keys

I’ve covered how to create them here, but you should know how to create distrubute, and change ssh keys. This will make it easier to discuss access to production servers with your ops team, and likely make it easier when you use things like Github.
 

How to use |

If you’ve used unix for sometime, you might be familar wih this. The pipe or | can be used to send the output from one process to another. Here’s a good example of its usage:

 

How to use tar

Tar is one of those basic unix commands that you need to know. Its the universal archiving tool for *nix systems(similar to Zip for windows). You should know how to create an archive and expand an archive. I’m only covering this with compression enabled, if you don’t have gzip, or don’t want it ommit the z option.


 

The file command

File is magic. It will look at a file and give you its best guess as to what it is. Usage is:

 

The strings command

Ever want to read the strings from a binary file? The strings command will do this for you. Just run “strings ” and you’ll get a dump of all the strings from that file. This is particularly useful when looking for strings in old PCAP files, or if a binary file has been tampered with.

 

How to use grep

Grep can be used to extract a lines of text from a file or stream matching a particular pattern. This is a really rich command, and should have a whole article dedicated to it. Here are some very simple use cases.

To match a pattern:

To pull all lines not matching a pattern:

 

How to count lines in a file

The wc commands will count the lines, words, and bytes in a file. The default options will return all three, if you ony want to count the lines in a file, use the -l option that will output only the lines in a file. Here is an example:

 

Count the unique occurrences of something

It might seem like its out of the reach of bash, but you can do this with a simple one liner. You just need to type:

This counts all the unique line occurrences, and then sorts them numerically.

 

Following the output of a file with tail

Tail is a very useful command; it will output the last 10 lines of a file by default. But sometimes you need to want to continiously watch a file. Fortunately tail can solve this for you. The -f option will print new lines as they’re added. Example:

 

I’ll follow this up a week from now, with more linux for devs. Hopefully you found this useful.

DevOps java solr

Solr Upgrade Surprise and Using Kill To Debug It

At work, we’ve recently upgraded to the latest and greatest stable version of Solr (3.6), and moved from using the dismax parser to the edismax parser. The initial performance of Solr was very poor in our environment, and we removed the initial set of search features we had planned to deploy trying to get the CPU utilization in order.

Once we finally, rolled back a set of features Solr seemed to be behaving optimally. Below is what we were seeing as we looked at our search servers CPU:
Solr CPU Usage Pre and Post Fix
Throughout the day we had periods where we saw large CPU spikes, but they didn’t really seem to affect throughput or average latency of the server. None the less we suspected there was still an issue, and started looking for a root cause.
 

Kill -3 To The Rescue

 
If you’ve never used kill -3, its perhaps one of the most useful Java debugging utilities around. It tells the JVM to produce a full thread dump, which it will then print to the STDOUT of the process. I became familiar with this when trying to hunt down treads in a Tomcat container that were blocking the process from exiting. Issuing kill -3 would give you enough information to find the problematic thread, and work with development to fix it.

In this case, I was hunting for a hint as to what went wrong with our search. I issued kill -3 during a spike, and got something like this:

 

Looking at the the output, I noticed that we had a lot threads calling FuzzyTermEnum. I thought this was strange, and sounded like an expensive search method. I talked with the developer, and we expected that the tilde character was being ignored by edismax. At the very least being escaped by our library, since it was included in the characters to escape. I checked the request logs, and we had people looking for exact titles that contained ~. This turned a 300ms query into a query that timed out, due to the size of our index. Further inspection of the thread dump revealed that we were also allowing the * to be used in query terms as well. Terms like *s ended up being equally problematic.
 

A Solr Surprize

 
We hadn’t sufficiently tested edismax, and we’re surprised that it ran ~,+,^, and * when escaped. I didn’t find any documentation that stated this directly, but I didn’t really expect to. We double checked our Solr library to see if that it was properly escaping the special characters in the query, but they we’re still being processed by Solr. On a hunch we tried double escaping the characters, which resolved the issue.

I’m not sure if this is a well known problem with edismax, but if you’re seeing odd CPU spikes this is definitely worth checking for. In addition, when trying to get to a root of a tough problem kill -3 can be a great shortcut. It saved me a bunch of painful debugging, and really eliminated almost all my guess work.

cron DevOps Linux

What I Wish Some Had Told Me About Writing Cron Jobs

Much like Doc Brown and Marty McFly, cron and I go way back. It is without doubt one of thing single most valuable tools you can use in linux system management. Though what I’ve learned over the years is that it can be hard to write jobs that reliably produce the results I want. I wish when I started writing these jobs executed by cron in the 1990s someone had told me a few of these things.
 

Don’t Allow Two Copies of Your Job to Run at Once

 
A common problem with cron jobs is that the cron daemon will launch new jobs while the old job is running. Sometimes this doesn’t cause a problem, but generally you expect only one job to launch at a time. If you’re using cron to control jobs that launch every 5 or 10 minutes, but only want one to run at a time its useful to implement some type of locking. A simple a method is to use something like this:

You can get more complicate using flock, or other atomic locking mechanisms. For most purposes this is good enough.
 

Sleep for a Bit

 
Ever have a cron job overload a whole server tier because logs rotate at 4am? Or, got a complaint from someone that you were overloading there application by having 200 server contact them at once? A quick fix is to have the job sleep for a random time after being launched. For example:

This does a good job of spreading the load out for expensive jobs, and avoid thundering herd problems. I generally pick an interval long enough so that my servers will be distributed throughout the period, but still meets my goal. For example, I might spread an expensive once a day job over an hour, but a job that runs ever 5 minutes may only be spread over 90 seconds. Lastly, this should only be used for things that you can except a loose time window around.
 

Log It


I’ll be the first to admit I do this all the time. I hate getting emails from cron, but in general you should avoid doing this. When everything is working this isn’t a big deal, but when something goes wrong, you’ve thrown away all the data that told you what happened. So, redirect to a log file, and overwrite or rotate that file.

Hopefully these tips help you out, and solve some of your cron pains.

DevOps

Rally Cars and Redunancy: Understand Your Failure Boundaries

I occasionally watch rally car racing, and if you haven’t seen it before its worth a watch. Guys drive small cars very fast down dirt roads, and while this is going on a passenger is reading driving notes to the driver. Occasionally these guys hit rocks, run off the road, and do all sorts of damage to their cars. The do a quick assessment of the damage determine if they can continue, and the carry on or pull out. Generally the calls the are making is to whether the particular failure has compromised the safety, or performance of the car so much that they cannot complete the race.

If you want to build a redundant system, you’ll need to take a look at each component and ask yourself what happens if it fails. Do you know how many components it will affect? Will it bring the site down, or degrade performance so much that the site will practically be down? Will your system recover automatically, or require intervention for an operator? Think through this is a very detailed manner, and make sure you understand the components.

Enter the Scientific Method

Develop a hypothesis, and develop a checklist of what you expect to see during a system failure.   This should include how you expect performance to degrade, what you expect to do to recover, and what alerts should be sent to your people on call. Put time frames in which you expect things to happen in, and most importantly note any place you expect there is a single point of failure.

Break It

Once you’ve completed your list, start shutting off pieces to test your theories. Did you get all the alerts you expected in a timely manner? More importantly, did you get extra alerts you didn’t expect? These are important because they may mislead, or obscure a failure. Did anything else fail that you didn’t expect? And, lastly did you have to do anything you didn’t expect to have to recover?

Are You a Soothsayer

Summarize the differences, and document what happened. If you got too many alerts, see if you can develop a plan to deal with them.Then document what the true failure boundaries are. For example if you’re firewall fails, do you loose access to your management network? After doing all this decide if there is anything you can do to push failure boundaries back into a single component, and if you can minimize the effect on the rest of your system. Basic infrastructure components like switches and routers usually have surprising failure boundaries, when coupled with infrastructure decisions such as putting all of your database servers in a single rack.

This process takes time, and its hard to do at small and medium scales. Once you have infrastructure up and running, its difficult to run these types of tests, but you should probably still advocate for it. Its better to discover these type of problems under controlled conditions then in the middle of the night. You may be able to test parts using virtualization, but sometimes you’ll need to pull power plugs. Concoct a any type of failure you can imagine, and look for soft failures(mysql server is loosing 10% of its packets) since they are the most difficult to detect. Remember, You’ll never really have a redundant system until you start understanding how each component works with other components in your system.

Configuration Management DevOps

Keep it Simple Sysadmin

I’ve been thinking about what I hate about my configuration management system. I seem to spend a lot of time when I want to make a change looking at the various resources in chef, and sometimes I end up using providers like the ops code apache2 cookbook to manage enabling and disabling of modules. A few days ago while in a rush I did the following:

A few days later, a co-worked decided this was verbose and replaced it with, using the previously mentioned apache2 cookbook syntax:
This seemed reasonable, and frankly is the approach I would have taken if I had taken time to figure it out if we had a cookbook providing this resource(this was an emergency change). The only problem was that we’re using a broken version of that module that didn’t actually do anything(I’ve still not dug in, but I found a similar bug that was fixed some time ago). So, no errors, no warning, and no configuration was applied or removed.

I’ve come to the decision that both are probably the wrong approach. Both of these approaches we’re trying to use the a2enmod command available in Ubuntu or provided similar functionality. It seems reasonable since it will manage dependencies, but why should I use this? The only reason would be to maintain compatibility with Ubuntu and Debians’ configuration management utilities, but I’ve already decided to outsource this to Chef which does that pretty well. I’ve come to believe that I should have just done this:

Why?

The right approach to configuration management is to use the minimal features of your tool. Chef(in this case), provides awesome tools for file management, but when I use higher level abstractions I’m actually introducing more obfuscation. The example given with the Apache module is painful, because when I look at what is really happening, I’m copying a file. Chef is really good at managing files, why would I want to abstract away? Do people really not understand how the a2enmod works? Is this really a good thing if you operations team doesn’t know how Apache configs work?

Cron is another great example:
Do we really find this simpler than creating a cron file, and having it copied to /etc/cron.d? Isn’t that why we introduced cron.d; to get out of having cron jobs installed in user’s crontabs? Its also difficult to ensure Chef removes the job, since a user can screw up that file. Not to mention that this has introduced a DSL to mange a DSL with a more verbose syntax than the original, which seems absurd.

KISS – Keep it Simple Sysadmin

My frustration here is that for the most part we are just copying files around. The higher level abstractions we use actually decrease clarity, and understanding of how the underlying configuration is being applied. If there is a logical flaw(bug) in the implementation, we’re stuck suffering and needing to address the issue, and sort out who is at fault.  So, use the simple resources, and stay away from the magic ones, and you’ll actually be able to understand the configuration of your systems and hopefully get to the root of your problems faster.

DevOps Monitoring

Monitoring – The Challenge of Small Ops – Part 2

So your building or have built a web service, you’ve got a lot of challenges a head. You’ve got to scale software and keep customers happy. Not surprisingly that likely involves keeping your web service up, and that typically starts by setting up some form of monitoring when something goes wrong. In a large organization this may be the job of an entire team, but in a small organization its likely everyone’s job. So, given that everyone will tell you how much monitoring sucks, how do you make it better and hopefully manageable.

A Subject Mater Expert Election

Elect someone as your monitoring expert. I’d suggest someone with an Ops background, since they probably used several tools, and are familiar with configuration and options for this type of software. Its important that this person is in the loop with regard product decisions, and need to be notified when applications are goring to be deployed.

What Gets Monitored?

The simplistic answer you’ll hear over and over again is “everything”, but you’re a small organization with limited headcount. Your monitoring expert should have an idea since he’s in the product loop. They should be making suggestions as to what metrics the application should expose, and getting the basic information from the development team for alerting (process names, ports, etc..). They should prioritize the list of tests and metrics, starting with what you need to know that the application is up, and finishing with nice to haves.

Who Builds It?

Probably everyone. If that’s all you want you monitoring expert to do, then have them do it, but its far more efficient to spread the load. If people are having difficulty integrating with particular tools like Nagios or Ganglia, that’s where you expert steps in. They should know the in’s and out’s of the tools your using, and be able to train someone to extend a test, or get a metric collected within a few minutes.

These metrics and tests should be treated as part of the product, and should follow the same process you use to get code in production. You should consider making them goals for the project, since if you wait till after shipping there will be some lag between pushing code and getting it monitored.

Wrapping It Up

Its really not that hard; you just need someone to take a breath and plan. The part of monitoring that sucks is the human part, and since your app will not tell people when its up and down, you need a person to think about that for you.

Take a look at Part 1 if you liked this article.

DevOps security

Three First Pass Security Steps

I’m no security expert, but in my experience these are three simple things you can do to avoid a security incident.

3. Fix Authentication

Don’t allow users to log in to your systems just using passwords. Passwords are easy to setup and get running, but also are easily lost. For SSH use SSH keys at a minimum. If you have the money and time implement a two factor authentication system that requires using passwords and an identifier. For SSL VPNs use a cert and a password, for a system use something like secure ID and a password.

2. Use Prepared Statements

Wherever you can use prepared statements. Since the statement will be compiled before input this help cut down on the majority of SQL injection attacks. This doesn’t exempt you from catching bad SQL, just makes it less likely to be an issue.

1. Reduce Your Attack Surface

Use a firewall, and use a host level firewall. The easiest way to prevent an attack is to lock the door, and if someone can’t reach your server via SSH they are unlikely to break in using SSH. You should make sure that you only expose the services your clients need to access your server. In addition, if you’ve done this at the edge of your network you should look at doing this on a host level as well.

 

 

 

DevOps Monitoring

Monitoring Your Customers with Selenium and Nagios

In a brief conversation with Noah Sussman at DevOps Days, when discussing the challenges of continious deployment for B2B services with SLAs, we got side tracked discussing using Selenium and Nagios in production.

A few years back while working for a B2B company that was compensated by an attributable sales, I got on a phone call early in the morning to discuss fixing a client side display issue. The previous night, after a release a integration engineer modified a config that broke our service from rendering on almost every page at a single customer. The bug was fairly subtle, allowing what he was working on to display correctly, but breaking every other div on the site. This was pushed in early Spring, at off hours, and caught at the beginning of the day on the east coast.

At 9am PST, we held a post-mortem with all of our engineers. We discussed the impact of the issue on our revenue, which fortunately was pretty small, and laid out the timeline. Immediately, we discussed whether this was a testing issue or a monitoring failure. The CEO came back, and said while it was understandable that we missed the failure, our goal as an Ops team should be to catch any rendering issue within 5 minutes of a failure. I was a little annoyed, but agreed to build a series of test to try and catch this.

Why Other Metrics Failed Us

We had a fairly sophisticated monitoring setup at this point. We tracked daily revenue per customer, and we would generally know within 30 minutes if we had a problem. Our customers were very sites, but typically for US only sites had almost no traffic between 0:00 PST/PDT and 6:00 PST/PDT; in that time period it wasn’t unusual to have 0-2 sales. Once we got into a busier sales period the issue was spotted withing 30 minutes, and we were alerted. During this time period it turns out our primary monitoring metric for this type of failure was useless.

QA Tools Can Solve Ops Problems Too

I was familiar with Selenium from some acceptance tests I helped our QA guys write. I began to put together a test suite that met my needs(can’t provide the code for this, sorry). It consisted of:

  • rendering the main page
  • navigating to a page which we displayed content
  • clicking a link we provided
  • verifying that we displayed our content on a new page

This worked fairly well, but I had to tweak some timings to make sure the page was “visable”. I rigged this up to run through jUnit, and left a selenium server running all the time. Every 5 minutes the test suite would execute, leaving behind a log of successes and failures. We eventually built a test suite for every sizable customer. Every 5 minutes we checked the output of the jUnit with a custom Nagios test, that would tell us which customers had failures, an send an individual alert for each one.

Great Success!

I was really annoyed when I first had this conversation with the CEO; I thought this was a boondoggle that ops should not be responsible for. Within the first month my annoyance turned to delight as I started getting paged when our customers had site issues. I typically called them before their NOC had noticed, and most of the time these were issue they introduced to their site. I’d do it again in a heartbeat, and recommend that anyone else give it a try.

 

DevOps Uncategorized

The Challenge of Small Ops (Part 1)

I missed a open session at DevOps days, and I’m really disappointed that I did after hearing feedback from one of the conference participant. He said many people in the session we’re advocating for eliminating operations in the small scale.

I realize that the world is changing, and that operations teams need to adjust to the changing world. We have skilled development teams, cloud services, APIs for almost everything you need to build a moderately sized web service. Its no wonder that some smaller(and some large) organizations are beginning to question the need for ops teams. So, I’m going to write a series of articles discussing the challenges that exist in operations of small and medium sized teams, and how an operations expert can help solve these issues.

Ops as the Firefighter

In my discussion with a fellow conference-goer about this topic, when he said the general feeling was that you push responsibility to your vendors and eliminate ops, I suggested that perhaps we should think of ops as a firefighter.

Small towns still have firefighting teams, and they may be volunteers, but I’ll bet they were trained by a professional. You should think of an Operations Engineer as your companies trainer. You should lean on them for the knowledge that can only be gained working in an operational environment.

On-Call

Failure is the only constant for web services, and your should expect them to happen. You will need to respond to failures in a calm and organized manner, but this is likely too much for a single individual. You’ll need a better approach.

A mid-level or senior operations engineer should be able to develop an on-call schedule for you. They should be able to identify how many engineers you need on-call in order to meet any SLA response requirement. In addition they can train your engineers how to respond, and make sure any procedure is followed that you might owe to customers. They can make everyone more effective in an emergency.

Vendor Management

Amazon, Heroku, and their friends all provide excellent reliable platforms, but from time to time they fail. Vendors typically like to restrict communications to as few people as possible, since it makes it easier for them to communicate. If you’re not careful you may find yourself spreading responsibility for vendors across your organization, as individuals add new vendors.

I believe it makes more sense to consolidate the knowledge in an operations engineer. An operations engineer is used to seeing vendors fail, and will understand the workflow required to report and escalate a problem. They understand how to read your vendors SLA, and hold them accountable to failures. Someone else can fill this role, but this person needs to be available at all hours, since failure occur randomly, and they will need to understand how to talk to the NOC on the other end.

The Advocate

Your platform provides a service, and you have customers that rely on you. Your engineering team often becomes focused on individual systems, and repairing failures in those systems. It is useful if someone plays the role of the advocate for the service, and I think operations is a perfect fit. A typical ops engineer will be able to determine if the service is still failing, and push for a resolution within the organization. They are generally familiar with the parts of the service and who is responsible for them.

DevOps Monitoring

Good Nagios Parenting, Avoids a Noisey Pager

Monitoring configuration is complicated, and the depths that you can configure alerts and tests seems endless. It may seem like a waste of time to invest in some options, but others can really help you eliminate states that send hundreds of alerts. Your end goal in your configuration is to narrow down any alert sent to the pager to be immediately actionable, and that all other issues are ignored. Certain Failure states like failed switches, routers, can cause a flood of alerts since they take down the network infrastructure, and obscure the true cause of an outage.

Defining the Right Config

The first step you can take to prevent a flood of pages is to define all you routers, switches, and other network equipment in your Nagios config. After you have that defined you simply need to define a parent on the config object.
For example:

# Primary Switch in VRRP Group
define host {
use switch
address 10.0.0.2
host_name switch-1
hostgroups switches
}

#Secondary Switch in VRRP Group
define host {
use switch
address 10.0.0.3
host_name switch-2
hostgroups switches
}

define host {
use server
address 10.0.0.100
host_name apache-server-1
hostgroups servers, www
parents switch-1, switch-2
}

This will configure the host apache-server-1 such that if switch-1 and switch-2 fail, alerts will be silence from the client. The alerts will remain off until either switch-1 or switch-2 becomes available again.

A Few Things to Keep in Mind

Nagios is pretty smart, and can handle multiple parents so that alerts will only be silenced if both parents become unavailable.

The availability of parent hosts is determined by the host health check, most commonly ping. If you need some other test of availability, make sure to define this in the host object.

Parent all the objects you can or that make sense to parent. For example, a router or transport failure at a remote data center should only send a single alert. This means you should define your routers, switches, and possibly your providers gateways. Do whatever you think makes sense, and take it as far as your can. Remember your goal is to make the number of alerts manageable, so the better you define the topology the less likely you are to get a useless page, or several hundred useless pages.