Author Archives: papilion

Infrastructure – The Challenge of Small Ops – Part 3

Infrastructure is hard to build. This is true when putting together compute clusters, or when dealing with roads or power lines. Typically this involves both increases in operating expenses and capital expenses, and a small mistake can be quite costly.
 

Limited Resources

 
All organizations have goals. Sometimes these goals are built around reliability, and sometimes they are build around budgets, but most of the time both are important. In large organizations a few extra servers, don’t usually carry a material cost impact, but in a small organization one extra servers can double the cost of a project. If you’re missing a large budget it can make some reliability goals quite challenging.
 

Reliability

 
Engineering is the art of making things as weak as they need to be to survive. When putting together infrastructure in a small environment its helpful to really give someone the job of reliability engineering. They should look at your application, and outline what is required to provide the basic redundancy your organization needs. Then they should see how they can line up the budget, and the requirements and get you a solution that meets your up-time needs as well as your pocketbook.
 

N+1

 
This is batted around often when discussing the redundancy. In very large organization this n may equal 1000, so n + 1 is 1001, but in a small organization N is often 1, making N + 1 equal to 2. This is often a hard problem to work around when you may only be allowed to buy 2 severs for a three tiered application, but you can work around it. Virtualization can really help out, but it increases the planning demands. You will need to insure that you have the capacity in each piece of physical equipment to meet your needs, and that you’ve made sure that system roles always exist such that they can fail independently. While this sounds simple, it really needs an owner to keep track of this, just to make sure you don’t loose your primary and backup service at the same time.

The second issue with n + 1 redundancy, when N equals 1 you need to plan capacity carefully. The best solution in this case is to use an active-passive setup. If you use active-active setups you need to be careful that you don’t exceed 50% of your total capacity, since a failure will remove 50% of your capacity.
 

Wrapping it Up

 
Infrastructure is one of the harder things to get right in a small org. Take you time and think about it. Always keep an eye on your budget, and reliability goals.

This is the final installment in this series. Take a look at Part 2 and Part 1 if you liked this article.

Solr Upgrade Surprise and Using Kill To Debug It

At work, we’ve recently upgraded to the latest and greatest stable version of Solr (3.6), and moved from using the dismax parser to the edismax parser. The initial performance of Solr was very poor in our environment, and we removed the initial set of search features we had planned to deploy trying to get the CPU utilization in order.

Once we finally, rolled back a set of features Solr seemed to be behaving optimally. Below is what we were seeing as we looked at our search servers CPU:
Solr CPU Usage Pre and Post Fix
Throughout the day we had periods where we saw large CPU spikes, but they didn’t really seem to affect throughput or average latency of the server. None the less we suspected there was still an issue, and started looking for a root cause.
 

Kill -3 To The Rescue

 
If you’ve never used kill -3, its perhaps one of the most useful Java debugging utilities around. It tells the JVM to produce a full thread dump, which it will then print to the STDOUT of the process. I became familiar with this when trying to hunt down treads in a Tomcat container that were blocking the process from exiting. Issuing kill -3 would give you enough information to find the problematic thread, and work with development to fix it.

In this case, I was hunting for a hint as to what went wrong with our search. I issued kill -3 during a spike, and got something like this:

 

Looking at the the output, I noticed that we had a lot threads calling FuzzyTermEnum. I thought this was strange, and sounded like an expensive search method. I talked with the developer, and we expected that the tilde character was being ignored by edismax. At the very least being escaped by our library, since it was included in the characters to escape. I checked the request logs, and we had people looking for exact titles that contained ~. This turned a 300ms query into a query that timed out, due to the size of our index. Further inspection of the thread dump revealed that we were also allowing the * to be used in query terms as well. Terms like *s ended up being equally problematic.
 

A Solr Surprize

 
We hadn’t sufficiently tested edismax, and we’re surprised that it ran ~,+,^, and * when escaped. I didn’t find any documentation that stated this directly, but I didn’t really expect to. We double checked our Solr library to see if that it was properly escaping the special characters in the query, but they we’re still being processed by Solr. On a hunch we tried double escaping the characters, which resolved the issue.

I’m not sure if this is a well known problem with edismax, but if you’re seeing odd CPU spikes this is definitely worth checking for. In addition, when trying to get to a root of a tough problem kill -3 can be a great shortcut. It saved me a bunch of painful debugging, and really eliminated almost all my guess work.

What I Wish Some Had Told Me About Writing Cron Jobs

Much like Doc Brown and Marty McFly, cron and I go way back. It is without doubt one of thing single most valuable tools you can use in linux system management. Though what I’ve learned over the years is that it can be hard to write jobs that reliably produce the results I want. I wish when I started writing these jobs executed by cron in the 1990s someone had told me a few of these things.
 

Don’t Allow Two Copies of Your Job to Run at Once

 
A common problem with cron jobs is that the cron daemon will launch new jobs while the old job is running. Sometimes this doesn’t cause a problem, but generally you expect only one job to launch at a time. If you’re using cron to control jobs that launch every 5 or 10 minutes, but only want one to run at a time its useful to implement some type of locking. A simple a method is to use something like this:

You can get more complicate using flock, or other atomic locking mechanisms. For most purposes this is good enough.
 

Sleep for a Bit

 
Ever have a cron job overload a whole server tier because logs rotate at 4am? Or, got a complaint from someone that you were overloading there application by having 200 server contact them at once? A quick fix is to have the job sleep for a random time after being launched. For example:

This does a good job of spreading the load out for expensive jobs, and avoid thundering herd problems. I generally pick an interval long enough so that my servers will be distributed throughout the period, but still meets my goal. For example, I might spread an expensive once a day job over an hour, but a job that runs ever 5 minutes may only be spread over 90 seconds. Lastly, this should only be used for things that you can except a loose time window around.
 

Log It


I’ll be the first to admit I do this all the time. I hate getting emails from cron, but in general you should avoid doing this. When everything is working this isn’t a big deal, but when something goes wrong, you’ve thrown away all the data that told you what happened. So, redirect to a log file, and overwrite or rotate that file.

Hopefully these tips help you out, and solve some of your cron pains.

6 Phone Screen Questions for an Ops Candidate

My company is hiring, and I’ve been thinking a lot more about what types of question are appropriate for a phone interview, but still give enough detail to lead me to a conclusion as to whether I think the person on the other end is competent. Having been on both sides of the table I thought I might share what I think are a few good questions.

 

1. Tell me about a monitoring test you’ve written?

I decided long ago that I don’t want to hire anyone who has never written a monitoring test. I don’t care how simple or complicated the test was, but I want to make sure they’ve done it. Throughout my career, I’ve come across so many specialized pieces of code or infrastructure, that I take it for granted sooner or later your going to need to do this. I find that the people who care about uptime do it earlier in their career. Its good to follow up with several more questions about their specific implementation, and then ask if they had any unexpected issues with the test.

 

2. How would you remove files from a directory when ‘rm *’ tells you there are too many files?

Back in the 1990′s when Solaris shipped with a version of Sendmail that was an open-relay, it wasn’t unusual for me to have to wipe a mail queue directory for a customer. If someone had been really aggressive sending mail to it, it wasn’t too unusual to be confronted with the message that the * expansion was too long to pass to rm.  I can think of a few ways to do this:

  1. for i in `ls /dir/`; do rm $i ; done
  2. find /dir/ -exec rm {} \;
  3. rm -rf /dir; mkdir /dir

And I’m sure there are plenty more. After I get the answer I like to cover if they think there is any issue with the method they’ve chosen.

I like this question since it show a candidates understanding of how the command-line works, and if they can think around some of its limitations.

 

3. How would you setup and automated installation of linux?

A good candidate should have done this, and they should imediately be talking about setting up  FAI or Kickstart. I like to make sure they cover the base pieces of infrastructure, like DHCP, tftp, and PXE. Generally I will follow up, and ask when they think it makes sense to setup this type of automation, since it does require quite a bit of initial infrastructure.

 

4. How would you go about finding files on an installed system to add to configuration management?

This question is straight forward and quick, and I’m looking for two things from the candidate. First, I want them to tell me about using the package management system to locate modified config files, and second I want to hear them tell me about talking to the development team as to what was copied on the system.

This question tells me they’ve looked for changes on systems, and have a basic understanding of what the package management tools provide. But, that they know there is a human component, and it might be quicker to ask the dev team what they installed then building a tool to find it.

 

5. If I gave you a production system with a PHP application running through Apache, what things would you monitor?

I like using this question because it give you an idea of the thoroughness of the candidates thought process. The easy answer is the URL the application is running on, but I like to push candidates for more. I generally looking for a list like:

  • The URL of the application
  • The Port Apache is running on
  • The Apache Processes
  • PING
  • SSH
  • NTP
  • Load Average / CPU utilization
  • Memory Utilization
  • Percentage of Apache Connections used
  • Etc..

I’m looking for both the application specific and the basic probes. I cannot tell you how many times in my career, I’ve started a job and found out SSH wasn’t monitored. Since it wasn’t part of the application, people didn’t think it was needed.

This question tests the candidates attention to detail. Monitoring is an important part of any production environment, and I want candidates who state the obvious.

6. If I asked you to backup a servers local filesystem, how would you do it?

Backups are, unfortunately, the bread and butter of operations work. A candidate should really have some experience running a backup, and so they should know the basics. Unfortunately, this is a really open ended question. There are endless ways it can be done, and that makes it a little tough on both the candidate and interviewer. One example a candidate could choose would be to use the tar command, but they could also choose to use tar with an LVM snapshot, or they could use rsync to a remote server. Its really the follow up question that makes this worthwhile; what are the disadvantages of your method, and can you think of another way you might do this to address those issues? Again, since its the bread and butter of operations work, they should know the strengths and weakness of the scheme they select, and they should know at least one alternative.

This question checks to see if a candidate has performed typical operations work, but also if they have thought through the problems with it.

 

Rally Cars and Redunancy: Understand Your Failure Boundaries

I occasionally watch rally car racing, and if you haven’t seen it before its worth a watch. Guys drive small cars very fast down dirt roads, and while this is going on a passenger is reading driving notes to the driver. Occasionally these guys hit rocks, run off the road, and do all sorts of damage to their cars. The do a quick assessment of the damage determine if they can continue, and the carry on or pull out. Generally the calls the are making is to whether the particular failure has compromised the safety, or performance of the car so much that they cannot complete the race.

If you want to build a redundant system, you’ll need to take a look at each component and ask yourself what happens if it fails. Do you know how many components it will affect? Will it bring the site down, or degrade performance so much that the site will practically be down? Will your system recover automatically, or require intervention for an operator? Think through this is a very detailed manner, and make sure you understand the components.

Enter the Scientific Method

Develop a hypothesis, and develop a checklist of what you expect to see during a system failure.   This should include how you expect performance to degrade, what you expect to do to recover, and what alerts should be sent to your people on call. Put time frames in which you expect things to happen in, and most importantly note any place you expect there is a single point of failure.

Break It

Once you’ve completed your list, start shutting off pieces to test your theories. Did you get all the alerts you expected in a timely manner? More importantly, did you get extra alerts you didn’t expect? These are important because they may mislead, or obscure a failure. Did anything else fail that you didn’t expect? And, lastly did you have to do anything you didn’t expect to have to recover?

Are You a Soothsayer

Summarize the differences, and document what happened. If you got too many alerts, see if you can develop a plan to deal with them.Then document what the true failure boundaries are. For example if you’re firewall fails, do you loose access to your management network? After doing all this decide if there is anything you can do to push failure boundaries back into a single component, and if you can minimize the effect on the rest of your system. Basic infrastructure components like switches and routers usually have surprising failure boundaries, when coupled with infrastructure decisions such as putting all of your database servers in a single rack.

This process takes time, and its hard to do at small and medium scales. Once you have infrastructure up and running, its difficult to run these types of tests, but you should probably still advocate for it. Its better to discover these type of problems under controlled conditions then in the middle of the night. You may be able to test parts using virtualization, but sometimes you’ll need to pull power plugs. Concoct a any type of failure you can imagine, and look for soft failures(mysql server is loosing 10% of its packets) since they are the most difficult to detect. Remember, You’ll never really have a redundant system until you start understanding how each component works with other components in your system.

Two Helpful Data Concepts

I’ve been batting around a couple terms while talking about various technical solutions for years, and I’ve found them useful while selecting and constructing technical solutions for managing data. They’ve helped me both build, and provide input on what I need from a technical solution.

1. Real-Timey-ness

When considering solutions for your data needs, you may say you need something real-time, but you’re unlikely to ever find such a solution. Most of them will incur some latency while persisting a record, and the availability of these items will be based on the latency it takes to write a record. Aggregate data will be further delayed, since it will depend on the latency of the first record. There are rarely any true real-time solutions, most solutions are real-timey; they’ll have some time measured in milliseconds, seconds, minutes,  or hours before that data becomes useable. In addition, most data is not actionable for sometime after it become available; a single data point is rarely enough to make a decision.

2. Correcty-ness

Is your answer correct? This is a fundamental question in dealing with data, and most people I’ve dealt with make one fundamental incorrect assumption. For most metrics, the data you collect is mostly correct. You will loose some transactions, you will over count others, your software will have bugs, and these will all lead to inaccuracy in your data. To improve your correcty-ness you’ll likely have to trade resolution or latency in order to push yourself closer to your achievable accuracy. Correcty-ness unlike real-timeness can be improved over time. You can look at higher resolution data, and adjust your initial measurement for a time period.

A Practical Example

Scalability of you’re application will generally be determined by your requirements around real-timey-ness and correcty-ness. Take for example a visits counter on a web-page. The most straight forward implementation would be increment a counter for each page render and store it in a database(UPDATE counter_table SET views_count = views_count+1 WHERE page_id = 5). This solution will have a high degree of accuracy, but comes with limited scalability since we’d be locking a row in order to increment its value. Furthermore the accuracy of this solution may actually degrade as usage increases since aborted page loads may fail to increment the counter. This solution would have a high real-timey and correcty-ness values, at the expense of scalability.

A more complex solution would be to look at the request logs and increment the row by batch using an asynchronous process. This solution will update the counter in whatever time it takes to aggregate the last set of logs. The correcty-ness of the count at any given point will be less than that of the above solution, as will the real-timey-ness since you will only have the answer since the last aggregation. However, the solution will support larger requests volumes, since the aggregation of requests will take place out-side of the page render.

The first solution presented is perfect for a small web-application. The small number of requests you receive at small scale can make asynchronous solutions look broken, and the latency incurred per page render is relatively small. In the small scale its probably better to favor a higher-degree of correcty-ness.

The second solution will perform much better at larger scale. Its lack of correcty-ness and real-timey-ness will be hidden by the hit counter incrementing by large numbers with each refresh. This solution would generally be called eventually consistent, but you can never really achieve consistency without looking at a fixed time window that is no longer being updated.

A Third Solution

During each page render a UDP packet could be sent to an application that increments an in memory counter. The page could then pull this count from the secondary application, and display the current count. To achieve consistency the request logs could be aggregated on a given interval, that then replaces the base value of the counter.

This solution will have a high degree of real-timey-ness since page views will be aggregated immediately. However the correcty-ness of the application will be less than the first solution, since the data transmission method is less reliable. This is a fairly scalable solution, that would balance actionable real-time data with the ability to correct measuring errors. That said it is likely less scalable then the two previous solutions.

Great, What Now?

When designing an application take the time to think about the acceptable real-timey-ness and correcty-ness. In general high correcty-ness and  real-timey-ness create slow applications at larger scale. So, when spec’ing out an application consider assigning a real-timey-ness value to data you present to users. I would typically define it as time windows, i.e. data presented in the UI will be at least 5 seconds old, and no more that 2 minutes. As for correcty-ness, I would define it as the acceptable accuracy within a given time period. For example, data must have an error no larger than 50%  within 5 seconds, 10% within 5 minutes, and 0.1% within 7 minutes.

Deciding what these numbers should be is a different problem. You can generally address real-timey-ness at the expense of correcty-ness, but its hard to improve correcty-ness and real-timey-ness at scale. I would generally look at the scale of the application to decide how important either is. Solutions that are low scale won’t generally have contention issues, so I would favor high real-timey-ness and correcty-ness. In addition, people are more likely to notice issues when the numbers increase by small amounts(i.e. the counter not going from 2 to 3 for 4 page views, and are more likely to complain about problems in the software. For large scale solutions I would take an approach looking at how long the data takes to become actionable. Being accurate within 1 to 5 minutes may be enough to help you derive a result from your data, but it may take 24 hours before you can conclude anything. Think about the amount of time this will take, and then build your specification accordingly.

Hopefully these concepts are useful, and can be put  to use when designing solutions for data.

 

 

Keep it Simple Sysadmin

I’ve been thinking about what I hate about my configuration management system. I seem to spend a lot of time when I want to make a change looking at the various resources in chef, and sometimes I end up using providers like the ops code apache2 cookbook to manage enabling and disabling of modules. A few days ago while in a rush I did the following:

A few days later, a co-worked decided this was verbose and replaced it with, using the previously mentioned apache2 cookbook syntax:
This seemed reasonable, and frankly is the approach I would have taken if I had taken time to figure it out if we had a cookbook providing this resource(this was an emergency change). The only problem was that we’re using a broken version of that module that didn’t actually do anything(I’ve still not dug in, but I found a similar bug that was fixed some time ago). So, no errors, no warning, and no configuration was applied or removed.

I’ve come to the decision that both are probably the wrong approach. Both of these approaches we’re trying to use the a2enmod command available in Ubuntu or provided similar functionality. It seems reasonable since it will manage dependencies, but why should I use this? The only reason would be to maintain compatibility with Ubuntu and Debians’ configuration management utilities, but I’ve already decided to outsource this to Chef which does that pretty well. I’ve come to believe that I should have just done this:

Why?

The right approach to configuration management is to use the minimal features of your tool. Chef(in this case), provides awesome tools for file management, but when I use higher level abstractions I’m actually introducing more obfuscation. The example given with the Apache module is painful, because when I look at what is really happening, I’m copying a file. Chef is really good at managing files, why would I want to abstract away? Do people really not understand how the a2enmod works? Is this really a good thing if you operations team doesn’t know how Apache configs work?

Cron is another great example:
Do we really find this simpler than creating a cron file, and having it copied to /etc/cron.d? Isn’t that why we introduced cron.d; to get out of having cron jobs installed in user’s crontabs? Its also difficult to ensure Chef removes the job, since a user can screw up that file. Not to mention that this has introduced a DSL to mange a DSL with a more verbose syntax than the original, which seems absurd.

KISS – Keep it Simple Sysadmin

My frustration here is that for the most part we are just copying files around. The higher level abstractions we use actually decrease clarity, and understanding of how the underlying configuration is being applied. If there is a logical flaw(bug) in the implementation, we’re stuck suffering and needing to address the issue, and sort out who is at fault.  So, use the simple resources, and stay away from the magic ones, and you’ll actually be able to understand the configuration of your systems and hopefully get to the root of your problems faster.

Monitoring – The Challenge of Small Ops – Part 2

So your building or have built a web service, you’ve got a lot of challenges a head. You’ve got to scale software and keep customers happy. Not surprisingly that likely involves keeping your web service up, and that typically starts by setting up some form of monitoring when something goes wrong. In a large organization this may be the job of an entire team, but in a small organization its likely everyone’s job. So, given that everyone will tell you how much monitoring sucks, how do you make it better and hopefully manageable.

A Subject Mater Expert Election

Elect someone as your monitoring expert. I’d suggest someone with an Ops background, since they probably used several tools, and are familiar with configuration and options for this type of software. Its important that this person is in the loop with regard product decisions, and need to be notified when applications are goring to be deployed.

What Gets Monitored?

The simplistic answer you’ll hear over and over again is “everything”, but you’re a small organization with limited headcount. Your monitoring expert should have an idea since he’s in the product loop. They should be making suggestions as to what metrics the application should expose, and getting the basic information from the development team for alerting (process names, ports, etc..). They should prioritize the list of tests and metrics, starting with what you need to know that the application is up, and finishing with nice to haves.

Who Builds It?

Probably everyone. If that’s all you want you monitoring expert to do, then have them do it, but its far more efficient to spread the load. If people are having difficulty integrating with particular tools like Nagios or Ganglia, that’s where you expert steps in. They should know the in’s and out’s of the tools your using, and be able to train someone to extend a test, or get a metric collected within a few minutes.

These metrics and tests should be treated as part of the product, and should follow the same process you use to get code in production. You should consider making them goals for the project, since if you wait till after shipping there will be some lag between pushing code and getting it monitored.

Wrapping It Up

Its really not that hard; you just need someone to take a breath and plan. The part of monitoring that sucks is the human part, and since your app will not tell people when its up and down, you need a person to think about that for you.

Take a look at Part 1 if you liked this article.

Three First Pass Security Steps

I’m no security expert, but in my experience these are three simple things you can do to avoid a security incident.

3. Fix Authentication

Don’t allow users to log in to your systems just using passwords. Passwords are easy to setup and get running, but also are easily lost. For SSH use SSH keys at a minimum. If you have the money and time implement a two factor authentication system that requires using passwords and an identifier. For SSL VPNs use a cert and a password, for a system use something like secure ID and a password.

2. Use Prepared Statements

Wherever you can use prepared statements. Since the statement will be compiled before input this help cut down on the majority of SQL injection attacks. This doesn’t exempt you from catching bad SQL, just makes it less likely to be an issue.

1. Reduce Your Attack Surface

Use a firewall, and use a host level firewall. The easiest way to prevent an attack is to lock the door, and if someone can’t reach your server via SSH they are unlikely to break in using SSH. You should make sure that you only expose the services your clients need to access your server. In addition, if you’ve done this at the edge of your network you should look at doing this on a host level as well.

 

 

 

Monitoring Your Customers with Selenium and Nagios

In a brief conversation with Noah Sussman at DevOps Days, when discussing the challenges of continious deployment for B2B services with SLAs, we got side tracked discussing using Selenium and Nagios in production.

A few years back while working for a B2B company that was compensated by an attributable sales, I got on a phone call early in the morning to discuss fixing a client side display issue. The previous night, after a release a integration engineer modified a config that broke our service from rendering on almost every page at a single customer. The bug was fairly subtle, allowing what he was working on to display correctly, but breaking every other div on the site. This was pushed in early Spring, at off hours, and caught at the beginning of the day on the east coast.

At 9am PST, we held a post-mortem with all of our engineers. We discussed the impact of the issue on our revenue, which fortunately was pretty small, and laid out the timeline. Immediately, we discussed whether this was a testing issue or a monitoring failure. The CEO came back, and said while it was understandable that we missed the failure, our goal as an Ops team should be to catch any rendering issue within 5 minutes of a failure. I was a little annoyed, but agreed to build a series of test to try and catch this.

Why Other Metrics Failed Us

We had a fairly sophisticated monitoring setup at this point. We tracked daily revenue per customer, and we would generally know within 30 minutes if we had a problem. Our customers were very sites, but typically for US only sites had almost no traffic between 0:00 PST/PDT and 6:00 PST/PDT; in that time period it wasn’t unusual to have 0-2 sales. Once we got into a busier sales period the issue was spotted withing 30 minutes, and we were alerted. During this time period it turns out our primary monitoring metric for this type of failure was useless.

QA Tools Can Solve Ops Problems Too

I was familiar with Selenium from some acceptance tests I helped our QA guys write. I began to put together a test suite that met my needs(can’t provide the code for this, sorry). It consisted of:

  • rendering the main page
  • navigating to a page which we displayed content
  • clicking a link we provided
  • verifying that we displayed our content on a new page

This worked fairly well, but I had to tweak some timings to make sure the page was “visable”. I rigged this up to run through jUnit, and left a selenium server running all the time. Every 5 minutes the test suite would execute, leaving behind a log of successes and failures. We eventually built a test suite for every sizable customer. Every 5 minutes we checked the output of the jUnit with a custom Nagios test, that would tell us which customers had failures, an send an individual alert for each one.

Great Success!

I was really annoyed when I first had this conversation with the CEO; I thought this was a boondoggle that ops should not be responsible for. Within the first month my annoyance turned to delight as I started getting paged when our customers had site issues. I typically called them before their NOC had noticed, and most of the time these were issue they introduced to their site. I’d do it again in a heartbeat, and recommend that anyone else give it a try.