Author Archives: papilion

Fork Less in Bash and See Performance Wins

So, if you haven’t seen this page you should take a look. It has a whole bunch of interesting techniques you can use to manipulate strings in bash. If you end up working with bash a lot you might find yourself doing this quite a bit, since it can save a lot of time.

Lets take a pretty typical task, stripping the domain off of an email address.

So this poorly written program will split an email address at the first @ and print the first portion:

Its counterpart, which does not fork, uses a bash built-in to remove everything after the @:

So, whats the execution difference?

Its a 100x faster to skip the fork.

Now, granted this is a pretty dumb example, and its easy to rewrite this to perform better than the bash example (i.e. don’t use a loop and just use awk which is 3x faster than the pure bash solution). So, think about what your doing, use a pipe over a loop, and if you can’t do that, try to find a built-in that can take care of your needs.

Getting Unique Counts From a Log File

Two colleagues of mine ask a very similar question for interviews. The question is not particularly hard, nor does it require a lot of thought to solve, but its something that as a developer or a ops guys you might find yourself needing to do. The question is, given a log file of a particular format, tell me how many times something occurs in that log file. For example tell me the number of unique IP addresses in an access log, and the number of times each IP had visited this system.

Its amazing how many people don’t know what to do with this. One of my peers ask people to do this using the command line, the other tells the candidate they can do this anyway then want. I like this question because its VERY practical; I do tasks like this everyday, and I expect the people I work with to be able to do.

A More Concrete Exmaple

I like the shell solution, because its basically a one liner. So lets walk through it using access logs as an example.

Here is a very basic sample of a common access_log I threw together for this:

Lets say you want to count the number of times a unique IP addresses who’ve visited this system. Using nothing more than awk, sort, and uniq you can find the answer. What you’ll want to do is pull the first field with awk, then pipe that through sort, and then uniq. This isn’t fancy, but it returns the result very quickly without a whole lot of fuss.

Like so:


This gives you each hostname or IP, and the number of times they’ve contacted this server.

Upping the Complexity


Now for something more complex lets say you want to get the most commonly requested document that returns a 404. So, again we can do this all in a shell one-liner. We still need awk, sort, uniq, but this time we’ll also use tail. This time we can use awk to examine the status field(9), then print the URL field(7) if the status returned was 404. We can then use sort, uniq, and sort to order the results. Finally we’ll use tail to only print the last line, and awk, to print the requested document.

So here is what this looks like:

Of course there are many other ways to do this. This is a totally simple way to do it, and the best part of this is that you can count on these tools being on almost every *nix system.


Just Enough Ops of Devs

A few weeks ago I was reading through the chef documentaion and I came across the page “Just Enough Ruby for Chef”. This inspired me to put together a quick article, on how much linux a developer needs to know. I’m going to be doing this as a series, and putting out one of these a week.
 

How to use and generate SSH keys

I’ve covered how to create them here, but you should know how to create distrubute, and change ssh keys. This will make it easier to discuss access to production servers with your ops team, and likely make it easier when you use things like Github.
 

How to use |

If you’ve used unix for sometime, you might be familar wih this. The pipe or | can be used to send the output from one process to another. Here’s a good example of its usage:

 

How to use tar

Tar is one of those basic unix commands that you need to know. Its the universal archiving tool for *nix systems(similar to Zip for windows). You should know how to create an archive and expand an archive. I’m only covering this with compression enabled, if you don’t have gzip, or don’t want it ommit the z option.


 

The file command

File is magic. It will look at a file and give you its best guess as to what it is. Usage is:

 

The strings command

Ever want to read the strings from a binary file? The strings command will do this for you. Just run “strings ” and you’ll get a dump of all the strings from that file. This is particularly useful when looking for strings in old PCAP files, or if a binary file has been tampered with.

 

How to use grep

Grep can be used to extract a lines of text from a file or stream matching a particular pattern. This is a really rich command, and should have a whole article dedicated to it. Here are some very simple use cases.

To match a pattern:

To pull all lines not matching a pattern:

 

How to count lines in a file

The wc commands will count the lines, words, and bytes in a file. The default options will return all three, if you ony want to count the lines in a file, use the -l option that will output only the lines in a file. Here is an example:

 

Count the unique occurrences of something

It might seem like its out of the reach of bash, but you can do this with a simple one liner. You just need to type:

This counts all the unique line occurrences, and then sorts them numerically.

 

Following the output of a file with tail

Tail is a very useful command; it will output the last 10 lines of a file by default. But sometimes you need to want to continiously watch a file. Fortunately tail can solve this for you. The -f option will print new lines as they’re added. Example:

 

I’ll follow this up a week from now, with more linux for devs. Hopefully you found this useful.

Thanks Mr. Jobs, But it Seems I Can Use Linux Laptop Now

So, back in 1997 I installed my first copy of FreeBSD. I had to do some major research to get X Windows up and running, and the next computer I bought I very carefully selected a video card to make things easier. I was happy, I was able to use gcc, but getting online via 56k modem could be a bit of a chore.

So long little devil…

In early 1998 I started using RedHat Linux. I could play mp3s, and easily run things like RealPlayer and Mathematica. My copy of Netscape Navigator was just every bit as good as my Windows copy. However, I was too young to aperciate LaTeX, and needed a word processor to write papers. I tried to use every word processor I could find, but allas they all sucked. So, I had to dual boot linux and windows.

The Sun Also Rises

In 1999, I had a Sun Sparc 5 Workstation. I used it for a few years, with little difficulty. At the time I used mutt for email, netscape when I needed a browser. Cut and paste was still questionable, and viewing a Word or an Excel doc took more work than I cared to admit. But the world was starting to change.

The Sun Also Sets

I was getting HTML email, constantly. I got more and more attachments, and my boss was asking for better calendaring. I would go to websites, and get a plesant Javascript pop up saying, I needed IE.

By 2001, I was using Windows full time. I needed Outlook, Word, and Excel. I wasn’t wild about it, but I could get things done.

And, we have a new Contender

In spring of 2001 I bought my first Mac. It was a beuatiful Titanium Macbook G4 running OS 9. I could run my productivity apps, connect to my windows shares, and still ssh to any unix system that needed my attention.

For the next 11 years I used Macs for a personal computer, and I used windows PCs for work. In 2008 I got my first work Mac and I found my happy place. I described it as having a linux computer without the hassle of trying to run linux on a laptop.

In 2010 and 2011, I still used a Mac and told my co-workers who install Ubuntu they were wasting there time. They suffered with wireless problems, things like bluetooth never worked, and battery life suffered. I couldn’t understand why anyone wouldn’t want to use OS X.

Nothing is forever

Two days ago I got my Dell XPS 13 as part of a Dell beta progam called project Sputnik. I got a special version of Ubuntu, with some kernel patches, and some patched packages for sleep and hibernation. After an hour of struggling with making a bootable USB drive from my Mac for my Dell(turns out it was an issue with the USB drive), I had a working computer. By 8pm I had my development enviroment setup, I had chef up and running, and even my VPN was working. I was amazed.

So, far its been good; most apps I use are web apps. I spend 70% of my time in a terminal, and 30% of my time in a web browser. Honestly its the perfect computer for me right now. So, I’m waving goodbye to the ecosystem Mr. Jobs built, and moving to the world of linux full time.

***EDIT***

I relized this after posting the article, and watching the response that I sould have included I recieved a discount from Dell for the laptop(roughly 20%).

The Pitfalls of Web Caches

At Wikia we’ve written a lot of code and used many different tools to improve performance. We have profilers, we’ve used every major linux webserver, APC, xcache, memcache, redis, and not to mention all the work we’ve put into the application. But nothing get us as much bang for our buck as our caching layer built around Varnish. We’ve used it to solve big and small problems, and for the most part it does its jobs and saves us hundreds of thousand per year in server costs, but that does not mean there aren’t problems. If you’re looking at using something like Varnish here a few things to keep in mind.

Performance Problems

When deploying your caching layer, remember you really haven’t “fixed” anything. You’re page loads will improve because you’ll skip the high cost of rendering a raw page, but if you have a cache miss the page will be as slow as it ever was. This can be important if you pass sessioned clients to your backend application servers, since they will suffer the slowest load times. So, use caching to cut load times, but also try to fix the underlying performance problems.

Rewrites

We use a lot of rewrites. They are extremely useful for normalizing URLs for purging, and to hide ugly URLs from end users; no one wants to go to wiki/index.php?title=Foo when the can just go to wiki/Foo. Unfortunately, applying this pattern leads to very complex logic, that relies on using the request URL as an accumulator. This makes for very difficult to understand code, since you can have multiple changes apply to one URL. Rewrites are also very difficult to test, since they are built into your caching server. Don’t forget that you’re estentially making you’re cache part of your application; there isn’t much difference between a rewrite in varnish and a route in rails.

If you have the choice, don’t put rewrites in your caching layer; it is very difficult to debug, and can be very fragile. Use redirects if at all possible, they may be slower, but are easier to figure out when something goes wrong.

Complicated Logic

Rewrites aren’t the only place you can make things complicated. Its very easy to build complex logic in something like Varnish. You can have several levels of if statements, and several points which you move on to the next step in the document delivery. This, just like rewrites, can lead to very difficult problems to debug, since you may have several conditions applied to the same request.

If you find yourself doing this, ask yourself if its needed? You may find that while its fast to implement in a caching layer, you may be better off building this into the application itself. Remember that caching layers are difficult to test, and you may not know that you logic is full working for at least one eviction cycle.

Wrapping it Up

My advice for using a web caching server is easy, keep it simple. Try and keep as much logic out of it as possible, don’t ignore your performance problems, and use it for what its best at, caching.

Infrastructure – The Challenge of Small Ops – Part 3

Infrastructure is hard to build. This is true when putting together compute clusters, or when dealing with roads or power lines. Typically this involves both increases in operating expenses and capital expenses, and a small mistake can be quite costly.
 

Limited Resources

 
All organizations have goals. Sometimes these goals are built around reliability, and sometimes they are build around budgets, but most of the time both are important. In large organizations a few extra servers, don’t usually carry a material cost impact, but in a small organization one extra servers can double the cost of a project. If you’re missing a large budget it can make some reliability goals quite challenging.
 

Reliability

 
Engineering is the art of making things as weak as they need to be to survive. When putting together infrastructure in a small environment its helpful to really give someone the job of reliability engineering. They should look at your application, and outline what is required to provide the basic redundancy your organization needs. Then they should see how they can line up the budget, and the requirements and get you a solution that meets your up-time needs as well as your pocketbook.
 

N+1

 
This is batted around often when discussing the redundancy. In very large organization this n may equal 1000, so n + 1 is 1001, but in a small organization N is often 1, making N + 1 equal to 2. This is often a hard problem to work around when you may only be allowed to buy 2 severs for a three tiered application, but you can work around it. Virtualization can really help out, but it increases the planning demands. You will need to insure that you have the capacity in each piece of physical equipment to meet your needs, and that you’ve made sure that system roles always exist such that they can fail independently. While this sounds simple, it really needs an owner to keep track of this, just to make sure you don’t loose your primary and backup service at the same time.

The second issue with n + 1 redundancy, when N equals 1 you need to plan capacity carefully. The best solution in this case is to use an active-passive setup. If you use active-active setups you need to be careful that you don’t exceed 50% of your total capacity, since a failure will remove 50% of your capacity.
 

Wrapping it Up

 
Infrastructure is one of the harder things to get right in a small org. Take you time and think about it. Always keep an eye on your budget, and reliability goals.

This is the final installment in this series. Take a look at Part 2 and Part 1 if you liked this article.

Solr Upgrade Surprise and Using Kill To Debug It

At work, we’ve recently upgraded to the latest and greatest stable version of Solr (3.6), and moved from using the dismax parser to the edismax parser. The initial performance of Solr was very poor in our environment, and we removed the initial set of search features we had planned to deploy trying to get the CPU utilization in order.

Once we finally, rolled back a set of features Solr seemed to be behaving optimally. Below is what we were seeing as we looked at our search servers CPU:
Solr CPU Usage Pre and Post Fix
Throughout the day we had periods where we saw large CPU spikes, but they didn’t really seem to affect throughput or average latency of the server. None the less we suspected there was still an issue, and started looking for a root cause.
 

Kill -3 To The Rescue

 
If you’ve never used kill -3, its perhaps one of the most useful Java debugging utilities around. It tells the JVM to produce a full thread dump, which it will then print to the STDOUT of the process. I became familiar with this when trying to hunt down treads in a Tomcat container that were blocking the process from exiting. Issuing kill -3 would give you enough information to find the problematic thread, and work with development to fix it.

In this case, I was hunting for a hint as to what went wrong with our search. I issued kill -3 during a spike, and got something like this:

 

Looking at the the output, I noticed that we had a lot threads calling FuzzyTermEnum. I thought this was strange, and sounded like an expensive search method. I talked with the developer, and we expected that the tilde character was being ignored by edismax. At the very least being escaped by our library, since it was included in the characters to escape. I checked the request logs, and we had people looking for exact titles that contained ~. This turned a 300ms query into a query that timed out, due to the size of our index. Further inspection of the thread dump revealed that we were also allowing the * to be used in query terms as well. Terms like *s ended up being equally problematic.
 

A Solr Surprize

 
We hadn’t sufficiently tested edismax, and we’re surprised that it ran ~,+,^, and * when escaped. I didn’t find any documentation that stated this directly, but I didn’t really expect to. We double checked our Solr library to see if that it was properly escaping the special characters in the query, but they we’re still being processed by Solr. On a hunch we tried double escaping the characters, which resolved the issue.

I’m not sure if this is a well known problem with edismax, but if you’re seeing odd CPU spikes this is definitely worth checking for. In addition, when trying to get to a root of a tough problem kill -3 can be a great shortcut. It saved me a bunch of painful debugging, and really eliminated almost all my guess work.

What I Wish Some Had Told Me About Writing Cron Jobs

Much like Doc Brown and Marty McFly, cron and I go way back. It is without doubt one of thing single most valuable tools you can use in linux system management. Though what I’ve learned over the years is that it can be hard to write jobs that reliably produce the results I want. I wish when I started writing these jobs executed by cron in the 1990s someone had told me a few of these things.
 

Don’t Allow Two Copies of Your Job to Run at Once

 
A common problem with cron jobs is that the cron daemon will launch new jobs while the old job is running. Sometimes this doesn’t cause a problem, but generally you expect only one job to launch at a time. If you’re using cron to control jobs that launch every 5 or 10 minutes, but only want one to run at a time its useful to implement some type of locking. A simple a method is to use something like this:

You can get more complicate using flock, or other atomic locking mechanisms. For most purposes this is good enough.
 

Sleep for a Bit

 
Ever have a cron job overload a whole server tier because logs rotate at 4am? Or, got a complaint from someone that you were overloading there application by having 200 server contact them at once? A quick fix is to have the job sleep for a random time after being launched. For example:

This does a good job of spreading the load out for expensive jobs, and avoid thundering herd problems. I generally pick an interval long enough so that my servers will be distributed throughout the period, but still meets my goal. For example, I might spread an expensive once a day job over an hour, but a job that runs ever 5 minutes may only be spread over 90 seconds. Lastly, this should only be used for things that you can except a loose time window around.
 

Log It


I’ll be the first to admit I do this all the time. I hate getting emails from cron, but in general you should avoid doing this. When everything is working this isn’t a big deal, but when something goes wrong, you’ve thrown away all the data that told you what happened. So, redirect to a log file, and overwrite or rotate that file.

Hopefully these tips help you out, and solve some of your cron pains.

6 Phone Screen Questions for an Ops Candidate

My company is hiring, and I’ve been thinking a lot more about what types of question are appropriate for a phone interview, but still give enough detail to lead me to a conclusion as to whether I think the person on the other end is competent. Having been on both sides of the table I thought I might share what I think are a few good questions.

 

1. Tell me about a monitoring test you’ve written?

I decided long ago that I don’t want to hire anyone who has never written a monitoring test. I don’t care how simple or complicated the test was, but I want to make sure they’ve done it. Throughout my career, I’ve come across so many specialized pieces of code or infrastructure, that I take it for granted sooner or later your going to need to do this. I find that the people who care about uptime do it earlier in their career. Its good to follow up with several more questions about their specific implementation, and then ask if they had any unexpected issues with the test.

 

2. How would you remove files from a directory when ‘rm *’ tells you there are too many files?

Back in the 1990′s when Solaris shipped with a version of Sendmail that was an open-relay, it wasn’t unusual for me to have to wipe a mail queue directory for a customer. If someone had been really aggressive sending mail to it, it wasn’t too unusual to be confronted with the message that the * expansion was too long to pass to rm.  I can think of a few ways to do this:

  1. for i in `ls /dir/`; do rm $i ; done
  2. find /dir/ -exec rm {} \;
  3. rm -rf /dir; mkdir /dir

And I’m sure there are plenty more. After I get the answer I like to cover if they think there is any issue with the method they’ve chosen.

I like this question since it show a candidates understanding of how the command-line works, and if they can think around some of its limitations.

 

3. How would you setup and automated installation of linux?

A good candidate should have done this, and they should imediately be talking about setting up  FAI or Kickstart. I like to make sure they cover the base pieces of infrastructure, like DHCP, tftp, and PXE. Generally I will follow up, and ask when they think it makes sense to setup this type of automation, since it does require quite a bit of initial infrastructure.

 

4. How would you go about finding files on an installed system to add to configuration management?

This question is straight forward and quick, and I’m looking for two things from the candidate. First, I want them to tell me about using the package management system to locate modified config files, and second I want to hear them tell me about talking to the development team as to what was copied on the system.

This question tells me they’ve looked for changes on systems, and have a basic understanding of what the package management tools provide. But, that they know there is a human component, and it might be quicker to ask the dev team what they installed then building a tool to find it.

 

5. If I gave you a production system with a PHP application running through Apache, what things would you monitor?

I like using this question because it give you an idea of the thoroughness of the candidates thought process. The easy answer is the URL the application is running on, but I like to push candidates for more. I generally looking for a list like:

  • The URL of the application
  • The Port Apache is running on
  • The Apache Processes
  • PING
  • SSH
  • NTP
  • Load Average / CPU utilization
  • Memory Utilization
  • Percentage of Apache Connections used
  • Etc..

I’m looking for both the application specific and the basic probes. I cannot tell you how many times in my career, I’ve started a job and found out SSH wasn’t monitored. Since it wasn’t part of the application, people didn’t think it was needed.

This question tests the candidates attention to detail. Monitoring is an important part of any production environment, and I want candidates who state the obvious.

6. If I asked you to backup a servers local filesystem, how would you do it?

Backups are, unfortunately, the bread and butter of operations work. A candidate should really have some experience running a backup, and so they should know the basics. Unfortunately, this is a really open ended question. There are endless ways it can be done, and that makes it a little tough on both the candidate and interviewer. One example a candidate could choose would be to use the tar command, but they could also choose to use tar with an LVM snapshot, or they could use rsync to a remote server. Its really the follow up question that makes this worthwhile; what are the disadvantages of your method, and can you think of another way you might do this to address those issues? Again, since its the bread and butter of operations work, they should know the strengths and weakness of the scheme they select, and they should know at least one alternative.

This question checks to see if a candidate has performed typical operations work, but also if they have thought through the problems with it.

 

Rally Cars and Redunancy: Understand Your Failure Boundaries

I occasionally watch rally car racing, and if you haven’t seen it before its worth a watch. Guys drive small cars very fast down dirt roads, and while this is going on a passenger is reading driving notes to the driver. Occasionally these guys hit rocks, run off the road, and do all sorts of damage to their cars. The do a quick assessment of the damage determine if they can continue, and the carry on or pull out. Generally the calls the are making is to whether the particular failure has compromised the safety, or performance of the car so much that they cannot complete the race.

If you want to build a redundant system, you’ll need to take a look at each component and ask yourself what happens if it fails. Do you know how many components it will affect? Will it bring the site down, or degrade performance so much that the site will practically be down? Will your system recover automatically, or require intervention for an operator? Think through this is a very detailed manner, and make sure you understand the components.

Enter the Scientific Method

Develop a hypothesis, and develop a checklist of what you expect to see during a system failure.   This should include how you expect performance to degrade, what you expect to do to recover, and what alerts should be sent to your people on call. Put time frames in which you expect things to happen in, and most importantly note any place you expect there is a single point of failure.

Break It

Once you’ve completed your list, start shutting off pieces to test your theories. Did you get all the alerts you expected in a timely manner? More importantly, did you get extra alerts you didn’t expect? These are important because they may mislead, or obscure a failure. Did anything else fail that you didn’t expect? And, lastly did you have to do anything you didn’t expect to have to recover?

Are You a Soothsayer

Summarize the differences, and document what happened. If you got too many alerts, see if you can develop a plan to deal with them.Then document what the true failure boundaries are. For example if you’re firewall fails, do you loose access to your management network? After doing all this decide if there is anything you can do to push failure boundaries back into a single component, and if you can minimize the effect on the rest of your system. Basic infrastructure components like switches and routers usually have surprising failure boundaries, when coupled with infrastructure decisions such as putting all of your database servers in a single rack.

This process takes time, and its hard to do at small and medium scales. Once you have infrastructure up and running, its difficult to run these types of tests, but you should probably still advocate for it. Its better to discover these type of problems under controlled conditions then in the middle of the night. You may be able to test parts using virtualization, but sometimes you’ll need to pull power plugs. Concoct a any type of failure you can imagine, and look for soft failures(mysql server is loosing 10% of its packets) since they are the most difficult to detect. Remember, You’ll never really have a redundant system until you start understanding how each component works with other components in your system.