Category Archives: Uncategorized


Fork Less in Bash and See Performance Wins

So, if you haven’t seen this page you should take a look. It has a whole bunch of interesting techniques you can use to manipulate strings in bash. If you end up working with bash a lot you might find yourself doing this quite a bit, since it can save a lot of time.

Lets take a pretty typical task, stripping the domain off of an email address.

So this poorly written program will split an email address at the first @ and print the first portion:

Its counterpart, which does not fork, uses a bash built-in to remove everything after the @:

So, whats the execution difference?

Its a 100x faster to skip the fork.

Now, granted this is a pretty dumb example, and its easy to rewrite this to perform better than the bash example (i.e. don’t use a loop and just use awk which is 3x faster than the pure bash solution). So, think about what your doing, use a pipe over a loop, and if you can’t do that, try to find a built-in that can take care of your needs.


Getting Unique Counts From a Log File

Two colleagues of mine ask a very similar question for interviews. The question is not particularly hard, nor does it require a lot of thought to solve, but its something that as a developer or a ops guys you might find yourself needing to do. The question is, given a log file of a particular format, tell me how many times something occurs in that log file. For example tell me the number of unique IP addresses in an access log, and the number of times each IP had visited this system.

Its amazing how many people don’t know what to do with this. One of my peers ask people to do this using the command line, the other tells the candidate they can do this anyway then want. I like this question because its VERY practical; I do tasks like this everyday, and I expect the people I work with to be able to do.

A More Concrete Exmaple

I like the shell solution, because its basically a one liner. So lets walk through it using access logs as an example.

Here is a very basic sample of a common access_log I threw together for this:

Lets say you want to count the number of times a unique IP addresses who’ve visited this system. Using nothing more than awk, sort, and uniq you can find the answer. What you’ll want to do is pull the first field with awk, then pipe that through sort, and then uniq. This isn’t fancy, but it returns the result very quickly without a whole lot of fuss.

Like so:

This gives you each hostname or IP, and the number of times they’ve contacted this server.

Upping the Complexity

Now for something more complex lets say you want to get the most commonly requested document that returns a 404. So, again we can do this all in a shell one-liner. We still need awk, sort, uniq, but this time we’ll also use tail. This time we can use awk to examine the status field(9), then print the URL field(7) if the status returned was 404. We can then use sort, uniq, and sort to order the results. Finally we’ll use tail to only print the last line, and awk, to print the requested document.

So here is what this looks like:

Of course there are many other ways to do this. This is a totally simple way to do it, and the best part of this is that you can count on these tools being on almost every *nix system.

DevOps Linux Uncategorized

Just Enough Ops of Devs

A few weeks ago I was reading through the chef documentaion and I came across the page “Just Enough Ruby for Chef”. This inspired me to put together a quick article, on how much linux a developer needs to know. I’m going to be doing this as a series, and putting out one of these a week.

How to use and generate SSH keys

I’ve covered how to create them here, but you should know how to create distrubute, and change ssh keys. This will make it easier to discuss access to production servers with your ops team, and likely make it easier when you use things like Github.

How to use |

If you’ve used unix for sometime, you might be familar wih this. The pipe or | can be used to send the output from one process to another. Here’s a good example of its usage:


How to use tar

Tar is one of those basic unix commands that you need to know. Its the universal archiving tool for *nix systems(similar to Zip for windows). You should know how to create an archive and expand an archive. I’m only covering this with compression enabled, if you don’t have gzip, or don’t want it ommit the z option.


The file command

File is magic. It will look at a file and give you its best guess as to what it is. Usage is:


The strings command

Ever want to read the strings from a binary file? The strings command will do this for you. Just run “strings ” and you’ll get a dump of all the strings from that file. This is particularly useful when looking for strings in old PCAP files, or if a binary file has been tampered with.


How to use grep

Grep can be used to extract a lines of text from a file or stream matching a particular pattern. This is a really rich command, and should have a whole article dedicated to it. Here are some very simple use cases.

To match a pattern:

To pull all lines not matching a pattern:


How to count lines in a file

The wc commands will count the lines, words, and bytes in a file. The default options will return all three, if you ony want to count the lines in a file, use the -l option that will output only the lines in a file. Here is an example:


Count the unique occurrences of something

It might seem like its out of the reach of bash, but you can do this with a simple one liner. You just need to type:

This counts all the unique line occurrences, and then sorts them numerically.


Following the output of a file with tail

Tail is a very useful command; it will output the last 10 lines of a file by default. But sometimes you need to want to continiously watch a file. Fortunately tail can solve this for you. The -f option will print new lines as they’re added. Example:


I’ll follow this up a week from now, with more linux for devs. Hopefully you found this useful.


Thanks Mr. Jobs, But it Seems I Can Use Linux Laptop Now

So, back in 1997 I installed my first copy of FreeBSD. I had to do some major research to get X Windows up and running, and the next computer I bought I very carefully selected a video card to make things easier. I was happy, I was able to use gcc, but getting online via 56k modem could be a bit of a chore.

So long little devil…

In early 1998 I started using RedHat Linux. I could play mp3s, and easily run things like RealPlayer and Mathematica. My copy of Netscape Navigator was just every bit as good as my Windows copy. However, I was too young to aperciate LaTeX, and needed a word processor to write papers. I tried to use every word processor I could find, but allas they all sucked. So, I had to dual boot linux and windows.

The Sun Also Rises

In 1999, I had a Sun Sparc 5 Workstation. I used it for a few years, with little difficulty. At the time I used mutt for email, netscape when I needed a browser. Cut and paste was still questionable, and viewing a Word or an Excel doc took more work than I cared to admit. But the world was starting to change.

The Sun Also Sets

I was getting HTML email, constantly. I got more and more attachments, and my boss was asking for better calendaring. I would go to websites, and get a plesant Javascript pop up saying, I needed IE.

By 2001, I was using Windows full time. I needed Outlook, Word, and Excel. I wasn’t wild about it, but I could get things done.

And, we have a new Contender

In spring of 2001 I bought my first Mac. It was a beuatiful Titanium Macbook G4 running OS 9. I could run my productivity apps, connect to my windows shares, and still ssh to any unix system that needed my attention.

For the next 11 years I used Macs for a personal computer, and I used windows PCs for work. In 2008 I got my first work Mac and I found my happy place. I described it as having a linux computer without the hassle of trying to run linux on a laptop.

In 2010 and 2011, I still used a Mac and told my co-workers who install Ubuntu they were wasting there time. They suffered with wireless problems, things like bluetooth never worked, and battery life suffered. I couldn’t understand why anyone wouldn’t want to use OS X.

Nothing is forever

Two days ago I got my Dell XPS 13 as part of a Dell beta progam called project Sputnik. I got a special version of Ubuntu, with some kernel patches, and some patched packages for sleep and hibernation. After an hour of struggling with making a bootable USB drive from my Mac for my Dell(turns out it was an issue with the USB drive), I had a working computer. By 8pm I had my development enviroment setup, I had chef up and running, and even my VPN was working. I was amazed.

So, far its been good; most apps I use are web apps. I spend 70% of my time in a terminal, and 30% of my time in a web browser. Honestly its the perfect computer for me right now. So, I’m waving goodbye to the ecosystem Mr. Jobs built, and moving to the world of linux full time.


I relized this after posting the article, and watching the response that I sould have included I recieved a discount from Dell for the laptop(roughly 20%).


Infrastructure – The Challenge of Small Ops – Part 3

Infrastructure is hard to build. This is true when putting together compute clusters, or when dealing with roads or power lines. Typically this involves both increases in operating expenses and capital expenses, and a small mistake can be quite costly.

Limited Resources

All organizations have goals. Sometimes these goals are built around reliability, and sometimes they are build around budgets, but most of the time both are important. In large organizations a few extra servers, don’t usually carry a material cost impact, but in a small organization one extra servers can double the cost of a project. If you’re missing a large budget it can make some reliability goals quite challenging.


Engineering is the art of making things as weak as they need to be to survive. When putting together infrastructure in a small environment its helpful to really give someone the job of reliability engineering. They should look at your application, and outline what is required to provide the basic redundancy your organization needs. Then they should see how they can line up the budget, and the requirements and get you a solution that meets your up-time needs as well as your pocketbook.


This is batted around often when discussing the redundancy. In very large organization this n may equal 1000, so n + 1 is 1001, but in a small organization N is often 1, making N + 1 equal to 2. This is often a hard problem to work around when you may only be allowed to buy 2 severs for a three tiered application, but you can work around it. Virtualization can really help out, but it increases the planning demands. You will need to insure that you have the capacity in each piece of physical equipment to meet your needs, and that you’ve made sure that system roles always exist such that they can fail independently. While this sounds simple, it really needs an owner to keep track of this, just to make sure you don’t loose your primary and backup service at the same time.

The second issue with n + 1 redundancy, when N equals 1 you need to plan capacity carefully. The best solution in this case is to use an active-passive setup. If you use active-active setups you need to be careful that you don’t exceed 50% of your total capacity, since a failure will remove 50% of your capacity.

Wrapping it Up

Infrastructure is one of the harder things to get right in a small org. Take you time and think about it. Always keep an eye on your budget, and reliability goals.

This is the final installment in this series. Take a look at Part 2 and Part 1 if you liked this article.


6 Phone Screen Questions for an Ops Candidate

My company is hiring, and I’ve been thinking a lot more about what types of question are appropriate for a phone interview, but still give enough detail to lead me to a conclusion as to whether I think the person on the other end is competent. Having been on both sides of the table I thought I might share what I think are a few good questions.


1. Tell me about a monitoring test you’ve written?

I decided long ago that I don’t want to hire anyone who has never written a monitoring test. I don’t care how simple or complicated the test was, but I want to make sure they’ve done it. Throughout my career, I’ve come across so many specialized pieces of code or infrastructure, that I take it for granted sooner or later your going to need to do this. I find that the people who care about uptime do it earlier in their career. Its good to follow up with several more questions about their specific implementation, and then ask if they had any unexpected issues with the test.


2. How would you remove files from a directory when ‘rm *’ tells you there are too many files?

Back in the 1990’s when Solaris shipped with a version of Sendmail that was an open-relay, it wasn’t unusual for me to have to wipe a mail queue directory for a customer. If someone had been really aggressive sending mail to it, it wasn’t too unusual to be confronted with the message that the * expansion was too long to pass to rm.  I can think of a few ways to do this:

  1. for i in `ls /dir/`; do rm $i ; done
  2. find /dir/ -exec rm {} \;
  3. rm -rf /dir; mkdir /dir

And I’m sure there are plenty more. After I get the answer I like to cover if they think there is any issue with the method they’ve chosen.

I like this question since it show a candidates understanding of how the command-line works, and if they can think around some of its limitations.


3. How would you setup and automated installation of linux?

A good candidate should have done this, and they should imediately be talking about setting up  FAI or Kickstart. I like to make sure they cover the base pieces of infrastructure, like DHCP, tftp, and PXE. Generally I will follow up, and ask when they think it makes sense to setup this type of automation, since it does require quite a bit of initial infrastructure.


4. How would you go about finding files on an installed system to add to configuration management?

This question is straight forward and quick, and I’m looking for two things from the candidate. First, I want them to tell me about using the package management system to locate modified config files, and second I want to hear them tell me about talking to the development team as to what was copied on the system.

This question tells me they’ve looked for changes on systems, and have a basic understanding of what the package management tools provide. But, that they know there is a human component, and it might be quicker to ask the dev team what they installed then building a tool to find it.


5. If I gave you a production system with a PHP application running through Apache, what things would you monitor?

I like using this question because it give you an idea of the thoroughness of the candidates thought process. The easy answer is the URL the application is running on, but I like to push candidates for more. I generally looking for a list like:

  • The URL of the application
  • The Port Apache is running on
  • The Apache Processes
  • PING
  • SSH
  • NTP
  • Load Average / CPU utilization
  • Memory Utilization
  • Percentage of Apache Connections used
  • Etc..

I’m looking for both the application specific and the basic probes. I cannot tell you how many times in my career, I’ve started a job and found out SSH wasn’t monitored. Since it wasn’t part of the application, people didn’t think it was needed.

This question tests the candidates attention to detail. Monitoring is an important part of any production environment, and I want candidates who state the obvious.

6. If I asked you to backup a servers local filesystem, how would you do it?

Backups are, unfortunately, the bread and butter of operations work. A candidate should really have some experience running a backup, and so they should know the basics. Unfortunately, this is a really open ended question. There are endless ways it can be done, and that makes it a little tough on both the candidate and interviewer. One example a candidate could choose would be to use the tar command, but they could also choose to use tar with an LVM snapshot, or they could use rsync to a remote server. Its really the follow up question that makes this worthwhile; what are the disadvantages of your method, and can you think of another way you might do this to address those issues? Again, since its the bread and butter of operations work, they should know the strengths and weakness of the scheme they select, and they should know at least one alternative.

This question checks to see if a candidate has performed typical operations work, but also if they have thought through the problems with it.


DevOps Uncategorized

The Challenge of Small Ops (Part 1)

I missed a open session at DevOps days, and I’m really disappointed that I did after hearing feedback from one of the conference participant. He said many people in the session we’re advocating for eliminating operations in the small scale.

I realize that the world is changing, and that operations teams need to adjust to the changing world. We have skilled development teams, cloud services, APIs for almost everything you need to build a moderately sized web service. Its no wonder that some smaller(and some large) organizations are beginning to question the need for ops teams. So, I’m going to write a series of articles discussing the challenges that exist in operations of small and medium sized teams, and how an operations expert can help solve these issues.

Ops as the Firefighter

In my discussion with a fellow conference-goer about this topic, when he said the general feeling was that you push responsibility to your vendors and eliminate ops, I suggested that perhaps we should think of ops as a firefighter.

Small towns still have firefighting teams, and they may be volunteers, but I’ll bet they were trained by a professional. You should think of an Operations Engineer as your companies trainer. You should lean on them for the knowledge that can only be gained working in an operational environment.


Failure is the only constant for web services, and your should expect them to happen. You will need to respond to failures in a calm and organized manner, but this is likely too much for a single individual. You’ll need a better approach.

A mid-level or senior operations engineer should be able to develop an on-call schedule for you. They should be able to identify how many engineers you need on-call in order to meet any SLA response requirement. In addition they can train your engineers how to respond, and make sure any procedure is followed that you might owe to customers. They can make everyone more effective in an emergency.

Vendor Management

Amazon, Heroku, and their friends all provide excellent reliable platforms, but from time to time they fail. Vendors typically like to restrict communications to as few people as possible, since it makes it easier for them to communicate. If you’re not careful you may find yourself spreading responsibility for vendors across your organization, as individuals add new vendors.

I believe it makes more sense to consolidate the knowledge in an operations engineer. An operations engineer is used to seeing vendors fail, and will understand the workflow required to report and escalate a problem. They understand how to read your vendors SLA, and hold them accountable to failures. Someone else can fill this role, but this person needs to be available at all hours, since failure occur randomly, and they will need to understand how to talk to the NOC on the other end.

The Advocate

Your platform provides a service, and you have customers that rely on you. Your engineering team often becomes focused on individual systems, and repairing failures in those systems. It is useful if someone plays the role of the advocate for the service, and I think operations is a perfect fit. A typical ops engineer will be able to determine if the service is still failing, and push for a resolution within the organization. They are generally familiar with the parts of the service and who is responsible for them.


Language Importance for DevOps Engineers

First and foremost this is a biased article. These are all my opinions, and come from my working experience.

Bash(or Posix shell)

Importance 10/10

If you’re working with *nix and can’t toss together a simple init.d script in 5 minutes, you haven’t done enough bash. It’s everywhere, and it should still be your first automation choice. It has a simple syntax, and is designed specifically to execute programs in a non-interactive manner. You’ll be annoyed that it lacks unit tests, and complex error handling, but its purpose built to automate administrative tasks.


Importance 9/10

This is the language that you will run into if you work in operations. There will be backup scripts, nagios tests, and a large collection of digital duck tape written by co-workers, that do very important jobs. Its syntax is ugly, and you may find yourself writing a eval to handle exceptions, but its everywhere. CPAN makes it fairly easy to get things done, and you can’t beat this for string handling.


Importance 5/10

This is the latin of the *nix world, and is basically portable assembly language. I refrain from writing C whenever possible, since I rarely need the raw performance, and the security and stability consequences are pretty severe. You should understand the syntax(its ALGO right), and be able to read a simple application. It would be great if you could submit a patch to a open-source project, but I would never turn down an ops hire because they didn’t know C well enough.


Importance 7/10

PHP more important than C?! Yep. Like perl its everywhere, people use it for prototyped webapps, and full blow production systems. Its another ALGO syntax language, except you can put together a simple web page in 2-3 minutes; its almost as magical as the Twilio API. You’ll find yourself poking at it on more than one occasion, so you might as well know what you’re doing.


Importance 6/10

Doing something with puppet or chef? You probably should know some ruby, and in fact it probably more important to know ruby than chef of puppet. Its relatively easy to pick up, and so many of the automation tools people love are written int it. As an extra bonus, you could write rails and sinatra apps. It’s good to have in your back pocket.


Importance 4/10

People love to love python, but the truth is that its a bit of a diva. Its a language that favors reading over writing, and has a very bloated standard library with lots of broken components(which is the right http library to use?). It wants to be a simpler perl, but I never find it as useful, and it always takes longer. I know a lot of companies say they want to use it as their “scripting” language, but in practice I’ve not seen the value(i stil want to rewrite everyones code).


Importance 2/10

These are DSLs for configuration management. They are supposed to be simple to learn, and if you can’t figure them out with a web browser and a few minutes, they are failing.


Importance 6/10

More ALGO syntax, and more prevalent in high scale web applications. Minimally you should be able to read this language, but its useful to be able to pound out a few lines of Java. It has many rich frameworks, and you’ll likely find it sneaking into your stack where you need something done fast. Also, it is really useful when it comes time to tune the JVM.


Importance 0/10

When I’ve run into it running someplace serious I’ll update its score.


Importance 8/10

I hate this language, but I can’t deny its growing importance. Its more common to see in a web browser, but its starting to creep into the backend with things like node.js. If you can understand javascript, you can help resolve whether the issue is a frontend or backend problem; you will have total stack awareness.


Importance 10/10

You have to know SQL. You will work with SQL databases, and you will want to move things in and out of them. You may want to know a dialect like MySQL very well, but you should understand the basics, and at a minimum be able to join a few tables, or create an aggregate query.


Stupid Bash Expansion Trick

I got asked a question regarding filename expansion in bash the other day, and was stumped. It turns out to be something I should have considered a long time ago, and will always keep in mind when writing a script.

Question 1:

What does the following script do if there is a file abc in the current directory?

for i in a*
  echo $i


This a* matches abc and expands to abc, and the script outputs:

Question 2:

What if you run the same script in a directory without any files?


The script outputs:



According to The Bash Reference Manual:

Bash scans each word for the characters ‘*’, ‘?’, and ‘[’. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names matching the pattern. If no matching file names are found, and the shell option nullglob is disabled, the word is left unchanged.

So bash will output ‘a*’, because that is how filename expansion works.

Question 3:

What if you run the following script and in a directory with no filename beginning with a:

for i in a*
  echo /usr/bin/$i


The script outputs:

/usr/bin/a2p /usr/bin/a2p5.10.0 /usr/bin/a2p5.8.9 /usr/bin/aaf_install /usr/bin/aclocal /usr/bin/aclocal-1.10 /usr/bin/addftinfo /usr/bin/afconvert /usr/bin/afinfo /usr/bin/afmtodit /usr/bin/afplay /usr/bin/afscexpand /usr/bin/agvtool /usr/bin/alias /usr/bin/allmemory /usr/bin/amavisd /usr/bin/amavisd-agent /usr/bin/amavisd-nanny /usr/bin/amavisd-release /usr/bin/amlint /usr/bin/ant /usr/bin/applesingle /usr/bin/appletviewer /usr/bin/apply /usr/bin/apr-1-config /usr/bin/apropos /usr/bin/apt /usr/bin/apu-1-config /usr/bin/ar /usr/bin/arch /usr/bin/as /usr/bin/asa /usr/bin/at /usr/bin/atos /usr/bin/atq /usr/bin/atrm /usr/bin/atsutil /usr/bin/autoconf /usr/bin/autoheader /usr/bin/autom4te /usr/bin/automake /usr/bin/automake-1.10 /usr/bin/automator /usr/bin/autoreconf /usr/bin/autoscan /usr/bin/autospec /usr/bin/autoupdate /usr/bin/auval /usr/bin/auvaltool /usr/bin/awk


Because you’re re-evaluating ‘/usr/bin/$i’ which is now ‘/usr/bin/a*’, which expands to the order list above due to shell filename expansion rules. If you want to avoid this you need to protect your variables using quotes. Here is the safe version of the script:

for i in a*
  echo /usr/bin/"$i"

Just something simple to think about when writing your bash scripts. Expect to enter loops on globs that don't match anything, always protect your variables, and consider setting the failglob option in your scripts.

DevOps Uncategorized

Dealing with Outages

No matter what service you’re building, at some point you can expect to have an outage. Even if your software is designed and scaled perfectly, one of your service providers may let you down, leading to a flurry of calls from customers. Plus the internet has many natural enemies in life (rodents, backhoes, and bullets), and you may find yourself cut off from the rest of the net with a small twist of fate. Don’t forget, even the best detection and redundancy schemes fail, and it not unusual to have your first notification of an outage come from a rather upset customer. Your site will go down, and you should be prepared for it.

Initial Customer Contact

You’re customer is upset. You’ve promised to provide some service that is now failing, and they are likely loosing money because of your outage. They’ve called your support line, or sent an email, and they are looking for a response. What do you do?

Give your self a second

Outages happen on their own schedules, and you may be at a movie, sleeping, the gym, or eating dinner at the French Laundry for example. You need to give you’re self 2-3 minute to compose yourself, find internet access, and call the customer back. If you have an answering service you’ve likely met the terms of your SLA, if you don’t figure out how much time you can take. I think this is a better option than voicemail, since it handles any issues you may have communicating with a customer in the first few minutes of the call. They may even be able to file a ticket for you with the basic information you need. This can cost a fair bit of money, and if this option is too pricey for your service, consider a voicemail number that will page your on-call team. It gives your team a small buffer, but they have to be prepared to talk to the customer quickly since this may add up to 5 minutes between the initial call and page. As the last resort, have your customer support number dial someone who is on-call.  If you have the time and resources, make the email address you use for  outage reports follow the same workflow as calls, so you don’t need a second process.

Promises Can Be Hard to Keep

Track your customer’s complaint; make sure its recorded in your ticketing system. You want to start keeping a record from the moment they called you, and be able to reconstruct the incident later. This will also help you determine a start time for any damages clause that may be in your SLA. I’d make sure the following things are done:

  • Get a call back number.
  • Let them know you are looking into the issue.
  • Let them know when you expect to call them back.
  • Let them know the ticket / incident number you are using to track the issue.
  • And most importantly, don’t promise anything that you can’t guarantee happens.


Have you met the terms of your SLA?

You only have one SLA agreement, right? If not, hopefully the basics are the same. Keep in mind what you’ve agreed to with your customers, and as early as possible identify if you’ve not met the terms of the service agreement. This is really just for tracking, but it can be useful if you have to involve an account manager and discuss any damage claims.

Houston, we don’t have a problem.

You’ve talked with the customer, you’ve created a ticket, you’ve managed expectations, now its time to figure out if there is an issue.

  • Check your internal monitoring systems.
  • Check your external monitoring systems.
  • Check you logging.
  • Check your traffic.
  • Give our customer’s use-case a try?

Does your service look ok, or do you see a problem? At this point you want to figure out if you have an issue, or not. If you can’t figure it out quickly, you need to escalate the issue to someone who can. If you don’t have an issue, call the customer and see if they still have any issues, and if they’ll agree to close the issue. If they are still having issues escalate, and if you have doubts as to wether your service is working, escalate. If you know you have an issue, its time to move on to resolving it.

 Who Needs to Know?

Its important to let everyone on your team know your service is having issues. Before anything happens, you should know who you need to contact when there is an issue. This will save time, and help minimize duplication of work(in larger organizations, two people may be receiving calls about the same issue). A mail group, or centralized chat server is an ideal solution since it fairly low latency, and you can record the communication that can be review later. You should be clear as to what the problem is, and provide a link to the ticket.

Who has your back?

The next thing you should be working out is who do you need to solve your issue. You product could be simple, or fairly complex. You may be the right person to address the problem, or you may need to call for backup. If you have an idea of who you need get in-touch with them now. Get them ready to help you solve your problem. It takes quite a bit of time to get people online, so if you possibly need their help its better to call them sooner than later.

Herding Cats

Finally, now that you’ve let everyone know, and you have a team assembled to solve the issue, figure out how you’re going to communicate. The method should be low latency, and low effort. I prefer conference calls, but a chat server can work just as well plus you can cut and paste errors into the chat. You should have this figured out well in advance of an incident.

Come on you tiny horses!

You’re ready to fix the problem. Just a few more things your should have figured out:

  • Who is doing the work?
  • Who is communicating with your customer?
  • Who is documenting the changes made?
  • Who will gather any additional people needed to resolve the issues?

This could be an easy answer if you only have one person, but working through almost any issue is much easier with two people. Ideally one person will act at the project manager, getting extra help, talking to the customer, while the other types furiously in a terminal to bring the service back up. I fyou have this worked out beforehand you’ll save some time, but if you don’t,  come to an agreement quickly, and stick to your roles. You don’t need 2 people talking to your customer, telling them different things, or worse two people bringing up an down a service.


So you’re finally back up…

Great only a few more things to do.

Open a ticket for the post-mortem. Link it to your outage ticket, and begin filling in any information that might be helpful. Be as detailed as possible, and even if its inconvenient take a little time to document the issue and resolution. You should also schedule a meeting immediately for the post-mortem that takes place in the next 24 hours. People are beginning to forget what they did, and you need to capture as much of it as possible.

Once you’ve completed your meeting, produce a document explaining the outage. This should be as brief as possible with very little internal information included. Document the timeline leading to the issue, how the issue was discovered, and how it was resolved. Also, build a list of future actions to prevent a repeat of the incident. If your customer asks for it, or their SLA includes language promising it, send them the document to explain the outage.

So, spend time thinking about talking to your customer when your down. Think through the process, so when they call you won’t have to make it up. I’ve setup several of these processes, and I’ve found that these are the issues that always need to be looked at. It worth the planning, and its always important to look at what happened, so that you can improve the process.