Things fail, we all know that. What most people don’t take into account is that things fail in combination and unexpected ways. We spend time and effort planning redundancy and failover schemes to seamlessly continue operations, but often neglect to fully test these plans before rolling services and equipment into production. What inevitably happens is that the service fails, because the fail-over plan never worked, or had not considered what issues might arise while failing over. So, borrowing the concept of User Acceptance Testing (UAT) from software development, we can develop a system of tests where we can feel confident out redundancy plans will work when we need them.
Build a test plan, its that simple. Start by identifying the dependent components of your system, then look all the typical failure scenarios that may happen in those components. If you have two switches, what happens if one dies? Bonded network interfaces, what happens if you loose an uplink on one of your switches?
After you identify the failure scenarios, specify the expected behavior in for the scenario. If a switch dies, network traffic should continue to be sent through the remaining switch. If interface one looses its ability to route traffic, interface two should be the primary interface in the bond.
Combining the two pieces should give you a specification of how you expect the system to behave in the case of these failures. You can really organize these anyway you want, but I typically use a user-story like format to describe the failure and expected outcome.
Example Test case:
- Switch 1 stops functioning
- Switch 2 takes over VRRP address
- Switch 2 passes traffic with minimal interruption, within 3 seconds.
- Nagios alerts that switch 1 has failed
- App server looses DB connection
- load-balancer detects error, and removes host
- load-balancer continues to pass traffic to other app-servers
- Nagios alerts that app-server has failed
Once you’ve completed your plan, get buy-in for it. You’ll want a few of your peers to review it, and look over it for any failures you may have missed. Once you have agreement on this being the right test set, its time for the next step.
Writing Artificial Tests
Start brainstorming ways to test failure modes. Simple non-destructive tests are best; emulate a switch failure by unplugging a switch. A hosts network interface fails, block its port on the switch. A system freezes, block the load balancer from connecting to it via a host level firewall. You may want to take things a step farther, like pulling a disk to test raid recovery.
Remember you’re trying to test your failover plans, and you should no be terribly concerned if you break a configuration in the process, because this may happen when you something goes down. Write all the steps to test down, and its also a good idea to write down how you get back to the know state.
Review your test cases, and make sure you have tests that address each failure mode. If its impossible to test a scenario, note it, and exclude it from your UAT. Once you’ve done that, your ready to test.
Performing the Tests
Any one involved in the day to day technical operations should be able to run through the tests. Its not a bad idea to have a whole team participate, so that people can get used to seeing how the system behaves when components are failing. Step through the test methodically, and record whether the test passed or failed, and how the system behaved during the process. For example, if you’re testing the failure of an app server, did any errors show up on http clients, and if so for how long?
This is going to happen, and when it does it is time to figure out why. Firstly, was this a configuration error, or the artifact of a previous test? If so, fix it, update your test plan, and start testing again. Did you redundancy plan have a fatal flaw? Thats ok too, that’s why we test. If you missed something in your plan, address the issue, and restart the test from scratch. You’re much better off catching problems in UAT then after you’ve pushed the service to production.
Keep a copy of the UAT somewhere, so if questions come up later you can discuss it. I use wikis for this, but any document will do. Once you have that sorted, you can roll your fancy new service into production.
UAT is a useful concept for software development, and also useful for production environments. Take your time and develop a good plan, and you’ll endup with longer up-times, and meeting you’re SLA requirements. As an added bonus, you gain experience seeing how your equipment on instances behave when something has gone wrong.