<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>hypergeometric</title>
	<atom:link href="http://blog.hypergeometric.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.hypergeometric.com</link>
	<description></description>
	<lastBuildDate>Sat, 14 Apr 2012 15:35:49 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Redunancy Planning, more work than adding one of everything</title>
		<link>http://blog.hypergeometric.com/2012/04/14/redunancy-planning-more-work-than-adding-one-of-everything/</link>
		<comments>http://blog.hypergeometric.com/2012/04/14/redunancy-planning-more-work-than-adding-one-of-everything/#comments</comments>
		<pubDate>Sat, 14 Apr 2012 15:29:40 +0000</pubDate>
		<dc:creator>papilion</dc:creator>
				<category><![CDATA[DevOps]]></category>

		<guid isPermaLink="false">http://blog.hypergeometric.com/?p=109</guid>
		<description><![CDATA[Since I started my career, redundancy has been featured in almost every deployment discussion. The general best practice is to add an additional element for each service tier, also know as N+1 redundancy. This approach is straight forward, but many people would actually be surprised by how often these schemes fail. At a very famous <a href="http://blog.hypergeometric.com/2012/04/14/redunancy-planning-more-work-than-adding-one-of-everything/"> read more <span class="meta-nav">&#187;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Since I started my career, redundancy has been featured in almost every deployment discussion. The general best practice is to add an additional element for each service tier, also know as N+1 redundancy. This approach is straight forward, but many people would actually be surprised by how often these schemes fail. At a very famous incident in San Francisco, a data center lost power in the majority of its co-location suites, due to the failure of their N+2(one better than N+1) backup power generation scheme.</p>
<h3>Start with the Easy Things </h3>
<p>You start by looking at each individual component in your stack and deciding if this system fails, can it fail independently. If you do this with your stack, you&#8217;ll generally find that pieces that scale easily horizontally have failure boundaries isolated to the system itself. For example if a web server fails, generally it has no impact on service, because concurrency is maintained elsewhere, but it will reduce capacity. This is the easiest place to plan for, because an extra server will typically take care of the issue.</p>
<h3>Now with the hard things</h3>
<p>When you look at components such at database masters, or storage nodes the story becomes more complex. This type of equipment generally has failure boundaries that extend beyond themselves. A rack full of application servers may become useless when they are no longer able to access a database for writes. You don&#8217;t truly have redundancy here until you have a scheme for fail-over. Without planning you may be trying to figure out slave promotion in 2:32 am.<br />
</p>
<h3>Then with the hard and really expensive things</h3>
<p>Core infrastructure needs love too. Again, things like rack power, networks, carriers, cloud providers, and buildings have failure boundaries as well. They unfortunately extend to several portion of your stack at once. They are very difficult to plan around, and often take a significant investment to have redundancy in. The datacenter mentioned above used 2 spare generators for redundancy for all the co-location suites, when 3 of the primary generators failed, so did their redundancy plan. They had let each suite become dependent on all of the other suites having normal power operations.<br />
</p>
<h3>Finally, figure out what you have to do</h3>
<p>Once you&#8217;ve identified all of your failure boundaries, its time for the fun part, financial discussions! Remember, why its important to have backups of all data, redundancy is a financial hedge. When planning try to figure out what the cost of downtime is, and to what extents the business is willing to fund them. Its not uncommon that multi-datacenter redundancy would require an application change to achieve, but its probably not worth the investment if you have no customers. Create a target and engineer a system that meets that goal for the budget.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.hypergeometric.com/2012/04/14/redunancy-planning-more-work-than-adding-one-of-everything/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Three Monitoring Tenants</title>
		<link>http://blog.hypergeometric.com/2012/03/08/three-monitoring-tenants/</link>
		<comments>http://blog.hypergeometric.com/2012/03/08/three-monitoring-tenants/#comments</comments>
		<pubDate>Thu, 08 Mar 2012 05:46:48 +0000</pubDate>
		<dc:creator>papilion</dc:creator>
				<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Monitoring]]></category>

		<guid isPermaLink="false">http://blog.hypergeometric.com/?p=148</guid>
		<description><![CDATA[This week, I was seeing a drop in average back-end performance at work, we had an average drop in page load performance from ~250ms to around 500ms. This seemed to be an intermittent problem and we searched through out graphs at NewRelic with no clear culprit. Then we started looking at our internal MediaWiki profiling <a href="http://blog.hypergeometric.com/2012/03/08/three-monitoring-tenants/"> read more <span class="meta-nav">&#187;</span></a>]]></description>
			<content:encoded><![CDATA[<p>This week, I was seeing a drop in average back-end performance at work, we had an average drop in page load performance from ~250ms to around 500ms. This seemed to be an intermittent problem and we searched through out graphs at NewRelic with no clear culprit. Then we started looking at our internal MediaWiki profiling collector, and some of the various aggregation tools that I put together. After a few minutes it became clear that the connection time on one of our databases had increased 1000 fold. Having recently changed the spanning-tree configuration, and moving some of our cross connects to 10Gb/s, I suspected this may have been a spanning tree issue. It turns out our Ganglia daemon (gmond) on that host had consumed enough memory due to a memory leak to negative affect system performance. Unfortunately this was a pretty inefficient way to find the problem, and reminded me of a few basic tenants of monitoring.</p>
<p>&nbsp;</p>
<h2>Monitor Latency</h2>
<p>A simple MySQL test could just tell you whether your server is up or down. Your alerts probably even have timeouts, but in most monitoring tool I&#8217;ve seen these are measured in seconds not milliseconds. Your should have your alert configured to tell you when the service you&#8217;re monitoring has gone to an unacceptable level, and maybe effecting site performance. So, your simple MySQL check should timeout in 3 seconds, but alert you if its taken more than 100ms to establish a connection. Remember, if the latency is high enough your service is effectively down.</p>
<p>&nbsp;</p>
<h2>Monitoring your Monitoring</h2>
<p>Sometimes your monitoring can get out of whack. You may find that you tests are consuming so many resources that they are negatively effecting your performance. You need to define acceptable parameters for these application, and make sure that its doing what you expect.</p>
<p>&nbsp;</p>
<h2>Set Your Alerts Lower than You Need</h2>
<p>Your alerts should go off before your services are broken. Ideally this would be done with alerts on warnings, but for a good number of people warning are too noisy. If you&#8217;re only going to alert on errors, set your threshold well below the service level you expect to provide. For example, if you have an HTTP service that you expect to answer within 100ms, and typically answers within 25ms, your warnings should be set at something like 70ms and errors at 80ms. By alerting early, your preventing a call from a customer, or an angry boss.</p>
<p>So give these three things a try, and you should end up with a better monitoring setup.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.hypergeometric.com/2012/03/08/three-monitoring-tenants/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Groovy, A Reasonable JVM Language for DevOps</title>
		<link>http://blog.hypergeometric.com/2012/03/06/groovy-a-reasonable-jvm-language-for-devops/</link>
		<comments>http://blog.hypergeometric.com/2012/03/06/groovy-a-reasonable-jvm-language-for-devops/#comments</comments>
		<pubDate>Tue, 06 Mar 2012 06:02:12 +0000</pubDate>
		<dc:creator>papilion</dc:creator>
				<category><![CDATA[DevOps]]></category>
		<category><![CDATA[solr]]></category>

		<guid isPermaLink="false">http://blog.hypergeometric.com/?p=140</guid>
		<description><![CDATA[I&#8217;ve worked at several environments where most of our product was run through the JVM. I&#8217;ve always used the information available to me in Mbeans, but the overhead of exposing them to a monitoring system like Ganglia or Nagios has always been problematic. What I&#8217;ve been looking for is a simple JVM language that allows <a href="http://blog.hypergeometric.com/2012/03/06/groovy-a-reasonable-jvm-language-for-devops/"> read more <span class="meta-nav">&#187;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve worked at several environments where most of our product was run through the JVM. I&#8217;ve always used the information available to me in Mbeans, but the overhead of exposing them to a monitoring system like Ganglia or Nagios has always been problematic. What I&#8217;ve been looking for is a simple JVM language that allows me to use any native object. In 2008 I found Groovy, and have used it as my JVM glue ever since.</p>
<h2>Meet Groovy</h2>
<p>I&#8217;ll skip from going into too much detail here, but Groovy is a simple dynamically typed language for the JVM. Its syntaxt and style is similar to that of Ruby. You can find more information <a href="http://groovy.codehaus.org/">here</a>.</p>
<h2>Solr JMX to Ganglia Example</h2>
<p><script src="https://gist.github.com/1983938.js"> </script></p>
<p>This is a quick script to get data from JMX into ganglia fromSsolr. The syntax is simplfied, and missing the typical java boiler plate required to run from the command line. Not only that, but because of the simplified typing model, its very easy for me to build strings and then execute a command. This allows me to quickly construct a script like I would in Perl or Python, but also allows me easy access JMX, and any Java library of my choosing. The primary drawback will be the time it takes to instantiate the JVM, which you can work around if you can&#8217;t deal with the launch time. </p>
<p>So, if your working in a Java with a bunch of Java apps, think about giving Groovy a chance for writing some of your monitoring tests, and metric collectors. It is a simpler language than Java to put together those little applications that can tell you how your system is performing, and well within the reach of your average DevOps engineer. </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.hypergeometric.com/2012/03/06/groovy-a-reasonable-jvm-language-for-devops/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A few things you should know about EC2</title>
		<link>http://blog.hypergeometric.com/2012/02/27/a-few-things-you-should-know-about-ec2/</link>
		<comments>http://blog.hypergeometric.com/2012/02/27/a-few-things-you-should-know-about-ec2/#comments</comments>
		<pubDate>Tue, 28 Feb 2012 01:44:39 +0000</pubDate>
		<dc:creator>papilion</dc:creator>
				<category><![CDATA[DevOps]]></category>
		<category><![CDATA[EC2]]></category>

		<guid isPermaLink="false">http://blog.hypergeometric.com/?p=135</guid>
		<description><![CDATA[Availability Zones are Randomized Between Accounts I had someone from Amazon tell me this, so I assume this to be true. In order to prevent people from gaming the system availability and over allocating instances in a singe az, zones ids are randomized across customers. So for any two accounts us-east-1a != us-east-1a. Amazon promises <a href="http://blog.hypergeometric.com/2012/02/27/a-few-things-you-should-know-about-ec2/"> read more <span class="meta-nav">&#187;</span></a>]]></description>
			<content:encoded><![CDATA[<h2>Availability Zones are Randomized Between Accounts</h2>
<p>I had someone from Amazon tell me this, so I assume this to be true. In order to prevent people from gaming the system availability and over allocating instances in a singe az, zones ids are randomized across customers. So for any two accounts us-east-1a != us-east-1a. Amazon promises availability zones to be separate for your account, it makes no promises about keeping these consistent across accounts. If you&#8217;re using multiple accounts, don&#8217;t assume you can choose the same availability zone. </p>
<h2>No Instance is Single Tenant</h2>
<p>We all want to game they system, and I&#8217;ve heard rummors that XL instance, and 4XL instances are single tenant, one VM per hardware instance. I&#8217;ve come to believe that no EC2 instances are single tenant, even the cluster compute instances. Its a fair bet that systems can easily be purchases with 96GB+ of memory, so AWS has likely been using configurations like this for the past 2+ years. Its always possible to have a noisy neighbor, don&#8217;t assume you can buy your way out. </p>
<h2>Micro Instances Aren&#8217;t Good for Production Use</h2>
<p>If you do anything at any kind of scale, don&#8217;t use micro instances. They have variable performance, and you shouldn&#8217;t rely on them for anything.</p>
<h2>EBS Should only be Allocated 1TB at a Time</h2>
<p>This is one area where it seems you can game the system. Many people have reported that by using 1TB volumes you get better performance. The conventional wisdom is that you are allocating a drive, or at least most of one. So, don&#8217;t skimp; over allocate if you need EBS.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.hypergeometric.com/2012/02/27/a-few-things-you-should-know-about-ec2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Configuration Management Tools Still Fall Short</title>
		<link>http://blog.hypergeometric.com/2012/02/23/configuration-management-tools-still-fall-short/</link>
		<comments>http://blog.hypergeometric.com/2012/02/23/configuration-management-tools-still-fall-short/#comments</comments>
		<pubDate>Thu, 23 Feb 2012 23:48:20 +0000</pubDate>
		<dc:creator>papilion</dc:creator>
				<category><![CDATA[DevOps]]></category>

		<guid isPermaLink="false">http://blog.hypergeometric.com/?p=132</guid>
		<description><![CDATA[I have a gripe with almost every configuration management tool I&#8217;ve used. I&#8217;m most familiar with chef, but I&#8217;ve used puppet a bit, so I apologize to the fine people at OpsCode in advance since my examples will be chef based. The Cake is a Lie Every time I run chef I tell my self <a href="http://blog.hypergeometric.com/2012/02/23/configuration-management-tools-still-fall-short/"> read more <span class="meta-nav">&#187;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I have a gripe with almost every configuration management tool I&#8217;ve used. I&#8217;m most familiar with chef, but I&#8217;ve used puppet a bit, so I apologize to the fine people at OpsCode in advance since my examples will be chef based.</p>
<h2>The Cake is a Lie</h2>
<p>Every time I run chef I tell my self a lie. My system will be in a known state when chef finishes running. The spirit of the DevOps movement is that we are building repeatable processes, and tools, freeing our companies from unknown, undocumented production environments, but in practice we may be making it worse.</p>
<h2>The One Constant is Change</h2>
<p>This should be a surprise to no one, but occasionally broken recipes get checked in, and run. Sometimes these effect state, sometimes they don&#8217;t; it really depends on the text of your recipe. Sometimes recipes run, and are removed. This is the natural cycle, since developing environments change over time. We remove and fix these recipes cavalierly, and to eliminate unneeded packages, cut run times, and to make our configuration management tool work.</p>
<h2>The Server is an Accumulator Pattern Without Scope</h2>
<p>What we generally forget is that servers are javascript. We intend for all of our changes to modify the system in a known way, but since (particularly with persistent images) we may have run several generations of scripts, we may not know our starting state. From the moment we have an instance/server/image we are accumulating changes that our configuration management utilities rely on to operate. Long forgotten recipes may still be haunting your server, with an old package, or config file, that unknowingly you are now using. A new instance may be equally hard to recreate because, despite your base assumption every chef run modified state, and you&#8217;ve been relying on those side effects in every run since.</p>
<h2>Is it your Mies en Plas or Chef&#8217;s?</h2>
<p>Chef doesn&#8217;t clean up, it leaves it to you. You have to be the disciplined one, and make sure your work place is clean. If you have physical hardware, this is more challenging than with virtual instances, but if you persist images you can suffer from the same problems as well.</p>
<h2>Whats Missing?</h2>
<p>All of these tools lack state verification. I&#8217;d love for these tools to be transactional, but I&#8217;m realistic, that will never happen. When a run is completed, I would like to verify that some state condition is met, rather than knowing that all my commands succeeded. Unfortunately, I&#8217;m not sure this is realistic.</p>
<h2>Protect Your Neck</h2>
<p>So, given that we have these accumulators, my preferred solution is to zero them out; reinstall early and often, or start new images whenever you can. The only state that is known is a clean install, and when you make major changes reinstall.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.hypergeometric.com/2012/02/23/configuration-management-tools-still-fall-short/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>SSH Do&#8217;s and Don&#8217;ts</title>
		<link>http://blog.hypergeometric.com/2012/02/22/ssh-dos-and-donts/</link>
		<comments>http://blog.hypergeometric.com/2012/02/22/ssh-dos-and-donts/#comments</comments>
		<pubDate>Wed, 22 Feb 2012 06:20:41 +0000</pubDate>
		<dc:creator>papilion</dc:creator>
				<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://blog.hypergeometric.com/?p=127</guid>
		<description><![CDATA[Do Use SSH Keys When ever you can use a key for SSH. Once you create it, you can distribute the public side widely to enable access where ever you need it. Generating one is easy: ssh-keygen -t dsa Don&#8217;t Use a Blank Passphrase on Your Key This key is now your identity. Protect it. <a href="http://blog.hypergeometric.com/2012/02/22/ssh-dos-and-donts/"> read more <span class="meta-nav">&#187;</span></a>]]></description>
			<content:encoded><![CDATA[<h2>Do Use SSH Keys</h2>
<p>When ever you can use a key for SSH. Once you create it, you can distribute the public side widely to enable access where ever you need it. Generating one is easy:</p>
<p><code><br />
ssh-keygen -t dsa<br />
</code></p>
<h2>Don&#8217;t Use a Blank Passphrase on Your Key</h2>
<p>This key is now your identity. Protect it. Select a sufficiently safe password, and enter it when prompted. This is basic security, plus allows you to &#8220;safely&#8221; move your keys between hosts without compromising the key security.</p>
<h2>Do Use Multiple Keys</h2>
<p>Its probably best to use a few keys when setting up access from different hosts. This makes it possible to shutdown a key without locking your self out.</p>
<h2>Don&#8217;t Copy Your Private Key Around</h2>
<p>Remember this is your identity, and authorization to access systems. Its never a good idea to copy it from system to system.</p>
<h2>Do Use SSH Agents</h2>
<p>Enabling the ssh agent on you laptop or desktop can save you from the tedium of password entry. Launching the agent is easy, then you just need to add key files to it.</p>
<p><code><br />
# starts the agent, and sets up your environment variables<br />
exec ssh-agent bash<br />
# add your identities to the agent by using ssh-add<br />
ssh-add<br />
</code></p>
<h2>Don&#8217;t Leave You Agents Running After You Log Out</h2>
<p>If you leave your agent running, this is like leaving your keys in a running car. Anyone can now assume your identity if they can gain access to your agent.</p>
<h2>Do Make A Custom ~/.ssh/config</h2>
<p>You&#8217;ll find from time to time that you&#8217;ll need special settings. You have a few options, like entering a very long command string, or creating a custom ~/.ssh/config file. I use this for short hostnames when I&#8217;m on a VPN, or when my username on my system doesn&#8217;t match my account on the remote system.</p>
<p><code><br />
# A wild card quick example<br />
Host *.production<br />
User geoffp<br />
IdentityFile ~/.ssh/prod_id_dsa<br />
ForwardAgent yes</code></p>
<p># Shortening a Host&#8217;s Name<br />
# so ssh my-short-name will work<br />
Host my-short-name<br />
User gpapilion<br />
ForwardAgent yes<br />
Hostname my.fully.qualified.hostname.com</p>
<h2>Do Use ForwardAgent</h2>
<p>This approximates single sign-on using ssh keys. As long as you are forwarding agent requests back to your original host, you should never be prompted for a password. I set my ~/.ssh/config to do this, but I also will use ssh -a on remote systems to keep from reentering password information.</p>
<p>*** EDIT ***</p>
<p>I&#8217;ve received a lot of feed back about this point. Some people have pointed out that this should not be used on untrusted systems. Essentially your agent will always respond when prompted to a agent forward request with the response to a challenge. If an attacker has compromised the system or the file systems enforcement of permissions is poor, your credential can be used in a sophisticated man in the middle attack.</p>
<p>Basically, don&#8217;t ever SSH to non-trusted systems with this option enabled, and I&#8217;d extend this an say don&#8217;t ever login to non-trusted systems.</p>
<p><a href="http://unixwiz.net/techtips/ssh-agent-forwarding.html#fwd">This article</a> does a good job of explaining how agent forwarding works. This <a href="http://en.wikipedia.org/wiki/Ssh-agent#Security_issues">article</a> on Wikipedia explains the security issue.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h2>Don&#8217;t Only Keep Online Copies of Your Keys</h2>
<p>Keep an offline backup. You may need to get access to a private key, and it always good to keep an offline copy for an emergency.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.hypergeometric.com/2012/02/22/ssh-dos-and-donts/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Techincal Debt Better Than Not Doing It</title>
		<link>http://blog.hypergeometric.com/2012/02/16/techincal-debt-better-than-not-doing-it/</link>
		<comments>http://blog.hypergeometric.com/2012/02/16/techincal-debt-better-than-not-doing-it/#comments</comments>
		<pubDate>Fri, 17 Feb 2012 03:58:32 +0000</pubDate>
		<dc:creator>papilion</dc:creator>
				<category><![CDATA[DevOps]]></category>

		<guid isPermaLink="false">http://blog.hypergeometric.com/?p=121</guid>
		<description><![CDATA[Its time to admit that sometimes it&#8217;s okay to incur technical debt, particularly when it comes to getting it done. So many times, I&#8217;ve run into to places that have constipated operations environments, or automation processes because something is hard to do automatically. If you can&#8217;t automated it, don&#8217;t block all other tasks because of <a href="http://blog.hypergeometric.com/2012/02/16/techincal-debt-better-than-not-doing-it/"> read more <span class="meta-nav">&#187;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Its time to admit that sometimes it&#8217;s okay to incur technical debt, particularly when it comes to getting it done. So many times, I&#8217;ve run into to places that have constipated operations environments, or automation processes because something is hard to do automatically.</p>
<p>If you can&#8217;t automated it, don&#8217;t block all other tasks because of one issue. It better to have a partially automated solution, than none at all. Just make sure you can document it, and come back later when you have more time. Don&#8217;t let your tools be your excuse for not doing it, it only makes you look bad.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.hypergeometric.com/2012/02/16/techincal-debt-better-than-not-doing-it/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>User Acceptance Testing for Successful Failovers</title>
		<link>http://blog.hypergeometric.com/2012/02/08/user-acceptance-testing-for-successful-failovers/</link>
		<comments>http://blog.hypergeometric.com/2012/02/08/user-acceptance-testing-for-successful-failovers/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 05:50:14 +0000</pubDate>
		<dc:creator>papilion</dc:creator>
				<category><![CDATA[DevOps]]></category>

		<guid isPermaLink="false">http://blog.hypergeometric.com/?p=117</guid>
		<description><![CDATA[Things fail, we all know that. What most people don&#8217;t take into account is that things fail in combination and unexpected ways. We spend time and effort planning redundancy and failover schemes to seamlessly continue operations, but often neglect to fully test these plans before rolling services and equipment into production. What inevitably happens is <a href="http://blog.hypergeometric.com/2012/02/08/user-acceptance-testing-for-successful-failovers/"> read more <span class="meta-nav">&#187;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Things fail, we all know that. What most people don&#8217;t take into account is that things fail in combination and unexpected ways. We spend time and effort planning redundancy and failover schemes to seamlessly continue operations, but often neglect to fully test these plans before rolling services and equipment into production. What inevitably happens is that the service fails, because the fail-over plan never worked, or had not considered what issues might arise while failing over. So, borrowing the concept of User Acceptance Testing (UAT) from software development, we can develop a system of tests where we can feel confident out redundancy plans will work when we need them.</p>
<h2>Test Cases</h2>
<p>Build a test plan, its that simple. Start by identifying the dependent components of your system, then look all the typical failure scenarios that may happen in those components. If you have two switches, what happens if one dies? Bonded network interfaces, what happens if you loose an uplink on one of your switches?</p>
<p>After you identify the failure scenarios, specify the expected behavior in for the scenario. If a switch dies, network traffic should continue to be sent through the remaining switch. If interface one looses its ability to route traffic, interface two should be the primary interface in the bond.</p>
<p>Combining the two pieces should give you a specification of how you expect the system to behave in the case of these failures. You can really organize these anyway you want, but I typically use a user-story like format to describe the failure and expected outcome.</p>
<p>Example Test case:</p>
<ul>
<li>Switch 1 stops functioning</li>
<ul>
<li>Switch 2 takes over VRRP address</li>
<li>Switch 2 passes traffic with minimal interruption, within 3 seconds.</li>
<li>Nagios alerts that switch 1 has failed</li>
</ul>
</ul>
<ul>
<li>App server looses DB connection</li>
<ul>
<li>load-balancer detects error, and removes host</li>
<li>load-balancer continues to pass traffic to other app-servers</li>
<li>Nagios alerts that app-server has failed</li>
</ul>
</ul>
<p>Once you&#8217;ve completed your plan, get buy-in for it. You&#8217;ll want a few of your peers to review it, and look over it for any failures you may have missed. Once you have agreement on this being the right test set, its time for the next step.</p>
<h2>Writing Artificial Tests</h2>
<p>Start brainstorming ways to test failure modes. Simple non-destructive tests are best; emulate a switch failure by unplugging a switch. A hosts network interface fails, block its port on the switch. A system freezes, block the load balancer from connecting to it via a host level firewall. You may want to take things a step farther, like pulling a disk to test raid recovery. </p>
<p>Remember you&#8217;re trying to test your failover plans, and you should no be terribly concerned if you break a configuration in the process, because this may happen when you something goes down. Write all the steps to test down, and its also a good idea to write down how you get back to the know state. </p>
<p>Review your test cases, and make sure you have tests that address each failure mode. If its impossible to test a scenario, note it, and exclude it from your UAT. Once you&#8217;ve done that, your ready to test.</p>
<h2>Performing the Tests</h2>
<p>Any one involved in the day to day technical operations should be able to run through the tests. Its not a bad idea to have a whole team participate, so that people can get used to seeing how the system behaves when components are failing. Step through the test methodically, and record whether the test passed or failed, and how the system behaved during the process. For example, if you&#8217;re testing the failure of an app server, did any errors show up on http clients, and if so for how long?</p>
<h2>Failing</h2>
<p>This is going to happen, and when it does it is time to figure out why. Firstly, was this a configuration error, or the artifact of a previous test? If so, fix it, update your test plan, and start testing again. Did you redundancy plan have a fatal flaw? Thats ok too, that&#8217;s why we test. If you missed something in your plan, address the issue, and restart the test from scratch. You&#8217;re much better off catching problems in UAT then after you&#8217;ve pushed the service to production.</p>
<h2>Passing</h2>
<p>Keep a copy of the UAT somewhere, so if questions come up later you can discuss it. I use wikis for this, but any document will do. Once you have that sorted, you can roll your fancy new service into production.</p>
<h2>Summary</h2>
<p>UAT is a useful concept for software development, and also useful for production environments. Take your time and develop a good plan, and you&#8217;ll endup with longer up-times, and meeting you&#8217;re SLA requirements. As an added bonus, you gain experience seeing how your equipment on instances behave when something has gone wrong.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.hypergeometric.com/2012/02/08/user-acceptance-testing-for-successful-failovers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Solr Query Change Beats JVM Tuning</title>
		<link>http://blog.hypergeometric.com/2012/02/07/solr-query-change-beats-jvm-tuning/</link>
		<comments>http://blog.hypergeometric.com/2012/02/07/solr-query-change-beats-jvm-tuning/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 15:57:02 +0000</pubDate>
		<dc:creator>papilion</dc:creator>
				<category><![CDATA[DevOps]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[devops]]></category>
		<category><![CDATA[java tuning]]></category>

		<guid isPermaLink="false">http://blog.hypergeometric.com/?p=110</guid>
		<description><![CDATA[I&#8217;ve been spending the last few days at work trying to improve our search performance, and have been banging my head against the dismax query target and parser in Solr. For those not familiar with the Dismax, its a simplified parser for Solr that eliminates the complexity from the Standard query parser. Instead of search <a href="http://blog.hypergeometric.com/2012/02/07/solr-query-change-beats-jvm-tuning/"> read more <span class="meta-nav">&#187;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been spending the last few days at work trying to improve our search performance, and have been banging my head against the dismax query target and parser in Solr. For those not familiar with the <a href="http://wiki.apache.org/solr/DisMax">Dismax</a>, its a simplified parser for Solr that eliminates the complexity from the Standard query parser. Instead of search terms like &#8220;field_name:value&#8221; you can simple enter &#8220;value&#8221;, but you can no longer search for a specific term in a specific field.</p>
<p>Our search index has grown in the last few months by 20% and our JVM and Solr setups were beginning to groan under the weight of the data. I went through a few rounds of JVM tuning, which reduced garbage collection time to less than 2%, and with some Solr configuration options managed to bring our typical query back under 5 seconds. This felt like a major win, until I adjusted the query.</p>
<p>Looking at our query parameters on search I noticed we were using the &#8220;<a href="http://wiki.apache.org/solr/CommonQueryParameters#fq">fq</a>&#8221; parameter to specify the id of the particular site we were looking for. These queries were taking anywhere from 5-15 seconds across our 360GB index, and I suspected that we were pulling in data to the JVM only to filter it away. The garbage collection graphs seemed to indicate this as well, since we had a very slow growing heap, and our eden space was emptying very quickly even with 20G allocated to it. When I changed from dismax to the standard target and specified the site id, I noticed my search time went from 5 seconds to .06 seconds, so started reading, and came across <a href="http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/">an article on nested queries</a>. My idea was that this would allow me to apply a constraint to the initial set of data returned, using the standard search target, and then perform a full text search using dismax and achieve the same results.</p>
<p>Original Query(grossly simplified):<code><br />
http://search-server/solr/select?fl=title%2Csite_id%2Ctext&amp;qf=title%5E7+text&amp;qt=dismax&amp;fq=site_id:147&amp;timeAllowed=2500&amp;q=SearchTerm+&amp;start=0&amp;rows=20"<br />
</code></p>
<p>Becomes the following nested query:<code></p>
<p>http://search-server/solr/select?fl=title%2Csite_id%2Ctext&#038;qf=title%5E7+text&#038;timeAllowed=2500&#038;q=site_id:147+_query_:%22{!dismax}SearchTerm%22&#038;start=0&#038;rows=20</p>
<p></code></p>
<p>Original Query Time : 5 seconds<br />
Nested Query Time : 87 milliseconds</p>
<p>Both return identical results. So, if performing a query against a large index and you want to use dismax, you should try using a nested search. You&#8217;re likely see much better performance, particularly if you&#8217;re filtering based on a facet. And this gives you a relatively easy way to specify the value of a field, and still want to use a dismax query.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.hypergeometric.com/2012/02/07/solr-query-change-beats-jvm-tuning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Language Importance for DevOps Engineers</title>
		<link>http://blog.hypergeometric.com/2012/02/05/language-importance-for-devops-engineers/</link>
		<comments>http://blog.hypergeometric.com/2012/02/05/language-importance-for-devops-engineers/#comments</comments>
		<pubDate>Sun, 05 Feb 2012 19:03:43 +0000</pubDate>
		<dc:creator>papilion</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.hypergeometric.com/?p=102</guid>
		<description><![CDATA[First and foremost this is a biased article. These are all my opinions, and come from my working experience. Bash(or Posix shell) Importance 10/10 If you&#8217;re working with *nix and can&#8217;t toss together a simple init.d script in 5 minutes, you haven&#8217;t done enough bash. It&#8217;s everywhere, and it should still be your first automation <a href="http://blog.hypergeometric.com/2012/02/05/language-importance-for-devops-engineers/"> read more <span class="meta-nav">&#187;</span></a>]]></description>
			<content:encoded><![CDATA[<p>First and foremost this is a biased article. These are all my opinions, and come from my working experience.</p>
<h2>Bash(or Posix shell)</h2>
<p><strong>Importance 10/10</strong></p>
<p>If you&#8217;re working with *nix and can&#8217;t toss together a simple init.d script in 5 minutes, you haven&#8217;t done enough bash. It&#8217;s everywhere, and it should still be your first automation choice. It has a simple syntax, and is designed specifically to execute programs in a non-interactive manner. You&#8217;ll be annoyed that it lacks unit tests, and complex error handling, but its purpose built to automate administrative tasks.</p>
<h2>Perl</h2>
<p><strong>Importance 9/10</strong></p>
<p>This is the language that you will run into if you work in operations. There will be backup scripts, nagios tests, and a large collection of digital duck tape written by co-workers, that do very important jobs. Its syntax is ugly, and you may find yourself writing a eval to handle exceptions, but its everywhere. CPAN makes it fairly easy to get things done, and you can&#8217;t beat this for string handling.</p>
<h2>C/C++</h2>
<p><strong>Importance 5/10</strong></p>
<p>This is the latin of the *nix world, and is basically portable assembly language. I refrain from writing C whenever possible, since I rarely need the raw performance, and the security and stability consequences are pretty severe. You should understand the syntax(its ALGO right), and be able to read a simple application. It would be great if you could submit a patch to a open-source project, but I would never turn down an ops hire because they didn&#8217;t know C well enough.</p>
<h2>PHP</h2>
<p><strong>Importance 7/10</strong></p>
<p>PHP more important than C?! Yep. Like perl its everywhere, people use it for prototyped webapps, and full blow production systems. Its another ALGO syntax language, except you can put together a simple web page in 2-3 minutes; its almost as magical as the Twilio API. You&#8217;ll find yourself poking at it on more than one occasion, so you might as well know what you&#8217;re doing.</p>
<h2>Ruby</h2>
<p><strong>Importance 6/10</strong></p>
<p>Doing something with puppet or chef? You probably should know some ruby, and in fact it probably more important to know ruby than chef of puppet. Its relatively easy to pick up, and so many of the automation tools people love are written int it. As an extra bonus, you could write rails and sinatra apps. It&#8217;s good to have in your back pocket.</p>
<h2>Python</h2>
<p><strong>Importance 4/10</strong></p>
<p>People love to love python, but the truth is that its a bit of a diva. Its a language that favors reading over writing, and has a very bloated standard library with lots of broken components(which is the right http library to use?). It wants to be a simpler perl, but I never find it as useful, and it always takes longer. I know a lot of companies say they want to use it as their &#8220;scripting&#8221; language, but in practice I&#8217;ve not seen the value(i stil want to rewrite everyones code).</p>
<h2>Chef/Puppet</h2>
<p><strong>Importance 2/10</strong></p>
<p>These are DSLs for configuration management. They are supposed to be simple to learn, and if you can&#8217;t figure them out with a web browser and a few minutes, they are failing.</p>
<h2>Java</h2>
<p><strong>Importance 6/10</strong></p>
<p>More ALGO syntax, and more prevalent in high scale web applications. Minimally you should be able to read this language, but its useful to be able to pound out a few lines of Java. It has many rich frameworks, and you&#8217;ll likely find it sneaking into your stack where you need something done fast. Also, it is really useful when it comes time to tune the JVM.</p>
<h2>Haskel</h2>
<p><strong>Importance 0/10</strong></p>
<p>When I&#8217;ve run into it running someplace serious I&#8217;ll update its score.</p>
<h2>Javascript</h2>
<p><strong>Importance 8/10</strong></p>
<p>I hate this language, but I can&#8217;t deny its growing importance. Its more common to see in a web browser, but its starting to creep into the backend with things like node.js. If you can understand javascript, you can help resolve whether the issue is a frontend or backend problem; you will have total stack awareness.</p>
<h2>SQL</h2>
<p><strong>Importance 10/10</strong></p>
<p>You have to know SQL. You will work with SQL databases, and you will want to move things in and out of them. You may want to know a dialect like MySQL very well, but you should understand the basics, and at a minimum be able to join a few tables, or create an aggregate query.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.hypergeometric.com/2012/02/05/language-importance-for-devops-engineers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic page generated in 0.724 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2012-04-14 10:37:08 -->

