My Ramblings

If you are reading this you must be pretty bored…

StubHub Lessons Learned

Here are a few good lessons shared by StubHub.  There is a lot more information in the post but this is what caught my eye.

  • Instrument requests so that a request can be traced through the entire stack
  • There is nothing like failing to teach how to do things right
  • Invest in continuous improvement by performing postmortems and ensuring that issues don't come up again
  • Build operational valves into the system so that you can easily swap out components when needed
  • Implement multiple solutions and have a bake-off in production to determine which version works better

http://highscalability.com/blog/2012/6/25/stubhub-architecture-the-surprising-complexity-behind-the-wo.html

Resilience Engineering

Here are some notes from an interesting couple articles published by John Allspaw regarding Resilience Engineering.

  • Resilience is about being able to function, rather than being impervious to failure
  • Looking at the things that go right is a better strategy to improve resiliency
  • Failures in complex systems don't have a singular root cause
  • Identifying human error as a root cause should result in trying to figure out what led to the human error
  • Political safe environments are required if you truly want to figure out what, how and why a human error occurred
  • In addition to learning why things go wrong we ought to learn just as much from why things go right
  • Safety is not the absence of incidents and failures but rather the presence of actions, behaviors, and culture that causes an organization to be safe
  • Anyone, at any time , no matter their seniority, can make a mistake or act under faulty assumptions
  • Making a mistake should be acceptable and admitting fault should be encouraged
  • Near miss events are excellent learning opportunities because they are just a little bit of failure that doesn't really hurt, happen more frequently, are a powerful reminder and thus keep the "constant sense of unease" required to provide resilience in a system
  • The goal of a post-mortem should be to gather as much information about an incident, mistake, etc. in order to spread the observations within the organization in order to prevent then from happening in the future
  • Components in complex systems come together to behave in ways that they never would have on their own in isolation
  • The Four Conerstones of Resilience
    • Anticipation - Knowing what to expect in the future
      • Architectural reviews
      • Operability reviews
      • Game day exercises
    • Monitoring - Knowing what to look for
      • System metrics
      • Business metrics
      • Metrics on operations and activities of both infrastructure and staff
    • Response - Knowing what to do
    • Learning - Knowing what has happened
      • Post-mortem

Links

Deployment is just a part of dev/ops cooperation, not the whole thing

Since embarking on my operations career I have written a ton of scripts, tools, daemons, etc. to allow our team to focus on the real work at hand.  Whether it be for data collection, monitoring, trending, or deployment processes operations teams do write a lot of code which is never seen but always used.  Writing these tools is only half the challenge; getting engineering to actually implement, instrument, consume these tools has been and still is quite a challenge.  Often times these tools are only used during tough times after already getting burned by an issue.  John makes some good points on collaboration and communication between operations and development teams in his blog post and it is worth a read.