My Ramblings

StubHub Lessons Learned

02 July 2012

Here are a few good lessons shared by StubHub. There is a lot more information in the post but this is what caught my eye.

Instrument requests so that a request can be traced through the entire stack
There is nothing like failing to teach how to do things right
Invest in continuous improvement by performing postmortems and ensuring that issues don't come up again
Build operational valves into the system so that you can easily swap out components when needed
Implement multiple solutions and have a bake-off in production to determine which version works better

http://highscalability.com/blog/2012/6/25/stubhub-architecture-the-surprising-complexity-behind-the-wo.html

Resilience Engineering

29 June 2012

Here are some notes from an interesting couple articles published by John Allspaw regarding Resilience Engineering.

Resilience is about being able to function, rather than being impervious to failure
Looking at the things that go right is a better strategy to improve resiliency
Failures in complex systems don't have a singular root cause
Identifying human error as a root cause should result in trying to figure out what led to the human error
Political safe environments are required if you truly want to figure out what, how and why a human error occurred
In addition to learning why things go wrong we ought to learn just as much from why things go right
Safety is not the absence of incidents and failures but rather the presence of actions, behaviors, and culture that causes an organization to be safe
Anyone, at any time , no matter their seniority, can make a mistake or act under faulty assumptions
Making a mistake should be acceptable and admitting fault should be encouraged
Near miss events are excellent learning opportunities because they are just a little bit of failure that doesn't really hurt, happen more frequently, are a powerful reminder and thus keep the "constant sense of unease" required to provide resilience in a system
The goal of a post-mortem should be to gather as much information about an incident, mistake, etc. in order to spread the observations within the organization in order to prevent then from happening in the future
Components in complex systems come together to behave in ways that they never would have on their own in isolation
The Four Conerstones of Resilience
- Anticipation - Knowing what to expect in the future
  - Architectural reviews
  - Operability reviews
  - Game day exercises
- Monitoring - Knowing what to look for
  - System metrics
  - Business metrics
  - Metrics on operations and activities of both infrastructure and staff
- Response - Knowing what to do
- Learning - Knowing what has happened
  - Post-mortem

Links

Deployment is just a part of dev/ops cooperation, not the whole thing

19 September 2011

Since embarking on my operations career I have written a ton of scripts, tools, daemons, etc. to allow our team to focus on the real work at hand. Whether it be for data collection, monitoring, trending, or deployment processes operations teams do write a lot of code which is never seen but always used. Writing these tools is only half the challenge; getting engineering to actually implement, instrument, consume these tools has been and still is quite a challenge. Often times these tools are only used during tough times after already getting burned by an issue. John makes some good points on collaboration and communication between operations and development teams in his blog post and it is worth a read.

My Ramblings

StubHub Lessons Learned

Resilience Engineering

Deployment is just a part of dev/ops cooperation, not the whole thing

Articles