I sometimes joke around with others that I have been on call 24×7 ever since I joined Zynga in October of 2008. I have helped out on many incidents and outages, each with their own specific traits and difficulties. There are many valuable skills that an on call engineer can apply to firefighting an incident. In today’s post I want to focus on one skill in particular. A skill that I believe is one of the most important when being an On Call engineer. That is the ability to be the driver behind an incident response.
Driving an Incident
When I say that being the driver is the most important skill – it can easily be misunderstood. Being the driver does not necessarily mean that you are the one actually doing every single fix and patching all of the servers. Rather it is about being the one who sees the issue all the way to a successful resolution.
There are several key components to being a successful driver of an incident.
Communication is by far the most important component. Whether you are communicating a status report or gather details from a co-worker it is crucial that you are as clear as possible so that details are not misunderstood. This may mean repeating what someone tells you in a slightly different manner so that both parties can ensure that the details are understood. This also means asking the important questions when getting a status update. Questions such as the ones below help you understand the real status of someones work:
- What needs to be done and by who?
- What is the ETA for the work?
- Is there anything preventing you from doing the work?
When writing status reports your communication style is critical. You need to be honest and clear about the details excluding as much “fluff” as possible so that various members know exactly what is going on. Often time these reports are focused around communicating out impact and the current plan of action. Be careful to present the issue as realistic as possible without downgrading the impact – even if it is your fault or your team’s fault. More often than not most people don’t care about fault – they care more about the impact and eta on resolution.
Escalation is another important trait that is often overlooked by those who are new or inexperienced. Often they are shy or nervous about escalating an issue – or possibly they feel like they should try to figure it out before getting others involved. My personal philosophy is to raise an item to the proper owner as soon as possible and to let them be responsible for dictating the respective urgency over the items they own.
When escalating items to other teams remember to do so without emotion. Yelling will never end well in the long run. Be clear about what the impact is and the issue is that you are seeing. Feel free to communicate the urgency from your side – but be respectful of the priorities that the other team may be facing. This might be a high priority for you – but there may be other items that have a larger impact that the other team is also dealing with.
As the incident driver you will need to focus on prioritizing the various issues that are in your way. For example if you are experiencing a major outage there may be more than 1 item broken. You will need to prioritize the largest item or the item that has the most impact. Understanding the impact of the incident you are driving is crucial so that you can make proper decisions in terms of the fixes that should be applied.
Prioritization also requires you to make calls around when and what type of work will be done. At times the incident you are driving may be after hours. You will need to prioritize the items that need to be fixed than and there versus the items that can wait till office hours. At times the incident may require a long period of time for a proper fix – at the same time there may be a quick temporary solution that can be performed to restore health. As the incident driver you will need to prioritize temporary fixes vs permanent fixes.
So remember. When you are the one driving an incident – be the driver. Communicate clearly and without emotion. Do not hesitate to escalate issues to appropriate team. Keep the incident prioritized around what needs to be resolved first. You will get better and better as time goes on.