Episode 10: Incident Response
by Imran Kasam & Steve LedwithThis episode was published on 6 August 2020 and is approximately 51 minutes long. This episode made possible by Glow Your Soul and Anchor.fm.
Overview
Are you looking to improve your incident response process for your applications? Have a recent outage that made for a bad night? This is the episode for you!
In this episode we discuss Incident Response at an Enterprise level! From monitoring, to response planning, troubleshooting, and how to handle an active event, we’ve got it covered. There’s even a starter template for your root cause analysis (RCA) available in the show notes.
This episode will help you be better prepared should your application fail.
Listen on Apple Podcasts
Listen on Anchor.fm
Listen on Spotify
Show Notes & Selected Links
To be good at incident response, you have to spend time thinking about it! This episode will provide you with all the information you need to build your response plan.
Incident Prevention
Monitoring is key.
On Call Person
- Who is responsible for watching the alerts from any systems?
- Who is first up for Pager Duty alerts.
- Think about how to help the on call resource when they have a long night
First Responder
- Highly trained to do the job
- The first person receiving an alert should be able to take action
- Have an escalation process, if the first responder doesn’t get there in a timely manner
Duty Phone or Bat Phone
- A dedicated phone for use by the On Call team
- Have a process for ensuring the right person has the phone, if it’s a physical phone
- If it’s a forwarding system, you’ll need to have a process for updating the destination phone number
8 Min: Application Stack Training
- How to you help the First Responder take action?
- Create a “Run Book”
- As part of making an application ready for production, make sure you have a run book which explains some basic troubleshooting steps
- What’s the first thing which should be done?
- Who does the First Responder reach out to when the steps provided don’t fix the problem?
9:50 Min: Alerts which are White Noise
- You can’t have meaningless alerts, or your On Call team will start to ignore the issue
- Find a way to clean up spurrious alerts
Severity and Priority of the alert determines the medium of the notification.
Not every system is important at 2 AM. Some sites might not be on your critical list for immediate action. Make sure your monitoring, alerts, and response processes take this into account.
12:45 Min: Everything Always Works, Right?
- No, unfortunately, shit happens.
“Don't wait until the third or fourth time you have an outage to put a plan in place. Do it after the first one! (Or before you have one!)”—Steve Ledwith
14:30 Running an Outage
- When the First Responder can’t fix it, start getting others involved.
- Maybe you need to alert a third party, or open a ticket with a vendor
- Get the right subject matter experts involved
- Start a bridge line / conference call / hangout
Pro Tip: Do not have multiple silos of communication which requires someone on the main call to go talk with this other set of resources. Support Telephone won’t make things better.
You never know who will have the answer to the problem at hand. Having everyone in the same place, be it a conference room, phone call, or hangout, allows everyone to hear the important (and not as important) information.
Your collective energy working on the problem together magnifies the energy of the team. Don’t cheat yourself.
19 Min: Running an Incident
- Define the key players in the Run Book
What is a Run Book? According to Wikipedia a run book is a compilation of routine procedures and operations that the system administrator or operator carries out. It usually contains procedures to begin, stop, supervise, and debug the system.
Communications
- Both internal and external audiences need updates
- Have a regular, pre-defined, cadence for updates
- Every 30 minutes post an update to your public facing site
- Senior Leadership
- ignore this group of stake holders at your own peril! Let this group know what’s going on, the current status. Get in front of the problem.
- Keep the Senior Leadership in the know. Tell them when the next update will be too. This group wants to support you; help them do so.
“Communicate regularly when you are working through an issue. If you're not communicating, people assume there's no progress.”—Steve Ledwith
Troubleshooting
- Have a person dedicated to reviewing the log files for the applications
- This person has to know enough to filter out the noise
- Make sure the people who need access have it; follow the proper channels to grant additional access, if necessary
- Every application is different, and the skills necessary to solve it will vary every time
- Learning how to troubleshoot is an art… It’s based on experience and insight, and knowing how things work together
- Hard to learn how to figure out the right things to look at?
- This is the hardest part of the process, and there’s no magic here
Prior Planning
- Have a plan for how you’re going to do all of these things
- Think about it like a fire drill saftey plan from your in-person office, or your school when you were a kid
- Practice it. Review it. Make sure those involved know the plan.
- What’s your process for granting access to production?
- Have a script for what you want to accomplish when an outage starts
- Initial communication
- Grant access
- Download logs
You train the way you fight! – Imran Kasam
Shoot. Move. Communicate.
- Have a mantra for your team.
- Focus on the problem at hand, not all the rest of the things going on.
- Take a shot at fixing it.
- Move on to the next step.
- Communicate to your users and stakeholders
Pro Tip: Establish a Note Taker / Outage Tracker
- You want to have someone taking notes and keep track of time.
- You need someone to call out when we’re repeating tasks, make sure you’re duplicating efforts.
- Huge help for doing the root cause analysis and working through your timeline.
34 Min: Root Cause Analysis (RCA)
- After an off hours outage, get some rest.
- When you’re back at it, figure out what happened.
- You probably have an Service Level Agreement (SLA) with your users, or customers.
- You will have executives who want to know the complete down time and impact to the business.
- Be prepared to investigate and learn from what happened.
- You could have contracts with penalties related to downtime which will require specifics.
What does “5 Nines” Mean
- 99.999% (“five nines”) means 5.26 minutes of downtime in a year
- See the full chart as part of High Availability on Wikipedia
Continuous Learning with the Root Cause Analysis
- As you’re working through the RCA look for things you can improve in your process.
- How can you update your run book?
- Look for new things to monitor?
- Document what you did, what you looked into, and what worked.
- Write down things which didn’t work.
Key Components of the RCA
- Brief description of the event
- Accurate timeline
- People who were involved
- Actions which were taken (things that worked, and those which didn’t)
- Root Cause of the Outage
- Remedy
Hold an Agile Style Retrospective about your Response
- Have a retrospective around the response to the outage.
- What worked?
- What needs to be improved?
- Did you have the right people involved?
- How did your partners respond?
- Did you include your partners?
- Do you have the right level of service from your providers?
- Do you have the right partners? Maybe they can’t meet your expectations!
- Do you need to update any processes or procedures?
- How can you streamline your response?
- Was the response appropriate for the system?
- Is there a place where you can spend a little money to make this process better?
Other Topics
- Once you’ve done an RCA, you’re going to find things which are important to you. Incorporate those findings into your future conversations.
- For example, when you’re evaluating a third party with Service Level Agreements, make sure their SLA meets your needs and will fit in your process.
- Was there one more person who could have helped?
- Were there people involved who didn’t need to be?
Episode Wrap Up
Successfully managing an application outage requires prior planning, a documented process, and a dedication to follow-up and learning from past events.
- Prevention
- Monitoring
- On call
- Escalating Alerts
- Runbook
- Response process
- Bridge line
- Communication with multiple audiences
- Troubleshooting
- Tools
- Access
- People
- Post Response Analysis
- Root Cause Analysis (RCA)
- Retrospective
As mentioned in the episode, we have a template for you to start with. You’ll want to dress it up to fit your business needs, but these are the important sections.
The template is available in ODT | PDF | TXT and is free to use as you see fit.
Final thought: Don’t go dark. Respond. Communicate. Communicate. Make sure everyone knows you’re working on the problem. Communicate.