Implementing ChatOps into our Incident Management Procedure Share Share on Facebook>
Shopify BBloh
Production engineers (PE) are expected to be incident management experts. Still, incident handling is difficult, often messy, and exhausting. We encounter new incidents, search high and low for possible explanations, sometimes tunnel on symptoms, and, under pressure, forget some best practices. At Shopify, we care not only about handling incidents quickly and efficiently, but also PE well-being. We have a special IMOC (incident manager on call) rotation and an incident chatbot to assist IMOCs. The IMOCâs role is to lead the incident response. In Shopifyâs ICS model case, the hierarchy is simplified to the Incident Commander (which we call the IMOC) who leads the incident response; the Public Information Officer who takes care of public communication, called a Support Response Manager (SRM) at Shopify; and, an operations section that directs all the actions needed to solve the incident, usually the component experts in our case. Itâs essential to note that the IMOC is on call for coordinating the incident response, not for fixing production issues (which is the component expertâs mission). They ensure that the incident goes through the following steps: During an incident, incident response steps shouldn’t be left to memory, especially when wanting to consistently offer an effective streamlined experience. Our chatbot, Spy, makes this easier by assisting the IMOC through the incident response. Spy features a set of incident commands that help reduce manual effort and context switching. We integrated the bot with our conversation tool and several third-party tools (PagerDuty, StatusPage, GitHub) to send timely reminders. Here is an overview of our current ChatOps setup: At Shopify, we use Slack as our chat app, and our main bot Spy stems from Lita. Lita is open source and written in Ruby, and can be extended easily. Spy has three main sets of commands that help with IMOC duties: ⢠spy page ⢠spy incident ⢠spy status Together, they help the IMOC go through the incident response funnel steps. Step 1: Failure detection After the IMOC starts the incident, communication is crucial to ensure that every stakeholder is aware of the issue. Through the `spy incident tldr` command, anyone at the company can ask Spy at any given moment what incidents are going on and see who is involved, when it started, and consult a brief summary.
Step 2: Start incident An alert will page the IMOC. Someone who notices the failure may also do so via the `spy page` command. Example: `spy page imoc order notifications not going out` Spy will then bind the incident to a #war-room Slack channel where all the discussions will take place. Step 3: Communication
Step 4: Fix and mitigate Spy can perform different mitigation actions can as it is closely embedded in our infrastructure. Some examples include rebalancing traffic, data center failover, blackholing jobs, locking deploy stacks. Step 5: Stop incident Once the fix has been shipped and verified in production, the IMOC can use `spy incident stop` command, which will generate a service disruption document to verify and post once ready. Step 6: Document the service disruption Spy will add any #war-room notes tagged with a notepad emoji (ð) or prefixed with `spy incident note`Â command to the service disruption document and post the resulting document in a direct message to the IMOC. Spy also sends timely reminders. For instance, if an incident has been ongoing for a while and the status page hasnât been updated, Spy will send the IMOC a reminder. It also has an on-call fatigue prevention mechanism built-in: if the IMOC has been handling an incident for pre-specified amount of time, Spy will reach out to the IMOC squad to help the current IMOC.
Link: https://engineering.shopify.com/blogs/engineering/implementing-chatops-into-our-incident-management-procedure