Practical Considerations for AI Incident Reviews
I’ve seen a lot of interest recently in using large language models to generate incident review documents, but very few actual attempts that I’ve observed at this have been successful. Why is that? Many people are approaching this problem without an understanding of the limitations of their tools, or what the point even is of doing incident reviews. In this post I’d like to expand on some things to consider if you want to take a stab at using LLMs for incident reviews without wasting anyone’s time.
The broadest category of problems that LLM incident reviews encounter is with input data quality. Sometimes this problem is obvious: if you can’t get structured data into your LLM tooling at all about your incident (maybe you’re copying and pasting Slack messages into a chatbot text input, or even trying to one-shot an incident review document with just a description of the incident) then whatever output you get is likely to be much lower quality than even a cursory human-drafted incident review document.
Even as you start getting access to some sources of structured input data, like a model context protocol (MCP) server for Slack data, it’s easy to underestimate how much additional data you need. When I’m doing a routine, one or two hour quick incident review draft I’m at minimum going to review Slack transcripts, telemetry data, and version control artifacts related to the incident. If you have access to all of these via MCP, integrated with one LLM, you are probably in the 99th percentile of AI adoption. If you don’t, you are going to find your generated incident summaries lacking in key context that would make them useful (more on this later).
There are some smaller, funnier, data quality problems you can run into as you’re trying to string together multiple data sources. My favorite is user identifiers: even for people it’s hard to tell that <@U123ABC> in a Slack transcript is the same as @mygithubusername in a pull request and Jane Doe in a Zoom meeting. When we built AI-generated incident narratives in Jeli Narrative Builder we could normalize user identifiers by querying a database that linked user accounts across all the data sources we ingested. Can you do the same?
I’d be surprised if these problems exist in exactly the same form in a couple years – I like to say that MCP today reminds me of the early years of OpenTelemetry, and OTel adoption today has become vastly easier than it was when I first dipped my toes into it six years ago. These aren’t insurmountable issues, but it’s important to understand that the value you get out of your AI incident reviews will be directly proportional to the effort you put into building infrastructure to provide your AI tooling with high quality input data.
With that said, there are some other issues I’ve seen that have persisted across multiple generations of foundation models, that seem relatively sticky even as model capabilities improve across the board.
One issue I’ve been surprised to see little improvement on: one- or no-shot incident review documents tend to have bad information architecture and often summarize input data without grouping it in a meaningful way. Bad human writing often does this as well of course, and I think this also mirrors the sorry state of incident review “best practices” that you see across the industry.
A broader problem that I think contributes to poor information architecture in these documents is that a lot of the most useful data you can present in an incident review document is interstitial. A shallow version of this issue you’ll run into almost immediately: you almost certainly have structured data about your incident that starts when your incident was “opened”, but can you trace that back to when the incident was actually reported and the investigation actually started?
It only gets harder from there. It’s easy to record that responders made a deductive leap to connect a particular symptom to a particular cause, but one of the most important things an incident review author can do is go ask those responders how they made that leap, and then record it for other people to learn from. Similarly, incidents often expose implicit tradeoffs that are left undocumented because they expose realities that your organization would prefer not to acknowledge: how often do you document that you are choosing not to make an investment in training or reliability because you’re prioritizing delivering features that actually drive revenue?
Just as smarter LLMs are getting better at expressing uncertainty rather than hallucinating, it seems like it should be possible to prompt a state of the art model to identify these interstitial gaps in the official incident record so that a human can work with other humans to fill them in. Many people driving AI adoption for incident review don’t do this though, because they don’t actually understand why incident reviews are valuable.
A simple model I use for how organizations benefit from incident reviews is:
- One person learns a lot about the incident from writing an incident document, and maybe facilitating an incident review meeting
- That person is able to take their learnings and present them to the responders of the incident, who come away with a stronger understanding of the incident they were involved in
- The artifacts from the previous two steps become part of a long term record that a broader group of people in your organization can use to understand the systems they work on
As you go down that list, more people are impacted by the process but their engagement also becomes more shallow. Critically, the whole process is anchored by a small group of investigators and responders who are most invested in producing learnings from the incident review process.
If you’ve never thought about this before – and I am frequently shocked by the number of people who advocate for incident review processes but have never really considered the mechanisms of how their organizations actually benefit from incident reviews – it’s possible to completely collapse the entire process by introducing an AI-generated incident review document in the wrong place. When you remove the person acting as a quality gate in the first step, there’s no one to drive engagement with the process in the second step. And then your outputs from the process sit neglected in a Google Drive folder somewhere.
Bringing this back to a more neutral stance: incident reviews are fundamentally a socio-technical process, and they do not provide benefit if people don’t engage with them. When you’re thinking about introducing AI tooling to your incident reviews, it matters a lot if people see it as an opportunity to extend your capacity for learning from incidents. If not, it’s unlikely that teams who are disengaged with your process will change their mind because you’ve created tooling that allows them to engage even less with it.
To that end, I’m not particularly optimistic about “agentic” workflows for drafting incident reviews (if you disagree, I’d be happy to be proven wrong). I’d be more interested to see something like NotebookLM for incidents, where the goal is to help an analyst better keep track of all the sources of data they have for an incident. In particular I think LLMs could be particularly useful for analyzing recordings of incident calls, which are often long enough that it’s not worth the effort for a human analyst to track through them.
I was pretty skeptical that AI tools would ever be valuable for incident reviews a couple years ago, and I’ve since softened my position a lot. But, like I’ve laid out in this post, I think the value from this tooling will go to organizations that already value learning from incidents, that are willing to put the necessary effort into building the right data infrastructure to provide AI incident tools with the context they need to function, and that understand incident reviews well enough to be able to tell when their tooling is producing low quality output. Conversely, if you aren’t learning from your incidents today you probably don’t have the right organizational muscle memory to get any benefit from producing a higher quantity of lower quality incident documentation. The work of understanding your incidents is hard but it isn’t too late to get started.