There was quite the discussion last week in the office, and to be honest, it caused a bit of a row. See, there's three camps of people in this world:
1) people who believe you should fix something as early as possibly - the Event People
2) people who believe you should fix something when there's an incident - the Incident People
3) people who want to know what an event or incident is - the Blissfully Unaware People
So the argument, if you can call it that, is around when should you try and fix something? Is it when the event is triggered and caught, or when the incident is created?
I'm firmly of the belief that the incident is the right time, but it took a bit of yelling to get that through. Here's my rationale:
1) an event exists only for a moment in time, and if you try and fix things from an event, you may be fixing something that doesn't exist any more;
2) an event has no record of it ever occuring;
3) once you've tried to fix it, you may well cause a different or allied problem to occur;
4) keeping state on events is Hard;
5) If you have to wait for the incident to be created to dispatch it to a fix agent, you're going to wait a longer amount of time.
6)The incident is the appropriate place to document the unfolding of the fix.
7) An incident gives you a longer-term record that can be used for trending and analysis.
[I wrote the above way back in 2016, everything from this point forward is new]
Even after 2 years of thinking and analysis and real-life view, I still take the same view. The point isn't really whether it's incident or event: the point is that there needs to be a way to track, over time, what happens in your infrastructure so taht you can make it better - application or otherwise.
Does any of this change in a [insert term du jour of today here] world? I don't believe so. I think the only thing that changes is whether a human looks at the incident or not, or whether an AI does.
Sunday, April 8, 2018
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment