26 October 2021

🎉 Incident Management System (IMS)

So you've been using Cliff for a few weeks now. You love how you can spot unexpected changes in your metrics. You feel confident with your business processes.

But something's bothering you...

Every time you receive an alert for an unexpected spike or dip, you look towards the heavens and wonder - Is this it? There's got to be more to this.

And you're right! Getting the alert is good, but being able to take further action and address the anomaly right there would be absolutely amazing. And that's exactly what this release is all about.

We have re-designed our alert system into a brand new one for cliff.ai users. We call it the Incident Management System.

Why build an Incident Management System?

Up until now, our alerting system was robust, but basic. After understanding our users and how they use Cliff, we identified these frequent pain points:

Problem 1: Some users weren't able to receive the notifications whenever there was any anomaly in the metric.

Reason: The “Alert Rules” weren't set up properly. This was because:

a. Users didn't know they were supposed to set up Alert Rules in order to receive notifications on their devices.

b. Or, they just forgot to set up alert rules.

Solution: Automate the Alert Rule setup process. Provide a base Monitor by default (yes we renamed Alert Rules to Monitor). So that even if the user does not set up any Monitor by themselves in the beginning, a default Monitor will make sure they keep receiving the notifications whenever anomalies occur.

[image]

Problem 2: Users could not choose whom to notify and when, in case of occurrence of anomalies.

Reason: Cliff didn't have the functionality of informing specific people at a specific time. e.g. after detection of an anomaly, notify A. If she is unable to acknowledge that within the given time, notify B and so on…

Solution: Establish an escalation policy that determines how, where and when the notifications will be escalated to specific team members or individuals when anomalies occur.

[image]

Problem 3: It was hard to draw insights and incident patterns on metrics

Reason: There was no insight page or incident “repository”, where users could draw quick critical insights from the metric incidents from a “single place”.

Solution: Provide an Incident page where users can see the list of all incidents (by date) and help them notice key information about incidents, like lifecycle status (triggered/acknowledged/resolved/surrendered), time of incidence, people assigned to that incident. And finally, provide a dedicated incident details page.

[image]

Product growth is a continuous process. There is always scope for improvement! We're trying to provide you with the best experience while you imbibe observability into your business processes, and with this update we aim to do just that!