Sometimes AI must make a decision that affects human health. You are probably thinking about self-driving cars, but our story is simpler: we are developing a system that wakes people up at night.
Imagine a monitoring system that tracks health of ten services. What if all metric data stops coming into that system? What should we do, wake up ten sysadmins? That would be a mistake: obviously, the monitoring system itself is broken. But what if only five services stopped sending any metric data? Three?
Another example. If free space on your HDD is at 90%, that is good. If it is at 1%, it's probably bad. But what if there is no data? It's worse than having a lot of free space, but is it better than having 1%?
A monitoring system usually has a web UI or configuration files to set up alerts and notifications. But what if monitoring system also has an API that can be used to set up thousands of alerts? Will it lead to qualitative changes in user behavior or just make routine operations slightly easier?
When you are developing an alerting system, you have to make decisions that require expertise in development, operations and design (in the best sense of all these words). That's what we'll talk about. All design decisions that we'll talk about were invented and tried out while developing Moira alerting system that is used in production at Kontur, Avito and Yandex.Money.