How & Why of Expert Systems for IT & Cloud Operations
Why & how our expert system works
AI-Ops, Machine Learning, and AI itself are all the rage these days. But they are not all, or even most, of the most useful tools. Sometimes, you just need an expert to look at the problem and give you the most likely solutions.
The only problem? A rather dire shortage of experts. In any industry, but especially ours, in large-scale Cloud & IT operations.
Large Internet and cloud systems break all the time, as you may have noticed. As these systems get larger and more complex, they are harder and harder to fix, especially as the industry embraces clouds, containers, and microservices.
There are just too many moving parts, interacting in complicated ways, and it’s very easy to miss basic problems, let alone complex interactions among a hundred distributed components.
And while experts can look at both the whole system and individual parts, using their knowledge and intuition to figure out what’s wrong, junior engineers just cannot do this.
They cannot do it because they lack both the necessary experience in these types of systems, and lack the detailed knowledge of the specific parts of any give system and how they fail.
Further, they are often both overwhelmed with data, metrics, and alerts during a failure and underwhelmed by the right data and insights they need to solve the issue.
From Wikipedia, “an expert system is a computer system that emulates the decision-making ability of a human expert.” Sound easy (well, not really) and this has proven quite challenging in practice.
Many people have tried encoding complex rules, decision trees, and even AI/ML models to find faults and fix systems.
But we think it’s easier than this, if you approach the problem the right way.
The secret is to structure the system the same way an expert thinks, which is in causes. As in, what are the possible causes of this problem? How can I rule in or out these various causes, the way a doctor or engineer would when you walk into their office with a problem.
The starting point for the analysis is the problem as it presents, e.g. the website is dead or we are out of disk space. Usually and ideally this entry point comes via a monitoring system alert, though of course this may only be a symptom, not the actual issue. But it’s our starting point.
The system has to be configured with the likely starting points, usually tied to the likely alerts from the monitoring systems.
From there and for each problem, we ask experts to list the likely causes, i.e. what could cause this. It’s important this list be quite broad and exhaustive, as the beauty of the expert system is it will only identify unlikely causes if the data supports it.
Real experts usually try to narrow the possible causes very quickly, as it’s hard to think about and test many things while the system urgently needs fixing. But the expert system can consider hundreds or even thousands of items and logical checks, including those that rarely occur, or only apply to a small number of systems.
In fact, even experts find expert systems like this useful as they can detect and raise unlikely or rare causes that are often forgotten. For example, our systems rarely use NFS, but for those few systems that do, it can really cause problems. Our expert systems are always there to check if NFS is in use and to remind us to investigate it.
Once we have a list of likely causes we ask experts how they would determine which cause was the problem, starting first with what data they’d need to feed their decision making.
Those data items are called Inputs and are configured from the vast array of data we have available, such as metrics, alerts, configurations, history, and more.
Just this step is valuable, as it helps junior engineers focus on the right data and the right things, so they don’t waste time hunting down useless items. Plus, often just looking at the right inputs will point them in the right direction and/or make the problem’s cause obvious.
Once we have the inputs identified, we ask about the logic for each cause. How can you rule this cause in or out, though it’s rarely that simple, so we also have a scoring system.
Some causes can simple be ruled out as impossible, because they don’t apply to this situation, such as you can’t be out of swap space if you don’t have swap enabled. For these causes, the logic can set an outcome of excluded.
The rule logic is a basic set of and/or and <=> type things based on the previously defined inputs. If the logic is true, we can score one way, and likewise, if false, another way.
For other causes, we assign a score, either directly, such as 25 or 75 (out of 100, or modify / bump the existing score up or down, as all the scores apply in the context of this cause.
In the end, the causes are presented in score-sorted order, plus excluded causes, so the user can see what the system thinks the causes of this problem are, in order or likelihood.
Once the cause list is presented, the engineer can look at repair options, which actually can also include further investigation, as the system is not expected to determine everything, but in some cases to only point the way to deeper investigations.
Other repair options include links to fixing procedures, along with some semi-automated and automated solutions, such as restarting services, purging logs for disk space, etc.
That’s it in a nutshell, about how our expert system works in the real world of large-scale infrastructure operations.