April 20, 2022

Forecasting Capacity Outages Using Machine Learning to Bolster Application Resiliency

Aditya Arora, Associate, Production Runtime Experience

 

Introduction

Goldman Sachs has a private cloud spanning data centers around the world. These data centers house a large number of cooperating hardware and software components, such as hosts, storage devices, network links, databases, message queues, caches, load balancers, etc. All of these components have finite capacity, and the exhaustion of this capacity could impact the resilience of necessary applications. Furthermore, in some cases, provisioning additional capacity can require some time, and hence it cannot be performed reactively in real-time (that is, immediately post exhaustion). Ideally, any approach to guarantee the availability of sufficient capacity must be proactive.

Viable approaches must also be sufficiently accurate, so as to not overwhelm users with notifications or lead to costly over-provisioning of resources, while not missing any potential capacity exhaustion scenarios. Such approaches must also be designed to handle the vagaries in resource utilization and be able to distill the signal from the noise.

In this blog post, we will explore a novel machine learning-based forecasting pipeline that uses a diverse array of techniques, including a patent pending change point detection algorithm to predict capacity outages ahead of time, thereby enabling infrastructure owners to take preventive action. Additionally, we also discuss how we apply the same forecasting approach to detect whether the real-time utilization of these components is anomalous and requires immediate attention. Such situations of anomalous behavior largely arise when the utilization surges significantly which results in bringing the component to the verge of exhaustion. In these cases, our system raises real-time alerts that notify owners about the anomalous behavior, thereby enabling them to temporarily mitigate the issue(s) while longer-term solutions are planned.

 

The Capacity Planning Problem Statement

In order to explain the approach we took when building the system, we will use the following problem statement: Given the past space utilization for a database (example - figure 1), predict when its space utilization is likely to hit 100%. This information can be used to trigger alarms early and give database administrators enough time to mitigate the issue by provisioning additional capacity, reducing consumption through archiving, and so on.

Figure 1: Database utilization over the recent 120 days.
Figure 1: Database utilization over the recent 120 days.

 

A Brief Overview of Our Approach

This problem statement is a classic, well-studied machine learning problem called time series forecasting. A time series captures the variation of a metric over time. For our problem, the metric would be the database space utilization, and the time series would describe its variation over a long period of time. Time series forecasting is the problem of making predictions about the future values of a metric by identifying patterns present in its past behavior. For example, for the time series in Figure 2, we identify a steady upward growth demarcated by the red line in the below image (more on this below). We use this upward growth pattern to make predictions into the future - represented by the green line in Figure 2.

In other words, the green line represents an approximate estimate for the utilization of this database in the near future, and this estimate can then be used to predict the future date of a potential capacity outage. This prediction is the valuable information that time series forecasting provides.

Figure 2: Patterns identified in the database utilization are used to predict future utilization.
Figure 2: Patterns identified in the database utilization are used to predict future utilization.

As with most machine learning problems, our approach towards a solution began with the exploration of a large dataset of past utilization time series for various kinds of infrastructure components. We looked for various kinds of patterns that might exist (such as the steady upward growth that we just discussed). For each of these patterns, we researched and narrowed down the appropriate machine learning techniques to maximize the accuracy of our prediction. Finally, after numerous rounds of backtesting and validation with these techniques, we built a system that makes high-quality forecasts for infrastructure utilization.

 

Patterns Identified

Some of the fundamental time series patterns that we identified during data exploration are listed in Table 1. Any time series we encounter can be a combination of one or more such fundamental patterns. Moreover, each of the patterns presents its own challenge, and needs distinct machine learning techniques to model. In this blog post, we will cover only some of the modeling techniques.

Figure 3(a): Steady growth. Pattern = Trend; Definition = Continuous growth / reduction in utilization.
Figure 3(a): Steady growth. Pattern = Trend; Definition = Continuous growth / reduction in utilization.

Figure 3(a): Pattern = Trend; Definition = Continuous growth / reduction in utilization.

Figure 3(b): A sudden jump on day 90.
Figure 3(b): A sudden jump on day 90.

Figure 3(b): Pattern = Change point; Definition = Sudden shift in utilization in short spans of time.

Figure 3(c): Seasonal pattern.
Figure 3(c): Seasonal pattern.

Figure 3(c): Pattern = Seasonality; Definition = Periodic patterns in utilization over time.

 

Machine Learning System Architecture

We will use the two time series in Figure 4 to explain the step-by-step architecture of the machine learning system. The time series in Figure 4(a) has seasonality (a repeating component) and a trend, while the time series in Figure 4(b) has seasonality, a trend, and a change point (sharp jump) around Day Number 50.

The time series that we will model. Figure 4(a): Time series with seasonality and trend. Figure 4(b): Time series with seasonality, change point, and trend.
The time series that we will model. Figure 4(a): Time series with seasonality and trend. Figure 4(b): Time series with seasonality, change point, and trend.

 

Extracting a Seasonal Component

To begin, the model extracts the seasonal component (a periodic pattern of the kind displayed in Figure 3(c)) from these time series in Figure 4. This breaks down the overall time series into two pieces - the part of the time series that repeats periodically and whatever is left over. This is highlighted in Figure 5, where the top plot shows the percentage utilization for a disk, the middle shows its extracted seasonal component, and the bottom shows the leftover component.

Note that when we sum up the two components - the middle seasonal component and the bottom leftover component, we will get the top overall time series. That is,

Overall time series (Top) = A Seasonal Component (Middle) + A Leftover Component (Bottom)

In order to extract the middle seasonal component (the one that periodically repeats), we use a couple of well-studied machine learning techniques known as Seasonal-Trend Decomposition (more info herehere, and here) and Periodogram Estimation (more info here and here).

Figure 5: How time series can be broken down into seasonal and leftover components.
Figure 5: How time series can be broken down into seasonal and leftover components.

The reason we break down these time series into seasonal and leftover components is because it simplifies our machine learning problem. There are a number of well-studied machine learning techniques to build models for purely seasonal time series and for time series that have no seasonality in them. By breaking our time series down in this manner, we can filter the noise (in this case, the repeating, predictable seasonal component) from the useful data (the anomalies in the leftover component).

 

Detecting Change Points

Once the seasonal component has been extracted, we then take the leftover time series and run a proprietary algorithm that we have developed to identify the time of the change point (a sharp jump) in the time series, if any. This algorithm is a modification of the well-known CUSUM change point detection algorithm, and further details about it can be found in our patent. Our algorithms are capable of detecting change points in the presence of an underlying trend. The results of this algorithm are shown in Figure 6 - no change point was detected in the image on the left while a change point was indeed detected on the image on the right on Day Number 50.

Figure 6: Change point detection on the leftover components.
Figure 6: Change point detection on the leftover components.

Again, the rationale for detecting change points is so that our model can work around them - there exist well-studied techniques to model time series without change points or seasonal patterns which can now be used out-of-the-box.

 

Fitting Trends

Finally, once we know whether there is a change point and, if so, where, then we can fit the trend using a classic machine learning technique known as ridge regression (more info here and here). When we fit the ridge regression model (or any machine learning model in general), it gives us a mathematical rule that describes the underlying pattern in the data. In Figure 7, the pattern identified by the mathematical rule is highlighted in red. Additionally, when there exist change points in the time series, we fit our trend model only to the portion after the latest change point. This is because a change point is a place where the behavior of the time series changes suddenly. As a consequence, the data before the change point is not relevant to the behavior of the time series after the change point.

Figure 7: Fitting trend on the leftover components.
Figure 7: Fitting trend on the leftover components.

Note: Even though the examples are deliberately kept simple, our trend model can actually detect non-linear (non straight line) trends using polynomial regression. An example of this is given in Figure 2, above.

 

Using Our Machine Learning System to Make Predictions

We have covered the steps to train the machine learning model to recognize trends. Now we can get to the exciting part - using our trained model to predict the future utilization. To do so, we must first understand the mechanics behind machine learning models. In a nutshell, any model is simply a mathematical function which accepts some kind of data as input and then produces estimates as output. In our case, the input is simply the sequence of day numbers (1, 2, 3, ..., 120) and the expected output is the value of the time series on that day. A good machine learning model is one that produces outputs which closely mirror the reality (the actual value of the time series that we have with us). Furthermore, in order to predict the future, we simply need to feed future day numbers (121, 122, ...) into our machine learning model and it will provide us approximate estimates regarding the future values of the time series. Finally, since our machine learning model is made up of two pieces, a seasonal component and a trend component, its prediction of the future will simply be the sum of these two components.

 

Predictions from the Seasonal Model

Since our seasonal model is merely a periodic and repetitive pattern, the future predictions are repetitions of the same pattern. We plot these repetitions for both time series in Figure 8.

Figure 8: Predicting the future - the seasonal model.
Figure 8: Predicting the future - the seasonal model.

 

Predictions from the Trend Model

Similarly, given the mathematical model for our trend model, we can then use it to make predictions into the future. Those predictions are charted in Figure 9.

Figure 9: Predicting the future - the trend model.
Figure 9: Predicting the future - the trend model.

 

Overall Model Predictions

By summing up the predictions from the above two components, we obtain our overall model predictions. These are shown in Figure 10.

Figure 10: Predicting the future.
Figure 10: Predicting the future.

 

The Entire Model Flow

Below, the entire flow is summarized.

 

The Anomaly Detection Problem Statement

As we discussed earlier, meticulous capacity planning can help in preventing resource crunches most of the time, but sometimes the application behavior can spike or fall suddenly and unpredictably, and such cases are strong indicators of potential application failures. Suppose that the available capacity of a network link or a database drops suddenly and drastically due to a surge or a spike in usage. This kind of behavior can have significant impact on applications, and hence it is critical to notify the right people as quickly as possible. This is broadly termed as anomaly detection, and in order to explain it we shall consider the following problem statement:

If we have the past utilization data for a database, can we use it to identify whether the current utilization is anomalously high.

The approach that we take to tackle such a problem statement is to compare our forecasts with the current utilization. Given that our time series forecasting model provides a good estimate of the usual behavior of the time series, if the current utilization deviates significantly from the forecasted utilization then we can claim that this large deviation is an anomaly which might need human attention. In order to understand this better, let us consider the earlier case of the two time series.

Suppose that the actual future utilization turned out to be the orange line at the end in Figure 12 - in the first case, the orange line is way off from the predicted utilization while in the second, it is pretty close to it. We can say that the behavior of the first time series is anomalous and it might require human attention, but the second time series most likely does not. Our prediction model successfully predicted utilization in the second case, and adjusted resources so that no human attention is required. In the first case, our model can flag for human attention.

Figure 12: Real-time anomaly detection.
Figure 12: Real-time anomaly detection.

 

Conclusion

The machine learning system which we discussed in this post underpins an extensively used platform that simplifies capacity planning and can perform effective time series anomaly detection.

Going forward, we are:

  • Working to open source our library for broader use.
  • Continuously focusing on enhancing our machine learning models. 

 

Credits

This model has been built and refined through the contributions of a number of team members including Madhavan Mukundan, Milan Someswar, Ankesh Jha, Nishu Kumari, Divyanshu Sharma, and Girija Vasireddy. Thank you also to Sayantan Sarkar, Anurag Ved, Krishnamurthy Vaidyanathan, and Bob Naccarella for the constant support and guidance.

 

Explore Opportunities at Goldman Sachs

We're hiring for several exciting roles. Visit our careers page to learn more.


See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.