December 16, 2021

Harnessing the Power of Machine Learning to Improve Data Lake Client Happiness

Jaimita Bansal, Vice President, Data Lake Engineering

Data Lake is the firm's big data platform. It consists of a proprietary data store, services, and infrastructure components that are used to house and process petabytes of data every day. As of today, the Data Lake contains a 650+ node HDFS cluster with 200 TB memory, 30K+ cores, and more than a hundred thousand datasets. Data flowing in and out of the Data Lake is crucial for enabling four key business segments of the firm - Global Markets, Consumer and Wealth Management, Asset Management, and Investment Banking. Data Lake has two types of users - Producers and Consumers. Producers produce data which is ingested into Data Lake - each batch of data is merged into an existing snapshot of data for that data store. Consumers can subscribe to multiple data stores and can get the required data exported to multiple database technologies including cloud (also called, virtual warehouses).

The Challenge

The Data Lake platform scales to run hundreds of thousands of ingestions and more than a million exports per day (and we are constantly growing!). Fast processing times for this scale is a financial need - speed is everything! Each of these ingestions and exports is time-sensitive, and the workloads they process vary tremendously in volume. For instance, the same producer can send anywhere from zero to millions of rows depending on the number of trades executed in that hour. The most difficult question to answer operationally has been: which particular ingestion or export has delayed, and also has the potential to delay downstream consumer deadlines? If we could proactively monitor for these, then we can manually intervene and prevent operational misses. This question led us to start estimating the time taken for each ingestion and export using a machine learning model, and subsequently creating an anomaly detection model based on predictions to find outliers.

Modeling Approaches

We started with the simplest possible model - a static threshold (e.g. 2 hours) based on 90th percentile duration of all jobs. This means that if an ingestion takes longer to complete than the static threshold, it is marked as delayed. This approach was proven to be ineffective because batches with larger volumes of data (asymmetric workloads) always took longer than that static threshold.

We then considered training simple statistical models to adjust the static threshold for each dataset based on previous run times for that particular dataset. However, previous runs for a dataset could all be running with a particular feature set, such as ten rows sent every day in a batch. These models fail when a new batch in that dataset changes the parameters, such as sending millions of rows instead of ten rows. If something like this happens, it means that previous runs of that dataset cannot be relied upon (asymmetric workloads issue).

We then moved on to identifying the feature set (~20+ features) that affects these processing times in close collaboration with the Data Lake Developers. Some of the more important features identified are:

  • Total rows in batch.
  • Number of datasets sent in that batch.
  • Merge size (size of the existing snapshot).
  • Number of inserts/updates/deletes.
  • Number of partitions.
  • Parallelism set for big data job.
  • Compression ratio of the dataset.
  • Schema of the dataset.
  • Time of day that a batch request is made.

Interestingly, we realized that there are daily windows when the resources/infrastructure components of Data Lake are processing big workloads, leading to higher wait times for the ingest/export jobs.

Predicting Ingestion Times using Machine Learning

We then explored supervised machine learning techniques (trained on millions of historical ingestion samples with ~20+ features) to predict the ingestion time, as the Baseline Model that we tried before didn't work very well. We experimented with:

  1. A Decision Tree Regressor for each store.
  2. One big Decision Tree Regressor for all ingestions (and another one for exports).

Comparing between the Baseline Model and the Decision Tree Regressor for each store, the Decision Tree Regressor had better MSE for 80% of the stores; performance was similar for the rest.

We found that the advantage of training Decision Tree Regressor models for every store (hundreds of thousands of models) is that they capture the micro trends of varying features in a store really well. However, a challenge for training these models was that the number of training samples varied vastly since every store sends batches with different frequencies. We encountered data sparsity issues for low frequency stores which send only 20-30 batches in a year; predictions made were not relevant since the underlying code and infrastructure changes over time as the Lake grows.

We then tried using one big Decision Tree Regressor trained on millions of ingestions (excluding failures) across all stores. It showed 8% lesser MSE than a Decision Tree Regressor for each store. However, this model produced an overall 0.92 R2 score on a test set (using an 80/20 train-test split) which was pretty accurate. We also found other advantages with this model:

  • It captures the global trends/buckets of features across the plant, (Data Lake infrastructure/services as a whole is/are referred to as the "plant") and faced no data sparsity issues.
  • Better collaboration with developers because of easy interpretability.
  • Easy to maintain versions of one model over time in comparison to hundreds of thousands of models.

Example Predictions on One Store

Notice the difference in predictions for the three models as staging row count (number of rows sent in that batch) changes.

We explored other machine learning techniques as well like regression, gradient boosting decision trees, and neural networks which either gave lesser or similar accuracy, but at the cost of interpretability. Easy visualization of the buckets(/splits) based on features are very useful for understanding the features that drive the processing times. We decided to choose big Decision Tree Regressor as the final model for our use-case.

Visualization: Part of the Data Lake Ingestion Tree

Below is a snippet of the big Decision Tree Regressor on ingestions. As can be seen from the first split in the tree, any batch with total rows less than 8 mil, gets ingested in 0.58 minutes on average and otherwise, gets ingested in 5.47 minutes on average. 

  • std is standard deviation.
  • samples is the number of samples in that node.
  • value is ingestion duration in minutes.
  • Note: Read the tree from left to right.

Lake Happiness

We define the happiness of the Data Lake by how well we perform in real-time against expectations set up by the model for every batch. The Data Lake is considered "happy" if we are able to process all incoming batches in the duration predicted by the model. Theoretically, we are creating an Anomaly Detection Model (explained later) based on the Decision Tree Regressor Model predictions/expectations.

We use these model(s) for:

  1. Real-time monitoring of expected performance vs reality across plant and across individual users through happiness metrics. This gives users an idea of expected performance and allows us to identify operational/systemic issues in Data Lake.
  2. Production alerting for batches running longer than expected.
  3. Prioritizing platform improvements with Data Lake Developers for specific clients who experience lags with their data.

Anomaly Detection (AD) Model

  • Model Threshold = Decision Tree Regressor Prediction + max(2* leaf std, hard_buffer) where left std is the standard deviation at the leaf node of decision tree
  • Breach minutes = Actual Batch Run time – Model Threshold, if Breach Min > 0 else 0
  • Severity (SLI) = Breach minutes/Model Threshold

We used two times the standard deviation at the leaf of the Decision Tree Regressor Node to calculate model threshold. This is done to contain the volume of breaches generated - application of Chebyshev’s inequality. We kept a hard_buffer of 5 minutes in the AD model, because we do not want to alert if there is a deviation of norm by a very few minutes.

A Happy Lake Client

A Data Lake client is happy if their batch gets finished within the threshold predicted by the AD model. Since the model takes into account all the big data features and asymmetric workloads sent by the client (millions of rows or tens of rows and other feature sets), we believe it is reasonable to set client expectations based on it.

Happiness Metric:

  • SLI = 0, client is happy, breach min is 0
  • SLI = (0,1), client is annoyed, breach min is (0, model_threshold]
  • SLI >= 1, client is angry, breach min is (model_threshold, infinity]; meaning that the actual run time is at least twice of the threshold.

We measure the percentage of clients who are happy/unhappy in real-time. Any dip in the happiness graph below indicates that some clients are affected by Data Lake lags. A major dip indicates a systemic issue in Data Lake which would generate internal alerts for the team. We also have different levels of views like plant (overall) view and warehouse view. Using these views, we can easily drill down into clients and batches affected in real-time with the help of the prediction made for those batches and respective breach minutes. We also have a developer focus view for suggesting long term improvements for Data Lake Developers through a list of stores showing high breach minutes.

*Warehouse View colored in green to red by the severity (SLI) of any particular breach in ingestion for that warehouse.

Overall, these models have proven to be extremely useful. They continue to generate accurate alerts and prevent delays. We run with >99% happiness on most days for both ingestions and exports!


  1. Partner with domain experts early on, and extract as many "features" as possible from their contextual experience. Our R2 score improved from 0.7 to >0.9 as new features were discovered and added after brainstorming with developers.
  2. Model Simplicity is directly related to Stakeholder and Client trust. Focusing on accuracy is important but not enough, sometimes in operational settings, where many of the stakeholders might not be engineers. For us, the project's success really depended on whether the model is adopted and understood by all stakeholders or not and thus, directly related to simplicity of the Machine Learning technique used. 

  3. Model Result Explanation is an operational expense. For Machine Learning Engineers, it can become a full day's work to explain why a certain prediction was made by the model. We ended up spending time creating simple visualizations which helped greatly.

  4. The model is only as good as the time window it is trained on. Major dev releases do necessitate model re-training and putting a new model version in production. However, on the flip side, the effect of any new release on clients can be measured by comparing performance against the existing model. Comparing the prediction results of the new model with the old model is also valuable for identifying specific store behaviors that have changed. However, the more complex question to solve has been how to compare two model versions and identify differences in splits and thresholds. Solving this would potentially make the effects of any new release on clients quite transparent to developers by showcasing which features/branches of the tree diverged in their behavior.

  5. Predictions from machine learning models can often be challenging to understand and even more complex to explain. For us, tree-based models worked nicely because they are relatively transparent in their splits and buckets, and highly accurate for us. We also created custom visualizations of the trees to understand and explain even better. There is a bit of a trade-off present between interpretability of machine learning models vs accuracy achieved in general. An important focus for us was to find the balance where usefulness of these models exceeded any overhead.

We are happy to see these models improving our operational efficiency and user happiness and continue to be excited about applying machine learning in the Big Data space.

Lake Opportunities

Do you enjoy solving interesting data engineering challenges? Our team is growing and we have several exciting opportunities!

See for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.

Certain solutions and Institutional Services described herein are provided via our Marquee platform. The Marquee platform is for institutional and professional clients only. This site is for informational purposes only and does not constitute an offer to provide the Marquee platform services described, nor an offer to sell, or the solicitation of an offer to buy, any security. Some of the services and products described herein may not be available in certain jurisdictions or to certain types of clients. Please contact your Goldman Sachs sales representative with any questions. Any data or market information presented on the site is solely for illustrative purposes. There is no representation that any transaction can or could have been effected on such terms or at such prices. Please see for additional information.
Transaction Banking services are offered by Goldman Sachs Bank USA (“GS Bank”). GS Bank is a New York State chartered bank, a member of the Federal Reserve System and a Member FDIC.
GS DAP™ is owned and operated by Goldman Sachs. This site is for informational purposes only and does not constitute an offer to provide, or the solicitation of an offer to provide access to or use of GS DAP™. Any subsequent commitment by Goldman Sachs to provide access to and / or use of GS DAP™ would be subject to various conditions, including, amongst others, (i) satisfactory determination and legal review of the structure of any potential product or activity, (ii) receipt of all internal and external approvals (including potentially regulatory approvals); (iii) execution of any relevant documentation in a form satisfactory to Goldman Sachs; and (iv) completion of any relevant system / technology / platform build or adaptation required or desired to support the structure of any potential product or activity.
Mosaic is a service mark of Goldman Sachs & Co. LLC. This service is made available in the United States by Goldman Sachs & Co. LLC and outside of the United States by Goldman Sachs International, or its local affiliates in accordance with applicable law and regulations. Goldman Sachs International and Goldman Sachs & Co. LLC are the distributors of the Goldman Sachs Funds. Depending upon the jurisdiction in which you are located, transactions in non-Goldman Sachs money market funds are affected by either Goldman Sachs & Co. LLC, a member of FINRA, SIPC and NYSE, or Goldman Sachs International. For additional information contact your Goldman Sachs representative. Goldman Sachs & Co. LLC, Goldman Sachs International, Goldman Sachs Liquidity Solutions, Goldman Sachs Asset Management, L.P., and the Goldman Sachs funds available through Goldman Sachs Liquidity Solutions and other affiliated entities, are under the common control of the Goldman Sachs Group, Inc.
© 2023 Goldman Sachs. All rights reserved.