December 2, 2021

Observability at Scale

Graeme Bennett, Managing Director, Site Reliability Engineering

Observability is the foundation of reliability. Observable systems provide sufficient information about their internal state through outputs such as metrics, alerts, logs, or traces so that engineers can determine if a system is healthy, providing expected service levels to users, or to optimize performance and behavior. When things inevitably go wrong, it enables engineers to quickly diagnose and fix issues when they arise. The more complex a system gets, and the higher user expectations are over reliability, the more important it becomes to invest in advanced observability methods to reason about what is going on.

Goldman Sachs is a leading global financial institution that delivers a broad range of financial services across Investment Banking, Global Markets, Asset Management, and Consumer & Wealth Management to customers all over the world. We run a diverse technology portfolio of highly complex, distributed software services that underpin the many business lines of our firm. These services incorporate a broad mix of technologies across web and mobile, dozens of languages and frameworks, containers, big data processing and analytics, machine learning, high-frequency trading, batch computing, and cloud-native applications - all of which operate in a highly-regulated and performance-driven environment. It is impossible for any individual to understand the operation of our systems without the help of sophisticated observability platforms and developer tools.

To that end, the Site Reliability Engineering (SRE) team at Goldman Sachs is focused on providing high-quality, scalable, observability platforms to our population of thousands of engineers. Our mission is to build observability at scale, and provide our colleagues with the tools they need to engineer highly reliable systems for our business partners who provide excellent customer service to our clients. In 2020, SRE began a significant initiative to build a new generation of centralized observability platforms to further accelerate our vision to drive adoption of SRE practices across the firm. Central to this mission, we assembled a team of engineers with deep experience in developing and running highly complex global platforms in the technology and finance industries.

Our platform strategy has a strong preference towards the use of open standards such as OpenTelemetry, and open source software like Prometheus and similar technologies under the umbrella of the Cloud Native Computing Foundation (CNCF). We use cloud-native platforms like Amazon Managed Service for Prometheus (AMP) that support open integrations and industry-leading vendor solutions. We also develop bespoke tools and platforms to fill gaps where solutions do not yet exist, such as our innovative SLO Repo platform that centralizes service uptime signals based on Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and provides features to report, analyze, and regulate SLO data quality. 

The goal of our initiative is to deliver a unified observability plane that spans all of our on-premise and cloud computing environments, offering engineers a common developer experience and a central, global view of metrics, logs, and trace information with features like dashboarding, alert management, incident management, and more. This approach also means it is simpler to implement common data security and compliance controls, thus allowing our engineers to focus on building their products.

Our first priority was to build a centralized metrics-based monitoring platform based on the open source Prometheus ecosystem. Prometheus is a popular platform with increasingly broad industry support. Our Prometheus platform has since achieved significant adoption across our on-premise systems. We place a strong emphasis on the use of service domain monitoring practices using SLOs which requires the metrics collection and time-series analysis capabilities that Prometheus provides.

Our next major priority is to extend coverage of monitoring to cloud-based applications, expanding the reach of our unified observability plane. Cloud computing presents new and interesting challenges for building an observability platform. For example, should we leverage a cloud-native observability service, build our own, or license a SaaS-based vendor solution, or mix and match? There are a vast array of options today with many trade-offs to consider. Often a solution exists but is prohibitively expensive to deploy at scale, or works well in cloud, but not on-premise, or vice versa. Some platforms contain certain features that encourage a focus on legacy methods that we on the SRE team aim to discourage. Others lack sufficient capacity to meet our processing requirements, or present a closed universe of configuration and data that prevents us from leveraging it outside that system. A particular challenge in cloud is dealing with the structure inherent in partitioned account, IAM and VPC configurations, so we can securely monitor across these structures. Given these various considerations and our target state goals, we have a strong focus on cloud services that enable open integration options and the use of open standards like OpenTelemetry, so we are better able to realize our unified control plane vision on and off-prem. As we refine our cloud observability strategy, we are particularly excited about the emergence of cloud services that complement our initiatives within the Prometheus ecosystem. We look forward to working with various partners as they continue to develop this platform and similar observability-related products with the characteristics we consider necessary to achieve our vision.

The observability space is undergoing rapid change and growth at present. In the recent decade, particularly the last three to five years, there has seen a surge of new investment, products, and competition and the trend only appears to be accelerating. More and more, observability is recognized as a foundational requirement for systems that allows engineers to cross the watershed of three nines availability towards four and five nines. Technology that emerged out of necessity at the big tech firms to operate distributed internet services for millions of users has become more mainstream through the open source community, and now, increasingly through vendors and cloud service providers. We're excited to embrace these new platforms and paradigms at Goldman Sachs.


See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.

GS DAP® is owned and operated by Goldman Sachs. This site is for informational purposes only and does not constitute an offer to provide, or the solicitation of an offer to provide access to or use of GS DAP®. Any subsequent commitment by Goldman Sachs to provide access to and / or use of GS DAP® would be subject to various conditions, including, amongst others, (i) satisfactory determination and legal review of the structure of any potential product or activity, (ii) receipt of all internal and external approvals (including potentially regulatory approvals); (iii) execution of any relevant documentation in a form satisfactory to Goldman Sachs; and (iv) completion of any relevant system / technology / platform build or adaptation required or desired to support the structure of any potential product or activity. All GS DAP® features may not be available in certain jurisdictions. Not all features of GS DAP® will apply to all use cases. Use of terms (e.g., "account") on GS DAP® are for convenience only and does not imply any regulatory or legal status by such term.
¹ Real-time data can be impacted by planned system maintenance, connectivity or availability issues stemming from related third-party service providers, or other intermittent or unplanned technology issues.
Transaction Banking services are offered by Goldman Sachs Bank USA (“GS Bank”) and its affiliates. GS Bank is a New York State chartered bank, a member of the Federal Reserve System and a Member FDIC. For additional information, please see Bank Regulatory Information.
Certain solutions and Institutional Services described herein are provided via our Marquee platform. The Marquee platform is for institutional and professional clients only. This site is for informational purposes only and does not constitute an offer to provide the Marquee platform services described, nor an offer to sell, or the solicitation of an offer to buy, any security. Some of the services and products described herein may not be available in certain jurisdictions or to certain types of clients. Please contact your Goldman Sachs sales representative with any questions. Any data or market information presented on the site is solely for illustrative purposes. There is no representation that any transaction can or could have been effected on such terms or at such prices. Please see https://www.goldmansachs.com/disclaimer/sec-div-disclaimers-for-electronic-comms.html for additional information.
Mosaic is a service mark of Goldman Sachs & Co. LLC. This service is made available in the United States by Goldman Sachs & Co. LLC and outside of the United States by Goldman Sachs International, or its local affiliates in accordance with applicable law and regulations. Goldman Sachs International and Goldman Sachs & Co. LLC are the distributors of the Goldman Sachs Funds. Depending upon the jurisdiction in which you are located, transactions in non-Goldman Sachs money market funds are affected by either Goldman Sachs & Co. LLC, a member of FINRA, SIPC and NYSE, or Goldman Sachs International. For additional information contact your Goldman Sachs representative. Goldman Sachs & Co. LLC, Goldman Sachs International, Goldman Sachs Liquidity Solutions, Goldman Sachs Asset Management, L.P., and the Goldman Sachs funds available through Goldman Sachs Liquidity Solutions and other affiliated entities, are under the common control of the Goldman Sachs Group, Inc.
© 2025 Goldman Sachs. All rights reserved.