December 2, 2021

Observability at Scale

Graeme Bennett, Managing Director, Site Reliability Engineering

Observability is the foundation of reliability. Observable systems provide sufficient information about their internal state through outputs such as metrics, alerts, logs, or traces so that engineers can determine if a system is healthy, providing expected service levels to users, or to optimize performance and behavior. When things inevitably go wrong, it enables engineers to quickly diagnose and fix issues when they arise. The more complex a system gets, and the higher user expectations are over reliability, the more important it becomes to invest in advanced observability methods to reason about what is going on.

Goldman Sachs is a leading global financial institution that delivers a broad range of financial services across Investment Banking, Global Markets, Asset Management, and Consumer & Wealth Management to customers all over the world. We run a diverse technology portfolio of highly complex, distributed software services that underpin the many business lines of our firm. These services incorporate a broad mix of technologies across web and mobile, dozens of languages and frameworks, containers, big data processing and analytics, machine learning, high-frequency trading, batch computing, and cloud-native applications - all of which operate in a highly-regulated and performance-driven environment. It is impossible for any individual to understand the operation of our systems without the help of sophisticated observability platforms and developer tools.

To that end, the Site Reliability Engineering (SRE) team at Goldman Sachs is focused on providing high-quality, scalable, observability platforms to our population of thousands of engineers. Our mission is to build observability at scale, and provide our colleagues with the tools they need to engineer highly reliable systems for our business partners who provide excellent customer service to our clients. In 2020, SRE began a significant initiative to build a new generation of centralized observability platforms to further accelerate our vision to drive adoption of SRE practices across the firm. Central to this mission, we assembled a team of engineers with deep experience in developing and running highly complex global platforms in the technology and finance industries.

Our platform strategy has a strong preference towards the use of open standards such as OpenTelemetry, and open source software like Prometheus and similar technologies under the umbrella of the Cloud Native Computing Foundation (CNCF). We use cloud-native platforms like Amazon Managed Service for Prometheus (AMP) that support open integrations and industry-leading vendor solutions. We also develop bespoke tools and platforms to fill gaps where solutions do not yet exist, such as our innovative SLO Repo platform that centralizes service uptime signals based on Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and provides features to report, analyze, and regulate SLO data quality. 

The goal of our initiative is to deliver a unified observability plane that spans all of our on-premise and cloud computing environments, offering engineers a common developer experience and a central, global view of metrics, logs, and trace information with features like dashboarding, alert management, incident management, and more. This approach also means it is simpler to implement common data security and compliance controls, thus allowing our engineers to focus on building their products.

Our first priority was to build a centralized metrics-based monitoring platform based on the open source Prometheus ecosystem. Prometheus is a popular platform with increasingly broad industry support. Our Prometheus platform has since achieved significant adoption across our on-premise systems. We place a strong emphasis on the use of service domain monitoring practices using SLOs which requires the metrics collection and time-series analysis capabilities that Prometheus provides.

Our next major priority is to extend coverage of monitoring to cloud-based applications, expanding the reach of our unified observability plane. Cloud computing presents new and interesting challenges for building an observability platform. For example, should we leverage a cloud-native observability service, build our own, or license a SaaS-based vendor solution, or mix and match? There are a vast array of options today with many trade-offs to consider. Often a solution exists but is prohibitively expensive to deploy at scale, or works well in cloud, but not on-premise, or vice versa. Some platforms contain certain features that encourage a focus on legacy methods that we on the SRE team aim to discourage. Others lack sufficient capacity to meet our processing requirements, or present a closed universe of configuration and data that prevents us from leveraging it outside that system. A particular challenge in cloud is dealing with the structure inherent in partitioned account, IAM and VPC configurations, so we can securely monitor across these structures. Given these various considerations and our target state goals, we have a strong focus on cloud services that enable open integration options and the use of open standards like OpenTelemetry, so we are better able to realize our unified control plane vision on and off-prem. As we refine our cloud observability strategy, we are particularly excited about the emergence of cloud services that complement our initiatives within the Prometheus ecosystem. We look forward to working with various partners as they continue to develop this platform and similar observability-related products with the characteristics we consider necessary to achieve our vision.

The observability space is undergoing rapid change and growth at present. In the recent decade, particularly the last three to five years, there has seen a surge of new investment, products, and competition and the trend only appears to be accelerating. More and more, observability is recognized as a foundational requirement for systems that allows engineers to cross the watershed of three nines availability towards four and five nines. Technology that emerged out of necessity at the big tech firms to operate distributed internet services for millions of users has become more mainstream through the open source community, and now, increasingly through vendors and cloud service providers. We're excited to embrace these new platforms and paradigms at Goldman Sachs.


See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.