At Goldman Sachs, Middleware Engineering builds and operates a broad database platform as a service. We deliver a large, heterogeneous service offering comprising of thousands of databases across a dozen different database types to unlock the potential of all our engineering teams. While our footprint has grown tremendously over the years, we always do our best to deliver the solutions that our customers need. This size, diversity and customer focus are some of our key strengths.
On the other hand, these strengths are also a challenge for our engineers. Because of the breadth and diversity of our clients, there was no single perspective of service quality that could span each of our different offerings. Our intuition about the customer experience could only scale so far without concrete data, and so we set about curating trustworthy and timely service quality data that we could use to drive significant decisions about our whole platform.
Naturally, we already had excellent observability of technical details like CPU and memory usage, storage performance and replication throughput. These were interesting, but not satisfying to us in isolation, as none of these technical metrics really captured the essence of "are our customers happy?". While these data points are useful, they don't always correlate naturally with customer happiness. For instance, a database under heavy load might have high CPU and memory usage but the customer can be perfectly happy with its throughput. Conversely, a customer might be quite upset when the database load is low - perhaps a network problem has made it inaccessible!
We found inspiration in the Site Reliability Engineering (SRE) discipline which has been adopted by others at Goldman Sachs and across the wider industry. We were inspired to identify Service Level Indicators (SLIs) for our database platform. An SLI is a thoughtfully defined measurement of some aspect of a service offering which, ideally, correlates very strongly to an aspect of our customers' happiness. We decided to start with an Availability SLI: Our customers are happy when their applications can reliably use their databases as intended.
Implementing our Availability SLI was no simple task. We had to measure something meaningful about using a database. The implementation had to scale to thousands of databases and several different database technologies. We also needed to account for our globally distributed network: our customers operate their businesses from nearly anywhere. In addition, we had to be highly reliable - more reliable than the databases we were monitoring. Failures in our measurement implementation had to have as small a blast radius as possible, if we were to trust the collected data.
Our first task was to define what Availability actually meant for our SLI. There are many different use cases for databases. Some database users expect to perform low-latency, high-speed transaction processing. Other users may be interested in performing complex, analytical queries on a regular basis or driven by urgent business needs. There's no perfect answer to "available" that fits every use case. We can get pretty close though. We know that the ability to accept user connections and perform straightforward read and write operations are fundamental to database health. A database is certainly unhealthy if it cannot perform these basic tasks.
This led us to the idea of an Availability Prober. A prober is merely a tool that acts like a synthetic user and reports if it could achieve a task or not. Our database prober could attempt to perform read and write operations on a database and confirm success or failure. We could then count the number of successful probes and compare this to the number of attempted probes to calculate any database's Availability SLI.
This presented a number of interesting engineering challenges. While we run many different kinds of databases, we didn't want to write a lot of different kinds of probers to match. We wanted a single prober that we could operate anywhere our users could operate. A good proportion of our users use Java and JDBC (Java Database Connectivity) for their database interaction. All of our relational databases have a JDBC driver available. Java and JDBC seemed like the perfect choice given our expected scale. JDBC is the standard Java API for interacting with any relational database backend. In theory, a standard API with well-supported drivers for each of our database types would be a great productivity boost. We could write the prober once and only vary which driver we loaded or what specific commands we wanted to send to the database. However, as we began to implement this idea, the JDBC abstraction presented some unique challenges.
One challenge is that JDBC is a blocking, synchronous API. A blocking, synchronous API call could potentially bring an entire thread of execution to a screeching halt. Our prober has to be able to monitor thousands of databases - we couldn't let one misbehaving database impact the prober's measurements of other databases.
JDBC does try to provide some flexibility here. There are various ways to specify network-level timeouts, login-level timeouts, and even statement-level timeouts. However, there is no holistic ability to define a timeout for the complete sequence of operations we want our prober to perform. We also found that different JDBC driver vendors had varying levels of support for timeouts so that there was no "perfect" solution.
The naïve solution to this problem is to orchestrate supervision of each probe. One thread of execution can perform the probe while a distinct thread supervises its progress. If the probe succeeds, the supervisor can record a success. If the probe fails or fails to terminate after a deadline is reached, then the supervisor can record a probe failure and deal with any cleanup necessary. This approach is fairly complex and difficult to scale. A simple implementation requires at least 2N threads to monitor N databases and managing the shared state between probe-thread and supervisor-thread is an unenviable task.
Our approach was to use Reactive Programming techniques - a naturally asynchronous programming paradigm - to bridge the synchronous JDBC world to our prober's requirements. Reactive Programming provides an abstraction that lets us model the prober as a sequence of events and reaction to them. Our prober is modeled as a stream of timer events, ticking every Y seconds. Our reaction is to start a new probe in response to the timer and also to start a timeout to handle when that probe does not complete in time. We used the RxJava framework to express this simplified version of the core probe loop:
Prober p = new Prober(database); Flowable.interval(probeInterval, TimeUnit.MILLISECONDS) .flatMapCompletable( tick -> p.probe() .timeout(timeoutMilliseconds, TimeUnit.MILLISECONDS, timeoutScheduler) .doOnSuccess(result -> supervisor.recordSuccess(result)) .doOnError(err -> supervisor.recordFailure(err)) .onErrorComplete() )
On line 1, a new Prober is created. This encapsulates everything we need to know to connect to, authenticate to, and probe a specific database target. On line 2, we ask RxJava to fire a timer event every probeInterval milliseconds. On line 3, we're reacting to a stream of those timer events by computing a "completable" result. A Completable is a Reactive abstraction that eventually completes a task successfully or unsuccessfully. This is exactly what our database probes should do. Lines 4-8 define how we respond to each of the event ticks from Flowable.interval. Line 4 starts the database probe while line 5 composes in a supervisor to cancel the probe after a timeout. The timeoutScheduler is shared across many database targets and is smart enough to use only a small number of threads to schedule timeouts for hundreds of targets. Lines 6 and 7 deal with the probe (or timeout) outcomes and record the results for our SLI. Line 8, onErrorComplete(), makes sure errors are not treated fatally as otherwise they would cancel the flow of ticks from Flowable.interval.
Our production implementation differs slightly in some ways. For example, we use a modified implementation of Flowable.interval that adds a little jitter to spread all of the prober's operations across the entire wall clock and minimize the thundering herd problem.
Although this can seem complex at first glance, the RxJava library and the reactive programming paradigm really paid off. The more efficient use of threads and easier event-driven programming model delivered a very low overhead prober which is easy to run at scale for very low cost. Our prober regularly performs the equivalent of many hundreds of probes per second from a modest dual-core virtual machine with less than a 2GB heap. This is despite spending relatively little engineering time on performance or memory optimization.
Of course, there was still plenty for us to learn on our journey. Probers can end up confronting some really interesting failure modes and we quickly learned there are subtle nuances to consider when managing probe failures. One specific example: a database can become so unresponsive that a probe cannot be reliably cancelled. In these rare cases, the prober might open a new connection to the database when the time comes for the next probe. This could lead to more and more connections adding further pressure to an already degraded database instance. The prober imposes a hard limit on its database connection use to prevent this kind of runaway degradation. There are also some optimizations to reuse connections - but also periodically open new ones just to be certain we're measuring the entire "connect, authentication, read and write" use case in a meaningful way. This "light touch" keeps the prober's overhead on the customer database as low as possible while still providing meaningful availability information.
In production, we run multiple probers. These probers are sharded by database technology, as well as by region and data center. This assures us that, should there be an unexpected bug in any database driver or any unplanned physical infrastructure outage, a meaningful subset of probers will still remain operational. Rollouts of the prober are performed automatically on a shard-by-shard basis. Rollouts are automatically paused if the newly deployed probers do not appear healthy, so that there are never any interruptions in our observations.
A globally distributed and highly available time-series database is used to collect and store each prober's findings. This allows us to calculate the Availability SLI for any particular database instance or even aggregate availability SLIs across our entire platform. We are also able to use these SLIs as sources for alerts so that our engineers benefit from meaningful alerts with a very high signal-to-noise ratio. Putting it all together, we can plot a near real-time status signal, the SLI as measured over a 4-week / 28-day period, and even estimate round-trip latencies to a specific database instance from the prober for any database instance.
Our experience building the Prober has been incredibly valuable. Our objective was to understand our customer's happiness through data and in that way it has been a huge success. It taught us a lot about some of the unique and subtly complex ways databases can fail and improved our problem detection and analysis capabilities. It has also unlocked a new data-driven approach to setting engineering reliability goals: the Service Level Objective or SLO.
See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.
¹ Real-time data can be impacted by planned system maintenance, connectivity or availability issues stemming from related third-party service providers, or other intermittent or unplanned technology issues.
Transaction Banking services are offered by Goldman Sachs Bank USA ("GS Bank") and its affiliates. GS Bank is a New York State chartered bank, a member of the Federal Reserve System and a Member FDIC. For additional information, please see Bank Regulatory Information.
² Source: Goldman Sachs Asset Management, as of March 31, 2025.
Mosaic is a service mark of Goldman Sachs & Co. LLC. This service is made available in the United States by Goldman Sachs & Co. LLC and outside of the United States by Goldman Sachs International, or its local affiliates in accordance with applicable law and regulations. Goldman Sachs International and Goldman Sachs & Co. LLC are the distributors of the Goldman Sachs Funds. Depending upon the jurisdiction in which you are located, transactions in non-Goldman Sachs money market funds are affected by either Goldman Sachs & Co. LLC, a member of FINRA, SIPC and NYSE, or Goldman Sachs International. For additional information contact your Goldman Sachs representative. Goldman Sachs & Co. LLC, Goldman Sachs International, Goldman Sachs Liquidity Solutions, Goldman Sachs Asset Management, L.P., and the Goldman Sachs funds available through Goldman Sachs Liquidity Solutions and other affiliated entities, are under the common control of the Goldman Sachs Group, Inc.
Goldman Sachs & Co. LLC is a registered U.S. broker-dealer and futures commission merchant, and is subject to regulatory capital requirements including those imposed by the SEC, the U.S. Commodity Futures Trading Commission (CFTC), the Chicago Mercantile Exchange, the Financial Industry Regulatory Authority, Inc. and the National Futures Association.
FOR INSTITUTIONAL USE ONLY - NOT FOR USE AND/OR DISTRIBUTION TO RETAIL AND THE GENERAL PUBLIC.
This material is for informational purposes only. It is not an offer or solicitation to buy or sell any securities.
THIS MATERIAL DOES NOT CONSTITUTE AN OFFER OR SOLICITATION IN ANY JURISDICTION WHERE OR TO ANY PERSON TO WHOM IT WOULD BE UNAUTHORIZED OR UNLAWFUL TO DO SO. Prospective investors should inform themselves as to any applicable legal requirements and taxation and exchange control regulations in the countries of their citizenship, residence or domicile which might be relevant. This material is provided for informational purposes only and should not be construed as investment advice or an offer or solicitation to buy or sell securities. This material is not intended to be used as a general guide to investing, or as a source of any specific investment recommendations, and makes no implied or express recommendations concerning the manner in which any client's account should or would be handled, as appropriate investment strategies depend upon the client's investment objectives.
United Kingdom: In the United Kingdom, this material is a financial promotion and has been approved by Goldman Sachs Asset Management International, which is authorized and regulated in the United Kingdom by the Financial Conduct Authority.
European Economic Area (EEA): This marketing communication is disseminated by Goldman Sachs Asset Management B.V., including through its branches ("GSAM BV"). GSAM BV is authorised and regulated by the Dutch Authority for the Financial Markets (Autoriteit Financiële Markten, Vijzelgracht 50, 1017 HS Amsterdam, The Netherlands) as an alternative investment fund manager ("AIFM") as well as a manager of undertakings for collective investment in transferable securities ("UCITS"). Under its licence as an AIFM, the Manager is authorized to provide the investment services of (i) reception and transmission of orders in financial instruments; (ii) portfolio management; and (iii) investment advice. Under its licence as a manager of UCITS, the Manager is authorized to provide the investment services of (i) portfolio management; and (ii) investment advice.
Information about investor rights and collective redress mechanisms are available on www.gsam.com/responsible-investing (section Policies & Governance). Capital is at risk. Any claims arising out of or in connection with the terms and conditions of this disclaimer are governed by Dutch law.
To the extent it relates to custody activities, this financial promotion is disseminated by Goldman Sachs Bank Europe SE ("GSBE"), including through its authorised branches. GSBE is a credit institution incorporated in Germany and, within the Single Supervisory Mechanism established between those Member States of the European Union whose official currency is the Euro, subject to direct prudential supervision by the European Central Bank (Sonnemannstrasse 20, 60314 Frankfurt am Main, Germany) and in other respects supervised by German Federal Financial Supervisory Authority (Bundesanstalt für Finanzdienstleistungsaufsicht, BaFin) (Graurheindorfer Straße 108, 53117 Bonn, Germany; website: www.bafin.de) and Deutsche Bundesbank (Hauptverwaltung Frankfurt, Taunusanlage 5, 60329 Frankfurt am Main, Germany).
Switzerland: For Qualified Investor use only - Not for distribution to general public. This is marketing material. This document is provided to you by Goldman Sachs Bank AG, Zürich. Any future contractual relationships will be entered into with affiliates of Goldman Sachs Bank AG, which are domiciled outside of Switzerland. We would like to remind you that foreign (Non-Swiss) legal and regulatory systems may not provide the same level of protection in relation to client confidentiality and data protection as offered to you by Swiss law.
Asia excluding Japan: Please note that neither Goldman Sachs Asset Management (Hong Kong) Limited ("GSAMHK") or Goldman Sachs Asset Management (Singapore) Pte. Ltd. (Company Number: 201329851H ) ("GSAMS") nor any other entities involved in the Goldman Sachs Asset Management business that provide this material and information maintain any licenses, authorizations or registrations in Asia (other than Japan), except that it conducts businesses (subject to applicable local regulations) in and from the following jurisdictions: Hong Kong, Singapore, India and China. This material has been issued for use in or from Hong Kong by Goldman Sachs Asset Management (Hong Kong) Limited and in or from Singapore by Goldman Sachs Asset Management (Singapore) Pte. Ltd. (Company Number: 201329851H).
Australia: This material is distributed by Goldman Sachs Asset Management Australia Pty Ltd ABN 41 006 099 681, AFSL 228948 (‘GSAMA’) and is intended for viewing only by wholesale clients for the purposes of section 761G of the Corporations Act 2001 (Cth). This document may not be distributed to retail clients in Australia (as that term is defined in the Corporations Act 2001 (Cth)) or to the general public. This document may not be reproduced or distributed to any person without the prior consent of GSAMA. To the extent that this document contains any statement which may be considered to be financial product advice in Australia under the Corporations Act 2001 (Cth), that advice is intended to be given to the intended recipient of this document only, being a wholesale client for the purposes of the Corporations Act 2001 (Cth). Any advice provided in this document is provided by either of the following entities. They are exempt from the requirement to hold an Australian financial services licence under the Corporations Act of Australia and therefore do not hold any Australian Financial Services Licences, and are regulated under their respective laws applicable to their jurisdictions, which differ from Australian laws. Any financial services given to any person by these entities by distributing this document in Australia are provided to such persons pursuant to the respective ASIC Class Orders and ASIC Instrument mentioned below.
No offer to acquire any interest in a fund or a financial product is being made to you in this document. If the interests or financial products do become available in the future, the offer may be arranged by GSAMA in accordance with section 911A(2)(b) of the Corporations Act. GSAMA holds Australian Financial Services Licence No. 228948. Any offer will only be made in circumstances where disclosure is not required under Part 6D.2 of the Corporations Act or a product disclosure statement is not required to be given under Part 7.9 of the Corporations Act (as relevant).
FOR DISTRIBUTION ONLY TO FINANCIAL INSTITUTIONS, FINANCIAL SERVICES LICENSEES AND THEIR ADVISERS. NOT FOR VIEWING BY RETAIL CLIENTS OR MEMBERS OF THE GENERAL PUBLIC
Canada: This presentation has been communicated in Canada by GSAM LP, which is registered as a portfolio manager under securities legislation in all provinces of Canada and as a commodity trading manager under the commodity futures legislation of Ontario and as a derivatives adviser under the derivatives legislation of Quebec. GSAM LP is not registered to provide investment advisory or portfolio management services in respect of exchange-traded futures or options contracts in Manitoba and is not offering to provide such investment advisory or portfolio management services in Manitoba by delivery of this material.
Japan: This material has been issued or approved in Japan for the use of professional investors defined in Article 2 paragraph (31) of the Financial Instruments and Exchange Law ("FIEL"). Also, any description regarding investment strategies on or funds as collective investment scheme under Article 2 paragraph (2) item 5 or item 6 of FIEL has been approved only for Qualified Institutional Investors defined in Article 10 of Cabinet Office Ordinance of Definitions under Article 2 of FIEL.
Interest Rate Benchmark Transition Risks: This transaction may require payments or calculations to be made by reference to a benchmark rate ("Benchmark"), which will likely soon stop being published and be replaced by an alternative rate, or will be subject to substantial reform. These changes could have unpredictable and material consequences to the value, price, cost and/or performance of this transaction in the future and create material economic mismatches if you are using this transaction for hedging or similar purposes. Goldman Sachs may also have rights to exercise discretion to determine a replacement rate for the Benchmark for this transaction, including any price or other adjustments to account for differences between the replacement rate and the Benchmark, and the replacement rate and any adjustments we select may be inconsistent with, or contrary to, your interests or positions. Other material risks related to Benchmark reform can be found at https://www.gs.com/interest-rate-benchmark-transition-notice. Goldman Sachs cannot provide any assurances as to the materialization, consequences, or likely costs or expenses associated with any of the changes or risks arising from Benchmark reform, though they may be material. You are encouraged to seek independent legal, financial, tax, accounting, regulatory, or other appropriate advice on how changes to the Benchmark could impact this transaction.
Confidentiality: No part of this material may, without GSAM's prior written consent, be (i) copied, photocopied or duplicated in any form, by any means, or (ii) distributed to any person that is not an employee, officer, director, or authorized agent of the recipient.
GSAM Services Private Limited (formerly Goldman Sachs Asset Management (India) Private Limited) acts as the Investment Advisor, providing non-binding non-discretionary investment advice to dedicated offshore mandates, involving Indian and overseas securities, managed by GSAM entities based outside India. Members of the India team do not participate in the investment decision making process.