SecDb is Goldman Sachs’ platform for trading, valuation, and risk management. Originally developed by Global Markets, SecDb has been in continual use for over 30 years by over 3,000 developers. Those developers have produced over 200 million lines of Slang (the SecDb proprietary language) code and over 160 million daily jobs. SecDb applications have over 13,000 daily end users across the front, middle, and back offices. An abrupt change to how SecDb operates could have an impact on trading, risk reporting, and other bank operations.
To support those use cases, SecDb applications natively work with over 10,000 globally distributed custom object databases. This post focuses on those SecDb databases — the backbone of the ecosystem — and the observability journey that provided a more resilient and observable platform over the last three years. Our resiliency postures have changed over the years; the following sections describe that journey and how we approached changing the resiliency of a large and important legacy platform.
The SecDb ecosystem is composed of over 10,000 databases supporting over 2.5 billion connections, receiving 164 TB of messages, and serving around 8 PB of data.
SecDb databases are in-memory key-value stores, organized into an eventually consistent replication group called a ring. Applications may write to any member of a ring, which provides excellent distributed performance. Databases within a ring are synchronized by a process called SecSync, which distributes updates and detects any inconsistencies in the eventually consistent data.
Simplified diagram depicting a ring, database and SecSync
Over the last 30 years, SecDb has gone through several iterations of monitoring technologies. In recent history, SecDb metrics used a TSDB (time-series database), which is a solution tightly coupled with SecDb. The SecDb database team (known as SecDBAs) use TSDB metrics during incidents or general investigations to diagnose a myriad of known issues, including database memory hard limits, connection limits, and others. SecDBAs also leverage PlotTool (time series plotting tool), which supplies a rich set of mathematical functions to perform operations on the data.
In this diagram, you can see the change in growth behavior for a ring with the current memory hard limit and new proposed hard limit:
Even though the physical machine may have more physical memory, each database has its set hard limit, as there may be other databases on the same host.
TSDB and PlotTool are great tools, but they’re not a monitoring system with real-time events and escalations. They were designed for storage and analysis of financial time-series data, not to generate real-time signals for an operational system. SecDBAs used both tools to generate reports and email alerts through batch jobs. As the number of databases increased over the years, so has maintenance and the number of emails that don’t need immediate attention during critical events on the platform.
We learned how valuable TSDBs have been for the SecDBAs. We wanted to continue to leverage them and add a rule-based event engine with escalations. After discussion with the Site Reliability Engineering (SRE) team, we opted to use Prometheus to leverage the PromQL query language with Alert Manager. Due to the criticality of the databases, we implemented a probing mechanism to avoid changing the databases themselves.
The Metrics Probe collects metrics by interacting with a daemon process that runs on every machine we manage. The probe processes the data, creates Prometheus-ready data, and exports it to a given port for the SRE-managed Prometheus instance to pick up.
As a result of the probes, we constructed dashboards for rings, hosts, and databases. We leveraged PromQL rules to create events, as some events create operational risk trackers for non-critical items, while others are escalated through PagerDuty to our on-call engineer.
The metrics probe provided real-time telemetry for the databases, and immediate escalations when issues occurred. The real-time nature of the alerts allowed SecDBAs to review and ask a new set of questions about the infrastructure. The new set of questions required richer telemetry that only the database internal state could provide. The Metric Server is a native database component that treats database telemetry as a first-class citizen, just like the data we store for clients.
The database is a group of processes that leverage shared memory for their state and work. The Metric Server stores its telemetry in shared memory and exposes it through a port for Prometheus to poll. When deciding to develop the server we first chose to use available libraries, but due to concurrency and locks we decided to write the component ourselves.
The new Metric Server allowed us to gather the internal telemetry of the database as transactions progress through its internal pipeline, giving us visibility we didn't have before.
Different database process request types
The Metric Server opened a new door to granular telemetry that we used to collect many more service-level indicators (SLIs) per database. What was before one endpoint per host became one endpoint per database. The globally hosted SRE Prometheus requires a more static configuration for endpoints, demanding another solution for SecDb, as we move hundreds of databases across machines all the time and create new ones on-demand.
To solve this problem, we created a SecDb regional Prometheus infrastructure as an intermediary to the Global SRE Prometheus. The regional collector allowed us to dynamically update our configuration upon configuration changes to our environment. It also helped us determine what to push to SRE Prometheus for long-term storage and what to keep temporarily for quick diagnostics.
SecDb regional Prometheus setup
Access to this telemetry let us monitor more aspects of the databases — and generated more alerts. As part of the process, we started to curate the data and responses.
The SecDb team follows an around-the-sun support model, with engineers on-call to support the global nature of our businesses. Each supporting region is composed of six on-call engineers who rotate weekly and typically synchronize with the global team on pending alerts or operational tasks. To facilitate on-call rotations and more alert management, we started to use SOS to provide a seamless transition from region to region.
SOS - On-Call Rotation Tool
Adding new metrics has become part of our normal operating process. We add new metrics due to daily investigations, or as part of our incident management process action items to prevent the same issue from occurring again. Defining new SLOs for our clients is another key factor that required new metrics. The SecDb databases publish their SLO data to the SLO Repository, a product provided by the global SRE team, that hosts the SLOs for all critical services for the firm.
The databases record two types of logs: connection logs and normal component logs. Connection logs provide details about the clients and applications connecting to the database. In some databases, there are over 150,000 simultaneous connections at a given point in time. Processing the data and analyzing it per ring over time has been difficult, so we started to ship the data to the centralized logging solution called BQL (BigQuery Logging) that’s provided by the global SRE team.
Using BQL, we can aggregate data on-demand and see how different clients start making more connections than before. Note that we don’t alert on this data; instead we monitor the number_of_connections timeseries published to Prometheus through the Metric Server, but connection details are invaluable when diagnosing issues.
We also ship application component logs to BQL to retrieve logging data from a centralized service to manage when machines fail or when processes move from one machine to another.
Sample Database Connection Logs
Observability is an ever-continuing process. We continue to add more SLIs based on our day-to-day experiences, as part of our incident management process, and as our clients need new SLOs. We continue to evolve our solutions due to the ever-changing nature of the business and the need to further reduce toil.
In 2023 and 2024 we extend the Observability effort beyond the databases to all the other products under the SecDb Platform. This exciting journey will present a new set of challenges as we deal with a different set of products. Stay tuned for an update on the new journey as we execute it!
Interested in solving complex problems? Learn more about engineering careers at Goldman Sachs.
See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.