September 23, 2021

Building Platforms for Data Engineering

Neema Raphael, Pierre De Belen, Eric Esterkin, Sarah Jankowski, Beeke-Marie Nelke - Data Engineering

As the volume, speed, complexity and demand of data accelerates dramatically, it has become increasingly difficult to ensure clients and internal businesses have timely access to actionable data. The ability to access more and varied data offers a competitive advantage, but is balanced by the costs and complexities of managing data silos, data duplication and data quality.

Over the last seven years, the Data Engineering team at Goldman Sachs has built and open sourced a solution to this rapidly growing challenge – providing accurate, timely, and safe access to data with increased efficiency and reliability. Our solution, Legend*, is an open source data platform that breaks down data silos while connecting business and technology teams. Legend includes novel features that accelerate information curation and sharing through consistent data vocabularies (read: data models) as well as self-service capabilities for retrieving and working with data – all while respecting the entitlements of the underlying sources.

Sounds complex! What does data modeling actually mean?

In its simplest form, data modeling is just thinking about the meaning of your data and its relationships to other data - and then writing that down in a formal way! Data modeling forces you as a discipline to think about data design just like you would about a system design.

The image below shows how Legend formalizes the hierarchical relationship between a firm and legal entity (indicated by the arrow symbol). The LegalEntity parent class shares its attributes (such as name) with the Firm class. The Firm itself has several additional properties including company type and employees.

Now this sounds simple! But how does data modeling work within the complexities of Goldman Sachs’ voluminous data?

Specifically, users can connect data sources and write declarative data validations. Then they can build complex queries for data retrieval and run structural transformations on the fly, all within a resilient hosted environment to execute. Within the model:

  • Data access is bound directly to the model, either via service definition, code generation or embedded code execution
  • Data producers and consumers use the same model definition, or a linked (mapped) model defined in code
  • Validation and data quality metrics are defined with direct reference to the model
  • The model is registered centrally and used as reference for data contracts between teams
  • The model is not specific to a physical implementation: it acts as a logical layer. Mappings add links to physical implementation (e.g. service endpoint, database schema)
  • The model is versioned to allow staged release of dependent code and services

Finally, the data models can then be published in different systems for decision-making including business intelligence tools, search tools, UI query tools or APIs.

In practice, there will likely be several models mapped to one another to create more complex joins and to provide the user with an understanding of how different data is transformed. In the Legend example below, a deprecated model with a Person and Organization class is mapped to a new model with an Employee and Firm class to allow transforming the data from one model to the other. For example, a person's last and first name will be represented as the Employee's full name in the new model.

With an understanding of basic data modeling theory and practice, let’s look at two representative use cases within Goldman Sachs where data models play a key role in creating efficiencies, reducing time to market, embedding entitlements, enforcing data quality, and empowering decision-makers.

  • A quant fund looking for reference data infrastructure within their trading operation
  • An internal data scientist publishing transaction trends for their coverage team

There are hundreds of variations of similar workflows - integrating new vendor data for market making desks, delivering a statement to clients, building a new data 'feed' between internal teams, and more.

It is possible that every request could be tackled in a bespoke fashion with varied methods of data collection, storage, normalization, aggregation, transformation, validation, reconciliation, provenance and usage tracking, visibility, licensing, as well as explanation, sharing, and presentation. Instead, our data platform streamlines these exact end-to-end workflows for all data engineers, data analysts, and data scientists starting with both publishing and providing access to the right data in the right structure with the right permissions.

Leveraging Marquee** data models and the business friendly user-interface of the Legend platform, the quant researcher simply clicks 'Subscribe to 3 year rolling history' and has the full history of product data with symbology available in a hosted database within minutes. The data gets refreshed automatically in real-time and the researcher can download it programmatically and leverage it in their back-testing and production algorithm immediately. This eliminates weeks of data wrangling.

The data scientist can now leverage our Legend browser UI to search using business terms they are familiar with (e.g. Trades, Deals, Portfolios, Positions, Allocations, Loans, etc). The browser includes visibility and entitlement checks, various query tools to retrieve data, and seamless access to related concepts (e.g. Portfolio → Accounts → Positions → Trades). The whole experience is self-service meaning engineers and non-engineers are able to request the appropriate permissions as well as ask questions of - and importantly understand - the data with ease. The scientist can now source all transactions they have access to, enrich them with new large-scale datasets they are evaluating, and publish a hosted dashboard for their client team.

Creating data models within Legend is recognized as non-trivially increasing engineering productivity, materially reducing time to market, and creating commercial opportunities to help our clients across the firm.

To watch some data modeling in action, check out our demo.

To learn more about Legend and our open sourcing partnership, check out our launch video.

And to hear more about our vision for how Legend can be a key tool to improve data quality across Financial Services, and find out more too about how the industry has been embracing Legend in its first year as an open source project, check out our upcoming presentation at the Open Source Strategy Forum in New York City on November 10.

Our team is also growing and we have several exciting opportunities!

(*) The open source contributions mentioned in this article relate to data models. The resulting collaborations involve the exchange of non-proprietary, non-confidential and non-licensed information only.

(**) The Goldman Sachs Marquee® platform is for institutional and professional clients only. Some of the services and products described on this site may not be available in certain jurisdictions or to certain types of client. Please contact your Goldman Sachs sales representative with any questions. Nothing on this site constitutes an offer, or an invitation to make an offer from Goldman Sachs to purchase or sell a product.


See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.

GS DAP® is owned and operated by Goldman Sachs. This site is for informational purposes only and does not constitute an offer to provide, or the solicitation of an offer to provide access to or use of GS DAP®. Any subsequent commitment by Goldman Sachs to provide access to and / or use of GS DAP® would be subject to various conditions, including, amongst others, (i) satisfactory determination and legal review of the structure of any potential product or activity, (ii) receipt of all internal and external approvals (including potentially regulatory approvals); (iii) execution of any relevant documentation in a form satisfactory to Goldman Sachs; and (iv) completion of any relevant system / technology / platform build or adaptation required or desired to support the structure of any potential product or activity. All GS DAP® features may not be available in certain jurisdictions. Not all features of GS DAP® will apply to all use cases. Use of terms (e.g., "account") on GS DAP® are for convenience only and does not imply any regulatory or legal status by such term.
Certain solutions and Institutional Services described herein are provided via our Marquee platform. The Marquee platform is for institutional and professional clients only. This site is for informational purposes only and does not constitute an offer to provide the Marquee platform services described, nor an offer to sell, or the solicitation of an offer to buy, any security. Some of the services and products described herein may not be available in certain jurisdictions or to certain types of clients. Please contact your Goldman Sachs sales representative with any questions. Any data or market information presented on the site is solely for illustrative purposes. There is no representation that any transaction can or could have been effected on such terms or at such prices. Please see https://www.goldmansachs.com/disclaimer/sec-div-disclaimers-for-electronic-comms.html for additional information.
Transaction Banking services are offered by Goldman Sachs Bank USA (“GS Bank”). GS Bank is a New York State chartered bank, a member of the Federal Reserve System and a Member FDIC.
Mosaic is a service mark of Goldman Sachs & Co. LLC. This service is made available in the United States by Goldman Sachs & Co. LLC and outside of the United States by Goldman Sachs International, or its local affiliates in accordance with applicable law and regulations. Goldman Sachs International and Goldman Sachs & Co. LLC are the distributors of the Goldman Sachs Funds. Depending upon the jurisdiction in which you are located, transactions in non-Goldman Sachs money market funds are affected by either Goldman Sachs & Co. LLC, a member of FINRA, SIPC and NYSE, or Goldman Sachs International. For additional information contact your Goldman Sachs representative. Goldman Sachs & Co. LLC, Goldman Sachs International, Goldman Sachs Liquidity Solutions, Goldman Sachs Asset Management, L.P., and the Goldman Sachs funds available through Goldman Sachs Liquidity Solutions and other affiliated entities, are under the common control of the Goldman Sachs Group, Inc.
© 2024 Goldman Sachs. All rights reserved.