September 23, 2021

Building Platforms for Data Engineering

Neema Raphael, Pierre De Belen, Eric Esterkin, Sarah Jankowski, Beeke-Marie Nelke - Data Engineering

As the volume, speed, complexity and demand of data accelerates dramatically, it has become increasingly difficult to ensure clients and internal businesses have timely access to actionable data. The ability to access more and varied data offers a competitive advantage, but is balanced by the costs and complexities of managing data silos, data duplication and data quality.

Over the last seven years, the Data Engineering team at Goldman Sachs has built and open sourced a solution to this rapidly growing challenge – providing accurate, timely, and safe access to data with increased efficiency and reliability. Our solution, Legend*, is an open source data platform that breaks down data silos while connecting business and technology teams. Legend includes novel features that accelerate information curation and sharing through consistent data vocabularies (read: data models) as well as self-service capabilities for retrieving and working with data – all while respecting the entitlements of the underlying sources.

Sounds complex! What does data modeling actually mean?

In its simplest form, data modeling is just thinking about the meaning of your data and its relationships to other data - and then writing that down in a formal way! Data modeling forces you as a discipline to think about data design just like you would about a system design.

The image below shows how Legend formalizes the hierarchical relationship between a firm and legal entity (indicated by the arrow symbol). The LegalEntity parent class shares its attributes (such as name) with the Firm class. The Firm itself has several additional properties including company type and employees.

Now this sounds simple! But how does data modeling work within the complexities of Goldman Sachs’ voluminous data?

Specifically, users can connect data sources and write declarative data validations. Then they can build complex queries for data retrieval and run structural transformations on the fly, all within a resilient hosted environment to execute. Within the model:

  • Data access is bound directly to the model, either via service definition, code generation or embedded code execution
  • Data producers and consumers use the same model definition, or a linked (mapped) model defined in code
  • Validation and data quality metrics are defined with direct reference to the model
  • The model is registered centrally and used as reference for data contracts between teams
  • The model is not specific to a physical implementation: it acts as a logical layer. Mappings add links to physical implementation (e.g. service endpoint, database schema)
  • The model is versioned to allow staged release of dependent code and services

Finally, the data models can then be published in different systems for decision-making including business intelligence tools, search tools, UI query tools or APIs.

In practice, there will likely be several models mapped to one another to create more complex joins and to provide the user with an understanding of how different data is transformed. In the Legend example below, a deprecated model with a Person and Organization class is mapped to a new model with an Employee and Firm class to allow transforming the data from one model to the other. For example, a person's last and first name will be represented as the Employee's full name in the new model.

With an understanding of basic data modeling theory and practice, let’s look at two representative use cases within Goldman Sachs where data models play a key role in creating efficiencies, reducing time to market, embedding entitlements, enforcing data quality, and empowering decision-makers.

  • A quant fund looking for reference data infrastructure within their trading operation
  • An internal data scientist publishing transaction trends for their coverage team

There are hundreds of variations of similar workflows - integrating new vendor data for market making desks, delivering a statement to clients, building a new data 'feed' between internal teams, and more.

It is possible that every request could be tackled in a bespoke fashion with varied methods of data collection, storage, normalization, aggregation, transformation, validation, reconciliation, provenance and usage tracking, visibility, licensing, as well as explanation, sharing, and presentation. Instead, our data platform streamlines these exact end-to-end workflows for all data engineers, data analysts, and data scientists starting with both publishing and providing access to the right data in the right structure with the right permissions.

Leveraging Marquee** data models and the business friendly user-interface of the Legend platform, the quant researcher simply clicks 'Subscribe to 3 year rolling history' and has the full history of product data with symbology available in a hosted database within minutes. The data gets refreshed automatically in real-time and the researcher can download it programmatically and leverage it in their back-testing and production algorithm immediately. This eliminates weeks of data wrangling.

The data scientist can now leverage our Legend browser UI to search using business terms they are familiar with (e.g. Trades, Deals, Portfolios, Positions, Allocations, Loans, etc). The browser includes visibility and entitlement checks, various query tools to retrieve data, and seamless access to related concepts (e.g. Portfolio → Accounts → Positions → Trades). The whole experience is self-service meaning engineers and non-engineers are able to request the appropriate permissions as well as ask questions of - and importantly understand - the data with ease. The scientist can now source all transactions they have access to, enrich them with new large-scale datasets they are evaluating, and publish a hosted dashboard for their client team.

Creating data models within Legend is recognized as non-trivially increasing engineering productivity, materially reducing time to market, and creating commercial opportunities to help our clients across the firm.

To watch some data modeling in action, check out our demo.

To learn more about Legend and our open sourcing partnership, check out our launch video.

And to hear more about our vision for how Legend can be a key tool to improve data quality across Financial Services, and find out more too about how the industry has been embracing Legend in its first year as an open source project, check out our upcoming presentation at the Open Source Strategy Forum in New York City on November 10.

Our team is also growing and we have several exciting opportunities!

(*) The open source contributions mentioned in this article relate to data models. The resulting collaborations involve the exchange of non-proprietary, non-confidential and non-licensed information only.

(**) The Goldman Sachs Marquee® platform is for institutional and professional clients only. Some of the services and products described on this site may not be available in certain jurisdictions or to certain types of client. Please contact your Goldman Sachs sales representative with any questions. Nothing on this site constitutes an offer, or an invitation to make an offer from Goldman Sachs to purchase or sell a product.


See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.