November 9, 2022

Integrating Legend with Databricks' Lakehouse for Financial Services

Ephrim Stanley, VP, Data Engineering

Legend* is an open source data platform created by Goldman Sachs and contributed to the Fintech Open Source Foundation (FINOS).

We are excited to announce the integration of the Databricks Lakehouse platform with Legend.

This contribution from the Databricks team is a great example of the spirit of FINOS - collaboration and innovation in the financial services industry via open source software. Databricks is also a member of FINOS.

In this blog post, we will start with a primer on Legend's relational data modeling and data access capabilities. We will then move onto a quick discussion about the use of data models to establish consistent data vocabularies across applications, platforms, and organizations. Finally, we will wrap up with a peek into work streams that are underway in the Legend project.

Relational Data Modeling and Data Access

A data model is a formal way of describing the semantics of data and its relationship to other data. Our prior blog post  - Building Platforms for Data Engineering - introduced a "Firm Employee" model that captures the semantics of a firm and its employees. 

This model is an abstract concept. The actual data might be physically stored in a relational database (or a database that supports SQL), like Databricks. The power of Legend is that it allows data queries to be expressed in terms of logical model concepts (like "Firm", "Person") and not in terms of how "Firm", "Person" data is physically stored in the database. 

Relational data modeling and data access requires four components:

  1. Model - Description of the data concepts and their relationships
  2. Query - Query written in the Legend Pure language. The query allows navigating the data model to access model element attributes. The query below queries the name of the firm and the first and last name of its employees:

        Firm.all()->project([f |$f.legalName, f|$f.employees.firstName, f|$f.employees.lastName], ['Firm Name', 'Employee First Name', 'Employee Last Name'])
  3. Mapping - Bridges the logical and physical models. In this case, the elements of the logical data model are mapped to a physical relational model. The Studio screenshot below shows how the Firm and Employee model classes have been mapped to a Firm and Employee table respectively.
Description in preceding text.
Description in preceding text.

4. Runtime - A runtime specifies where the data is physically stored. In this case, our data is stored in tables in a Databricks database/cluster. 

Description of the UI screenshot follows.
Description of the UI screenshot follows.

Using the above, Legend is able to translate the Pure query into a database specific SQL query, execute the query against the Databricks database, and return the results. 

Databricks Relational Connector

Legend integrates with many databases and data platforms. Thanks to the contribution from Databricks, Legend can now integrate with Databricks databases.

The contribution from Databricks provides the following:

  1. Metamodel of a Databricks database - A way to describe the attributes of a Databricks database. This includes attributes such has host/port and authentication schemes.

  2. JDBC connector - A "connector" is a piece of Java code that connects to the database via JDBC and executes queries. While the JDBC specification offers a standard way to query relational databases with SQL, database connectivity and authentication are bespoke and need to be customized for different databases.

  3. SQL Generation - Not all databases support the same SQL dialect. This means that the SQL queries generated by Legend have to be tailored to match the database's dialect. In addition, the Pure language provides for "dynafunctions" which can either be pushed down to the database as SQL functions or expressed as computations in the Pure language.
The snippet below shows the 'concat' dynafunction being translated to the 'concat' SQL function. 

forDynafunction('concat', [
    choice(DatabaseType.Databricks, $allStates, ^ToSql(format='concat%s', transform={p:String[*]|$p->joinStrings('(', ', ', ')')}))

Consistent Data Models and Model Distribution 

Legend offers first class support for data modeling, access, and governance. While Legend platform components like Studio offer a node code solution to data modeling, the data models themselves are treated as machine readable source code.

The models are stored in a Git (GitLab) repository managed by the Legend SDLC product. Legend SDLC's native integration with GitLab CI/CD allows Legend models to be managed, versioned, and distributed as code. 

Legend uses the Apache Maven protocol to distribute models as jar artifacts. These artifacts can then be used outside of the Legend suite of products. 

An example of this is the Databricks legend-delta showcase project that consumes Legend models and uses them to build Databricks pipelines. 

Note: Legend model artifacts can be consumed directly from Apache Maven repositories. The recently released and incubating legend-depot project offers a rich and API to index and serve model elements. Checkout the project for more details.

Future Roadmap

Goldman Sachs is actively contributing to the Legend projects on GitHub. Over the past two years, a total of 197 open source contributors have pushed over 6,400 commits to the Legend codebase and submitted 2,400 Pull Requests, adding 292,000 lines of code. In the spirit of open source software, we are committed to increasing contribution and participation from the rest of the community. In addition, we want to make it easy for contributors to add support for a new a database and platform. 

In support of this goal, we are actively refactoring the Legend code base. The scope of the refactoring includes the following:

  1. Legend database plugins - Today, adding support for a new database platform requires a release of the 'legend-engine' project. We want to move to a model where platform support for a database can be dynamically loaded using a plugin architecture. This will enable new connectors to be added without having to wait for a release of the legend-engine project. 
  2. Reusing authentication code across stores - In Legend, a database is a 'store' of data. Legend supports other stores of data like a 'service store' which sources its data from a REST API. We want to extract and reuse cross cutting functionality such as authentication across different stores.
  3. Customizing database specific behavior - Code to customize SQL generation for different databases is scattered over many source files. We are reorganizing this code to introduce well-defined modules and integration points to add database specific customizations.
  4. Enhance database test suite - This refactoring will provide a suite of tests that can be run against a target database to quickly assess its compatibility with the Legend platform. 

With these changes, we hope to bring in more databases and data platforms into the Legend community!

Learn More and Get Started

Documentation and Open Source Code

More presentations, talks and videos can be found on the Legend website 

To learn more about Goldman Sachs and explore opportunities visit our careers page.

(*) The open source contributions mentioned in this article relate to data models. The resulting collaborations involve the exchange of non-proprietary, non-confidential, and non-licensed information only..

See for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.

This site is for informational purposes only and does not constitute an offer to sell, or the solicitation of an offer to buy, any security. The Goldman Sachs Marquee® platform is for institutional and professional clients only. Some of the services and products described on this site may not be available in certain jurisdictions or to certain types of client. Please contact your Goldman Sachs sales representative with any questions. Nothing on this site constitutes an offer, or an invitation to make an offer from Goldman Sachs to purchase or sell a product. This site is given for purely indicative purposes and does not create any contractual relationship between you and Goldman Sachs. Any market information contained on the site (including but not limited to pricing levels) is based on data available to Goldman Sachs at a given moment and may change from time to time. There is no representation that any transaction can or could have been effected on such terms or at such prices. Please see for additional information. © 2023 Goldman Sachs. All rights reserved.
Transaction Banking services are offered by Goldman Sachs Bank USA (“GS Bank”). GS Bank is a New York State chartered bank, a member of the Federal Reserve System and a Member FDIC. © 2023 Goldman Sachs. All rights reserved.
Not all products and functionality mentioned on this website are currently available through our API platform.
All loans and deposit products are provided by Goldman Sachs Bank USA, Salt Lake City Branch. Member FDIC.
Brokerage and investment advisory services offered by our investment products are provided by Goldman Sachs & Co. LLC (`‘GS&CO.`’), which is an SEC registered broker-dealer and investment adviser, and member FINRA/SIPC. Research our firm at FINRA's BrokerCheck. Custody and clearing services are provided by Apex Clearing Corporation, a registered broker-dealer and member FINRA/SIPC. Please consider your objectives before investing. A diversified portfolio does not ensure a profit or protect against a loss. Past performance does not guarantee future results. Investment outcomes and projections are forward-looking statements and hypothetical in nature. Neither this website nor any of its contents shall constitute an offer, solicitation, or advice to buy or sell securities in any jurisdictions where GS&Co. is not registered. Any information provided prior to opening an investment account is on the basis that it will not constitute investment advice and that GS&Co. is not a fiduciary to any person by reason of providing such information. For more information about our investment offerings, visit our Full Disclosures.