Legend* is an open source data platform created by Goldman Sachs and contributed to the Fintech Open Source Foundation (FINOS).
We are excited to announce the integration of the Databricks Lakehouse platform with Legend.
This contribution from the Databricks team is a great example of the spirit of FINOS - collaboration and innovation in the financial services industry via open source software. Databricks is also a member of FINOS.
In this blog post, we will start with a primer on Legend's relational data modeling and data access capabilities. We will then move onto a quick discussion about the use of data models to establish consistent data vocabularies across applications, platforms, and organizations. Finally, we will wrap up with a peek into work streams that are underway in the Legend project.
A data model is a formal way of describing the semantics of data and its relationship to other data. Our prior blog post - Building Platforms for Data Engineering - introduced a "Firm Employee" model that captures the semantics of a firm and its employees.
This model is an abstract concept. The actual data might be physically stored in a relational database (or a database that supports SQL), like Databricks. The power of Legend is that it allows data queries to be expressed in terms of logical model concepts (like "Firm", "Person") and not in terms of how "Firm", "Person" data is physically stored in the database.
Relational data modeling and data access requires four components:
4. Runtime - A runtime specifies where the data is physically stored. In this case, our data is stored in tables in a Databricks database/cluster.
Using the above, Legend is able to translate the Pure query into a database specific SQL query, execute the query against the Databricks database, and return the results.
Databricks Relational Connector
Legend integrates with many databases and data platforms. Thanks to the contribution from Databricks, Legend can now integrate with Databricks databases.
The contribution from Databricks provides the following:
The snippet below shows the 'concat' dynafunction being translated to the 'concat' SQL function. forDynafunction('concat', [ ... choice(DatabaseType.Databricks, $allStates, ^ToSql(format='concat%s', transform={p:String[*]|$p->joinStrings('(', ', ', ')')})) ... ]
Legend offers first class support for data modeling, access, and governance. While Legend platform components like Studio offer a node code solution to data modeling, the data models themselves are treated as machine readable source code.
The models are stored in a Git (GitLab) repository managed by the Legend SDLC product. Legend SDLC's native integration with GitLab CI/CD allows Legend models to be managed, versioned, and distributed as code.
Legend uses the Apache Maven protocol to distribute models as jar artifacts. These artifacts can then be used outside of the Legend suite of products.
An example of this is the Databricks legend-delta showcase project that consumes Legend models and uses them to build Databricks pipelines.
Note: Legend model artifacts can be consumed directly from Apache Maven repositories. The recently released and incubating legend-depot project offers a rich and API to index and serve model elements. Checkout the https://github.com/finos/legend-depot project for more details.
Goldman Sachs is actively contributing to the Legend projects on GitHub. Over the past two years, a total of 197 open source contributors have pushed over 6,400 commits to the Legend codebase and submitted 2,400 Pull Requests, adding 292,000 lines of code. In the spirit of open source software, we are committed to increasing contribution and participation from the rest of the community. In addition, we want to make it easy for contributors to add support for a new a database and platform.
In support of this goal, we are actively refactoring the Legend code base. The scope of the refactoring includes the following:
With these changes, we hope to bring in more databases and data platforms into the Legend community!
Documentation and Open Source Code
More presentations, talks and videos can be found on the Legend website
To learn more about Goldman Sachs and explore opportunities visit our careers page.
(*) The open source contributions mentioned in this article relate to data models. The resulting collaborations involve the exchange of non-proprietary, non-confidential, and non-licensed information only..
See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.