Scaling Open Policy Agent (OPA) to offer a centrally managed Cloud Entitlements Service

Problem

Open standards-based Cloud Entitlements Service (OCES) is the strategic authorization platform for Cloud-native applications across Goldman Sachs. OCES uses Open Policy Agent (OPA) as the policy engine. OCES provides a centrally managed, externalized authorization service, that enables developers to define, manage, and enforce entitlements and policies for their applications.

OPA was originally designed to be used as a lightweight policy engine that can be co-located and integrated with an application. However, running it as a policy engine of an externalized authorization service introduced new challenges related to secured and efficient policy delivery.

This blog post describes how the OCES platform evolved to addresses these challenges.

Overview of OCES

OCES is powered by Open Policy Agent (OPA). OPA is an open source, general-purpose policy engine that unifies policy enforcement across the stack. OPA provides a high-level declarative language (called Rego) that lets you specify policy as code and simple APIs to offload policy decision-making from your application. These Rego policies, along with the reference data, are used to make authorization decisions. OCES uses several governed reference data sources as the reference data, whereas the Rego policies are authored by business units, referred to as tenants in OCES. It separates policy management from the application lifecycle and delegates access control decisions to an external decision point.

Figure 1: OPA policy engine

The multi-tenancy concept in OCES is modeled on business units (e.g., Global Investment Research, Transaction Banking, etc.), designed to enforce a vertical separation across the OCES ecosystem. This separation ensures that every entity in OCES (e.g., reference data, Rego policy, deployment) is mapped to an OCES tenant, which is crucial for:

Restricted visibility of reference data and Rego policies
Independent infrastructure and operations

Here’s a 10,000-foot view of OCES:

Figure 2: 10,000-foot overview of OCES architecture

The Rego policies are authored in Gitlab and published to OCES Core (OCES Policy and Reference Data Processing) for processing. Using Gitlab facilitates inherent version control, unit testing, approval workflow and audit features. The policies are bundled into a tar archive and stored in S3. The reference data from various sources is published to OCES Core for processing. They are bundled into a tar archive and stored in S3.

OCES OPA tenants download these bundles and load them in-memory. OPA polls S3 every minute (configurable) to check if fresh bundles are available. OPA limits the maximum bundle size to 1 GB. In-memory bundles allow OPA to take quick (<10ms) decisions on the authorization requests sent by client applications. However, the low latency comes at the cost of high memory utilization, as OPA memory requirement is ~200x of the bundle size.

OCES OPA tenant deployments support the heterogeneity of use-cases across the firm. OCES tenant deployments are multi-cloud (currently deployed in AWS and GCP) and support both – central or sidecar deployment model based on the latency budget. A detailed overview is available in a previously published blog post on Cloud Entitlements Service.

The next section describes the challenges in using OPA in a centrally managed externalized authorization service.

Challenges In Using OPA For Externalized Authorization Service

OPA was originally designed to be used as a lightweight policy engine that can be co-located and integrated with an application - as a sidecar, host-level daemon, or library. This pattern enables fast authorization decisions but isn’t suitable for a centrally managed authorization service due to below reasons:

Application teams need to deploy and manage OPA servers and application developers need to build OPA competency.
The OPA servers for different applications can have different OPA versions and configurations, rendering Rego policy management difficult. For example, a policy bundle built with a higher OPA version can fail to load in an OPA server running a lower OPA version.
Audit and regulatory overhead on applications. For example, regulation requires storing the OPA decision logs for 7 years. Application teams would also be responsible for infrastructure and process audit, Business Continuity Planning evidence, Tech Risk design reviews, etc.

Conditioning OPA to make it suitable for OCES required:

Scaling the OPA servers

The OCES multi-tenancy model ensures that authorization query traffic is bound at tenant level, limiting the blast radius to a specific tenant.

Within a tenant, AWS autoscaling for ECS is configured to horizontally scale the OPA cluster size based on the usage.

Scaling the OPA bundle delivery

While AWS autoscaling provides an easy solution to scale OPA servers, the challenging part is to scale the OPA bundle delivery.

Bundles contain the Rego policies and reference data. As OPA servers load the bundles from the S3 bucket, the challenge is to ensure that the Rego policies and reference data is delivered to OPA in a quick and secure way.

In the initial OCES launch (Policy Publication v1), the bundle delivery process required manual onboarding and deployment for each policy onboarding, which wasn’t scalable.

Figure 3: OPA configuration with each Rego policy as a bundle

The next section describes how we iteratively optimized the Rego policy bundle delivery.

Optimizing the Policy Bundle Delivery

Policy Publication v1 had two major issues:

Each Rego policy had to be onboarded individually.
Every onboarding needed a deployment.

Along with that, TechRisk observed that there was no standard integrity check of the Rego policies (i.e., The policy was not manipulated between authoring and enforcement).

The next section describes how OCES solved these issues with bundle discovery and signing.

OPA Bundle Discovery and Signing

OPA provides a discovery feature which helps you centrally manage the OPA configuration. When the discovery feature is enabled, OPA will periodically download a discovery bundle and process it to generate the rest of the configuration.

Building onto the discovery feature provided by OPA, OCES launched a new version of policy publication (Policy Publication v2). The new policy publication pipeline didn't require onboarding and deployment of each Rego policy. Clients would only need to onboard their Rego Gitlab repository once. After that, they could publish any number of Rego policies without any onboarding.

Moreover, as soon as the Rego policies were published, they were immediately made available to the respective OCES OPA tenants, without any deployments.

Figure 4: OPA bundle discovery configuration

OPA supports digital signatures for policy bundles. Specifically, a signed bundle is a cryptographically secure OPA bundle that includes a file named “.signatures.json” that dictates which files should be included in the bundle and what their SHA hashes are.

OCES Policy Publication v2 used this standard bundle signing mechanism to sign bundles. During creation, bundles are signed with a secure private key. In the OCES OPA tenants, a public key is used to verify the integrity of the bundles.

Figure 5: Architecture post bundle discovery and signing.

(1) Authored Rego policies are uploaded as an artifact to a S3 bucket
(2) Each upload triggers a lambda function, which calls the policy publisher service to notify it of an upload.
(3.1) Policy publisher downloads the artifact and (3.2) is configured with the private key used to sign the bundles at startup. (3.3) The policy publisher stores the meta-data in DynamoDB, creates a bundle out of the artifact and uploads it to S3. It also updates the discovery bundle for each tenant if required.
(4) OPA polls S3 for updates on the discovery bundle and regular bundles

Policy Publication v2 was widely adopted and OCES clients appreciated the seamless, fast, and secured bundle delivery.

However, some clients reported that their Rego policies were taking too long (~30 mins) to become available after publication.

The next section describes how we diagnosed and solved this delay in bundle delivery.

OPA Bundle Consolidation

The first diagnostic step was to profile the delay in bundle delivery. Using the Prometheus metrics published by OPA servers, we observed that OPA was taking a long time to activate the bundles after downloading them from the S3 bucket. Policy Publication v2 pipeline created one OPA bundle for each Gitlab repository (pipeline v1 created one bundle for each Rego policy).

We ran many tests and confirmed that the delay was only observed in the OCES OPA tenants which had a comparatively high number of Rego repositories onboard.

Figure 6: The ‘activation_delay_seconds’ is the number of seconds OPA took to activate a bundle after it’s downloaded. Ideally, this lag should be near-zero, but it was observed to be nearly 10 minutes.

We shared these results with OPA support team and they concluded that though OPA supports multiple bundles, it’s not optimized to efficiently activate them. The root of the issue is that when OPA activates a new bundle, it also compiles all the existing activated bundles. As this compilation is expensive, there exists a tipping point, dependent on the number and size of the bundles, after which the time taken by OPA to activate the bundles will increase unsustainably.

The only path forward was to consolidate the bundles. This meant creating one bundle per OCES OPA tenant, comprising of all the reference data and Rego policies applicable to the tenant.

Bundle Consolidation Challenges

There were several risks and challenges in consolidating the bundles:

Blast radius of a bad Rego policy – OPA fails to activate a bad bundle. Prior to consolidation, this would only affect one bundle. With consolidation, this would affect the whole tenant, causing a wide outage.
Policy namespace conflicts – Rego policies across repositories may have namespace conflicts. This would result in a bad bundle.
Separate bundle generation process for reference data and Rego policies – The reference data and Rego policies bundles were generated by two separate processes. To generate a consolidated bundle, all these individual bundles needed to be packaged as a single valid bundle efficiently.

Bundle Consolidation Design

We first introduce the concept of an aggregate state. An aggregate state is maintained for each OCES OPA tenant, that contains all the individual Rego policy directories and data bundle directories within it. A metadata file contains additional information about the aggregate state.

The purpose of this aggregate state is to provide a staging area before consolidating the bundles. Before generating the consolidated bundle from the state directory, it must satisfy a series of invariants.

These invariants are designed to ensure that the consolidated bundle is always healthy. The invariants are:

No two policy projects in the aggregate should share a common Rego package.
- Example - If spr_123456 contains a Rego package "com.gs.hcm" and spr_109077 contains a Rego package "com.gs.hcm.operations", the resulting aggregate state becomes unhealthy because the package names are overlapping.
No two Rego repositories/reference data bundles should have any conflicting documents. This invariant is similar to #1 but applies to path conflict between static data documents.
No import errors across different Rego policy repositories.
- Example - OCES provides the "com.gs.oces.appdir_client" Rego policy as a library for AppDir authorization calls, which is imported in the Rego policies of applications. If this dependency is missing in the consolidated bundle, it will result in an import error and fail to activate.

The consolidated bundle is run in an OPA server before publication. If the bundle is loaded and activated successfully, only then it is published to S3. This ensures that OCES OPA tenants always receive a healthy consolidated bundle.

Consolidating the reference data and Rego policy bundles at scale requires orchestration, which can be implemented with event streaming.

In this design, every policy publication pipeline trigger or reference data bundle change is modeled as an event, which is published to a Kafka topic. The events are consumed, and the aggregate state is modified accordingly. A successful modification, satisfying the invariants, results in a consolidated bundle.

Figure 7: Process flow diagram for bundle consolidation

With bundle consolidation, OCES launched Policy Publication v3. This release not only resolved the bundle activation latency issue, but also reduced the CPU and memory utilization of the OPA tenants, paving the way to reduce operational costs.

Figure 8: With bundle consolidation, the bundle activation lag reduced from ~10 minutes to near-zero. This means that the bundles are activated instantly upon download.

Figure 9: Bundle consolidation reduced the CPU utilization of OCES OPA tenants

Other Enhancements:

The OCES bundle generation service was optimized to generate new reference data bundles only if the reference data was modified.
The OCES Rego pipeline extensively supports unit testing of the policies including test coverage gates, test coverage reports and Git submodules integration to include dependencies e.g. common Rego libraries.

Notable Trade-offs

Decision latency vs Memory utilization: To minimize the authorization decision latency, we load the bundles in OPA memory. This decision favors low latency over infrastructure costs.

Data freshness vs Resource consumption: OPA polls S3 every minute to check for fresh bundles. This frequency results in high CPU and network usage, an increased bundle download and activation overhead; however it reduces the staleness of Rego policies and reference data. This decision favors data freshness over resource consumption costs.

Outcome

The iterative optimizations in OCES, particularly the introduction of bundle discovery, bundle signing and bundle consolidation, have significantly enhanced the efficiency, security, and performance of the OPA policy bundle delivery process. It demonstrates the effectiveness of the strategies implemented to overcome the initial challenges.

See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.

Goldman Sachs DeveloperPrivacy and CookiesGS Terms & ConditionsRegulatory DisclosuresSecurity

GS DAP® is owned and operated by Goldman Sachs. This site is for informational purposes only and does not constitute an offer to provide, or the solicitation of an offer to provide access to or use of GS DAP®. Any subsequent commitment by Goldman Sachs to provide access to and / or use of GS DAP® would be subject to various conditions, including, amongst others, (i) satisfactory determination and legal review of the structure of any potential product or activity, (ii) receipt of all internal and external approvals (including potentially regulatory approvals); (iii) execution of any relevant documentation in a form satisfactory to Goldman Sachs; and (iv) completion of any relevant system / technology / platform build or adaptation required or desired to support the structure of any potential product or activity. All GS DAP® features may not be available in certain jurisdictions. Not all features of GS DAP® will apply to all use cases. Use of terms (e.g., "account") on GS DAP® are for convenience only and does not imply any regulatory or legal status by such term.

¹ Real-time data can be impacted by planned system maintenance, connectivity or availability issues stemming from related third-party service providers, or other intermittent or unplanned technology issues.

Transaction Banking services are offered by Goldman Sachs Bank USA (“GS Bank”) and its affiliates. GS Bank is a New York State chartered bank, a member of the Federal Reserve System and a Member FDIC. For additional information, please see Bank Regulatory Information.

Certain solutions and Institutional Services described herein are provided via our Marquee platform. The Marquee platform is for institutional and professional clients only. This site is for informational purposes only and does not constitute an offer to provide the Marquee platform services described, nor an offer to sell, or the solicitation of an offer to buy, any security. Some of the services and products described herein may not be available in certain jurisdictions or to certain types of clients. Please contact your Goldman Sachs sales representative with any questions. Any data or market information presented on the site is solely for illustrative purposes. There is no representation that any transaction can or could have been effected on such terms or at such prices. Please see https://www.goldmansachs.com/disclaimer/sec-div-disclaimers-for-electronic-comms.html for additional information.

Mosaic is a service mark of Goldman Sachs & Co. LLC. This service is made available in the United States by Goldman Sachs & Co. LLC and outside of the United States by Goldman Sachs International, or its local affiliates in accordance with applicable law and regulations. Goldman Sachs International and Goldman Sachs & Co. LLC are the distributors of the Goldman Sachs Funds. Depending upon the jurisdiction in which you are located, transactions in non-Goldman Sachs money market funds are affected by either Goldman Sachs & Co. LLC, a member of FINRA, SIPC and NYSE, or Goldman Sachs International. For additional information contact your Goldman Sachs representative. Goldman Sachs & Co. LLC, Goldman Sachs International, Goldman Sachs Liquidity Solutions, Goldman Sachs Asset Management, L.P., and the Goldman Sachs funds available through Goldman Sachs Liquidity Solutions and other affiliated entities, are under the common control of the Goldman Sachs Group, Inc.

November 25, 2024

Scaling Open Policy Agent (OPA) to Offer a Centrally Managed Cloud Entitlements Service

Ashwin Balakrishna, Vice President; Kamlakant Shukla, Vice President