Open standards-based Cloud Entitlements Service (OCES) is the strategic authorization platform for Cloud-native applications across Goldman Sachs. OCES uses Open Policy Agent (OPA) as the policy engine. OCES provides a centrally managed, externalized authorization service, that enables developers to define, manage, and enforce entitlements and policies for their applications.
OPA was originally designed to be used as a lightweight policy engine that can be co-located and integrated with an application. However, running it as a policy engine of an externalized authorization service introduced new challenges related to secured and efficient policy delivery.
This blog post describes how the OCES platform evolved to addresses these challenges.
OCES is powered by Open Policy Agent (OPA). OPA is an open source, general-purpose policy engine that unifies policy enforcement across the stack. OPA provides a high-level declarative language (called Rego) that lets you specify policy as code and simple APIs to offload policy decision-making from your application. These Rego policies, along with the reference data, are used to make authorization decisions. OCES uses several governed reference data sources as the reference data, whereas the Rego policies are authored by business units, referred to as tenants in OCES. It separates policy management from the application lifecycle and delegates access control decisions to an external decision point.
Figure 1: OPA policy engine
The multi-tenancy concept in OCES is modeled on business units (e.g., Global Investment Research, Transaction Banking, etc.), designed to enforce a vertical separation across the OCES ecosystem. This separation ensures that every entity in OCES (e.g., reference data, Rego policy, deployment) is mapped to an OCES tenant, which is crucial for:
Here’s a 10,000-foot view of OCES:
Figure 2: 10,000-foot overview of OCES architecture
The Rego policies are authored in Gitlab and published to OCES Core (OCES Policy and Reference Data Processing) for processing. Using Gitlab facilitates inherent version control, unit testing, approval workflow and audit features. The policies are bundled into a tar archive and stored in S3. The reference data from various sources is published to OCES Core for processing. They are bundled into a tar archive and stored in S3.
OCES OPA tenants download these bundles and load them in-memory. OPA polls S3 every minute (configurable) to check if fresh bundles are available. OPA limits the maximum bundle size to 1 GB. In-memory bundles allow OPA to take quick (<10ms) decisions on the authorization requests sent by client applications. However, the low latency comes at the cost of high memory utilization, as OPA memory requirement is ~200x of the bundle size.
OCES OPA tenant deployments support the heterogeneity of use-cases across the firm. OCES tenant deployments are multi-cloud (currently deployed in AWS and GCP) and support both – central or sidecar deployment model based on the latency budget. A detailed overview is available in a previously published blog post on Cloud Entitlements Service.
The next section describes the challenges in using OPA in a centrally managed externalized authorization service.
OPA was originally designed to be used as a lightweight policy engine that can be co-located and integrated with an application - as a sidecar, host-level daemon, or library. This pattern enables fast authorization decisions but isn’t suitable for a centrally managed authorization service due to below reasons:
Conditioning OPA to make it suitable for OCES required:
The OCES multi-tenancy model ensures that authorization query traffic is bound at tenant level, limiting the blast radius to a specific tenant.
Within a tenant, AWS autoscaling for ECS is configured to horizontally scale the OPA cluster size based on the usage.
While AWS autoscaling provides an easy solution to scale OPA servers, the challenging part is to scale the OPA bundle delivery.
Bundles contain the Rego policies and reference data. As OPA servers load the bundles from the S3 bucket, the challenge is to ensure that the Rego policies and reference data is delivered to OPA in a quick and secure way.
In the initial OCES launch (Policy Publication v1), the bundle delivery process required manual onboarding and deployment for each policy onboarding, which wasn’t scalable.
Figure 3: OPA configuration with each Rego policy as a bundle
The next section describes how we iteratively optimized the Rego policy bundle delivery.
Policy Publication v1 had two major issues:
Along with that, TechRisk observed that there was no standard integrity check of the Rego policies (i.e., The policy was not manipulated between authoring and enforcement).
The next section describes how OCES solved these issues with bundle discovery and signing.
OPA provides a discovery feature which helps you centrally manage the OPA configuration. When the discovery feature is enabled, OPA will periodically download a discovery bundle and process it to generate the rest of the configuration.
Building onto the discovery feature provided by OPA, OCES launched a new version of policy publication (Policy Publication v2). The new policy publication pipeline didn't require onboarding and deployment of each Rego policy. Clients would only need to onboard their Rego Gitlab repository once. After that, they could publish any number of Rego policies without any onboarding.
Moreover, as soon as the Rego policies were published, they were immediately made available to the respective OCES OPA tenants, without any deployments.
Figure 4: OPA bundle discovery configuration
OPA supports digital signatures for policy bundles. Specifically, a signed bundle is a cryptographically secure OPA bundle that includes a file named “.signatures.json” that dictates which files should be included in the bundle and what their SHA hashes are.
OCES Policy Publication v2 used this standard bundle signing mechanism to sign bundles. During creation, bundles are signed with a secure private key. In the OCES OPA tenants, a public key is used to verify the integrity of the bundles.
Figure 5: Architecture post bundle discovery and signing.
Policy Publication v2 was widely adopted and OCES clients appreciated the seamless, fast, and secured bundle delivery.
However, some clients reported that their Rego policies were taking too long (~30 mins) to become available after publication.
The next section describes how we diagnosed and solved this delay in bundle delivery.
The first diagnostic step was to profile the delay in bundle delivery. Using the Prometheus metrics published by OPA servers, we observed that OPA was taking a long time to activate the bundles after downloading them from the S3 bucket. Policy Publication v2 pipeline created one OPA bundle for each Gitlab repository (pipeline v1 created one bundle for each Rego policy).
We ran many tests and confirmed that the delay was only observed in the OCES OPA tenants which had a comparatively high number of Rego repositories onboard.
Figure 6: The ‘activation_delay_seconds’ is the number of seconds OPA took to activate a bundle after it’s downloaded. Ideally, this lag should be near-zero, but it was observed to be nearly 10 minutes.
We shared these results with OPA support team and they concluded that though OPA supports multiple bundles, it’s not optimized to efficiently activate them. The root of the issue is that when OPA activates a new bundle, it also compiles all the existing activated bundles. As this compilation is expensive, there exists a tipping point, dependent on the number and size of the bundles, after which the time taken by OPA to activate the bundles will increase unsustainably.
The only path forward was to consolidate the bundles. This meant creating one bundle per OCES OPA tenant, comprising of all the reference data and Rego policies applicable to the tenant.
There were several risks and challenges in consolidating the bundles:
We first introduce the concept of an aggregate state. An aggregate state is maintained for each OCES OPA tenant, that contains all the individual Rego policy directories and data bundle directories within it. A metadata file contains additional information about the aggregate state.
The purpose of this aggregate state is to provide a staging area before consolidating the bundles. Before generating the consolidated bundle from the state directory, it must satisfy a series of invariants.
These invariants are designed to ensure that the consolidated bundle is always healthy. The invariants are:
The consolidated bundle is run in an OPA server before publication. If the bundle is loaded and activated successfully, only then it is published to S3. This ensures that OCES OPA tenants always receive a healthy consolidated bundle.
Consolidating the reference data and Rego policy bundles at scale requires orchestration, which can be implemented with event streaming.
In this design, every policy publication pipeline trigger or reference data bundle change is modeled as an event, which is published to a Kafka topic. The events are consumed, and the aggregate state is modified accordingly. A successful modification, satisfying the invariants, results in a consolidated bundle.
Figure 7: Process flow diagram for bundle consolidation
With bundle consolidation, OCES launched Policy Publication v3. This release not only resolved the bundle activation latency issue, but also reduced the CPU and memory utilization of the OPA tenants, paving the way to reduce operational costs.
Figure 8: With bundle consolidation, the bundle activation lag reduced from ~10 minutes to near-zero. This means that the bundles are activated instantly upon download.
Figure 9: Bundle consolidation reduced the CPU utilization of OCES OPA tenants
Decision latency vs Memory utilization: To minimize the authorization decision latency, we load the bundles in OPA memory. This decision favors low latency over infrastructure costs.
Data freshness vs Resource consumption: OPA polls S3 every minute to check for fresh bundles. This frequency results in high CPU and network usage, an increased bundle download and activation overhead; however it reduces the staleness of Rego policies and reference data. This decision favors data freshness over resource consumption costs.
The iterative optimizations in OCES, particularly the introduction of bundle discovery, bundle signing and bundle consolidation, have significantly enhanced the efficiency, security, and performance of the OPA policy bundle delivery process. It demonstrates the effectiveness of the strategies implemented to overcome the initial challenges.
See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.