Entitlements is a paradigm that fundamentally tries to answer the question “Given that we already know who a particular user is, what is the user entitled to do in an application?”. Questions we typically ask are:
As Goldman Sachs continues to expand its footprint on the public cloud, it is imperative that the firm protects its data. This commitment has led to the genesis of an Open standards-based Cloud Entitlements Service (OCES). It offers a firmwide policy-based, attribute-driven entitlements platform. In this post, we will share approaches and best practices that can be applied to similar entitlements challenges at scale in financial services and other domains by leveraging serverless compute services and managed streaming products offered by Amazon Web Services (AWS), coupled with Open Policy Agent.
Prior to the introduction of Cloud Entitlements Service, there were several challenges including:
The firm required a platform whereby any business unit could enforce entitlements on public cloud providers with the same ease that they would have on-premises. Given the time and sensitivity of activities the firm supports, it was paramount that any solution be available on a 24/7 basis with four 9s of availability. In a similar manner, it had to not only support monitoring and observability of the service and its infrastructure but also of the policy distribution and reference data events.
OCES provides a centralized policy engine hosted on the AWS platform. OCES is based on Open Policy Agent (OPA), which is an open source, general-purpose policy engine that provides a high-level declarative language and lets developers specify policy as code and use simple APIs to offload policy decision-making from application code. OCES provides a policy-as-code lifecycle for simple policy authoring, and supports monitoring and distribution of policy and reference data events at scale to OPA policy engines. OCES is multi-tenanted offering dedicated OPA tenants for various business areas in the firm. They get consistent service level guarantees since this service utilizes managed streaming, elastic scaling, and operational efficiencies of serverless compute infrastructure in the cloud. Further, OPA has gained wide acceptance across industries as a cloud-native, open source, general-purpose policy engine that unifies policy enforcement.
The overall architecture of OCES can be seen in the illustration below. The next section describes various components in detail.
The components of OCES can be viewed in reference to a Standards Control Architecture.
At a high-level they are:
Policy Enforcement Point (PEP)
The PEP is where entitlements are actually enforced. The supported enforcement points include:
Policy Decision Point (PDP)
OPA clusters provide a highly available, reliable, and scalable PDP. Policies authored by application owners and associated reference data for a policy are evaluated by OPA to provide decisions that can be enforced by applications and APIs that integrate with OPA.
Policy Authoring Point (PAP)
OPA policies are authored in a declarative language called REGO which was inspired by Datalog. Within the OCES ecosystem, application owners author policies in GitLab which are subsequently pushed to the OPA PDPs using the GitLab CI/CD pipeline. Policies are subject to standard SDLC practices such as approval, testing, and baking in lower environments prior to being published to Production.
Policy Information Point (PIP)
Policies allow application owners to prescribe rules on who can or cannot use their systems. However, policies by themselves comprise of 50% of the input. Reference data forms a critical component of policy evaluation and is the remaining 50% of the input. Examples of reference data include 1) role memberships, 2) attribute assignments on a subject accessing a privileged resource, 3) metadata about resources being accessed, etc.
Diverse and well-governed reference data can enable policy authors to author powerful policies, thereby empowering them to enforce fine grained entitlements. OCES integrates with several standard and governed reference data sources in Goldman Sachs with the ability to further extend these sources.
The first challenge we faced was making entitlements related reference data available on AWS for consumption. The solution needed to meet the following criteria; (i) reflect updates to the data on a near real-time basis, (ii) be easily accessible and given the sensitivity of this use case, (iii) be reliable and consistent. Given those requirements, we selected Amazon MSK.
Apache Kafka is an open source, event based, stream processing software with high-throughput, low latency data pipelines, streaming analytics, and out-of-the-box data integrations. Apache Kafka offers the necessary mission critical stability, ensuring zero message loss, guaranteed ordering, permanent and durable data storage, a great degree of scalability, and easy methods of interaction through client libraries and interfaces. However, an improper setup can detract from the benefits of an Apache Kafka deployment. Amazon MSK alleviates this by abstracting away the nuances of managing an Apache Kafka infrastructure.
Apache Kafka topics are separated by business unit to ensure business data separation. Two topics are maintained per business unit; one for entitlement policy updates and another for reference data updates. Avro Schemas and a schema registry are used to ensure messages adhered to a specified contract when they were being published to the Apache Kafka topics.
OPA, is an open source entitlements engine that can enforce any business entitlements policy whether it is role-based or attribute-based. Policies are written in REGO which is a high-level declarative language and allows policies to be written as code and uses simple APIs to delegate complex policy decision-making away from application code. OCES leverages OPA as it was already a proven platform demonstrating a robust support for flexible policy authoring, efficient request evaluation, and meeting the firm's entitlement requirements.
OCES has been designed as a multi-tenanted platform offering each business unit their own dedicated OPA tenant, which is well suited to OPA's lightweight nature. This flexibility allows business units across Goldman Sachs to use the platform to its fullest extent without interfering with each other. This topology was also suitable from the perspective of data separation and for limiting the impact across the firm should any engine outages occur. OCES has also been developed as a multi-regional service to ensure the platform would be resilient to any regional outages. OPA has been deployed on an Amazon ECS Fargate cluster as it greatly reduces the overhead of managing an already complex platform, especially when deploying several instances and across multiple regions.
OPA provides two mechanisms for supplying an engine with relevant data and policies required for making entitlement decisions. (i) a push-paradigm where the OPA engine has to be loaded via a REST interface that the engine exposes out-of-the-box, and (ii) a pull-paradigm where the engine is configured to periodically download a bundle (a tarball comprising of the necessary data and policy files), and activates it. Given that these approaches cannot be used in conjunction with each other, we chose the pull-based approach as it was more suitable for OCES' requirements. This bundle is periodically recreated to include any new reference data or policy changes that were observed since the previous bundle generation.
Performant reads and writes were essential for the rapid generation of these bundles. This becomes an important requirement especially where there are bulk entitlement events (such as onboarding a new client) that could generate several hundred thousand events. Redis, an open source, in-memory, key-value store offered the performance and high availability necessary to fit into the OCES infrastructure. Amazon ElastiCache facilitated rapidly standing up a Redis cluster that met the requirements. As new events are recorded on the Apache Kafka topics, they are written into Redis ensuring data isolation is maintained by partitioning the data in the cluster based on the topic where the data originated from. A separate process runs periodically, based on a schedule, and reads in all the data in a specific partition and generates a corresponding data file that will be used by the engine. The generated data file is bundled into the tarball with the most recent version of the policy and is written to Amazon S3 where the engine will download the bundle from.
1. Outages are a reality - building for resiliency is a necessity
Entitlements are a fundamental prerequisite for accessing any application. OCES provides a critical service for applications on the cloud and hence availability and resiliency are non-negotiable. As we started designing the solution and started thinking about Ops and Observability, building for resiliency through a multi-region topology became a core tenet.
2. The nature of a domain and its specific requirements constrains flexibility offered by technology
A very important lesson that we learned while designing OCES is that nature of the domain we operate in enforces constraints that are otherwise not applicable when using a framework or technology. As an example, events that occur on any entitlement reference data must be strictly ordered when published to AWS. This ordering is important to ensure accurate entitlement decisions. This serial ordering limited the optimizations that we could have performed in the Kafka publication.
3. When presented with multiple conflicting options, avoid choices that lead to one-way doors
When thinking about scalability and maintenance of the platform, we were presented with options such as running OPA as tenants that applications could connect to vs. running OPA agents as sidecars for every application. There are pros and cons of each approach. The obvious downside of the sidecar approach is that the number of agents can grow unchecked, thereby leading to considerable cost and control implications. However, with a centralized OPA tenant cluster model, there may be latency implications that might impact performance for certain applications. A lesson we learned when brainstorming these options is that while we build by making one design choice, it should not close the door of using a secondary design choice when there is a legitimate business need. Hence, while OCES offers centralized OPA tenants for businesses, the option to deploy sidecars is open and available.
4. Business separation is highly valued and is often a fundamental requirement
One of the core tenets for launching OCES was offering a multi-tenanted environment where we could provide business data separation and isolation. This model provides dedicated OPA tenants for lines of businesses in the firm. Multi-tenancy offers the flexibility to load only relevant policies and reference data for a tenant. We learned that there is a lot of value in this approach because we can shield each tenant from the usage patterns and load of other tenants.
The launch of OCES is just the beginning of a larger journey towards enabling entitlements management on a cloud environment operated by a firm such as Goldman Sachs. We are now focused on extending the initial offering of OCES to allow multiple businesses in the firm to model comprehensive business policies on the cloud. We are working towards features that would accelerate adoption across various businesses.
In this blog, we discussed how we developed an entitlements platform for Goldman Sachs’ cloud-native applications. We hope this post provides insight into how a variety of AWS services can be utilized in conjunction with OPA to orchestrate an entitlements platform that can be leveraged by a broad audience.
Want to learn more about exciting engineering opportunities at Goldman Sachs? Explore our careers page.
See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.