June 16, 2022

Building Automated Scalable Network Policy Management: A Look into PINACL

Kishore Kumar Anand, VP, Engineering and Irene Nydees, VP, Engineering

At Goldman Sachs, the infrastructure resources in our cloud can be managed using Infrastructure-as-Code (IaC). The engineering teams at Goldman Sachs request these resources, via code, on-demand and expect the provisioning to be completed in a matter of minutes. The various infrastructure resources include compute, network and storage along with other services. This post will explore how network connectivity requirements are fulfilled by PINACL.

PINACL is a GS Accelerate sponsored product that seeks to drive end-to-end automation of network security connectivity policies that control network traffic in an enterprise. PINACL allows connectivity policy to be authored and owned by application developers rather than network operators. It leverages network topology, built for the enterprise by PINACL, to discover the impact on the existing security posture and delivers the user's intent.

Enterprise Requirements

  • Scale: Application developers and network operators submit multiple connectivity requests which results in a large number of connectivity flows.
  • Consistency: Data should be strictly consistent across the system because inconsistencies could lead to unauthorized access and security incidents.
  • Latency: Users would like to realize their infrastructure changes in a timely manner. Since a single connectivity request could impact multiple security enforcement points, the fulfillment system should have high throughput to keep the end-to-end latency low.

Solutions

  • Highly scalable distributed services that asynchronously receive and deliver proper provisioning.
  • PINACL uses CockroachDB as a datastore. CockroachDB is a distributed database with a SQL interface that promises consistency and partition tolerance.

PINACL Architecture

PINACL's architectural and design decisions are influenced by these challenges and solutions listed above. The product's capabilities can be broadly divided into two parts: upstream and downstream systems. 

Upstream System

The upstream system accepts requests through REST API endpoints for realizing any change in the network security posture. It processes the events in an asynchronous manner and adds tasks to the policy fulfillment queue.

Please reed the description.
Please reed the description.

Description of the upstream system: A flow chart  diagram detailing the flow from IAC application developers, network operators , event streams and topology changes funneling into the api server, the processor queue, pipeline processor fulfillment queue. * Each USER (use case) will have a dedicated / independent UPSTREAM SYSTEM processing the requests / changes.


High availability, low coupling, and high cohesion are the driving forces for the design of the upstream system.

Each use case like policy management by network engineers or software developers has been divided into individual services with single responsibility. Each component has its own dedicated upstream service independent of others making it easier to maintain and extend the system to support new use cases in the future. All the upstream systems communicate with the downstream system through the same fulfilment queue.

The upstream system is further divided into two parts—API server and Pipeline processor—to ensure high availability. The multiple instances or replicas of API servers can continue accepting and queuing the orders, even when the pipeline processor is down due to issues with the processing of any request. All the requests will eventually be processed by the pipeline processor.

API Server: Authorized users can interact with PINACL using the RESTful APIs exposed by the server component to CREATE / READ / UPDATE / DELETE the security policies managed by them.

Pipeline Processor: Processes user connectivity intent by first detecting the impacted rules and then with the help of network topology determines the security devices which need to be reprovisioned. These security devices are then added to the fulfillment queue which will be processed by the downstream system.

For example, consider the following network topology where there is a security device (SD1) blocking the network traffic between Host A and Host B.

Please see description below.
Please see description below.

A flow diagram showing the before and after of the connectivity between Host A and Host B with the security device interrupting the flow. In the after portion of the diagram is unblocked i- with the policy flowing through from PINACL allowing the security device being programmed. 


Downstream / Fulfillment System

The downstream system processes the tasks in the policy fulfillment queue by preparing and pushing the latest policies to the security devices.

Please see descriptive text below.
Please see descriptive text below.

A diagram showing multiple upstream systems flowing into a priority based fulfilment queue into orchestraqtors controlled by requesting agents, then down into the vendor provided security devices. 


The rate at which the upstream systems add tasks to the fulfillment queue can be high, driven by the volume of changes it receives, making horizontal scalability the driving factor in designing the downstream system in order to maintain low end-to-end latency. To ensure scalability, the fulfillment system adopts the Orchestrator - Agent pattern with task stealing approach, where the agents request (pull) the tasks from the orchestrator when it's free. We adapted a pull-based approach over a push-based one as the rate of processing and completion of each task varies widely based on the policy size and vendor. The agent nodes are elastic. During heavy load, new agent nodes can be dynamically added to the group to share the load / tasks and increase the throughput. Each task is a unit of work which involves pushing the security policy to a device. 

Orchestrator acts as a mediator to group the same security devices currently pending in the queue to create a single task, when an agent requests for one. It also ensures the same security device is not allocated to different agents at the same time to avoid policy provisioning conflicts.  

Agent composes the policy by optimizing / compressing the flows that will be pushed to a security device from many millions to a few thousand. The agent then translates and pushes the policy to the security device using the driver of the respective vendor. 

Queue Control: Priority can be assigned to the task which influences the order of fulfillment which becomes effective in controlling the queue especially during heavy loads.

Data Management in PINACL

At a high level, the PINACL product needs to keep track of a few key datasets:

  • Connectivity policy contains a collection of rules. Rules reference endpoints (or endpoint groups) and app protocol (or app protocol groups). Endpoint groups are defined as a collection of other endpoint groups (nested) or leaf endpoints. The rules and endpoint group relationships change often. They are entities with strong relationships between them and have their own lifetime and ownership.
  • Network Topology information, which changes less frequently.
  • Upstream system computes the impact caused by policy changes and determines which security devices need to be programmed. This requires joins between the entities and Recursive Common Table Expression, i.e. CTE to resolve nested dependencies. 

Relational Database Choice

Both the upstream and downstream processing requires significant data joins to fulfill the requests. With most document DBs, server-side join is either non-existent or works with only non-sharded collections (e.g. MongoDB's lookup operator) and limited to equi-joins. This requires applications to perform client side joins which are relatively expensive due to network latency, amount of data transferred over the network, and the number of calls to the DB.

While a database which doesn't enforce any schema is easier to develop, there is value in enforcing a schema and referential integrity at the database layer. Datasets tend to outlive the application, so ensuring the data integrity is paramount and having the database prevent data quality issues due to bugs in the application is valuable.

With de-normalized documents in either JSON/BSON format, there is a significant overhead to the dataset size compared to a relational database. It is partly to do with denormalization introducing repeated values and partly to do with JSON/BSON requiring repeated string keys.

Support for ACID transactions simplifies application development without having to detect and workaround various failure scenarios.

Selecting CockroachDB

CockroachDB is an open source database providing support for relational data model, range sharded data partitioning, replication, and high availability. CockroachDB uses MVCC to provide ACID and more specifically serializable isolation level. It supports PostgreSQL wire protocol allowing existing PostgreSQL clients (eg: JDBC driver) to work seamlessly with CockroachDB.

While there are lots of features and benefits with using CockroachDB, here are a few that were important when making a selection:

  • Supports transactions with serializable isolation level which simplifies application development.
  • Leverages its MVCC to support point-in-time queries to reduce contention with other writes in the system.
  • Range partitioning of data allows the data storage layer to scale by addition of more nodes to the cluster.
  • Provides high availability out of the box, as long as the quorum is available i.e. N/2+1 nodes are up, with replication of every range N times across the cluster.
  • Supports online backup by leveraging point-in-time query capabilities.
  • Supports rolling upgrades.
  • Supports IPv4 and IPv6 data types.

What's Next?

Today, PINACL supports multiple security device vendors and with the growing need of multi vendor solutions, we are planning to add support for more vendors. With the increasing adoption of cloud infrastructure along with on-premise, PINACL seeks to extend the support for the hybrid environment. The network topology consists of different kinds of devices which can enforce security like Router ACLs. PINACL is envisioned to become the single solution to manage all kinds of security enforcement points.

PINACL is actively hiring software engineers in Bengaluru, see roles here


About GS Accelerate

Since launching in 2018, Goldman Sachs employees have submitted more than 1,800 ideas, many of which, including PINACL, have been funded, resulting in seven new products available for Goldman Sachs clients and people. GS Accelerate businesses are hiring across teams and functions in offices across the globe including New York, Dallas, and Birmingham. To view open roles, visit here.

PINACL technology is covered by pending and issued patents.


See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.