The DevOps space is in a constant state of change and innovation. One area that has seen significant growth in recent years is the use of monolithic repos, or monorepos. A monorepo allows you to store many codebases in a single repository, whereas multi-repos are many repositories that are all linked to each other. Several large technology companies, including Facebook, Google, and Uber have incorporated monorepos as a basic component of their architecture.
The use of monorepos can help companies save time on management, testing, and debugging. Many of these benefits come from having all the code in a centralized, easily accessible place. This makes it easier to view and manage dependencies and interactions amongst the packages. There are also fewer dependencies between projects for developers to worry about when deploying new features or updates. Monorepos have been gaining popularity because they enable faster iterations for product development without worrying about breaking other parts of the project, while allowing changes to be made quickly, with less risk for merge conflicts.
In practice, working in a monorepo implies that one is always working out of the HEAD (also known as the master, main, or trunk branches) instead of the development branch, which is typical in Git workflows.
source: fpy.cz
Instead of working on a branch that is pulled from HEAD, which can become quickly outdated, everybody works on HEAD (in the above diagram, master). With multiple repos, libraries that are pinned to specific versions are scattered about. This makes it difficult to identify callers and update dependencies across projects. With monorepos, all libraries and their dependencies are accessible on HEAD. This makes identifying all current callers simple, as users are aware of which clients need to be updated automatically.
Release branches are published off of HEAD, like taking an exit off of the development highway. Source: cloud.google.com/architecture/devops
Adopting a monorepo strategy also enables the ability to track changes and clean up one’s dependency management throughout a codebase. Having all the code in one place makes it easier to find and fix errors, as well as make changes quickly without worrying about breaking other parts of the project. Additionally, using a monorepo allows developers to more easily share code between projects, which can save time and development resources.
Despite all of their benefits, monorepos introduce new security risks to source code management and build processes. The following section covers a few of the potential security risks and mitigation techniques.
One of the key considerations for companies when implementing a monorepo is access control. When all of the code is stored in a single repository, it can be difficult to track who has access to which files and folders. It's also more difficult to revoke access if someone no longer needs access to certain files in different parts of the monorepo at scale.
Enterprises typically handle their access controls and privileges at the repo level in the multi-repo world, but they must be scoped at the file or folder levels in the monorepo world. The Git codeowners feature allows this with GitLab monorepos. Individuals or teams that are in charge of files or directories in a repository are referred to as code owners. When someone opens a pull request to modify code, the code owners are immediately requested to review it. Managing the lifecycles of code owner permissions and wiring them back to a central entitlements system is not simple, but is critical to establishing a procedure for handling code owners’ entitlements.
When enterprises decide to use a monorepo, they must deal with performance and availability risks. As the codebase expands, it may become more difficult to maintain and scale, especially with hosted editions of version control needing to handle thousands of transactions (code changes) per hour. Additionally, as the repository becomes larger, it can take longer for developers to clone it and build it. In order to mitigate these risks, companies need to have a clear plan for scaling their monorepo. This plan should include measures for optimizing performance and ensuring availability.
Git provides many features that enhance developer productivity, including shallow clones, sparse checkout and LFS. Another alternative is to not use decentralized and distributed version controls, such as Git, and to instead keep your local codebase in sync all the time by using centralized version control tools such as Perforce to improve push/pull efficiency.
One of the key advantages of utilizing a monorepo is that it reduces development time by allowing for more code sharing between projects. This standardization allows for easy comprehension of another team's codebase, making it easier to address issues or bugs in random projects that impact one’s build.
However, this also introduces a new risk of breaking changes, such as massive refactoring, restructuring the code, or code cleanups. When code is shared between projects, it is possible for changes made in one project to break the code in another project, as the code is interdependent in a monorepo. This can be especially risky if two projects used by different teams are in different stages of development. To mitigate this risk, companies need to have a clear plan for managing breaking changes. This plan should include measures for identifying and tracking breaking changes, testing breaking changes before they are merged into the main codebase and rolling back breaking changes if they cause problems.
This is where build systems such as Bazel or Buck, which understand the linkages between code commits, shine. Bazel recognizes the dependencies involved with large changes and builds a dependency graph to better assess their effect. Only the sections of the source code that have changed since the prior build may be recompiled with this approach. Additionally, Bazel has a built-in caching mechanism that speeds up the build process, making it ideal for governing the build hygiene for large codebases.
A sample dependency graph that Bazel has generated. Source: https://docs.bazel.build/versions/4.2.1/skylark/aspects.html
Supply chain attacks are a real and present danger to enterprise security, and monorepos might increase the risk of these types of attacks. Because all the code is housed in one place, it's easier for attackers to find and exploit vulnerabilities in the code.
This exemplifies the importance of running vulnerability scanning and composition analysis (which are common security practices to mitigate supply chain attacks) of software packages as a part of Git CI pipelines in monorepos. However, many providers do not include vulnerability scanning as part of their Bazel build CI pipelines. This is due to the fact that Bazel build does not contain dependency manifest files or lock files, as package managers such as npm do. Instead, the configuration is maintained in BUILD files written in Starlark, a Python3-based domain-specific language. This makes it difficult for modern-day vulnerability scanners to find out which packages (and their transitive dependencies) are required by the project out of the box.
This risk can be reduced by pointing the Bazel build to internal software artifact repositories and allowing only scanned libraries to be ingested. If keeping up with such an up-to-date software asset inventory is difficult, it's a good idea to build a wrapper or an API around the software composition and vulnerability analysis platforms to process the Bazel dependency graph as a payload and evaluate the provenance of each library and all transitive dependencies linked to the project. Such provenance may be established by SLSA Software Attestations and in-toto attestation framework.
A sample flattened list of the dependency graph generated by Bazel. Source: https://docs.bazel.build/versions/main/query-how-to.html
In the monorepo world, remote build caches are typically used by developers and CI systems to share build outputs. If your build is reproducible, the outputs from one machine can be safely reused on another machine, which can make builds significantly faster. However, they can also be susceptible to attack. In a monorepo, an attacker could poison the build cache by inserting a malicious build that would be recompiled every time, which might result in a supply chain compromise.
To minimize this threat, it is critical to protect the build caches for each project. Using authentication and authorization procedures to ensure that only authorized people can access the build cache is essential. It is also prudent to monitor the build cache for any unexpected changes that indicate an attack.
Overall, there are several advantages to working with a monorepo. However, just as with any large codebase, there are specific procedures and tooling that must be in place to keep it secure. You may reap the benefits while still keeping your codebase secure by understanding the risks involved with monorepos and implementing necessary controls to mitigate them.
Want to learn more about exciting engineering opportunities at Goldman Sachs? Explore our careers page.
See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.