January 12, 2022

Infrastructure and the Command Chain Pattern

Eugeniu Zaharia, Vice President, Storage Engineering

The Challenge

At Goldman Sachs, we manage a large private cloud that includes various compute, storage, networks, and higher level infrastructure services like relational databases and NoSQL databases. The clients of this private cloud are the internal engineering teams at Goldman Sachs that build services for the business. They expect infrastructure on-demand, and so the provisioning of resources is underpinned by APIs that are accessible directly or via higher level configuration constructs.

To provision these resources, the teams managing this infrastructure often have to execute read/write operations against multiple external vendor resources in a transactional manner. The operations need to either succeed together or be reverted - if there are errors, we need to leave the state the way we found it.

The external APIs we interact with are generally robust and stable, but we do encounter issues that result in partial failures, and therefore, an inconsistent state of our infrastructure. For example, when we provision storage on an array and update the inventory records at the same time, we want both of those operations to either succeed or fail together, so that the array and the inventory are in sync. If partial failures occur, we have to clean up. Cleaning up involves either reverting the successful operations or re-applying the failed ones. Partial failures are difficult to troubleshoot because of the complex resource topology. Whenever there was a partial failure, our team would have to manually troubleshoot the failed orders, and manually roll back the resources in order to keep the physical infrastructure and the internal inventory in sync. This was tedious and time-consuming.

The Solution

As most external resources are not XA compliant, existing external distributed transaction APIs cannot be used. We devised a solution called the Command Chain pattern. This blends concepts from the Chain of Responsibility and the Command patterns, and with this, we automated the clean up of failures.

The Command Chain is used to manage transactionality against multiple external vendor resources. This was implemented in Java and the sample code below is in this language.

Registering the Commands with the Chain

External resource operations are grouped into commands, by implementing the Command interface. The commands are registered with the chain, together with their context in the order that they are to be executed.

CommandChain commandChain = new ReversibleCommandChain();

Context context = new SimpleContext();
        
commandChain.registerInContext(startOrder, context);
commandChain.registerInContext(updateInventoryBefore, context);
commandChain.registerInContext(callExternalVendor1, context);
commandChain.registerInContext(callExternalVendor2, context);
commandChain.registerInContext(updateInventoryAfter, context);      
commandChain.registerInContext(closeOrder, context);

commandChain.execute(context);
Command Context

The chain needs to keep track of the arguments needed to execute each command in order to be able to revert them if needed. The context is an abstraction that encapsulates all the arguments needed to execute and/or revert the commands.

private final List<AtomicCommand<?>> commands = new ArrayList<>();

public <T extends Context, C extends Command<T>> void registerInContext(final C command, final T context) {
    commands.add(new AtomicCommand<>((ReversibleCommand<T>) command, context));
}
Reversible Commands

The execute and the revert methods of reversible commands have to each be single operations and be as simple as possible. More complex operations need to be split into single atomic/commands. If the commands are able to be retried, then they will implement the retry mechanism as part of the execute/retry methods.

public interface ReversibleCommand<TContext> extends Command<TContext> {

    // execute phase to move the estate forward
    void execute(TContext context);

    // revert phase to rollback the estate to the original state
    void revert(TContext context);

}
Command Execute

The Command Chain works via a list of commands. Each command implements an execute() method and, if the command is reversible, a revert() method. The chain executes the commands sequentially and adds the successful commands to a queue that can later be used to bring the estate back to the original state.

private final List<AtomicCommand<?>> commands = new ArrayList<>();

private final Deque<AtomicCommand<?>> toBeReverted = new ArrayDeque<>();

public <T extends Context> void execute(final T context) {
        
   CommandResult result = null;
   for (final AtomicCommand<?> command : commands) {

       result = command.executeInContext();

       if (result.getState() == State.FAILED) {
               
           final CommandResult reversalResult = revert();
           .................
           throw commandExecutionException;

       } else if (result.getState() == State.COMPLETE) {
           toBeReverted.push(command);
       }
   }
}
Command Revert

When failures occur, the chain automatically reverts the commands that executed successfully by popping them from the toBeReverted queue and invoking the revert() method of the commands that previously succeeded, in reverse order of their execution.

private final Deque<AtomicCommand<?>> toBeReverted = new ArrayDeque<>();

private CommandResult revert() {

    final CommandResult result = new CommandResult(State.REVERTED);
    while (!toBeReverted.isEmpty()) {
       final AtomicCommand<?> command = toBeReverted.pop();
            
       final CommandResult result = command.revertInContext();
       ............
    }

    return result;
}

If any of the revert operations fail, the Command Chain stops, and an alert is raised for someone to investigate and troubleshoot the errors. The Command Chain Command Log (see below) is very useful in such cases.

Command Log

The execution status of each command is logged to a Command Log. This provides audit information and can be referred to during troubleshooting. It can be used to assess at which point in the chain the error occurred, which commands were reverted and which were not.

select * from COMMAND_LOG where REQUEST_ID ='REQUEST_152'
Command Chain Code Sample
public class ReversibleCommandChain implements CommandChain {

    private final List<AtomicCommand<?>> commands = new ArrayList<>();

    //maintain commands to be reverted in a queue to easily ensure the revert happens in reverse order
    private final Deque<AtomicCommand<?>> toBeReverted = new ArrayDeque<>();

    public <T extends Context, C extends Command<T>> void registerInContext(final C command, final T context) {
        commands.add(new AtomicCommand<>((ReversibleCommand<T>) command, context));
    }
    
    // Executes all of the commands in the chain. If failures occur, revert the previously complete commands.
    public <T extends Context> void execute(final T context) {
        
        CommandResult result = null;
        for (final AtomicCommand<?> command : commands) {

            result = command.executeInContext();
            if (result.getState() == State.FAILED) {
               
                final CommandResult reversalResult = revert();
                //throw an exception after revert to stop the remaining commands
                throw commandExecutionException;

            } else if (result.getState() == State.COMPLETE) {
                toBeReverted.push(command);
            }
        }
    }

    // Reverts the commands from the toBeReverted queue.
    private CommandResult revert() {

        final CommandResult result = new CommandResult(State.REVERTED);
        while (!toBeReverted.isEmpty()) {
            final AtomicCommand<?> command = toBeReverted.pop();
            
            final CommandResult result = command.revertInContext();
            //any additional revert steps go here
        }

        return result;
    }
}

Conclusion

In this post, we have explored how we manage the consistency of our infrastructure resources, and how the Command Chain pattern has helped us manage this. We have been able to reduce the amount of time spent on support and have eliminated the need for manual intervention in our production environment. Now, most partial failures are cleaned up automatically, and clients can re-execute the requests after the root cause is fixed.

This pattern can be improved upon, for example, by adding the ability to replay the entire chain of commands automatically without asking the client to retry the orders. This would require persisting the Context of each Command in the Command Log, in order to replay the exact same inputs. We hope that you have found this post helpful and perhaps this post will inspire you to implement the Replayable Command Chain. If you would like to learn more about opportunities at Goldman Sachs, we invite you to explore our careers page.


See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.

Certain solutions and Institutional Services described herein are provided via our Marquee platform. The Marquee platform is for institutional and professional clients only. This site is for informational purposes only and does not constitute an offer to provide the Marquee platform services described, nor an offer to sell, or the solicitation of an offer to buy, any security. Some of the services and products described herein may not be available in certain jurisdictions or to certain types of clients. Please contact your Goldman Sachs sales representative with any questions. Any data or market information presented on the site is solely for illustrative purposes. There is no representation that any transaction can or could have been effected on such terms or at such prices. Please see https://www.goldmansachs.com/disclaimer/sec-div-disclaimers-for-electronic-comms.html for additional information.
Transaction Banking services are offered by Goldman Sachs Bank USA (“GS Bank”). GS Bank is a New York State chartered bank, a member of the Federal Reserve System and a Member FDIC.
GS DAP™ is owned and operated by Goldman Sachs. This site is for informational purposes only and does not constitute an offer to provide, or the solicitation of an offer to provide access to or use of GS DAP™. Any subsequent commitment by Goldman Sachs to provide access to and / or use of GS DAP™ would be subject to various conditions, including, amongst others, (i) satisfactory determination and legal review of the structure of any potential product or activity, (ii) receipt of all internal and external approvals (including potentially regulatory approvals); (iii) execution of any relevant documentation in a form satisfactory to Goldman Sachs; and (iv) completion of any relevant system / technology / platform build or adaptation required or desired to support the structure of any potential product or activity.
Mosaic is a service mark of Goldman Sachs & Co. LLC. This service is made available in the United States by Goldman Sachs & Co. LLC and outside of the United States by Goldman Sachs International, or its local affiliates in accordance with applicable law and regulations. Goldman Sachs International and Goldman Sachs & Co. LLC are the distributors of the Goldman Sachs Funds. Depending upon the jurisdiction in which you are located, transactions in non-Goldman Sachs money market funds are affected by either Goldman Sachs & Co. LLC, a member of FINRA, SIPC and NYSE, or Goldman Sachs International. For additional information contact your Goldman Sachs representative. Goldman Sachs & Co. LLC, Goldman Sachs International, Goldman Sachs Liquidity Solutions, Goldman Sachs Asset Management, L.P., and the Goldman Sachs funds available through Goldman Sachs Liquidity Solutions and other affiliated entities, are under the common control of the Goldman Sachs Group, Inc.
© 2024 Goldman Sachs. All rights reserved.