January 12, 2022

Infrastructure and the Command Chain Pattern

Eugeniu Zaharia, Vice President, Storage Engineering

The Challenge

At Goldman Sachs, we manage a large private cloud that includes various compute, storage, networks, and higher level infrastructure services like relational databases and NoSQL databases. The clients of this private cloud are the internal engineering teams at Goldman Sachs that build services for the business. They expect infrastructure on-demand, and so the provisioning of resources is underpinned by APIs that are accessible directly or via higher level configuration constructs.

To provision these resources, the teams managing this infrastructure often have to execute read/write operations against multiple external vendor resources in a transactional manner. The operations need to either succeed together or be reverted - if there are errors, we need to leave the state the way we found it.

The external APIs we interact with are generally robust and stable, but we do encounter issues that result in partial failures, and therefore, an inconsistent state of our infrastructure. For example, when we provision storage on an array and update the inventory records at the same time, we want both of those operations to either succeed or fail together, so that the array and the inventory are in sync. If partial failures occur, we have to clean up. Cleaning up involves either reverting the successful operations or re-applying the failed ones. Partial failures are difficult to troubleshoot because of the complex resource topology. Whenever there was a partial failure, our team would have to manually troubleshoot the failed orders, and manually roll back the resources in order to keep the physical infrastructure and the internal inventory in sync. This was tedious and time-consuming.

The Solution

As most external resources are not XA compliant, existing external distributed transaction APIs cannot be used. We devised a solution called the Command Chain pattern. This blends concepts from the Chain of Responsibility and the Command patterns, and with this, we automated the clean up of failures.

The Command Chain is used to manage transactionality against multiple external vendor resources. This was implemented in Java and the sample code below is in this language.

Registering the Commands with the Chain

External resource operations are grouped into commands, by implementing the Command interface. The commands are registered with the chain, together with their context in the order that they are to be executed.

CommandChain commandChain = new ReversibleCommandChain();

Context context = new SimpleContext();
        
commandChain.registerInContext(startOrder, context);
commandChain.registerInContext(updateInventoryBefore, context);
commandChain.registerInContext(callExternalVendor1, context);
commandChain.registerInContext(callExternalVendor2, context);
commandChain.registerInContext(updateInventoryAfter, context);      
commandChain.registerInContext(closeOrder, context);

commandChain.execute(context);
Command Context

The chain needs to keep track of the arguments needed to execute each command in order to be able to revert them if needed. The context is an abstraction that encapsulates all the arguments needed to execute and/or revert the commands.

private final List<AtomicCommand<?>> commands = new ArrayList<>();

public <T extends Context, C extends Command<T>> void registerInContext(final C command, final T context) {
    commands.add(new AtomicCommand<>((ReversibleCommand<T>) command, context));
}
Reversible Commands

The execute and the revert methods of reversible commands have to each be single operations and be as simple as possible. More complex operations need to be split into single atomic/commands. If the commands are able to be retried, then they will implement the retry mechanism as part of the execute/retry methods.

public interface ReversibleCommand<TContext> extends Command<TContext> {

    // execute phase to move the estate forward
    void execute(TContext context);

    // revert phase to rollback the estate to the original state
    void revert(TContext context);

}
Command Execute

The Command Chain works via a list of commands. Each command implements an execute() method and, if the command is reversible, a revert() method. The chain executes the commands sequentially and adds the successful commands to a queue that can later be used to bring the estate back to the original state.

private final List<AtomicCommand<?>> commands = new ArrayList<>();

private final Deque<AtomicCommand<?>> toBeReverted = new ArrayDeque<>();

public <T extends Context> void execute(final T context) {
        
   CommandResult result = null;
   for (final AtomicCommand<?> command : commands) {

       result = command.executeInContext();

       if (result.getState() == State.FAILED) {
               
           final CommandResult reversalResult = revert();
           .................
           throw commandExecutionException;

       } else if (result.getState() == State.COMPLETE) {
           toBeReverted.push(command);
       }
   }
}
Command Revert

When failures occur, the chain automatically reverts the commands that executed successfully by popping them from the toBeReverted queue and invoking the revert() method of the commands that previously succeeded, in reverse order of their execution.

private final Deque<AtomicCommand<?>> toBeReverted = new ArrayDeque<>();

private CommandResult revert() {

    final CommandResult result = new CommandResult(State.REVERTED);
    while (!toBeReverted.isEmpty()) {
       final AtomicCommand<?> command = toBeReverted.pop();
            
       final CommandResult result = command.revertInContext();
       ............
    }

    return result;
}

If any of the revert operations fail, the Command Chain stops, and an alert is raised for someone to investigate and troubleshoot the errors. The Command Chain Command Log (see below) is very useful in such cases.

Command Log

The execution status of each command is logged to a Command Log. This provides audit information and can be referred to during troubleshooting. It can be used to assess at which point in the chain the error occurred, which commands were reverted and which were not.

select * from COMMAND_LOG where REQUEST_ID ='REQUEST_152'
Command Chain Code Sample
public class ReversibleCommandChain implements CommandChain {

    private final List<AtomicCommand<?>> commands = new ArrayList<>();

    //maintain commands to be reverted in a queue to easily ensure the revert happens in reverse order
    private final Deque<AtomicCommand<?>> toBeReverted = new ArrayDeque<>();

    public <T extends Context, C extends Command<T>> void registerInContext(final C command, final T context) {
        commands.add(new AtomicCommand<>((ReversibleCommand<T>) command, context));
    }
    
    // Executes all of the commands in the chain. If failures occur, revert the previously complete commands.
    public <T extends Context> void execute(final T context) {
        
        CommandResult result = null;
        for (final AtomicCommand<?> command : commands) {

            result = command.executeInContext();
            if (result.getState() == State.FAILED) {
               
                final CommandResult reversalResult = revert();
                //throw an exception after revert to stop the remaining commands
                throw commandExecutionException;

            } else if (result.getState() == State.COMPLETE) {
                toBeReverted.push(command);
            }
        }
    }

    // Reverts the commands from the toBeReverted queue.
    private CommandResult revert() {

        final CommandResult result = new CommandResult(State.REVERTED);
        while (!toBeReverted.isEmpty()) {
            final AtomicCommand<?> command = toBeReverted.pop();
            
            final CommandResult result = command.revertInContext();
            //any additional revert steps go here
        }

        return result;
    }
}

Conclusion

In this post, we have explored how we manage the consistency of our infrastructure resources, and how the Command Chain pattern has helped us manage this. We have been able to reduce the amount of time spent on support and have eliminated the need for manual intervention in our production environment. Now, most partial failures are cleaned up automatically, and clients can re-execute the requests after the root cause is fixed.

This pattern can be improved upon, for example, by adding the ability to replay the entire chain of commands automatically without asking the client to retry the orders. This would require persisting the Context of each Command in the Command Log, in order to replay the exact same inputs. We hope that you have found this post helpful and perhaps this post will inspire you to implement the Replayable Command Chain. If you would like to learn more about opportunities at Goldman Sachs, we invite you to explore our careers page.


See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.