December 8, 2021

Delivering Dynamic and Engaging Machine Learning-Driven Investment Research Experiences using Amazon SageMaker

Fan Liu, Pravallika Etoori, Henna Dialani, and Alex Woolley - GIR Engineering

The Global Investment Research (GIR) Engineering Team enables Research Analysts to author, publish, and distribute research content to institutional and certain other firm clients through GIR’s Research Portal. GIR publishes original research content hundreds of times a day, with publications providing fundamental analysis and insights on companies, industries, economies, and financial instruments for clients in the equity, fixed income, currency, and commodities markets. Our goal is to provide the best experience for clients across email, mobile, on our Research Portal, and through our Research APIs. 

GIR editors manually curate website pages and rank reports to highlight the best new content for broad client segments, such as European equity clients, or on a specific topic, such as cryptocurrency. However, it is impossible to manually curate the combination of interests relevant to each client. This forces clients to either attempt to create and maintain their own set of filters or to potentially miss content that is relevant to them amid the constant flow of new material.

To address the need for personalization that scales and adjusts as client interests change, we have leveraged the power of Machine Learning (ML). ML allows us to comb through billions of points of data to provide a level of personalization at an individual level that would never be possible with human curation. Based on a combination of client identified interests, client usage, and the metadata and content of our publications, we are able to suggest recommendations of reports clients might otherwise have missed and to suggest additional topics they might wish to follow. With this ability, we are able to create the best possible experience for each client. 

ML has enabled us to release three significant features on the Research Portal over the past year: Recommended Reports, Related Reading, and Recommended Follows. 

Recommended Reports

Problem: One of the main purposes of any content site’s homepage is to showcase the most notable new pieces of content. Editorial curation ranks and organizes the material to sort the signal from the noise, but manual curation can only go so far. A currency trader based in the UK will have a different interpretation of “most notable” than an equity investor in Japan.

To combine curation with the specific interests of each client, we placed a component of recommended content powered by our ML model on our homepage, as well as other key parts of the Research Portal. This ensures that content which may be of particular relevance to the client is exposed side by side with the content selected by editors and trending among the wider audience.

Clients of GIR have access to a wealth of content covering a huge number of companies, industries, economies, and financial instruments. Depending on their area of focus and investment strategy, many clients have specific areas of interest. As GIR’s range of content expands to cover an increasingly diverse client base, it is important that we continue to connect clients to content they value. Our Recommended Reading model enables us to highlight impactful content personalized to an individual client.

We use supervised learning to automatically recommend reports to our clients using prior readership and document metadata as inputs.

To build Recommended Reports, historical readership information is transformed into vectors representing readership patterns and relevant document features based on metadata we store for every report. This information is incorporated into a supervised learning model. For Recommended Reports, we use the XGBoost model, a decision tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework.

To train and deploy this model, we leverage a number of AWS Services, most notably:

  • Amazon SageMaker is a managed service for preparing, building, training, and deploying machine learning models.
  • AWS Lambda provides serverless compute, enabling functions to be run without provisioning or managing servers.
  • AWS Step Functions is a workflow service for orchestrating and automating processes.

In our architecture, an AWS Lambda function generates a machine learning training workflow as an AWS Step Functions state machine. An Amazon CloudWatch alarm triggers this step function on a regular basis to retrain the model using new data, and expose the newly trained model through our API Endpoint.

In this Step Function workflow:

  • We fetch training data from Snowflake, our data warehouse, and store it in an AWS S3 bucket.
  • We trigger the training job which reads training data, performs data preprocessing and feature engineering before training the XGBoost model.
  • Once the model is trained successfully, it is saved to SageMaker, which provides an inference endpoint invoked by our component on page load.

The following diagram illustrates our end-to-end architecture in AWS: 

Measuring success: We measured the click-through rate of the recommended reports component powered by our ML via our model, comparing it to click-through rate of components placed in similar places on our pages. 

Related Reading

Problem: After reading a report, clients have no obvious next step in their user journey despite having access to years of historical research. Research Portal clients often access content directly through email and push notifications, where this limited user journey is particularly apparent. Editors curate relevant content links alongside a number of high-traffic reports, but this cannot be scaled to cover every report. 

To more broadly enhance our user journey, we built a Related Reading component, which uses ML to identify and surface related content amongst the several hundred thousands of reports available on the Research Portal.

Our Related Reading component is powered by ensemble learning, consisting of collaborative filtering (CF) and universal sentence encoder (USE) models.

CF makes automatic predictions based on users' historical readership and allowed us to make use of behavioral patterns to suggest related content.

Meanwhile, the USE model is pre-trained on a large corpus and converts documents into vector embeddings. The USE model determines document similarity based on the content of reports, where we generate vector embeddings and compute vector similarities across all available documents. USE offsets the cold-start problem where a newly published document would otherwise not be suggested soon enough for reading.

Two Step Function workflows orchestrate our ensemble model:

1. A SageMaker batch transform job generates USE embedding vectors for newly published documents by the hour. Then, a processing job constructs a document - embeddings matrix to enable fast similarity computations at runtime.

2. Meanwhile, for fast CF, we used the combination of an AWS Lambda and SageMaker training job to pull our readership stats more frequently and construct and apply singular value decomposition to a document - user matrix.

The combination of both techniques enables us to serve fresh and relevant related readings for a user on a specific report, and we combined the results of both at runtime using a SageMaker inference endpoint. 

The detailed component architecture of the solution is presented in the following diagram:

Measuring success: where relevant, we measure click-through rate for the Related Reading component powered by our ML model against manually crafted components curated by our global team of curators. We also measured the length of user journey over time. 

Recommended Follows

Problem: A popular feature of the Research Portal is the ability to follow authors, publications, companies, and more. With more than 1000 authors, and reports covering over 3500 companies and 450 different industries, it's challenging for clients to identify additional content they should follow and for which they want to be notified of new publications. 

To surface relevant tags to the clients and improve client engagement through follows, we built a personalized follow recommendation engine.

We train a collaborative filtering (CF) model for this use case. Using a trained Collaborative Filtering model, we can recommend items that a user might like on the basis of prior interactions and identified interests. The model is retrained twice a day to surface fresh recommendations to the clients.

For model training and deployment, we follow the same architecture as mentioned above in the report recommendations use case. 

Here is a simplified diagram that shows the flow of a user request to our SageMaker endpoint; requests are made to a client application hosted by ECS, which makes a call directly to the API Gateway fronting the SageMaker endpoint:

Measuring Success: we measure individual follows and unfollows within our Recommended Follows component.

Challenges building GIR's Machine Learning Capabilities

Our team of developers was ambitious in that we wanted to do ML fast whilst also creating a reusable blueprint for future models. 

Automating Machine Learning Pipelines

Amazon SageMaker helped us a lot with the infrastructure heavy lifting; however, we still had to ensure our data pipelines and machine learning jobs are resilient and secure.

As the first team using SageMaker in Goldman Sachs, we spent hours setting up resources using Terraform. We had to set up our VPC and IAM roles appropriately to ensure connectivity to our Snowflake data warehouse, and unlock access to all the offerings and integrations AWS, and specifically, Amazon SageMaker had to offer.

We use a Python Lambda with the AWS Data Science Step Function SDK to create SageMaker resources from training to inference. This blueprint has allowed us to stand up subsequent models consistently and quickly. CloudWatch triggers fetch data from Snowflake and Adobe Analytics at a regular cadence, which then becomes available to all of our models. We also trigger automated retraining using CloudWatch alarms.

When working on a new model, we follow this SDLC architecture with some additional steps if required.

Performance

The use cases for our recommendation engine directly impact the Research Portal client experience. Therefore, we needed highly performant and reliable endpoints.

The first few iterations of our recommended reports model were painfully slow - recommendation responses took on average 8 seconds. With load tests, our endpoint latency spiked further, with response times reaching 60 seconds. Adding extra compute made no difference; and so we revisited every part of our model.

Through analysis, done with the help of Goldman Sachs' Core ML team, we found that our inference input data processing and feature engineering was the bottleneck. We moved our feature engineering step to application startup, pre-generating features for each document and user in bulk; this pre-calculation removed the bottleneck. By prioritizing real-time inference latency over startup time, we massively reduced our response times from ~8 seconds to a cool sub-100ms. 

Conclusion

Machine Learning plays an increasingly important role in connecting GS clients to valuable and relevant Research insights. We are able to deliver dynamic, personalized experiences based on a combination of user interests and readership. Recommended Reports and Follows are prominent across our homepage and are constantly surfacing engaging, fresh content to clients. Related Links ensures our clients can get the most value out of each piece of content available on the Research Portal.

We are continuing to build on these foundations, extending our recommendations to cover content outside of Research and integrating into more of the firm's client facing divisional portals. Augmented by a worldwide editorial team, our Recommendations features help us ensure that clients get the best experience whether through push notifications, emails or directly on the homepage of the Research Portal.

Join Us

We're hiring for several exciting opportunities! Click here for more information.


See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.