Using Amazon SageMaker to Build a Machine Learning Platform with Just Three Engineers

Sept. 30, 2019
At NerdWallet, we strive to help you make financial decisions with confidence. To do this, many or all of the products featured here are from our partners. However, this doesn’t influence our evaluations. Our opinions are our own.

Previously, we wrote about why we built our Machine Learning Platform and the impact it had on our workflow. It helped us ship models to production in days instead of months. In this post, we’ll dive deeper into exactly how we leveraged Amazon SageMaker to build this platform – with just three engineers! My hope is that you take away some key reasons around why we built what we did, which will help you ask similar questions in your own unique contexts.

Key Takeaways

    • Less is more: Leveraging great services and open source software can help you build a complex platform with very few engineers.
    • SageMaker BYO: Amazon SageMaker’s Build-Your-Own container feature makes it an extremely extensible ML compute cloud.
    • ML ~= Engineering: ML projects can be modeled like just another engineering project – they can leverage a large amount of shared tooling.

Quick Recap

ML Platform Tech Stack - github, jenkins, flask, airflow, sagemaker, and artifactory

That’s a bunch of things that need to exist for an ML Platform!

 

In this post, we’ll look at a few specific pieces in depth: our data scientists’ Development Environment, how that relates to the Remote Pipeline Execution with SageMaker, and how we leveraged features in the Build System to enable a human centric workflow at low cost. Before we go into any of that, let’s make sure we all have the same idea of what those words mean.

  • Development Environment refers to what the data scientist uses to write and test code – it could be their laptops or a persistent remote server that they control. In our case, this explicitly means their laptops – we model it this way so the feedback loop is as tight as possible, and our data scientists are super-productive.
  • Remote Execution means some abstracted-away notion of scalable compute resources. This means data scientists don’t control nor manage these resources – it is provided as a magical cloud that makes their things go faster and helps them parallelize beyond their laptop. In our case, this is specifically Amazon SageMaker.
  • Build System is tooling that packages our projects in certain standardized ways so that they can be shipped away to remote servers and be executed to provide deterministic results. Think of it as a “project packager”, something that takes our code and makes it a reproducible executable. For some people, this is a Makefile with standard commands to install dependencies; for us, this is a pipeline built on Jenkins that publishes immutable *.tar.gz artifacts. These artifacts contain our project and all the dependencies they need to run.

Guiding Principles

When we started building out the platform, we had a few guiding principles.

These are important because they tie directly to the fact that we had a small number of engineers (three) working on this project. Making the system easy would reduce support load as new data-scientists ramp up. Leveraging existing solutions would reduce time-to-ship, as well as ongoing support load. It’s important to keep this in mind because this is the Why behind the experience we built out. We’ll also come back to these over the course of the article as we see how they influenced a decision.

What does our ML Platform enable?

Typical Development Experience

First, let’s anchor our discussion by looking at what the platform exactly enables. We’ll focus on model training here; as we’ll see later, the same approach scales to things like analysis or batch-prediction. You’ve already seen part of this in the previous post – we’ll go into more detail this time around. Here’s the Data Scientist’s experience as a user of the platform.

Setup

When the data scientists wants to start a new pipeline, they first ask NerdWallet’s build/deploy tool – indy – to create one for them.

This is going to set up a repository locally and in the NerdWallet Engineering Organization. The structure of the repo is going to come from a boilerplate that works with the Cookiecutter project. This ensures things are set up in a manner that fits generally with our build tools, and specifically with the ML Platform build tools.

Apart from these top-level directives, there are also some source files that come with well documented, self-explanatory interfaces that the ML Platform relies on:

Note that these source files are where the real action happens – everything inside these functions is the domain of the data scientist. They’ll write parsing, featurization, and training code specific to their pipeline. They may have as many utilities and library imports as they desire – it is just another normal python project.

The ML Platform only cares that these interfaces are exposed to it, and that is done via setuptools entry-points. Setuptools is the python ecosystem’s packaging and distribution library, and it provides these entry-points as a way to use interfaces with python packages.

Local Development

As we’ll see later, these interfaces allow the platform to delegate and use pipelines pretty much like black boxes. For more on setuptools, entry-points, and how they allow developers to create extensible applications check out the official docs.

Once the data scientist has gotten familiar with the structure and maybe written a bit of their specific code, they can take their model for a spin locally. After placing some small subset of data in their local data folder, they do:

Remote Development

The `nwml` command shown here is part of the ML Platform’s tooling. It provides useful and consistent functionality to read data from the standard location on disk, pipe it to the pipeline’s interfaces in the right order, and finally write the result back onto a standard disk location. The key piece here is that the data scientist can test their code end to end locally with a single command. This gives data scientists a very similar development experience to what many other engineers have – rapid feedback leading to productivity gains, without having to wait for remote servers to boot up.

The data scientist is now confident that their code works for their sample data and that everything is wired up correctly. They are ready to run an actual training job using the entire dataset, on more powerful compute instances. To do this, they can open up a Pull Request on Github with their changes:

That is it! Eventually, they will get a comment back from the ML Platform with a link to their training job, logs, and – if the training was successful – the trained model as well:

Note that there’s a link here to the AWS Console – this allows the Data Scientist to easily leverage SageMaker’s great UI to poke around.

If you’ve used SageMaker before, you’re probably wondering – “What about the details of the job, like what resource to use or where to get the data from?”. This is indeed required, and is controlled by the Data Scientist using a configuration file in their repo:


Human Centric Workflow

This configuration file lets the Data Scientist express their desire in config, not code. It also ensures that things like the input dataset (and other pipeline configurations) are versioned as part of code. This is an inexpensive way to tie in all these various details about a training job together.

This workflow that we have is central to why we built the platform the way we did, and specifically why we used Amazon SageMaker the way we did. We like to think of this as human centric in much the same way that one of our inspirations does – Netflix’s QCon 2018 talk on Metaflow. Instead of thinking of how best to design the system, we thought of how to make it as easy as possible for our data scientists to use it.

Local training involves invoking our `nwml` tool, which delegates key pieces to the pipeline itself:

Remote training is exactly the same thing – except the data is transferred from S3 into the standard folder on disk, and the whole workflow executes on SageMaker:

This mental model is really simple – remote training is exactly what is running locally, except with the whole dataset. Simplicity is key because it helps debugging: data scientists own the entire workflow and are self-sufficient.

With this workflow, data scientists don’t need to learn about EC2 instances and IAM permissions – and yet they have full insight into what was happening on the cloud, and can debug issues. The exact same command being run by the remote server in production can be run by a data scientist locally. Any data science errors that happen remotely can be reproduced locally by using the same artifacts – the state of code, or the data being used.

In this way, data scientists have full ownership of the training code or the trained model in production, while ML Engineering has full ownership of the plumbing that gets data to and from the model. This solves for three of our guiding principles: it removes friction by meeting the data scientist at their console, and makes it easy to do the right thing (and difficult to do the wrong thing) by setting up standard projects and using only versioned, standard code archives for training.

But we had another principle – to reuse existing solutions where possible. To see how we applied that, it’s important to see another view of the ML Platform that’s hidden from the data scientist – the actual system that powers this workflow.

What is our ML Platform really?

What’s in an ML Platform?

The ML Platform is pretty much the workflow seen by the Data Scientist above – with a few systems tying things up. It actually looks like this:

  • Any time a data scientist pushes code up to an active PR, the standard Github webhook will trigger a build job. 
  • For ML Pipelines, the build job eventually publishes an image, and then pings a flask service that orchestrates training jobs.
  • This in turn triggers an ML Platform training job via code running on Airflow, and that sets things up by inspecting the pipeline’s configuration before starting a SageMaker training job using the image that was just published.
  • Once SageMaker has got things wired up with the datasets, it calls the standard train entrypoint.
  • This train entrypoint is implemented by the ML Platform tooling, which sets up some symlinks and then calls `nwml train`, just like the data scientist does locally.

Amazon SageMaker

Amazon SageMaker is a “fully managed machine learning service” that provides features related to exploration, training, and inference. At NerdWallet, we already leverage a number of excellent AWS services in our stack, so this was an obvious candidate for us when we designed the remote execution part of the platform.

There are two broad ways in which SageMaker can be used for training – the pre-packaged algorithms and the custom build-your-own solution.

Built-In Algorithms

SageMaker ships a number of common machine learning algorithms that are optimized to run on its hardware. There are a bunch of these available for some common use-cases, and they are quick to use as well. Their limitation is that these algorithms assume a paradigm of “config, not code” and are fairly restrictive in how they expect the input data to be provided. The expectation is that the data would be transformed into exactly the format the algorithm wants, and transformed back once the algorithm finishes. Such a remote-first approach also makes it almost impossible to run code locally; hence, data scientists need to set up transform jobs for their pipeline even while they do exploratory work.

Caveats

  1. For exploratory data analysis and rapid prototyping, SageMaker also provides hosted Jupyter Notebooks. While this is something that works in many cases, we found that there were a few issues with leveraging internal libraries, and that the transient nature of notebooks did not work for us.
  2. For some deep learning algorithms, SageMaker also provides an option to ship custom code that will execute remotely on their deep learning containers with optimized libraries. This option was released after we built the system, and is more flexible than the code-as-config approach of the pre-built algorithms. However, it still suffers from the same problem of making it difficult to debug and run things locally.

Build Your Own Containers

SageMaker also allows us to build and publish our own docker images to an Elastic Container Registry (ECR), which it can use when setting up the training job instead of a pre-built image. SageMaker allows this by having a simple black-box interface with the container – on startup, a `train` command is run on the container.

The advantage of this solution is full flexibility – any code the data scientist writes can potentially be run on SageMaker. This means ML Pipelines can use the exact same development tools and leverage the same internal libraries as the rest of our projects. Since Python is a major backend language at NerdWallet, this unlocks a lot of value.

The disadvantage of this setup is the complexity of having to build and publish images to ECR. Further, this approach doesn’t let us leverage the great out-of-the-box functionality of the pre-optimized algorithms that Amazon provides for free.

Caveats

  1. It is possible – as we shall see later – to package custom code on top of the pre-optimized images that Amazon provides. This is only marginally more complex than building images using something like a vanilla ubuntu image.

How we use Amazon SageMaker

As we saw earlier, we had a strong desire to make our ML Platform simple to use, and a major part of that was easy local development which paralleled remote execution. The ability to build and use our own images with SageMaker let us achieve this.

The way we do this is by requiring our tooling repo, `nwml`, to be installed by each pipeline as part of its project. When we run CI for the pipeline, a couple of things happen:

  • A python executable (PEX) file is created and published as a code-archive
  • A docker image is constructed by layering this PEX file on top of a standard base image, and published to our internal ECR

The tooling repo exposes a couple of console_scripts – these are also part of setuptools entry_points – namely `nwml` itself, and also one just called `train`. `nwml` is the CLI we’ve already seen. `train` is an entry-point that is also implemented by the `nwml` repo, but this is only ever meant to be called inside SageMaker. The `train` entry-point is key to our abstraction here, and does the following:

  • Examines some environment variables and files placed in standard locations by SageMaker, as well as some training-job information passed to it via job tags
  • Creates a temporary directory that looks just like the data scientists’ directory locally by pulling down the published code archive
  • Sets up the `data/*` paths by making symlinks to the input files SageMaker has wired up.
  • Calls `nwml train` on the temporary directory that has just been set up, which mimics `nwml train` on the data scientists’ laptop.

Don’t Build – Buy (for free if possible)

This looks like a ton of special sauce, right? Turns out it isn’t really all that much. Here’s what we got for free from our existing build/deploy tooling:

  • Tooling to create a new project from a boilerplate, and to set up a repo with the right permissions
  • Automated CI checks, with the ability to build PEX files that expose console scripts
  • Ability to publish images that bundle these PEX files (and other parts of the code archive)

Furthermore, here’s what we got for free with SageMaker, instead of using just another general compute cloud like AWS Batch:

  • Ability to leverage GPU-enabled instances, with images that are optimized to work on said instances
  • Ability to transfer data from S3 into the training instances seamlessly, with value added features like streaming and sharding
  • Ability to monitor specific metrics of the training instance (like CPU / GPU utilization) as well as the ability to emit custom metrics that will be graphed
  • Access to a rich GUI that helps data scientists explore various other pieces of their training job

Finally, note that we essentially use Github comments as the first GUI for the data scientist – which absolves us from having to build an admin tool right away. Extending this, we could probably push notifications about the training job to Slack, buying even more time before we need to build and maintain a custom admin interface. Leveraging these great external – and internal – tools is how we applied our guiding principle and managed to build the platform with just 3 engineers.

Case Study: Using Optimized Training Images

A great way to validate how this leverage helped is to look at how we enabled data scientists to use optimized Docker images to speed up their training jobs.

  • Amazon publishes Deep Learning Containers for SageMaker which “provide optimized environments with TensorFlow or MXNet, Nvidia CUDA (for GPU instances), and Intel MKL (for CPU instances) libraries”; these are available in their public ECR.
  • We worked with our Production Engineering team to whitelist and create a pipeline for the optimized Docker image, just like we have for vanilla Ubuntu images.
  • Using the image-building pipeline we spoke about earlier, we then switched out the base image from vanilla Ubuntu (which all our other projects use) to the newly minted optimized image. The PR opened with this change succeeded in training with the optimized libraries.

Note that the ML Platform team doesn’t build or maintain any of the images, nor the tech that builds the images. We’re reusing shared components available to all engineering teams at NerdWallet, or via Amazon.

Summary

In this post, we looked at a few specific pieces of our ML Platform in depth: our data scientists’ Development Environment, how it relates to the Remote Pipeline Execution with SageMaker, and how we leveraged features in our internal Build System to enable a human centric workflow at low cost. We started with a few guiding principles for the system, and how they translated into the development workflow. We further saw how these influenced our decision to leverage a specific feature of SageMaker which allowed us to use our own Docker images for training. Finally, we examined how modeling ML Pipelines as another flavor of engineering projects helps reuse shared tooling, which lets us unlock a lot of value at a low cost.

Key Takeaways: Revisited

    • Less is more: Leveraging great services and open source software can help you build what seems like a complex platform with very few engineers.
    • SageMaker BYO: Amazon SageMaker’s Build-Your-Own container feature makes it an extremely extensible ML compute cloud.
    • ML ~= Engineering: ML projects can be modeled like just another engineering project – they can leverage a large amount of shared tooling.

What’s Next?

We hope you enjoyed this in-depth look at how we used SageMaker to provide an easy, “human centric” workflow for our data scientists. This system is strictly a work in progress – off the top of our heads, here’s what we’d like to make better:

  • Leverage SageMaker’s inference features to launch services that serve real-time inferences efficiently
  • Make it easy for data scientists to compare metrics across versions of their models, and across different models
  • Make it easy for data scientists to surface their models’ insights in the product itself, without bespoke plumbing and the associated friction

If you’d like to join us as we build out this system – any many other cool systems like it, or feel like you’ll want to use this system to build some cool models, check out our careers page!