Building an end-to-end MLOps pipeline using AWS SageMaker
Dr. Heiko KromerDr. Philipp WarmerIntroduction
Most readers of this blog post have heard the statement that doing data science is 90% data wrangling and only 10% model building. This might be true for establishing a Proof of Concept of the task at hand. However, real world benefit only arises when deploying and continuously improving the machine learning solution in production. Machine Learning Operations (MLOps) aims to ensure a robust and sustainable machine learning (ML) model behavior leveraging on a variety of interconnected processes. “ML” is only a small part of an ML system that runs in a production environment (Figure 1).
This blogpost will cover the following topics:
- Exploring the dataset — Here we perform exploratory data analysis using a Jupyter notebook in AWS SageMaker Studio.
- Building the model — Using AWS SageMaker to Build an XGBoost Model and Deploy it to an Endpoint.
- Putting MLOps in action — Building an end-to-end MLOps Pipeline with AWS SageMaker Studio.
In this blogpost we will show how to achieve world class MLOps using AWS SageMaker and apply it on an obfuscated dataset that contains measurements of a set of wind turbines. The dataset was kindly provided by WinJi. The goal of this blogpost is to show how you can use the rich feature set of AWS SageMaker to build a complete, end-to-end ML pipeline almost from scratch — all you need is data and write some code. AWS SageMaker is a proprietary solution from AWS. If you are curious to learn how to run end-to-end MLOps using an open source stack refer to this blogpost: Machine learning for production — introducing D ONE’s MLOps repository.
Here in this blogpost, we are not focusing heavily on the “pure” ML code, i.e., we are not going to tune the model’s hyperparameters to edge out the last percentages on a given metric. Instead, we want to show a holistic view of an end-to-end ML production system as outlined in Figure 1.
Part 1: Exploratory Data Analysis (EDA)
AWS SageMaker is the one-stop-shop from AWS to build, train, and deploy machine learning models. It natively integrates with the other fully managed infrastructure, tools, and workflows of AWS. To get the data into AWS and upload a notebook containing the exploratory data analysis please follow the steps showcased in this slide deck.
We start our journey into MLOps with AWS SageMaker by inspecting the dataset to understand its structure. We focus on two aspects:
- Data composition with regards to structure and quality
- Associations between the different attributes
The dataset contains the measurements on a set of wind turbines. An overview of the column headers in the dataset is shown below in Figure 2.
The entire EDA code can be found in the GitHub repository under this link.
Understanding the data is crucial for the success of any ML project. We encourage you to visit the EDA notebook, because in the steps below the reader is assumed to know what the data is about, e.g., what the features represent and their value ranges. With solid information about the dataset, we have half of what we need to bring to AWS SageMaker to build our MLOps pipeline. The second piece is the code, which we will dive into in the next section.
Let’s get started and understand the architecture of the data science workflow that we follow in the steps below.
In the second part of this blogpost (Step 1 in Figure 4), we will set up an AWS SageMaker Studio Notebook to build an XGBoost model to detect when a wind turbine is in a faulty state. In the third part (Steps 2 to 6 in Figure 4) we will modify an AWS SageMaker Pipelines template to build an end-to-end MLOps workflow including preprocessing, training, and deployment. In more detail, we will go through these steps:
- Building an XGBoost model using a Jupyter Notebook in AWS SageMaker Studio to detect when a wind turbine is in a faulty state.
- Preprocessing the wind turbine dataset.
- Training an XGBoost model with selected hyperparameters.
- Evaluating model performance.
- Registering the model in the registry if the selected metric is above a threshold in step 5.
- Manually approving the model to automatically start the deployment pipeline.
Part 2: Building an XGBoost model using a Jupyter Notebook in AWS SageMaker Studio to detect when a wind turbine is in a faulty state
Part 2 of this blogpost is completely independent from part 3. You do not need to follow along or execute the code in part 2 for part 3 to work.
With AWS SageMaker Studio, we can run Jupyter notebooks in a JupyterLab-like environment. Such a notebook can include instructions to spin up preconfigured containers for preprocessing, training, and inference. We will preprocess the dataset, train an XGBoost model to detect wind turbine failure states, and deploy the model to an endpoint. We do all of this in a single Jupyter notebook that you can find in the repository that accompanies this blogpost.
To follow the steps in the notebook, you need to make sure to have the dataset readily available in an S3 bucket as described in the introduction part. You can pick any bucket name that you want, but you have to make sure that the bucket name begins with “sagemaker”. The reason for this is that the policy which is attached to the default role SageMaker assumes is permitted access only to buckets that begin with the string “sagemaker”.
The bucket name that we are using throughout this blogpost is “sagemaker-done-mlops”.
Overview
We are going to use SageMaker Studio to launch a Jupyter Notebook instance. In this notebook, we define preprocessing, training, evaluation, and deployment steps (see Figure 5). For preprocessing, we use an SKlearn container that is provided by AWS. The raw input data as well as the preprocessed data is stored in the S3 bucket that we specified. A pre-built XGBoost container is spun up for training the model on the dataset that is stored in S3. We make sure to store model artifacts on S3 in the evaluation step. After we are satisfied with the evaluation of the model performance, we deploy the model to an endpoint for inference.
Each step in the notebook is extensively commented on. Here in the blogpost, we will focus on a bird’s eye view of the notebook. You are encouraged to download the notebook into AWS SageMaker and go through it yourself or read along in the repository.
As laid out in the introduction, our goal is to present an holistic view of an end-to-end MLOps workflow. Because the notebook itself contains extensive explanations, we are only expanding on some aspects of the code in the notebook:
2.1. Preprocessing
2.2. Training
2.3. Evaluation
2.4. Invoking the endpoint
As you might notice, you cannot copy and execute the code snippets we included in this blogpost on their own. We are showing these code snippets that are taken from the notebook to illustrate some of the steps and comment on what is happening behind the scenes.
2.1. Preprocessing
For preprocessing of the dataset, we apply the following steps:
- Filtering out low power values.
- Process target column by filtering on relevant error types and filling null values.
- Fill nulls in the feature columns.
- Select relevant columns and make sure the first column of the dataset is the target column. The preconfigured XGBoost container from AWS requires the target to be in the first column of the dataset.
The preprocessing is initialized and started with this code:
This code spins up an “ml.m4.xlarge” EC2 instance using the AWS provided SKLearn processor with the framework version that we specified. This container is fully managed by AWS SageMaker. We could also define our own container from scratch, but the preconfigured one will do fine. You can find more information on the SKLearn preprocessor here.
Note that we provided an execution role that we defined earlier in the global variables. We send the code instructions that we wrote in the preprocessing cell along with the input data location (in the variable RAW_DATA_PATH) and output data location. The output of the preprocessing steps will be copied into the specified location in S3. We do this because we do not wish to see the results of the preprocessing (namely the training, validation, and test dataset) to be stored only on the instance that SageMaker spins up. In case the instance becomes unavailable (or is terminated), we lose the output and the work was done (and paid for) in vain. When running the preprocessing, we can provide the number of test days and the number of validation days as arguments.
When the cell is running, we can inspect the preprocessing in the AWS console in the SageMaker service (select Processing on the left):
When clicking on the job name, we can, e.g., explore settings, logs, instance metrics, and trace potential errors. In the prepare_data.py script, we included some logging statement that we can see in the notebook output cell or in the AWS CloudWatch log stream:
This is one of the many convenient features when using AWS SageMaker. Everything logging related was created for us by communicating with other AWS services behind the curtain. We do not need to know how to set that up on our own.
2.2. Training
After preprocessing, we can start the XGBoost model training process. For that, we are using an AWS prebuilt container. AWS provides a number of prebuilt Docker containers, see for example here. We retrieve the container image with this code snippet:
As you can see we specify what framework we want, the region, the version of the container, and the Python version. We are running this notebook in the “eu-central-1” region.
Next, we select the instance type for the estimator. In our case, we select one “ml.c4.4xlarge” instance. You can select a smaller instance in case you want to save cost at the expense of a longer execution time. With the “rule” argument we can ask SageMaker to create a report for us.
This report provides a summary of the XGBoost model training evaluation results, insights of the model performance, and interactive graphs.
What is left is selecting hyperparameters and training as well as validation dataset. Lastly, we fit the XGBoost model:
SageMaker will spin up the previously defined “ml.c4.4xlarge” instance and use the training and validation dataset to fit the XGBoost model. We can explore the training process in the AWS Console by navigating to the Training jobs:
As in the preprocessing step, we can, e.g., explore the settings, metrics, or logs during training while the job runs. Note that we did not provide a job name, so this is the default job name for training that SageMaker assigns.
After training, we can check the XGBoost model training report running the commands below. It can take some time for the report to be created, hence running the command immediately after the job finishes might not work. Just wait a few minutes and execute the cell. The report will be copied to the location where the notebook is running so we can access it directly in the AWS SageMaker Studio environment.
Using IPython, we can create a link to the report:
2.3. Evaluation
After the model is trained, we would like to evaluate it. For that we can consult the XGBoost model report, which gives us a confusion matrix and classification report:
We can see that the model scores very high on the training set metrics. We are going to evaluate the model later on the test set performance.
Let’s for now try to understand why the model makes predictions (i.e., insights into explainability), we can for example plot the feature importance.
We can see that the model assigns a high importance relative to other features to the wind speed and blade angle. This points to a potential problem with the dataset and how we are using it. It could be that the wind speed and average blade angle is not recorded when the wind turbine is not running. Because the dataset is obfuscated, this cannot be investigated further at this point. It is somewhat concerning that the model performs so well in terms of the metrics we checked in the XGBoost report. Before rolling this model out into production, it is worth double checking the data collection process is not flawed. If our suspicion holds true, we should drop the columns that “leak” the target variable.
2.4. Invoking the endpoint
As a last step, we deploy the model to an endpoint to use it to retrieve predictions from it. As an example, we will use the endpoint to calculate some of the model evaluation metrics that the XGBoost report produced for us.
Deploying the model to an endpoint is as easy as this:
We select a smaller instance — “ml.t2.medium” — for inference because the workload is not as heavy as during training. In a real life scenario, we would monitor and scale the instance appropriately or make use of AWS auto scaling capabilities.
To be able to call the endpoint, we need some helper functions to convert the dataframe object to a CSV payload that we can send to the model and another function to call the endpoint and read the response:
Because this is a binary prediction, we know that the return values are the probabilities of a sample being a faulty wind turbine state.
As an example, we use this function to create the confusion matrix for the test dataset.
We see that the model performs quite nicely also on the test dataset. This raises further suspicions as we discussed when checking the feature importances.
If you followed all the steps, you want to make sure to delete the endpoint after you are done to avoid unnecessary cost and run the last cell in the Jupyter notebook. This can also be achieved from the AWS Console.
From here, we would begin, e.g., tweaking the model, trying out new features, performing hyperparameter tuning, and eventually putting our whole pipeline into production. We can achieve all of this with code written in Jupyter notebooks, but it is easy to lose oversight quickly. One would need to develop a customized solution to keep track of model development, training, evaluation, comparison with other versions, as well as deployment. Fortunately, we do not have to start from scratch, we can use AWS SageMaker to do all of this for us, only providing the code and the data.
Part 3: Building an end-to-end MLOps Pipeline with AWS SageMaker Studio
In this part, we will write an end-to-end pipeline with AWS SageMaker Studio almost from scratch. We do not have to worry too much about the heavy lifting, because one of the selling points of using AWS is that it comes with a rich set of templates, documentation, and prior work that can help us get jump started. While there is literally a “one click ML solution” that is called Amazon SageMaker JumpStart, we will use SageMaker Pipelines here.
We are going to copy an existing MLOps pipeline template. This template comes with two pipelines, a model build and a model deployment pipeline as illustrated here:
Overview
We are going to modify the “modelbuild” pipeline and slight adapt the code that we developed in part 2 following these steps:
3.1. Create a new project using an AWS-provided template
3.2. Modify the code in the repository
3.3. Run the pipeline
3.4. Approve the model to start the deployment pipeline
3.5. Check deployment pipeline and endpoint
Note that AWS constantly enhances their service offerings. Depending on when you are reading through this blogpost, the screenshots might be outdated and the information displayed is slightly different.
3.1. Create a new project using an AWS-provided template
In AWS SageMaker Studio, we can create a new project by selecting the resources tab from the left panel and selecting “Projects” in the dropdown, then clicking “Create project”:
In the wizard, we want to make sure to select the template “MLOps template for model building, training, and development”, then hit “Select project template”:
This will create a project from the selected template. You can give it any name and description you would like while honoring the requirements.
When you are satisfied with what you have entered, you can hit the “Create project” button which will initiate project creation. This can take a few minutes. After the project is successfully created, we can clone the repository. Make sure to clone the repository that ends with “modelbuild” (this can be either the first or second row, be careful). You can expand the column header by dragging the separator between the “Name” and the “Local path” header:
For more information on the template, you can consult the SageMaker MLOps Project Walkthrough.
3.2. Modify the code in the AWS repository
We can now modify the code in the repository in SageMaker Studio to suit our needs. Do not clone the GitHub repo, but follow the steps described below. The file structure in the AWS repository is as follows:
|-- img/
|-- pipelines/
||--__init__.py
<strong>||-- __version__.py
</strong>||-- _utils.py
||-- get_pipeline_definition.py
||-- run_pipeline.py
<strong>||-- abalone/
</strong>|||-- __init__.py
<strong>|||-- evaluate.py
|||-- pipeline.py
|||-- preprocess.py
</strong>|-- tests/
<strong>|-- codebuild-buildpsec.yml
</strong>|-- CONTRIBUTING.md
|-- LICENSE
|-- README.md
|-- sagemaker-pipelines-project.ipynb
|-- setup.cfg
|-- setup.py
|-- tox.ini
We are going to modify the files in bold. You can consult the README.md to find out more about the other files and the template in general. Make sure to download this folder from the GitHub repository.
We are replacing selected files in the AWS repository with the files in the GitHub repository. The AWS-provided template uses an “abalone” dataset, however, we would like to build our pipeline to consume the wind turbine dataset.
- In the AWS repository, the folder /pipelines/abalone/ contains the code for the pipeline, namely the pipeline code in pipeline.py, the code for preprocessing in preprocess.py, and the code in evaluate.py for model evaluation. We are going to replace all these three files with the files in the GitHub repository and rename the abalone folder to “windturbine”. The code in these files is taken from the notebook outlined in part 1. There are just a few adaptations to define the individual steps in the pipeline. The code is documented so you are invited to go through it.
- The __version__.py file contains metadata for the pipelines package. We can edit the title, description, version, author information and give other supplementary information about our pipeline. You can choose to edit the contents of this file in any way you see fit or not at all.
- codebuild-buildpsec.yml orchestrates the pipeline. In this file, we only have to change one line to reflect the changed folder structure of this project.
It is possible to extend this pipeline by adding more steps, for example if you select another template that includes monitoring, you can include monitoring steps in the pipeline. We just have to adjust the code in the template to fit our needs. In this blogpost, we focus on model building, training, and deployment.
Make sure to replace the files highlighted in bold in the file structure and change (do not forget to save) line 15 in codebuild-buildpsec.yml to read:
<strong>run-pipeline --module-name pipelines.windturbine.pipeline</strong>
In my case, I have renamed /pipelines/abalone/ to /pipelines/windturbine/. In the pipeline.py file you have to update the bucket name in line 41 to read the bucket where you stored the data earlier (SageMaker will also store artifacts created during the pipeline run in this bucket):
<strong>BUCKET_NAME = "sagemaker-done-mlops"</strong>
You can choose any name that you like, however, the name of the bucket must begin with “sagemaker”, and be globally unique. If you choose not to adhere to thise, you have to edit the Identify Access and Management (IAM) policies attached to the role that SageMaker assumes.
After we replaced the files and made the change in the pipeline name, we can commit the changes by clicking on “Git” in the left panel and staging all the changes by clicking the “+” on the files that we edited. The “+” will appear when you hover over the filename. Label the commit with a meaningful name and optionally a description:
Do not forget to include the files that we just “created” in the new folder /pipelines/windturbine, including the __init__.py. This path did not exist before (it was /pipelines/abalone), hence we need to commit it explicitly.
3.3. Run the pipeline
After committing the changes and pushing them to the repository, a new run of the (updated) pipeline will be executed automatically. This can take a few minutes. If nothing happens, you can open the AWS Console and check in the CodeCommit service if the code in the repository has been pushed. It is easy to forget the push after the commit, which causes the code to be unchanged:
Check the code and commit history. If you encounter other errors, it might save time to check the build directly using the CodeBuild service of AWS. That way you do not have to wait until the build fails in the pipeline but you can monitor the logs “live”.
If everything went well, we can check the pipeline visually. To view the graph of the pipeline, double click on the pipeline name:
We can double click on the execution to inspect the pipeline diagram.
There we can see the different pipeline stages (I have defined the name of the steps in the pipeline.py file and you can change it if you want), we can check the settings and see which parts of the pipeline are being executed. After the pipeline runs, everything should be green.
Note that there is an additional conditional step that was not present in the notebook that we presented in part 2 of the blogpost. The reason for that is that we include a step that instructs SageMaker not to register the model in case a condition — we use the accuracy metric — is below a defined threshold. This is a neat feature in case we make changes to the model that would result in a poor performance. You can read more about model registry here.
3.4. Approve the model to start the deployment pipeline
Because we set the approval process to manual, we need to approve the model before it will be deployed to a staging endpoint. We do this by navigating to the “Model groups” tab and double clicking on the name:
We can see that the model version is pending, meaning that we manually need to approve the model before the deployment pipeline is triggered:
If we double click on a version, we can see the model quality results. We defined in evaluate.py that we would like to have the accuracy, recall, precision, and AUC reported. If we are satisfied, we can approve the model by clicking on the “Update status” button.
In the pop up that opens, we update the model version status to “Approved”:
3.5. Check deployment pipeline and endpoint
After the model is approved, the deployment pipeline will be automatically triggered. If everything worked well, this pipeline will deploy the model to a staging endpoint. We can check this in the AWS Console CodePipeline service (can be found by searching for CodePipeline in the AWS Console):
There we can check the progress of the deployment and debug in case it is necessary. Note that the pipeline will only work if a supported instance is selected in the pipeline.py file:
<strong>INFERENCE_INSTANCES = ['ml.t2.medium', 'ml.m5.large']</strong>
There we defined that the model supports these two types of instances, if we wanted to use a different instance during inference, we need to provide this in the pipeline.py file. If you are curious, you can check the modeldeploy repository in CodeCommit and modify the inference instance type there (but make sure that the one you select is in the list of instances defined in the modelbuild pipeline.py file).
If everything worked well we can see the endpoint in the AWS Console in SageMaker:
And we can get the Amazon Resource Name (ARN) from there by clicking on the name of the endpoint:
We are now able to test the endpoint and if we are satisfied, bring it into a production environment.
To close everything down after you are done, make sure to remove the endpoint and close all running Jupyter kernels in AWS SageMaker Studio.
Conclusion
Congratulations for working through this data science project! We hope that this outline sparked your interest in MLOps with AWS SageMaker. In this blogpost we showed you (1) how easy it is to set up AWS SageMaker, (2) developed, trained, and deployed an XGBoost model with Jupyter Notebooks in the SageMaker environment, and (3) built an end-to-end MLOps training pipeline with AWS SageMaker Pipelines. SageMaker and its ability to work seamlessly with other AWS services did the heavy lifting for us, we only needed to modify the code and point SageMaker to an S3 bucket. No need to orchestrate a fleet of virtual instances, worry about capacity or scaling, and make use of the debugging tools AWS offers behind the scenes. We did not have to write a single line of code for deployment, AWS created the whole deployment pipeline for us. We hope that this journey was valuable and insightful for you. If you have any questions, concerns, comments, or feedback please do not hesitate to reach out to Heiko or Philipp. We are looking forward to sharing our experiences with you and to find out how we can help you set up a fully functioning end-to-end MLOps pipeline with AWS SageMaker.