Hi. I'm David Arpin from the Amazon SageMaker Team and I'm here to talk to you about a new capability, SageMakers automated model tuning. So Amazon SageMaker at a high level is Machine Learning platform that we've designed to make it very easy to build, train, and deploy your Machine Learning models to get them from idea into production as quickly and easily as possible. These three components are interconnected, but independent. So you can use one or more of them to suit your needs. For the build component, we have the ability to very quickly and easily set up an instance that's running a Jupyter notebook server. Which is an interactive environment designed for data scientists to explore data, create Markdown and documentation, and interactive visualizations of data. The next component would be training which is a distributed managed environment that when you create a training job, we spin up a cluster of training instances for you. We load a Docker container that has an algorithm within it. We bring in data from S3, we train that algorithm, we output the artifacts back to S3 and then tear-down the cluster without you having to think about any of that. We manage that process on your behalf. The deploy component is that once you've trained your model, you can easily deploy it to a real-time production endpoint. Then invoke that endpoint to get real-time predictions from that machine learning model. On top of these core components, we have additional layers. So we have Custom Provided SageMaker Algorithms and these have been designed from the ground up with advancements in science and engineering. The methodology is slightly different. It's designed to be more scalable, more efficient, as well as engineering advancements to use things like GPU acceleration and train in a distributed setting. We also have pre-built deep learning frameworks for TensorFlow, MXNet, Pytorch, and Chainer. These frameworks allow you to very quickly and easily write the code that you would naturally for those deep-learning frameworks. Then deploy them to SageMaker without having to think about Managing the Container and knowing that the container that we've set up for you and pre-built will allow you to train in a distributed setting and take advantage of other nice functionalities within each framework. Then finally, you have the ability to bring your own Docker container. So you can code up your own algorithm, package it up in a Docker container, and still take advantage of SageMakers managed training and hosting environments. Then on top of all of those is tuning. So SageMakers automated model tuning is a service that wraps up training jobs and works with the custom algorithms, the pre-built deep learning frameworks, or bring your own, in order to help you find the best hyperparameters and improve the performance of your machine learning model. So let's dig into SageMaker's automated model tuning in more detail. What are hyperparameters first? Hyperparameters help you tune your machine learning model in order to get the best performance. So if you're building a neural network, you may want to tune your learning rate, which is the size of updates you make to the weights each iteration or each pass through your network. The number of layers, so how deep or shallow your neural network is. Whether you're using regularization and drop-out, which regularization will penalize large weights and drop-out will actually drop nodes out of your network to prevent over fitting. So these hyperparameters are important to make sure that you get the best predictive performance out of your neural network. Hyperparameters are also used in other machine learning algorithms things like trees. So if you're fitting a decision tree or a random forest or a gradient boosted ensemble of trees. The number of trees is important. So do you want one tree or many tree, how deep each tree should be. So do you want a smaller number of very deep trees or a much larger number of very shallow trees? What your boosting step size is, so from round to round of boosting, how much you change. These things are all very impactful from a supervised learning perspective in how your model performs. But even with unsupervised machine learning techniques like clustering there are still hyperparameters. For clustering, maybe the number of clusters you want at the end, how you want to initialize the seeds in your clustering process, and whether you want to pre-transform or pre-process the data prior to using the clustering algorithm. So when we think about the hyperparameter space and what we need to do to change these hyperparameters in order to get a better fit, it's important to think and realize that there's a very large influence of hyperparameters on the overall model performance. At the plot on the right, you can see that we're training the neural network and we're varying the embedding size and the hidden size. You can see the validation F1 score, which is a measure of our accuracy plot along the z-axis. So we want the best accuracy we can get out of this model and we've tested a lot of different potential options for these two hyperparameter values. You can see that the difference is 10 points on this F1 score. So varying these hyperparameters can make a very big difference in performance which can obviously make a big difference in your bottom line from the predictions that you're generating from these models. The hyperparameter space also grows exponentially. So here we've plotted to hyperparameters. But if we wanted to train more hyperparameters, we would have to evaluate more and more and more points every time in order to create a plot like this. So it gets very hard as we want to tune a large number of hyperparameters to do that efficiently. There's also non-linearities and interactions between the hyperparameters. So you can see that the embedding size and hidden size have some interaction where when you change one parameter without changing the other, you have one result, but when you change them both together, you get a different result. In addition, there's non-linearities, meaning that you can't just continue to increase one hyperparameter value and always expect to get better performance. You'll increase it up to a certain point, at which point in time, you'll have either diminishing returns or even have worse results. Finally, each point along this chart would be an expensive evaluation. We will have to retrain the model for those hyperparameter value combinations. So depending on our model complexity and our data size, that can be very expensive to calculate different types of hyperparameter combinations. So we want to take a process of tuning these hyperparameters. There are several common ways of tuning them. The first would be manual. You start with the default values of the hyperparameters, you make a couple of guesses and check and eventually converge on a set of hyperparameter values that you're comfortable with. The second would be to use a data scientist experience or intuition. If they've seen a problem like this before in the past, they may be able to pick hyperparameters more successfully for future cases. There also may be heuristics where you can tune one hyperparameter value first and then subsequently train a second and then a third. These tend to require someone with an advanced skill set in machine learning to do this process though. The second method would be brute force and there are three common sub-classes. This is grid, random, and sobol. With grid, what we do is we try a specific subset of hyperparameter values for each hyperparameter. So you can see on the chart at the right that we have tried three values of one hyperparameter and three values of another and we make the combinatorial combination of nine possible hyperparameter pairs. We try each of these in a brute force sense. The next method would be random, where we just randomly pick values for each of the two hyperparameters. Although this sounds naive, it's actually quite successful. Then there's also methodologies like sobol which try and blend the two of these, where you're trying to fill up the space like a grid, but you add some randomness in. The reason you want to add that randomness in and the reason that random hyperparameter search can be so effective is that very often, we have one hyperparameter that has a much larger impact on the objective or the training loss or accuracy of our model. Whatever value you are trying to improve most. Because of that, random will actually try a much larger number of distinct values in that parameter and will explore that space better than grid where you've predefined a very specific subset of values which may miss out on the best performance. Finally, there's the meta model approach, which actually tries to build another machine learning model on top of your first machine learning model to predict which hyperparameters are going to yield the best potential accuracy or objective metric results. That's the method that SageMaker takes, and SageMaker uses Gaussian process regression model to model objective metric as a function of your hyperparameters. So it's trying to predict what your accuracy or what your training loss is going to be as a function of hyperparameter values. This is a good method that assumes some smoothness. So for a small change in hyperparameter value, you won't have a drastically wild change in your objective metric. It works with very low data, which is important because as we mentioned earlier, it's expensive to continue training these models. It provides you with some estimate of confidence or uncertainty as to what the value of an objective metric would be at different points, at different hyperparameter values. Then once we build that predictive model, we use Bayesian optimization to decide where we search next, which hyperparameter value combination we test next. Bayesian optimization is great because it works in an explore and exploit method where you are both trying out a lot of different hyperparameter values, and then when you find good ones, you're testing very near, nearby points that maybe just slightly better. So it does this very natural nice combination of using both of those techniques. It's also gradient free, so we don't have to understand exactly how our objective metric relates to our hyperparameters. Because usually, that's unknown. So it's important to have a gradient free method of optimization. So going through a very simple example, we can see the dashed green line here would be our objective. In this case maybe that's our Machine Learning models training loss or validation loss, and we want to minimize that. We want to have the best model that we can on our validation dataset, and we have sampled a couple of points. One point at 0.5, and one point at zero, and our Gaussian process regression has built that black line, which is its prediction of what the objective value might look like, and you can see we have those bands that delineate how certain we are of the values in-between the points that we've tested along this one hyperparameter that varies between zero and 1.0. The bottom plot has our expected improvement, and this is calculated by the Bayesian optimization portion, to try and blend the uncertainty that we have with our prediction of how good that loss might be or how good our objective metric might be, in order to define where the next point we test is. So we always test the next point where we find our peak expected improvement. You can see in this case, it's all of it at 1.0. So we're trying the ends of the spectrum. We're still in the exploration phase of hyperparameter tuning. We test at 1.0 and find that our objective metric is not very good in that space. So the next space that we test, is in-between our first two points, zero and 0.5, and we test that out and find that we don't have much improvement over our pre-existing best objective value. So now we'll test in between 0.5 and 1.0. Now, we're starting to see a real improvement. So with a very small number of iterations, we've landed on a hyperparameter setting that would be very good at optimizing this objective function. Now we transition into the exploit phase and it's all done naturally by the algorithm, where we're testing very close to that new optimal point that we found. We test a little bit to the left, no improvement. So now, we test a little bit to the right and we find an improvement. Within a very few number of iterations, we've already found this optimum for this function, despite the fact that we knew nothing about it going in. We didn't know its functional form, so we couldn't use a regular optimization technique on it. So how do we integrate SageMaker hyperparameter tuning in to the SageMaker automated model tuning capability. We mentioned before that it works with SageMaker algorithms, frameworks and bring your own container. So we think that's very important. It treats your algorithm like a black box that it can optimize over. It doesn't have to know exactly what's going on in your algorithm in order to be effective at tuning its hyperparameters. We also require flat hyperparameters, are provided. So those hyperparameters can either be continuous. They can take a continuous numeric value, they can take an integer value or they can take a categorical value, one of a subset of potential distinct values. Then we have the fact that we need your objective metrics logged to CloudWatch. So this happens naturally with SageMaker training jobs. Anything that you output will be reported in CloudWatch. So you can easily output the objective metrics that you want to optimize on, and use a regex to scrape that information out. This is what allows us to be so flexible and work in so many different use cases, is that these are the only requirements in order to train a model and use SageMaker's automated model tuning. So with that, let's transition into a demonstration of what SageMaker automated model tuning will look like in practice. So we've opened up Amazon SageMaker in the AWS console, and I'll open up a notebook instance that we have, and go to one of the examples that are available. In this case, we're going to use hyperparameter tuning with mxnet gluon on cifar10. As I mentioned, hyperparameter tuning works with the pre-built pointing frameworks, the custom algorithms and bring your own capabilities. But for this case, we'll use our pre-built mxnet container to do the project. So we've opened up the example notebook that we have here, and you can see we've pre-run this notebook because, as I mentioned, hyperparameter tuning runs a lot of different training jobs which can be quite timely. So we'll walk through a pre-run version of this. To give you a little bit of background on the tests that were trying to complete, is training a convolutional neural network, a ResNet 34 V2 on CIFAR10, which is very common computer vision and classification benchmark. We're going to compare default hyperparameter values to brute force tuning through random search, and then Amazon SageMaker is automated model tuning as well. So the first step in any SageMaker notebook is set up. So I'll import some of the libraries that I need and set up an IAM role, which will grant me permissions and access that I need in order to get my data and create SageMaker training clusters. I'll import some more libraries. Here, I've brought in an MXNet estimator from a SageMaker Python SDK that we've created. I've brought in things from the SageMaker Python SDK that help us with hyperparameter tuning and running automated model tuning jobs in SageMaker, and then we've also written our own library called Random Tuner, which I'll walk through in more detail and then I've brought in some standard Python data science libraries like Pandas and Matplotlib.So digging into the dataset that we have, we're going to use CIFAR-10, which as I mentioned is very common computer vision standard dataset, and it's 32 by 32 pixel images. They're evenly distributed across 10 classes with 50,000 images that we'll use to train our convolutional network, and then 10,000 images that we'll use to test its performance to see how it performs on images that hasn't seen, and you can see a snapshot of a sampling of some of the images that we have. Small images with a pretty well-defined pictures of airplanes, automobile, bird, cat, deer, dog, frog, horse, ship and truck. So a broad spectrum of items in the real world that it has to classify into one of those 10 classes. We use a library to download that data and now we upload it back to S3 and by uploading back to S3 this allows SageMaker training to pick it up and use it in training scripts. The next step is to define our MXNet script, and with the pre-built deep learning framework containers in SageMaker, we have this capability to write MXNet code naturally, and then wrap it up in a few small functions to submit it to our container which allows us to train in SageMaker's training managed environment. So the first function we create is train, and this function defines our network. You can see we're bringing in ResNet34 from the glue on Models Zoo, it brings in our data, and it actually starts the training process. The next function that we have is Save. So once we've returned our trained neural network from that training function, we're going to save it in a manner that we can then access it later as a model artifact in S3. We also have a couple of other helper functions called get data, get test data, and get train data, and these are just ways of getting those images that are in S3 that SageMaker loads to the training cluster, and bringing them into MXNet in an efficient iterative fashion, and then we have our test function which allows us to monitor our accuracy on our holdout sample of images, and monitor our performance of our network as we train it. Then we also have hosting functions here that you don't need to use for this example because we're just doing hyperparameter tuning, but if you wanted to host the output of that model in a SageMaker real-time endpoint, you can do so by using these hosting functions. The next step is an initial training job. So what we'll do here is we'll create this MXNet estimator, we'll pass in that cifar10.py script that we just talked through which has our MXNet code, we'll provide an [inaudible] role for our access management, we'll provide the specifications for our training cluster. We're going to train on one p3.8x large node, and we provide a subset of hyperparameters, and here we're going to specify batch size. So it'll do 1,024 images per mini-batch. We'll specify number of epochs, so we're going to train for 50 epochs. So that's 50 full passes through the dataset, and then we'll specify three key hyperparameters for our convolutional neural network. So the first is learning rate which controls our update size that we make to the weights, each pass. The second is momentum which uses information from our previous step to inform our current step. The final is weight decay which penalizes weights when they grow too large. So we've created this MXNet estimator, and then by using.fit, we pass in that S3 location, the location of our images in S3, and so we use.fit and that creates the SageMaker training job, and it will go provision those instances, load that to pre-built deep learning framework container, start executing that cifar10.py script, and as that's executing, logs from CloudWatch will be printed here in the example notebook. So we can see our setup and then we can track from epoch to epoch how our performance is doing, and we can see we start off with a pretty low validation accuracy of only about 10 percent which is about as good as random guessing. It's a rather long training job. As I mentioned we do 50 epochs, and so if we scroll all the way to the bottom to find out how our training job did, we can see that when we use the default values for MXNet's optimization routine, we really only get an accuracy of about 53 percent which isn't great. So C4 has a very hard problem but we should still try and do better. So let's see if we can change the values for learning rate momentum and weight decay from the default values to something else to get a better result. So the first thing that we'll do is do a random search hyperparameter tuning. As we mentioned it seems naive but it performs surprisingly well and it's a very good baseline for how good another hyperparameter tuning model does. So we've created this simple random tuner.py script that has a few helper functions that we'll need. We start out by defining a few classes for categorical integer parameters and we can use parameters, where we can specify the ranges or distinct values that each parameter should be allowed to take in that random search. Now we have two other functions, one that takes in the hyperparameters that you specify and randomly samples one of the values in the range or distinct set that you've given, and the second that actually kicks off the training jobs and then kicks off many of them each with randomly sampled values for the hyperparameters. Then we have a couple more functions here to help us analyze that output. As we mentioned, logs from training jobs and SageMaker go to CloudWatch. So we've created a couple of functions to go scrape those logs and pull out using a regex, the validation accuracy that we saw at the bottom of our last training job. So in order to start using this, we'll define one more simple function which is fit random. This will take in a job name, and hyperparameter dictionary, and startup model training. We set weight equals false here, so that we can train as many models in the background is we'd like. Now, we specify a dictionary of hyperparameters. Similar to above, we'll use the same batch size of 1,024, same number of epochs. But this time, we're going to define our learning rate as taking a value somewhere between 0.001 to an order of magnitude smaller than our first example and somewhere between 0.5. Then we'll define momentum between 0.99, which is a pretty wide spectrum of possible values for that, then wait to k between 0 and 0.0.1. So now we'll start our random search, and we're going to pass in that function that creates a training job, our hyperparameters, which have the defined ranges in them, how many jobs we want to run at most, and how many jobs we want to run at once. We could set this max parallel jobs to a 120 if we're comfortable with spinning up a 120 P3 ADX large as all at one time. But for the purposes of this notebook, will just set it to eight. Once we submit that training job, we can see that it's going to kick off a hundred and twenty subsequent training jobs, and now we can go and summarize and compare the validation accuracy across all of those training jobs that we've kicked off with random hyperparameter values sampled from the range as we specified. So you can see, we've used our other functions table metrics and get metrics to return a Pandas DataFrame of each training job and the objective score. So you can see that, at best, we're over 20 percentage points better than the default values, which is a huge improvement in accuracy for this problem. However, at worst, we're only 23 percent accurate. So randomly sampling hyperparameters I use, it still gets a lot of inaccurate results as well. We're just barely better than random guessing there. So you can see that there's a huge spectrum of possible objective values based on the hyperparameter values. Had you inaccurately started with the values that give you 23 percent accuracy as your hyperparameters, you would be doomed to having poor performance. The next thing that we'll look at is hyperparameter relationships. So we've created a scatter plot of our objective function, as well as each of our hyperparameters, learning rate momentum, and weight decay. You can see that as learning rate increases, in general, we're getting better validation accuracy until a certain point, and then actually toward the end, as the learning rate gets too large, we see a little bit more volatility in our potential accuracies. Momentum, on the other hand, has a little bit of noise at low values, maybe has a peak here in terms of the best value being somewhere around 0.8 and then quickly diminishes, so this nonlinear effect that we see of momentum. In addition, weight decay has not as much of a noticeable effect, and then we can see that this last plot is about the number of jobs, and the job number, and our objective. You can see that we don't actually get any better over time. We're randomly sampling. This is exactly what we would expect, as we randomly sample hyperparameter values, sometimes our best job might occur the first time and sometimes it might occur on a 120th job, and we don't have much control over that. We can also see that we're just randomly sampling the hyperparameter values relative to themselves. We're not taking into account that learning rate and momentum are actually related to one another, so we're just randomly trying values. Now, let's see what happens when we tune with SageMaker Automated Model Tuning. As we mentioned, we use a Gaussian process regression model to predict what good hyperparameters values might be and Bayesian Optimization, which does a good job of exploring and exploiting potential optimists. The tuner that we have built is built into the SageMaker Python SDK, so we imported that at the top. We can start by just defining this new MXNet estimator, again we pass the same training script, the same role, instance count and instance type. We defined our two static hyperparameters here. So we defined 1,024 as our batch size and 50 is our epochs again, and now we'll move on to actually defining our hyperparameter ranges. So you can see that the classes here function very similarly. Random tuner was written with this in mind to take a similar format to how the SageMaker Python SDK works. We define the minimum value of our learning rate and the maximum value in this continuous parameter class. So we next create an objective metrics. So we define what we want our objective metric to be. So that's validation accuracy, and we write the Regex in order to pass that from our CloudWatch logs here. Now, we create a tuner object from the SageMaker Python SDK, we pass it our estimator, our information, our objective metrics, and our hyperparameter ranges that we want to test. We also pass in the max jobs and the max parallel jobs. As you can see here, we've defined a much smaller number of jobs because SageMaker Automated Model Tuning is going to be much more effective than random search at finding a potential good value for objective metric. We've to find a much smaller number of parallel jobs as well. So instead of eight, we're not training two. That's important for SageMaker Automated Model Tuning because we learn from our past runs, so it's important that we not try to do too many of those in parallel because we may increase how fast we do the overall tuning job, but we may lose out on accuracy, as we can't predict far enough ahead. The next step is to summarize accuracy from our SageMaker Model Tuning jobs after that's run. You can see again, we have a final objective value of actually 74 percent, so better within 30 jobs than we've found within a hundred and twenty jobs of random search. We have a worst performance of 0.41, so still better than our worst training job in random search. So it's important to realize that SageMaker is actively trying to test good hyperparameter combinations, and so in the worst case, it's probably going to do better, and in far fewer training jobs with one-fourth of the training jobs, we actually found a model that has better accuracy than random search. Now if we compare the hyperparameter values across training jobs, we see the same relationship between our objective metric and hyperparameter values, except there seems to be less noise, and the hyperparameters themselves don't have as random of a scattering of values relative to one another. That's because, when we are trying to explore that space, we're looking to adjust multiple hyperparameter values at one time, and understand the relationships between those things. So we can also see up here that, in general, our accuracy is getting more consistent and improving over time. So that's another feature of SageMaker Automated Model Tuning is that as it learns, which hyperparameter values are successful, it will start exploiting that and fitting ever better models. So in conclusion, we've seen that hyperparameter tuning is a very important aspect of machine learning. If we do it naively, it can be quite costly. It can be four times more jobs to do random search than it was to use SageMaker Automated Model Tuning. If you wanted to extend on this, what we've done today, you could try and other brute force method like grid search to see how that performs relative to both random search and Automating Model Tuning. You could try tuning a larger number of hyperparameters, you could use the first round of tuning to inform a second round or you could also apply hyperparameter tuning to your own problem. If you'd like more information on SageMaker Automated Model Tuning, please see the example notebooks and documentation on AWS's website. So thank you for joining us today. I'm David [inaudible] from the Amazon SageMaker team. We hope you enjoy SageMaker Automated Model Tuning and use it to improve your machine learning model fits today.