Welcome. Welcome to an introduction to machine learning. I'm Blaine Sundrud. I'm a Senior Instructional Designer and a Technical Trainer at Amazon Web Services, AWS. I am thrilled to welcome you to one of the first sessions of today's AI Innovate event. Over the next hour, I'm going to help you lay the framework that you're going to need to start developing machine learning solutions in your own work. I'll define some key terms. We're going to talk about the different types of machine learning algorithms that'll help you solve business problems that you're going to face out in the real world. Then we're going to walk through the machine-learning pipeline. Let's start this 60 minute journey off with a story. This is a use case of a machine learning in action, one that took place right here at Amazon. Let me head over to the light board to illustrate the story for you. Hold on a second. All right. Several years ago, Amazon, @amazon.com, needed to improve the way that it routed customer service calls. So it looked to machine learning for help. Now the original routing system works something like this, customer calls in, greeted by the menu, "Press 1 for return. Press 2 for Kindle, 3," but whatever. You get the idea. The customer has to make a selection, and then they're sent to an agent. Agent. The agent is there to go ahead, and they're trained in the right skills to help them with the customer. Well the problem is, hey, you might have guessed, the kind of things that we do and sell here at Amazon, there's a lot of stuff. So the list of things that customer could be calling from is practically endless. So if we didn't get the right option for the customer to call in, then the customer is sent to a generalist as opposed to a specialist. The generalist has to figure out what they want. So they send him to another person who hopefully is the right one. Maybe it is. Maybe it's got the right skills. Maybe not. It's got to sent to another one, and so on. You keep going through all these pieces until eventually you get to the real happy person that's supposed to help out the customer. Now for some businesses, that might not be the end of the world. For Amazon, when you're dealing with hundreds of millions of customer calls every year, this path, that's inefficient. It cost a lot of money, wasted time, and worst of all it's not a good way to get our customers the help they needed. Well you can probably guess the rest of the story. Amazon built a system. They used machine learning to improve the whole routing system. So the idea was to get rid of all of this extra stuff and go straight to the agent that could help them. This made the customers happier. It made the call center agents more productive. Basically, everyone lives happily ever after. So I'm going to show you how Amazon actually did it, how we did it. We're going to spend the next 60 minutes walking through this machine learning pipeline, and explain how we deployed this smarter, more intelligent, customer service routing system, this ML system, so that you can develop your own ML solutions moving forward. So first, what does the machine learning pipeline look like anyway? Well it starts with collecting and integrating your data. Then you prepare the data, visualize it for analysis, then you select the features you want to use and engineer some as well. Then you can train your model, evaluate it and deploy it. So at this point now, it's time to turn our business problems into machine learning problems. So let's start with the business problem. So our business problem in this case is how are we going to route our customer calls successfully? The machine learning problem, we'll get to that in a second. In fact, before you can do any of the things we're going to talk about, we have to decide even if a machine-learning, ML, is the right solution to deploy in the first place. All right. So at this point, is machine learning an appropriate solution? So let's break it down. Machine learning is a subset of artificial intelligence or AI. Machine learning uses data, and this data is going to be used to train the model, and the model is then used for predictions. Can I spell predictions? Close enough. Good enough. The predictions then, we can make those from huge datasets, because the strength lies in its ability to extract hidden patterns, structures from this data. Now common use case for machine learning. Well let's call it a credit card transactions. So card transactions. So there's card transactions. In this case, we find the appropriate data because what we're looking for is to determine fraud. The appropriate data is mind. We're identifying patterns among all of the card transactions, specifically looking for patterns that indicate a fraudulent transaction. With these patterns, you can train the ML model to predict future transactions as, yes, fraudulent, or no, not fraudulent. So with that in mind, let's return to the question. Is machine learning an appropriate solution for the business problem? Well in this case, it was for the Amazon call center. They had millions of historical phone calls as the dataset, but there was no single indicator that they could use to get a customer directly to an agent in just one step. It's more complicated than that. So we needed to identify patterns within the whole range of customer data that could help us route customer to the right agents in a single-step. Tons of data, tons of data, all of which needed to be analyzed for patterns that Amazon could use to make accurate predictions. Knock-knock, who's there? Machine learning. This is exactly the problem that machine learning was built for. So in this case, the machine learning problem that it's actually solving for is to predict the agent skills. Now there are different types of machine learning problems out there. So hypothetically speaking, let's say that with our call center problem, our original goal was just to predict whether a customer was simply calling in about their Kindle or not calling in about kindle. This type of problem is considered a binary classification. So it's binary. Let me put it over here. So binary, Kindle or not Kindle. Simple as that. There's only two outcomes. It's a classification problem because we're predicting a category instead of a real number like a price. Although it's simple, this basic classification task supports a wide variety of elegant, scalable and actually very powerful business solutions. For example, is this credit card fraudulent or not? The other type of problem you might run into is multi-class. A multi-class solution in this case, we're still predicting a category, but it's more than just two outcomes. It's not Kindle or not Kindle. We might be looking at a number of different choices. We're going to walk you through, today's example is actually a multi-class solution. There's many ways of classifying the type of skill that would be needed to solve this particular customer call. Maybe they're looking for a Kindle, but maybe it's to return a product, maybe it's to answer a question about Alexa, or whatever it might be. There could be any number, hundreds of different things it could be. In these two examples, binary or multi-class solution, these are classification problems. But there's also regression. In a regression problem, I'm no longer mapping to a series of defined categories. Now, I'm looking for a continuous values such as numbers, integers, 1, 2, 3, whatever. An example of machine learning regression problem, predicting the price of your company's stock. Let's get back to our call center problem. We have determined our machine learning problem. This is what we're looking for. You've identified it, it's a multiclass problem. So we have a whole different set of outputs it could be. At this stage, now it's time for us to talk to our domain experts and gather more information. Time to challenge your assumptions. During this phase, some of the questions Amazon asked were, what exactly did these customer service agents skills represent? How much overlap were there between the skills? Are they similar enough where I might be able to possibly combine them? What happens when a customer's routed to an agent with the wrong skill? Did that agent stand a chance of possibly answering the question anyway? The more questions you ask during this discovery stage, the more inputs you'll get and the more input that the domain experts give you, the different people, then the better your model is going to be. All right. Let's go back to the desk, put a few more things on this. All right. Now, it's time to get started with your ML pipeline. Now, it's all about the data and the training that data to enable the models to make your predictions. Data is everywhere and because it is everywhere, it can be collected from multiple sources like Internet, Databases, other types of Storage. Chances are very good. Some of the data your team collects however, it's going to be noisy. Your data is possibly incomplete even irrelevant. So wherever it comes from it will need to be compiled, get integrated. Most importantly, you have to clean the data. First, you need to collect and integrate the data that's relevant to your problem. No matter what type of data you're collecting, you're going to need to make sure that you've got the proper tools and the knowledge to work with all different datatypes. But let's go back to our call center use case. The data we needed came from answering questions like what were the customer's recent orders? Does the customer own a Kindle? Are they a prime member? The historical customer data that answers questions like these are called Features. They could features as your inputs to the problem. The machine learning model's job during training is to learn which of these features are actually important to make the right prediction for the future. If the value you're looking for is know, like in a supervised learning, then that prediction is called a label. But if the value isn't known, like in unsupervised learning, then it's called a target. We'll talk more about supervised and unsupervised learning in a bit. Don't worry about that. For right now, just know that in our call center example, our label was the skill an agent needed to resolve the customer call. All right. Together, the label and the features, this makes up a single data point. This is called an observation. Stack up a bunch of observations, that's your dataset. Good data will contain a signal about the phenomenon you're trying to model. For instance, let's say there's merchant trying to forecast demand for products. They might track number of sales they've had, good start. But what if they've forgotten to log when certain products we're out of stock? If you're trying to forecast demand, it's important to know when you were out of stock, and therefore, critical to have data that represents that as one of your features. Here is a general rule of thumb. You need at least 10 times the number of data points as features. So if you've got five features, you should have 50 data points minimum in your training data. So data preparation, as you can see, sometimes that very first dataset is not going to be enough for a good prediction. As developers, it's important to understand what data you're missing so that you can access it. This is where the data preparation phase comes in. First step, take a small random sample of your data and you really need to dig into it. Now, you probably need between 20 to 50 observations. Although again, that depends how many features you have. Your job in the data prep phase is to manually and critically explore the data. You've got to look at it close. Ask yourself questions like this, what features are there? That Step 1. Does it match your expectations? Is there enough information to make accurate predictions? If you just looked at it, what are you going to see? Here is a good rule of thumb. If a human, you could look at a given data point and guess the correct label, then an ML algorithm should be successful there too. Now, you might also want to critically think about your labels. Ask yourself, are there any labels that you want to exclude from the business model for business reasons? Are there any labels that aren't entirely accurate? In the call center use case, we asked some domain experts key questions that helped inform this part of Amazon's Analysis. For instance, we would ask, how much overlap was there between skills? Were any skill similar enough to be combined? If we did our homework and properly answered those types of questions, we may have been able to simplify our model by excluding a few labels. For instance, instead of having labels that represent multiple Kindle skills, it might have made sense to just combine those into one overarching Kindle skill label. That way, every customer that had a problem with a Kindle can be routed to an agent trained in all kindle issues, rather than tinker toy, little tiny things here or there. It can be hard to understand your data without seeing the data. That's why you need to do more than just a manual analysis, you need a programmatic analysis. This is what you get when you visualize the data. I love visualization. Visualization is great. It's a technique that helps you understand the relationships within your dataset. This leads to better features, better models. When you can see the data in a chart or plotted out, you can help unveil previously unseen patterns. It reveals corrupt data or outliers that you don't want, properties that could be very significant in your analysis. The Amazon example. A programmatic analysis of the label might have shown 50 percent of the calls were related to returns, 40 percent were for prime membership, 30 percent related to kindle, and so on. Basic stats like these can be powerful methods to obtain quick feature and labeled summaries to understand them. Two other common visualization techniques we're going to cover, histograms and scatter plots. Let's take a look at that. All right. Let's talk about histograms. By the way, thanks Tom for pre-drawing one for me. Histograms are effective visualizations for spotting outliers in data. For example, let's say you're visualizing the distribution of hours per week your company's employees actually work. So you're trying to make a prediction about salaries and you're going to base that on the number of hours your full-time employees actually show up to work. So with this histogram, you can see that the majority of your employees are working between 35 and 55 hours a week. But you can also see there's a lower outlier over here. Couple of your employees are working 15-20 hours a week. Well, maybe you have some part-time employees that for whatever reason got mixed into your dataset of your full-time employees. If you want it base your prediction on full-time employees only, then it's important to identify and remove these part-time employees from your dataset. Well, in this case, you could just delete the outlier data or you could cap it., so you don't see any data for any employee who worked less than 35 hours this week. This solution would help you ensure you're only looking at that full-time employee set. But there's other solutions. For instance, in the multi-class classification problem, you're going to want to figure out how to actually combine this outlier data with other data classes rather than just ignoring it or deleting it. For example, in the call center example, we had multiple Kindle skills. Well, ultimately, Amazon just decided to combine specific Kindle skills into a single general Kindle skills for its model training. If it's a regression problem, you can deal with outliers or even missing data by just assigning a new value using imputation. Now, imputation is going to make a best guess, so to speak, as to what the value actually should be. For instance, you might have a set of data and you can take a mean, 45. This 45 is going to be what I would use in the case of missing data. So in the salary prediction example, let's say our data looks something like this. So Employee 1 is going to actually work, let's say 35 hours for a week and another employee, E2, is going to work 44 hours for that week. Then you have Employee 3. No, there is no data for Employee 3. Rather than just eliminating it or worse putting a zero because Employee 3 did not work zero hours that'll mess up your data, in this case, you can simply take your mean which is 45, I embed drawing on this board. So we are going to just put that in there and say, for Employee 3, it's going to take the mean which is 45 hours. Great. In place the missing data. It's not a zero, I'm not ignoring Employee 3, I'm still going to have wait for it even though I don't know what it is, it's going to be valuable. All right. Good. Now, along with histograms, another visualization tool, scatter plots. So in this case, the idea of scatter plots is to visualize the relationship between the features and the labels, where what you've got are a whole lot of different unique points. It's important to understand if there's a strong correlation between features and labels. In this instance, a scatter plot might actually help us see the correlation between the number of hours worked and their income levels. So in this case, yeah, it's looking like a strong correlation. Now, on the flip side, we might see a weak correlation if we were to use age and those elements being out here in that case, nothing of value. When thinking about data preparation, keep in mind that if you don't address noisy data, it's just going to hurt your model's performance. These types of visualization techniques and approaches are critical. Your model will suffer because of noisy data points like outliers or missing data. This results in less accurate predictions. So we've been talking about in order to get accurate predictions, you have to get clean data. But there's more to that. You need an algorithm that makes sense for your business problem. Choosing the right algorithm for the job is another big step in this part of the ML pipeline. It can be a challenge for any machine learning practitioner especially given that there are several 100 algorithms out there. Now, to help out, let's talk about these four different categories of machine learning algorithms. We've got supervised, unsupervised, reinforcement, and deep learning. Let's start here. Supervised learning. It's a popular type of machine learning because it's widely applicable, has several successful applications out in the world. The focus of supervised algorithms is on learning patterns by seeing the relationship between variables and known outcomes. It's called supervised learning because there needs to be a supervisor, a trainer, that can actually show the engine the right answers, so to speak. In machine learning by the way, a trainer can be any sort of complex systems, could be machine, could be human, or other natural processes. Imagine you're training a Machine Learning model that's capable of predicting future earthquakes. In this case, the teacher or the ultimate source of truth is nature herself. Like any student, a supervised algorithm needs to learn by example. Essentially, it needs a teacher who uses training data to help it determine the patterns and relationships between the inputs and the outputs. This picture here for example, it's a car. So you get a nice little car. I embed the cars. It's got two wheels, it's got a headlight, you got to have a window font there, that's a car, great. This one over here, well, that's a truck. Okay. Fine. After the training is finished, a successful learning algorithm can make the decisions on its own. You no longer need a teacher to actually label things as car or truck. In the end, the output knows that's a car, that's a truck, it can do it by itself. The call center use case is an example of supervised learning. We trained our model on a bunch of historical customer data that included the correct labels or the customer agent skills. That enabled the model to make its own prediction based on other similar data moving forward. So for example, that I know that this particular call needs someone with a Kindle skill. We'll talk more about what it means for an algorithm to determine relationship later, we talk about parameters and hyperparameters for now. Let's go ahead and focus right now though on the types of algorithms rather than those elements. Supervised algorithms. They need good training datasets. In properly labeled observations, hang on, I need to emphasize something. It's really important to know that this type of machine learning is only successful if the system we are trying to model it after is already functioning and easy to observe. If we want to train a model that label cars or trucks or buses or whatever, then we need to make sure that the training data is labeled. If not, then you got to go through a large number of photos and actually label them manually. Now, if such a human process was not already in place, then obtaining that ideal training dataset, it could be problematic and might ultimately be a reason to not pursue a supervised learning algorithm. So let's talk about what happens when there's no teacher in the room. Okay, I'm gone. Wait a minute, I still have to be here. Sorry, I thought you'd quit. No. Hello, here we go. Sometimes all we've got is just the data. No provided labels. There's nobody here telling you what something is. Can something useful still be learned? Well, yeah, that's unsupervised learning. With unsupervised algorithms, we don't know all the variables. We don't know the patterns. So the machine itself simply looks at the data and tries to create labels all on its own. A common type of unsupervised learning, it's called clustering. This algorithm, it groups data points into different clusters based on similar features in order to better understand the attributes of a specific group or cluster. For instance, let's say you sell office supplies different companies all over the world. Well, in analyzing customer purchasing habits, an unsupervised model might actually be able to identify two different groups. Each groups, there's no need for a label but what it finds out is that maybe this one group is just purchasing paper and pencils or whatever, and this turns out to be smaller companies. Whereas this other cluster of groups is buying conference tables and chairs and big furniture items, it turns out this happens to be your larger companies. You may not have had this label initially but just their purchasing habits started dividing them up into buckets automatically or specifically the engine did that. Clustering in this situation could help you realize that you need to come up with different marketing strategy for different types of companies. Consider fraud detection. A supervised algorithm could predict a particular threat that's already been classified. But the most dangerous attacks are the ones you don't see coming. The ones you don't know about. That is the ones that haven't already been labeled. To detect an unclassified category of fraud in the early phases, like a sudden large order from an unknown user or a suspicious shipping address, unsupervised algorithms group malicious actors into a cluster and then analyze their connections to other accounts without knowing the actual labels of the attack originally. All right. Another algorithm, it's been gaining popularity a lot recently, Reinforcement Learning. Let me put this up on the board here. So if our example here we're going to start with the agent, then an action. This is the environment. Finally, we come over here to the state, the reward. This becomes the loop. Now unlike these first two algorithms that both actually have an endpoint and end state, this one, reinforcement, continually improves by mining feedback from previous iterations. In reinforcement learning, this agent continually learns through trial and error as it interacts with the environment. The reinforcement learning is broadly useful when the reward of a desired outcome is known but the path to achieve it isn't. That path requires a lot of trial and error to actually discover. Well, let's think of Pac-Man here. So you've got Pac-Man. We get Pac-Man out there. Pac-Man, fine. In this case, maybe it's supply chain, but Pac-Man's more fun. So we're going to go ahead and the action might be is it going to go left? Is it going to go right, don't know? So depending on whether he goes left or right, the reward or the state is going to constantly change. The model is learning, and it's going to be graded as opposed to tagged or labeled. So if I go left, maybe no that's bad. So I'm going to have a score of minus two. But if I go right, okay that's good, that's going to be a plus two whatever it is. Think about playing a new board game but you don't know the rules, and you might not know the intricacies of the game, but you just know you got to get to the other side of the board. So as you move through the game and you learn the values of certain actions, you get more familiar with the space, left, bad, right, good. "No, it's fire-breathing dragon. No, negative two whatever." Fine. These values you learn can influence your future behavior. Well, I'm sure as heck ain't going to do that move again that's a bad one. I'm not going to keep moving towards the dragon or the ghost. So as a result, the performance starts to improve. It gets better based on your past experience. All right. That's reinforcement, fine. Now, let's talk about the deep learning algorithms. Yeah, here's a buzzword for you, right? Deep Learning. It's a reinvention of artificial neural networks. Now if you're thinking about the biological neural network, so here's a neuron and that's connects and it's fine. I can't draw neuron, pretend it's a neuron. You've got these in the brain. If you think like that, you're actually on the right track because just like a biological neural network where this connects to another nerve, another nerve, and so on, each neuron is activated when the sum of the input signals into one neuron exceeds a particular threshold. The thing is, a single neuron, it's not sufficient for any practical classification needs. Instead, we combine them into a fully connected set of layers to produce artificial neural networks. We call these Multilayer Perceptrons. So you might start with some inputs, and each of these can be considered a neuron but then you've got a whole lot of different hidden layers and each one of these of their own piece and so on. Eventually, you might get to an output, but it's a collection of these. How deep is deep learning in the real-world? Some networks can have thousands of layers of these perceptrons. As you can imagine, the computational power required to train such networks, it's not cheap. One important breakthrough in deep learning was the invention of Convolutional Neural Networks or CNNs for short. These are especially useful for image processing. Now the main idea of a CNN is, in this case for image processing is I take nearby pixels in the image into account instead of treating them as entirely separate inputs. A special operation called a convolution is applied to entire subsections of the image. If several convolutional layers are stacked one after another, each convolutional layer learns to recognize patterns that increase in complexity as it moves through the layers. Now, if we take the output of a neuron and feed it as an input to itself or to neurons of previous layers. So instead of everything going in one direction, we actually feed backwards or maybe into itself. This is what we call Recurrent Neural Networks. It's as if the neuron remembers the output from a previous iteration, thus creating some memory. A more complex network is called LSTM, stands for long short-term memory. It's commonly used for speech recognition or translation. It's conversation for another time. Feature Selection. This our next important step, feature selection, where you get to select which features you want to use with your model. What you want to have is a minimal correlation among your features, but you want to have the maximum correlation between the features and the desired output. So you want to select the features that correlate to your desired output. Now part of selecting the best features includes recognizing when you've got to engineer a feature. Feature engineering is the process of manipulating your original data into new and potentially a lot more useful features. Feature engineering is arguably the most critical and time-consuming step of the ML Pipeline. It answers questions like, do the features I'm using makes sense for what I want to predict? Or how can I systematically take what I've learned about my features during the visualization process and encode that information into new features? For instance, in looking at the raw data of our call center use case, you might have noticed already, 50 percent of the customers were calling in about tracking a package. However, after visualization, 25 percent of those customers calling in about tracking packages, they're actually located in the exact same city. Now that's a large number. It's potentially significant pattern. In this situation, you could engineer a feature for customer's tracking packages in specific cities. This information might lead to same patterns, you otherwise wouldn't have seen before. We've had some features that answered questions like, what was the customer's most recent order? What was the time of the customer's most recent order? Does the customer own a kindle? When we feed these features into the model training algorithm, it can only learn from exactly what we show it. Here, for instance, we're showing the model that this purchase was made at 1:00 PM on Tuesday the 13th. Well, unless we really want to predict something extremely specific or we're doing a time series analysis, that's not really a meaningful feature we want to feed into our model. It'd be much more meaningful if we could transform that timestamp into a feature that represents maybe how long ago that order took place. Knowing, for instance, that your last purchase was months ago would probably help the model realize that your last purchase is probably not the reason you're calling today. Now obviously, we can engine those feature just by taking the diff between order date-time and today's date-time. That's a much more helpful feature. Here's another example we could use about image classification. Let's say you wanted to train a model to identify cars in a picture. Fine. You can do this by feeding raw images of cars and training it to identify the car. But it won't be that helpful given that these images are very complex combination of pixels. The raw data, that is the raw images you're going to feed in, it doesn't include any higher-level features such as edges, lines, circles, the patterns that it can recognize. So during the feature engineering stage, you can pre-process the data. This will classify it, possibly get to more granular features, that way can feed those features back into the model, get better accuracy. We'll talk more about accuracy and precision in a little bit, but that's critical. Finally ready for training. First step you have to take when you're officially training your data is you have to split it. Now, splitting the data allows you to ensure that you've got production data that's similar to your training data that your model will as a result be more generalizable or applicable outside of the training environment. Let's head over the board so we can investigate this a little more closely. Here we go. Once again, thanks Tom for doing the work for me. Typically, you want to split your data into three sections: you've got your training data, your dev data, and your test data. Now, training data is going to include both the features and the labels, this feeds into the algorithm you've selected to help produce your model. The model is then used to make predictions over a developments dataset, which is where you'll likely notice things that you'll want to tweak, and tune, and change. Then when you're ready, then you can actually run the test dataset, which only includes features since you want the labels to be what's predicted through the model. The performance you get here with a test dataset is then what you can reasonably expect to see in production. The amount of data you will have determines how ultimately you split it up. But regardless, you'll want to train your model on as much data as possible knowing that you're going to need to reserve some of it for the dev phase and some for testing. So if you have a lot of data, then you can probably split it up into let's say 70 percent here for training, and 15 percent for dev, and another 15 percent for test. If you have little data, well, maybe it's 80 percent 10 and 10, you'll end up working it out the way you can. Another important thing to note though as you start splitting up your data, make sure you randomize it. This is critical. You've got to randomize it during your split to help your model avoid bias. This is especially true with structured data, if your data coming in a specific order. So let's say for example that your data is listed sequentially. Well, your model will start to become used to that structure and it will start to adapt to this pattern as it learns. Then eventually when you run your model against test data, this pattern of sequential data will be applied and that'll bias your model. So effectively, to make sure your model isn't biased, you need to feed it randomized data. Now, popular randomization is simply shuffling your data. Now, if you aren't familiar with that, no worries, there's a lot of great tools out there that will help you shuffle your data. For example, Scikit-learn. Now, randomizing and splitting your training data is a critical step in the training process. Common mistake people make is that they don't hold out testing data, and what they end up doing is simply testing on part of the data they trained with, the training data. Well, this doesn't generalize your model, it actually will lead to either over fitting or underfitting. Let's talk about that. Overfitting it's where your model learns the particulars of a dataset too well. It's essentially memorizing your training data as opposed to actually learning the relationship between the features and the labels so the model can use what it learns in those relationships to build patterns to apply to new data in the future. Remember our stock data from earlier. Well, the model learns the pattern here as the stock price goes up the end of the month and then drops the beginning of the month. For example here, 30th 425, 4, 1, 375. It might miss other important data that's likely impacting the price, such as April is tax season in this example. It's clear here that mixing up the rows is going to be necessary to give the model an opportunity to learn other things from the data. It's pretty clear here that we need to look at more, a lot more dates. In addition to simply randomizing the data, it's also very important to collect as much relevant data as possible because underfitting on the other hand can occur if you don't have enough features to model the data properly. This can again prevent the model from properly generalizing the data because it doesn't have enough information to predict a right answer, to predict correct. To really understand overfitting and underfitting and how to avoid it, we need to talk about two things; bias and variance. Think about bias as the gap between your predicted value and the actual value where variance describes how dispersed your predicted values are. Now, that's a lot of jargon. So let's actually take a moment and look at it visually over here. So a bull's eye that's a nice analogy to use here because generally speaking, the center of the bull's eye is where you aim your darts, the center of the bull's eye in this analogy is the label or your target, it predicts the value of your model. Each dot is then going to be a result that you're model produced during the training. So let me demonstrate. So we start with a low bias, low variance model. Everything's clustered tight and it's right there in the bull's eye. I'm getting everything I predict in one area, there's not a lot of spread. Now, next, if we go over to a low variance but a high bias, so in this case, I'm not getting everything that I want, but at least I'm getting a predictable series of responses. It's a tight cluster, I'm just not on the bull's eye. Now, on the other hand, a high variance low bias. Well, in this case, it means I'm on target as far as the center of the spread goes, but the spread is wide, it's all over the place. Then high variance, high bias. Yeah. This is the bad. So in this case, I'm all over the place and I'm not on target. Ideal? What's the ideal case? Yeah. You guessed it. You want the low bias and low variance. Realistically though? Yeah. There's a balancing act that's happening here. Bias and variance both contribute to errors, but what's you're ultimately going for here is to minimize your prediction error, not bias or variance specifically. That's the bias variance trade-off. Bringing underfitting and overfitting back into the picture. Underfitting is where you've got low variance and high bias. These models are overly simple and they can't really see the underlying patterns in the data. Overfitting. That's the high-variance and low bias. These models are overly complex, and while they can detect patterns in the training data, they're not accurate outside of the training data. So let's consider our use case as an example. Say hypothetically that we trained our model based solely on data from customers who already had a kindle, a prime account, and there was a package tracking question at some point during their membership. So our model could detect a pattern that showed that say 70 percent of prime members call in about an Amazon device. But should the model used this pattern and try to make any future predictions? Well, you'd probably say no and if you did, you'd be correct. In this example, the model didn't even consider Alexa related data, or what about deep lens, or holiday data, or any number of other types of data points. Therefore, the model's going to be underfitted because it's hardly sufficient information to predict at a more granular level, while prime members are actually calling in about an Amazon device. Now, this is an oversimplified example. But the point remains, in testing and production, our model won't pay attention to these other missing categories, it will skew the results towards only the data that the model was actually trained on. One technique that can be used to combat underfitting and overfitting is called hyperparameter tuning. In machine learning, there are parameters, and there are hyperparameters. Let's go back to the desk and pull up the slides. Let's talk about parameters briefly. Now, a parameter is internal of the model and it's something the model can learn or estimate purely off of the data. An example of a parameter could be the weight of an artificial neural network or the coefficients in linear regression. The model has to have parameters to make predictions, and most often, these aren't set by humans. Hyperparameters on the other hand, they're external of the model and can't be estimated from the data. Hyperparameters set by humans, and typically, you can't really know the best value of the hyperparameter, but you can trial and error and use that to get there. Yeah. Think about hyperparameter as the knobs, the lever, you're going to use those to tune the machine learning algorithm, and that'll improve its performance. The right hyperparameters have to be chosen for the right type of problem. Here's an example of a hyperparameter. It could be the learning rate for training a neural network. Let's take a look at different types of hyperparameters. Walking through this part of the process is one of the most effective ways of improving your model's performance. So make sure you take the time to conduct hyperparameter tuning thoroughly. Speaking of which, now it is time to train your model. The process of training an ML model involves providing your algorithm with training data to learn from. As mentioned earlier, for supervised learning, the training data must contain both the features and the correct prediction, which again we call labels. The learning algorithm finds patterns in the training data that maps the features to the label. So when you show the trained model new inputs, it'll return accurately predicted labels. Then you can use the ML model to get predictions on new data for which you don't know the label. For example, let's say you want to train an ML model to predict if an email is spam or not spam. All right. You provide your algorithm with training data, It contains emails, and the known label, those labeled tells it whether it's spam or not spam, the algorithm then trains the model using that data resulting in a model that tries to predict whether a new email which it hasn't seen before is spam or not spam. All right. We did the same process with our call center example. We passed along features such as does the customer own a kindle, yes or no, along with the appropriate label, yes it owned it, no they don't own it. In this case, kindle skill we put into our algorithm, which then learn the relationships between these inputs and outputs and spit out a model that could extrapolate those patterns into similar data sets. As explained earlier, after the initial phase of training your model is done, you'll need to evaluate how accurate that model is by using the development data that you set aside and run it through the model, and this is going to tell you how well you generalize the models. The test data may be fed the model for the most accurate predictions. In fact let's circle back this topic of accuracy and precision. While you're evaluating you want to fit the data that generalizes more towards unseen problems. Remember from our earlier discussion about over-fitting, you should not fit the training data to obtain the maximum accuracy which is kind of weird, kind of intuitive, right? I mean, you want a model that predicts accurately on previously unseen data, that's true. But remember if you train your model to be too accurate, it will be over-fit to that specific training data. For classification problems like the call center use case we've been dealing with all day, we're trying to predict if a new observation will be classified as this customer agent skill or that customer agent skill. One of the most effective ways to evaluate your model's accuracy, precision, and ability to recall involves looking at something called a confusion matrix. Now, the confusion matrix analyzes the model and shows how many of the data points were predicted correctly and incorrectly. So let's take a look here, in the bottom right, this is the class one box. Meaning this represents all of the true positives, you predicted a one and you got a one. So great, for our call center case this could mean of all the things you thought you'd predict, and your model did predict, for example needing to route customer to a specific agent with strong elected skills, your model did this 1,800 times. In the top left box, this is the class zero class zero box, this is your true negative. For instance, with our use case you might predict the model will not route calls to the fresh department, and the model in fact did not route any calls to Amazon Fresh. The top right box, now this is your class one class zero box where you predicted in this case a one, but you ended up getting a zero, and finally last is this bottom box where you predict a zero but end up getting a one. To summarize, accuracy is the degree of deviation from the truth or the total number of right predictions divided by the total number of predictions. Precision is the ability to reproduce similar results, and it's defined by all of your true positive numbers divided by true positive and false positive. All right. At this point after you've trained your model and you're satisfied with the accuracy based on some of the techniques we've talked about, it's best practice to evaluate how it's doing by running it against a few different algorithms. Now, you should consider running it through a couple different algorithms within the chosen algorithm category. So if you're working say with a supervised algorithm like our classification algorithm for the call center, you should try the model against a different classification algorithm, for example decision tree algorithm or the K nearest neighbors algorithm. But this will give you a better idea of how to get the best fit and the best results for your model. Okay. Deployments, monitoring, here we are. You've prepared your data, you've cleaned your data, you've visualized it. You've selected your features, you've split your data test your model, you've tuned it several times, lets be honest, and after you've done all that, and you're satisfied with the model's predictions on unseen data, it's time to deploy your model into production so it can begin making your predictions. One of the primary ML tools for building, training and deploying models Amazon SageMaker. Amazon SageMaker is fully managed, it covers the entire end-to-end pipeline that we've just discussed. The build module, SageMaker provides a hosted environment for you to work with your data, you can experiment with your algorithms, you can visualize the output. Then the train module actually takes care of the model training and tuning at high scale. Then it has the deploy module, designed to provide you a managed environment for you to host, test models for inference, secure low latency, the tools are there. Now, this additional tool is and SageMaker that are going to help you label data, manager your compute costs, take care of forecasting, and much more that has battle of one of our Machine Learning tech leaders here at Amazon, they're going to discuss in a different session later on today, you don't want to miss that one that one's going to be important. All right. Getting back to deploying and monitoring. You'll want to remember to monitor your production data and retrain your model if it's necessary, because a newly deployed model needs to reflect current production date, you don't want to get out of date. Since data distributions can drift over time, deploying a model it's not a onetime exercise, it's a continuous process. You're not going to be out of a job. It's a good practice, you continually monitor the production data, and retrain if you find that the production data distribution has deviated significantly from the training data distribution, no deviation, isn't a change. Evaluating in a production setting is a little bit different. Now you've got to have a very concrete success metric that you can use to measure success. In our call center use case, our routing experiments were predicated on the assumption that the ability to more accurately predict skills would reduce the number of transfers. Now, in production we can actually put that assumption to test. Well, okay so that takes us to the end of them ML pipeline. If that felt like a little bit of a whirlwind, that's because it was. It has like a tone that goes into implementing an ML solution. It's a process that most often takes several weeks or months, don't be scared of it, we've really just skim the surface. But now hopefully, you've enough of foundation of the process. We talked about the key terms and processes, key terms and concepts to use. So the rest of today's events, you'll be able to really dive deeper into the content, services, especially tools that most interest you. So with that I'm Blaine Sundrud, and I hope you got something good out of today, and have an excellent rest of the day, and I'm going go and throw it back to you all. Have a great day.