So we've talked about the loss function already, but that is used for model optimization. Now we're going to need statistical testing to help us answer the question, does the model perform in a way that allows for useful action to be taken medically? So now that we've trained a classifier in minimize loss, and we've seen that that learning hold true on a holdout test set, we now need to figure out some meaningful metrics to evaluate the performance of our new model with statistics, yay. The initial temptation is to look at something like accuracy to determine how well a model fits the data set. To some extent, this tendency is related to our analogy of a final exam in school where we want to know the proportion of correct out of the total number? But what if we had a medical test data set that was sampled as a holdout test set for a rare disease? >> If the number of positive cases was 1 in 10, then a machine learning model could achieve 90% accuracy by calling all of the cases in the data set negative. This accuracy doesn't reflect the fact that the model is actually not useful at all because it will not catch any of the disease cases even though the reported accuracy is 90%. This is kind of an exaggerated example, but it explains why we need other, many other different measures, to consider their true performance of a model based on the task. Picking the right metric is critical for evaluating machine learning models to avoid situations like this, especially in healthcare situations where the stakes are high. >> The output of a trained machine learning classifier for a categorical labeling task will typically be a probability score, as we've talked about, usually between zero and one for a desired output label. We previously mentioned that the difference in the probability score of the model and the true label is the basis for calculating the loss or quantifying the error. If we're trying to understand the performance of our classifier in more concrete terms, in particular typical ways in which we might use it in the real world, then we'll have to choose a threshold that will binarize the output label into a specific category prediction. In other words, convert that probability to either a one or a zero. The most common approach here is to choose a threshold of 0.5 as the middle ground so that anything greater than 0.5 is a positive decision for the label, and anything less than 0.5 is a negative decision for the label. With that threshold, the common metrics used in medical testing can then be calculated. But choosing a threshold without more useful information about our model's performance on a given data set or task doesn't really make sense. After all, 0.5 seems somewhat arbitrary after all the work we've put into the model so far. So we'll need something that can give us a little bit more of a global understanding of our model at every possible threshold. >> Let's introduce now ROC curves or receiver operating characteristic curves, which are a wonderful metric for evaluating the model performance considering a range of thresholds at one time. Algorithms that were trained for discrete labels, such as disease or new disease are most suited to this approach. If a model can detect multiple classes, we apply an ROC curve for each one. So for example if you have three classes named X, Y, and Z, you'll have one ROC curve for X classified against Y and Z, another ROC curve for y classified against X and Z, and then a third one for Z classified against Y and X. But before we can build an ROC curve, we should know about some basic metrics for determining performance of the model. In order to understand the implications for a medical task in an ROC analysis, we should know the basics of statistical testing. And we can start now to calculate some of these assuming that our trained model threshold is 0.5 for now. The fundamental analysis of performance for machine learning classification problems where an output can be two or more classes is a table that contains different combinations of predicted and actual values. This is known as the confusion matrix. And there's a very good chance that this content will be familiar to you. So we'll do a brief review and then tie this back into the ROC curve. >> So this table is a symbol but incredibly powerful tool for understanding several metrics that are important to our medical classification task. Some of the most useful calculations we can do with this table are recall, precision, specificity, accuracy, and most importantly, build the ROC curve. Let's understand true positive, false positive, false negative, and true negative using a classic scenario and some numbers. Let's say we've developed a smartphone app that can predict heart attack risk in 90 days using the heart rate function on a wearable device. We will assume that we've already trained our machine learning model and are now going to see how it performs on a holdout test set of 200 cases. In this test set, there are 120 positives where the user did have a heart attack, and 80 negatives where the user did not have a heart attack. When we run our model, it will predict 100 negative and 100 positive. Let's see how this breaks down in our confusion matrix. >> We can take a look at what we can calculate from this confusion table. We can calculate a few important metrics here. The first one, true positive refers to cases that were positive and our model predicted positive. True negatives on the other hand are cases that were negative and are model predicted negative. False positives are cases that were negative but we predicted positive and false negatives are vice versa, cases that were positive but we predicted negative. So we can update our matrix like this with some new terms. And here are the meanings of these terms. The first one is accuracy, which is the number of all correct predictions divided by the total number of examples in the data set. The best accuracy is one whereas the worst is zero. Then we have sensitivity or recall, which means out of all the positive data points, how many did the model predict as positive? Specificity means out of all the negative data points, how much did the model predict as negative? We have two other related terms that we can take a look at here. The first one of these is precision or positive predictive value which is defined as how often is a model correct when it's predicting positive. Whereas on the other hand, negative predictive value means how often the model is correct when it's predicting negative. Note that both positive and negative predictive values are influenced by the prevalence of conditions in a test set and this can be misleading.