This week we're going to kick it off with support vector machines. The support vector machine is a commonly used approach for engaging in supervised machine learning. And it's very basic form, we can think of an SVM model is trying to separate two classes of data which are color coded in a two dimensional scatter plot. And we want to separate them with a straight line, the goal is really to find a linear equation. The best separates these two classes. If we can find an equation that can separate it, then the classes are considered to be linearly separable. Let's look at an example of this, and for that we're going to go to Major League Baseball. I'm going to make the problem a bit easier and a little cleaner for us for teaching, but we'll see how we can deal with some ambiguity later on. Now the MLB actually captures a lot of data about pitches. Though a fair bit of it is actually inferred data, so it's a little unclear where each element comes from. But my goal for this is as follows. Can we use pitching data, specifically the speed of the ball and the amount of spin it has when leaving the pitcher's hand to predict the type of pitch it will be? We even make this a bit easier. We'll just consider fastballs and curveballs, and we'll see if this is a problem which is linearly separable. So let's start by bringing in a couple of our module imports for our analysis. The data for this historic in our assets folder is baseballsvmdata.zip, and contains all of the pitching data for the single season. And I'm going to use matt plot lib here as well, and there's a special module zip file which will be using for imports. So let's bring this in and take a look, so our data frame is huge and it's filled with rich data. We've got player names, descriptions of plays, batting orders, and even position of the ball over the plate. So we could do like a little visual inspection of the attributes that we're interested in. So here, I'm just going to make a scatter plot, and I'm going to look at the effective speed and the release spin rate. I'm going to set the size of our dots to 1, so very small and I'll change the size of the figure so there's 10 inches by 4 inches. All right, there we go, so this gives us some rough parameters for discussion. We see everything is sort of clumped together. And there doesn't seem to be a huge trend other than we have fast pitches, that tend to have a spin rate which is homogeneous. And when pitches slow down a bit the spin rate can be more variable, so we need to make this a little easier. Let's color it by pitch type, so I'm only interested in these two pitches here. I'm just going to try and make it a little easier for us at least to start with. So we're going to take out fastballs and curveballs, and they're coated in the data is FT or CU. And then I'm going to set the color of all of them to blue. And then I'm going to take our curveballs and I'm going to turn them orange, one last cleanup. Sometimes there's missing data because we've got a big data set. I'm just going to drop those observations, but you might want to impute data instead. Okay, so we'll drop those and then we'll plot it again here, all right, so that's kind of interesting. This looks like something which is almost linearly separable, you can see that by reducing it to these two classes of pitches. We get one cluster which is high speed, this blue cluster over here. We have another class which is lower speed or off speed pitches over here. And then you can see we might even be able to draw a line here and get a decent classification there. Now usually, we're interested in building models which more features than just 2, but these two dimensional plots make for a good demonstration. And perhaps it's confusing it for first, but I'm going to be more formally denote each feature with a subscript. So we've got feature one, let's say, effective speed which is X sub 1, and featured two which is the release spin rate X sub 2. And then we can describe all of these features together in the variable capital X. You don't have to use this notation in your own work, and we didn't use it in the logistic regression. But it's common and it's going to be used in the documentation for most machine learning libraries, including SK learn. So I want to get you used to using that. At the same time it's common to use the variable Y-hat to represent the output variable that we're trying to classify. So we use that, although just in the code, we'll just call it Y. Now remember the goal of an SPM classifiers to find the optimal line which in general terms we call the hyper plane. To distinguish between our classes across all of the features and X. This line is called the maximum or large margin classifier, and is thought of as a linear band which separates the two classes. This band is sometimes called the street. And the edges of the streets are given by parallel lines which exist at the first points from each class, which are closest to the line. These points are called the support vectors and that's where SVMs get their name, and we can actually take a look at them. So let's see an example, I'm going to bring in from SK learned the SVM library. So that's going to have our SVC class or support vector classifier. Now you can do much more than actually linear classifiers with SK learn, and we're going to talk about that a bit later on. But right now we're going to create just this linear support vector classifier. Now in my code today and in the future, you'll see this random state variable set to 1337. That's just an arbitrary number that I picked, so that the results I get here are reproducible for the results you'll see as well. Okay, so let's form our training and test set, I'm going to use some smaller data here. So I'm going to take from our data frame, the first 1000 entries and I'm going to put them in a variable called DF pitches. And then I'm going to separate this into x and y features for our target values, so I'll create X train. So that's going to be our two features effective speed and release spin rate, I'm just going to take the first 1000 or, sorry, the first 500 of those. And then y train, which are the labels that we want to predict, that's are either fastball or a curveball. Now we have to do the same thing with this test set, so I'll just take from 500 to 1000 and 500 to 1000. So essentially, we have equal sized test and train sets, now the beauty of SK learns API becomes obvious. We fit and we evaluate the quality of this model just like we did with our logistic regression. So we'll call clf.fit and then we'll score it right away. All right, so we built a perfect classifier. Now this isn't very common in practice, but in practice you're usually looking at a harder problem to solve where there is much more noise. One of the interesting things we can do from here is get the support vectors for the model. The items that are closest to the street or to the line which separates these two classes. So first, let's plot our data points again, this time we're only going to look at the pitch data we decided to play within the model. And I'm going to make these points a bit bigger for visual inspection. So here I'm going to set the size to be 5, and I'll set the color to be the color column. And now I want to circle the support vectors, we get the support vector list from the model. So the CLF model itself after it's been fitted, we can get the support vectors using this support vector underscore a variable or attribute. So I'm going to create some new data points, so I'm actually going to plot the support vectors as if they were another scatter plot on top. This is a common map plot lib pattern, design pattern if you will. But I'm going to plot them as giant empty circles that have around, they're black on the outside so their rings and that are quite large. So the effect is that it should highlight the different items, so this is pretty interesting. So you can see here we have our one class of pitches here, these would be our curveballs. And we have our other class of pitches here, these are the fastballs. And you might be saying, well, this doesn't look like the image that we saw earlier. But remember, we're only looking at the first 1000 items inside of this set. And you can see clearly here where some hyper plane would be between these, I find that image in particular really interesting. So I want to zoom in on those support vectors, and I want to actually calculate that hyper plane. Now you can just ignore this code, I've just put it here. This is just the calculation of a line and the plotting of a line inside of map plot lib based on the coefficients in the model. And if you're interested, feel free to dig into that in more detail, but I'm just going to plot it directly on the figure, on top of our figure. All right, so this is a zoomed in portion, the zooming in was actually really easy. We just set our x and y limits and so that will change our view port, the other points are rendered still just outside of our view port. And we can see here the hyper plane that separates the blue fastballs and the orange curveballs. Now I want you to look at this image for a minute because I had to. The support vectors are supposed to be the points which are closest to the street, closest to this hyper plane. So why are some of those points actually closer? For instance, these points are closer to the street, where these points are closer to the street. So think about that for a moment and think about how we tackle the problem, we'll just pause for a minute. Okay, remember that for this, we plotted all of our points in the data set. But we only actually trained our model on a few of these, the first 500. Our test set has observations that our model hasn't been trained on. And most of those points closest to the street actually happened to be in the test set, and that's okay. It shows that our model is actually able to form a good generalization, at least with respect to this particular data. So let's actually plot that again, but now let's render our test set. These are the points that we held out to see really how good our model would be, and let's render them in red. So this is pretty interesting, the orange ones here and the blue ones over here. That's what we trained our model on and based on that, the street was calculated to be here. Then we took another 500 points and we classified them as either being fastballs or curveballs. And these points here we're actually classified correctly as curveballs. If our model had been able to see these points and had learned on these points previously. Then the hyper plane would actually be shifted over a little bit. And this shows the approach, the general approach of SVMs to trying to segment the space between these classes. So that it gives you some flexibility as you see new points, I think it's really useful to see how SPMs are built over time. And so I want to create a little frame by frame animation for you. Remember, we need to see at least every class of data before we can train an SVM. So this means we need to sort the data so that fastballs and curveballs appear in the first and last and are interspersed throughout. It's a bit of an ugly code for this so you can just ignore it, but I'm going to do it here for you. And again, this is part of that authentic experience of investigating as a data scientist. So here you see, we have a curveball, it's orange and this was its effective speed and release speed. And it was followed by a fastball, it's blue and so forth. All right, so we're going to go through and frame by frame, we'll probably take the first couple of frames together. Frame by frame, we're actually going to create the support vector machine. And we'll create one for each of these frames will render the frame, and then we'll look at it in an animation. So natplot lib has this handy animation function called Funk Animate. Now this is actually really handy to create animated gifs, and that's what we're going to be doing here. As we've seen, are actual model fitting is pretty easy, we have some x data, these are our data frames. So a data frame will be passed in here, and we've got two different attributes that we're interested in. And then we have some y data which is what we're trying to predict, and then we're going to return the fitted model. So the actual machine learning part of it is done very nicely in this tiny little function. What we're going to do to actually build the frames that we're going to animate is this model here. This function here, update, is actually going to be called for every single frame that we want to build. And the frame number is going to be some numbers, so 1 for the first one, 2 for the second one and so forth. So there's a bunch of code that we have to right here to do that, first we want to clear whatever our current plot is. Think about this as if you're applauding just a single frame, essentially we are. Then I want to take some number of observations, so let's say the first time this is called the frame numbers is 1 or 2. And we want to add to 2 that, so we're only going to take a handful of observations. We're going to build the scatter plot just like we did before, we're going to fit our model on the exact same data. We're going to plot our support vectors just like we did before, plot are hyper plane just like we did before. Set are x and y axis limits, so we'll set our viewpoint and then we're going to return the axis. Now all we actually have to do is run this, I'm going to run it for 350 frames. It's going to take a while to run, not that long, and we're going to build the animation as an animated gif. Okay, so let's take a look at that image now here in the notebook. And so we can actually call the eye python display function, and just display this svm.gif. So we can hear it, see here that the hyper plane is created, the support support vectors are really bisecting. And then we can see that it starts to tilt, the slope changes as new support vectors are found. Not all of the support vectors are on the screen at one time, actually. Some of them because of our view port are actually being rendered or were being rendered off screen. And that's why you would only see a couple of nodes at a time, and you can see here we have some stability. And then it balances around a little bit as new points are added, so this is the basics of support vector machines. There's a lot more that we should be considering when we're actually building these models. But I wanted to give you the intuition of how an SVM works. Essentially, we have these two classes of data in this case. We've got a data dimension space and we're just looking for the right line to bisect that. Let's look a little bit more in detail though at how SVMs can grow in complexity. And how we can apply new techniques to them to deal with more data classes and more features.