There are several Naive Bayes algorithms that we will look at the Bernoulli version. And the features in the Bernoulli version are discrete values. There's 0 or 1. And they indicate the presence or the absence of a feature. It's either there or it's not. There's no point-five. It's not half-way there. Either it is or it isn't. Algorithms that try to learn, The probability of y given x such as linear regression are called discriminative learning algorithms. And algorithms that instead try to model the other way around, the probability of x given y are called generative algorithms. There's a datapoint for you. So for instance if y indicates an example is a dog, 0, or an elephant, 1, then the probability of x given y = 0 models the distributions of dogs' features. And the probability of x given y = 1 models the distribution of elephant features. And this Naive Bayes is an example of a generative learning algorithm. And it's called naive because its makes a strong assumption that all of the features, all of the x's and y's are independent, and this is rarely true in the real world. Yet given this seemingly bold assumption, the algorithm often produces excellent results. So in the linear regression example were my two features completely independent, square footage and number of bedrooms? You nodded your head yes. Are they completely independent? The bigger your square footage of your house, you can put more bedrooms in the house, right? So there is some relationship there between those two. In here, we assume all the features are completely independent, so it makes this very bold assumption, this very strong assumption. So getting back to the problem we want to solve here is we want to determine if a given email is spam or not spam. And this is an example of a text classification problem. So after modeling Py probability of y called the class priors and the probability of x given y the algorithm can use Bayes theorem to derive a posterior prediction on x given y. So in my example I first created a dictionary of words. So these were all my words. Little, tad, rufus, vacation, dinner, restaurant, eating, drinking, sleeping, equal, the. Let me go to the right a little more here. Price and buy were the last ones. Or they went on the right, those are the two words that I used to help identify spams, spam emails. So imagine I had eight non-spam emails, and I put 1's in the place, so this, a 1 here says the word hello occurred, okay? A 0 says goodbye did not occur. A 1 says my name appeared, Dave, and so forth, okay? And you'll notice for all these emails there were no price and no buys where absent. So these were e-mails that we not spam. We looked at them ahead of time. And then these are our P and X's given Y's equals one, these are the spam emails so they'll contain one of the keywords price or buy. And in this case, I had a price and buy. And this whole matrix, they call it just X, it's all our features. The outcomes, Y, what we wanted, was the first eight were all zero. Not spam, not spam, not spam, it's a total of 16 emails, so there's 8 zeros here to represent the first 8 emails were not spam, and the second from our training data were spam emails. And I also wanted to comment on that, I switched over after coding up my first linear regression. It was interesting and it was fun but then as I got into it more, I discovered scikit-learn and scipyi and went off and started to use these learning algorithms that you can just import to library. And just call the function and start looking at results. So I did a lot with, I think all the rest of these are SK learning scipyi based. This is now my test data. And so I just randomly made up examples, putting 1s and 0s in different columns. And this third email, yeah, email to here starting from zero, it has the word price. This number 3 has buy, and this number 4 one down here has price and buy, that's my test data. So it's pretty sleek. You just import the Bernoulli Naive Bayes library. You can create an instance of it by calling it a function. And you pass the training data x and y into it, and all these routines, the training algorithm is called fit. So the fit method for this object is called, and the training is done when that call returns. So all the training happens inside that fit function, okay? And then a lot of these routines will have hyper variables that you have the choice of setting. I don't recall off the top of my head exactly what these are. But if you go to scikit.learn there's great documentation there. And you can see what they did, and I choose to set the parameters this way. So now we're going to give the algorithm our test data and have it make its predictions. So I extract a row and I reshape it into, I think it comes out as a column vector when you do this, and I needed to reshape it into a row vector. because it's what the predict method on the Naive Bayes Bernoulli method wanted. I think it yelled at me when I passed it the wrong shape vector. So it makes a prediction on each, so it walks through all the test cases. And it makes a prediction on each of the test cases and if the Y predictive value is equal to zero it prints it's not spam and predicts it was spam. And it did pretty good, but it messed up on number two. There because that contained price. Index two contained price. So it was price, all of these were not spam. This contained price, this contained buy, and this contained price and buy. So it did pretty good. It got two out of three. >> Is it because of the data [INAUDIBLE] less again, or. >> I'm not sure. You can get different results by changing these hyper parameters up here and changing these guys, and I remember playing around dialing these numbers. And it would some of my runs produced different results and also a lot of these algorithms will call the system's random functions to introduce randomness into the algorithm. So you'll call it one time and you'll get a certain set of results and you call it exactly again the same method and you get different results. And they're trying to introduce some randomness. I don't have a good explanation at this point in time for why they do that. There is a parameter you can pass in for the random variable and you can stop that from happening. I saw an example where there's a named parameter you can pass in like you see here. And you can pass in a certain value and then you change it if you want, or not. But, all of these you always have to look at you train it, you give it some data it hasn't seen and then you measure how effective it is. And you decide if that's good enough or not. Is this good enough? I don't know. [LAUGH] We've identified four out of five emails correctly, two is not spam, and two as spam, and then it was wrong on that one.