So, so far in this course, we've introduced a variety of evaluation and classification schemes, I've shown different ways we might use regression and classification algorithms. So it's a good time to take a step back and say, how can we actually apply some of these concepts with code? So in this lecture, I want to introduce a simple codebase we can use throughout the rest of course three, for the purpose of evaluating regressors and classifiers, and later on for implementing our training, validation, and testing pipelines. So a nice example to work through in this lecture is going to build a model that implements sentiment analysis. In other words, we're going to try and build some kind of aggressor or classifier that estimates star ratings based on the text or the word choices used in a review. So the reason we choose this problem is because it's complex and high-dimensional or it uses high-dimensional features. For example, if we have many, many features corresponding to all the different words that might be used in a review, which means that things like tuning models and carefully evaluating models, and trading off different modeling decisions is going to become particularly important for a modeling task like this one. So in order to build our sentiment analysis, evaluation, training, testing codebase, we'll start by reading in our dataset. This is nothing new here, it's just our regular old Amazon gift card data that we're going to read by opening the gzip file and then treating it as a tab-separated variable file. We just read it in line by line in the third block here, and as we do so, we convert relevant values to integers. So things like the star rating, helpful votes or total votes will convert to an int as we're parsing the data. Other than that, it's all boilerplate. So our goal is going to build a classifier that estimates sentiment or in this case, a star-rating will be our label based on occurrence of words. So it'll look something like the following: we're going to have a predictor, nothing other than a linear regressor or something like that, features derived from the text and tries to estimate the rating. So in particular, we're going to use a linear classifier or a model based on linear regression that says the rating is equal to some offset or alpha, plus a summation over all of the words in the review of some feature, which is going to be a count of how many times that word occurs multiplied by some parameter, which is theta, which is saying, what is the weight associated with that word? So you can see, that's an example of a linear model. We have an offset parameter alpha and then we have this inner product between all of our count features, multiplied by our parameters theta, and each value of theta is then measuring, if I observe an occurrence of a particular word, how much is that associated with positive or negative sentiment depending on whether it has a positive or negative value. So how do we go about building that model? Well, our first challenge is going to be able to build a useful feature vector from our text features derived from a review, and while the purpose of this class is not to teach natural language processing, we do need to make some careful decisions in order to compile a dictionary of words to use in our model, since it wouldn't be practical to consider every single word in the English language as a feature to make predictions. So how do we compile a relatively small dictionary of words that we might like to use? So just to show you that it really would be impractical to consider every word, if we consider just the number of unique words that occur in the body of all of the reviews in this gift card data, we do that just by building this defaultdict instance to count words and we iterate through the data. We split it according to any whitespace character that's going to tokenize it by words, and then each time we see a word, we increment the count for that word. We see here, there are close to 100,000 unique words used in this relatively small dataset of gift card reviews. Well, we might have some notion that there's a lot of redundancy in that dictionary. If a word has different capitalization, so let's say if it's at the start of a sentence, or is followed by a comma or a full stop, it's going to show up as a unique instance or a different word, whereas really we might treat those words as being equivalent. So the next thing we might do is reduce our dictionary size by removing all capitalization and removing all punctuation. So that's pretty straightforward to do. We can use this list of punctuation characters from the string library and then we just convert our reviews to lowercase. Then, we filter using this list comprehension to take only those characters in each review that are not punctuation characters. Then, we do the same thing as before by incrementing our word counts having removed capitalization and punctuation. Now we find there's about 46,000 unique words. It's still too many to deal with. All right. Another concept we might try and use is one called stemming. So we might have many different instances of essentially the same word. So you might have words like drinks, drinking, and drinker, which the previous model would have treated as three different words. You might want to treat them as the same word. You can also take things like argue, arguing, argues, arguing, and argus, those would all map to argu. It isn't actually a word itself but it's just saying all of these words have a common stem. So we could maybe reduce our dictionary size by applying some kind of stemming technique. So here, we import some library from the Natural Language Toolkit called the Porter Stemmer. It's one of the most commonly used stemming algorithms, which essentially is just a list of replacement rules. It can gradually reduce the end of these words until we find the common word stems and if we do that, we get a somewhat reduced dictionary size. In fact, it didn't actually reduce the dictionary size by very much. It went from 46,000 to 37,000. It's maybe a matter of opinion where the stemming is useful. It's only reduced our dictionary size by a modest amount but possibly, we've actually discarded some useful information. So there could be settings where different instances of a word actually correspond to quite different concepts, in which case we wouldn't want to discard them by stemming. Long story short though, we've tried to remove punctuation, capitalization, we've tried stemming, and what we're left with is a dictionary that's still too large to actually deal with practically. So we can't deal with a 37,000-word dictionary. So something simple or quite brutish that we might try and use would just be to take the subset of the words corresponding to the most popular words in the dictionary. So how do we do that? Well, first, like before, we just count all the word instances. This is the same code block as I had previously in code block number 14 there, where I just say I'm going to remove capitalization and punctuation, don't worry about standing for the moment, and just counting each instance of each unique word. Then, in code block 15, I'm going to say, what are the, for this example, 1,000 most popular words? So I store my counts along with the words themselves. I sort that, I reverse it, and then I take the top 1,000 instances, and that's going to be my dictionary. So I've removed capitalization and punctuation. I've sorted words to keep the top 1,000 most popular, and finally I have a couple of utility data structures that is going to map each word to a unique ID. So all this going on when I develop these data structures is to say, for each of those 1,000 most common words, I'm going to map it to a particular feature index or number from 0-999 saying, which feature index in my feature vector does that word belong to? Having done that, that's going to help me to define some function which actually computes my features. This is going to be a function that takes a particular review or a particular data point and extracts my 1,000-dimensional word feature. So I start by just building a vector of all zeros. I then iterate through my review, removing capitalization and punctuation. Then, I tokenize according to individual words, and each time I see a word that belongs to my word set, I increment the corresponding feature. The final thing I do, so, so far I've built my 1,000-dimensional word vector, is I add one more dimension to the end, which is going to be my offset feature. So that could be a feature which always takes the value of one corresponding to my intercept time, like I should always have in a linear model. So that's it. We've used some maybe fairly naive and straightforward technique to build a fixed length, in this case, 1,001 dimensional feature vector for each of our reviews. We can now use this to train a sentiment analysis model, which would correspond to this equation. So the ratings predicted by some offset alpha, multiplied by some features times some parameters or my counts, and my theta values for each word. So actually fitting the model is not going to be anything too unusual, we've already seen code to fit models before. We will adapt this in later lectures once we're introducing things like training and testing pipelines. But for the moment, all we're going to do is randomly shuffle our dataset, extract [inaudible] data point to build matrix X, extract labels for each data point, which is the star-ratings, and then train a model via least squares. Finally, we won't worry about things like evaluation or training and test splits so much, we're just introducing the codebase in this lecture. Having trained this model, we can for example, look at which words are associated with the most positive or the most negative labels, based on which have the highest positive or negative coefficients. So maybe it's hard to see due to the font size but you can also examine this code which is provided. So all we're doing here is saying, after we've estimated those values of theta, which is code block 21, we can sort those in terms of which have the highest or which have the lowest values and map those back to the 1,000 original words plus or offset feature. So the most negative ones here are things like disappointing. So every time I see the word disappointing in a review, my estimate of the rating goes down by 1.2 stars. I'm disappointed, unable, waste, you can imagine how these would be negative sentiment associated with gift cards. Similarly, the most positive, some of them are not so clear, they may be appearing as words in longer positive strings, but what's, problems, particular, worry, exelente, and excelent, and excelente in different languages will correspond to very highly positive sentiment. So that's enough just to introduce our basic codebase, which we'll extend later. So all we did in this lecture was to develop a codebase for sentiment analysis, and in doing so, we introduced some of the challenges in capturing features from texts. So on your own, I would suggest taking this codebase and extending it to use alternate feature representations. See what happens if you remove or keep capitalization, punctuation, use larger and smaller dictionary size and see what impact do those decisions have on the model performance or its mean squared error, or its accuracy, or something like that.