Hello, my name is Tasio Guevara. I'm a solutions architect with AWS. In this session, I'm going to talk and explain how you can build a text classification model by using AWS Glue and Amazon SageMaker. Not only that, I want to make sure that you don't need to know that much about machine learning in order to fulfill this task. So you may have been using already SageMaker and using this sample notebooks. Indeed they are a great tool to get started and actually start using the built-in algorithms or bringing your own, but you may have noticed a pattern in them. Usually, you start by downloading some dataset, a public dataset, and then working with it in the notebook instance and eventually uploading it to S3. Now, that is really great for learning and for some small datasets. When you start working with real case and real data, you will see that the size of these datasets they are much bigger. So you may encounter that these instances cannot scale up to the challenge even though you can actually get even a P316 extra large in a single notebook instance. But this is where AWS glue becomes really relevant. Because you can really work with enormous amounts of data and you can scale and process that data in order to be consumed by SageMaker or any algorithm that you want to bring. First, I would like, actually, to talk about text classification. Why is it important to you? So you may notice that most of the data that we humans produce is nowadays pictures, videos, audio but we also produce a ton of texts. If you go to Twitter, when it's social network, what we do is actually post some pictures but write. Not only that, we read news. So you have news, you guys repost from companies. You have tons of text that you may want to leverage in order to improve your business. So text classification is one specific use case for natural language processing. For example, you could think about sentiment analysis. What is people talking about your company? Are they've talking good things, bad things? That's just a specific case of a broader text classification problem. But then you could also use that for triaging trouble tickets. You may have your customers introduce a description of a problem they have. But in many cases, they need to actually introduce some class to help you fulfill their need. What if they could just introduce a problem, write it down and you automatically can try us out to the right team to solve the issue. Well, they would benefit from, first thing, not having to introduce those metadata that are necessary for you but not for them, but also get a quicker response. First, I would like to talk about the machine learning process. In this session, we're going to go through this process with a real case. We're going to take a real dataset, it's a publicly available dataset, and we're going to end with a model that we can use to do some text classification. Now, there is a process in between and there is some pinpoints in this process that AWS is helping you overcome. Thanks to the services and the data platform that we provide. We'll see how you can leverage that to really kill those pinpoints. So this is how the machine learning process looks like. It has four different stages and it starts with the business problem. You need to have a business problem that you want to solve in order to get started because if you don't know where you're going, you never know that you're there. Once you have defined the problem plus some requirements about accuracy, then you need to have data. You need to work with data and do you have to collect it, you have to integrate it, which means that you may have different sources and you want to put it together, and then you need to prepare and clear in order to train a model. So for training you actually need an algorithm, then you need to also visualize the data that you're going to use, maybe do some feature engineering, and then train the model. But, usually, it doesn't really work on the first go. You probably going to need to tweak how the algorithm works in order to get a better model. Then you need to evaluate it. Once it's evaluated, you need to decide if it's good for production or not. But once you decide that it's suitable for production, then you need an endpoint where you can run inferences against and you need to monitor how that is performing. Eventually, you're going to go into retrain. Mastering this cycle is crucial for achieving good results. No single great model out there is just working on the first go. You always circulate in this cycle over it and iterate quickly, as quick as you can, to get better results. Let's see this in practice. Let's see these four phases and let's go one by one. First, we have the discovery. Let's frame our business problem. At the beginning, I talked about trouble ticket and our business goal will be to do a triaging of trouble tickets. We'll see that I cheated a bit on the business problem but it's very similar problem that we'll solve. The goals that you can achieve with this is that you can streamline the ticket creation process. As I said, usually when you are having an issue with any service, you have to introduce some metadata that is helping the service provider to try use that ticket. Now, what if we can remove that? It would be a much better experience. Then you may reduce this ticket solving time by getting to the right queue at the beginning and then you would minimize balancing tickets. So what dataset can we use? I'm going to talk about the Amazon reviews dataset. As you see, amazon.com sells millions of different products. Each product has customer reviews. Now, this is how a customer review looks like. It has some metadata like the author, the rating, a header, a text, which is at the body of the review, how useful has it been, plus other metadata. Amazon.com has made these available on AWS. You have this dataset available from a public bucket in S3. It contains 20 years of product reviews from Amazon customers and accompanying metadata. There are more than 160 million reviews. There's close to 80 gigabytes of raw data, 51 gigabytes barely if it's compressed with parquet, and it's partitioned by product category. This is a key thing for our use case because what we're going to do is try to predict a product category out of the text of that review. So as I said, it's a very similar problem because we're going to predict a label out of a text. But in this case, it's going to be a product category from a review instead of a trouble ticket category from the problem description. Now, these are some examples I picked up. I picked them specifically because they are hard to predict for a human, so disappointed, stopped working due to water after only two weeks, what kind of product could that be? Ten-year-old loved it. That sounds like a toy but it may be something different. The metal is very cheap and bendable, but it works fine. Well, this would be a hard task for a human. But let's see how a machinery model can deal with it. We'll see that at the end. Yeah, we have already the business problem defined, now we need to frame it with some requirements. So how good should the model be? Well, in our case, it doesn't really matter, right? It's just an example. So I am going to show you the perfect confusion matrix. A confusion matrix is a measure of how good a classification model works. So in the y-axis, we have the true labels of the data, and in the x-axis, we have the predicted label. In this case, we see a perfect one. So everything is white except for the diagonal. So that means that we have a 100 percent accuracy. We have 100 percent recall. So we have a really perfect model. This is something you'll see only theoretically here but it will never happen in real life. So now, we have finished our discovery and we can move on to the integration. We'll start by visualizing and analyzing the data. In order to do that, we can use AWS Glue crawler. A crawler is a process that is going to traverse our data and try to extract a schema out of it. Then, once it's done, it will have a database and a table there that you can use with other AWS services like Athena. So the creation of our Glue crawler is rather simple. In our case, we just specify the path in an S3 bucket and give it a role that has access to that bucket, and can create tables and databases on Glue. So once we have that, we can start actually querying that data. I mentioned Athena, you can actually take Quicksight to have visualizations on data that is being created by Athena. This is the schema that was extracted from the public dataset. We see all of the columns that are there. We're going to focus on review body, which is the text, and product category, which is what we're going to try to predict, also called the target. Now, let's have a look at some characteristics of the dataset. By using Quicksight, I could query how many reviews were there per product category. We see this interesting graph where we see that this is a very, very unbalanced dataset. That means that there are many more reviews of some categories in this case, for example, books than others like personal care appliances. Well, this makes sense. This dataset covers 20 years of Amazon.com reviews starting from 1995. As you may well remember, Amazon.com started selling books, so it makes sense that there are many more reviews about books than other categories. But generating a model out of this could lead to a very biased model. That means that the predictions that we would get would lean towards books just because the algorithm has been trained on a lot of books. So we want to go from that imbalanced dataset to something like this where we have roughly an equivalent amount of reviews for each category. So how do we do that? If we talk about ETL, T is for transformation, so we need to transform a bit this dataset from imbalanced one to a balanced one. We're going to do that with AWS Glue. Now, you can write Glue jobs but usually, it takes some time to spin up the compute capacity that is going to run the job. So it's a very good idea to start with a development endpoint. Recently, it was announced that you can actually use Sagemaker Notebooks in order to connect to this development endpoints. Before, and you still can do it, you needed a settling notebooks. So you needed to deploy your own EC2 instance and do some configuration. But now, it's a very seamless experience. So you go to the Glue console. You click on the development endpoints. There, once you have deployed your endpoints, you can just launch a notebook instance, and you can access to it from here or from the Sagemaker console. Okay. Let's start with the balancing work. When we need to balance a dataset, there are different strategies. One of them could be just duplicate records of the categories that have less reviews in this case. Another approach could be just to remove a lot of all of these records. There are other sophisticated stuff that data scientists do, but usually, those do not work so well with text. So in this session, we're going to go ahead and remove some records. So how many and which ones do we remove? So we're going to equalize on the category with the lowest counts. That means that we're going to remove actually a lot of these reviews. It's also going to help us with this exercise because we're going to work with much less data which means that we are going to be able to train models much faster. We're going to remove just randomly. So this is what we need to do. We need to find the category with the lowest count first. Then, calculate a factor that is going to be assembling factor for each other category. Then, we need to take a sample of N rows of each category. Then, we need to write those into S3 so they can be consumed by another Glue job. So the first job is going to be to find the lowest count and calculate a sampling factor. When we start with Glue, we need to get a dynamic frame. This dynamic frame is going to be used to read data from S3. But in this case, we're going to use the Glue catalog, so we don't need to define an S3 bucket or anything. We just point to the database and the table that we got from the crawler. Now, a dynamic frame is very similar to a Spark DataFrame. But AWS Glue provides some benefits on top of a regular DataFrame so it can help you deal with some daily data. Once we have this data source, we're going to convert it to a regular Spark DataFrame because we're going to use a method in Spark DataFrame that is not available, the DynamicFrame. This case is grouped by. So we're going to do the group by counts. Then, we're going to collect the result. That is going to give us how many reviews we have per category. Once we have that, we find the category that has the least reviews, and then we calculate a factor for each category. Now, once we have the factors, we need to take a sample of N reviews for each category. So we have an empty list with the samples and we go through every category and every factor for the category. Again, we read from the database and the table that we got from the crawler. Now, this is a very important lesson I learned by doing this. Push_down_ predicate allows us to push a query down to the service in a way that it can use it to leverage the partitions in the data. As we said at the beginning, the dataset is partitioned by product category. By doing this, we're going to increase the performance of these queries by a lot. By a lot, I mean that I used to run this with 60 data processing units and it took me, without the push_down_predicate, five hours to run this. By putting that simple line of code, I got it down to four minutes. So now that we have a reader, a DynamicFrame used to read, we have to take a sample of each category using the factors that we just calculated. We do that with the sample method existing in a DataFrame. So this is why we first convert the sample into a DataFrame and then get the sample out of it. We end up by adding the sample to a list of samples. Then we can move to the next step. We need to write those into S3. For that, we first are going to do a union of every single DataFrame that we used to calculate it. Once we have that, we're going to need a writer with a DynamicFrame. So we convert back from DataFrame to DynamicFrame and we use a DynamicFrame writer to put that into S3. Now, the format we're going to use is parquet because it's very efficient and it leverages the partitioning that the dataset had. As you see, writing parquet with Glue is very simple. We take the DynamicFrame writer and we just specify the format. We specify the target, that means the S3 bucket and prefix, and the partition key. That's about it. Now, once we run this job script, then we will have exactly what we're looking for, a balanced dataset. With this, we finish the Integration phase and can move on into the Training phase. For Training, we need to choose an algorithm and prepare the data for it to be consumed by the algorithm. So the algorithm that I chose and you can use is BlazingText which is part of the built-in SageMaker algorithms. BlazingText comes in two modes. One is unsupervised which allows you to take text and get word embeddings out of it. A word embedding is basically a representation of a word in the form of a vector. As you may know, Machine Learning requires numbers in order to work. Then we have the supervised mode. This extends a famous fastText classifier and is used for multi class and multi label text classification. This is the model we will use. Now, in order for BlazingText to work, we need to provide it with data in a specific way. This is what BlazingText actually needs to consume to train a model. So it requires a single preprocessed text file with space separated tokens. A token is basically just a word or a punctuation symbol. Then we need, in that file, a single sentence per line and with the labels alongside the sentence. So a label is just a word which is prefixed by the string underscore, underscore label, underscore, underscore. Finally, we need classification and validation. It's an option but we'll definitely make use of it. So these are the things that we need to do. We need to select only the fields that we're going to use. If you remember, we have this table with many fields and we just need two of them. Then we need to tokenize the review body and prepend the label with the right format that is required. Then we need to split the dataset into training, validation, and in this case we're going to also use a test subset so we can then show how good our model is. Then finally, we need to write this subset in a single object because Bayesian text requires a single file in S3. So basically, what we need is something that looks like that. Something with a label, then a string of tokens, and one sentence per line. So as before, in order to work with this, we're going to have a new different job. So we need a new script. We start, again, with the dynamic frame reader. We point to the database and table but in this case we're pointing to the balanced dataset. Then we just select the fields that we want. We want the product category under review body. Then we can move to the tokenization. How do we do that? So the dynamic frame contains a method called map. So you can apply map into the dataset and that means that every single role in that data is going to be applied a function. The function in this case is called tokenize. The tokenized function, it takes the product category from the dynamic record, it puts in lowercase, then prepends the label, and transforms review body in a specific way. What we're doing is just use a tokenization library from a famous NLP library called NLTK. So we take a tokenizer, there's a family of them available in NLTK, and just apply that to the string, and we then just need to do a join using a space. Then once we have tokenized every single sentence and preparing did with a label, we need to split the dataset into the training, the validation, and the test. For that, we need to use a random split function available in the DataFrame from Spark. So we convert the tokenized dynamic frame into a DataFrame and apply the random split. I just chose 60 percent of the reviews will be used for training, 20 percent for validation and 20 percent for testing. So now we need to write each of these subsets into S3. If you'll remember the requirements, we needed just a single file. The source data is actually partitioned. So that means that we have multiple files with multiple prefixes. In this case, we really need just one. How do we do that? Well, we can use the repetition method in a DataFrame to just get exactly one file, one object in S3. Once we have that, we just, again, take a dynamic frame writer. In this case, we're going to use this CSV format because we just have two columns which is the label and the review, and we just need to separate them by space. So we have the separator as a space, we don't need headers, and we don't need to quote our text in there, and that's it. After this, we will have the data available and ready for SageMaker to use with BlazingText. Now, we need to train that model and pass all of that data through the algorithm. For this, we're going to need a SageMaker Estimator. We're going to use the generic Amazon SageMaker Estimator because right now in the Python SDK, there's not a specific one available for BlazingText. So we're just going to configure that estimator with the container that contains the BlazingText algorithm, we're going to provide the channels that are going to be used for training and validating. We're going to use a specific role. We're going to provide instance configuration, hyperparameters, and finally the location where you're training model artifacts is going to be stored. Let's see this in Jupyter Notebooks. So let's jump into the console and then go to our a notebook that I prepared. So here we are in Jupyter. I have a notebook prepared to do the training and let's get started. So we start by doing some regular inputs. We take a SageMakers session that is going to allow us to get some defaults. For example, here we get the default role. Then we specify the bucket where our data is. We specify a prefix and then we can proceed. First, we install NLTK as we will use it later on for running some tests, and we define the data channels. So data channels are just S3 inputs that we acquired by telling where the training data, where the validation data are. Then we define the upper location. This is the S3 bucket and the prefix where our training artifacts are going to be stored and take the region, take a BlazingText Container. We take the latest version and we can get started with the estimator. So we define our estimator, we define Amazon SageMaker Estimator. That's the general one. We specify the container, the role. We're going to just use one instance. We're going to use a C5 for extra-large. The reason for this is because our dataset is actually smaller than one gigabyte. Recommendation would be to use a P2 or P3 instance GPUs if our dataset was a bit bigger. Then we specify some other attributes and that's about it. Then we can define the hyperparameters for this estimator. But we're not going to just stick to the static set of hyperparameters. We're going to use a very nice feature from SageMaker called hyperparameter tuning. All hyperparameter tuning is going to do is allow us to define ranges of hyperparameters and then run training models, training jobs in parallel and explore the results to try to find better hyperparameter configurations. So what we need to do is just define ranges. For example, we say that learning range rate is going to be between 0.01 and 0.08 or the number of dimensions into vectors and the word embeddings that are going to be used by the algorithm between a 100 and 200, and then we define the objective metric. So this is a metric that hyperparameter tuning will use to determine which model is actually better. So once we define this, we create a tuner object plus the estimator, plus the hyperparameter ranges. We'd say how many jobs are going to be run and how many in parallel. So that's it. We define the tuner and we call its fit function, and it will begin with the job. We're here in the SageMaker console and I have already run a few hyperparameter tuning jobs. So we pick one of the hyperparameter tuning jobs that I have. We see that it actually span 10 different jobs and actually was running two in parallel. Then you see there is the objective metric here. If we go to the bottom, we see that it started with 62.9, then 64, and we started to see some improvements as it went. Then we can actually see the best running job and the best running job it tells us that it used, for example, vector dimension of 183, word-ngrams of two, and you can actually see everything that was used for training this specific one. So with this, we already finished the training. So we've covered another phase of our Machine Learning model. Now, for deploying this model, we just can create a model, clicking here, and then from the console itself where we see the models. When we see the model, we can just click on it and create an endpoint. That would deploy our model into a specific input that we can use for running inferences. Now, that is the way to do it in the console, but definitely, we can also do it using code. So we go back to the Jupyter notebook. So by using code, we can just call the tuner deployed method with a specific counter of instances and specific instance type, endpoint name and that's it. Then we're ready, once it's deployed ready to run inferences against it. So let's do remember those three that we said that will be challenging, let's see how the modal response. So we see we can actually run inferences and we can get a single prediction for each sentence. In the case of our example, we have the first sentence that it says is lawn and garden. That's actually not the right one. Then the second it says toys, that was the most obvious one but is not right either. The third one it predicted as kitchen, which is the right one. So probably we're talking about a knife or something similar. But we can also get more than one label, for instance, for each sentence. If we see the first sentence it says, it's lawn and garden with almost 40 percent probability, but the second is watches with almost 20 percent and that actually is right. That review was about a watch. Then, if we continue with the second, it says it's toys, apparel, or luggage and none of them are right because we're talking about a gift card and don't ask me why a 10-year-old gets excited with a gift card, but that was the review. Finally, we also see that that last sentence could be also for the home improvement or majored appliances which actually makes sense. This is just using some single examples. What we want is to build a confusion matrix out of the training set. So for that, we can use some code we can run all the training set through the predictions and then we can build our confusion matrix. Since I said that I iterated a few times over it, I'm going to show you different metrics is that I got. With that, we have deployed our model and fulfilled the deployment. Now, are we finished? Well, as I said at the beginning, it's crucial to iterate over this model training in order to get a more accurate model. So I'm going to walk you over some of the things that are learned by doing this. In one of the first iterations that I did build in this text classification model, I got this confusion matrix. Now, can you spot something weird about it? Well, I see a column that is darker than the rest and that may show some bias into predicting a specific label. In this case, I didn't know exactly where to look, but so I started exploring a bit of the training dataset and then I saw that actually when I repartitioned previously that got the file sorted by product category. Which means that when I pass that data through the algorithm, it saw first one category, then it saw another one, it saw another one and there was a last one. The last one was what the algorithm learned in the last time and that it was what's shaped the final results of that algorithm. So when I checked the file, I saw that that category, the last one, was exactly this one in the column. What do you need to do to avoid that? Well, we need to shuffle it. Remember that you repartition, the only thing that you need to do is order by a random value and then repartition. Once you have that, then everything in the training dataset and the validation one, it will be mixed. So you're going to train the algorithm now learning about each category in every iteration. The results after that are much better. We still see some darker sites, but they could be for many different reasons. Now, one thing that I learned is that I should have explored the data a bit better before running into all of this. So then I thought, well, let's see what reviews are we getting and then I decided to select the ones that were shorter than 20 characters. So if you see them, well, excellent, great, good that doesn't tell anything about a specific category. They're actually very helpless here. They may be helpful for a review, but for our purpose, we can get rid of them because this can be about anything. So once we do that, then we get a slightly better model. Now, this model offered us about 65 percent accuracy which is pretty good for a model that has to select between 36 categories. Let's see how much actually this would cost. I've taken the main drivers for cars, in this case, which are AWS Glue and Amazon SageMaker. So first, we start with the development endpoint. We actually can run these in around three hours. So I selected an endpoint with just two Data Processing Unit, which is what you pay for. The cost at the moment is $0.44 per DPU hour. So that would total for $2.64. Then we have the sampling job and the preparation job and I select a different DPUs because parallelization and the type of data that we're working with and it took six minutes to just sample. So think about it. It's taken 50 gigabytes of data and actually, they were in the North Virginia region and when I moved it and copied it, it was actually in Ireland. So it took about six minutes to just take those samples and move them over to Ireland across regions and that would cost about $2.2. Then the preparation one only 20 DPUs, in this case, it took about 25 minutes and that will amount to $3.67. For the notebook instance, I just chose the smallest one that would run through the workshop about in three hours and that would just be $0.15. Now, for the tuning job which is 10 of this tuning jobs, but internally we would average about almost three hours with 1 mlc.5.4xlarge for each of the training jobs and that would be $3.19. So in total, you can actually build a model for a barely more than $10. Now, you can optimize on this. So when you actually run glue jobs, you can get metrics and these metrics can tell you things about how many executors do you need. Executors actually can be translated into DPUs and I would encourage you to go to the documentation to check how that calculation is done. But in this case, for example, we can see we could have sped up our sampling job by increasing the number of DPUs and that would have taken a shorter time. Now, optimizing on costs. It could mean that, well, if you can run doing shorter time by using more DPUs, well, you need to find what is the sweet spot for you because it really depends on the workload. Then, in this case, we saw that actually, we were very, very close from the needed executors to the active ones. So this cannot be optimized so much, but again, this really depends. I hope this was helpful. My name is [inaudible] , thank you for watching.