Hello my name's Vikram Madan and I'm a senior product manager here at AWS on the deep learning team. Today, I want to tell you about Amazon Sage Maker Ground Truth. Ground Truth is a new capability of Amazon Sage Maker that makes it easy for you to efficiently and accurately label the datasets that are required for training machine learning systems. Ground Truth can automatically labels some part of the training dataset and then sends the rest to human workers for labeling. Ground Truth also uses innovative algorithms and user experience techniques to improve the accuracy of the labeling that's sent to human workers. So the current method for building a training dataset involves a lot of manual effort, and is prone to errors. It often involves distributing labeling tasks across a large number of human workers. This can add significant overhead, and cost, and leave room for human error and bias. Additionally, an operation like this is extremely complex to manage, and can take months to process. So how does Ground Truth actually fix this problem? It makes it easy to efficiently perform highly accurate data labeling using your data stored in Amazon S3. Ground Truth provides a managed experience where you can set up an end-to-end labeling job with just a few clicks. You simply need to provide a pointer to your data in S3, Ground Truth then offers a set of templates for common labeling tasks, where you only need to click a few choices and then provide some minimal instructions to the human workers to get your data labeled. Then you select one of the three workforce options, and then on completion of a labeling job, Ground Truth will send the outputted labeled data to your S3 bucket. So when selecting a workforce to perform labeling, you have three choices. You can send a labeling task to the Public Mechanical Turk Crowdsource Workforce, you can use one of this pre-approved third-party vendors that are listed on the AWS marketplace, and you can even bring your own internal workers, and what that actually means is that we host a labeling application on which you can onboard these workers. So now, let's talk about how ground-truth helps you improve the accuracy of your data labeling. There's two core aspects here. The first are innovative UX techniques that are directly built into these templates that you can use for common labeling tasks. In addition, Ground Truth provides a set of built-in algorithms that help improve the accuracy of labeling by taking in the inputs of workers and outputting high fidelity label. Next, let's talk about how Ground Truth improves the efficiency of data labeling. Ground Truth has innovative feature called Automatic Labeling, which actually automatically labels a subset of your dataset. A portion of your dataset will still be labeled by humans. But next, let's double-click on how Automatic Labeling actually works. It uses an innovative ML technique called Active Learning, that helps us understand which data is well understood and can be potentially automatically labeled, and then which data is not well understood, and may need to be looked at humans for labeling. What happens is that Ground Truth can actually look at your training dataset and identify which data is not well understood and thus, needs to be sent to humans. It also can identify which data is very well understood and can be automatically labeled. Now, how all the magic works is that under the hood, Ground Truth is actually training an ML model in your SageMaker account. This whole process is iterative, and Ground Truth breaks up your data into batches and repeats this cycle until all your data is labeled. Okay. So now, let's take a step back and understand what all of this means for you. First and foremost, Ground Truth makes data labeling easy. It also helps you lower your total cost of data labeling by up to 70 percent. Ground Truth enables you to securely manage your datasets, and it can help you increase the accuracy of your labeled data. Now, let's take a look at Ground Truth in action to get a better feel for these features. So now, I'm going to walk through on how to create a data labeling job, and it's just a few clicks and a handful of informational fields and we can essentially kickoff a data labeling job. Now, there's three aspects to data labeling job. First of all, you need to provide us the input dataset. Then you need to tell us when it needs to be performed on that datasets. So the actual labeling task, and then finally, you need to configure a workforce. So I'll walk through each of those three pieces and this is all in the context of me having 10 pictures and I want to be able to identify if my family dog whose name is Freckles is in those pictures. Now, we'll walk through the first part, the input dataset first, and here you see we've actually already prepped that here. So you provide us an input manifest, which is a JSON document and you can see a Freckles manifest here. You also provide us where we will output the labeled data, and this is the same bucket as here. So let's quickly take a look at what that manifest looks like. So here we have the input manifest and this is a simple JSON lines document. So you can see each of the images, it's its own line, and you provide the S3 URL for that image. So that's how the input manifests looks and I'm going to minimize this now, and that we're just providing a pointer to here and in it's S3. So now, we also you have to give access to the service, to these S3 locations. That's as simple as creating a role for those S3 buckets. So here I've already provisioned a role, that if I had to create a new one, I could just go to the create role, and provide the names of the S3 buckets that I want to give access to. I'm not going to create that here. But I'm going to move on to the next step now. So now, we finished the first aspect of the data labeling job which is providing the input dataset, and we're going to select the actual task type. So this is what's going to be performed on the task, on the input dataset, and if you remember, I'm actually trying to identify whether my family dog named Freckles is actually in one of the images that I provided. So I'm actually going to do an image classification task and this is pretty simple. Basically, I want to show an image and ask, is Freckles in this image? Or as Freckles not in this image? We're going to move to next and we're going to move from that second step, which is the labeling task to actually configuring the workforce, and here, you see I've already decided to use the Public Mechanical Turk Work Force. I have to ensure that there's no adult content in my images. I also need to make sure that I'm not sending any sensitive data to the public workforce, because these will be publicly accessible and shown to a globally distributed public workforce. So I've clicked those two aspects and I've configured the public workforce. Now, I'm going to return to the task setup and I'm actually going to configure the actual task that the labeling workforce will see. Here is essentially the task prompt. These are the options. This is Freckles, this is not Freckles, and then we have actually the informational panel that will be shown to the labelers So here's a good example. Freckles is a black cocker spaniel, with white spots, she's 17 years old and she has red heart-shaped tag. This is a bad example. Please make sure that dogs that are not Freckles, please label them as this is not Freckles. So let's actually take a look at what the labeler will see. So this is actually what the label or will see, the image that we will have in the dataset will be rendered here, and then, if I select that this is Freckles, basically, I can submit this and voila, you have a task completed. But this is just a preview, we'll go back to the workflow, and now we're ready to kick off this workflow. I did want to show one more thing in terms of the automatic labeling feature. It's as simple as clicking a check box, and with that checkbox, you essentially have enabled automatic labeling. Ground Truth will then essentially easily be able to automatically label the data. So now that we're ready, we can submit this and it will be sent out to the public Mechanical Turk Workforce, and then, basically, once the data labeling job is completed, it will take the outputted labels, augment that initial manifest that you provided, and drop that into your S3 bucket, and we can take a look at quickly how that augmented manifests looks like. So here, if you remember, we have all the images that we had in that manifest, and we have the associated label with that and associated metadata with that label. You have a confidence score. You also have what the actual label category will look like. As well as other information are like, whether it was human annotated or auto annotated. Thank you for learning about Ground Truth today. My name's Vikram Madan, and thanks for watching.