In this practical, we'll be writing the Naive Exact Matching algorithm that was presented in lecture, and using it to match artificial reads to the genome. So, let's begin by downloading the genome for the phi X organism. So I'm going to just do W get. We're going to download that. Now what I'm going to do is, I'm going to use the readGenome function which we wrote in one of the previous practicals to read this. So I don't have to reimplement that. I'm going to use that to just read the genome we just downloaded. Okay, now we've done that, let's write the naive exact matching algorithm. So I'm going to write a function, naive. And this will take two arguments, one is P, which is the pattern we're searching for. And then T which is the text to search. So, P, for example, could be a sequence, a read sequence, and T would be a genome that we want to match it against. So I'm going to create a list, occurrences. And this will keep track of all the indices where P matches against T. And what I want to do is I'm going to loop through every position where P could start. So for item range, from zero up to length of T, minus length of P, plus one. So this gives me every position in T where P could start without running past the end of T. And I'm going to create a Boolean variable match which I'm going to initialize to true, and I'm going to compare every character of P against the corresponding position in T, and if there's a mismatch I'll set match to false. So I'm going to say for J in range up to the length of P. If the character in T at I plus J is not equal to the character in P at J, then I'm going to set match equal to false. >> And there you're comparing the character at offset J within P, but to the character would offset I plus J within T to account for the fact that we're in the Ith alignment. So we're offset both because of what alignment we're trying and because of which character we're comparing. >> Right, and so if I find a mismatch, I'm going to set match to be false. And then, there is no sense in comparing the rest of P since we already found the mismatch, so I can break out of that group. Now when I've finished this, if our match variable is still equal to true, this means that P must match exactly against T in that position. So in this case, I'm going to append the index, I, to our list of occurrences. And when this is done, when we've looped through every possible position, just return a list of occurrences. >> Yep, so that will have the opposite of every match of P within T. >> Right. So let's try this out. Let's just create a small text string of nucleotides, and I'll do a short pattern. I'll just choose AG, and now I'll run naive on the pattern in the text like that, and1 we see it gets three different places where P matches against T. We can check this by just looking at those, so there is zero to two. You can look at five to seven. And we can look at nine to 11. And yet at that position T can letters AG. So now that we have this implemented, now that we have naive exact matching, I'm going to generate some random reads from the genome. And I already have a function to do that so I'm just going to paste that in here, and what this function does is it generates artificial reads from a genome by just taking some sequences of a genome string, taken from random positions in that string. So all of these artificial reads should match back exactly against the genome. So let's generate a bunch of reads using this method. I'm going to use the genome, which you already downloaded, and I'm going to generate 100 reads of length 100. And once I do this, now I'm going to count how many of these reads match back exactly against the genome. Since they're artificially generated, and there are no sequencing errors or mutations, they should all match exactly. So I'm going to create a variable numMatched, which will just count the number of reads that match. I'm going to initialize this to zero, and now for each read in our list, I'm going to to get the list of matches for genome to have to put the pattern first, so I'm going to start with the read R, and then the genome test that we're comparing it against. >> So the read is our pattern and the genome is our text >> Yeah. So it matches now is the list of indices where the read R matches against the genome. If this list has like zero, that means that our read didn't match at all. So if the length of matches is greater than zero, that means we found a match. So in this case, I'm going to increment my variable, num match. And when this is done I just want to print out the proportion of read that matched, so I'm going to say print, what's going to be the number of matches out of the number of reads That matched, and then put the variables here so the first is numMatched and the second one is the length of the reads list. So if you run this, 100 out of 100 reads match exactly. So that's cool, our artificial reads all matched the genome. >> Great.