In the last practical, we downloaded a set of sequencing reads and plotted the quality value, the distribution of quality values. Now let's use that same data set and let's plot the GC content at each position in the reads. Ben, what would be a reason why we would be interested in? >> The GC content, in other words, the fraction of the genome that contains G's or C's, that is something that is different from species to species. So different species will have different characteristic GC contents. Here though, we're really just using GC as a way of trying to figure out whether the mix of different bases is changing as we move along the read. We expect that it won't change very much. We expect it'll stay pretty even. But if something strange is going on, like if there was one particularly bad sequencing cycle, then we might see a very different mix of G's and C's relative to the other bases. So we just want to plot that for the entire read and just check to see whether anything strange is happening. >> Okay, so let's create this function. We'll call it findGCByPos, and we'll pass in a set of reads. In this case, we don't case about the quality scores. We just want the strings of bases. And what I'm going to do is I'm going to create two lists. One is going to be the number of GC bases that we've seen at each position in the reads. In this data set, all the reads are of length 100. So I'm going to make this list 100 long. So I'm going to keep track of the number of GC bases, and I'm also going to keep track of the total number of bases that we've seen at each position. And then, once I finish that I can just divide out each index. The GC by the total number to get the average GC content at that position. So now let's loop through each read in our set of reads. And in each read, I have to loop through each base. going to say for each index i in the length of the read, so first I'm going to check if that read is a C or a G. So ifiI is equal to C or read i is equal to G. Then I need to increment my GC array at index i. And regardless of what that base is I want to increment my totals array in index i. So once I've finished this loop, I'll have finished updating both of my arrays so I can just divide GC by totals to get the average GC content. So I"m going to do this for each base up to the length of the reads. And I want to say GCI, and I am going to divide it by totals i. And I'm converting totals i to a float just so it won't truncate the integers or anything like this. Now one thing I have to be careful of, is if totals i is equal to 0, I'm going to get an error, a divide by 0 error. So, I only want to do this if totals i is greater than 0. And then when this is done I can just return GC. Okay now let's run this function on our set of reads. So find GC by position on our set of sequences. And I'm going to use map plot lib once again to plot this. In this case I'm going to do a line plot. For the x values I'm just going to do the numbers from zero up to the length of the read. So those will be the indices in the sequence. I'm going to plot the average GC content. And I have to do plot.show. And here's our plot. So, yeah, like you said we would expect, Ben, if these sequences are drawn without bias from different positions in the genome, we would expect the GC content to be pretty constant. You can see that this is the case here. There is quite a bit of random fluctuation, which is probably due to noise. But overall it stays pretty constant. >> Yeah, we're not looking at very many sequencing reads here, so its not surprising that the fraction kind of jumps around a bit. But one thing you can see is that its consistently higher than 0.5. So, that's because this genome that we're working with, the human genome, has GC content, on average, greater than a half. Yeah looking at this you can see that the average GC content is probably somewhere between 0.56 and 0.58. Okay, now that we've done that let's just do one last thing. Let's look at the distribution of nucleotides, or sorry, the distribution of bases in these sequences. If you recall, I think it was in practical two, we used collections to do this, which is a Python module. So let's do that again. We're going to import collections and I'm going to create a variable, count, which is collections.Counter(). And now for each sequence in our list of sequences, I want to update this counter. So I'm going to do count.update and I pass in a string, which in this case is just the sequence of bases. And then let's print(count) when we're done. So you can see, like we said above, the G and C both occur at least 28,000 times, T and A only occur between 21 and 22,000 times. So GC content is above 0.5 in this organism. Something else that's interesting is we also have this N here which is not a base. What is that? >> So N is basically when the base caller has no confidence. It doesn't even want to make a call. Because there's no good evidence to support one base over the others. Sometimes the base caller will report in. You can think of it as meaning basically no confidence. >> Yeah, so thankfully, in this sequencing run, the caller, or the sequence are only reported an 18 times, which is very little. So yeah, that's good. >> Yeah, if you see ends, that means something's wrong, the sequencer was just not able to figure out what base was there. So it's good that it didn't happen too often here.