[NOISE] [MUSIC] Hello, I'm Neil Clark and in this talk we're going to look at Enrichment. I'll start off by talking about over representation analysis which is chronologically speaking, the first attempt towards an analysis of enrichment. So, an overall representation analysis is a way of characterizing the composition of sets typically of gene sets. So typically, an experiment might result in a particular set of genes, and the investigator might be interested in characterizing the composition of that gene set. For example, the genes might have an associated category such as a membership of a given pathway, or have an associated gene ontology term for example. And the investigator might be interested to know which of these categories are overrepresented in their gene set. So we're going to look at a couple of statistical approaches to assessing that. So we're going to look at the Hypergeometric test and the Fisher-exact test. And then building on this idea of looking at relationships between sets. We're going to look at the Jaccard index. So this analysis of overrepresentation has developed, more recently, it's taken the form of the analysis of the enrichment of gene sets. So this is a method of analyzing differential gene expression. Instead of the gene level at the level of collections of genes and this has a number of advantages, some of them statistical, some of them to do with the interpretation of the results. I'm going to talk, give a general formulation for this enrichment of gene sets before I go into a particular detail about one method, the method called Gene Set Enrichment or GSEA. Overrepresentation analysis is one of the earliest attempts at a kind of enrichment analysis in Bioinformatics, and it might be best for us to look at a specific example in order to understand what is going on here. So, let's take an analysis from Rasche, et al. Here the authors performed a meta-analysis over a number of studies, to identify 213 genes which had an apparent association with Type- II diabetes. Now, the authors of this study, then asked a question. If we're given KEGG pathway let's look at these 213 genes and count how many of those 213 genes are members of this specific KEGG pathway. And then we're going to ask, is that number more than we would expect just by chance? Or is it about the same? If it is more than we would expect by chance, then this might lead us to conclude that this specific KEGG pathway is in some way relevant to or disregulated in type- II diabetes. Now, and then on the right, there's a kind of an illustration of the results of this analysis. The authors tested all KEGG pathways in this way and identified all of those that are significantly overrepresented in that 213 genes and these are illustrated in the figure on the right. Now, the question of overrepresentation is a statistical one, and there are a number of different statistical approaches to this. And we're going to look at a couple of formulations of this problem. First, we're going to represent it as an Urn model, and use the Hypergeometric test. And we can also look at the problem from the point of view of a Fisher Exact test. So, I will give a brief description of each of those two next. Some of the overrepresentation analysis also try to take into account the magnitude of the differential equation. And this was done by producing a kind of a score which was the product not only of the significance of the p-value for each gene, but also the fold change and this was also part of this work by Rasche et al. So, lets go on now and examining some of the statistical tests that are used in overrepresentation analysis. An alternative formulation for the statistical test of overrepresentation is the Fisher Exact test. This test takes two categorical variables and tests whether there is significant relationship between the two. It was acclaim by, Muriel Bristol, that she could taste whether the milk was added before, or after, the tea in a teacup. Now official was suspicious of this, and devised a test to prove one way or another. So in this case, the two categorical variables are whether the milk was added first or second. So that's either yes or no. And whether Muriel Bristol, having tasted the tea, predicts that the milk was added first or second. Now if there is no relationship between the two, such that if Muriel Bristol says that the milk was added first, and it often was, this could be proved or disproved with the Fisher Exact test. So, the way it works is by, is based on what's called a contingency table. So let's go back to our example of the KEGG pathways and our analysis of whether our given gene set has an overrepresentation of this KEGG pathway. So our two categorical variables here are, the a gene is a member of a KEGG pathway, or it's not. And the gene is present in our gene set, or it's not. And so, one way to formulate the overrepresentation question is, if the gene is a member of a KEGG pathway, is it then also more likely to be present in our gene set, or is there actually no relationship between those two? So we can write a contingency table for these two variables. So for each gene, we put it in one of four places. If the gene is a member of a KEGG pathway and present in our gene set, then the gene contributes to this part of the contingency table, for example. If the gene is not a member of this KEGG pathway but it's still in our gene set, then it contributes here. So let's say for example, of the 213 genes in our gene set, 100 of them are a member of some specific KEGG pathway. And the other 113 are not members of this pathway. So, initially, we might think wow, nearly half of our genes in our set are members of this KEGG pathway, it must be an important pathway. But if we perform the statistical test, then we can actually get a quantitative answer to whether we should be excited about this KEGG pathway or not. So let's complete the contingency table. Let's look at all those genes that are not present in our gene set. And count how many of them are members of the KEGG Pathway and how many are not members of the KEGG Pathway are the question. And let's say, just for simplicity sake, we have a total of 20,000 human genes and 9,000 of them are members of the KEGG pathway and 11,000 are not. This is very unrealistic numbers, just put it in verse to illustrate a point, and the point being that about as roughly speaking, about half of the genes, of all human genes are members of the pathway and half are not. And this is about the same proportion as we see in our gene set. So, this contingency table is actually not very surprising. It would actually be quite likely to occur just by chance. Now, the actual probability of observing any particular contingency table is actually given by the hypergeometric solution we saw in the previous slide. Under the Fisher Exact test, we ask the question. What is the chance of seeing a particular difference in proportion, here, which is of a particular magnitude or greater? And so with the Fisher Exact test, we actually we perform this sum over the hypergeometric distribution of all contingency tables which are at least as extreme as the one that we observe. And the result is [SOUND] will give you a probability that there is actually a relationship between the categorical variables and that indeed the KEGG pathway is either significantly overrepresented or not.