Why does combining models work so well? When you integrate together two or more models, you are increasing complexity. So does it not increase the risk of overfeeding? Isn't the name of the game to limit model complexity? As we saw earlier, we need to prune down decision trees, for example. Wouldn't it be paradoxical if greatly building up model complexity actually helps? Well, in this video, I'll cover the basics of ensemble models, and then we'll clear up that vexing little paradox. The effectiveness of ensembles is very much like the so-called collective intelligence we see across a group of human judges. Consider this contest to guess how many dollar bills are in this transparent container. This guy, Gary Pancho, was representing an analytics vendor on the expo hall floor of our Predictive Analytics World conference. This was in 2012. His job was to grab people's attention as they walked by. If your estimation was the best, you won all the money. So it was $362. And of 61 submissions, the winner was off by only $10, not bad. But here's the crazy thing. If you take the average of all entries, which was $365, you're only off by $3. The collective mind was better than even the best single mind. The group outsmarted every individual. Basically, the people's over estimations and underestimation all kind of came out in the wash. This phenomenon is known as collective intelligence or the wisdom of the crowds. It's repeatedly discovered when people hold these kinds of human judgment competitions. And it works for models just the same as for human minds. Instead of the wisdom of the crowd of humans, it's the wisdom of the crowd of models. This was pointed out by leading consultant Dean Abbott who said the wisdom of the crowd concept motivates ensembles, because it illustrates a key principle of ensembling. Predictions can be improved by averaging the predictions of the many. If you make a bunch of simple models like decision trees, and then have them all as a group, just vote. Or possibly average them or combine them in a bit more intricate way, the overall group of models, called an ensemble, usually does better than all or at least most individual component models. This is a really powerful effect. It means you can just elegantly pull together many models that may each be simple yet clunky, and you get a significant improvement. You supercharge your modelling capabilities without any advanced mathematical footwork. It's elegant yet sophisticated, very much like Nyong'o. Whether each model is simple like a decision tree or complex like those for the Netflix prize, combining them is simple. Just apply predictive modeling to learn how to combine them. Since each model comes about from machine learning, this is an act of learning on top of learning, sometimes called meta-learning. The new model sits above the two existing models like a manager. This new model then considers both component models predictions on a case by case basis. For certain cases, it can give more credence to model A rather than B, or the other way around. By so doing, the ensemble model is trained to predict which cases are weak points for each component model. There may be many cases where the two models are actually in agreement. But where there's disagreement, teaming the models together provides the opportunity to improve performance. As with assessments made by people, the predictive scores produced by a model are imperfect. Some will be too high, and some will be too low. But since the models tend to make up for one another's errors, the overall ensemble usually does better than any one simple model. This process scales nicely to combine many component models into a single ensemble model. And the model at the top can be simple such as logistic regression, or it can actually be even simpler just like taking a vote or averaging together all the models scores, which is in fact the more common approach. The component models could be a bunch of hand-selected models created in various ways. It could be a neural network, a decision tree, and a logistic regression model, or a few complex models competing in a public modeling competition. Or the component models could each be the same kind of model like a whole collection of decision trees. In that case, it's key to ensure there's some diversity among the models. They cannot all be the same. This is often achieved by randomly selecting from the training data or randomly duplicating some rows of the data before the training of each component model so that each one ends up being unique. Once complete to use this model in deployment to score an individual case, you just apply each model to render its predictive score. And then at the top, the overall ensemble tallies up the results to calculate the overall score for that individual. For example, let's revisit this use of a decision tree to classify whether a case is inside or outside a circular region. We showed this example back in the first course. A tree's decision boundary consists only of horizontal and vertical lines, so its approximation of a circle is rough. But then, when we created an ensemble of 100 decision trees, we get a smoother, more refined model. And check out this animation. As more trees are added to an ensemble, it better approximates a single diagonal decision boundary. So, there are many variations on this overall approach. Specific ensemble modeling algorithms have some pretty cute expressive names, including random forests, like a bunch of decision trees is a forest, bagging, bucket of models, bundling, committee of experts, and TreeNet. Ensembles have a firm established reputation and generally are said to boost a simple model's performance by 5 to 30%. Check out this comparison of five simple modeling methods as evaluated across six data sets. The horizontal positions are for each of the data sets, such as for diagnosing diabetes and for assessing financial credit applications in Germany. And the vertical axis is the relative error achieved across the methods scaled to be from zero to one, lower is better. The different colors are for the different modeling methods including neural nets, logistic regression, decision trees, and a couple others. Okay, the details don't matter much. The takeaway here is that for any one method, pick a color and look only at that color. You see that it's rank, that methods rank compared to the other methods varies like crazy. It's all over the place as you move from one data set to the other. But the more exciting takeaway is coming in one second. What if for each data set, we combined the five competing models into one ensemble model? Voila, look how much better. The four colors here show four competing methods, such as voting or averaging, to ensemble together the five methods compared in the previous image. They each beat most of those simple methods for each data set, and they never do nearly as poorly as the poorly ranked methods we saw so far up in error, so far up the y-axis, as we were seeing a moment ago. And check out these comparisons of individual decision trees to an ensemble of decision trees. For each problem, the red bar says the inevitably lower error rate of an ensemble model in comparison to the simple models, higher error rates shown by the blue bars. But wait a minute. What about the question I asked at the beginning of this video? Why does increasing complexity help rather than over fit? Isn't that paradoxical? After all, an ensemble can grow to include thousands of component models. So it's a leap away from the keep it simple stupid, KISS principle, also known as Occam's razor. As John Elder put it in his paper, the generalization paradox of ensembles, ensembles appear to increase complexity. So their ability to generalize better seems to violate the preference for simplicity summarized by Occam's razor. We've seen that building up a predictive model's complexity so that it more closely fits the training data can only go so far. After a certain point, true predictive performance as measured over a held aside test set, begins to suffer. But ensembles seem to be immune to this limitation. John's paper resolves the apparent paradox by redefining complexity, measuring it by function rather than by form. Ensemble models look more complex, but do they act more complex? Instead of considering a model's structural complexity, how big it is, or how many components it includes, he measures the complexity of the overall modeling method. He employs a measure called generalized degrees of freedom, which shows how adaptable a modeling method is, how much its resulting predictions change as a result of a small experimental change to the training data. So if a small change in the data makes a big difference, the learning method may be brittle, susceptible to the whims of randomness and noise found within any data set. It turns out that this measure of complexity is lower for an ensemble of models than for individual models. Ensembles overadapt less. In this way, ensemble models exhibit less complex behavior, so their success in robustly learning without overlapping isn't paradoxical after all. So ensemble models are a very popular key technique for any modeling practitioner's toolbox. John Elder affectionately calls them a secret weapon. In the next video, we'll take a step back to compare and contrast the many methods that we've surveyed.