In this lecture, we go one step further than covariance and correlation. That is we will look at simple linear regression. In this lecture, we will focus on just two variables to keep it simple. And in further lectures more variables will be considered. This video is quite long, but after this you really know a lot. The relationship between two variables can be represented by a covariance and a correlation. However, if one has the impression that values of one variable come prior to values of another variable, one may want to do a little bit more than just getting an estimate of a correlation. In fact, you may want to use values of one variable to predict values of the other variable. Let us look at the scatter plot of prices and quantities in the following figure. The horizontal axis gives the quantity sold of the first 21 stamps of Spain in the period 1850 to 1853. On the vertical axis, we present the current price in euros according to the 2017 Michelle Catalog for the mint condition stamps. Clearly, this not very clear picture shows that some stamps were issued with multiple millions. And we also see that some of those first 21 stamps are now very expensive. All these stamps have a picture of Queen Isabel II. Here she is. The data are in this table. Clearly, the skewed distribution of the numbers, both for quantity and for price, is large. And this means that it is better to look at a picture where we transform the data using the natural logarithm with base e, the natural number. That is we're going to look at this plot and this one. The scatterplot of these two variables looks like. Now it's evident that larger quantities seem to be associated with lower prices and that smaller quantities associated with higher prices, and this makes sense. Now for the next series of stamps from numbers 22 onwards, you may perhaps want to predict what the current collectors price is. A useful tool for this purpose is called simple regression. Simple regression is the starting point for many econometric methods and techniques. The method of simple regression does not require advanced technical tools. Still, the ideas of simple regression are fundamental. Look again at the scatter plot of log prices versus log quantities. One would expect that more quantity in those days leads to lower prices today. An econometrician tries to quantify the magnitude of changes in prices due to changes in quantities. We see that the straight line can be fitted reasonably well to the point in this diagram, as we see in the next figure. The actual points are the dots, and the predicted points are on the line. A given quantity does not always perfectly associate with the price level. The observed data are, however, reasonably close to the line, but note that they do not lie exactly on it. The line would be the following. For a given number of log quantity and a given line, the predicted log price is equal to the value of a plus the value of b times log quantity. We denote the difference between actual log price and the predicted log price by the residual e. So we have the coefficient the b measures the slope or marginal effect. That is the change in log price when the log quantity changes by 1 unit. Clearly, the slope b here is negative. The slope b is called a price elasticity for this case of log transformed data. In simple regression, the focus is on two variables of interest that we denote by y and x, where one variable x is thought to be helpful to predict the other variable y. This helpful variable x we call the regressor variable or the explanatory factor. And the variable y that is what we want to predict is called the dependent variable or the explained variable. One may not assume that the 21 observations on log price obey a Gaussian distribution, like this one. The observations are considered to be independent draws from the same normal and the letter N, but better called Gaussian distribution with mean mu and standard deviation sigma. Note that we use the Greek letters mu and sigma for parameters that we do not know and that we want to estimate from the observed data. There are data equivalents are those m and s. The probability distribution of log price here is described by just two parameters, the mean and the variance. For the normal distribution with mean mu, the best prediction for a new observation on log price is equal to the mean m. An estimator of the population mean is given by the sample mean. The sample mean is equal to the prediction of log price when it does not depend on any other variable. In many cases, however, it helps to use additional information to improve the prediction. In our example, log quantity may help to predict log price because more quantity will most likely lead to a lower price. When the y is distributed as normal with mean mu and variance sigma squared, then the mean is the expected value with notation E of y and variance sigma squared is the expected value of y minus mu squared, as follows. An estimator of the population mean is given by the sample mean and an estimator for the variance is the sample variance. No need to worry, by the way, about why we use n minus 1 instead of n. The idea of using one variable to predict another, instead of just using the sample mean, means that we move from an unconditional mean to a conditional mean. So y given a value of x. For example, the conditional mean can be written like this. So we thus move from an unconditional prediction to a conditional prediction. That is we move from the first to the second expression on this slide. An alternative way of writing the conditional prediction follows from demeaning y by subtracting the linear term alpha plus beta x i, such that a normally distributed error term with mean 0 emerges like this. This rewritten form will become useful when we want to estimate the coefficients alpha and beta from observed data, as will be demonstrated next. Until now, you acquired insights into two aspects of predicting values of a dependent variable y based on an explanatory variable x. First there are the coefficients a and b that can be useful in practice for actual data, and there are other parameters alpha and beta. We will now see how to obtain values for a and b from a given set of observations. We will use observed data on y and x to find optimal values of the coefficients a and b. The line y is a plus b times x is called a regression line. We have n pairs of observations on y and x, and we want to find the line that gives the best fit to these points. The idea is that we want to explain the variation in the outcomes of the variable y but a variation in the explanatory variable x. When we use the linear function a plus b times x to predict y, then we get residuals e. And we want to choose the fitted line such that these residuals are small. Thus minimizing the residual seems a sensible strategy to find the best possible values for a and b. And a useful objective function is the sum of squared residuals like this one. This way of finding the values for a and b is called the method of least squares. Minimizing to a and b gives the following results. The first expression shows that b is equal to the sample covariance of y and x divided by the sample variance of x. Clearly, there is a link between the expression for b and the expression of the correlation which we saw earlier. When we fit the straight line to a scatter of data, we want also to know how good is line fits the data. And one measure for this is called the r-squared. The line emerges from explaining the variation in the outcomes of y by means of the variation and the explanatory variable x. The r squared is thus defined as 1 minus the fraction of the variation in y that is not explained by the regression model. That is like this when the r squared is 0, there is no fit at all. And when the r-squared is one, the fit is perfect. Finally, the variance of the residuals is estimated as follows. Where, again, there is no need to worry about the term n minus 2 in this case. Let us now, see how it all works out for the data on the Spanish stamps. The average of log price is 7.841 and the average of log quantity is 11.177. The variance of log quantity is 7.032 and the covariance between log quantity and log price is- 2.769. Taking these numbers to the equation for a and b we get that a is equal to 12 and something and b is -0.394. The sum of squared residuals is 2.284, and hence, s is equal to 0.347. Finally, the variance of y is equal to 1.199. And this makes that the r-squared is equal to 0.909. With these numbers, we can now also make a prediction given a certain value for x. That is we can find for an x0 a prediction for y0 written as y0 hat like this. And thus, we are uncertain about this prediction. We can also report a prediction interval for this forecast, which is given by this where the s is the standard deviation of the estimated residuals and where k is the amount of standard deviations that you might find reasonable. So suppose the quantity is only 500, so say a very rare stamp. Log quantity is than equal to 6.215, and the prediction for the log price is 9.794. When we set k is equal to 2, then the associated prediction interval is in between 9.1 and 10.5. Translating this back to prices by applying the exponential transformation to these two boundary values gives the interval close to 9,000 and 36,000 with a middle value of 22,000 euros. The number k is 2 is obtained from the Gaussian distribution in the previous video. And hence, one can say that with approximately 95% confidence, the predicted price given a quantity issued of 500 is in between 9,000 and 36,000 euros. In the next lecture, I will show you how you can examine if the parameter beta is equal to 0, which is the final important aspect of the simple regression model.