So far, we've used regression to test linear associations between our explanatory variables and our response variable. By linear, we mean that the association can be explained best with a straight line. Here is a scatterplot showing a linear association between urban rate, and Internet use rate from the gap minder data set. That is, we can draw a straight line to the scatterplot. And this regression line does a pretty good job of capturing the association. But, what if the association is not linear? That is, what if the association's curvilinear? For example, take the relationship between anxiety and performance. It is a well established phenomenon that both low anxiety and high anxiety are related to poor performance. But that a moderate level of anxiety is optimal to perform at your best. If you drew a straight linear regression through these points, most points would be really far away from the line. Meaning that there is a lot of prediction error. The best fitting line is not straight, rather it is one that curves to capture the non-linear nature of the association. Here's an example, of a less extreme curvilinear association between urban rate and female employment rate with a linear regression line. Returning to the SAAS code for the gap minder data said, the code to produce this scatterplot is here. We used the SG plot procedure to create a scatterplot for our x variable, urbanrate and our y variable, femaleemployment rate. After the slash, we add some options for the regression line using the line ATTRS option. Which stands for line attributes. Specifically, we set the line attributes, L-I-N-E-A-T-T-R-S equal to, and then in parentheses, color=blue and thickness equals=2. These options ask for a blue regression line that is a little bit thicker than the default of thickness=1. Outside the parentheses, we add the option clm, which asks SAAS to print the 95% confidence interval for the regression line followed by a semicolon. Then we label our axis and type run to run the code. You can see that it looks like family employment rate decreases as urbanization rate increases. But that around urban rates of 80 or higher, it looks like female employment rate appears to increase. So, it looks like kind of a U shaped association. Just like with the anxiety and performance association, a straight linear regression line isn't doing a good job of picking up on a curvilinear part of the association. We can actually fit a line that curves to better this association by adding a polynomial term. For example, we could add a second-order polynomial, or quadratic term, to draw a line of best fit that captures the curvature we are seeing. To do this, we use the same sgplot procedure code that was used to draw the scatter plot with a linear regression line. The difference is that we’re asking for two lines to be plotted, a straight linear regression line and a curved quadratic regression line. In the second line of code, we ask for a linear regression line by adding degree=1 to the options, following a slash. Degree=1 asks for a first order polynomial, or a linear line. In the third line of code, we asked for quadratic regression line by adding degree=2 to the options following the slash. Degree=2 asks for a second order polynomial or quadratic line. Now, my scatter plot shows the original linear regression line in blue. And the quadratic regression line in green. You can also see that I added some line attributes to make the color of the regression line green, and to increase its thickness to two. Now, my scatter plot shows the original linear regression line in blue, and the quadratic regression line in green. Notice how the quadratic line does a better job of capturing the association at lower and higher urbanization rates. The points at these levels are close to the quadratic or second order polynomial curve. Meaning that the expected or predicted values are closer to the actual observed values. So, based on just looking at the two curves, it looks like the green quadratic curve fits the data better than the blue linear regression line. But, we can be even more sure of this conclusion if we test to see whether adding a second order polynomial term to our regression model gives us a significantly better fitting model. I do this by simply adding another variable that is the squared value of my explanatory x variable x squared to my regression model. First, let's test a regression model for just a linear association between urbanization rate and female employment rate using the GLM procedure. Note that we have centered our urban rate quantitative explanatory variable, urban_c. Centering is especially important when you're testing a polynomial regression model, because it makes it considerably easier to interpret the regression coefficients. If we look at the results, we can see from the significant P value and negative parameter estimate that female employment rate is negatively associated with urbanization rate. So, the linear association, the blue line in the scatterplot, is statistically significant. But, the R-square is 0.09, indicating that the linear association of urban rate is capturing only about 9% of the variability in female employment rate. But what happens if we allow that straight line to curve by adding a second order polynomial to that regression equation? The SAAS code to do this is here. As you can see, it's the same code as for the linear regression model. With the exception that we've added another explanatory variable, which is urbanrate_c*urbanrate_c. This gives us the square of the urbanrate variable, which is a second order polynomial, or quadratic term. When we look at the table of results, we see that the value for the linear term for urbanrate is negative. And the P value is less than 0.05. In addition, the quadratic term is positive, and significant indicating that the curvilinear pattern we observed in our scatter plot is statistically significant. A negative linear coefficient and a positive quadratic coefficient indicates that the curve is convex such that starts high, then goes down, and then starts to go up again. This is consistent with the bowl-shaped pattern. In addition, we see that the R-Square increased at 0.16, which means that adding the quadratic term for urban rate increased the amount of variability in female employment rate. That can be explained by urbanization rate from 9% to 16%. Together, these results suggest that the best fitting line for this association is one that includes some curvature. You might wonder, if putting in a squared term for a variable already in the model creates multi-colinearity. Which is a high association or correlation between the explanatory variables. Ordinarily, if we have two highly correlated explanatory variables, we would want to put only one of them in the model. Obviously, the second order polynomial, with is urbanrate squared, is correlated with urbanrate. However, in this case, we want to keep it in the model, because we specifically want to capture the curvilinear relationship that was evident in our scatter plot. This brings us to another important characteristic of centering. Centering significantly reduces the correlation between the linear and quadratic variables in a polynomial regression model. We can also test for more complex, non-linear associations by adding higher-order polynomials. For example, you can add cubic, third-order polynomial, or even quartic fourth order polynomial turns to the model to account for more complex curves. For example, this scatter plot shows more than one curve. In this case, adding a cubic, or third order polynomial term, might improve the fit of the model. One thing to keep in mind though, is that modeling more complex curves in the sample data will often improve the fit of the model for that sample. However, it also increases the risk of doing something called overfitting. Overfitting occurs when you get a model to explain your response variable really well in the sample, but that model becomes very specific to the sample data. That is it capitalizes on the variability that's present in that sample. The downside is that an overfitted model that fits the sample really well may not fit well at all on another sample drawn from the same population. An overfitted model is biased toward the sample it was developed on and consequently, the conclusions we draw from that model may not be representative of the population. So, we try to establish balance, where we identify a model that fits our sample well, but will also fit well if we tested it on another sample from the same population. This is called the bias-variance tradeoff, and we will discuss this in more detail in the fourth course of the specialization, Machine Learning For Data Analysis.