Linear regression is a powerful data science tool and one you definitely need to be familiar with. If you’re not, that’s okay… Read last weeks article here for a good overview of the topic and its use case.
So, you want to describe the linear relationship between a set of features and an outcome. You decide that linear regression is your plan of attack, and boot up Rstudio, or your favourite Python editor. Next, you import your machine learning libraries and write some code. You run the models and success, 15 lines of complicated looking console output is provided for your troubles…
You may think, now it’s time for interpretation, however there’s a key step we must undertake before continuing. Linear models have 4 key assumptions that should be satisfied in order to confidently interpret your output.
- Linear relationship between predictors and outcome.
- Independent residuals
- Normality of residuals
Truth be told, these are assumptions, so we can never be completely confident that all 4 have held. However, we can check our model for any clear evidence that they have been compromised. We should definitely do this.
Linear relationship between predictors and outcome.
First, we can simply plot our predictors against our outcome in a pairwise fashion. It will be relatively clear if a variable does not have a linear relationship with the outcome because you will struggle draw a straight line to describe the relationship. This is a visual inspection (see below) and no need to go over board here.
If it looks to be the case that a predictor isn’t linearly related to the outcome, don’t despair. We can trial some basic transformations, such as a log transformation and replot the graph. This may solve the assumption (again see below), however you will have made your model ultimately more difficult to interpret (more on that later).
Residuals are the difference between your predicted value and it’s paired ground truth. Another way of thinking about residuals is the vertical distance (if outcome is on the y axis) between the ground truth and the fitted regression. The residuals shouldn’t be correlated to each other and therefore, you shouldn’t be able to easily establish a pattern in their appearance. A common issue is that residuals are correlated with time series features and will increase/decrease over time.
Most modern programming languages used for data science are able to create a models fitted values vs residuals plot. We should expect to see an evenly spread set of the residuals over the fitted values (see below).
Normality of residuals
Here we are checking that the residuals are normally distributed.
To do this we create a Q-Q plot. Again, this is straightforward in Python or R. We want to observe the plot roughly following the straight horizontal line. Don’t worry too much about the start and finish of the line, mostly inspect the middle portion for any deviation. If there appears to be an issue then we should search for outliers in the data and omit them if needs be. We can also apply a transformation to the data and rerun our Q-Q plot.
Our final assumption refers to the constant variance of our residuals. We assess this by once again inspecting the models fitted values vs residuals plot.
We would like to see an even vertical spread of the data points along the horizontal axis. There are a few things one can trial when evidence of heteroscedasticity arises, but they out of the scope of this article.
Okay, phew... Our model has ran successfully and we are satisfied that all 4 of the models assumptions have held. Now time for the fun stuff…
For a basic interpretation of the output we can consider 2 terms:
- Beta coefficient
- Significance value (P-value)
In linear regression, the beta coefficient of a predictor represents the unit change in the outcome for a unit change in the predictor. For example, if we are trying to predict the weight of a cancer tumour (measured in grams), using a patients age (measured in years), if the beta coefficient for age is 5, then for every 1 year increase in age we would predict a 5 gram increase in tumour weight.
The P-value in this circumstance is a proxy for our confidence in that result. In statistics, we set an arbitrary limit for when we consider a result significant and “to be trusted”. In medicine, this is usually the 5% level and corresponds to the 95% confidence level. Therefore, in the above example, where we were predicting cancer tumour weight, we “trust” the result when our p-value falls below 0.05. Again, this is just an arbitrary limit and there is a more recent push in academia to move away from p-values altogether.
To complicate things just a little more before we finish, if you have many predictors in your linear regression, then we need to modify our interpretation slightly. In this case, our unit increase in outcome per unit increase in predictor is when we fix the other predictors value. Building on our earlier example, this would be if we used both age and BMI to predict cancer tumour weight. If our beta coefficient for age remained at 5, our interpretation would be a 5 gram increase in tumour weight for a one year increase in age, when BMI is held constant. This is sometimes called controlling or adjusting for variables.
There are other terms that can be included in our model , such as interaction, fixed and random effects terms, but I think we will leave them for another week!
Now we are able to interpret our models output at its most basic level, we can move on to making predictions from new data points based on what we’ve learnt.
Thanks for reading and please subscribe here for weekly content like this.