Linear Regression: As simple as G&T
Machine learning without linear regression is like gin without tonic. It’s one of the first weapons in a data scientists arsenal and remains a powerful and popular tool in research as well as commercial applications.
Linear regression was first used in the 1800s to predict planetary movements! It has since been implemented in fields ranging from economics to environmental science successfully describing linear relationships between observations and their outcome. It is robust to different forms of input data, but the outcome data must be continuous. Like most things, it makes more sense with an example and some intuitive graphs…
Lets take 50 patients for which we know their age and their respective depression score. We can plot their data on a 2D graph as follows:
As you can see there is clearly a relationship between a patients age and their depression score. Now what if we want to describe this relationship objectively? I think we can agree that the relationship looks visually linear, but the exact equation of this linear relationship is unknown. In real life, it may be useful to predict a new patients depression score based on some key demographics, such as age. This may help a general practitioner screen many patients quickly to ascertain patients at a higher risk of depression and consult these patients with a degree of priority. In order to make this prediction, we need to know the equation of the line that fits the data best. Obviously, this is an oversimplification for comprehension, but stay with me.
The first key part of this problem is that we assume the relationship between our input data and our outcome data is linear. This is one of the 4 major assumptions required by linear regression and is not always as straight forward as one might think. If the relationship between our data points looks more complex than a linear descriptor (y = mx +c) then perhaps a different tool should be used.
Linear relationship → This means for a unit increase in X, we have a proportional unit increase in Y.
So, we are happy that our relationship can be assumed to be linear, what next? Well now we need to decide where our “line of best fit” should go. We could decide this by trial and error. This would involve drawing many lines, such as the line below and visually inspecting which lines looks better.
As you can imagine visually checking many lines and subjectively determining their effectiveness at describing the data is time consuming and inaccurate. Linear regression offers us an objective alternative to this process.
In linear regression, we find the error between all our observations and the respective prediction that the “line of best fit” would make. Error can be measured in many ways, but most commonly we use the least squares method which, in this example, involves an error metric called mean squared error.
Lets take our red line drawn by trial and error above. Each observation is a particular distance away from the red line on the y-axis. We can take our observed value of y (depression score) and subtract our predicted value of y (predicted depression score) represented by the red line. This value is then squared to remove any directional information from the metric. We repeat this process for every observation and take the sum. This summation is then divided by the total number of observations to give us our overall error metric. Now we have an objective measure for how well the red line describes the linear relationship in our data.
The goal now is to find the line that minimises the mean squared error. That line is shown below in blue…
I’m sure you can agree that our new line visually fits our data much better and we can feel more confident that a prediction based on this line is going to be more informative.
In a 2-dimensional space the above approach works well, however as alluded to earlier, often life isn’t this simple. In healthcare, we are likely to get better predictions if we were to include many clinico-demographic variables, such as gender, exercise levels, dietary measures and comorbidities. In order to do this we must enter an “nth” dimensional space. It is now necessary to use more complex machine learning algorithms to solve the minimisation of error problem that underpins linear regression. This is where the true utility of linear regression comes from and as usual I will include an example here, of it being used in healthcare to impact patient outcomes.
Today, I just want to give an overview of linear regression and talk you through its simple case. In future articles, we will take a deep dive into all the components of linear regression and its different use cases in academia and commercially.
Subscribe here to learn more by weekly articles and find my other articles explaining machine learning in a healthcare setting. Also, if you’re more of a visual learn consider subscribing to my YouTube channel here.