Decision Trees in 5 minutes
Decision trees are simple yet powerful. They offer advanced machine learning with relatively high interpretability (contrasting powerful “black box” algorithms such as neural networks). If you’ve ever wondered how computers are able to learn for themselves to address important questions, then stick around for a few.
My particular interest is in how Artificial Intelligence can be applied to Healthcare, hence I’ll be using examples of this to show you how decision trees function in the wild.
The name of the game is to predict, or group, patients with certain health characteristics into those with respiratory disease vs those without.
Decision trees were first considered in a statistical sense by a British statistician named William Belson (You can find the paper here). They’ve grown in popularity in recent years due to their implementation within the machine learning world and specifically their use within random forests. Hence, I’ll focus on their application within this sphere.
Fundamentally, you can think of a decision tree as a set of discriminating questions that separate input data. They function to classify categorical or continuous observations into meaningful groups based on their outcome. Their 4 main components are the following:
- Root Nodes
- Internal Nodes
- Leaf Nodes
Let’s begin with a problem. Respiratory disease affects one of our major organ systems (our lungs) and contributes substantially to the ill heath of all populations. Being able to predict which patients are at a heightened risk of developing respiratory disease can help clinicians (like myself) and researchers reduce the burden of disease by targeting health care resources amongst other things. This is no easy task and a decision tree is a unique and novel tool to tackle this problem.
The name of the game is to predict, or group, patients with certain health characteristics into those with respiratory disease vs those without. With this in mind, we can start to build our tree.
For our example, we have collected the following data about 50 patients:
- Smoking status (binary variable)
- Gender (binary variable)
- Age (continuous variable)
- Respiratory disease (binary outcome variable)
All 50 of our observations (patients) must move into the root node. At this point our tree decides the best feature to separate our data and move it into the internal nodes. The branches begin to separate the data, but how do we decide which feature the branches should rely on? There are several established methods for this task.
I’ll address the most commonly used method: Gini impurity…
Gini impurity is a way of determining the optimal path for our data to take. In our example we would take our list of features (health characteristics) and calculate the Gini impurity for each variable. The equation is included above, but you only need to superficially grasp what Gini impurity represents to understand the power of decision trees. It represents the degree to which a question about a group of observations separates those observations in regards to their outcome.
Returning to our respiratory illness example, we can calculate the degree to which each of our features classify our patients as having respiratory disease or not. Whichever feature is optimal for this task (has the lowest Gini impurity score) will form the first branches from the root node to the internal node. Conventional wisdom would indicate this is likely to be smoking status. 20 out of our 50 patients smoke. So, one of our branches and it’s internal node now represent the 20 smokers of the group.
To continue to grow our tree we simply repeat the above process for each of the new internal nodes. Let’s take our group of smokers. We ask which of the features, that haven’t already been included in our path back to the root node (age and sex), have the best Gini impurity. We also need to compare the lowest Gini score (lets assume age had the lowest score in this case) to the current nodes level of outcome discrimination. Can you see how we might calculate this? The Gini impurity is calculated for this internal node. Now let’s say age further improves our Gini score. We add a new branch and a new internal node. And we ask the same question again. Which of the remaining features (health characteristics) in our data best separates our group of patients by the outcome (respiratory disease)?
If you’re following the example (well done) then you may be wondering “how does this process ever stop” or “what about the leaf nodes”? Should we find at any point that an internal node is optimally grouped based on it’s current architecture of branches and nodes, then we set that node to a leaf node. This represents the final resting place for our observation (…patient). When all unbranched nodes become leaf nodes, we have successfully trained our model.
With this model stored in a computer’s memory, we now have a powerful prediction algorithm for looking at new data and deciding whether is it probable that a patient will develop respiratory illness. We can also inspect our decision tree and see which factors are most important in predicting respiratory disease.
Here is an amazing example of decision trees being used to predict heart attacks following angiography. The authors achieved a prediction accuracy that rivalled a logistic regression model (we will discuss another time).
The plot thickens when we start to augment our decision trees. One of the most popular methods for this is random forests. They’re as spooky as they sound. My next article will address the strange world of random forests and of course their application in healthcare!
If you enjoyed and found this useful, please give it a clap and consider subscribing for more like this. Here is a link to my Youtube channel to learn more.