A Random Forest A Day Keeps The Doctor Away…

James Parkin
6 min readFeb 11, 2021

The future of medicine is decision augmentation using novel machine learning algorithms in parallel with expert clinical experience. Once we incorporate the massive advancements in personalised medicine, made possible by Omics association studies, we will have a futuristic healthcare service rivalled only by Star Trek’s “Tricorder”…

Ignoring the cheesy Star Trek reference, it is truly exciting to be on the brink of the next revolution in healthcare (In my humble opinion). Millennial doctors are flocking in their hundreds into the field of Artificial Intelligence in Healthcare and for good reason. I aim to educate whoever will listen, through YouTube videos and articles such as this one, the fundamentals that will allow them to propel their careers into warp drive… Okay, no more Star Trek references.

Last week I briefly outlined how decision trees function and their utility in healthcare. Random forests extend on this concept and if you are unfamiliar with decision trees then give last weeks article a read.

Leo Breiman and Adele Cutler can be thanked for modernising the Random forest method which was initially described by Tin Kam Ho in 1995.

I feel the major components of Random forests are as follows:

  • Bootstrapping data
  • Decision trees
  • Random variable selection
  • Bagging

In machine learning, if you want to achieve state-of-the-art performance then common practice is to ensemble methods together. A Random forest is essentially an ensembled set of decision trees. We compute many decision trees and ask them to predict our outcome one by one.

Out-of-boot data is unfortunately termed “bagging”…

If you are caught up with decision trees, you may well be asking: Surely, we will just get the same prediction many times if we compute multiple decision trees? After all, the Gini impurity will remain constant unless the data changes… This is where the “Random” in Random forests begins to set in.

Our data set will go through a process of bootstrapping before computing each decision tree. This method consists of taking many randomly selected subsets of our data to train the model. The randomness increases because every observation that is selected is “replaced” back to the main set, so that it can potentially be picked again. This means we can have one subset with many copies of the same observation (although unlikely) and multiple subsets that have very different combinations of observations. It differs from methods such as cross validation because of this replacement.

To begin planting our forest let’s outline our data set and imagine we’ve collected the respiratory disease status for 50 individuals. Let’s also say we know their age, gender and smoking status. We take our set of 50 observations and bootstrap them as outlined above, producing 100 random subsets of size 50. From this bootstrapped data, we produce 100 decision trees.

We almost have our Random forest… However, there’s a nifty trick we employ to improve the robustness of our prediction. If you remember from last weeks article we talked about the stepwise progression of asking the most discriminatory question (regarding the separation of outcome) at each node of the tree and comparing each feature of the data. In Random forests, we develop our decision trees with an extra element of randomness. At every node we consider only a random subset of features to compute and compare their Gini impurities. This results in lots of very different decision trees based on our bootstrapped data.

Our Random forest has been trained. It is trained on the set of data selected by the bootstrapping procedure and technically could now be stored in a computer’s memory and used to predict the respiratory disease status of individuals based on the features used to train the model (age, gender and smoking status). The way that we predict with this model is by running each of our new data’s observations on every decision tree within the forest. Each tree will give us a prediction of outcome (it will predict whether the patient has respiratory disease or not). We will then take the most common prediction (mode of prediction) and that will be our final prediction estimate.

When building models with elements of randomness we should always validate our performance (really we should do this regardless, but hear me out). We want to know whether our model will work well with new/unseen data. Fortunately, the bootstrapping procedure will leave us with a subset of observations that weren’t used to train the model. This is typically around 25% of the data and mimics the train/test split commonly found in data science. The official term for this unused data is “out-of-boot” data and we can apply the above aggregate prediction technique to assess how well our Random forest performs. Bootstrapping to aggregate prediction modal estimates based on out-of-boot data is unfortunately termed “bagging”.

Well done. You now know the fundamental mechanics of a Random forest. There are optimisation parameters, such as the number of randomly selected features to assess at each node that we can adjust, but I think we’ve covered enough for this weeks article.

I want to leave you, as always, with some real life examples of methods described today.

The future of medicine is decision augmentation using novel machine learning algorithms in parallel with expert clinical experience.

Prostate cancer is one of the most common cancers on the planet (even though it only affects men). Despite, or indeed, because of its commonality most men die with prostate cancer, not from it. Those who die from prostate cancer often have very aggressive disease by definition. Presently, our best test for screening men for prostate cancer is measuring a blood marker called PSA. PSA is incredibly sensitive, but lacks in specificity. This means it can be raised for many reasons other than cancer and if used for screening it will lead to many men being referred for prostatic biopsy. The problem with biopsy is it’s a particularly invasive line of questioning and comes with a whole host of adverse complications. Further to this, the majority of men, who endure a prostatic biopsy, don’t go on to develop aggressive forms of prostate cancer. This paradox has plagued the urological public health departments of developed worlds since the advent of screening programs.

L. Xiao et al have used Random forests to combine clinico-demographic data with two less suitable screening investigations (serum PSA and transrectal US) to achieve a specificity of 93.8%. Incredible! This is another demonstration of the power of applying novel ML approaches to address real-world clinical problems. This model will of course need to be validated on many new test sets, but should it stay the course it will prove a useful tool for augmenting clinicians identifying high-risk individuals for prostatic biopsy.

Thank you for reading, I hope you found it useful and give it a clap if did!

Follow here for more like this and subscribe to my YouTube channel to learn more.

--

--

James Parkin

Medical Doctor and Data Scientist living in London. I write about novel Machine Learning techniques being used to solve Healthcare’s biggest problems.