Decision Trees Part 1: Mammal Classification

Ramzi Saud
5 min readApr 16, 2021

A no-math guide to understanding decision trees for beginners.

My olive sapling

Often used in game theory, stats 101 classes, and corporate decision making, decision trees are an intuitive algorithm that, even if we don’t realize it, we use all the time. By the end of this post you’ll have taken a strong first step to understanding decision trees, no matter what your prior knowledge is. Let’s get started!

Interpreting Trees

The best thing about decision trees is that they’re easy to visualize and understand at the same time. There are three essential pieces to a decision tree: a root node, branches, and leaf nodes. The root node is your starting point with some question to be answered, the branches show the answer and connect the nodes, and leaf nodes contain either the next question or the end result. A particularly shallow person might use the following decision tree to decide if they would date somebody.

Fig. 1 Decision tree used to evaluate whether or not you may date someone

In our dating example, our root node is whether or not someone is attractive; our first pair of branches with yes/no connect to the next leaf nodes which tell us that in the case that the person is attractive then our shallow guy/gal decides to date them, and if they’re not attractive then there are further leaf nodes to consider before they decide to pursue a date. From this example, you can see how we employ this thinking process in our daily lives.

Growing Trees

Our previous example has simple features with yes/no outcomes, but how do we split if we have continuous features? And how do we determine which feature to use at each node?

To choose the best feature for our root node we need to iterate through all of our features as well as all of the possible values within those features and calculate a metric of choice (most commonly the Gini Impurity or Information Gain) at each value. We then take the minimum (or maximum) of this metric and that determines the best split and feature. Now that we’ve split the data once, we repeat the process of optimizing our metric and splitting at that point until we reach a stopping point regulated either by our complexity parameter or a preselected maximum tree depth.

The complexity parameter is used so the tree does not grow endlessly. If the calculated cost of adding another node is greater than the set complexity parameter then that node isn’t added, and the tree classifies observations from there. If you’re interested in the details of the splitting criterion, there is no shortage of literature on the subject of Gini Impurity vs Information Gain and how they are both calculated.

Let’s take a look at an example where we want to classify mammals into their biological order using five different features from AnAge, a database containing records on animal longevity.

  • Maximum Longevity: Max lifespan of an animal in years
  • Adult Weight: The weight of an adult at maturity in kilograms
  • Gestation: How long a fetus is carried in the womb in days
  • Female Maturity: The age at which females reach sexual maturity
  • Male Maturity: The age at which males reach sexual maturity

The tree is constructed in R using the rpart and rpart.plot libraries. The data is split into 80/20 training and test sets to evaluate the performance of the tree.

Fig. 2 Classifying mammalian order using a decision tree

To start, this tree has some bonus information in our result leaf nodes. Starting from the top of the node, we have the predicted class, the correct classification rate, and lastly the percentage of observations in the node. For example, in the result node on the right side of the tree branching from maximum longevity: we predict that the observation is a rodent, of the 354 animals that fall into this node, 211 of them are rodents, and 33% of all of the observations in the training set are in this node.

The algorithm identifies the split of adult weight at 0.78 kilograms to have the minimum Gini impurity score, so that becomes our root node. From there our data is divided into two subsets where we once again find the minimum Gini score within each of them, settling on gestation and maximum longevity splitting at 114 days and 12 years respectively. This process continues until the tree places the observation into the appropriate biological order. Also note that variables can be repeated; gestation is split consecutively, helping to differentiate between broader ranges of gestation time.

This tree, based on the training data, has an accuracy of 63.25%, meaning that it correctly classifies 63.25% of mammals into the correct order. It’s best and worst at predicting bats and marsupials at 97.62% and 45.71% true positive rates respectively.

Performance of our tree can be further determined by following the nodes to predict the order for the testing data. On the test set we have an accuracy of 61.89%, only off by a couple percent from the training set; that’s a good sign that our tree isn’t overfitting. We’re now predicting bats and marsupials at 92.86% and 21.05%; marsupials might be too similar to another order, and adding another variable to the tree may help to predict them.

Drawbacks of Decision Trees

Though decision trees are easy to interpret and provide a nice visualization, they do have two important downsides. Firstly, a small change in the training data can potentially cause a substantial change in the tree, giving you a completely different tree. That’s why it’s important to use cross validation when fitting decision trees on their own! Secondly, decision trees are weak learners, meaning they’re usually a little bit better than guessing. What we can do to account for this is combine the output of multiple trees with either boosted trees or random forests which I will go into next time!

The data is sourced from the Human Ageing Genomic Resources (HAGR) project, which is focused on the study of human ageing and longevity.

My R code to create the trees and get the performance metrics can be found on my Github.

--

--

Ramzi Saud

Bridging the gap between machine learning and you