Data Science: Implementing a Random Forest in Rust.

applied.math.coding
5 min readMay 26

This story is part of my Data Science series.

Random forests do come in many variations. The common idea is instead building a single tree model, one trains an ensemble of tree model where each has slightly different input parameters. These input parameters for instance could be a random-selected subset of features. Or a random sample from the training data.

Predictions with such an ensemble typically is done by ‘majority voting’ in case of a classification, and averaging in case of a regression.

In this story I aim to present an implementation of a random forest for a classification problem stemming from these data here.

This data base is quite large and thus not suitable for training one single tree from it. For this reason I choose to build an ensemble of trees based on random-selected samples of the training data.

Implementation (Tree):

Before implementing a random forest, we first need to implement a tree model:

#[derive(Clone)]
pub struct Tree {
pub feature_idx: usize,
pub p: f64,
pub split: f64,
pub left: Option<Arc<Tree>>, // <-- recursive type!
pub right: Option<Arc<Tree>>,
}

p refers to the estimated probability in the current node, split is the feature split that decides to either follow the left or right sub-tree. feature_idx is the index of the feature to which the split refers.

The data are expected to be in the form Vec<Vec<f64>> where the first column corresponds to the binary outcome. In the actual implementation I do fetch these data from a DuckDB but this is not so essential for this article.

In order to accelerate the search for an optimal feature split, I first compute for each feature’s data intervals of equal measure w.r.t the feature’s induced probability distribution:

fn compute_quantiles(data: &mut Vec<Vec<f64>>, n_quantiles: usize) 
-> Vec<Vec<f64>> {
let…
applied.math.coding

Recommended from Medium

Lists

See more recommendations