Data Science: Implementing a Random Forest in Rust.
--
This story is part of my Data Science series.
Random forests do come in many variations. The common idea is instead building a single tree model, one trains an ensemble of tree model where each has slightly different input parameters. These input parameters for instance could be a random-selected subset of features. Or a random sample from the training data.
Predictions with such an ensemble typically is done by ‘majority voting’ in case of a classification, and averaging in case of a regression.
In this story I aim to present an implementation of a random forest for a classification problem stemming from these data here.
This data base is quite large and thus not suitable for training one single tree from it. For this reason I choose to build an ensemble of trees based on random-selected samples of the training data.
Implementation (Tree):
Before implementing a random forest, we first need to implement a tree model:
#[derive(Clone)]
pub struct Tree {
pub feature_idx: usize,
pub p: f64,
pub split: f64,
pub left: Option<Arc<Tree>>, // <-- recursive type!
pub right: Option<Arc<Tree>>,
}
p
refers to the estimated probability in the current node, split
is the feature split that decides to either follow the left
or right
sub-tree. feature_idx
is the index of the feature to which the split refers.
The data are expected to be in the form Vec<Vec<f64>>
where the first column corresponds to the binary outcome. In the actual implementation I do fetch these data from a DuckDB but this is not so essential for this article.
In order to accelerate the search for an optimal feature split, I first compute for each feature’s data intervals of equal measure w.r.t the feature’s induced probability distribution:
fn compute_quantiles(data: &mut Vec<Vec<f64>>, n_quantiles: usize)
-> Vec<Vec<f64>> {
let…