Data Science: Feature Selection by comparing Histogram Distances.

applied.math.coding
3 min readMay 17

This story is part of my Data Science series.

Typically when we create a classification or regression model we try to keep it as simple as possible. Simplification can be reached in many ways, and one is to reduce the considered features to a small as possible subset.

In this small account I want to present you one rather less-known approach but that I think is very interesting in certain scenarios.

Classification problems are all about the following term:

P(Y = 1 | X = d)

That is, to estimate the probability Y to take value 1 given that an observed feature X takes the value d.

Since

P(Y = 1 | X = d) ~ P(X = d | Y = 1)

P(Y = 0| X = d) ~ P(X = d | Y = 0)

A variable X distinguishes the outcome 1 resp. 0 of X good, if the values of P(X = d | Y = 1) and P(X = d | Y = 0) are different from each other. Otherwise, if they were the same, then since

P(Y = 1 | X = d) + P(Y = 0| X = d) = 1

this would lead to P(Y = 1 | X = d) = 1/2.

How much these both conditional probabilities deviate from each other, we can estimate from the training data.

For this, we compute both histograms, one corresponding to the conditional distribution of X under Y = 0, and the other under Y = 1.

To measure the difference of both distributions, we can use the so called histogram distance. You can find here some theoretical and computational aspects about the latter.

Let us test if this really works.

For this, I take the data from here and build a logistic regression model over the first million of records by using all the features.

This yields a ROC-AUC of 0.685.

Next I compute for all the 28 features the histogram distances of the conditional distributions. That data are quite large…

applied.math.coding