# Data Science: Feature Selection by comparing Histogram Distances.

--

*This story is part of my **Data Science** series.*

Typically when we create a classification or regression model we try to keep it as simple as possible. Simplification can be reached in many ways, and one is to reduce the considered features to a small as possible subset.

In this small account I want to present you one rather less-known approach but that I think is very interesting in certain scenarios.

Classification problems are all about the following term:

`P(Y = 1 | X = d)`

That is, to estimate the probability `Y`

to take value `1`

given that an observed feature `X`

takes the value `d`

.

Since

`P(Y = 1 | X = d) ~ P(X = d | Y = 1)`

P(Y = 0| X = d) ~ P(X = d | Y = 0)

A variable `X`

distinguishes the outcome `1`

resp. `0`

of `X`

good, if the values of `P(X = d | Y = 1)`

and `P(X = d | Y = 0)`

are different from each other. Otherwise, if they were the same, then since

`P(Y = 1 | X = d) + P(Y = 0| X = d) = 1`

this would lead to `P(Y = 1 | X = d) = 1/2`

.

How much these both conditional probabilities deviate from each other, we can estimate from the training data.

For this, we compute both histograms, one corresponding to the conditional distribution of `X`

under `Y = 0`

, and the other under `Y = 1`

.

To measure the difference of both distributions, we can use the so called **histogram distance**. You can find here some theoretical and computational aspects about the latter.

Let us test if this really works.

For this, I take the data from here and build a logistic regression model over the first million of records by using all the features.

This yields a ROC-AUC of `0.685`

.

Next I compute for all the 28 features the histogram distances of the conditional distributions. That data are quite large…