Skip to main content

Comparing Feature Selectors

Hi! You want to compare the performance of multiple feature selectors? This is an example Notebook, showing you how to do such an analysis.

Prerequisites

We are going to use more or less the same configuration as we did in the Quick start example, but then with more Feature Selectors. Again, start by downloading the example project: comparing-feature-selectors.zip

Installing the required packages

Now, let's install the required packages. Make sure you are in the comparing-feature-selectors folder, containing the requirements.txt file, and then run the following:

pip install -r requirements.txt

Running the experiment

Run the following command to start the experiment:

python benchmark.py --multirun ranker="glob(*)" +callbacks.to_sql.url="sqlite:////tmp/results.sqlite

Analyzing the results

There now should exist a .sqlite file at this path: /tmp/results.sqlite:

```
$ ls -al /tmp/results.sqlite
-rw-r--r-- 1 vscode vscode 20480 Sep 21 08:16 /tmp/results.sqlite
```

Is that the case? Then let's now analyze the results! 📈

We will install plotly-express, so we can make nice plots later.

%pip install plotly-express nbconvert --quiet

Next, let's find a place to store our results to. In this case, we choose to store it in a local SQLite database, located at /tmp/results.sqlite.

import os

con: str = "sqlite:////tmp/results.sqlite"
con
'sqlite:////tmp/results.sqlite'

Now, we can read the experiments table.

import pandas as pd

experiments: pd.DataFrame = pd.read_sql_table("experiments", con=con, index_col="id")
experiments
datasetdataset/ndataset/pdataset/taskdataset/groupdataset/domainrankervalidatorlocal_dirdate_created
id
3lllxl48My synthetic dataset1000020classificationNoneNoneANOVA F-valuek-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:28:27.506838
1944ropgMy synthetic dataset1000020classificationNoneNoneBorutak-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:28:31.230633
31gd56gfMy synthetic dataset1000020classificationNoneNoneChi-Squaredk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:19.633012
a8washm5My synthetic dataset1000020classificationNoneNoneDecision Treek-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:23.459190
27i7uwg4My synthetic dataset1000020classificationNoneNoneInfinite Selectionk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:27.506974
3velt3b9My synthetic dataset1000020classificationNoneNoneMultiSURFk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:31.758090
3fdrxlt6My synthetic dataset1000020classificationNoneNoneMutual Infok-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:04.289361
14lecx0gMy synthetic dataset1000020classificationNoneNoneReliefFk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:08.614262
3sggjvu3My synthetic dataset1000020classificationNoneNoneStability Selectionk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:59.121416
dtt8bvo5My synthetic dataset1000020classificationNoneNoneXGBoostk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:36:23.385401

Let's also read in the validation_scores.

validation_scores: pd.DataFrame = pd.read_sql_table("validation_scores", con=con, index_col="id")
validation_scores
indexn_features_to_selectfit_timescorebootstrap_state
id
3lllxl48010.0044330.79551
3lllxl48020.0042270.79101
3lllxl48030.0051830.79501
3lllxl48040.0038650.79651
3lllxl48050.0029020.79501
..................
dtt8bvo50160.0006700.78051
dtt8bvo50170.0004800.77251
dtt8bvo50180.0031590.77601
dtt8bvo50190.0008480.76501
dtt8bvo50200.0005650.75901

160 rows × 5 columns

We can now merge them. Notice that we set as the index the experiment ID, so we can use pd.DataFrame.join to do this.

validation_scores_with_experiment_info = experiments.join(
validation_scores
)
validation_scores_with_experiment_info.head(1)
datasetdataset/ndataset/pdataset/taskdataset/groupdataset/domainrankervalidatorlocal_dirdate_createdindexn_features_to_selectfit_timescorebootstrap_state
id
14lecx0gMy synthetic dataset1000020classificationNoneNoneReliefFk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:08.614262NaNNaNNaNNaNNaN

Cool! That will be all the information that we need. Let's first create an overview for all the rankers we benchmarked.

validation_scores_with_experiment_info \
.groupby("ranker") \
.mean(numeric_only=True) \
.sort_values("score", ascending=False)
dataset/ndataset/pindexn_features_to_selectfit_timescorebootstrap_state
ranker
Infinite Selection10000.020.00.010.50.0046000.8189251.0
XGBoost10000.020.00.010.50.0029980.8185751.0
Decision Tree10000.020.00.010.50.0028100.8176751.0
Stability Selection10000.020.00.010.50.0024060.8033251.0
Chi-Squared10000.020.00.010.50.0025480.7959751.0
ANOVA F-value10000.020.00.010.50.0037450.7892751.0
Mutual Info10000.020.00.010.50.0023140.7864751.0
Boruta10000.020.00.010.50.0023660.5180751.0
MultiSURF10000.020.0NaNNaNNaNNaNNaN
ReliefF10000.020.0NaNNaNNaNNaNNaN

Already, we notice that MultiSURF and ReliefF are missing. This is because the experiments failed. That can happen in a big benchmark! We will ignore this for now and continue with the other Feature Selectors.

👀 We can already observe, that the average classification accuracy is the highest for Infinite Selection. Although it would be premature to say it is the best, this is an indication that it did will for this dataset.

Let's plot the results per n_features_to_select. Note, that n_features_to_select means a validation step was run using a feature subset of size n_features_to_select.

import plotly.express as px

px.line(
validation_scores_with_experiment_info,
x="n_features_to_select",
y="score",
color="ranker"
)

feature selectors comparison plot

Indeed, we can see XGBoost, Infinite Selection and Decision Tree are solid contenders for this dataset.

🙌🏻


This has shown how easy it is to do a large benchmark with fseval. Cheers!