The pipeline
fseval
executes a predefined number of steps to benchmark your Feature Selector or Feature Ranker.
See the schematic illustration below:
The steps (1-6) can be described as follows.
First, the pipeline configuration (
PipelineConfig
) is processed using Hydra.Hydra is a powerful tool for creating Command Line Interfaces in Python, allowing hierarchical representation of the configuration. Configuration can be defined in either YAML or Python files, or a combination of the two. The top-level config is enforced to be of the
PipelineConfig
interface, allowing Hydra to perform type-checking.The config is passed to the
run_pipeline
function.The dataset is loaded. Like defined in the
DatasetConfig
object.A Cross Validation (CV) split is made.
The split is done as defined in the
CrossValidatorConfig
. Each cross validation fold is executed in a separate run of the pipeline.The training subset is passed to the
fit()
step.The testing subset is passed to the
score()
step.
Benchmark
To get a better idea of what is happening in the pipeline, we can take a closer look at the benchmark steps of the pipeline (steps a-d).
In the pipeline, the following steps are performed:
- The data is (optionally) resampled. This is useful, for example, to do a bootstrap. Such, the stability of an algorithm can be determined. The resampling is configured using the
ResampleConfig
. - A Feature Ranker is fit. Any Feature Selector or Feature Ranker is defined in the
EstimatorConfig
. - Depending on which attributes the Feature- Ranker or Selector estimates, different validations are run.
- When the ranker estimates the
feature_importances_
orranking_
attribute, the estimated ranking is validated as follows. According to theall_features_to_select
parameter in thePipelineConfig
, various feature subsets are validated. By default, at most 50 subsets are validated using another estimator. First, the validation estimator is fit on a subset containing only the highest ranked feature, then only the two highest ranked features, etcetera. - In the case that a ranker estimates the
support_
attribute, that selected feature subset is validated.
- When the ranker estimates the
- When the ranker was fit on the dataset, and the validation estimator was fit on all the feature subsets, the pipeline is scored. This means the ranker fitting times and the validation scores are aggregated wherever applicable, and stored into tables according to the enabled Callbacks.