Skip to main content

The pipeline

fseval executes a predefined number of steps to benchmark your Feature Selector or Feature Ranker.

See the schematic illustration below:

Pipeline main architecture

The steps (1-6) can be described as follows.

  1. First, the pipeline configuration (PipelineConfig) is processed using Hydra.

    Hydra is a powerful tool for creating Command Line Interfaces in Python, allowing hierarchical representation of the configuration. Configuration can be defined in either YAML or Python files, or a combination of the two. The top-level config is enforced to be of the PipelineConfig interface, allowing Hydra to perform type-checking.

  2. The config is passed to the run_pipeline function.

  3. The dataset is loaded. Like defined in the DatasetConfig object.

  4. A Cross Validation (CV) split is made.

    The split is done as defined in the CrossValidatorConfig. Each cross validation fold is executed in a separate run of the pipeline.

  5. The training subset is passed to the fit() step.

  6. The testing subset is passed to the score() step.

Benchmark

To get a better idea of what is happening in the pipeline, we can take a closer look at the benchmark steps of the pipeline (steps a-d).

In the pipeline, the following steps are performed:

  1. The data is (optionally) resampled. This is useful, for example, to do a bootstrap. Such, the stability of an algorithm can be determined. The resampling is configured using the ResampleConfig.
  2. A Feature Ranker is fit. Any Feature Selector or Feature Ranker is defined in the EstimatorConfig.
  3. Depending on which attributes the Feature- Ranker or Selector estimates, different validations are run.
    1. When the ranker estimates the feature_importances_ or ranking_ attribute, the estimated ranking is validated as follows. According to the all_features_to_select parameter in the PipelineConfig, various feature subsets are validated. By default, at most 50 subsets are validated using another estimator. First, the validation estimator is fit on a subset containing only the highest ranked feature, then only the two highest ranked features, etcetera.
    2. In the case that a ranker estimates the support_ attribute, that selected feature subset is validated.
  4. When the ranker was fit on the dataset, and the validation estimator was fit on all the feature subsets, the pipeline is scored. This means the ranker fitting times and the validation scores are aggregated wherever applicable, and stored into tables according to the enabled Callbacks.