Skip to main content

PipelineConfig

class fseval.config.PipelineConfig(
dataset: DatasetConfig=MISSING,
cv: CrossValidatorConfig=MISSING,
resample: ResampleConfig=MISSING,
ranker: EstimatorConfig=MISSING,
validator: EstimatorConfig=MISSING,
storage: StorageConfig=MISSING,
callbacks: Dict[str, Any]=field(default_factory=lambda: {}),
metrics: Dict[str, Any]=field(default_factory=lambda: {}),
n_bootstraps: int=1,
n_jobs: Optional[int]=1,
all_features_to_select: str="range(1, min(50, p) + 1)",
defaults: List[Any] = field(
default_factory=lambda: [
"_self_",
{"dataset": MISSING},
{"cv": "kfold"},
{"resample": "shuffle"},
{"ranker": MISSING},
{"validator": MISSING},
{"storage": "local"},
{"callbacks": []},
{"metrics": ["feature_importances", "ranking_scores", "validation_scores"]},
{"override hydra/job_logging": "colorlog"},
{"override hydra/hydra_logging": "colorlog"},
]
)
)

The complete configuration needed to run the fseval pipeline.

Attributes:

dataset : DatasetConfigDetermines the dataset to use for this experiment.
cv : CrossValidatorConfigThe CV method and split to use in this experiment.
resample : ResampleConfigDataset resampling; e.g. with or without replacement.
ranker : EstimatorConfigA Feature Ranker or Feature Selector.
validator : EstimatorConfigSome estimator to validate the feature subsets.
storage : StorageConfigA storage method used to store the fit estimators.
callbacks : Dict[str, Any]Callbacks. Provide hooks for storing the config or results.
metrics : Dict[str, Any]Metrics allow custom computation after any pipeline stage.
n_bootstraps : intAmount of 'bootstraps' to run. A bootstrap means running the pipeline again but with a resampled (see resample) version of the dataset. This allows estimating stability, for example.
n_jobs : Optional[int]Amount of CPU's to use for computing each bootstrap. This thus distributes the amount of bootstraps over CPU's.
all_features_to_select : strDetermines the feature subsets to validate with the validation estimator. The format of this parameter is a string that can contain an arbitrary Python expression, that must evaluate to a List[int] object. Each number in the list is passed to the sklearn.feature_selection.SelectFromModel as the max_features parameter.
  • For example: all_features_to_select="[1, 2]" means two feature subsets are evaluated with the validation estimator - the first with only the highest ranked feature and the second with the two highest ranked features.
  • For example: all_features_to_select="range(1, p + 1)" means that all feature subsets are evaluated.
By default, this parameter is set to all_features_to_select="range(1, min(50, p) + 1)", meaning at most 50 subsets containing the highest ranked features are validated.
defaults : List[Any]Default values for the above. See Hydra docs on Defaults List.

Experiments can be configured in two ways.

  1. Using YAML files stored in a directory
  2. Using Python (Structured Configs)

Examples

conf/my_config.yaml
defaults:
- base_pipeline_config
- _self_
- override dataset: synthetic
- override validator: knn
- override /callbacks:
- to_sql

n_bootstraps: 1

Using the override keyword is required when overriding a config group. See more here.