PipelineConfig
class fseval.config.PipelineConfig(
dataset: DatasetConfig=MISSING,
cv: CrossValidatorConfig=MISSING,
resample: ResampleConfig=MISSING,
ranker: EstimatorConfig=MISSING,
validator: EstimatorConfig=MISSING,
storage: StorageConfig=MISSING,
callbacks: Dict[str, Any]=field(default_factory=lambda: {}),
metrics: Dict[str, Any]=field(default_factory=lambda: {}),
n_bootstraps: int=1,
n_jobs: Optional[int]=1,
all_features_to_select: str="range(1, min(50, p) + 1)",
defaults: List[Any] = field(
default_factory=lambda: [
"_self_",
{"dataset": MISSING},
{"cv": "kfold"},
{"resample": "shuffle"},
{"ranker": MISSING},
{"validator": MISSING},
{"storage": "local"},
{"callbacks": []},
{"metrics": ["feature_importances", "ranking_scores", "validation_scores"]},
{"override hydra/job_logging": "colorlog"},
{"override hydra/hydra_logging": "colorlog"},
]
)
)
The complete configuration needed to run the fseval pipeline.
Attributes:
dataset : DatasetConfig | Determines the dataset to use for this experiment. |
cv : CrossValidatorConfig | The CV method and split to use in this experiment. |
resample : ResampleConfig | Dataset resampling; e.g. with or without replacement. |
ranker : EstimatorConfig | A Feature Ranker or Feature Selector. |
validator : EstimatorConfig | Some estimator to validate the feature subsets. |
storage : StorageConfig | A storage method used to store the fit estimators. |
callbacks : Dict[str, Any] | Callbacks. Provide hooks for storing the config or results. |
metrics : Dict[str, Any] | Metrics allow custom computation after any pipeline stage. |
n_bootstraps : int | Amount of 'bootstraps' to run. A bootstrap means running the pipeline again but with a resampled (see resample ) version of the dataset. This allows estimating stability, for example. |
n_jobs : Optional[int] | Amount of CPU's to use for computing each bootstrap. This thus distributes the amount of bootstraps over CPU's. |
all_features_to_select : str | Determines the feature subsets to validate with the validation estimator. The format of this parameter is a string that can contain an arbitrary Python expression, that must evaluate to a List[int] object. Each number in the list is passed to the sklearn.feature_selection.SelectFromModel as the max_features parameter.
all_features_to_select="range(1, min(50, p) + 1)" , meaning at most 50 subsets containing the highest ranked features are validated. |
defaults : List[Any] | Default values for the above. See Hydra docs on Defaults List. |
Experiments can be configured in two ways.
- Using YAML files stored in a directory
- Using Python (Structured Configs)
Examples
- YAML
- Structured Config
conf/my_config.yaml
defaults:
- base_pipeline_config
- _self_
- override dataset: synthetic
- override validator: knn
- override /callbacks:
- to_sql
n_bootstraps: 1
conf/my_config.py
from omegaconf import MISSING
from fseval.config import PipelineConfig
# To set PipelineConfig defaults in a Structured Config, we must redefine the entire
# defaults list.
my_config = PipelineConfig(
n_bootstraps=1,
defaults=[
"_self_",
{"dataset": "synthetic"},
{"cv": "kfold"},
{"resample": "shuffle"},
{"ranker": MISSING},
{"validator": "knn"},
{"storage": "local"},
{"callbacks": ["to_sql"]},
{"metrics": ["feature_importances", "ranking_scores", "validation_scores"]},
{"override hydra/job_logging": "colorlog"},
{"override hydra/hydra_logging": "colorlog"},
],
)
Using the override keyword is required when overriding a config group. See more here.