Skip to main content

DatasetConfig

class fseval.config.DatasetConfig(
name: str=MISSING,
task: Task=MISSING,
adapter: Any=MISSING,
adapter_callable: str="get_data",
feature_importances: Optional[Dict[str, float]]=None,
group: Optional[str]=None,
domain: Optional[str]=None,
)

Configures a dataset, to be used in the pipeline. Can be loaded from various sources using an 'adapter'.

Attributes:

name : strHuman-readable name of dataset.
task : TaskEither Task.classification or Task.regression.
adapter : AnyDataset adapter. must be of fseval.types.AbstractAdapter type, i.e. must implement a get_data() -> (X, y) method. Can also be a callable; then the callable must return a tuple (X, y).
adapter_callable : AnyAdapter class callable. the function to be called on the instantiated class to fetch the data (X, y). is ignored when the target itself is a function callable.
feature_importances : Optional[Dict[str, float]]Weightings indicating relevant features or instances. Should be a dict with each key and value like the following pattern: X[<numpy selector>] = <float> Example: X[:, 0:3] = 1.0 which sets the 0-3 features as maximally relevant and all others minimally relevant.
group : Optional[str]An optional group attribute, such to group datasets in the analytics stage.
domain : Optional[str]Dataset domain, e.g. medicine, finance, etc.

Adapters

To load data, you require to define an adapter. Several are available.

OpenMLDataset

class fseval.config.adapters.OpenMLDataset(
dataset_id: int=MISSING,
target_column: str=MISSING,
drop_qualitative: bool=False,
)

Allows loading a dataset from OpenML.

Attributes:

dataset_id : intThe dataset ID.
target_column : strWhich column to use as a target. This column will be used as y.
drop_qualitative : boolWhether to drop any column that is not numeric.

Example

So, for example, loading the Iris dataset:

conf/dataset/iris.yaml
name: Iris Flowers
task: classification
adapter:
_target_: fseval.adapters.openml.OpenML
dataset_id: 61
target_column: class

WandbDataset

class fseval.config.adapters.WandbDataset(
artifact_id: str=MISSING
)

Loads a dataset from the Weights and Biases artifacts store. Data must be stored in two tables X and Y.

Requires being logged into the Weights and Biases CLI (in other words, having the WANDB_API_KEY set), and having installed the wandb python package.

Attributes:

artifact_id : strThe ID of the artifact to fetch. Has to be of the following form: <entity>/<project>/<artifact_name>:<artifact_version>.

For example:
dunnkers/synthetic-datasets/switch:v0 would be a valid artifact_id.

Example

For example, we could load the following artifact:

wandb adapter artifact example

using the following config:

conf/dataset/chen.yaml
name: Switch (Chen et al.)
task: regression
adapter:
_target_: fseval.adapters.wandb.Wandb
artifact_id: dunnkers/synthetic-datasets/switch:v0
feature_importances:
X[:5000, 0:4]: 1.0
X[5000:, 4:8]: 1.0

<> Functions

We can also use functions as adapters, as long as they return a tuple (X, y). e.g. using sklearn.datasets.make_classification as an adapter:

conf/dataset/synthetic.yaml
name: My synthetic dataset
task: classification
adapter:
_target_: sklearn.datasets.make_classification
n_samples: 10000
n_informative: 2
n_classes: 2
n_features: 20
n_redundant: 0
random_state: 0
shuffle: false
feature_importances:
X[:, 0:2]: 1.0

⚙️ Custom adapter

To load datasets from different sources, we can use different adapters. You can create an adapter by implementing this interface:

class AbstractAdapter(ABC, BaseEstimator):
@abstractmethod
def get_data(self) -> Tuple[List, List]:
...

For example:

benchmark.py
@dataclass
class CustomAdapter(AbstractAdapter):
def get_data(self) -> Tuple[List, List]:
X = [[]]
Y = []

return X, Y

More examples

For more examples, see the repo for more dataset configs.