base¶
Base classes and utilities for datasets.
Module Attributes
Supported dataset writers. |
|
Default dataset writer. |
|
Supported dataset executors. |
|
Default dataset executor. |
Functions
|
|
|
|
|
Build an argument parser for processing a dataset. |
|
The CLI entrypoint for parsing a dataset. |
Classes
|
The base class for all Undertale datasets. |
- class undertale.datasets.base.Dataset(writer: str = 'parquet', executor: str = 'local', logging_directory: str | None = None)¶
Bases:
objectThe base class for all Undertale datasets.
- Parameters:
writer – The name of the dataset writer to use.
executor – The name of the dataset executor to use.
logging_directory – A path to the directory to use for logging.
- schema: Schema | None = None¶
The schema class that this dataset implements.
This should be the literal class from the schema module.
- get_executor(pipeline: List[PipelineStep], **kwargs) PipelineExecutor¶
Returns an executor for the current pipeline.
- Parameters:
pipeline – A list of pipeline steps.
- abstractmethod get_pipeline(input: str, writer: List[PipelineStep], parallelism: int = 1) PipelineExecutor¶
Build and return the dataset processing pipeline.
This should make use of the
get_executormethod to wrap the configured executor.- Parameters:
input – Some input data from the user (path, name, etc.).
writer – A series of output writer steps to add to the pipeline.
parallelism – The degree of parallelism; dataset authors can choose to implement this however they want.
- static load(path: str) Dataset¶
Load a dataset from the given path.
- Parameters:
path – The path from which this dataset should be loaded.
- Returns:
A dataset loaded from the given path.
- static store(dataset: Dataset, path: str) None¶
Save a dataset to the given path.
- Parameters:
path – The path where this dataset should be saved.
- undertale.datasets.base.WRITERS = {'parquet': <function <lambda>>, 'pretraining': <function <lambda>>}¶
Supported dataset writers.
- undertale.datasets.base.DEFAULT_WRITER = 'parquet'¶
Default dataset writer.
- undertale.datasets.base.EXECUTORS = {'local': <class 'datatrove.executor.local.LocalPipelineExecutor'>, 'slurm': <class 'datatrove.executor.slurm.SlurmPipelineExecutor'>}¶
Supported dataset executors.
- undertale.datasets.base.DEFAULT_EXECUTOR = 'local'¶
Default dataset executor.