base¶

Base classes and utilities for datasets.

Module Attributes

`WRITERS`	Supported dataset writers.
`DEFAULT_WRITER`	Default dataset writer.
`EXECUTORS`	Supported dataset executors.
`DEFAULT_EXECUTOR`	Default dataset executor.

Functions

`adapt_to_flatten`(self, document)
`adapt_to_flatten_for_pretraining`(self, document)
`build_parser`(parser)	Build an argument parser for processing a dataset.
`main`(cls)	The CLI entrypoint for parsing a dataset.

Classes

Dataset([writer, executor, logging_directory])

The base class for all Undertale datasets.

class undertale.datasets.base.Dataset(writer: str = 'parquet', executor: str = 'local', logging_directory: str | None = None)¶

Bases: object

The base class for all Undertale datasets.

Parameters:

schema: Schema | None = None¶

The schema class that this dataset implements.

This should be the literal class from the schema module.

get_executor(pipeline: List[PipelineStep], **kwargs) → PipelineExecutor¶

Returns an executor for the current pipeline.

abstractmethod get_pipeline(input: str, writer: List[PipelineStep], parallelism: int = 1) → PipelineExecutor¶

Build and return the dataset processing pipeline.

This should make use of the get_executor method to wrap the configured executor.

Parameters:

input – Some input data from the user (path, name, etc.).
writer – A series of output writer steps to add to the pipeline.
parallelism – The degree of parallelism; dataset authors can choose to implement this however they want.

static load(path: str) → Dataset¶

Load a dataset from the given path.

static store(dataset: Dataset, path: str) → None¶

Save a dataset to the given path.

undertale.datasets.base.WRITERS = {'parquet': <function <lambda>>, 'pretraining': <function <lambda>>}¶: Supported dataset writers.

undertale.datasets.base.DEFAULT_WRITER = 'parquet'¶: Default dataset writer.

undertale.datasets.base.EXECUTORS = {'local': <class 'datatrove.executor.local.LocalPipelineExecutor'>, 'slurm': <class 'datatrove.executor.slurm.SlurmPipelineExecutor'>}¶: Supported dataset executors.

undertale.datasets.base.DEFAULT_EXECUTOR = 'local'¶: Default dataset executor.

undertale.datasets.base.main(cls: Type[Dataset]) → None¶

The CLI entrypoint for parsing a dataset.

This should be called in __main__ for dataset modules.

Undertale