base

Base classes and utilities for datasets.

Module Attributes

WRITERS

Supported dataset writers.

DEFAULT_WRITER

Default dataset writer.

EXECUTORS

Supported dataset executors.

DEFAULT_EXECUTOR

Default dataset executor.

Functions

adapt_to_flatten(self, document)

adapt_to_flatten_for_pretraining(self, document)

build_parser(parser)

Build an argument parser for processing a dataset.

main(cls)

The CLI entrypoint for parsing a dataset.

Classes

Dataset([writer, executor, logging_directory])

The base class for all Undertale datasets.

class undertale.datasets.base.Dataset(writer: str = 'parquet', executor: str = 'local', logging_directory: str | None = None)

Bases: object

The base class for all Undertale datasets.

Parameters:
  • writer – The name of the dataset writer to use.

  • executor – The name of the dataset executor to use.

  • logging_directory – A path to the directory to use for logging.

schema: Schema | None = None

The schema class that this dataset implements.

This should be the literal class from the schema module.

get_executor(pipeline: List[PipelineStep], **kwargs) PipelineExecutor

Returns an executor for the current pipeline.

Parameters:

pipeline – A list of pipeline steps.

abstractmethod get_pipeline(input: str, writer: List[PipelineStep], parallelism: int = 1) PipelineExecutor

Build and return the dataset processing pipeline.

This should make use of the get_executor method to wrap the configured executor.

Parameters:
  • input – Some input data from the user (path, name, etc.).

  • writer – A series of output writer steps to add to the pipeline.

  • parallelism – The degree of parallelism; dataset authors can choose to implement this however they want.

static load(path: str) Dataset

Load a dataset from the given path.

Parameters:

path – The path from which this dataset should be loaded.

Returns:

A dataset loaded from the given path.

static store(dataset: Dataset, path: str) None

Save a dataset to the given path.

Parameters:

path – The path where this dataset should be saved.

undertale.datasets.base.WRITERS = {'parquet': <function <lambda>>, 'pretraining': <function <lambda>>}

Supported dataset writers.

undertale.datasets.base.DEFAULT_WRITER = 'parquet'

Default dataset writer.

undertale.datasets.base.EXECUTORS = {'local': <class 'datatrove.executor.local.LocalPipelineExecutor'>, 'slurm': <class 'datatrove.executor.slurm.SlurmPipelineExecutor'>}

Supported dataset executors.

undertale.datasets.base.DEFAULT_EXECUTOR = 'local'

Default dataset executor.

undertale.datasets.base.main(cls: Type[Dataset]) None

The CLI entrypoint for parsing a dataset.

This should be called in __main__ for dataset modules.

Parameters:

cls – A dataset class to interact with.