Datasets¶
Build a Dataset¶
To build a dataset, call the dataset pipeline script from the pipelines
directory:
python pipelines/dataset-{dataset}.py {input} {output}
For example, to build the HumanEval-X dataset from the raw dataset snapshot at
humaneval-x-raw/20251114-100300.tgz and save it to a directory called
humaneval-x/, run:
# Process the HumanEval-X dataset.
python pipelines/datasets/humaneval-x.py \
humaneval-x-raw/20251114-100300.tgz \
humaneval-x
Environments¶
It can be useful to customize your environment to make controling Undertale
parameters for your specific setup a bit easier. There are a couple of example
environment files included in the environments directory which can be
activated as follows:
source environments/example.env
The environments directory includes a simple example for local development
as well as a more complex example representing a SLURM cluster for distributed
pipelines and training.
Parallelism¶
All dataset commands support being run in parallel.
# Process HumanEval-X with custom local parallelism.
python pipelines/datasets/humaneval-x.py \
humaneval-x-raw/20251114-100300.tgz \
humaneval-x \
--parallelism 8
# On a SLURM cluster.
srun python pipelines/datasets/humaneval-x.py \
humaneval-x-raw/20251114-100300.tgz \
humaneval-x \
--cluster slurm \
--parallelism 16
Explore a Dataset with a Shell¶
To load a dataset into a Python shell, run:
python -m undertale.utils.datasets.shell {path}
# Load the HumanEval-X dataset into a shell.
python -m undertale.utils.datasets.shell humaneval-x/
Large Datasets¶
The undertale.utils.dataset.shell utility uses pandas to load the
dataset - this requires the entire dataset to be loaded into memory. If you’re
working with a larger-than-memory dataset, you can use the polars shell
instead to get a LazyFrame (see the Polars Documentation for
more details):
python -m undertale.utils.datasets.shell.polars {path}
# Load the HumanEval-X dataset into a polars.LazyFrame.
python -m undertale.utils.datasets.shell.polars humaneval-x/
Use a Dataset in a Script¶
Final datasets are simply large directories of parquet. Datasets can be loaded
in Python in all the usual ways you would load parquet - for example, with
pandas:
import pandas
dataset = pandas.read_parquet(path)
...
Where path is the path to the saved dataset directory.
Split a Dataset¶
Splitting a dataset ahead of training into training, validation, and test sets can be efficient and ensure a deterministic split. There is a helper utility available for that.
# Two-way split: 90% training, 10% validation (default).
#
# Writes to humaneval-x-training/ and humaneval-x-validation/.
python -m undertale.utils.datasets.split \
humaneval-x/ \
humaneval-x
# Three-way split: 80% training, 10% validation, 10% test.
#
# Writes to humaneval-x-training/, humaneval-x-validation/, and
# humaneval-x-testing/.
python -m undertale.utils.datasets.split \
humaneval-x/ \
humaneval-x
--splits training:80 validation:10 testing:10
Percentages must sum to 100. See the --seed option to control split
randomization.
Repartition a Dataset¶
To repartition a dataset into a fixed number of chunks or by target chunk size,
use the repartition utility.
# Repartition to exactly 32 chunk files.
python -m undertale.utils.datasets.repartition \
humaneval-x/ \
humaneval-x-repartitioned \
--chunks 32
# Repartition by target chunk size.
python -m undertale.utils.datasets.repartition \
humaneval-x/ \
humaneval-x-repartitioned \
--size 25MB
Exactly one of --chunks or --size must be specified.
Drop or Keep Columns¶
To drop or keep specific columns from a dataset, use the drop utility.
# Drop specific columns.
python -m undertale.utils.datasets.drop \
humaneval-x/ \
humaneval-x-filtered \
--drop metadata source
# Keep only specific columns.
python -m undertale.utils.datasets.drop \
humaneval-x/ \
humaneval-x-filtered \
--keep id solution
Exactly one of --drop or --keep must be specified.
Rename Columns¶
To rename one or more columns in a dataset, use the rename utility.
# Rename a single column.
python -m undertale.utils.datasets.rename \
humaneval-x/ \
humaneval-x-renamed \
--rename source:origin
# Rename multiple columns at once.
python -m undertale.utils.datasets.rename \
humaneval-x/ \
humaneval-x-renamed \
--rename source:origin metadata:info
The output dataset preserves the same chunk structure as the input.