Datasets

Build a Dataset

To build a dataset, call the dataset module directly:

python -m undertale.datasets.{dataset} {input} {output}

For example, to build the HumanEval-X dataset from the raw dataset at humanevalx-raw/ and save it to a directory called humanevalx/, run:

# Parse the HumanEvalX dataset.
python -m undertale.datasets.humanevalx humanevalx-raw/ humanevalx/

# Parse the HumanEvalX dataset with 8 parallel processes.
python -m undertale.datasets.humanevalx humanevalx-raw/ humanevalx/ --parallelism 8

Explore a Dataset with a Shell

To load a dataset into a Python shell, run:

python -m undertale.datasets.scripts.shell {path}

# Load the HumanEval-X dataset into a shell.
python -m undertale.dataset.scripts.shell humanevalx/

Use a Dataset in a Script

To write a script that uses a dataset that has already been parsed, you can do something like:

from undertale.datasets.base import Dataset

dataset = Dataset.load(path)

...

Where path is the path to the saved dataset directory.

VLLM Server Integration

Some datasets make use of a VLLM server to generate summaries of code. Assuming you already have a VLLM server running (see their documentation for details on how to set that up), to build a dataset pipeline with a VLLM step, you need to set the VLLM_SERVER_ADDRESS environment variable like:

export VLLM_SERVER_ADDRESS=http://my.vllm.server:8000/v1

Additionally, if your VLLM server requires an API key, you will need to set the VLLM_API_KEY environment variable.