Datasets¶
Build a Dataset¶
To build a dataset, call the dataset module directly:
python -m undertale.datasets.{dataset} {input} {output}
For example, to build the HumanEval-X dataset from the raw dataset at
humanevalx-raw/ and save it to a directory called humanevalx/, run:
# Parse the HumanEvalX dataset.
python -m undertale.datasets.humanevalx humanevalx-raw/ humanevalx/
# Parse the HumanEvalX dataset with 8 parallel processes.
python -m undertale.datasets.humanevalx humanevalx-raw/ humanevalx/ --parallelism 8
Explore a Dataset with a Shell¶
To load a dataset into a Python shell, run:
python -m undertale.datasets.scripts.shell {path}
# Load the HumanEval-X dataset into a shell.
python -m undertale.dataset.scripts.shell humanevalx/
Use a Dataset in a Script¶
To write a script that uses a dataset that has already been parsed, you can do something like:
from undertale.datasets.base import Dataset
dataset = Dataset.load(path)
...
Where path is the path to the saved dataset directory.
VLLM Server Integration¶
Some datasets make use of a VLLM
server to generate summaries of code. Assuming you already have a VLLM server
running (see their documentation for details on how to set that up), to build a
dataset pipeline with a VLLM step, you need to set the VLLM_SERVER_ADDRESS
environment variable like:
export VLLM_SERVER_ADDRESS=http://my.vllm.server:8000/v1
Additionally, if your VLLM server requires an API key, you will need to set the
VLLM_API_KEY environment variable.