Modeling

Tokenizer Training

The first step in training any of our models is to train a tokenizer. To train a tokenizer on e.g., the HumanEval-X dataset, run the tokenizer training pipeline script:

# Train a tokenizer on the HumanEval-X dataset.
python pipelines/models/train-tokenizer.py \
    humaneval-x/ \
    tokenizer

See Parallelism for controlling parallel workers and cluster backends.

Tokenization

With your trained tokenizer you can now tokenize an entire dataset to prepare for pre-training.

# Tokenize the HumanEval-X dataset.
#
# Only retain the minimal fields necessary for pre-training.
python pipelines/models/tokenize-dataset.py \
    humaneval-x/ \
    humaneval-x-pretraining \
    --tokenizer tokenizer.json \
    --minimal

Consider splitting off some (10%) of your dataset for validation.

See Parallelism for controlling parallel workers and cluster backends.

Pre-Training (Maked Language Modeling)

With your tokenized training dataset (and optional validation split) you are now ready to begin pretraining a model.

# Start a pretraining run locally (as an example).
#
# Results will be written to maskedlm/.
python pipelines/models/pretrain-maskedlm.py \
    --tokenizer tokenizer.json \
    humaneval-x-pretraining/ \
    maskedlm

# Include validation data (pre-split).
python pipelines/models/pretrain-maskedlm.py \
    --tokenizer tokenizer.json \
    humaneval-x-pretraining-training/ \
    --validation humaneval-x-pretraining-validation/ \
    maskedlm

# Use multiple accelerators on the same host.
python pipelines/models/pretrain-maskedlm.py \
    --devices 4 \
    --tokenizer tokenizer.json \
    humaneval-x-pretraining-training/ \
    --validation humaneval-x-pretraining-validation/ \
    maskedlm

# Distributed training on a SLURM Cluster.
#
# This SLURM script requires certain environment variables
# to be configured - see `environments/example-slurm.env`
# for more details or customize the SLURM script to your
# environment.
source environments/example-slurm.env
sbatch pipelines/models/pretrain-maskedlm.slurm

There are several other configurable parameters for other training scenarios - to get a full list, see the --help output.

Saved model checkpoints are available in the output directory.

See Environments for for details on configuring the local environment - in particular for distributed SLURM training.

Tensorboard

The pretraining pipeline produces TensorBoard-compatible logging in the output directory. To host a TensorBoard server and monitor training progress, run:

tensorboard --logdir maskedlm/

Inference

With a trained model checkpoint, you can predict masked tokens in a piece of disassembly input.

# Predict masked tokens in a piece of disassembly.
python pipelines/models/infer-maskedlm.py \
    --tokenizer tokenizer.json \
    --checkpoint maskedlm/checkpoint.ckpt \
    "xor rax [MASK]"

Fine-Tuning (Multi-Modal Summarization)

Coming soon…

Inference

Coming soon…