Modeling¶
Pretoken Processing¶
Before a tokenizer can be trained on a dataset, disassembly must be processed into pretokens that the tokenizer can consume. To pretokenize e.g., the HumanEval-X dataset, run:
python -m undertale.datasets.scripts.pretokenize humanevalx/ humanevalx-pretokenized/
Tokenizer Training¶
Next, you can train a tokenizer on the pretokenized dataset:
python -m undertale.models.item.tokenizer \
humanevalx-pretokenized/ \
item.tokenizer.json
Tokenization¶
With your trained tokenizer you can now tokenize an entire dataset:
python -m undertale.datasets.scripts.tokenize \
-t item.tokenizer.json \
-w pretraining \
humanevalx-pretokenized/ \
humanevalx-tokenized/
Pre-Training (Maked Language Modeling)¶
With a trained tokenizer and a tokenized dataset, you can now proceed with the first phase of training:
python -m undertale.models.item.pretrain-maskedlm \
-t item.tokenizer.json \
humanevalx-tokenized/ \
pretrain-maskedlm/
Inference¶
With a pre-trained model you can now do masked language modeling inference (for a given pretokenized text):
python -m undertale.models.item.infer-maskedlm \
-t item.tokenizer.json \
-c pretrain-maskedlm/version_0/checkpoints/model.ckpt \
"xor rax [MASK]"
Fine-Tuning (Contrastive Embeddings)¶
Coming soon…
Inference¶
Coming soon…
Fine-Tuning (Multi-Modal Summarization)¶
Coming soon…
Inference¶
Coming soon…