tokenizer¶
Tokenizer implementation and training script.
Functions
|
|
|
Preprocess some disassembly into pretokens. |
|
Process a single dataset chunk. |
|
Train our custom tokenizer on a given dataset. |
- undertale.models.item.tokenizer.train(dataset, parallelism: int = 1, vocab_size: int = 4096)¶
Train our custom tokenizer on a given dataset.
This tokenizer essentially computes a dictionary of tokens for all instruction mnemonics and registers present in the given dataset and then trains a byte pair encoding (BPE) model to represent immediate values to constrain the size of the dataset.
- Parameters:
dataset – The path to the dataset on which to train.
parallelism – The number of parallel processes to use fo tokenizer training.
vocab_size – The vocabulary size for the immediate BPE model. This is a hyperparameter that could be tuned to optimize the token representation.
- Returns:
A trained tokenizer.