tokenizer¶
Tokenizer implementation and utilities.
Functions
|
Load a trained tokenizer from a file. |
|
Merge preprocessed token files into a single file for training. |
|
Preprocess a dataset of disassembly. |
|
Save a trained tokenizer to a file. |
|
Tokenize a given dataset with a trained tokenizer. |
|
Train a tokenizer on a given dataset. |
- undertale.models.tokenizer.preprocess_tokens(input: str, output: str) str¶
Preprocess a dataset of disassembly.
- Parameters:
input – Path to disassembly dataset.
output – Path where the preprocessed tokens should be written.
- Returns:
The path to the preprocessed tokens file.
- undertale.models.tokenizer.merge_preprocessed_tokens(inputs: List[str], output: str) str¶
Merge preprocessed token files into a single file for training.
- Parameters:
inputs – Paths to preprocessed token files.
output – Merged output path.
- Returns:
The path to the merged preprocessed token file.
- undertale.models.tokenizer.train_tokenizer(input: str, output: str, sequence_length: int = 512, vocabulary_size: int = 4096, silent: bool = True) str¶
Train a tokenizer on a given dataset.
This tokenizer essentially computes a dictionary of tokens for all instruction mnemonics and registers present in the given dataset and then trains a byte pair encoding (BPE) model to represent immediate values to constrain the size of the dataset.
- Parameters:
input – The path to the preprocessed token file on which to train.
output – The path where the trained tokenizer file should be saved.
sequence_length – The sequence length for padding and truncation.
vocabulary_size – The vocabulary size for the immediate BPE model. This is a hyperparameter that could be tuned to optimize the token representation.
silent – If
True, suppress progress bar display.
- Returns:
The path to the trained tokenizer file.
- undertale.models.tokenizer.tokenize(input: str, output: str, tokenizer: str) str¶
Tokenize a given dataset with a trained tokenizer.
- Parameters:
input – Path to disassembly dataset.
output – Path where the tokenized dataset should be written.
tokenizer – Path to the trained tokenizer that should be used.
- Returns:
The path to the tokenized dataset.
- undertale.models.tokenizer.save(tokenizer: Tokenizer, path: str) None¶
Save a trained tokenizer to a file.
- Parameters:
tokenizer – A trained tokenizer.
path – The path where trained tokenizer should be saved.
- undertale.models.tokenizer.load(path: str) Tokenizer¶
Load a trained tokenizer from a file.
- Parameters:
path – The path to a trained tokenizer file to load.
- Returns:
A trained tokenizer loaded from
path.