tokenizer

Tokenizer implementation and training script.

Functions

load(path[, sequence_length])

pretokenize(disassembly)

Preprocess some disassembly into pretokens.

process(path)

Process a single dataset chunk.

train(dataset[, parallelism, vocab_size])

Train our custom tokenizer on a given dataset.

undertale.models.item.tokenizer.train(dataset, parallelism: int = 1, vocab_size: int = 4096)

Train our custom tokenizer on a given dataset.

This tokenizer essentially computes a dictionary of tokens for all instruction mnemonics and registers present in the given dataset and then trains a byte pair encoding (BPE) model to represent immediate values to constrain the size of the dataset.

Parameters:
  • dataset – The path to the dataset on which to train.

  • parallelism – The number of parallel processes to use fo tokenizer training.

  • vocab_size – The vocabulary size for the immediate BPE model. This is a hyperparameter that could be tuned to optimize the token representation.

Returns:

A trained tokenizer.