tokenizer¶

Tokenizer implementation and training script.

Functions

`load`(path[, sequence_length])
`pretokenize`(disassembly)	Preprocess some disassembly into pretokens.
`process`(path)	Process a single dataset chunk.
`train`(dataset[, parallelism, vocab_size])	Train our custom tokenizer on a given dataset.

undertale.models.item.tokenizer.train(dataset, parallelism: int = 1, vocab_size: int = 4096)¶

Train our custom tokenizer on a given dataset.

This tokenizer essentially computes a dictionary of tokens for all instruction mnemonics and registers present in the given dataset and then trains a byte pair encoding (BPE) model to represent immediate values to constrain the size of the dataset.

Parameters:

dataset – The path to the dataset on which to train.
parallelism – The number of parallel processes to use fo tokenizer training.
vocab_size – The vocabulary size for the immediate BPE model. This is a hyperparameter that could be tuned to optimize the token representation.

Returns:

A trained tokenizer.

tokenizer¶

Undertale

Navigation

Related Topics