tokenizer

Tokenizer implementation and utilities.

Functions

load(path)

Load a trained tokenizer from a file.

merge_preprocessed_tokens(inputs, output)

Merge preprocessed token files into a single file for training.

preprocess_tokens(input, output)

Preprocess a dataset of disassembly.

save(tokenizer, path)

Save a trained tokenizer to a file.

tokenize(input, output, tokenizer)

Tokenize a given dataset with a trained tokenizer.

train_tokenizer(input, output[, ...])

Train a tokenizer on a given dataset.

undertale.models.tokenizer.preprocess_tokens(input: str, output: str) str

Preprocess a dataset of disassembly.

Parameters:
  • input – Path to disassembly dataset.

  • output – Path where the preprocessed tokens should be written.

Returns:

The path to the preprocessed tokens file.

undertale.models.tokenizer.merge_preprocessed_tokens(inputs: List[str], output: str) str

Merge preprocessed token files into a single file for training.

Parameters:
  • inputs – Paths to preprocessed token files.

  • output – Merged output path.

Returns:

The path to the merged preprocessed token file.

undertale.models.tokenizer.train_tokenizer(input: str, output: str, sequence_length: int = 512, vocabulary_size: int = 4096, silent: bool = True) str

Train a tokenizer on a given dataset.

This tokenizer essentially computes a dictionary of tokens for all instruction mnemonics and registers present in the given dataset and then trains a byte pair encoding (BPE) model to represent immediate values to constrain the size of the dataset.

Parameters:
  • input – The path to the preprocessed token file on which to train.

  • output – The path where the trained tokenizer file should be saved.

  • sequence_length – The sequence length for padding and truncation.

  • vocabulary_size – The vocabulary size for the immediate BPE model. This is a hyperparameter that could be tuned to optimize the token representation.

  • silent – If True, suppress progress bar display.

Returns:

The path to the trained tokenizer file.

undertale.models.tokenizer.tokenize(input: str, output: str, tokenizer: str) str

Tokenize a given dataset with a trained tokenizer.

Parameters:
  • input – Path to disassembly dataset.

  • output – Path where the tokenized dataset should be written.

  • tokenizer – Path to the trained tokenizer that should be used.

Returns:

The path to the tokenized dataset.

undertale.models.tokenizer.save(tokenizer: Tokenizer, path: str) None

Save a trained tokenizer to a file.

Parameters:
  • tokenizer – A trained tokenizer.

  • path – The path where trained tokenizer should be saved.

undertale.models.tokenizer.load(path: str) Tokenizer

Load a trained tokenizer from a file.

Parameters:

path – The path to a trained tokenizer file to load.

Returns:

A trained tokenizer loaded from path.