schema

Dataset schema definition and enforcement.

Classes

Function()

A dataset of individual functional pieces of code.

PairwiseContrasting()

A pairwise contrastive learning dataset.

Schema()

A base schema class for a particular type of dataset.

SummarizedFunction()

Function with a functional summary.

TripletContrasting()

A triplet loss contrastive learning dataset.

WholeBinary()

A dataset of entire binaries.

Exceptions

InvalidSchemaError

Raised when a dataset does not match the schema.

exception undertale.datasets.schema.InvalidSchemaError

Bases: Exception

Raised when a dataset does not match the schema.

class undertale.datasets.schema.Schema

Bases: object

A base schema class for a particular type of dataset.

This defines the included fields on a given dataset.

classmethod get_members()

Get members of this class.

Returns:

A list of non-class, non-function members of this class as a tuple of name and value.

classmethod to_features()

Produce a set of required features.

Returns:

A Features object in Datasets library form.

classmethod validate(dataset)

Validate that a dataset matches this schema.

Raises:

InvalidSchemaError – if required fields are missing or unknown fields are included.

class undertale.datasets.schema.WholeBinary

Bases: Schema

A dataset of entire binaries.

This dataset requires further processing before it is useful for training, but this schema might still be useful.

binary = Value('binary')

The entire binary as bytes.

class Optional

Bases: Schema

source = Value('string')

The source code used to build the entire binary.

This should be a serialized JSON object mapping file paths relative to the root of a project directory to their content. Ideally it should be possible to reconstruct an entire source tree from this field.

Example:

{
    'hello.cpp': 'void main() { printf("hello world\n"); }',
    'assets/data.csv': 'name,description,comment\nfoo,bar,baz',
}
architecture = Value('string')

The architecture this binary was compiled on.

compiler = Value('string')

The compiler.

class undertale.datasets.schema.Function

Bases: Schema

A dataset of individual functional pieces of code.

Note

The name Function is a bit of a misnomer - this is really just any subgraph of the Inter-Procedural Control Flow Graph (IPCFG). While it will often just be a single function, it doesn’t necessarily have to be.

code = Value('binary')

The executable bytes of this function.

disassembly = Value('string')

Instructions disassembled from the code field.

Instructions should be newline separated, lowercase, in intel syntax, and appear in address order.

This disassembly should exclude unreachable instructions if possible.

class Optional

Bases: Schema

id = Value('string')

An identifier relevant to this dataset.

source = Value('string')

The source code for this function, as a single string.

architecture = Value('string')

The architecture this binary was compiled on.

compiler = Value('string')

The compiler.

function_name = Value('string')

The name of this function.

class undertale.datasets.schema.PairwiseContrasting

Bases: Schema

A pairwise contrastive learning dataset.

Contains pairs of samples mapped to ground-truth similarity values for pairwise contrastive loss.

first

The first sample.

alias of Function

second

The second sample.

alias of Function

class undertale.datasets.schema.TripletContrasting

Bases: Schema

A triplet loss contrastive learning dataset.

Contains an anchor and a positive and negative sample for triplet contrastive loss.

anchor

The anchor sample.

alias of Function

positive

Another sample with a high degree of similarity.

alias of Function

negative

Another sample with a low degree of similarity.

alias of Function

class undertale.datasets.schema.SummarizedFunction

Bases: Function

Function with a functional summary.

This dataset can be used for multi-modal fine-tuning for a full summarization pipeline.