schema¶
Dataset schema definition and enforcement.
Classes
|
A dataset of individual functional pieces of code. |
A pairwise contrastive learning dataset. |
|
|
A base schema class for a particular type of dataset. |
Function with a functional summary. |
|
A triplet loss contrastive learning dataset. |
|
A dataset of entire binaries. |
Exceptions
Raised when a dataset does not match the schema. |
- exception undertale.datasets.schema.InvalidSchemaError¶
Bases:
ExceptionRaised when a dataset does not match the schema.
- class undertale.datasets.schema.Schema¶
Bases:
objectA base schema class for a particular type of dataset.
This defines the included fields on a given dataset.
- classmethod get_members()¶
Get members of this class.
- Returns:
A list of non-class, non-function members of this class as a tuple of name and value.
- classmethod to_features()¶
Produce a set of required features.
- Returns:
A Features object in Datasets library form.
- classmethod validate(dataset)¶
Validate that a dataset matches this schema.
- Raises:
InvalidSchemaError – if required fields are missing or unknown fields are included.
- class undertale.datasets.schema.WholeBinary¶
Bases:
SchemaA dataset of entire binaries.
This dataset requires further processing before it is useful for training, but this schema might still be useful.
- binary = Value('binary')¶
The entire binary as bytes.
- class Optional¶
Bases:
Schema- source = Value('string')¶
The source code used to build the entire binary.
This should be a serialized JSON object mapping file paths relative to the root of a project directory to their content. Ideally it should be possible to reconstruct an entire source tree from this field.
Example:
{ 'hello.cpp': 'void main() { printf("hello world\n"); }', 'assets/data.csv': 'name,description,comment\nfoo,bar,baz', }
- architecture = Value('string')¶
The architecture this binary was compiled on.
- compiler = Value('string')¶
The compiler.
- class undertale.datasets.schema.Function¶
Bases:
SchemaA dataset of individual functional pieces of code.
Note
The name Function is a bit of a misnomer - this is really just any subgraph of the Inter-Procedural Control Flow Graph (IPCFG). While it will often just be a single function, it doesn’t necessarily have to be.
- code = Value('binary')¶
The executable bytes of this function.
- disassembly = Value('string')¶
Instructions disassembled from the code field.
Instructions should be newline separated, lowercase, in intel syntax, and appear in address order.
This disassembly should exclude unreachable instructions if possible.
- class Optional¶
Bases:
Schema- id = Value('string')¶
An identifier relevant to this dataset.
- source = Value('string')¶
The source code for this function, as a single string.
- architecture = Value('string')¶
The architecture this binary was compiled on.
- compiler = Value('string')¶
The compiler.
- function_name = Value('string')¶
The name of this function.
- class undertale.datasets.schema.PairwiseContrasting¶
Bases:
SchemaA pairwise contrastive learning dataset.
Contains pairs of samples mapped to ground-truth similarity values for pairwise contrastive loss.