parquet

Parquet parsing.

Functions

modify_parquet(input, output, operations[, ...])

Modify a parquet dataset by applying a sequence of operations.

Classes

Deduplicate(columns)

Remove duplicate rows by a set of columns.

Drop(columns)

Drop specific columns from the dataset.

HashColumn(column, target)

Add a column containing the hash of an existing column.

Keep(columns)

Keep only specific columns from the dataset.

ParquetOperation()

Abstract base class for parquet DataFrame transformations.

Rename(mapping)

Rename columns in the dataset.

Repartition([chunks, size])

Repartition the dataset by number of chunks or target chunk size.

class undertale.pipeline.parquet.ParquetOperation

Bases: ABC

Abstract base class for parquet DataFrame transformations.

class undertale.pipeline.parquet.HashColumn(column: str, target: str)

Bases: ParquetOperation

Add a column containing the hash of an existing column.

Parameters:
  • column – Name of the column to hash.

  • target – Name of the new hash column to create.

class undertale.pipeline.parquet.Deduplicate(columns: List[str])

Bases: ParquetOperation

Remove duplicate rows by a set of columns.

Parameters:

columns – Column names to deduplicate by (unique together).

class undertale.pipeline.parquet.Drop(columns: List[str])

Bases: ParquetOperation

Drop specific columns from the dataset.

Parameters:

columns – Column names to drop.

class undertale.pipeline.parquet.Keep(columns: List[str])

Bases: ParquetOperation

Keep only specific columns from the dataset.

Parameters:

columns – Column names to keep.

class undertale.pipeline.parquet.Rename(mapping: Dict[str, str])

Bases: ParquetOperation

Rename columns in the dataset.

Parameters:

mapping – A mapping of old column names to new column names.

class undertale.pipeline.parquet.Repartition(chunks: int | None = None, size: int | str | None = None)

Bases: ParquetOperation

Repartition the dataset by number of chunks or target chunk size.

Parameters:
  • chunks – Number of chunk files to generate.

  • size – The maximum chunk size in bytes or string representation (e.g., “25MB”).

Raises:

ValueError – If not exactly one of chunks or size is specified.

undertale.pipeline.parquet.modify_parquet(input: str | List[str], output: str, operations: List[ParquetOperation], compression: str | None = None) List[str]

Modify a parquet dataset by applying a sequence of operations.

This method is memory-efficient and supports larger-than-memory parquet datasets using Dask.

Note

When using Repartition with chunks, the number of chunks is guaranteed but the number of rows per chunk may not be exactly the same. If chunks exceeds the number of rows in the dataset, chunks parquet files will still be created, but some of them will be empty.

Parameters:
  • input – Path to the parquet dataset directory or a list of paths to each chunk of the dataset.

  • output – Path to the target directory.

  • operations – A list of ParquetOperation instances to apply in order.

  • compression – If provided, name of the algorithm to use (e.g., ‘snappy’). By default no compression will be used. See the pyarrow documentation for a list of supported compression methods.

Returns:

A list of paths to the generated files.