Text File I/O

Utilities for working with text files in Apache Beam.

class resiliparse.beam.textio.ReadAllFromText(shuffle_splits: bool = True, shuffle_names: bool = False, desired_split_size: int = 67108864, min_split_size: int = 1048576, compression_type: apache_beam.io.filesystem.CompressionTypes = 'auto', coder: apache_beam.coders.coders.Coder = StrUtf8Coder)

Bases: apache_beam.transforms.ptransform.PTransform

Read lines from a given PCollection of FileMetadata objects in parallel.

See ReadFormText for more information.

Parameters
  • shuffle_splits – shuffle input splits to break fusion

  • shuffle_names – shuffle matched file names before deriving splits (adds one more fusion break, should be enabled if file_pattern matches many files or if input files are compressed and cannot be split)

  • desired_split_size – desired file split size in bytes

  • min_split_size – minimum file split size in bytes

  • compression_type – file compression type, will determine whether individual files are splittable

  • coder – coder for decoding file contents

class resiliparse.beam.textio.ReadFromText(file_pattern: str, empty_match_treatment: apache_beam.io.fileio.EmptyMatchTreatment = 'ALLOW_IF_WILDCARD', shuffle_splits: bool = True, shuffle_names: bool = False, desired_split_size: int = 67108864, min_split_size: int = 1048576, compression_type: apache_beam.io.filesystem.CompressionTypes = 'auto', coder: apache_beam.coders.coders.Coder = StrUtf8Coder)

Bases: apache_beam.transforms.ptransform.PTransform

Read lines from text files in parallel.

Can be used to parallelize the processing of large individual text files with newline-delimited records. Unlike apache_beam.io.textio.ReadFromText, ReadFromText can prevent fusion of the file splits by opportunistically generating splits and shuffling them before actually reading the file contents. This prevents input bottlenecks on certain runners that do not automatically distribute splits for parallel processing such as the FlinkRunner.

The only supported newline separator at the moment is \n.

Parameters
  • file_pattern – input file glob

  • empty_match_treatment – what to do with empty glob matches

  • shuffle_splits – shuffle input splits to break fusion

  • shuffle_names – shuffle matched file names before deriving splits (adds one more fusion break, should be enabled if file_pattern matches many files or if input files are compressed and cannot be split)

  • desired_split_size – desired file split size in bytes

  • min_split_size – minimum file split size in bytes

  • compression_type – file compression type, will determine whether individual files are splittable

  • coder – coder for decoding file contents

class resiliparse.beam.coders.StrUtf8Coder

Bases: apache_beam.coders.coders.StrUtf8Coder

More resilient version of apache_beam.coders.StrUtf8Coder, which can handle encoding errors.

Uses the bytes_to_str encoding helpers for encoding and decoding text.

decode(value)

Decodes the given byte string into the corresponding object.