Text File I/O
Utilities for working with text files in Apache Beam.
- class resiliparse.beam.textio.ReadAllFromText(shuffle_splits: bool = True, shuffle_names: bool = False, desired_split_size: int = 67108864, min_split_size: int = 1048576, compression_type: CompressionTypes = 'auto', coder: Coder = StrUtf8Coder)
Bases:
PTransform
Read lines from a given
PCollection
ofFileMetadata
objects in parallel.See
ReadFormText
for more information.- Parameters:
shuffle_splits – shuffle input splits to break fusion
shuffle_names – shuffle matched file names before deriving splits (adds one more fusion break, should be enabled if
file_pattern
matches many files or if input files are compressed and cannot be split)desired_split_size – desired file split size in bytes
min_split_size – minimum file split size in bytes
compression_type – file compression type, will determine whether individual files are splittable
coder – coder for decoding file contents
- class resiliparse.beam.textio.ReadFromText(file_pattern: str, empty_match_treatment: ~apache_beam.io.fileio.EmptyMatchTreatment = 'ALLOW_IF_WILDCARD', shuffle_splits: bool = True, shuffle_names: bool = False, desired_split_size: int = 67108864, min_split_size: int = 1048576, compression_type: ~apache_beam.io.filesystem.CompressionTypes = 'auto', coder: <module 'apache_beam.coders' from '/home/docs/checkouts/readthedocs.org/user_builds/chatnoir-resiliparse/envs/stable/lib/python3.8/site-packages/apache_beam/coders/__init__.py'> = StrUtf8Coder)
Bases:
PTransform
Read lines from text files in parallel.
Can be used to parallelize the processing of large individual text files with newline-delimited records. Unlike
apache_beam.io.textio.ReadFromText
,ReadFromText
can prevent fusion of the file splits by opportunistically generating splits and shuffling them before actually reading the file contents. This prevents input bottlenecks on certain runners that do not automatically distribute splits for parallel processing such as the FlinkRunner.The only supported newline separator at the moment is
\n
.- Parameters:
file_pattern – input file glob
empty_match_treatment – what to do with empty glob matches
shuffle_splits – shuffle input splits to break fusion
shuffle_names – shuffle matched file names before deriving splits (adds one more fusion break, should be enabled if
file_pattern
matches many files or if input files are compressed and cannot be split)desired_split_size – desired file split size in bytes
min_split_size – minimum file split size in bytes
compression_type – file compression type, will determine whether individual files are splittable
coder – coder for decoding file contents
- class resiliparse.beam.coders.StrUtf8Coder
Bases:
StrUtf8Coder
More resilient version of
apache_beam.coders.StrUtf8Coder
, which can handle encoding errors.Uses the
bytes_to_str
encoding helpers for encoding and decoding text.- decode(value)
Decodes the given byte string into the corresponding object.