WARC I/O
Utilities for working with WARC files in Apache Beam.
- class resiliparse.beam.warcio.ReadAllWarcs(warc_args: Dict[<class 'str'>, Any] = None, with_filename: bool = True, freeze: bool = True, always_keep_meta: bool = False)
Bases:
PTransform
Read WARC records from a given
PCollection
ofFileMetadata
objects.- Parameters:
warc_args – arguments to pass to
fastwarc.warc.ArchiveIterator
with_filename – keep the input filename as a key (otherwise return only the record)
freeze – freeze returned records (required if returned records are not consumed immediately)
always_keep_meta – always return record metadata, even if they exceed
max_content_length
(fromwarc_args
), but strip them of their payload
- class resiliparse.beam.warcio.ReadWarcs(file_pattern: str, warc_args: Dict[<class 'str'>, Any] = None, with_filename: bool = True, freeze: bool = True, always_keep_meta: bool = False, shuffle_files: bool = False)
Bases:
PTransform
Read WARC records from files matching a glob pattern.
- Parameters:
file_pattern – input file glob pattern
warc_args – arguments to pass to
fastwarc.warc.ArchiveIterator
with_filename – keep the input filename as a key (otherwise return only the record)
freeze – freeze returned records (required if the records are not consumed immediately)
always_keep_meta – always return record metadata, even if they exceed
max_content_length
(fromwarc_args
), but strip them of their payloadshuffle_files – shuffle matched file names (useful if Beam runner does not support automatic shuffling of input source splits)