WARC I/O
Utilities for working with WARC files in Apache Beam.
- class resiliparse.beam.warcio.ReadAllWarcs(warc_args=None, with_filename=True, freeze=True, always_keep_meta=False)
Bases:
PTransformRead WARC records from a given
PCollectionofFileMetadataobjects.- Parameters:
warc_args (Dict[<class 'str'>, Any]) – arguments to pass to
fastwarc.warc.ArchiveIteratorwith_filename (bool) – keep the input filename as a key (otherwise return only the record)
freeze (bool) – freeze returned records (required if returned records are not consumed immediately)
always_keep_meta (bool) – always return record metadata, even if they exceed
max_content_length(fromwarc_args), but strip them of their payload
- class resiliparse.beam.warcio.ReadWarcs(file_pattern, warc_args=None, with_filename=True, freeze=True, always_keep_meta=False, shuffle_files=False)
Bases:
PTransformRead WARC records from files matching a glob pattern.
- Parameters:
file_pattern (str) – input file glob pattern
warc_args (Dict[<class 'str'>, Any]) – arguments to pass to
fastwarc.warc.ArchiveIteratorwith_filename (bool) – keep the input filename as a key (otherwise return only the record)
freeze (bool) – freeze returned records (required if the records are not consumed immediately)
always_keep_meta (bool) – always return record metadata, even if they exceed
max_content_length(fromwarc_args), but strip them of their payloadshuffle_files (bool) – shuffle matched file names (useful if Beam runner does not support automatic shuffling of input source splits)