WARC I/O

Utilities for working with WARC files in Apache Beam.

class resiliparse.beam.warcio.ReadAllWarcs(warc_args: Dict[<class 'str'>, Any] = None, with_filename: bool = True, freeze: bool = True, always_keep_meta: bool = False)

Bases: PTransform

Read WARC records from a given PCollection of FileMetadata objects.

Parameters:
  • warc_args – arguments to pass to fastwarc.warc.ArchiveIterator

  • with_filename – keep the input filename as a key (otherwise return only the record)

  • freeze – freeze returned records (required if returned records are not consumed immediately)

  • always_keep_meta – always return record metadata, even if they exceed max_content_length (from warc_args), but strip them of their payload

class resiliparse.beam.warcio.ReadWarcs(file_pattern: str, warc_args: Dict[<class 'str'>, Any] = None, with_filename: bool = True, freeze: bool = True, always_keep_meta: bool = False, shuffle_files: bool = False)

Bases: PTransform

Read WARC records from files matching a glob pattern.

Parameters:
  • file_pattern – input file glob pattern

  • warc_args – arguments to pass to fastwarc.warc.ArchiveIterator

  • with_filename – keep the input filename as a key (otherwise return only the record)

  • freeze – freeze returned records (required if the records are not consumed immediately)

  • always_keep_meta – always return record metadata, even if they exceed max_content_length (from warc_args), but strip them of their payload

  • shuffle_files – shuffle matched file names (useful if Beam runner does not support automatic shuffling of input source splits)