WARC I/O

Utilities for working with WARC files in Apache Beam.

class resiliparse.beam.warcio.ReadAllWarcs(warc_args=None, with_filename=True, freeze=True, always_keep_meta=False)

Bases: PTransform

Read WARC records from a given PCollection of FileMetadata objects.

Parameters:
  • warc_args (Dict[<class 'str'>, Any]) – arguments to pass to fastwarc.warc.ArchiveIterator

  • with_filename (bool) – keep the input filename as a key (otherwise return only the record)

  • freeze (bool) – freeze returned records (required if returned records are not consumed immediately)

  • always_keep_meta (bool) – always return record metadata, even if they exceed max_content_length (from warc_args), but strip them of their payload

class resiliparse.beam.warcio.ReadWarcs(file_pattern, warc_args=None, with_filename=True, freeze=True, always_keep_meta=False, shuffle_files=False)

Bases: PTransform

Read WARC records from files matching a glob pattern.

Parameters:
  • file_pattern (str) – input file glob pattern

  • warc_args (Dict[<class 'str'>, Any]) – arguments to pass to fastwarc.warc.ArchiveIterator

  • with_filename (bool) – keep the input filename as a key (otherwise return only the record)

  • freeze (bool) – freeze returned records (required if the records are not consumed immediately)

  • always_keep_meta (bool) – always return record metadata, even if they exceed max_content_length (from warc_args), but strip them of their payload

  • shuffle_files (bool) – shuffle matched file names (useful if Beam runner does not support automatic shuffling of input source splits)