FastWARC (Python)

Resiliparse FastWARC API documentation (Python bindings).

WARC

final class fastwarc.warc.WarcRecordType(*values)

Enum indicating a WARC record’s type as given by its WARC-Type header.

Multiple types can be combined with boolean operators for filtering records in ArchiveIterator.

unknown = 512

Special type: unknown record type (filter only)

any_type = 65535

Special type: any record type (filter only)

no_type = 0

Special type: no record type (filter only)

warcinfo = 2
response = 4
resource = 8
request = 16
metadata = 32
revisit = 64
conversion = 128
continuation = 256
final class fastwarc.warc.WarcHeader

Pre-defined set of standard WARC 1.1 headers. This enum can be used in place of bytes or str values in HeaderMap methods to void misspellings.

WARC_TYPE: ClassVar[WarcHeader]
WARC_RECORD_ID: ClassVar[WarcHeader]
WARC_DATE: ClassVar[WarcHeader]
CONTENT_LENGTH: ClassVar[WarcHeader]
CONTENT_TYPE: ClassVar[WarcHeader]
WARC_CONCURRENT_TO: ClassVar[WarcHeader]
WARC_BLOCK_DIGEST: ClassVar[WarcHeader]
WARC_PAYLOAD_DIGEST: ClassVar[WarcHeader]
WARC_IP_ADDRESS: ClassVar[WarcHeader]
WARC_REFERS_TO: ClassVar[WarcHeader]
WARC_REFERS_TO_TARGET_URI: ClassVar[WarcHeader]
WARC_REFERS_TO_DATE: ClassVar[WarcHeader]
WARC_TARGET_URI: ClassVar[WarcHeader]
WARC_TRUNCATED: ClassVar[WarcHeader]
WARC_WARCINFO_ID: ClassVar[WarcHeader]
WARC_FILENAME: ClassVar[WarcHeader]
WARC_PROFILE: ClassVar[WarcHeader]
WARC_IDENTIFIED_PAYLOAD_TYPE: ClassVar[WarcHeader]
WARC_SEGMENT_ORIGIN_ID: ClassVar[WarcHeader]
WARC_SEGMENT_NUMBER: ClassVar[WarcHeader]
WARC_SEGMENT_TOTAL_LENGTH: ClassVar[WarcHeader]
final class fastwarc.warc.ArchiveIterator(stream, record_types=any_type, parse_http=True, min_content_length=None, max_content_length=None, func_filter=None, verify_digests=False, quirks_mode=False, auto_decode='none', max_header_len=32768, stream_detect=True, buffer_size=65536, inplace=False, fsspec_args=None, *, strict_mode=True)

Bases: Iterable[WarcRecord]

WARC record stream iterator.

The iterator can be initialized from a file-like Python object, a path-like object, or a URL string. If installed, fsspec is used for opening paths and URLs unless fsspec_args=False.

Parameters:
  • stream (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – input stream, file-like object, file name, or URL

  • record_types (WarcRecordType) – bitmask of WarcRecordType values to return

  • parse_http (bool) – parse HTTP records automatically

  • min_content_length (int | None) – skip records smaller than this length, or None to disable

  • max_content_length (int | None) – skip records larger than this length, or None to disable

  • func_filter (Callable[[WarcRecord], bool] | None) – Python callable taking a WarcRecord and returning bool

  • verify_digests (bool) – skip records with missing or invalid block digests

  • quirks_mode (bool) – enable lenient parsing for malformed records

  • auto_decode (Literal['none', 'content', 'transfer', 'all']) – automatically decode HTTP payload encodings

  • max_header_len (int) – maximum allowed header length (throws an error if exceeded)

  • stream_detect (bool) – auto-detect gzip, zstd, or lz4 compressed streams

  • default_buffer_size – default buffer size to use for reading from files (has no effect if stream is not a path-like object)

  • inplace (bool) – Reuse and mutate the same WarcRecord instance instead of creating a new instance in each iteration.

  • fsspec_args – arguments for fsspec, or False to disable it

  • strict_mode (bool) – this argument is deprecated and ignored. Use quirks_mode instead.

  • buffer_size (int)

Return type:

Self

__iter__()

Implement iter(self).

Return type:

Iterator[WarcRecord]

__next__()

Implement next(self).

Return type:

WarcRecord

final class fastwarc.warc.HeaderMap(encoding='utf-8')

Bases: _HeaderMap

Dict-like type representing a WARC or HTTP header block.

Parameters:

encoding (str) – header source encoding

Return type:

Self

final class fastwarc.warc.WarcHeaderMap(encoding='utf-8')

Bases: _HeaderMap

Dict-like type representing a WARC or HTTP header block.

Parameters:

encoding (str) – header source encoding

Return type:

Self

class fastwarc.warc.WarcRecord

Bases: object

A WARC record.

WARC records are pickleable. Pickling preserves the current record state, including parsed HTTP headers if they have already been parsed.

Return type:

Self

consume(n=None)

Consume payload bytes without returning them.

Parameters:

n (int | None) – maximum number of bytes to consume, or None for all remaining bytes

Return type:

int

freeze()

Freeze the record payload.

Freezing copies the remaining payload bytes into memory so the record can outlive the iterator stream and support backward seeking.

Return type:

bool

init_headers(record_type=Ellipsis, record_urn=None, *, content_length=None)

Initialize mandatory headers in a fresh WarcRecord instance.

The content_length keyword argument is accepted for compatibility but is deprecated and ignored. The value of the Content-Length header and the content_length property are determined automatically by the length of the record payload.

Parameters:
  • record_type (WarcRecordType) – WARC-Type

  • record_urn (bytes | None) – WARC-Record-ID as URN without '<urn:' and '>'

  • content_length (int | None) – deprecated compatibility argument, ignored

parse_http(auto_decode='none', max_header_len=32768, quirks_mode=False, *, strict_mode=True)

Parse HTTP headers and advance the payload reader.

It is safe to call this method multiple times, even if the record is not an HTTP record.

If a parsed header exceeds max_header_len, an error is raised.

Quirks mode allows parsing of headers terminated with only LF instead of CRLF.

Parameters:
  • auto_decode (Literal['none', 'content', 'transfer', 'all']) – automatically decode HTTP payload encodings (accepted values: 'none', 'content', 'transfer', 'all')

  • max_header_len (int) – maximum allowed header length (throws an error if exceeded)

  • quirks_mode (bool) – enable parsing of LF-only headers.

  • strict_mode (bool) – this argument is deprecated and ignored.

parse_warc_headers(quirks_mode=False, max_header_len=32768)

Parse the WARC header block from the attached stream.

Parameters:
  • quirks_mode (bool) – enable lenient parsing

  • max_header_len (int) – maximum allowed header length (throws an error if exceeded)

Returns:

number of bytes read

Return type:

int

set_bytes_content(content)

Set the WARC payload as bytes.

Parameters:

content (bytes) – payload as bytes

set_bytes_payload(content)

Set the WARC payload as bytes.

Parameters:

content (bytes) – payload as bytes

verify_block_digest(consume=False)

Verify whether WARC-Block-Digest matches the current record block.

Returns False for missing or invalid digest metadata and raises OSError only for stream I/O failures.

Parameters:

consume (bool) – consume the remaining record payload instead of preserving it

Return type:

bool

verify_payload_digest(consume=False)

Verify whether WARC-Payload-Digest matches the current HTTP payload.

HTTP headers must have been parsed first with parse_http(). Returns False for missing or invalid digest metadata and raises OSError only for stream I/O failures.

Parameters:

consume (bool) – consume the remaining payload instead of preserving it

Return type:

bool

write(stream, checksum_data=False, payload_digest=None, chunk_size=16384)

Write this record to a stream.

Parameters:
  • stream (WarcWriter | BinaryIO | _GenericWriter) – output stream

  • checksum_data (bool) – calculate and add block and payload digests

  • payload_digest (bytes | None) – optional SHA-1 payload digest bytes

  • chunk_size (int) – write block size

Returns:

number of bytes written

Return type:

int

property content_length: int

Remaining WARC record length in bytes.

This is not necessarily the same as the WARC Content-Length header.

Type:

int

property headers: HeaderMap

WARC record headers.

Mutating the returned HeaderMap updates the record directly.

Type:

HeaderMap

property http_charset: str | None

HTTP charset/encoding returned by the server.

The returned value is guaranteed to be a valid Python encoding name.

Type:

str or None

property http_content_type: str | None

Plain HTTP Content-Type without fields such as charset=.

Type:

str or None

property http_date: datetime

Parsed HTTP Date header.

Type:

datetime.datetime or None

property http_headers: HeaderMap

Parsed HTTP headers, if available.

Type:

HeaderMap or None

property http_last_modified: datetime

Parsed HTTP Last-Modified header.

Type:

datetime.datetime or None

property is_frozen: bool

Whether this record has been frozen.

Type:

bool

property is_http: bool

Whether this record is an HTTP record.

Modifying this property also updates the WARC Content-Type header.

Type:

bool

property is_http_parsed: bool

Whether HTTP headers have been parsed.

Type:

bool

property reader: WarcRecordPayloadReader

Reader for the remaining WARC record payload.

Type:

WarcRecordPayloadReader or None

property record_date: datetime

WARC Date.

Type:

datetime.datetime or None

property record_id: str | None

Record ID.

This is the same as headers[WarcHeader.WARC_RECORD_ID] if present.

Type:

str or None

property record_type: WarcRecordType

Record type.

Type:

WarcRecordType

property stream_pos: int

WARC record start offset in the original input stream.

Type:

int

final class fastwarc.warc.WarcRecordPayloadReader

Bases: WarcReader

Reader for the remaining WARC record payload.

This object is tied to the lifetime of its parent WarcRecord. If the record belongs to an active ArchiveIterator, the reader becomes stale once iteration advances unless the record has been frozen with WarcRecord.freeze().

Return type:

Self

consume(size=Ellipsis)

Consume payload bytes without returning them.

Parameters:

size (int) – maximum number of bytes to consume, or -1 for all remaining bytes

Return type:

int

readline(max_line_len=8192)

Read a single payload line.

Parameters:

max_line_len (int) – maximum line length

Return type:

bytes

fastwarc.warc.has_block_digest(record)

Filter predicate for checking if a record has a block digest.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.has_content_length_gte(min)

Parameterized filter predicate for checking if a record’s Content-Length is greater than or equal to min.

This predicate is equivalent to using min_content_length in ArchiveIterator.

Parameters:

min (int) – minimum Content-Length

Returns:

WARC record filter

Return type:

Callable[[WarcRecord], bool]

fastwarc.warc.has_content_length_lte(max)

Parameterized filter predicate for checking if a record’s Content-Length is less than or equal to max.

This predicate is equivalent to using max_content_length in ArchiveIterator.

Parameters:

max (int) – maximum Content-Length

Returns:

WARC record filter

Return type:

Callable[[WarcRecord], bool]

fastwarc.warc.has_payload_digest(record)

Filter predicate for checking if a record has a payload digest.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.has_record_type(record_type_bitmask)

Parameterized filter predicate for checking if a record’s record type matches the given bitmask.

This predicate is equivalent to using record_types in ArchiveIterator.

Parameters:

record_type_bitmask (WarcRecordType | int) – WarcRecordType or bitmask of types

Returns:

WARC record filter

Return type:

Callable[[WarcRecord], bool]

fastwarc.warc.has_valid_block_digest(record)

Filter predicate for checking if a record has a valid block digest.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.has_valid_payload_digest(record)

Filter predicate for checking if a record has a valid payload digest.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.is_concurrent(record)

Filter predicate for checking if a record is concurrent to another record.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.is_http(record)

Filter predicate for checking if a record is an HTTP record.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.is_warc_10(record)

Filter predicate for checking if a record is a WARC/1.0 record.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.is_warc_11(record)

Filter predicate for checking if a record is a WARC/1.1 record.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

StreamIO

exception fastwarc.stream_io.FastWARCError(**kwargs)

Bases: OSError

exception fastwarc.stream_io.ReaderStaleError(**kwargs)

Bases: OSError

exception fastwarc.stream_io.StreamError(**kwargs)

Bases: OSError

class fastwarc.stream_io.BrotliReader(inner, buffer_size=65536, fsspec_args=None)

Bases: WarcReader

Brotli reader.

Parameters:
  • inner (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – raw input stream, file-like object, file name, or URL

  • buffer_size (int) – input buffer size

  • fsspec_args – arguments for fsspec, or False to disable it

Return type:

Self

class fastwarc.stream_io.BrotliWriter(inner, buffer_size=8192, fsspec_args=None)

Bases: WarcWriter

Brotli writer.

Parameters:
  • inner (WarcWriter | BinaryIO | _GenericWriter | PathLike[str] | str) – raw output stream, file-like object, file name, or URL

  • buffer_size (int) – compression buffer size

  • fsspec_args – arguments for fsspec, or False to disable it

Return type:

Self

class fastwarc.stream_io.ChunkedReader(inner, buffer_size=4096, fsspec_args=None)

Bases: WarcReader

HTTP chunked-transfer reader.

Parameters:
  • inner (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – raw input stream, file-like object, file name, or URL

  • buffer_size (int) – input buffer size

  • fsspec_args – arguments for fsspec, or False to disable it

Return type:

Self

class fastwarc.stream_io.ChunkedWriter(inner, min_chunk_size=1024, fsspec_args=None)

Bases: WarcWriter

HTTP chunked-transfer writer.

Parameters:
  • inner (WarcWriter | BinaryIO | _GenericWriter | PathLike[str] | str) – raw output stream, file-like object, file name, or URL

  • min_chunk_size (int) – minimum chunk size

  • fsspec_args – arguments for fsspec, or False to disable it

Return type:

Self

class fastwarc.stream_io.GzipReader(inner, buffer_size=65536, zlib=False, fsspec_args=None)

Bases: WarcReader

Gzip reader.

Parameters:
  • inner (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – raw input stream, file-like object, file name, or URL

  • buffer_size (int) – input buffer size

  • zlib (bool) – use zlib-wrapped deflate instead of gzip framing

  • fsspec_args – arguments for fsspec, or False to disable it

Return type:

Self

class fastwarc.stream_io.GzipWriter(inner, compression_level=9, buffer_size=8192, zlib=False, fsspec_args=None)

Bases: WarcWriter

Gzip writer.

Parameters:
  • inner (WarcWriter | BinaryIO | _GenericWriter | PathLike[str] | str) – raw output stream, file-like object, file name, or URL

  • compression_level (int) – compression level

  • buffer_size (int) – compression buffer size

  • zlib – use zlib-wrapped deflate instead of gzip framing

  • fsspec_args – arguments for fsspec, or False to disable it

Return type:

Self

class fastwarc.stream_io.Lz4Reader(inner, buffer_size=65536, fsspec_args=None)

Bases: WarcReader

LZ4 reader.

Parameters:
  • inner (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – raw input stream, file-like object, file name, or URL

  • buffer_size (int) – input buffer size

  • fsspec_args – arguments for fsspec, or False to disable it

Return type:

Self

class fastwarc.stream_io.Lz4Writer(inner, buffer_size=8192, fsspec_args=None)

Bases: WarcWriter

LZ4 writer.

Parameters:
  • inner (WarcWriter | BinaryIO | _GenericWriter | PathLike[str] | str) – raw output stream, file-like object, file name, or URL

  • buffer_size (int) – compression buffer size

  • fsspec_args – arguments for fsspec, or False to disable it

Return type:

Self

class fastwarc.stream_io.WarcReader

Bases: ContextManager[WarcReader, bool | None]

Abstract base class for reader objects in fastwarc.stream_io.

Return type:

Self

close()

Close the stream.

Return type:

None

frame_start_position()

Return the start offset of the current compression frame or member, if supported.

Return type:

int | None

inner_seek(offset, whence=0)

Seek within the wrapped inner stream.

Parameters:
  • offset (int) – seek offset

  • whence (int) – seek mode (0 = start, 1 = current, 2 = end)

Return type:

int

inner_tell()

Return the current inner stream offset.

Return type:

int

read(size=Ellipsis)

Read bytes from the stream.

Parameters:

size (int) – maximum number of bytes to read, or -1 for all remaining bytes

Return type:

bytes

seek(offset, whence=0)

Seek within the decoded stream.

Parameters:
  • offset (int) – seek offset

  • whence (int) – seek mode (0 = start, 1 = current, 2 = end)

Return type:

int

tell()

Return the current decoded stream offset.

Return type:

int

class fastwarc.stream_io.WarcWriter

Bases: ContextManager[WarcWriter, bool | None]

Abstract base class for writer objects in fastwarc.stream_io.

Return type:

Self

close()

Close the stream.

Return type:

None

finish()

Finish the current compression member or frame, if supported.

Return type:

None

flush()

Flush buffered output.

Return type:

None

write(data)

Write bytes to the stream.

Parameters:

data (bytes) – bytes to write

Returns:

number of bytes written

Return type:

int

class fastwarc.stream_io.ZstdReader(inner, buffer_size=65536, fsspec_args=None, dictionary=None)

Bases: WarcReader

Zstandard reader.

Parameters:
  • inner (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – raw input stream, file-like object, file name, or URL

  • buffer_size (int) – input buffer size

  • fsspec_args – arguments for fsspec, or False to disable it

  • dictionary (bytes | None) – optional decompression dictionary

Return type:

Self

class fastwarc.stream_io.ZstdWriter(inner, buffer_size=8192, compression_level=3, fsspec_args=None, dictionary=None, compress_dictionary_frame=False)

Bases: WarcWriter

Zstandard writer.

Parameters:
  • inner (WarcWriter | BinaryIO | _GenericWriter | PathLike[str] | str) – raw output stream, file-like object, file name, or URL

  • buffer_size (int) – compression buffer size

  • fsspec_args – arguments for fsspec, or False to disable it

  • dictionary (bytes | None) – optional compression dictionary

  • compress_dictionary_frame – include dictionary frames in compressed output

  • compression_level (int)

Return type:

Self

fastwarc.stream_io.zstd_train_dictionary_from_continuous(sample_data, sample_sizes, max_size)

Train a Zstandard dictionary from a stream of samples.

Parameters:
  • sample_data (bytes) – continuous stream of sample bytes

  • sample_sizes (list[int]) – sample boundaries

  • max_size (int) – maximum dictionary size

Returns:

dictionary as bytes

Return type:

bytes

fastwarc.stream_io.zstd_train_dictionary_from_files(filenames, max_size)

Train a Zstandard dictionary from a set of files.

Parameters:
  • filenames (list[str]) – input file names

  • max_size (int) – maximum dictionary size

Returns:

dictionary as bytes

Return type:

bytes

fastwarc.stream_io.zstd_train_dictionary_from_samples(samples, max_size)

Train a Zstandard dictionary from a set of samples.

Parameters:
  • sample_data – list of byte samples

  • max_size (int) – maximum dictionary size

  • samples (list[bytes])

Returns:

dictionary as bytes

Return type:

bytes