FastWARC (Python)
Resiliparse FastWARC API documentation (Python bindings).
WARC
- final class fastwarc.warc.WarcRecordType(*values)
Enum indicating a WARC record’s type as given by its
WARC-Typeheader.Multiple types can be combined with boolean operators for filtering records in
ArchiveIterator.- unknown = 512
Special type: unknown record type (filter only)
- any_type = 65535
Special type: any record type (filter only)
- no_type = 0
Special type: no record type (filter only)
- warcinfo = 2
- response = 4
- resource = 8
- request = 16
- metadata = 32
- revisit = 64
- conversion = 128
- continuation = 256
- final class fastwarc.warc.WarcHeader
Pre-defined set of standard WARC 1.1 headers. This enum can be used in place of
bytesorstrvalues inHeaderMapmethods to void misspellings.- WARC_TYPE: ClassVar[WarcHeader]
- WARC_RECORD_ID: ClassVar[WarcHeader]
- WARC_DATE: ClassVar[WarcHeader]
- CONTENT_LENGTH: ClassVar[WarcHeader]
- CONTENT_TYPE: ClassVar[WarcHeader]
- WARC_CONCURRENT_TO: ClassVar[WarcHeader]
- WARC_BLOCK_DIGEST: ClassVar[WarcHeader]
- WARC_PAYLOAD_DIGEST: ClassVar[WarcHeader]
- WARC_IP_ADDRESS: ClassVar[WarcHeader]
- WARC_REFERS_TO: ClassVar[WarcHeader]
- WARC_REFERS_TO_TARGET_URI: ClassVar[WarcHeader]
- WARC_REFERS_TO_DATE: ClassVar[WarcHeader]
- WARC_TARGET_URI: ClassVar[WarcHeader]
- WARC_TRUNCATED: ClassVar[WarcHeader]
- WARC_WARCINFO_ID: ClassVar[WarcHeader]
- WARC_FILENAME: ClassVar[WarcHeader]
- WARC_PROFILE: ClassVar[WarcHeader]
- WARC_IDENTIFIED_PAYLOAD_TYPE: ClassVar[WarcHeader]
- WARC_SEGMENT_ORIGIN_ID: ClassVar[WarcHeader]
- WARC_SEGMENT_NUMBER: ClassVar[WarcHeader]
- WARC_SEGMENT_TOTAL_LENGTH: ClassVar[WarcHeader]
- final class fastwarc.warc.ArchiveIterator(stream, record_types=any_type, parse_http=True, min_content_length=None, max_content_length=None, func_filter=None, verify_digests=False, quirks_mode=False, auto_decode='none', max_header_len=32768, stream_detect=True, buffer_size=65536, inplace=False, fsspec_args=None, *, strict_mode=True)
Bases:
Iterable[WarcRecord]WARC record stream iterator.
The iterator can be initialized from a file-like Python object, a path-like object, or a URL string. If installed,
fsspecis used for opening paths and URLs unlessfsspec_args=False.- Parameters:
stream (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – input stream, file-like object, file name, or URL
record_types (WarcRecordType) – bitmask of
WarcRecordTypevalues to returnparse_http (bool) – parse HTTP records automatically
min_content_length (int | None) – skip records smaller than this length, or
Noneto disablemax_content_length (int | None) – skip records larger than this length, or
Noneto disablefunc_filter (Callable[[WarcRecord], bool] | None) – Python callable taking a
WarcRecordand returningboolverify_digests (bool) – skip records with missing or invalid block digests
quirks_mode (bool) – enable lenient parsing for malformed records
auto_decode (Literal['none', 'content', 'transfer', 'all']) – automatically decode HTTP payload encodings
max_header_len (int) – maximum allowed header length (throws an error if exceeded)
stream_detect (bool) – auto-detect gzip, zstd, or lz4 compressed streams
default_buffer_size – default buffer size to use for reading from files (has no effect if
streamis not a path-like object)inplace (bool) – Reuse and mutate the same
WarcRecordinstance instead of creating a new instance in each iteration.fsspec_args – arguments for
fsspec, orFalseto disable itstrict_mode (bool) – this argument is deprecated and ignored. Use
quirks_modeinstead.buffer_size (int)
- Return type:
Self
- __iter__()
Implement iter(self).
- Return type:
Iterator[WarcRecord]
- __next__()
Implement next(self).
- Return type:
- final class fastwarc.warc.HeaderMap(encoding='utf-8')
Bases:
_HeaderMapDict-like type representing a WARC or HTTP header block.
- Parameters:
encoding (str) – header source encoding
- Return type:
Self
- final class fastwarc.warc.WarcHeaderMap(encoding='utf-8')
Bases:
_HeaderMapDict-like type representing a WARC or HTTP header block.
- Parameters:
encoding (str) – header source encoding
- Return type:
Self
- class fastwarc.warc.WarcRecord
Bases:
objectA WARC record.
WARC records are pickleable. Pickling preserves the current record state, including parsed HTTP headers if they have already been parsed.
- Return type:
Self
- consume(n=None)
Consume payload bytes without returning them.
- Parameters:
n (int | None) – maximum number of bytes to consume, or
Nonefor all remaining bytes- Return type:
int
- freeze()
Freeze the record payload.
Freezing copies the remaining payload bytes into memory so the record can outlive the iterator stream and support backward seeking.
- Return type:
bool
- init_headers(record_type=Ellipsis, record_urn=None, *, content_length=None)
Initialize mandatory headers in a fresh
WarcRecordinstance.The
content_lengthkeyword argument is accepted for compatibility but is deprecated and ignored. The value of theContent-Lengthheader and thecontent_lengthproperty are determined automatically by the length of the record payload.- Parameters:
record_type (WarcRecordType) – WARC-Type
record_urn (bytes | None) – WARC-Record-ID as URN without
'<urn:'and'>'content_length (int | None) – deprecated compatibility argument, ignored
- parse_http(auto_decode='none', max_header_len=32768, quirks_mode=False, *, strict_mode=True)
Parse HTTP headers and advance the payload reader.
It is safe to call this method multiple times, even if the record is not an HTTP record.
If a parsed header exceeds max_header_len, an error is raised.
Quirks mode allows parsing of headers terminated with only LF instead of CRLF.
- Parameters:
auto_decode (Literal['none', 'content', 'transfer', 'all']) – automatically decode HTTP payload encodings (accepted values:
'none','content','transfer','all')max_header_len (int) – maximum allowed header length (throws an error if exceeded)
quirks_mode (bool) – enable parsing of LF-only headers.
strict_mode (bool) – this argument is deprecated and ignored.
- parse_warc_headers(quirks_mode=False, max_header_len=32768)
Parse the WARC header block from the attached stream.
- Parameters:
quirks_mode (bool) – enable lenient parsing
max_header_len (int) – maximum allowed header length (throws an error if exceeded)
- Returns:
number of bytes read
- Return type:
int
- set_bytes_content(content)
Set the WARC payload as bytes.
- Parameters:
content (bytes) – payload as bytes
- set_bytes_payload(content)
Set the WARC payload as bytes.
- Parameters:
content (bytes) – payload as bytes
- verify_block_digest(consume=False)
Verify whether
WARC-Block-Digestmatches the current record block.Returns
Falsefor missing or invalid digest metadata and raisesOSErroronly for stream I/O failures.- Parameters:
consume (bool) – consume the remaining record payload instead of preserving it
- Return type:
bool
- verify_payload_digest(consume=False)
Verify whether
WARC-Payload-Digestmatches the current HTTP payload.HTTP headers must have been parsed first with
parse_http(). ReturnsFalsefor missing or invalid digest metadata and raisesOSErroronly for stream I/O failures.- Parameters:
consume (bool) – consume the remaining payload instead of preserving it
- Return type:
bool
- write(stream, checksum_data=False, payload_digest=None, chunk_size=16384)
Write this record to a stream.
- Parameters:
stream (WarcWriter | BinaryIO | _GenericWriter) – output stream
checksum_data (bool) – calculate and add block and payload digests
payload_digest (bytes | None) – optional SHA-1 payload digest bytes
chunk_size (int) – write block size
- Returns:
number of bytes written
- Return type:
int
- property content_length: int
Remaining WARC record length in bytes.
This is not necessarily the same as the WARC
Content-Lengthheader.- Type:
int
- property headers: HeaderMap
WARC record headers.
Mutating the returned
HeaderMapupdates the record directly.- Type:
- property http_charset: str | None
HTTP charset/encoding returned by the server.
The returned value is guaranteed to be a valid Python encoding name.
- Type:
str or None
- property http_content_type: str | None
Plain HTTP
Content-Typewithout fields such ascharset=.- Type:
str or None
- property http_date: datetime
Parsed HTTP
Dateheader.- Type:
datetime.datetime or None
- property http_last_modified: datetime
Parsed HTTP
Last-Modifiedheader.- Type:
datetime.datetime or None
- property is_frozen: bool
Whether this record has been frozen.
- Type:
bool
- property is_http: bool
Whether this record is an HTTP record.
Modifying this property also updates the WARC
Content-Typeheader.- Type:
bool
- property is_http_parsed: bool
Whether HTTP headers have been parsed.
- Type:
bool
- property reader: WarcRecordPayloadReader
Reader for the remaining WARC record payload.
- Type:
WarcRecordPayloadReader or None
- property record_date: datetime
WARC Date.
- Type:
datetime.datetime or None
- property record_id: str | None
Record ID.
This is the same as
headers[WarcHeader.WARC_RECORD_ID]if present.- Type:
str or None
- property record_type: WarcRecordType
Record type.
- Type:
- property stream_pos: int
WARC record start offset in the original input stream.
- Type:
int
- final class fastwarc.warc.WarcRecordPayloadReader
Bases:
WarcReaderReader for the remaining WARC record payload.
This object is tied to the lifetime of its parent
WarcRecord. If the record belongs to an activeArchiveIterator, the reader becomes stale once iteration advances unless the record has been frozen withWarcRecord.freeze().- Return type:
Self
- consume(size=Ellipsis)
Consume payload bytes without returning them.
- Parameters:
size (int) – maximum number of bytes to consume, or
-1for all remaining bytes- Return type:
int
- readline(max_line_len=8192)
Read a single payload line.
- Parameters:
max_line_len (int) – maximum line length
- Return type:
bytes
- fastwarc.warc.has_block_digest(record)
Filter predicate for checking if a record has a block digest.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.has_content_length_gte(min)
Parameterized filter predicate for checking if a record’s Content-Length is greater than or equal to
min.This predicate is equivalent to using
min_content_lengthinArchiveIterator.- Parameters:
min (int) – minimum
Content-Length- Returns:
WARC record filter
- Return type:
Callable[[WarcRecord], bool]
- fastwarc.warc.has_content_length_lte(max)
Parameterized filter predicate for checking if a record’s
Content-Lengthis less than or equal tomax.This predicate is equivalent to using
max_content_lengthinArchiveIterator.- Parameters:
max (int) – maximum
Content-Length- Returns:
WARC record filter
- Return type:
Callable[[WarcRecord], bool]
- fastwarc.warc.has_payload_digest(record)
Filter predicate for checking if a record has a payload digest.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.has_record_type(record_type_bitmask)
Parameterized filter predicate for checking if a record’s record type matches the given bitmask.
This predicate is equivalent to using
record_typesinArchiveIterator.- Parameters:
record_type_bitmask (WarcRecordType | int) –
WarcRecordTypeor bitmask of types- Returns:
WARC record filter
- Return type:
Callable[[WarcRecord], bool]
- fastwarc.warc.has_valid_block_digest(record)
Filter predicate for checking if a record has a valid block digest.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.has_valid_payload_digest(record)
Filter predicate for checking if a record has a valid payload digest.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_concurrent(record)
Filter predicate for checking if a record is concurrent to another record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_http(record)
Filter predicate for checking if a record is an HTTP record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_warc_10(record)
Filter predicate for checking if a record is a WARC/1.0 record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_warc_11(record)
Filter predicate for checking if a record is a WARC/1.1 record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
StreamIO
- exception fastwarc.stream_io.FastWARCError(**kwargs)
Bases:
OSError
- exception fastwarc.stream_io.ReaderStaleError(**kwargs)
Bases:
OSError
- exception fastwarc.stream_io.StreamError(**kwargs)
Bases:
OSError
- class fastwarc.stream_io.BrotliReader(inner, buffer_size=65536, fsspec_args=None)
Bases:
WarcReaderBrotli reader.
- Parameters:
inner (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – raw input stream, file-like object, file name, or URL
buffer_size (int) – input buffer size
fsspec_args – arguments for
fsspec, orFalseto disable it
- Return type:
Self
- class fastwarc.stream_io.BrotliWriter(inner, buffer_size=8192, fsspec_args=None)
Bases:
WarcWriterBrotli writer.
- Parameters:
inner (WarcWriter | BinaryIO | _GenericWriter | PathLike[str] | str) – raw output stream, file-like object, file name, or URL
buffer_size (int) – compression buffer size
fsspec_args – arguments for
fsspec, orFalseto disable it
- Return type:
Self
- class fastwarc.stream_io.ChunkedReader(inner, buffer_size=4096, fsspec_args=None)
Bases:
WarcReaderHTTP chunked-transfer reader.
- Parameters:
inner (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – raw input stream, file-like object, file name, or URL
buffer_size (int) – input buffer size
fsspec_args – arguments for
fsspec, orFalseto disable it
- Return type:
Self
- class fastwarc.stream_io.ChunkedWriter(inner, min_chunk_size=1024, fsspec_args=None)
Bases:
WarcWriterHTTP chunked-transfer writer.
- Parameters:
inner (WarcWriter | BinaryIO | _GenericWriter | PathLike[str] | str) – raw output stream, file-like object, file name, or URL
min_chunk_size (int) – minimum chunk size
fsspec_args – arguments for
fsspec, orFalseto disable it
- Return type:
Self
- class fastwarc.stream_io.GzipReader(inner, buffer_size=65536, zlib=False, fsspec_args=None)
Bases:
WarcReaderGzip reader.
- Parameters:
inner (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – raw input stream, file-like object, file name, or URL
buffer_size (int) – input buffer size
zlib (bool) – use zlib-wrapped deflate instead of gzip framing
fsspec_args – arguments for
fsspec, orFalseto disable it
- Return type:
Self
- class fastwarc.stream_io.GzipWriter(inner, compression_level=9, buffer_size=8192, zlib=False, fsspec_args=None)
Bases:
WarcWriterGzip writer.
- Parameters:
inner (WarcWriter | BinaryIO | _GenericWriter | PathLike[str] | str) – raw output stream, file-like object, file name, or URL
compression_level (int) – compression level
buffer_size (int) – compression buffer size
zlib – use zlib-wrapped deflate instead of gzip framing
fsspec_args – arguments for
fsspec, orFalseto disable it
- Return type:
Self
- class fastwarc.stream_io.Lz4Reader(inner, buffer_size=65536, fsspec_args=None)
Bases:
WarcReaderLZ4 reader.
- Parameters:
inner (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – raw input stream, file-like object, file name, or URL
buffer_size (int) – input buffer size
fsspec_args – arguments for
fsspec, orFalseto disable it
- Return type:
Self
- class fastwarc.stream_io.Lz4Writer(inner, buffer_size=8192, fsspec_args=None)
Bases:
WarcWriterLZ4 writer.
- Parameters:
inner (WarcWriter | BinaryIO | _GenericWriter | PathLike[str] | str) – raw output stream, file-like object, file name, or URL
buffer_size (int) – compression buffer size
fsspec_args – arguments for
fsspec, orFalseto disable it
- Return type:
Self
- class fastwarc.stream_io.WarcReader
Bases:
ContextManager[WarcReader,bool|None]Abstract base class for reader objects in
fastwarc.stream_io.- Return type:
Self
- close()
Close the stream.
- Return type:
None
- frame_start_position()
Return the start offset of the current compression frame or member, if supported.
- Return type:
int | None
- inner_seek(offset, whence=0)
Seek within the wrapped inner stream.
- Parameters:
offset (int) – seek offset
whence (int) – seek mode (
0= start,1= current,2= end)
- Return type:
int
- inner_tell()
Return the current inner stream offset.
- Return type:
int
- read(size=Ellipsis)
Read bytes from the stream.
- Parameters:
size (int) – maximum number of bytes to read, or
-1for all remaining bytes- Return type:
bytes
- seek(offset, whence=0)
Seek within the decoded stream.
- Parameters:
offset (int) – seek offset
whence (int) – seek mode (
0= start,1= current,2= end)
- Return type:
int
- tell()
Return the current decoded stream offset.
- Return type:
int
- class fastwarc.stream_io.WarcWriter
Bases:
ContextManager[WarcWriter,bool|None]Abstract base class for writer objects in
fastwarc.stream_io.- Return type:
Self
- close()
Close the stream.
- Return type:
None
- finish()
Finish the current compression member or frame, if supported.
- Return type:
None
- flush()
Flush buffered output.
- Return type:
None
- write(data)
Write bytes to the stream.
- Parameters:
data (bytes) – bytes to write
- Returns:
number of bytes written
- Return type:
int
- class fastwarc.stream_io.ZstdReader(inner, buffer_size=65536, fsspec_args=None, dictionary=None)
Bases:
WarcReaderZstandard reader.
- Parameters:
inner (WarcReader | BinaryIO | _GenericReader | PathLike[str] | str) – raw input stream, file-like object, file name, or URL
buffer_size (int) – input buffer size
fsspec_args – arguments for
fsspec, orFalseto disable itdictionary (bytes | None) – optional decompression dictionary
- Return type:
Self
- class fastwarc.stream_io.ZstdWriter(inner, buffer_size=8192, compression_level=3, fsspec_args=None, dictionary=None, compress_dictionary_frame=False)
Bases:
WarcWriterZstandard writer.
- Parameters:
inner (WarcWriter | BinaryIO | _GenericWriter | PathLike[str] | str) – raw output stream, file-like object, file name, or URL
buffer_size (int) – compression buffer size
fsspec_args – arguments for
fsspec, orFalseto disable itdictionary (bytes | None) – optional compression dictionary
compress_dictionary_frame – include dictionary frames in compressed output
compression_level (int)
- Return type:
Self
- fastwarc.stream_io.zstd_train_dictionary_from_continuous(sample_data, sample_sizes, max_size)
Train a Zstandard dictionary from a stream of samples.
- Parameters:
sample_data (bytes) – continuous stream of sample bytes
sample_sizes (list[int]) – sample boundaries
max_size (int) – maximum dictionary size
- Returns:
dictionary as bytes
- Return type:
bytes
- fastwarc.stream_io.zstd_train_dictionary_from_files(filenames, max_size)
Train a Zstandard dictionary from a set of files.
- Parameters:
filenames (list[str]) – input file names
max_size (int) – maximum dictionary size
- Returns:
dictionary as bytes
- Return type:
bytes
- fastwarc.stream_io.zstd_train_dictionary_from_samples(samples, max_size)
Train a Zstandard dictionary from a set of samples.
- Parameters:
sample_data – list of byte samples
max_size (int) – maximum dictionary size
samples (list[bytes])
- Returns:
dictionary as bytes
- Return type:
bytes