FastWARC

Resiliparse FastWARC API documentation.

WARC

class fastwarc.warc.WarcRecordType(value)

An enumeration.

Enum indicating a WARC record’s type as given by its WARC-Type header. Multiple types can be combined with boolean operators for filtering records.

unknown = 512

Special type: unknown record type (filter only)

any_type = 65535

Special type: any record type (filter only)

no_type = 0

Special type: no record type (filter only)

warcinfo = 2
response = 4
resource = 8
request = 16
metadata = 32
revisit = 64
conversion = 128
continuation = 256
class fastwarc.warc.ArchiveIterator(self, stream, record_types=any_type, parse_http=True, min_content_length=-1, max_content_length=-1, func_filter=None, verify_digests=False, strict_mode=True, auto_decode='none')

Bases: object

WARC record stream iterator.

Parameters:
  • stream – input stream (preferably an IOStream, but any file-like Python object is fine)

  • parse_http (bool) – whether to parse HTTP records automatically (disable for better performance if not needed)

  • record_types (int) – bitmask of WarcRecordType record types to return (others will be skipped)

  • min_content_length (int) – skip records with Content-Length less than this

  • max_content_length (int) – skip records with Content-Length large than this

  • func_filter (Callable) – Python callable taking a WarcRecord and returning a bool for further record filtering

  • verify_digests (bool) – skip records which have no or an invalid block digest

  • strict_mode (bool) – enforce strict spec compliance (setting this to False will enable quirks such as LF instead of CRLF for headers, missing Content-Length or otherwise malformed records)

  • auto_decode (str) – automatically decode record body if Content-Encoding or Transfer-Encoding headers are set (accepted values: 'none' [default], 'content', 'transfer', 'all', has no effect if parse_http is False)

__iter__(self)

Iterate all WarcRecord items in the current WARC stream.

Return type:

t.Iterable[WarcRecord]

__next__(self)

Implements an iterator that can be used with next().

Return type:

WarcRecord

class fastwarc.warc.WarcHeaderMap(self, encoding='utf-8')

Bases: object

Dict-like type representing a WARC or HTTP header block.

Parameters:

encoding (str) – header source encoding

__iter__(self)

Iterate all header map items.

Return type:

t.Iterable[(str, str)]

append(self, key, value)

Append header (use if header name is not unique).

Parameters:
  • key (str) – header key

  • value (str) – header value

asdict(self)

Headers as Python dict.

If multiple headers have the same key, only the last occurrence will be returned.

Return type:

t.Dict[str, str]

astuples(self)

Headers as a series of tuples, including multiple headers with the same key. Use this over asdict() if header keys are not necessarily unique.

Return type:

((str, str),)

clear(self)

Clear all headers.

get(self, key, default=None)

Get header value or default.

If multiple headers have the same key, only the last occurrence will be returned.

Parameters:
  • key (str) – header key

  • default (str) – default value if key not found

Return type:

str

items(self)

Item view of keys and values.

If multiple headers have the same key, only the last occurrence will be returned.

Return type:

t.Iterable[(str, str)]

keys(self)

Iterable of header keys.

If multiple headers have the same key, only the last occurrence will be returned.

Return type:

t.Iterable[str]

values(self)

Iterable of header values.

If multiple headers have the same key, only the last occurrence will be returned.

Return type:

t.Iterable[str]

write(stream)

Write header block into stream.

reason_phrase

HTTP reason phrase (unset if header block is not an HTTP header block).

Type:

str or None

status_code

HTTP status code (unset if header block is not an HTTP header block).

Type:

int or None

status_line

Header status line.

Type:

str

class fastwarc.warc.WarcRecord(self)

Bases: object

A WARC record.

WARC records are picklable, but pickling will freeze() the WARC record.

freeze(self)

“Freeze” a record by baking in the remaining payload stream contents.

Freezing a record makes the WarcRecord instance copyable and reusable by decoupling it from the underlying raw WARC stream. Instead of reading directly from the raw stream, a frozen record maintains an internal buffer the size of the remaining payload stream contents at the time of calling freeze().

Freezing a record will advance the underlying raw stream.

init_headers(self, content_length=0, record_type=no_type, record_urn=None)

Initialize mandatory headers in a fresh WarcRecord instance.

Parameters:
  • content_length (int) – WARC record body length in bytes

  • record_type (WarcRecordType) – WARC-Type

  • record_urn (bytes) – WARC-Record-ID as URN without '<', '>' (if unset, a random URN will be generated)

parse_http(self, strict_mode=True, auto_decode='none')

Parse HTTP headers and advance content reader.

It is safe to call this method multiple times, even if the record is not an HTTP record.

Parameters:
  • strict_mode – enforce CRLF line endings, setting this to False will allow plain LF also

  • auto_decode (str) – automatically decode record body if Content-Encoding or Transfer-Encoding headers are set (accepted values: 'none' [default], 'content', 'transfer', 'all')

Type:

strict_mode: bool

set_bytes_content(self, b)

Set WARC body.

Parameters:

b (bytes) – body as bytes

verify_block_digest(self, consume=False)

Verify whether record block digest is valid.

Parameters:

consume (bool) – do not create an in-memory copy of the record stream (will fully consume the rest of the record)

Returns:

True if digest exists and is valid

Return type:

bool

verify_payload_digest(self, consume=False)

Verify whether record payload digest is valid.

Parameters:

consume (bool) – do not create an in-memory copy of the record stream (will fully consume the rest of the record)

Returns:

True if record is HTTP record and digest exists and is valid

Return type:

bool

write(self, stream, checksum_data=False, chunk_size=16384)

Write WARC record onto a stream.

Parameters:
  • stream – output stream

  • checksum_data (bool) – add block and payload digest headers

  • payload_digest (bytes) – optional SHA-1 payload digest as bytes

  • chunk_size (int) – write block size

Returns:

number of bytes written

Return type:

int

content_length

Remaining WARC record length in bytes (not necessarily the same as the Content-Length header).

Type:

int

headers

WARC record headers.

Type:

WarcHeaderMap

http_charset

HTTP charset/encoding as returned by the server or None if no valid charset is set. A returned string is guaranteed to be a valid Python encoding name.

Type:

str or None

http_content_type

Plain HTTP Content-Type without additional fields such as charset=.

Type:

str or None

http_date

Parsed HTTP Date header or None if server did not return a valid HTTP date.

Type:

datetime.datetime | None

http_headers

HTTP headers if record is an HTTP record and HTTP headers have been parsed yet.

Type:

WarcHeaderMap or None

http_last_modified

Parsed HTTP Last-Modified header or None if server did not return a valid HTTP modification date.

Type:

datetime.datetime | None

is_http

Whether record is an HTTP record.

Modifying this property will also affect the Content-Type of this record.

Type:

bool

is_http_parsed

Whether HTTP headers have been parsed.

Type:

bool

reader

Reader for the remaining WARC record content.

Type:

BufferedReader

record_date

WARC Date.

Type:

datetime.datetime | None

record_id

Record ID (same as headers['WARC-Record-ID']).

Type:

str

record_type

Record type (same as headers['WARC-Type'].

Type:

WarcRecordType

stream_pos

WARC record start offset in the original (uncompressed) stream.

Type:

int

class fastwarc.warc.WarcRecordType(value)

Bases: IntFlag

An enumeration.

fastwarc.warc.has_block_digest(record)

Filter predicate for checking if record has a block digest.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.has_payload_digest(record)

Filter predicate for checking if record has a payload digest.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.is_concurrent(record)

Filter predicate for checking if record is concurrent to another record.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.is_http(record)

Filter predicate for checking if record is an HTTP record.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.is_warc_10(record)

Filter predicate for checking if record is a WARC/1.0 record.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

fastwarc.warc.is_warc_11(record)

Filter predicate for checking if record is a WARC/1.1 record.

Parameters:

record (WarcRecord) – WARC record

Return type:

bool

StreamIO

exception fastwarc.stream_io.FastWARCError

Bases: Exception

Generic FastWARC exception.

exception fastwarc.stream_io.ReaderStaleError

Bases: FastWARCError

FastWARC reader stale error.

exception fastwarc.stream_io.StreamError

Bases: FastWARCError

FastWARC stream error.

class fastwarc.stream_io.BrotliStream(self, raw_stream, quality=11, lgwin=22, lgblock=0)

Bases: CompressingStream

Brotli IOStream implementation.

Implementation relies on Google’s brotli Python package, will be ported to native C version in a later version.

Parameters:
  • raw_stream – raw data stream

  • quality (int) – compression quality (higher quality means better compression, but less speed)

  • lgwin (int) – Base 2 logarithm of the sliding window size in the range 16 to 24

  • lgblock (int) – Base 2 logarithm of the maximum input block size in the range 16 to 24 (will be set based on quality of value is 0)

begin_member(self)

Begin compression member / frame (if not already started).

Returns:

bytes written

Return type:

int

close(self)

Close the stream.

end_member(self)

End compression member / frame (if one has been started).

Returns:

bytes written

Return type:

int

flush(self)

Flush stream buffer.

seek(self, offset)

Seek to specified offset.

Parameters:

offset (int) – seek offset

tell(self)

Return current stream offset.

Returns:

stream offset

Return type:

int

class fastwarc.stream_io.BufferedReader(self, stream, buf_size=16384, negotiate_stream=True)

Bases: object

Buffered reader operating on an IOStream instance.

Parameters:
  • stream (IOStream) – stream to operate on

  • buf_size (int) – internal buffer size

  • negotiate_stream (bool) – whether to auto-negotiate stream type

close(self)

Close stream.

consume(self, size=-1)

Consume up to size bytes from the input stream without allocating a buffer for it.

Parameters:

size (int) – number of bytes to read (default means read remaining stream)

Returns:

number of bytes consumed

Return type:

int

read(self, size=-1)

Read up to size bytes from the input stream.

Parameters:

size (int) – number of bytes to read (default means read remaining stream)

Returns:

consumed buffer contents as bytes (or empty string if EOF)

Return type:

bytes

readline(self, crlf=True, max_line_len=8192)

Read a single line from the input stream.

Parameters:
  • crlf (bool) – whether lines are separated by CRLF or LF

  • max_line_len (int) – maximum line length (longer lines will still be consumed, but the return value will not be larger than this)

Returns:

line contents (or empty string if EOF)

Return type:

bytes

tell(self)

Offset on the input stream.

Returns:

offset

Return type:

int

class fastwarc.stream_io.BytesIOStream(self, initial_data=None)

Bases: IOStream

IOStream that uses an in-memory buffer.

Parameters:

initial_data (bytes) – fill internal buffer with this initial data

close(self)

Close the stream.

getvalue(self)

Get buffer value.

Returns:

buffer value

Return type:

bytes

seek(self, offset)

Seek to specified offset.

Parameters:

offset (int) – seek offset

tell(self)

Return current stream offset.

Returns:

stream offset

Return type:

int

class fastwarc.stream_io.CompressingStream

Bases: IOStream

Base class for compressed IOStream types.

begin_member(self)

Begin compression member / frame (if not already started).

Returns:

bytes written

Return type:

int

end_member(self)

End compression member / frame (if one has been started).

Returns:

bytes written

Return type:

int

class fastwarc.stream_io.FileStream(self, filename=None, mode='rb')

Bases: IOStream

Fast alternative to Python file objects for local files.

Parameters:
  • filename (str) – input filename

  • mode (str) – file open mode

close(self)

Close the stream.

flush(self)

Flush stream buffer.

seek(self, offset)

Seek to specified offset.

Parameters:

offset (int) – seek offset

tell(self)

Return current stream offset.

Returns:

stream offset

Return type:

int

class fastwarc.stream_io.GZipStream(self, raw_stream, compression_level=9, zlib=False)

Bases: CompressingStream

GZip IOStream implementation.

Parameters:
  • raw_stream – raw data stream

  • compression_level (int) – GZip compression level (for compression only)

  • zlib (bool) – use raw deflate / zlib format instead of gzip

begin_member(self)

Begin compression member / frame (if not already started).

Returns:

bytes written

Return type:

int

close(self)

Close the stream.

end_member(self)

End compression member / frame (if one has been started).

Returns:

bytes written

Return type:

int

flush(self)

Flush stream buffer.

prepopulate(self, initial_data)

Fill internal working buffer with initial data. Use if some initial data of the stream have already been consumed (e.g., for stream content negotiation). Has to be called before the first read().

Parameters:
  • deflate (int) – True if data is uncompressed, False if data is compressed GZip data.

  • initial_data (bytes) – data to pre-populate

tell(self)

Return current stream offset.

Returns:

stream offset

Return type:

int

class fastwarc.stream_io.IOStream

Bases: object

IOStream base class.

close(self)

Close the stream.

flush(self)

Flush stream buffer.

read(self, size)

Read size bytes from stream.

Parameters:

size (int) – bytes to read

Returns:

read bytes

Return type:

bytearray

seek(self, offset)

Seek to specified offset.

Parameters:

offset (int) – seek offset

tell(self)

Return current stream offset.

Returns:

stream offset

Return type:

int

write(self, data)

Write bytes to stream.

Parameters:

data (bytes) – data to write

Returns:

number of bytes written

Return type:

int

class fastwarc.stream_io.LZ4Stream(self, raw_stream, compression_level=12, favor_dec_speed=True)

Bases: CompressingStream

LZ4 IOStream implementation.

Parameters:
  • raw_stream – raw data stream

  • compression_level (int) – LZ4 compression level (for compression only)

  • favor_dec_speed (bool) – favour decompression speed over compression speed and size

begin_member(self)

Begin compression member / frame (if not already started).

Returns:

bytes written

Return type:

int

close(self)

Close the stream.

end_member(self)

End compression member / frame (if one has been started).

Returns:

bytes written

Return type:

int

flush(self)

Flush stream buffer.

prepopulate(self, initial_data)

Fill internal working buffer with initial data. Use if some initial data of the stream have already been consumed (e.g., for stream content negotiation). Has to be called before the first read().

Parameters:

initial_data (bytes) – data to pre-populate

tell(self)

Return current stream offset.

Returns:

stream offset

Return type:

int

class fastwarc.stream_io.PythonIOStreamAdapter(self, py_stream)

Bases: IOStream

IOStream adapter for file-like Python objects.

Parameters:

py_stream – input Python stream object

close(self)

Close the stream.

flush(self)

Flush stream buffer.

seek(self, offset)

Seek to specified offset.

Parameters:

offset (int) – seek offset

tell(self)

Return current stream offset.

Returns:

stream offset

Return type:

int

fastwarc.stream_io.wrap_stream(raw_stream)

Wrap raw_stream into a PythonIOStreamAdapter if it is a Python object or return raw_stream unmodified if it is a IOStream already.

Parameters:

raw_stream – stream to wrap

Returns:

wrapped stream

Return type:

IOStream