FastWARC

Resiliparse FastWARC API documentation.

WARC

class fastwarc.warc.WarcRecordType(value)

An enumeration.

Enum indicating a WARC record’s type as given by its WARC-Type header. Multiple types can be combined with boolean operators for filtering records.

unknown = 512

Special type: unknown record type (filter only)

any_type = 65535

Special type: any record type (filter only)

no_type = 0

Special type: no record type (filter only)

warcinfo = 2
response = 4
resource = 8
request = 16
metadata = 32
revisit = 64
conversion = 128
continuation = 256
class fastwarc.warc.ArchiveIterator(self, stream, record_types=any_type, parse_http=True, min_content_length=- 1, max_content_length=- 1, func_filter=None, verify_digests=False)

Bases: object

WARC record stream iterator.

Parameters
  • stream – input stream (preferably an IOStream, but any file-like Python object is fine)

  • parse_http (bool) – whether to parse HTTP records automatically (disable for better performance if not needed)

  • record_types (int) – bitmask of WarcRecordType record types to return (others will be skipped)

  • min_content_length (int) – skip records with Content-Length less than this

  • max_content_length (int) – skip records with Content-Length large than this

  • func_filter (Callable) – Python callable taking a WarcRecord and returning a bool for further record filtering

  • verify_digests (bool) – skip records which have no or an invalid block digest

__iter__(self)

Iterate all WarcRecord items in the current WARC stream.

Return type

t.Iterable[WarcRecord]

__next__(self)

Return an iterator object for this WARC stream that can be used with next().

Return type

t.Iterator[WarcRecord]

class fastwarc.warc.WarcHeaderMap(self, encoding='utf-8')

Bases: object

Dict-like type representing a WARC or HTTP header block.

Parameters

encoding (str) – header source encoding

__iter__(self)

Iterate all header map items.

Return type

t.Iterable[(str, str)]

append(self, key, value)

Append header (use if header name is not unique).

Parameters
  • key (str) – header key

  • value (str) – header value

asdict(self)

Headers as Python dict.

If multiple headers have the same key, only the last occurrence will be returned.

Return type

t.Dict[str, str]

astuples(self)

Headers as a series of tuples, including multiple headers with the same key. Use this over asdict() if header keys are not necessarily unique.

Return type

((str, str),)

clear(self)

Clear all headers.

get(self, key, default=None)

Get header value or default.

If multiple headers have the same key, only the last occurrence will be returned.

Parameters
  • key (str) – header key

  • default (str) – default value if key not found

Return type

str

items(self)

Item view of keys and values.

If multiple headers have the same key, only the last occurrence will be returned.

Return type

t.Iterable[(str, str)]

keys(self)

Iterable of header keys.

If multiple headers have the same key, only the last occurrence will be returned.

Return type

t.Iterable[str]

values(self)

Iterable of header values.

If multiple headers have the same key, only the last occurrence will be returned.

Return type

t.Iterable[str]

write()

Write header block into stream.

status_code

HTTP status code (unset if header block is not an HTTP header block).

Type

int or None

status_line

Header status line.

Type

str

class fastwarc.warc.WarcRecord(self)

Bases: object

A WARC record.

WARC records are picklable, but pickling will freeze() the WARC record.

freeze(self)

“Freeze” a record by baking in the remaining payload stream contents.

Freezing a record makes the WarcRecord instance copyable and reusable by decoupling it from the underlying raw WARC stream. Instead of reading directly from the raw stream, a frozen record maintains an internal buffer the size of the remaining payload stream contents at the time of calling freeze().

Freezing a record will advance the underlying raw stream.

init_headers(self, content_length=0, record_type=no_type, record_urn=None)

Initialize mandatory headers in a fresh WarcRecord instance.

Parameters
  • content_length (int) – WARC record body length in bytes

  • record_type (WarcRecordType) – WARC-Type

  • record_urn (bytes) – WARC-Record-ID as URN without '<', '>' (if unset, a random URN will be generated)

parse_http(self)

Parse HTTP headers and advance content reader.

It is safe to call this method multiple times, even if the record is not an HTTP record.

set_bytes_content(self, b)

Set WARC body.

Parameters

b (bytes) – body as bytes

verify_block_digest(self, consume=False)

Verify whether record block digest is valid.

Parameters

consume (bool) – do not create an in-memory copy of the record stream (will fully consume the rest of the record)

Returns

True if digest exists and is valid

Return type

bool

verify_payload_digest(self, consume=False)

Verify whether record payload digest is valid.

Parameters

consume (bool) – do not create an in-memory copy of the record stream (will fully consume the rest of the record)

Returns

True if record is HTTP record and digest exists and is valid

Return type

bool

write(self, stream, checksum_data=False, chunk_size=16384)

Write WARC record onto a stream.

Parameters
  • stream – output stream

  • checksum_data (bool) – add block and payload digest headers

  • payload_digest (bytes) – optional SHA-1 payload digest as bytes

  • chunk_size (int) – write block size

Returns

number of bytes written

Return type

int

content_length

Remaining WARC record length in bytes (not necessarily the same as the Content-Length header).

Type

int

headers

WARC record headers.

Type

WarcHeaderMap

http_charset

HTTP charset/encoding as returned by the server or None if no valid charset is set. A returned string is guaranteed to be a valid Python encoding name.

Type

str or None

http_content_type

Plain HTTP Content-Type without additional fields such as charset=.

Type

str or None

http_headers

HTTP headers if record is an HTTP record and HTTP headers have been parsed yet.

Type

WarcHeaderMap or None

is_http

Whether record is an HTTP record.

Modifying this property will also affect the Content-Type of this record.

Type

bool

is_http_parsed

Whether HTTP headers have been parsed.

Type

bool

reader

Reader for the remaining WARC record content.

Type

BufferedReader

record_id

Record ID (same as headers['WARC-Record'ID'].

Type

str

record_type

Record type (same as headers['WARC-Type'].

Type

WarcRecordType

stream_pos

WARC record start offset in the original (uncompressed) stream.

Type

int

fastwarc.warc.has_block_digest(record)

Filter predicate for checking if record has a block digest.

Parameters

record (WarcRecord) – WARC record

Return type

bool

fastwarc.warc.has_payload_digest(record)

Filter predicate for checking if record has a payload digest.

Parameters

record (WarcRecord) – WARC record

Return type

bool

fastwarc.warc.is_concurrent(record)

Filter predicate for checking if record is concurrent to another record.

Parameters

record (WarcRecord) – WARC record

Return type

bool

fastwarc.warc.is_http(record)

Filter predicate for checking if record is an HTTP record.

Parameters

record (WarcRecord) – WARC record

Return type

bool

fastwarc.warc.is_warc_10(record)

Filter predicate for checking if record is a WARC/1.0 record.

Parameters

record (WarcRecord) – WARC record

Return type

bool

fastwarc.warc.is_warc_11(record)

Filter predicate for checking if record is a WARC/1.1 record.

Parameters

record (WarcRecord) – WARC record

Return type

bool

StreamIO

exception fastwarc.stream_io.FastWARCError

Bases: Exception

Generic FastWARC exception.

exception fastwarc.stream_io.ReaderStaleError

Bases: fastwarc.stream_io.FastWARCError

FastWARC reader stale error.

exception fastwarc.stream_io.StreamError

Bases: fastwarc.stream_io.FastWARCError

FastWARC stream error.

class fastwarc.stream_io.BufferedReader(self, stream, buf_size=16384, negotiate_stream=True)

Bases: object

Buffered reader operating on an IOStream instance.

Parameters
  • stream (IOStream) – stream to operate on

  • buf_size (int) – internal buffer size

  • negotiate_stream (bool) – whether to auto-negotiate stream type

close(self)

Close stream.

consume(self, size=- 1)

Consume up to size bytes from the input stream without allocating a buffer for it.

Parameters

size (int) – number of bytes to read (default means read remaining stream)

Returns

number of bytes consumed

Return type

int

read(self, size=- 1)

Read up to size bytes from the input stream.

Parameters

size (int) – number of bytes to read (default means read remaining stream)

Returns

consumed buffer contents as bytes (or empty string if EOF)

Return type

bytes

readline(self, crlf=True, max_line_len=8192)

Read a single line from the input stream.

Parameters
  • crlf (bool) – whether lines are separated by CRLF or LF

  • max_line_len (int) – maximum line length (longer lines will still be consumed, but the return value will not be larger than this)

Returns

line contents (or empty string if EOF)

Return type

bytes

tell(self)

Offset on the input stream.

Returns

offset

Return type

int

class fastwarc.stream_io.BytesIOStream(self, initial_data=None)

Bases: fastwarc.stream_io.IOStream

IOStream that uses an in-memory buffer.

Parameters

initial_data (bytes) – fill internal buffer with this initial data

class fastwarc.stream_io.CompressingStream

Bases: fastwarc.stream_io.IOStream

Base class for compressed IOStream types.

class fastwarc.stream_io.FileStream(self, filename=None, mode='rb')

Bases: fastwarc.stream_io.IOStream

Fast alternative to Python file objects for local files.

Parameters
  • filename (str) – input filename

  • mode (str) – file open mode

class fastwarc.stream_io.GZipStream(self, raw_stream, compression_level=9)

Bases: fastwarc.stream_io.CompressingStream

GZip IOStream implementation.

Parameters
  • raw_stream – raw data stream

  • compression_level (int) – GZip compression level (for compression only)

class fastwarc.stream_io.IOStream

Bases: object

IOStream base class.

class fastwarc.stream_io.LZ4Stream(self, raw_stream, compression_level=12, favor_dec_speed=True)

Bases: fastwarc.stream_io.CompressingStream

LZ4 IOStream implementation.

Parameters
  • raw_stream – raw data stream

  • compression_level (int) – LZ4 compression level (for compression only)

  • favor_dec_speed (bool) – favour decompression speed over compression speed and size

class fastwarc.stream_io.PythonIOStreamAdapter(self, py_stream)

Bases: fastwarc.stream_io.IOStream

IOStream adapter for file-like Python objects.

Parameters

py_stream – input Python stream object

fastwarc.stream_io.wrap_stream(raw_stream)

Wrap raw_stream into a PythonIOStreamAdapter if it is a Python object or return raw_stream unmodified if it is a IOStream already.

Parameters

raw_stream – stream to wrap

Returns

wrapped stream

Return type

IOStream