FastWARC
Resiliparse FastWARC API documentation.
WARC
- class fastwarc.warc.WarcRecordType(value)
An enumeration.
Enum indicating a WARC record’s type as given by its
WARC-Type
header. Multiple types can be combined with boolean operators for filtering records.- unknown = 512
Special type: unknown record type (filter only)
- any_type = 65535
Special type: any record type (filter only)
- no_type = 0
Special type: no record type (filter only)
- warcinfo = 2
- response = 4
- resource = 8
- request = 16
- metadata = 32
- revisit = 64
- conversion = 128
- continuation = 256
- class fastwarc.warc.ArchiveIterator(self, stream, record_types=any_type, parse_http=True, min_content_length=- 1, max_content_length=- 1, func_filter=None, verify_digests=False, strict_mode=True)
Bases:
object
WARC record stream iterator.
- Parameters
stream – input stream (preferably an
IOStream
, but any file-like Python object is fine)parse_http (bool) – whether to parse HTTP records automatically (disable for better performance if not needed)
record_types (int) – bitmask of
WarcRecordType
record types to return (others will be skipped)min_content_length (int) – skip records with Content-Length less than this
max_content_length (int) – skip records with Content-Length large than this
func_filter (Callable) – Python callable taking a
WarcRecord
and returning abool
for further record filteringverify_digests (bool) – skip records which have no or an invalid block digest
strict_mode (bool) – enforce strict spec compliance (setting this to
False
will enable quirks such asLF
instead ofCRLF
for headers)
- __iter__(self)
Iterate all
WarcRecord
items in the current WARC stream.- Return type
t.Iterable[WarcRecord]
- __next__(self)
Return an iterator object for this WARC stream that can be used with
next()
.- Return type
t.Iterator[WarcRecord]
- class fastwarc.warc.WarcHeaderMap(self, encoding='utf-8')
Bases:
object
Dict-like type representing a WARC or HTTP header block.
- Parameters
encoding (str) – header source encoding
- __iter__(self)
Iterate all header map items.
- Return type
t.Iterable[(str, str)]
- append(self, key, value)
Append header (use if header name is not unique).
- Parameters
key (str) – header key
value (str) – header value
- asdict(self)
Headers as Python dict.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type
t.Dict[str, str]
- astuples(self)
Headers as a series of tuples, including multiple headers with the same key. Use this over
asdict()
if header keys are not necessarily unique.- Return type
((str, str),)
- clear(self)
Clear all headers.
- get(self, key, default=None)
Get header value or
default
.If multiple headers have the same key, only the last occurrence will be returned.
- Parameters
key (str) – header key
default (str) – default value if
key
not found
- Return type
str
- items(self)
Item view of keys and values.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type
t.Iterable[(str, str)]
- keys(self)
Iterable of header keys.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type
t.Iterable[str]
- values(self)
Iterable of header values.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type
t.Iterable[str]
- write()
Write header block into stream.
- status_code
HTTP status code (unset if header block is not an HTTP header block).
- Type
int or None
- status_line
Header status line.
- Type
str
- class fastwarc.warc.WarcRecord(self)
Bases:
object
A WARC record.
WARC records are picklable, but pickling will
freeze()
the WARC record.- freeze(self)
“Freeze” a record by baking in the remaining payload stream contents.
Freezing a record makes the
WarcRecord
instance copyable and reusable by decoupling it from the underlying raw WARC stream. Instead of reading directly from the raw stream, a frozen record maintains an internal buffer the size of the remaining payload stream contents at the time of callingfreeze()
.Freezing a record will advance the underlying raw stream.
- init_headers(self, content_length=0, record_type=no_type, record_urn=None)
Initialize mandatory headers in a fresh
WarcRecord
instance.- Parameters
content_length (int) – WARC record body length in bytes
record_type (WarcRecordType) – WARC-Type
record_urn (bytes) – WARC-Record-ID as URN without
'<'
,'>'
(if unset, a random URN will be generated)
- parse_http(self, strict_mode=True)
Parse HTTP headers and advance content reader.
It is safe to call this method multiple times, even if the record is not an HTTP record.
- Parameters
strict_mode – enforce
CRLF
line endings, setting this toFalse
will allow plainLF
also- Type
strict_mode: bool
- set_bytes_content(self, b)
Set WARC body.
- Parameters
b (bytes) – body as bytes
- verify_block_digest(self, consume=False)
Verify whether record block digest is valid.
- Parameters
consume (bool) – do not create an in-memory copy of the record stream (will fully consume the rest of the record)
- Returns
True
if digest exists and is valid- Return type
bool
- verify_payload_digest(self, consume=False)
Verify whether record payload digest is valid.
- Parameters
consume (bool) – do not create an in-memory copy of the record stream (will fully consume the rest of the record)
- Returns
True
if record is HTTP record and digest exists and is valid- Return type
bool
- write(self, stream, checksum_data=False, chunk_size=16384)
Write WARC record onto a stream.
- Parameters
stream – output stream
checksum_data (bool) – add block and payload digest headers
payload_digest (bytes) – optional SHA-1 payload digest as bytes
chunk_size (int) – write block size
- Returns
number of bytes written
- Return type
int
- content_length
Remaining WARC record length in bytes (not necessarily the same as the
Content-Length
header).- Type
int
- headers
WARC record headers.
- Type
- http_charset
HTTP charset/encoding as returned by the server or
None
if no valid charset is set. A returned string is guaranteed to be a valid Python encoding name.- Type
str or None
- http_content_type
Plain HTTP Content-Type without additional fields such as
charset=
.- Type
str or None
- http_date
Parsed HTTP
Date
header orNone
if server did not return a valid HTTP date.- Type
datetime.datetime | None
- http_headers
HTTP headers if record is an HTTP record and HTTP headers have been parsed yet.
- Type
WarcHeaderMap or None
- is_http
Whether record is an HTTP record.
Modifying this property will also affect the
Content-Type
of this record.- Type
bool
- is_http_parsed
Whether HTTP headers have been parsed.
- Type
bool
- reader
Reader for the remaining WARC record content.
- Type
- record_date
WARC Date.
- Type
datetime.datetime | None
- record_id
Record ID (same as
headers['WARC-Record-ID']
).- Type
str
- record_type
Record type (same as
headers['WARC-Type']
.- Type
- stream_pos
WARC record start offset in the original (uncompressed) stream.
- Type
int
- fastwarc.warc.has_block_digest(record)
Filter predicate for checking if record has a block digest.
- Parameters
record (WarcRecord) – WARC record
- Return type
bool
- fastwarc.warc.has_payload_digest(record)
Filter predicate for checking if record has a payload digest.
- Parameters
record (WarcRecord) – WARC record
- Return type
bool
- fastwarc.warc.is_concurrent(record)
Filter predicate for checking if record is concurrent to another record.
- Parameters
record (WarcRecord) – WARC record
- Return type
bool
- fastwarc.warc.is_http(record)
Filter predicate for checking if record is an HTTP record.
- Parameters
record (WarcRecord) – WARC record
- Return type
bool
- fastwarc.warc.is_warc_10(record)
Filter predicate for checking if record is a WARC/1.0 record.
- Parameters
record (WarcRecord) – WARC record
- Return type
bool
- fastwarc.warc.is_warc_11(record)
Filter predicate for checking if record is a WARC/1.1 record.
- Parameters
record (WarcRecord) – WARC record
- Return type
bool
StreamIO
- exception fastwarc.stream_io.FastWARCError
Bases:
Exception
Generic FastWARC exception.
- exception fastwarc.stream_io.ReaderStaleError
Bases:
fastwarc.stream_io.FastWARCError
FastWARC reader stale error.
- exception fastwarc.stream_io.StreamError
Bases:
fastwarc.stream_io.FastWARCError
FastWARC stream error.
- class fastwarc.stream_io.BufferedReader(self, stream, buf_size=16384, negotiate_stream=True)
Bases:
object
Buffered reader operating on an
IOStream
instance.- Parameters
stream (IOStream) – stream to operate on
buf_size (int) – internal buffer size
negotiate_stream (bool) – whether to auto-negotiate stream type
- close(self)
Close stream.
- consume(self, size=- 1)
Consume up to
size
bytes from the input stream without allocating a buffer for it.- Parameters
size (int) – number of bytes to read (default means read remaining stream)
- Returns
number of bytes consumed
- Return type
int
- read(self, size=- 1)
Read up to
size
bytes from the input stream.- Parameters
size (int) – number of bytes to read (default means read remaining stream)
- Returns
consumed buffer contents as bytes (or empty string if EOF)
- Return type
bytes
- readline(self, crlf=True, max_line_len=8192)
Read a single line from the input stream.
- Parameters
crlf (bool) – whether lines are separated by CRLF or LF
max_line_len (int) – maximum line length (longer lines will still be consumed, but the return value will not be larger than this)
- Returns
line contents (or empty string if EOF)
- Return type
bytes
- tell(self)
Offset on the input stream.
- Returns
offset
- Return type
int
- class fastwarc.stream_io.BytesIOStream(self, initial_data=None)
Bases:
fastwarc.stream_io.IOStream
IOStream that uses an in-memory buffer.
- Parameters
initial_data (bytes) – fill internal buffer with this initial data
- class fastwarc.stream_io.CompressingStream
Bases:
fastwarc.stream_io.IOStream
Base class for compressed
IOStream
types.
- class fastwarc.stream_io.FileStream(self, filename=None, mode='rb')
Bases:
fastwarc.stream_io.IOStream
Fast alternative to Python file objects for local files.
- Parameters
filename (str) – input filename
mode (str) – file open mode
- class fastwarc.stream_io.GZipStream(self, raw_stream, compression_level=9)
Bases:
fastwarc.stream_io.CompressingStream
GZip
IOStream
implementation.- Parameters
raw_stream – raw data stream
compression_level (int) – GZip compression level (for compression only)
- class fastwarc.stream_io.IOStream
Bases:
object
IOStream base class.
- class fastwarc.stream_io.LZ4Stream(self, raw_stream, compression_level=12, favor_dec_speed=True)
Bases:
fastwarc.stream_io.CompressingStream
LZ4
IOStream
implementation.- Parameters
raw_stream – raw data stream
compression_level (int) – LZ4 compression level (for compression only)
favor_dec_speed (bool) – favour decompression speed over compression speed and size
- class fastwarc.stream_io.PythonIOStreamAdapter(self, py_stream)
Bases:
fastwarc.stream_io.IOStream
IOStream adapter for file-like Python objects.
- Parameters
py_stream – input Python stream object
- fastwarc.stream_io.wrap_stream(raw_stream)
Wrap
raw_stream
into aPythonIOStreamAdapter
if it is a Python object or returnraw_stream
unmodified if it is aIOStream
already.- Parameters
raw_stream – stream to wrap
- Returns
wrapped stream
- Return type