FastWARC
Resiliparse FastWARC API documentation.
WARC
- class fastwarc.warc.WarcRecordType(value)
An enumeration.
Enum indicating a WARC record’s type as given by its
WARC-Type
header. Multiple types can be combined with boolean operators for filtering records.- unknown = 512
Special type: unknown record type (filter only)
- any_type = 65535
Special type: any record type (filter only)
- no_type = 0
Special type: no record type (filter only)
- warcinfo = 2
- response = 4
- resource = 8
- request = 16
- metadata = 32
- revisit = 64
- conversion = 128
- continuation = 256
- class fastwarc.warc.ArchiveIterator
Bases:
object
WARC record stream iterator.
- Parameters:
stream – input stream (preferably an
IOStream
, but any file-like Python object is fine)parse_http (bool) – whether to parse HTTP records automatically (disable for better performance if not needed)
record_types (int) – bitmask of
WarcRecordType
record types to return (others will be skipped)min_content_length (int) – skip records with Content-Length less than this
max_content_length (int) – skip records with Content-Length large than this
func_filter (Callable) – Python callable taking a
WarcRecord
and returning abool
for further record filteringverify_digests (bool) – skip records which have no or an invalid block digest
strict_mode (bool) – enforce strict spec compliance (setting this to
False
will enable quirks such asLF
instead ofCRLF
for headers, missing Content-Length or otherwise malformed records)auto_decode (str) – automatically decode record body if Content-Encoding or Transfer-Encoding headers are set (accepted values:
'none'
[default],'content'
,'transfer'
,'all'
, has no effect ifparse_http
isFalse
)
- __iter__()
Iterate all
WarcRecord
items in the current WARC stream.- Return type:
t.Iterable[WarcRecord]
- __next__()
Implements an iterator that can be used with
next()
.- Return type:
- class fastwarc.warc.WarcHeaderMap
Bases:
object
Dict-like type representing a WARC or HTTP header block.
- Parameters:
encoding (str) – header source encoding
- __iter__()
Iterate all header map items.
- Return type:
t.Iterable[(str, str)]
- append(key, value)
Append header (use if header name is not unique).
- Parameters:
key (str) – header key
value (str) – header value
- asdict() Dict[str, str]
Headers as Python dict.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type:
t.Dict[str, str]
- astuples() Tuple[Tuple[str, str]]
Headers as a series of tuples, including multiple headers with the same key. Use this over
asdict()
if header keys are not necessarily unique.- Return type:
((str, str),)
- clear()
Clear all headers.
- get(key, default=None) str
Get header value or
default
.If multiple headers have the same key, only the last occurrence will be returned.
- Parameters:
key (str) – header key
default (str) – default value if
key
not found
- Return type:
str
- items()
Item view of keys and values.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type:
t.Iterable[(str, str)]
- keys()
Iterable of header keys.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type:
t.Iterable[str]
- values()
Iterable of header values.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type:
t.Iterable[str]
- write(stream)
Write header block into stream.
- reason_phrase
HTTP reason phrase (unset if header block is not an HTTP header block).
- Type:
str or None
- status_code
HTTP status code (unset if header block is not an HTTP header block).
- Type:
int or None
- status_line
Header status line.
- Type:
str
- class fastwarc.warc.WarcRecord
Bases:
object
A WARC record.
WARC records are picklable, but pickling will
freeze()
the WARC record.- freeze()
“Freeze” a record by baking in the remaining payload stream contents.
Freezing a record makes the
WarcRecord
instance copyable and reusable by decoupling it from the underlying raw WARC stream. Instead of reading directly from the raw stream, a frozen record maintains an internal buffer the size of the remaining payload stream contents at the time of callingfreeze()
.Freezing a record will advance the underlying raw stream.
- init_headers(content_length=0, record_type=<WarcRecordType.no_type: 0>, record_urn=None)
Initialize mandatory headers in a fresh
WarcRecord
instance.- Parameters:
content_length (int) – WARC record body length in bytes
record_type (WarcRecordType) – WARC-Type
record_urn (bytes) – WARC-Record-ID as URN without
'<'
,'>'
(if unset, a random URN will be generated)
- parse_http(strict_mode=True, auto_decode='none')
Parse HTTP headers and advance content reader.
It is safe to call this method multiple times, even if the record is not an HTTP record.
- Parameters:
strict_mode – enforce
CRLF
line endings, setting this toFalse
will allow plainLF
alsoauto_decode (str) – automatically decode record body if Content-Encoding or Transfer-Encoding headers are set (accepted values:
'none'
[default],'content'
,'transfer'
,'all'
)
- Type:
strict_mode: bool
- set_bytes_content(b)
Set WARC body.
- Parameters:
b (bytes) – body as bytes
- verify_block_digest(consume=False)
Verify whether record block digest is valid.
- Parameters:
consume (bool) – do not create an in-memory copy of the record stream (will fully consume the rest of the record)
- Returns:
True
if digest exists and is valid- Return type:
bool
- verify_payload_digest(consume=False)
Verify whether record payload digest is valid.
- Parameters:
consume (bool) – do not create an in-memory copy of the record stream (will fully consume the rest of the record)
- Returns:
True
if record is HTTP record and digest exists and is valid- Return type:
bool
- write(stream, checksum_data=False, payload_digest=None, chunk_size=16384)
Write WARC record onto a stream.
- Parameters:
stream – output stream
checksum_data (bool) – add block and payload digest headers
payload_digest (bytes) – optional SHA-1 payload digest as bytes
chunk_size (int) – write block size
- Returns:
number of bytes written
- Return type:
int
- content_length
Remaining WARC record length in bytes (not necessarily the same as the
Content-Length
header).- Type:
int
- headers
WARC record headers.
- Type:
- http_charset
HTTP charset/encoding as returned by the server or
None
if no valid charset is set. A returned string is guaranteed to be a valid Python encoding name.- Type:
str or None
- http_content_type
Plain HTTP Content-Type without additional fields such as
charset=
.- Type:
str or None
- http_date
Parsed HTTP
Date
header orNone
if server did not return a valid HTTP date.- Type:
datetime.datetime | None
- http_headers
HTTP headers if record is an HTTP record and HTTP headers have been parsed yet.
- Type:
WarcHeaderMap or None
- http_last_modified
Parsed HTTP
Last-Modified
header orNone
if server did not return a valid HTTP modification date.- Type:
datetime.datetime | None
- is_http
Whether record is an HTTP record.
Modifying this property will also affect the
Content-Type
of this record.- Type:
bool
- is_http_parsed
Whether HTTP headers have been parsed.
- Type:
bool
- reader
Reader for the remaining WARC record content.
- Type:
- record_date
WARC Date.
- Type:
datetime.datetime | None
- record_id
Record ID (same as
headers['WARC-Record-ID']
).- Type:
str
- record_type
Record type (same as
headers['WARC-Type']
.- Type:
- stream_pos
WARC record start offset in the original (uncompressed) stream.
- Type:
int
- class fastwarc.warc.WarcRecordType(value)
Bases:
IntFlag
An enumeration.
- fastwarc.warc.has_block_digest(record)
Filter predicate for checking if record has a block digest.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.has_payload_digest(record)
Filter predicate for checking if record has a payload digest.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_concurrent(record)
Filter predicate for checking if record is concurrent to another record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_http(record)
Filter predicate for checking if record is an HTTP record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_warc_10(record)
Filter predicate for checking if record is a WARC/1.0 record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_warc_11(record)
Filter predicate for checking if record is a WARC/1.1 record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
StreamIO
- exception fastwarc.stream_io.FastWARCError
Bases:
Exception
Generic FastWARC exception.
- exception fastwarc.stream_io.ReaderStaleError
Bases:
FastWARCError
FastWARC reader stale error.
- exception fastwarc.stream_io.StreamError
Bases:
FastWARCError
FastWARC stream error.
- class fastwarc.stream_io.BrotliStream
Bases:
CompressingStream
Brotli
IOStream
implementation.Implementation relies on Google’s
brotli
Python package, will be ported to native C version in a later version.- Parameters:
raw_stream – raw data stream
quality (int) – compression quality (higher quality means better compression, but less speed)
lgwin (int) – Base 2 logarithm of the sliding window size in the range 16 to 24
lgblock (int) – Base 2 logarithm of the maximum input block size in the range 16 to 24 (will be set based on quality of value is 0)
- begin_member()
Begin compression member / frame (if not already started).
- Returns:
bytes written
- Return type:
int
- close()
Close the stream.
- end_member()
- End compression member / frame (if one has been started).
If ytd
- Returns:
bytes written
- Return type:
int
- flush()
Flush stream buffer.
- seek(offset)
Seek to specified offset.
- Parameters:
offset (int) – seek offset
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- class fastwarc.stream_io.BufferedReader
Bases:
object
Buffered reader operating on an
IOStream
instance.- Parameters:
stream (IOStream) – stream to operate on
buf_size (int) – internal buffer size
negotiate_stream (bool) – whether to auto-negotiate stream type
- close()
Close stream.
- consume(size=18446744073709551615)
Consume up to
size
bytes from the input stream without allocating a buffer for it.- Parameters:
size (int) – number of bytes to read (default means read remaining stream)
- Returns:
number of bytes consumed
- Return type:
int
- read(size=18446744073709551615)
Read up to
size
bytes from the input stream.- Parameters:
size (int) – number of bytes to read (default means read remaining stream)
- Returns:
consumed buffer contents as bytes (or empty string if EOF)
- Return type:
bytes
- readline(crlf=True, max_line_len=8192)
Read a single line from the input stream.
- Parameters:
crlf (bool) – whether lines are separated by CRLF or LF
max_line_len (int) – maximum line length (longer lines will still be consumed, but the return value will not be larger than this)
- Returns:
line contents (or empty string if EOF)
- Return type:
bytes
- tell()
Offset on the input stream.
- Returns:
offset
- Return type:
int
- class fastwarc.stream_io.BytesIOStream
Bases:
IOStream
IOStream that uses an in-memory buffer.
- Parameters:
initial_data (bytes) – fill internal buffer with this initial data
- close()
Close the stream.
- getvalue()
Get buffer value.
- Returns:
buffer value
- Return type:
bytes
- seek(offset)
Seek to specified offset.
- Parameters:
offset (int) – seek offset
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- class fastwarc.stream_io.CompressingStream
Bases:
IOStream
Base class for compressed
IOStream
types.- begin_member()
Begin compression member / frame (if not already started).
- Returns:
bytes written
- Return type:
int
- end_member()
- End compression member / frame (if one has been started).
If ytd
- Returns:
bytes written
- Return type:
int
- class fastwarc.stream_io.FileStream
Bases:
IOStream
Fast alternative to Python file objects for local files.
- Parameters:
filename (str) – input filename
mode (str) – file open mode
- close()
Close the stream.
- flush()
Flush stream buffer.
- seek(offset)
Seek to specified offset.
- Parameters:
offset (int) – seek offset
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- class fastwarc.stream_io.GZipStream
Bases:
CompressingStream
GZip
IOStream
implementation.- Parameters:
raw_stream – raw data stream
compression_level (int) – GZip compression level (for compression only)
zlib (bool) – use raw deflate / zlib format instead of gzip
- begin_member()
Begin compression member / frame (if not already started).
- Returns:
bytes written
- Return type:
int
- close()
Close the stream.
- end_member()
- End compression member / frame (if one has been started).
If ytd
- Returns:
bytes written
- Return type:
int
- flush()
Flush stream buffer.
- prepopulate(deflate, initial_data)
Fill internal working buffer with initial data. Use if some initial data of the stream have already been consumed (e.g., for stream content negotiation). Has to be called before the first
read()
.- Parameters:
deflate (int) –
True
ifdata
is uncompressed,False
ifdata
is compressed GZip data.initial_data (bytes) – data to pre-populate
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- class fastwarc.stream_io.IOStream
Bases:
object
IOStream base class.
- close()
Close the stream.
- flush()
Flush stream buffer.
- read(size)
Read
size
bytes from stream.- Parameters:
size (int) – bytes to read
- Returns:
read bytes
- Return type:
bytearray
- seek(offset)
Seek to specified offset.
- Parameters:
offset (int) – seek offset
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- write(data)
Write bytes to stream.
- Parameters:
data (bytes) – data to write
- Returns:
number of bytes written
- Return type:
int
- class fastwarc.stream_io.LZ4Stream
Bases:
CompressingStream
LZ4
IOStream
implementation.- Parameters:
raw_stream – raw data stream
compression_level (int) – LZ4 compression level (for compression only)
favor_dec_speed (bool) – favour decompression speed over compression speed and size
- begin_member()
Begin compression member / frame (if not already started).
- Returns:
bytes written
- Return type:
int
- close()
Close the stream.
- end_member()
- End compression member / frame (if one has been started).
If ytd
- Returns:
bytes written
- Return type:
int
- flush()
Flush stream buffer.
- prepopulate(initial_data)
Fill internal working buffer with initial data. Use if some initial data of the stream have already been consumed (e.g., for stream content negotiation). Has to be called before the first
read()
.- Parameters:
initial_data (bytes) – data to pre-populate
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- class fastwarc.stream_io.PythonIOStreamAdapter
Bases:
IOStream
IOStream adapter for file-like Python objects.
- Parameters:
py_stream – input Python stream object
- close()
Close the stream.
- flush()
Flush stream buffer.
- seek(offset)
Seek to specified offset.
- Parameters:
offset (int) – seek offset
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- fastwarc.stream_io.wrap_stream(raw_stream)
Wrap
raw_stream
into aPythonIOStreamAdapter
if it is a Python object or returnraw_stream
unmodified if it is aIOStream
already.- Parameters:
raw_stream – stream to wrap
- Returns:
wrapped stream
- Return type: