FastWARC
Resiliparse FastWARC API documentation.
WARC
- class fastwarc.warc.WarcRecordType(value)
An enumeration.
Enum indicating a WARC record’s type as given by its
WARC-Typeheader. Multiple types can be combined with boolean operators for filtering records.- unknown = 512
Special type: unknown record type (filter only)
- any_type = 65535
Special type: any record type (filter only)
- no_type = 0
Special type: no record type (filter only)
- warcinfo = 2
- response = 4
- resource = 8
- request = 16
- metadata = 32
- revisit = 64
- conversion = 128
- continuation = 256
- class fastwarc.warc.ArchiveIterator
Bases:
objectWARC record stream iterator.
The iterator can be initialized from either an
fastwarc.stream_io.IOStream, a file-like Python object, or a string. In the case of a string, it is treated as a file path or a URL. If installed, fsspec is used for opening the file or URL (unlessfsspec_args=False), and aPythonIOStreamAdapteris returned. Iffsspecis not installed, aFileStreamis returned instead.- Parameters:
stream – input stream (
IOStreamor file-like Python object) or file name/URL as stringparse_http (bool) – whether to parse HTTP records automatically (disable for better performance if not needed)
record_types (int) – bitmask of
WarcRecordTyperecord types to return (others will be skipped)min_content_length (int) – skip records with Content-Length less than this
max_content_length (int) – skip records with Content-Length large than this
func_filter (Callable) – Python callable taking a
WarcRecordand returning aboolfor further record filteringverify_digests (bool) – skip records which have no or an invalid block digest
strict_mode (bool) – enforce strict spec compliance (setting this to
Falsewill enable quirks such asLFinstead ofCRLFfor headers, missing Content-Length or otherwise malformed records)auto_decode (str) – automatically decode record body if Content-Encoding or Transfer-Encoding headers are set (accepted values:
'none'[default],'content','transfer','all', has no effect ifparse_httpisFalse)fsspec_args – dict of arguments to pass to
fsspecorFalseto disablefsspec
- __iter__()
Iterate all
WarcRecorditems in the current WARC stream.- Return type:
t.Iterable[WarcRecord]
- __next__()
Implements an iterator that can be used with
next().- Return type:
- class fastwarc.warc.WarcHeaderMap
Bases:
objectDict-like type representing a WARC or HTTP header block.
- Parameters:
encoding (str) – header source encoding
- __iter__()
Iterate all header map items.
- Return type:
t.Iterable[(str, str)]
- append(key, value)
Append header (use if header name is not unique).
- Parameters:
key (str) – header key
value (str) – header value
- asdict() Dict[str, str]
Headers as Python dict.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type:
t.Dict[str, str]
- astuples() Tuple[Tuple[str, str]]
Headers as a series of tuples, including multiple headers with the same key. Use this over
asdict()if header keys are not necessarily unique.- Return type:
((str, str),)
- clear()
Clear all headers.
- get(key, default=None) str
Get header value or
default.If multiple headers have the same key, only the last occurrence will be returned.
- Parameters:
key (str) – header key
default (str) – default value if
keynot found
- Return type:
str
- items()
Item view of keys and values.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type:
t.Iterable[(str, str)]
- keys()
Iterable of header keys.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type:
t.Iterable[str]
- values()
Iterable of header values.
If multiple headers have the same key, only the last occurrence will be returned.
- Return type:
t.Iterable[str]
- write(stream)
Write header block into stream.
- reason_phrase
HTTP reason phrase (unset if header block is not an HTTP header block).
- Type:
str or None
- status_code
HTTP status code (unset if header block is not an HTTP header block).
- Type:
int or None
- status_line
Header status line.
- Type:
str
- class fastwarc.warc.WarcRecord
Bases:
objectA WARC record.
WARC records are picklable, but pickling will
freeze()the WARC record.- freeze()
“Freeze” a record by baking in the remaining payload stream contents.
Freezing a record makes the
WarcRecordinstance copyable and reusable by decoupling it from the underlying raw WARC stream. Instead of reading directly from the raw stream, a frozen record maintains an internal buffer the size of the remaining payload stream contents at the time of callingfreeze().Freezing a record will advance the underlying raw stream.
- init_headers(content_length=0, record_type=<WarcRecordType.no_type: 0>, record_urn=None)
Initialize mandatory headers in a fresh
WarcRecordinstance.- Parameters:
content_length (int) – WARC record body length in bytes
record_type (WarcRecordType) – WARC-Type
record_urn (bytes) – WARC-Record-ID as URN without
'<','>'(if unset, a random URN will be generated)
- parse_http(strict_mode=True, auto_decode='none')
Parse HTTP headers and advance content reader.
It is safe to call this method multiple times, even if the record is not an HTTP record.
- Parameters:
strict_mode – enforce
CRLFline endings, setting this toFalsewill allow plainLFalsoauto_decode (str) – automatically decode record body if Content-Encoding or Transfer-Encoding headers are set (accepted values:
'none'[default],'content','transfer','all')
- Type:
strict_mode: bool
- set_bytes_content(b)
Set WARC body.
- Parameters:
b (bytes) – body as bytes
- verify_block_digest(consume=False)
Verify whether record block digest is valid.
- Parameters:
consume (bool) – do not create an in-memory copy of the record stream (will fully consume the rest of the record)
- Returns:
Trueif digest exists and is valid- Return type:
bool
- verify_payload_digest(consume=False)
Verify whether record payload digest is valid.
- Parameters:
consume (bool) – do not create an in-memory copy of the record stream (will fully consume the rest of the record)
- Returns:
Trueif record is HTTP record and digest exists and is valid- Return type:
bool
- write(stream, checksum_data=False, payload_digest=None, chunk_size=16384)
Write WARC record onto a stream.
- Parameters:
stream – output stream
checksum_data (bool) – add block and payload digest headers
payload_digest (bytes) – optional SHA-1 payload digest as bytes
chunk_size (int) – write block size
- Returns:
number of bytes written
- Return type:
int
- content_length
Remaining WARC record length in bytes (not necessarily the same as the
Content-Lengthheader).- Type:
int
- headers
WARC record headers.
- Type:
- http_charset
HTTP charset/encoding as returned by the server or
Noneif no valid charset is set. A returned string is guaranteed to be a valid Python encoding name.- Type:
str or None
- http_content_type
Plain HTTP Content-Type without additional fields such as
charset=.- Type:
str or None
- http_date
Parsed HTTP
Dateheader orNoneif server did not return a valid HTTP date.- Type:
datetime.datetime | None
- http_headers
HTTP headers if record is an HTTP record and HTTP headers have been parsed yet.
- Type:
WarcHeaderMap or None
- http_last_modified
Parsed HTTP
Last-Modifiedheader orNoneif server did not return a valid HTTP modification date.- Type:
datetime.datetime | None
- is_http
Whether record is an HTTP record.
Modifying this property will also affect the
Content-Typeof this record.- Type:
bool
- is_http_parsed
Whether HTTP headers have been parsed.
- Type:
bool
- reader
Reader for the remaining WARC record content.
- Type:
- record_date
WARC Date.
- Type:
datetime.datetime | None
- record_id
Record ID (same as
headers['WARC-Record-ID']).- Type:
str
- record_type
Record type (same as
headers['WARC-Type'].- Type:
- stream_pos
WARC record start offset in the original (uncompressed) stream.
- Type:
int
- class fastwarc.warc.WarcRecordType(value)
Bases:
IntFlagAn enumeration.
- fastwarc.warc.has_block_digest(record)
Filter predicate for checking if record has a block digest.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.has_payload_digest(record)
Filter predicate for checking if record has a payload digest.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_concurrent(record)
Filter predicate for checking if record is concurrent to another record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_http(record)
Filter predicate for checking if record is an HTTP record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_warc_10(record)
Filter predicate for checking if record is a WARC/1.0 record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
- fastwarc.warc.is_warc_11(record)
Filter predicate for checking if record is a WARC/1.1 record.
- Parameters:
record (WarcRecord) – WARC record
- Return type:
bool
StreamIO
- exception fastwarc.stream_io.FastWARCError
Bases:
ExceptionGeneric FastWARC exception.
- exception fastwarc.stream_io.ReaderStaleError
Bases:
FastWARCErrorFastWARC reader stale error.
- exception fastwarc.stream_io.StreamError
Bases:
FastWARCErrorFastWARC stream error.
- class fastwarc.stream_io.BrotliStream
Bases:
CompressingStreamBrotli
IOStreamimplementation.Implementation relies on Google’s
brotliPython package, will be ported to native C version in a later version.- Parameters:
raw_stream – raw data stream or file name / URL
quality (int) – compression quality (higher quality means better compression, but less speed)
lgwin (int) – Base 2 logarithm of the sliding window size in the range 16 to 24
lgblock (int) – Base 2 logarithm of the maximum input block size in the range 16 to 24 (will be set based on quality of value is 0)
fsspec_args (dict) – dict of arguments to pass to
fsspec(set toFalseto disablefsspec)
- begin_member()
Begin compression member / frame (if not already started).
- Returns:
bytes written
- Return type:
int
- close()
Close the stream.
- end_member()
- End compression member / frame (if one has been started).
If ytd
- Returns:
bytes written
- Return type:
int
- flush()
Flush stream buffer.
- seek(offset)
Seek to specified offset.
- Parameters:
offset (int) – seek offset
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- class fastwarc.stream_io.BufferedReader
Bases:
objectBuffered reader operating on an
IOStreaminstance.- Parameters:
stream (IOStream) – stream to operate on
buf_size (int) – internal buffer size
negotiate_stream (bool) – whether to auto-negotiate stream type
- close()
Close stream.
- consume(size=18446744073709551615)
Consume up to
sizebytes from the input stream without allocating a buffer for it.- Parameters:
size (int) – number of bytes to read (default means read remaining stream)
- Returns:
number of bytes consumed
- Return type:
int
- read(size=18446744073709551615)
Read up to
sizebytes from the input stream.- Parameters:
size (int) – number of bytes to read (default means read remaining stream)
- Returns:
consumed buffer contents as bytes (or empty string if EOF)
- Return type:
bytes
- readline(crlf=True, max_line_len=8192)
Read a single line from the input stream.
- Parameters:
crlf (bool) – whether lines are separated by CRLF or LF
max_line_len (int) – maximum line length (longer lines will still be consumed, but the return value will not be larger than this)
- Returns:
line contents (or empty string if EOF)
- Return type:
bytes
- tell()
Offset on the input stream.
- Returns:
offset
- Return type:
int
- class fastwarc.stream_io.BytesIOStream
Bases:
IOStreamIOStream that uses an in-memory buffer.
- Parameters:
initial_data (bytes) – fill internal buffer with this initial data
- close()
Close the stream.
- getvalue()
Get buffer value.
- Returns:
buffer value
- Return type:
bytes
- seek(offset)
Seek to specified offset.
- Parameters:
offset (int) – seek offset
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- class fastwarc.stream_io.CompressingStream
Bases:
IOStreamBase class for compressed
IOStreamtypes.- begin_member()
Begin compression member / frame (if not already started).
- Returns:
bytes written
- Return type:
int
- end_member()
- End compression member / frame (if one has been started).
If ytd
- Returns:
bytes written
- Return type:
int
- class fastwarc.stream_io.FileStream
Bases:
IOStreamFast alternative to Python file objects for local files.
- Parameters:
filename (str) – input filename
mode (str) – file open mode
- close()
Close the stream.
- flush()
Flush stream buffer.
- seek(offset)
Seek to specified offset.
- Parameters:
offset (int) – seek offset
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- class fastwarc.stream_io.GZipStream
Bases:
CompressingStreamGZip
IOStreamimplementation.- Parameters:
raw_stream – raw data stream or file name / URL
compression_level (int) – GZip compression level (for compression only)
zlib (bool) – use raw deflate / zlib format instead of gzip
fsspec_args (dict) – dict of arguments to pass to
fsspec(set toFalseto disablefsspec)
- begin_member()
Begin compression member / frame (if not already started).
- Returns:
bytes written
- Return type:
int
- close()
Close the stream.
- end_member()
- End compression member / frame (if one has been started).
If ytd
- Returns:
bytes written
- Return type:
int
- flush()
Flush stream buffer.
- prepopulate(deflate, initial_data)
Fill internal working buffer with initial data. Use if some initial data of the stream have already been consumed (e.g., for stream content negotiation). Has to be called before the first
read().- Parameters:
deflate (int) –
Trueifdatais uncompressed,Falseifdatais compressed GZip data.initial_data (bytes) – data to pre-populate
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- class fastwarc.stream_io.IOStream
Bases:
objectIOStream base class.
- close()
Close the stream.
- flush()
Flush stream buffer.
- read(size)
Read
sizebytes from stream.- Parameters:
size (int) – bytes to read
- Returns:
read bytes
- Return type:
bytearray
- seek(offset)
Seek to specified offset.
- Parameters:
offset (int) – seek offset
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- write(data)
Write bytes to stream.
- Parameters:
data (bytes) – data to write
- Returns:
number of bytes written
- Return type:
int
- class fastwarc.stream_io.LZ4Stream
Bases:
CompressingStreamLZ4
IOStreamimplementation.- Parameters:
raw_stream – raw data stream or file name / URL
compression_level (int) – LZ4 compression level (for compression only)
favor_dec_speed (bool) – favour decompression speed over compression speed and size
fsspec_args (dict) – dict of arguments to pass to
fsspec(set toFalseto disablefsspec)
- begin_member()
Begin compression member / frame (if not already started).
- Returns:
bytes written
- Return type:
int
- close()
Close the stream.
- end_member()
- End compression member / frame (if one has been started).
If ytd
- Returns:
bytes written
- Return type:
int
- flush()
Flush stream buffer.
- prepopulate(initial_data)
Fill internal working buffer with initial data. Use if some initial data of the stream have already been consumed (e.g., for stream content negotiation). Has to be called before the first
read().- Parameters:
initial_data (bytes) – data to pre-populate
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- class fastwarc.stream_io.PythonIOStreamAdapter
Bases:
IOStreamIOStream adapter for file-like Python objects.
- Parameters:
py_stream – input Python stream object
- close()
Close the stream.
- flush()
Flush stream buffer.
- seek(offset)
Seek to specified offset.
- Parameters:
offset (int) – seek offset
- tell()
Return current stream offset.
- Returns:
stream offset
- Return type:
int
- fastwarc.stream_io.wrap_stream(raw_stream, mode='rb', fsspec_args=None)
Wrap
raw_streaminto aPythonIOStreamAdapterif it is a file-like Python object or returnraw_streamunmodified if it is anIOStreamalready. Instead of a stream, you can also pass a string, which is treated as a file path or a URL. If installed, fsspec is used for opening the file or URL (unlessfsspec_args=False), and aPythonIOStreamAdapteris returned. Iffsspecis not installed, aFileStreamis returned instead.- Parameters:
raw_stream – stream to wrap
mode (str) – stream mode for fsspec open
fsspec_args (dict) – dict of arguments to pass to
fsspec(set toFalseto disablefsspec)
- Returns:
wrapped stream
- Return type: