HTTP Tools

Helper functions for parsing raw HTTP payloads.

Read Chunked HTTP Payloads

Contrary to WARCIO, Resiliparse’s FastWARC does not automatically decode chunked HTTP responses. This is simply a design decision in favour of simplicity, since decoding chunked HTTP payloads is actually the crawler’s job. In the Common Crawl, for example, all chunked payloads are already decoded and the original Transfer-Encoding header is preserved as X-Crawler-Transfer-Encoding: chunked. We do, however, acknowledge that in some cases it is still necessary to decode chunked payloads anyway, which is why Resiliparse provides read_http_chunk() as a helper function for this.

The function accepts a buffered reader (either a fastwarc.stream_io.BufferedReader or a file-like Python object that implements readline(), such as io.BytesIO) and is supposed to be called iteratively until no further output is produced. Each call will return a single chunk, which can be concatenated with the previous chunks:

from fastwarc.stream_io import BufferedReader, BytesIOStream
from resiliparse.parse.http import read_http_chunk

chunked = b'''c\r\n\
Resiliparse \r\n\
6\r\n\
is an \r\n\
8\r\n\
awesome \r\n\
5\r\n\
tool.\r\n\
0\r\n\
\r\n'''

reader = BufferedReader(BytesIOStream(chunked))
decoded = b''
while chunk := read_http_chunk(reader):
    decoded += chunk

# b'Resiliparse is an awesome tool.'
print(decoded)

For convenience, you can also use iterate_http_chunks(), which is a generator that wraps around read_http_chunk() and fully consumes the chunked stream:

from resiliparse.parse.http import iterate_http_chunks

# b'Resiliparse is an awesome tool.'
print(b''.join(iterate_http_chunks(reader)))