FastWARC CLI

Besides the Python API, FastWARC also provides a command line interface via the fastwarc command, which enables working with WARC files on the console.

Run fastwarc [COMMAND] --help for detailed help listings.


Top-Level Commands

In the following is a short listing of the top-level commands:

fastwarc

FastWARC Command Line Interface.

fastwarc [OPTIONS] COMMAND [ARGS]...

Commands

benchmark

Benchmark FastWARC read performance.

check

Verify WARC record consistency.

extract

Extract WARC record by offset.

index

Index WARC records as CDXJ.

recompress

Recompress a WARC file.


Full Command Listing

Below is a full description of all available commands:

fastwarc

FastWARC Command Line Interface.

fastwarc [OPTIONS] COMMAND [ARGS]...

benchmark

Benchmark FastWARC read performance.

The FastWARC CLI comes with a benchmarking tool for measuring WARC record decompression and parsing performance on your own machine. The benchmarking results can be compared directly with WARCIO.

Supported WARC sources are local files, S3 and HTTP(s) URLs. Supported compression algorithms are GZip, LZ4, or uncompressed.

The read benchmarking tool has additional options, such as reading WARCs directly from a remote S3 data source using Boto3.

See FastWARC Benchmarks for more information and example benchmarking results.

fastwarc benchmark [OPTIONS] [INPUT_URL]...

Options

-d, --decompress-alg <decompress_alg>

Decompression algorithm

Default

auto

Options

gzip | lz4 | uncompressed | auto

-e, --endpoint-url <endpoint_url>

S3 endpoint URL

Default

https://s3.amazonaws.com

-a, --aws-access-key <aws_access_key>

AWS access key for s3:// URLs

-s, --aws-secret-key <aws_secret_key>

AWS secret key for s3:// URLs (leave empty to read from STDIN)

--is-prefix

Treat input URL as prefix (only for S3)

-p, --use-python-stream

Use slower Python I/O instead of native FileStream for local files

-H, --parse-http

Parse HTTP headers

-v, --verify-digests

Verify record block digests

-f, --filter-type <filter_type>

Filter for specific WARC record types

Default

any_type

Options

warcinfo | response | resource | request | metadata | revisit | conversation | continuation | any_type

-w, --bench-warcio

Compare FastWARC performance with WARCIO

Arguments

INPUT_URL

Optional argument(s)

check

Verify WARC record consistency.

Check digests of all records in a WARC file and print a summary. You can verify all block and payload digests in the given WARC file and print a summary of all corrupted and (optionally) all intact records.

The command will exit with a non-zero exit code if at least one record fails verification.

fastwarc check [OPTIONS] INFILE

Options

-d, --decompress-alg <decompress_alg>

Decompression algorithm

Default

auto

Options

gzip | lz4 | uncompressed | auto

-p, --verify-payloads

Also verify payload digests

-q, --quiet

Do not print progress information

-o, --output <output>

Output file with verification details

Arguments

INFILE

Required argument

extract

Extract WARC record by offset.

You can extract individual records at a given byte offset with either just headers, payload, or both.

fastwarc extract [OPTIONS] INFILE OFFSET

Options

-o, --output <output>

Output file, default is stdout

--payload

Output only record payload (transfer and/or content encoding are preserved

--headers

Output only record (and HTTP) headers

Arguments

INFILE

Required argument

OFFSET

Required argument

index

Index WARC records as CDXJ.

WARC files can be indexed to the CDXJ format with a configurable set of fields.

fastwarc index [OPTIONS] [INFILES]...

Options

-o, --output <output>

Output file, default is stdout

-f, --fields <FIELDS>

Comma-separated list of indexed fields, eg. “offset”, “length”, “filename”, “http:status”, “http:<http-header>”, or “<warc-record-header>”

Default

offset,warc-type,warc-target-uri

--preserve-multi-header

Preserve multiple values of HTTP headers as JSON list

Arguments

INFILES

Optional argument(s)

recompress

Recompress a WARC file.

This command allows you to recompress a WARC file if it is uncompressed or not compressed properly at the record-level if or you want to recompress a GZip WARC as LZ4 or vice versa.

fastwarc recompress [OPTIONS] INFILE OUTFILE

Options

-c, --compress-alg <compress_alg>

Compression algorithm to use for output file

Default

auto

Options

gzip | lz4 | uncompressed | auto

-d, --decompress-alg <decompress_alg>

Decompression algorithm for decoding input file (auto tries to detect based on file extension)

Default

auto

Options

gzip | lz4 | uncompressed | auto

-l, --compress-level <compress_level>

Compression level (defaults to max)

-q, --quiet

Do not print progress information

Arguments

INFILE

Required argument

OUTFILE

Required argument