FastWARC CLI
Besides the Python API, FastWARC also provides a command line interface via the fastwarc
command, which enables working with WARC files on the console.
Run fastwarc [COMMAND] --help
for detailed help listings.
Top-Level Commands
In the following is a short listing of the top-level commands:
fastwarc
FastWARC Command Line Interface.
fastwarc [OPTIONS] COMMAND [ARGS]...
Commands
- benchmark
Benchmark FastWARC read performance.
- check
Verify WARC record consistency.
- extract
Extract WARC record by offset.
- index
Index WARC records as CDXJ.
- recompress
Recompress a WARC file.
Full Command Listing
Below is a full description of all available commands:
fastwarc
FastWARC Command Line Interface.
fastwarc [OPTIONS] COMMAND [ARGS]...
benchmark
Benchmark FastWARC read performance.
The FastWARC CLI comes with a benchmarking tool for measuring WARC record decompression and parsing performance on your own machine. The benchmarking results can be compared directly with WARCIO.
Supported WARC sources are local files, S3 and HTTP(s) URLs. Supported compression algorithms are GZip, LZ4, or uncompressed.
The read benchmarking tool has additional options, such as reading WARCs directly from a remote S3 data source using Boto3.
See FastWARC Benchmarks for more information and example benchmarking results.
fastwarc benchmark [OPTIONS] [INPUT_URL]...
Options
- -d, --decompress-alg <decompress_alg>
Decompression algorithm
- Default:
auto
- Options:
gzip | lz4 | uncompressed | auto
- -e, --endpoint-url <endpoint_url>
S3 endpoint URL
- Default:
https://s3.amazonaws.com
- -a, --aws-access-key <aws_access_key>
AWS access key for s3:// URLs
- -s, --aws-secret-key <aws_secret_key>
AWS secret key for s3:// URLs (leave empty to read from STDIN)
- --is-prefix
Treat input URL as prefix (only for S3)
- -p, --use-python-stream
Use slower Python I/O instead of native FileStream for local files
- -H, --parse-http
Parse HTTP headers
- -v, --verify-digests
Verify record block digests
- -f, --filter-type <filter_type>
Filter for specific WARC record types
- Default:
any_type
- Options:
warcinfo | response | resource | request | metadata | revisit | conversation | continuation | any_type
- -w, --bench-warcio
Compare FastWARC performance with WARCIO
Arguments
- INPUT_URL
Optional argument(s)
check
Verify WARC record consistency.
Check digests of all records in a WARC file and print a summary. You can verify all block and payload digests in the given WARC file and print a summary of all corrupted and (optionally) all intact records.
The command will exit with a non-zero exit code if at least one record fails verification.
fastwarc check [OPTIONS] INFILE
Options
- -d, --decompress-alg <decompress_alg>
Decompression algorithm
- Default:
auto
- Options:
gzip | lz4 | uncompressed | auto
- -p, --verify-payloads
Also verify payload digests
- -q, --quiet
Do not print progress information
- -o, --output <output>
Output file with verification details
Arguments
- INFILE
Required argument
extract
Extract WARC record by offset.
You can extract individual records at a given byte offset with either just headers, payload, or both.
fastwarc extract [OPTIONS] INFILE OFFSET
Options
- -o, --output <output>
Output file, default is stdout
- --payload
Output only record payload (transfer and/or content encoding are preserved
- --headers
Output only record (and HTTP) headers
Arguments
- INFILE
Required argument
- OFFSET
Required argument
index
Index WARC records as CDXJ.
WARC files can be indexed to the CDXJ format with a configurable set of fields.
fastwarc index [OPTIONS] [INFILES]...
Options
- -o, --output <output>
Output file, default is stdout
- -f, --fields <FIELDS>
Comma-separated list of indexed fields, eg. “offset”, “length”, “filename”, “http:status”, “http:<http-header>”, or “<warc-record-header>”
- Default:
offset,warc-type,warc-target-uri
- --preserve-multi-header
Preserve multiple values of HTTP headers as JSON list
Arguments
- INFILES
Optional argument(s)
recompress
Recompress a WARC file.
This command allows you to recompress a WARC file if it is uncompressed or not compressed properly at the record-level if or you want to recompress a GZip WARC as LZ4 or vice versa.
fastwarc recompress [OPTIONS] INFILE OUTFILE
Options
- -c, --compress-alg <compress_alg>
Compression algorithm to use for output file
- Default:
auto
- Options:
gzip | lz4 | uncompressed | auto
- -d, --decompress-alg <decompress_alg>
Decompression algorithm for decoding input file (auto tries to detect based on file extension)
- Default:
auto
- Options:
gzip | lz4 | uncompressed | auto
- -l, --compress-level <compress_level>
Compression level (defaults to max)
- -q, --quiet
Do not print progress information
Arguments
- INFILE
Required argument
- OUTFILE
Required argument