Resiliparse CLI

The resiliparse command line utility provides tools for maintaining and benchmarking Resiliparse. At the moment, these tools are aimed primarily at developers of Resiliparse. General-purpose tools geared towards users of the library may be added later.

To install the Resiliparse CLI tool, specify the cli flag (and optionally the cli-benchmark flag for any third-party benchmarking dependencies) in your pip install command:

$ pip install 'resiliparse[cli]'

Once installed, run resiliparse [COMMAND] --help for detailed help listings.


Top-Level Commands

In the following is a short listing of the top-level commands:

resiliparse

Resiliparse Command Line Interface.

resiliparse [OPTIONS] COMMAND [ARGS]...

Commands

encoding

Encoding module tools.

html

HTML module tools.

lang

Language module tools.


Full Command Listing

Below is a full description of all available commands:

resiliparse

Resiliparse Command Line Interface.

resiliparse [OPTIONS] COMMAND [ARGS]...

encoding

Encoding module tools.

resiliparse encoding [OPTIONS] COMMAND [ARGS]...
download-whatwg-mapping

Download WHATWG encoding mapping.

Download the current WHATWG encoding mapping, parse and transform it, and then print it as a copyable Python dict.

resiliparse encoding download-whatwg-mapping [OPTIONS]

html

HTML module tools.

resiliparse html [OPTIONS] COMMAND [ARGS]...
benchmark

Benchmark Resiliparse HTML parser.

Benchmark Resiliparse HTML parsing by extracting the titles from all HTML pages in a WARC file.

You can compare the performance to Selectolax (both the old MyHTML and the new Lexbor engine) and BeautifulSoup4 by installing the PyPi packages selectolax and beautifulsoup4. Install Resiliparse with the cli-benchmark flag to install all optional third-party dependencies automatically.

See Resiliparse HTML Parser Benchmarks for more details and example benchmarking results.

resiliparse html benchmark [OPTIONS] WARC_FILE

Arguments

WARC_FILE

Required argument

lang

Language module tools.

resiliparse lang [OPTIONS] COMMAND [ARGS]...
benchmark

Benchmark Resiliparse against FastText and Langid.

Either package must be installed for this comparison. Install Resiliparse with the cli-benchmark flag to install all optional third-party dependencies automatically.

resiliparse lang benchmark [OPTIONS] INFILE

Options

-r, --rounds <rounds>

Number of rounds to benchmark

Default:

10000

-f, --fasttext-model <fasttext_model>

FastText model to benchmark

Arguments

INFILE

Required argument

create-dataset

Create a language detection dataset.

Create a language detection dataset from a set of extracted Wikipedia article dumps.

Expected is a directory containing one subdirectory per language (with the language name, e.g. “en” or “enwiki”) with any number of subdirectories and wiki_* plaintext files. Use Wikiextractor for creating the plaintext directories for each language.

Empty lines and <doc> tags will be stripped from the plaintext, otherwise the texts are expected to be clean already.

The created dataset will consist of one directory for each language, each containing three files for train, validation, and test with one example per line. The order of the lines is randomized.

resiliparse lang create-dataset [OPTIONS] INDIR OUTDIR

Options

--val-size <val_size>

Portion of the data to use for validation

--test-size <test_size>

Portion of the data to use for testing

--min-examples <min_examples>

Minimum number of examples per language

Default:

10000

-j, --jobs <jobs>

Parallel jobs

Arguments

INDIR

Required argument

OUTDIR

Required argument

download-wiki-dumps

Download Wikipedia dumps for language detection.

Download the first Wikipedia article multistream part for each of the specified languages.

The downloaded dumps can then be extracted with Wikiextractor.

resiliparse lang download-wiki-dumps [OPTIONS] DUMPDATE

Options

-l, --langs <langs>

Comma-separated list of languages to download

-o, --outdir <outdir>

Output directory

-j, --jobs <jobs>

Parallel download jobs (3 is the Wikimedia rate limit)

Arguments

DUMPDATE

Required argument

evaluate

Evaluate language prediction performance.

resiliparse lang evaluate [OPTIONS] INDIR

Options

-s, --split <split>

Which input split to use

Default:

val

Options:

val | test

-l, --langs <langs>

Restrict languages to this comma-separated list

-c, --cutoff <cutoff>

Prediction cutoff

Default:

1200

-t, --truncate <truncate>

Truncate examples to this length

-f, --fasttext-model <fasttext_model>

Use the specified FastText model for samples above cutoff

--sort-lang

Sort by language instead of F1

--print-cm

Print confusion matrix (may be very big)

Arguments

INDIR

Required argument

train-vectors

Train and print vectors for fast language detection.

Expects the following directory structure:

resiliparse lang train-vectors [OPTIONS] INDIR

Options

-s, --split <split>

Which input split to use

Default:

train

Options:

train | test | val

-f, --out-format <out_format>

Output format (raw vectors or C code)

Default:

raw

Options:

raw | c

-l, --vector-size <vector_size>

Output vector size

Default:

256

Arguments

INDIR

Required argument