Resiliparse CLI

The resiliparse command line utility provides tools for maintaining and benchmarking Resiliparse. At the moment, these tools are aimed primarily at developers of Resiliparse. General-purpose tools geared towards users of the library may be added later.

To install the Resiliparse CLI tool, specify the cli flag (and optionally the cli-benchmark flag for any third-party benchmarking dependencies) in your pip install command:

$ pip install 'resiliparse[cli]'

Once installed, run resiliparse [COMMAND] --help for detailed help listings.

Top-Level Commands

In the following is a short listing of the top-level commands:

resiliparse

Resiliparse Command Line Interface.

Usage

resiliparse [OPTIONS] COMMAND [ARGS]...

Commands

encoding: Encoding module tools.

html: HTML module tools.

lang: Language module tools.

Full Command Listing

Below is a full description of all available commands:

resiliparse

Resiliparse Command Line Interface.

Usage

resiliparse [OPTIONS] COMMAND [ARGS]...

encoding

Encoding module tools.

Usage

resiliparse encoding [OPTIONS] COMMAND [ARGS]...

download-whatwg-mapping

Download WHATWG encoding mapping.

Download the current WHATWG encoding mapping, parse and transform it, and then print it as a copyable Python dict.

Usage

resiliparse encoding download-whatwg-mapping [OPTIONS]

html

HTML module tools.

Usage

resiliparse html [OPTIONS] COMMAND [ARGS]...

benchmark

Benchmark Resiliparse HTML parser.

Benchmark Resiliparse HTML parsing by extracting the titles from all HTML pages in a WARC file.

You can compare the performance to Selectolax (both the old MyHTML and the new Lexbor engine) and BeautifulSoup4 by installing the PyPi packages selectolax and beautifulsoup4. Install Resiliparse with the cli-benchmark flag to install all optional third-party dependencies automatically.

See Resiliparse HTML Parser Benchmarks for more details and example benchmarking results.

Usage

resiliparse html benchmark [OPTIONS] WARC_FILE

Arguments

WARC_FILE: Required argument

lang

Language module tools.

Usage

resiliparse lang [OPTIONS] COMMAND [ARGS]...

benchmark

Benchmark Resiliparse against FastText and Langid.

Either package must be installed for this comparison. Install Resiliparse with the cli-benchmark flag to install all optional third-party dependencies automatically.

Usage

resiliparse lang benchmark [OPTIONS] INFILE

Options

-r, --rounds <rounds>

Number of rounds to benchmark

Default:: 10000

-f, --fasttext-model <fasttext_model>: FastText model to benchmark

Arguments

INFILE: Required argument

create-dataset

Create a language detection dataset.

Create a language detection dataset from a set of extracted Wikipedia article dumps.

Expected is a directory containing one subdirectory per language (with the language name, e.g. “en” or “enwiki”) with any number of subdirectories and wiki_* plaintext files. Use Wikiextractor for creating the plaintext directories for each language.

Empty lines and <doc> tags will be stripped from the plaintext, otherwise the texts are expected to be clean already.

The created dataset will consist of one directory for each language, each containing three files for train, validation, and test with one example per line. The order of the lines is randomized.

Usage

resiliparse lang create-dataset [OPTIONS] INDIR OUTDIR

Options

--val-size <val_size>: Portion of the data to use for validation

--test-size <test_size>: Portion of the data to use for testing

--min-examples <min_examples>

Minimum number of examples per language

Default:: 10000

-j, --jobs <jobs>: Parallel jobs

Arguments

INDIR: Required argument

OUTDIR: Required argument

download-wiki-dumps

Download Wikipedia dumps for language detection.

Download the first Wikipedia article multistream part for each of the specified languages.

The downloaded dumps can then be extracted with Wikiextractor.

Usage

resiliparse lang download-wiki-dumps [OPTIONS] DUMPDATE

Options

-l, --langs <langs>: Comma-separated list of languages to download

-o, --outdir <outdir>: Output directory

-j, --jobs <jobs>: Parallel download jobs (3 is the Wikimedia rate limit)

Arguments

DUMPDATE: Required argument

evaluate

Evaluate language prediction performance.

Usage

resiliparse lang evaluate [OPTIONS] INDIR

Options

-s, --split <split>

Which input split to use

Default:: 'val'
Options:: val | test

-l, --langs <langs>: Restrict languages to this comma-separated list

-c, --cutoff <cutoff>

Prediction cutoff

Default:: 1200

-t, --truncate <truncate>: Truncate examples to this length

-f, --fasttext-model <fasttext_model>: Use the specified FastText model for samples above cutoff

--sort-lang: Sort by language instead of F1

--print-cm: Print confusion matrix (may be very big)

Arguments

INDIR: Required argument

train-vectors

Train and print vectors for fast language detection.

Expects the following directory structure:

Usage

resiliparse lang train-vectors [OPTIONS] INDIR

Options

-s, --split <split>

Which input split to use

Default:: 'train'
Options:: train | test | val

-f, --out-format <out_format>

Output format (raw vectors or C code)

Default:: 'raw'
Options:: raw | c

-l, --vector-size <vector_size>

Output vector size

Default:: 256

Arguments

INDIR: Required argument