Resiliparse CLI

The resiliparse command line utility provides tools for maintaining and benchmarking Resiliparse. At the moment, these tools are aimed primarily at developers of Resiliparse. General-purpose tools geared towards users of the library may be added later.

Run resiliparse [COMMAND] --help for detailed help listings.


Top-Level Commands

In the following is a short listing of the top-level commands:

resiliparse

Resiliparse Command Line Interface.

resiliparse [OPTIONS] COMMAND [ARGS]...

Commands

encoding

Encoding module tools.

html

HTML module tools.

lang

Language module tools.


Full Command Listing

Below is a full description of all available commands:

resiliparse

Resiliparse Command Line Interface.

resiliparse [OPTIONS] COMMAND [ARGS]...

encoding

Encoding module tools.

resiliparse encoding [OPTIONS] COMMAND [ARGS]...
download-whatwg-mapping

Download WHATWG encoding mapping.

Download the current WHATWG encoding mapping, parse and transform it, and then print it as a copyable Python dict.

resiliparse encoding download-whatwg-mapping [OPTIONS]

html

HTML module tools.

resiliparse html [OPTIONS] COMMAND [ARGS]...
benchmark

Benchmark Resiliparse HTML parser.

Benchmark Resiliparse HTML parsing by extracting the titles from all HTML pages in a WARC file.

You can compare the performance to Selectolax (both the old MyHTML and the new Lexbor engine) and BeautifulSoup4 by installing the PyPi packages selectolax and beautifulsoup4.

See Resiliparse HTML Parser Benchmarks for more details and example benchmarking results.

resiliparse html benchmark [OPTIONS] WARC_FILE

Arguments

WARC_FILE

Required argument

lang

Language module tools.

resiliparse lang [OPTIONS] COMMAND [ARGS]...
benchmark

Benchmark Resiliparse against FastText and Langid.

resiliparse lang benchmark [OPTIONS] INFILE

Options

-r, --rounds <rounds>

Number of rounds to benchmark

Default

10000

-f, --fasttext-model <fasttext_model>

FastText model to benchmark

Arguments

INFILE

Required argument

create-dataset

Create a language detection dataset.

Create a language detection dataset from a set of extracted Wikipedia article dumps.

Expected is a directory containing one subdirectory per language (with the language name, e.g. “en” or “enwiki”) with any number of subdirectories and wiki_* plaintext files. Use Wikiextractor for creating the plaintext directories for each language.

Empty lines and <doc> tags will be stripped from the plaintext, otherwise the texts are expected to be clean already.

The created dataset will consist of one directory for each language, each containing three files for train, validation, and test with one example per line. The order of the lines is randomized.

resiliparse lang create-dataset [OPTIONS] INDIR OUTDIR

Options

--val-size <val_size>

Portion of the data to use for validation

--test-size <test_size>

Portion of the data to use for testing

--min-examples <min_examples>

Minimum number of examples per language

Default

10000

-j, --jobs <jobs>

Parallel jobs

Arguments

INDIR

Required argument

OUTDIR

Required argument

download-wiki-dumps

Download Wikipedia dumps for language detection.

Download the first Wikipedia article multistream part for each of the specified languages.

The downloaded dumps can then be extracted with Wikiextractor.

resiliparse lang download-wiki-dumps [OPTIONS] DUMPDATE

Options

-l, --langs <langs>

Comma-separated list of languages to download

-o, --outdir <outdir>

Output directory

-j, --jobs <jobs>

Parallel download jobs (3 is the Wikimedia rate limit)

Arguments

DUMPDATE

Required argument

evaluate

Evaluate language prediction performance.

resiliparse lang evaluate [OPTIONS] INDIR

Options

-s, --split <split>

Which input split to use

Default

val

Options

val | test

-l, --langs <langs>

Restrict languages to this comma-separated list

-c, --cutoff <cutoff>

Prediction cutoff

Default

1200

-t, --truncate <truncate>

Truncate examples to this length

-f, --fasttext-model <fasttext_model>

Use the specified FastText model for samples above cutoff

--sort-lang

Sort by language instead of F1

--print-cm

Print confusion matrix (may be very big)

Arguments

INDIR

Required argument

train-vectors

Train and print vectors for fast language detection.

Expects the following directory structure:

resiliparse lang train-vectors [OPTIONS] INDIR

Options

-s, --split <split>

Which input split to use

Default

train

Options

train | test | val

-f, --out-format <out_format>

Output format (raw vectors or C code)

Default

raw

Options

raw | c

-l, --vector-size <vector_size>

Output vector size

Default

256

Arguments

INDIR

Required argument