Character Encoding

Resiliparse encoding utilities API documentation.

class resiliparse.parse.encoding.EncodingDetector

Bases: object

Universal character encoding detector based on uchardet.

uchardet is a C wrapper and a continuation of Mozilla’s Universal Charset Detector library.

encoding(self, html5_compatible=True)

Get a Python-compatible name of the encoding that was detected and reset the detector.

By default, the detected encoding is remapped based on the WHATWG encoding specification, which is primarily suitable for web content. To disable this behaviour, set html5_compatible=False. For more information, see: map_encoding_to_html5().

If WHATWG remapping is enabled, UTF-8 is returned as a fallback encoding. Otherwise, the method returns None on failure to detect the encoding.

Parameters:: html5_compatible (bool) – Remap encoding names according to WHATWG
Returns:: detected encoding or None on failure
Return type:: str or None

reset(self): Manually reset the encoding detector state.

update(self, data)

Update charset detector with more data.

The detector will shortcut processing when it has enough data to reach certainty, so you don’t need to worry too much about limiting the input data.

Parameters:: data (bytes) – input data

resiliparse.parse.encoding.bytes_to_str(data, encoding='utf-8', errors='ignore', fallback_encodings=('utf-8', 'cp1252'), strip_bom=True)

Helper for decoding a byte string into a unicode string using a given encoding. This encoding should be determined beforehand, e.g., with detect_encoding().

bytes_to_str() tries to decode the byte string with encoding. If that fails, it will fall back to UTF-8 and Windows-1252 (or whichever encodings where given in fallback_encodings). If all fallbacks fail as well, the string will be double-decoded with encoding and invalid characters will be treated according to errors, which has the same options as for bytes.decode() (i.e., "ignore" or "replace"). The double-decoding step ensures that the resulting string is sane and can be re-encoded without errors.

This function also takes care to strip BOMs from the beginning of the string if strip_bom=True.

Parameters:

data (bytes) – input byte string
encoding (str) – desired encoding
errors (str) – error handling for invalid characters
fallback_encodings (t.Iterable[str]) – list of fallback encodings to try if the primary encoding fails
strip_bom (bool) – strip BOM sequences from beginning of the string

Returns:

decoded string

Return type:

str

resiliparse.parse.encoding.detect_encoding(data, max_len=131072, html5_compatible=True, from_html_meta=False)

Detect the encoding of a byte string. This is a convenience wrapper around EncodingDetector that uses a single global instance.

The string that is passed to the EncodingDetector will be no longer than max_len bytes to prevent slow-downs and keep memory usage low. If the string is longer than this limit, only the max_len / 2 bytes from the start and from the end of the string will be used. This is a tradeoff between performance and accuracy. If you need higher accuracy, increase the limit to feed more data into the EncodingDetector (the default should be more than enough in most cases).

The EncodingDetector relies on uchardet as its encoding detection engine. If the input string is an HTML document, you can also use the available information from the HTML meta charset tag instead. With from_html_meta=True, detect_encoding() will try to use the charset meta tag in the HTML string if one is available and ASCII-readable within the first 1024 bytes. If this fails, it will fall back to auto-detection with uchardet.

By default, the detected encoding name is remapped according to the WHATWG encoding specification, which is primarily suitable for web content. To disable this behaviour, set html5_compatible=False. For more information, see: map_encoding_to_html5(). Encodings detected from HTML meta tags are always remapped, no matter the value of html5_compatible, to ensure valid encoding names.

If WHATWG remapping is enabled, UTF-8 is returned as a fallback encoding. Otherwise, the method returns None on failure to detect the encoding.

Parameters:

data (bytes) – input string for which to detect the encoding
max_len (int) – maximum number of bytes to feed to detector (0 for no limit)
html5_compatible (bool) – Remap encoding names according to WHATWG
from_html_meta (bool) – if string is an HTML document, use meta tag info

Returns:

detected encoding

Return type:

str

resiliparse.parse.encoding.detect_mime(data, max_unprintable=0.05)

Try to detect common internet MIME types based on the initial magic byte sequence of data.

The check is very basic and only checks the starting bytes as well as the number of unprintable bytes. It does not replace a full-blown MIME type detection engine like Apache Tika at the moment.

Parameters:

data (bytes) – input bytes
max_unprintable (float) – maximum allowable ratio of unprintable characters for text

Returns:

detected MIME type

Return type:

str

resiliparse.parse.encoding.map_encoding_to_html5(encoding, fallback_utf8=True)

Map an encoding name to a subset of names allowed by the HTML5 standard.

This function will remap the given name according to the mapping definition given in Section 4.2 of the WHATWG encoding specification. The returned value will always be a valid Python encoding name, but the supplied input name does not necessarily have to be.

The WHATWG mapping is designed to boil down the many possible encoding names to a smaller subset of canonical names while taking into account common encoding mislabelling practices. The main purpose of this function is to map encoding names extracted from HTTP headers or websites to their canonical names, but it also makes sense to apply the mapping to an auto-detected encoding name, since it remaps some encodings based on observed practices on the web, such as the mapping from ISO-8859-1 to Windows-1252, which is more likely to be correct, even if both options are possible. EncodingDetector.encoding() already remaps its detected encodings to the WHATWG set by default.

The mapping does not involve Python’s encoding alias names, but instead uses an adjusted WHATWG mapping. Inputs not defined in this mapping are remapped to UTF-8. Hence, the function always produces a valid output, but the mapped encoding is not guaranteed to be compatible with the original encoding. Use bytes_to_str() to avoid decoding errors. You can also set fallback_utf8=False to return None instead if the supplied encoding is unknown.

The adjusted encoding mapping differs from the WHATWG spec in the following details:

ISO-8859-8-I name replaced with ISO-8859-8

WINDOWS-874 name replaced with ISO-8859-11

x-mac-cyrillic is unsupported

x-user-defined is unsupported

No “replacement” mapping for 7-bit versions of ISO/IEC 2022

Parameters:

encoding (str) – input encoding name
fallback_utf8 (bool) – Whether to fall back to UTF-8 or return None for unknown encodings

Returns:

mapped output encoding name

Return type:

str or None