Character Encoding
Resiliparse encoding utilities API documentation.
- class resiliparse.parse.encoding.EncodingDetector
Bases:
object
Universal character encoding detector based on uchardet.
uchardet is a C wrapper and a continuation of Mozilla’s Universal Charset Detector library.
- encoding(self, html5_compatible=True)
Get a Python-compatible name of the encoding that was detected and reset the detector.
By default, the detected encoding is remapped based on the WHATWG encoding specification, which is primarily suitable for web content. To disable this behaviour, set
html5_compatible=False
. For more information, see:map_encoding_to_html5()
.If WHATWG remapping is enabled, UTF-8 is returned as a fallback encoding. Otherwise, the method returns
None
on failure to detect the encoding.- Parameters:
html5_compatible (bool) – Remap encoding names according to WHATWG
- Returns:
detected encoding or
None
on failure- Return type:
str or None
- reset(self)
Manually reset the encoding detector state.
- update(self, data)
Update charset detector with more data.
The detector will shortcut processing when it has enough data to reach certainty, so you don’t need to worry too much about limiting the input data.
- Parameters:
data (bytes) – input data
- resiliparse.parse.encoding.bytes_to_str(data, encoding='utf-8', errors='ignore', fallback_encodings=('utf-8', 'cp1252'), strip_bom=True)
Helper for decoding a byte string into a unicode string using a given encoding. This encoding should be determined beforehand, e.g., with
detect_encoding()
.bytes_to_str()
tries to decode the byte string withencoding
. If that fails, it will fall back to UTF-8 and Windows-1252 (or whichever encodings where given infallback_encodings
). If all fallbacks fail as well, the string will be double-decoded withencoding
and invalid characters will be treated according toerrors
, which has the same options as forbytes.decode()
(i.e.,"ignore"
or"replace"
). The double-decoding step ensures that the resulting string is sane and can be re-encoded without errors.This function also takes care to strip BOMs from the beginning of the string if
strip_bom=True
.- Parameters:
data (bytes) – input byte string
encoding (str) – desired encoding
errors (str) – error handling for invalid characters
fallback_encodings (t.Iterable[str]) – list of fallback encodings to try if the primary encoding fails
strip_bom (bool) – strip BOM sequences from beginning of the string
- Returns:
decoded string
- Return type:
str
- resiliparse.parse.encoding.detect_encoding(data, max_len=131072, html5_compatible=True, from_html_meta=False)
Detect the encoding of a byte string. This is a convenience wrapper around
EncodingDetector
that uses a single global instance.The string that is passed to the
EncodingDetector
will be no longer thanmax_len
bytes to prevent slow-downs and keep memory usage low. If the string is longer than this limit, only themax_len / 2
bytes from the start and from the end of the string will be used. This is a tradeoff between performance and accuracy. If you need higher accuracy, increase the limit to feed more data into theEncodingDetector
(the default should be more than enough in most cases).The
EncodingDetector
relies on uchardet as its encoding detection engine. If the input string is an HTML document, you can also use the available information from the HTML meta charset tag instead. Withfrom_html_meta=True
,detect_encoding()
will try to use the charset meta tag in the HTML string if one is available and ASCII-readable within the first 1024 bytes. If this fails, it will fall back to auto-detection with uchardet.By default, the detected encoding name is remapped according to the WHATWG encoding specification, which is primarily suitable for web content. To disable this behaviour, set
html5_compatible=False
. For more information, see:map_encoding_to_html5()
. Encodings detected from HTML meta tags are always remapped, no matter the value ofhtml5_compatible
, to ensure valid encoding names.If WHATWG remapping is enabled, UTF-8 is returned as a fallback encoding. Otherwise, the method returns
None
on failure to detect the encoding.- Parameters:
data (bytes) – input string for which to detect the encoding
max_len (int) – maximum number of bytes to feed to detector (0 for no limit)
html5_compatible (bool) – Remap encoding names according to WHATWG
from_html_meta (bool) – if string is an HTML document, use meta tag info
- Returns:
detected encoding
- Return type:
str
- resiliparse.parse.encoding.detect_mime(data, max_unprintable=0.05)
Try to detect common internet MIME types based on the initial magic byte sequence of
data
.The check is very basic and only checks the starting bytes as well as the number of unprintable bytes. It does not replace a full-blown MIME type detection engine like Apache Tika at the moment.
- Parameters:
data (bytes) – input bytes
max_unprintable (float) – maximum allowable ratio of unprintable characters for text
- Returns:
detected MIME type
- Return type:
str
- resiliparse.parse.encoding.map_encoding_to_html5(encoding, fallback_utf8=True)
Map an encoding name to a subset of names allowed by the HTML5 standard.
This function will remap the given name according to the mapping definition given in Section 4.2 of the WHATWG encoding specification. The returned value will always be a valid Python encoding name, but the supplied input name does not necessarily have to be.
The WHATWG mapping is designed to boil down the many possible encoding names to a smaller subset of canonical names while taking into account common encoding mislabelling practices. The main purpose of this function is to map encoding names extracted from HTTP headers or websites to their canonical names, but it also makes sense to apply the mapping to an auto-detected encoding name, since it remaps some encodings based on observed practices on the web, such as the mapping from ISO-8859-1 to Windows-1252, which is more likely to be correct, even if both options are possible.
EncodingDetector.encoding()
already remaps its detected encodings to the WHATWG set by default.The mapping does not involve Python’s encoding alias names, but instead uses an adjusted WHATWG mapping. Inputs not defined in this mapping are remapped to UTF-8. Hence, the function always produces a valid output, but the mapped encoding is not guaranteed to be compatible with the original encoding. Use
bytes_to_str()
to avoid decoding errors. You can also setfallback_utf8=False
to returnNone
instead if the supplied encoding is unknown.The adjusted encoding mapping differs from the WHATWG spec in the following details:
ISO-8859-8-I name replaced with ISO-8859-8
WINDOWS-874 name replaced with ISO-8859-11
x-mac-cyrillic is unsupported
x-user-defined is unsupported
No “replacement” mapping for 7-bit versions of ISO/IEC 2022
- Parameters:
encoding (str) – input encoding name
fallback_utf8 (bool) – Whether to fall back to UTF-8 or return
None
for unknown encodings
- Returns:
mapped output encoding name
- Return type:
str or None