Utilities for detecting and working with (potentially broken) text encodings.
Character Encoding Detection
from resiliparse.parse.encoding import EncodingDetector det = EncodingDetector() det.update(b'\xff\xfeH\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d\x00') enc = det.encoding() # utf-16-le det.update(b'Autres temps, autres m\x9curs.') enc = det.encoding() # cp1252 (Windows-1252)
You can call
update() multiple times to feed more input. The more data the detector has to sample from, the more accurate the prediction will be. Calling
encoding() will return the predicted encoding name as a string and reset the internal state so that the detector can be reused for a different document.
By default, the detected encoding name is remapped according to the WHATWG encoding specification (see: Map Encodings to WHATWG Specification for a detailed description). If you do not want this remapping to take place, set
html5_compatible=False on the call to
encoding(). In this case, the returned value can be
None if the encoding could not be determined. With WHATWG remapping enabled, unknown encodings are mapped to UTF-8. This is convenient most of the time, but it also means that the source text is not necessarily decodable without errors, so care should be taken here. You can use
bytes_to_str() to avoid decoding errors (see Convert Byte String to Unicode for details).
from resiliparse.parse.encoding import detect_encoding enc = detect_encoding(b'Potrzeba jest matk\xb1 wynalazk\xf3w.') # iso8859-2
Map Encodings to WHATWG Specification
Before decoding the contents of a web page with an encoding that was extracted from either the HTTP
Content-Type header or the HTML body itself, it often makes sense to remap the encoding according to WHATWG encoding specification. The WHATWG mapping is designed to boil down the many possible encoding names to a smaller subset of canonical names while taking into account common encoding mislabelling practices. The mapping is primarily designed for author-supplied encoding names, but it also makes sense to apply it to auto-detected encoding names, since it remaps some encodings based on observed practices on the web, such as the mapping from ISO-8859-1 to Windows-1252, which is more likely to be correct, even if both are possible. You can remap a given encoding name as follows:
from resiliparse.parse.encoding import map_encoding_to_html5 print(map_encoding_to_html5('iso-8859-1')) # cp1252 print(map_encoding_to_html5('csisolatin9')) # iso8859-15 print(map_encoding_to_html5('oops')) # utf-8
You see that the given input name does not necessarily have to be a valid Python encoding name, but the returned output will be. Unknown or invalid encodings are mapped to UTF-8. Set
fallback_utf8=False if you prefer to get
None back instead.
Convert Byte String to Unicode
Detecting the encoding of a byte string is one thing, but the next step is to actually decode it into a Unicode string. Resiliparse provides
bytes_to_str(), which does exactly that.
The function takes the raw byte string and a desired encoding name and tries to decode it into a Python Unicode string. If the decoding fails (due to undecodable characters), it will try to fall back to UTF-8 and Windows-1252. If both fallbacks fail as well, the string will be decoded with the originally intended encoding and invalid characters will either be skipped or replaced with a suitable replacement character (controllable via the
errors parameter, which accepts the same values as Python’s
from resiliparse.parse.encoding import detect_encoding, bytes_to_str bytestr = b'\xc3\x9cbung macht den Meister' decoded = bytes_to_str(bytestr, detect_encoding(bytestr)) # 'Übung macht den Meister'
Of course a simple
bytestr.decode() would be sufficient for such a trivial example, but sometimes, the supplied encoding is inaccurate or the string turns out to contain mixed or broken encodings. In that case there is no other option than to try multiple encodings and to ignore any errors if all of them fail. The default fallback encodings for this situation (UTF-8 and Windows-1252) can be overridden with the
When setting custom fallback encodings, keep in mind that single-byte encodings without undefined codepoints (such as IANA ISO-8859-1) will never fail, so it does not make sense to have more than one of those in the fallback list. In fact, even very dense encodings such as Windows-1252 are very unlikely to ever fail.
bytes_to_str() also ensures that the resulting string can be re-encoded as UTF-8 without errors, which is not always the case when doing a simple
from resiliparse.parse.encoding import bytes_to_str # This will produce the unencodable string 'ઉ\udd7a笞': unencodeable = b'+Condensed'.decode('utf-7', errors='ignore') # OK, but somewhat broken: b'+Condense-' unencodeable.encode('utf-7') # Error: UnicodeEncodeError: 'utf-8' codec can't encode character '\udd7a' in position 1: surrogates not allowed unencodeable.encode()
bytes_to_str(), these issues can be avoided:
# Produces '+Condensed', because UTF-8 fallback can decode the string without errors bytes_to_str(b'+Condensed', 'utf-7') # But even without fallbacks, we get 'ઉ笞', which can at least be re-encoded as UTF-8 bytes_to_str(b'+Condensed', 'utf-7', fallback_encodings=)