HTML2Text
Resiliparse HTML2Text is a very fast and rule-based plain text extractor for HTML pages. HTML2Text uses the Resiliparse DOM parser.
Basic Plain Text Conversion
The simplest and fastest way to convert an HTML page to plain text is to use the extract_plain_text()
helper without any further parameters. This will extract all visible text nodes inside the HTML document’s <body>
. Only <script>
, <style>
and a few other (generally) invisible elements are skipped and very basic ASCII formatting is applied:
from resiliparse.extract.html2text import extract_plain_text
html = """<!doctype html>
<head>
<title>Foo</title>
<meta charset="utf-8">
</head>
<body>
<section id="wrapper">
<nav>
<ul>
<li><a href="/">Index</a></li>
<li><a href="/contact">Contact</a></li>
</ul>
</nav>
<main id="foo">
<h1>foo <a href="#foo" aria-hidden="true">Link</a></h1>
<p>baz<br>bar</p>
<img src="" alt="Some image">
<input type="hidden" value="foo">
<input type="text" value="Some text" placeholder="Insert text">
<input type="text" placeholder="Insert text">
</main>
<script>alert('Hello World!');</script>
<noscript>Sorry, your browser doesn't support JavaScript!</noscript>
<div><div><div><footer id="global-footer">
Copyright (C) 2021 Foo Bar
</footer></div></div></div>
</section>
</body>
</html>"""
print(extract_plain_text(html))
Output:
• Index
• Contact
foo Link
baz
bar
Some image
Copyright (C) 2021 Foo Bar
Instead of the raw HTML as a string, you can also pass an HTMLTree
instance.
For customization of the generated plain text, the function extract_plain_text()
accepts several parameters controlling individual aspects of its output, such as the extraction of alt
texts (enabled by default), link href
targets, form fields, or noscript
elements.
# Without alt texts:
extract_plain_text(html, alt_texts=False)
# Skips: "Some image"
# With href targets:
extract_plain_text(html, links=True)
# Adds:
# • Index (/)
# • Contact (/contact)
#
# foo Link (#foo)
# With form fields:
extract_plain_text(html, form_fields=True)
# Adds:
# [ Some text ] [ Insert text ]
# With noscript
extract_plain_text(html, noscript=True)
# Adds:
# Sorry, your browser doesn't support JavaScript!
If you don’t like list bullets, you can turn them off as well:
print(extract_plain_text(html, list_bullets=False))
Output:
Index
Contact
foo Link
baz
bar
Some image
Copyright (C) 2021 Foo Bar
If you want the most compact extraction possible without any formatting, set preserve_formatting=False
:
print(extract_plain_text(html, preserve_formatting=False))
Output:
Index Contact foo Link baz bar Some image Copyright (C) 2021 Foo Bar
Main Content Extraction
HTML2Text can also do very simple and fast rule-based main content extraction (also called boilerplate removal). Setting main_content=True
will apply a set of rules for removing page elements such as navigation blocks, sidebars, footers, some ads, and (as far as they are possible to detect without rendering the page) invisible elements:
print(extract_plain_text(html, main_content=True))
Output:
foo
baz
bar
Some image
Of course, the same options for adjusting the output as above can be applied here as well:
print(extract_plain_text(html,
main_content=True,
alt_texts=False,
preserve_formatting=False,
noscript=True))
Output:
foo baz bar Sorry, your browser doesn't support JavaScript!