HTML2Text
Resiliparse simple HTML (main) content extraction API documentation.
- resiliparse.extract.html2text.extract_plain_text(html, preserve_formatting=True, main_content=False, list_bullets=True, alt_texts=False, links=True, form_fields=False, noscript=False, comments=None, skip_elements=None)
Perform a simple plain-text extraction from the given DOM node and its children.
Extracts all visible text (excluding script/style elements, comment nodes etc.) and collapses consecutive white space characters. If
preserve_formatting
isTrue
, line breaks, paragraphs, other block-level elements, list elements, and<pre>
-formatted text will be preserved.Extraction of particular elements and attributes such as links, alt texts, or form fields can be configured individually by setting the corresponding parameter to
True
. Defaults toFalse
for most elements (i.e., only basic text will be extracted).- Parameters:
html (HTMLTree or str) – HTML as DOM tree or Unicode string
preserve_formatting (bool) – preserve basic block-level formatting
main_content (bool) – apply simple heuristics for extracting only “main-content” elements
list_bullets (bool) – insert bullets / numbers for list items
alt_texts (bool) – preserve alternative text descriptions
links (bool) – extract link target URLs
form_fields (bool) – extract form fields and their values
noscript (bool) – extract contents of <noscript> elements
comments (bool) – treat comment sections as main content
skip_elements (t.Iterable[str] or None) – list of CSS selectors for elements to skip
- Returns:
extracted plain text
- Return type:
str