HTML2Text

Resiliparse simple HTML (main) content extraction API documentation.

resiliparse.extract.html2text.extract_plain_text(html, preserve_formatting=True, main_content=False, list_bullets=True, alt_texts=False, links=True, form_fields=False, noscript=False, comments=None, skip_elements=None)

Perform a simple plain-text extraction from the given DOM node and its children.

Extracts all visible text (excluding script/style elements, comment nodes etc.) and collapses consecutive white space characters.

If preserve_formatting is True, line breaks, paragraphs, other block-level elements, list elements, and pre-formatted text will be preserved. Use the special value 'minimal_html' to add reduced HTML markup to the formatted output, preserving headings (<h1-6>), paragraphs (<p>), lists (<ul>, <ol>), <pre> text, <br> line breaks, and links (<a>, if links=True).

Extraction of particular elements and attributes such as links, alt texts, or form fields can be configured individually by setting the corresponding parameter to True. Defaults to False for most elements (i.e., only basic text will be extracted).

Parameters:

html (HTMLTree or str) – HTML as DOM tree or Unicode string
preserve_formatting (bool or t.Literal['minimal_html']) – preserve basic block-level formatting (use 'minimal_html' for minimal HTML markup in output)
main_content (bool) – apply simple heuristics for extracting only “main-content” elements
list_bullets (bool) – insert bullets / numbers for list items
alt_texts (bool) – preserve alternative text descriptions
links (bool) – extract link target URLs
form_fields (bool) – extract form fields and their values
noscript (bool) – extract contents of <noscript> elements
comments (bool) – treat comment sections as main content
post_meta (bool) – preserve blog post / article meta data in main content extract
hidden_elements (bool) – keep elements hidden by inline CSS or class names
skip_elements (t.Iterable[str] or None) – list of CSS selectors for elements to skip

Returns:

extracted plain text

Return type:

str