HTML2Text

Resiliparse simple HTML (main) content extraction API documentation.

resiliparse.extract.html2text.extract_plain_text(tree, preserve_formatting=True, main_content=False, list_bullets=True, alt_texts=False, links=True, form_fields=False, noscript=False, comments=None, skip_elements=None)

Perform a simple plain-text extraction from the given DOM node and its children.

Extracts all visible text (excluding script/style elements, comment nodes etc.) and collapses consecutive white space characters. If preserve_formatting is True, line breaks, paragraphs, other block-level elements, list elements, and <pre>-formatted text will be preserved.

Extraction of particular elements and attributes such as links, alt texts, or form fields can be be configured individually by setting the corresponding parameter to True. Defaults to False for most elements (i.e., only basic text will be extracted).

Parameters
  • tree (DOMNode) – HTML DOM tree

  • preserve_formatting (bool) – preserve basic block-level formatting

  • main_content (bool) – apply simple heuristics for extracting only “main-content” elements

  • list_bullets (bool) – insert bullets / numbers for list items

  • alt_texts (bool) – preserve alternative text descriptions

  • links (bool) – extract link target URLs

  • form_fields (bool) – extract form fields and their values

  • noscript (bool) – extract contents of <noscript> elements

  • comments (bool) – treat comment sections as main content

  • skip_elements (t.Iterable[str] or None) – list of CSS selectors for elements to skip

Returns

extracted plain text

Return type

str