HTML2Text

Resiliparse simple HTML (main) content extraction API documentation.

extract_plain_text(base_node, preserve_formatting=True, preserve_formatting=True, main_content=False,         list_bullets=True, alt_texts=False, links=True, form_fields=False, noscript=False, skip_elements=None)

Perform a simple plain-text extraction from the given DOM node and its children.

Extracts all visible text (excluding script/style elements, comment nodes etc.) and collapses consecutive white space characters. If preserve_formatting is True, line breaks, paragraphs, other block-level elements, list elements, and <pre>-formatted text will be preserved.

Extraction of particular elements and attributes such as links, alt texts, or form fields can be be configured individually by setting the corresponding parameter to True. Defaults to False for most elements (i.e., only basic text will be extracted).

Parameters
  • base_node (DOMNode) – base DOM node of which to extract sub tree

  • preserve_formatting (bool) – preserve basic block-level formatting

  • main_content (bool) – apply simple heuristics for extracting only “main-content” elements

  • list_bullets (bool) – insert bullets / numbers for list items

  • alt_texts (bool) – preserve alternative text descriptions

  • links (bool) – extract link target URLs

  • form_fields (bool) – extract form fields and their values

  • noscript (bool) – extract contents of <noscript> elements

  • skip_elements (t.Iterable[str] or None) – names of elements to skip (defaults to head, script, style)

Returns

extracted plain text

Return type

str