HTML2Text
Resiliparse simple HTML (main) content extraction API documentation.
- resiliparse.extract.html2text.extract_plain_text(html, preserve_formatting=True, main_content=False, list_bullets=True, alt_texts=False, links=True, form_fields=False, noscript=False, comments=None, skip_elements=None)
Perform a simple plain-text extraction from the given DOM node and its children.
Extracts all visible text (excluding script/style elements, comment nodes etc.) and collapses consecutive white space characters.
If
preserve_formattingisTrue, line breaks, paragraphs, other block-level elements, list elements, and pre-formatted text will be preserved. Use the special value'minimal_html'to add reduced HTML markup to the formatted output, preserving headings (<h1-6>), paragraphs (<p>), lists (<ul>,<ol>),<pre>text,<br>line breaks, and links (<a>, iflinks=True).Extraction of particular elements and attributes such as links, alt texts, or form fields can be configured individually by setting the corresponding parameter to
True. Defaults toFalsefor most elements (i.e., only basic text will be extracted).- Parameters:
html (HTMLTree or str) – HTML as DOM tree or Unicode string
preserve_formatting (bool or t.Literal['minimal_html']) – preserve basic block-level formatting (use
'minimal_html'for minimal HTML markup in output)main_content (bool) – apply simple heuristics for extracting only “main-content” elements
list_bullets (bool) – insert bullets / numbers for list items
alt_texts (bool) – preserve alternative text descriptions
links (bool) – extract link target URLs
form_fields (bool) – extract form fields and their values
noscript (bool) – extract contents of <noscript> elements
comments (bool) – treat comment sections as main content
post_meta (bool) – preserve blog post / article meta data in main content extract
hidden_elements (bool) – keep elements hidden by inline CSS or class names
skip_elements (t.Iterable[str] or None) – list of CSS selectors for elements to skip
- Returns:
extracted plain text
- Return type:
str