HTML Parsing

Resiliparse comes with a light-weight and fast HTML parsing and DOM processing library based on the Lexbor web browser engine for processing HTML web pages.

Warning

The HTML parsing module is experimental. While the code is mostly well-tested, there are a few Lexbor bugs that have been fixed upstream, but haven’t been released so far. You may want to build your own Resiliparse binaries with the latest Lexbor Git master for the best experience.

To parse a Python Unicode string into a DOM tree, construct a new HTMLTree object by calling the static factory method parse():

from resiliparse.parse.html import HTMLTree

html = """<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Example page</title>
  </head>
  <body>
    <main id="foo">
      <p id="a">Hello <span class="bar">world</span>!</p>
      <p id="b" class="dom">Hello <span class="bar baz">DOM</span>!</p>
     </main>
  </body>
</html>"""

tree = HTMLTree.parse(html)

If your HTML contents are encoded as bytes, use parse_from_bytes() instead of parse(), which takes a bytes object and an encoding:

from resiliparse.parse.encoding import detect_encoding

html_bytes = html.encode('utf-16')
tree = HTMLTree.parse_from_bytes(html_bytes, detect_encoding(html_bytes))

It is sufficient if the encoding name is a “best guess”, since the name will be remapped according to the WHATWG specification (using map_encoding_to_html5()) and the decoding is done with bytes_to_str(), which tries several fallback encodings if the originally intended encoding fails (see Convert Byte String to Unicode for more information).

DOM Selection

Resiliparse provides a basic set of standard DOM functions for selecting element nodes and attributes in the DOM tree.

Elements

The document root element node can be accessed through the HTMLTree.document property. The properties HTMLTree.body and HTMLTree.head also exist for conveniently accessing a document’s <head> or <body> elements (if they exist). These anchors and all other element nodes are represented by DOMNode objects, which support the following standard DOM selectors for matching further elements either by attributes or CSS selectors:

These element selectors behave just like you would expect from other languages or libraries and return either a single DOMNode object or a DOMCollection with all matching DOMNode objects. The only exception is matches(), which returns a boolean value indicating whether the subtree contains any element matching the given CSS selector. In addition to these standard DOM functions, Resiliparse provides a generic get_elements_by_attr() function for selecting elements by arbitrary attribute names and values.

Note

If you want to match only a single element, always use the dedicated single-match selectors (e.g., use query_selector() instead of query_selector_all() etc.). These functions have built-in early stopping optimizations and are therefore more efficient than matching all elements in the tree and discarding unwanted elements in the resulting collection.

Here are a few examples of how to match elements by ID, tag name, class name, or CSS selector:

# Match single node by ID:
print(repr(tree.body.get_element_by_id('foo')))
# >>> <main id="foo">

# Match multiple nodes by tag name:
print(repr(tree.head.get_elements_by_tag_name('meta')))
# >>> {<meta charset="utf-8">}

# Match multiple nodes by class name:
print(repr(tree.body.get_elements_by_class_name('bar')))
# >>> {<span class="bar">, <span class="bar baz">}

# Match single node by CSS selector:
print(repr(tree.document.query_selector('body > main p:last-child')))
# >>> <p id="b" class="dom">

# Match multiple nodes by CSS selector:
print(repr(tree.body.query_selector_all('main *')))
# >>> {<p id="a">, <span class="bar">, <p id="b" class="dom">, <span class="bar baz">}

# Check whether there is any element matching this CSS selector:
print(tree.body.matches('.bar'))
# >>> True

DOMCollection objects are iterable, indexable, and slicable. The size of a collection can be checked with len(). If a slice is requested, the returned object will be another DOMCollection:

coll = tree.body.query_selector_all('main *')

# First element
print(repr(coll[0]))
# >>> <p id="a">

# Last element
print(repr(coll[-1]))
# >>> <span class="bar baz">

# First two elements
print(repr(coll[:2]))
# >>> {<p id="a">, <span class="bar">}

DOMCollection objects have the same DOM methods for selecting objects as DOMNode objects. This can be used for efficiently matching elements in the subtree(s) of the previously selected elements. The selection methods behave just like their DOMNode counterparts and return either a single DOMNode or another DOMCollection:

coll = tree.body.get_elements_by_class_name('dom')

# Only matches within the subtrees of elements in `coll`:
print(repr(coll.get_elements_by_class_name('bar')))
# >>> {<span class="bar baz">}

Attributes

Attributes of element nodes can be accessed either via DOMNode.getattr() or by dict-like access:

meta = tree.head.query_selector('meta[charset]')
if meta is not None:
    print(meta.getattr('charset'))
    # >>> utf-8

    # Or:
    print(meta['charset'])
    # >>> utf-8

The dict access method will raise a KeyError exception if the attribute does not exist.

The id and class attributes of an element are also available through the id and class_name or class_list properties:

p = tree.body.get_element_by_id('b')
print(p.id)
# >>> b

span = p.query_selector('span')
print(span.class_name)
# >>> bar baz

print(span.class_list)
# >>> ['bar', 'baz']

A list of existing attributes on an element is provided by its attrs property:

a = tree.create_element('div')

a.id = 'a-id'
a.class_name = 'a-class'
a['href'] = 'https://example.com'

print(a.attrs)
# >>> ['id', 'class', 'href']

HTML and Text Serialization

All DOMNode objects have a text and html property for accessing their plaintext or HTML serialization:

print(tree.body.get_element_by_id('a').text)
# >>> Hello world!

print(tree.body.get_element_by_id('a').html)
# >>> <p id="a">Hello <span class="bar">world</span>!</p>

Alternatively, you can also simply cast a DOMNode to str, which is equivalent to DOMNode.html:

print(tree.body.get_element_by_id('a'))
# >>> <p id="a">Hello <span class="bar">world</span>!</p>

For extracting specifically the text contents of the document’s <title> element, there is also the HTMLTree.title property:

# Example page
print(tree.title)

DOM Tree Traversal

The DOM subtree of any node can be traversed in pre-order by iterating over a DOMNode instance. Different types of nodes can be distinguished by their type property.

from resiliparse.parse.html import NodeType

root = tree.body.get_element_by_id('a')

tag_names = [e.tag for e in root]
tag_names_elements_only = [e.tag for e in root if e.type == NodeType.ELEMENT]

print(tag_names)
# >>> ['p', '#text', 'span', '#text', '#text']

print(tag_names_elements_only)
# >>> ['p', 'span']

To iterate only the immediate children of a node, loop over its child_nodes property instead of the node itself:

for e in tree.body.get_element_by_id('foo').child_nodes:
    if e.type == NodeType.ELEMENT:
        print(e.text)

Output:

Hello DOM!
Hello world!

In addition, any DOMNode object also has the following properties:

These can be used for traversing in a custom order or with custom logic, though the callback-based traversal mechanism described in the next section Advanced Traversal is usually more convenient.

Advanced Traversal

Besides the simple iterable interface, Resiliparse also supports a more advanced and flexible callback-based method for traversing the DOM tree. The traverse_dom() helper function accepts a DOMNode and a callback function that is called for each individual DOM node with a DOMContext context object as its only parameter.

The following example prints the values of all text nodes:

from resiliparse.parse.html import *

def start_cb(ctx: DOMContext):
    if ctx.node.type == NodeType.TEXT and ctx.node.value.strip():
        print(ctx.node.value.strip(), end=' ')

traverse_dom(tree.body, start_cb)

Output:

Hello world ! Hello DOM !

In addition to the start element callback, you can also specify an end element callback that will be invoked every time the DOM tree traverser encounters and element’s end tag (i.e., every time the tree traverser goes up one node level):

The following example prints all start and end tags without their textual contents or attributes:

def start_cb(ctx: DOMContext):
    if ctx.node.type == NodeType.ELEMENT:
        print(f'<{ctx.node.tag}>', end='')

def end_cb(ctx: DOMContext):
    if ctx.node.type == NodeType.ELEMENT:
        print(f'</{ctx.node.tag}>', end='')

traverse_dom(tree.body, start_cb, end_cb)

Output:

<body><main><p><span></span></p><p><span></span></p></main></body>

Besides a reference to the current node, the context objects also keeps track of the traversal depth, so the following is possible:

def start_cb(ctx: DOMContext):
    if ctx.node.type == NodeType.ELEMENT:
        print(f'{"  " * ctx.depth}<{ctx.node.tag}>')
    if ctx.node.type == NodeType.TEXT and ctx.node.value.strip():
        print(f'{"  " * ctx.depth}{ctx.node.text.strip()}')

def end_cb(ctx: DOMContext):
    if ctx.node.type == NodeType.ELEMENT:
        print(f'{"  " * ctx.depth}</{ctx.node.tag}>')

traverse_dom(tree.body, start_cb, end_cb)

Output:

<body>
  <main>
    <p>
      Hello
      <span>
        world
      </span>
      !
    </p>
    <p>
      Hello
      <span>
        DOM
      </span>
      !
    </p>
  </main>
</body>

The context object is the same object throughout the whole traversal process, so besides the node node and depth attributes, it can be mutated arbitrarily to maintain your own state. If you need to, you can also pass a pre-initialized context object to traverse_dom(). The following example converts the HTML body into nested Python lists:

def start_cb(ctx: DOMContext):
    if ctx.node.type == NodeType.ELEMENT:
        t = (ctx.node.tag, [])
        ctx.list_stack_current[-1].append(t)
        ctx.list_stack_current.append(t[1])
    elif ctx.node.type == NodeType.TEXT:
        txt = ctx.node.value.strip()
        if txt:
            ctx.list_stack_current[-1].append(txt)

def end_cb(ctx: DOMContext):
    if ctx.node.type == NodeType.ELEMENT:
        ctx.list_stack_current.pop()

ctx = DOMContext()
ctx.list_stack = []
ctx.list_stack_current = [ctx.list_stack]
traverse_dom(tree.body, start_cb, end_cb, ctx)

print(ctx.list_stack)

Output:

[('body', [('main', [('p', ['Hello', ('span', ['world']), '!']), ('p', ['Hello', ('span', ['DOM']), '!'])])])]

DOM Tree Manipulation

Resiliparse supports DOM manipulation and the creation of new nodes with a basic set of well-known DOM functions.

Warning

A DOMNode object is valid only for as long as its parent tree has not been modified or deallocated. Thus, DO NOT use existing instances after any sort of DOM tree manipulation! Doing so may result in Python crashes or (worse) security vulnerabilities due to dangling pointers (use after free). This is a known Lexbor limitation for which there is no workaround at the moment.

Elements

In the following is an example of how you can create new DOM elements and text nodes and insert them into the tree:

# Create a new <div> element node
new_element = tree.create_element('p')

# Create a new text node
new_text = tree.create_text_node('Hello Resiliparse!')

# Insert nodes into DOM tree
main_element = tree.body.query_selector('main')
main_element.append_child(new_element)
new_element.append_child(new_text)

print(main_element)

Output:

<main id="foo">
      <p id="a">Hello <span class="bar">world</span>!</p>
      <p id="b" class="dom">Hello <span class="bar baz">DOM</span>!</p>
     <p>Hello Resiliparse!</p></main>

In addition to append_child(), nodes also provide insert_before() for inserting a child node before another child instead of appending it at the end, and replace_child() for replacing an existing child node in the tree with another.

Use remove_child() to remove a node from the tree:

main_element.remove_child(new_element)

To fully delete a node, use decompose() on the node itself. This will remove it from the tree (if not already done) and delete the node and its entire subtree recursively:

new_element.decompose()
# From here on, this element and all elements in its subtree are invalid!!!

Attributes

Attributes can be added or modified via setattr() or by assigning directly to its dict entry:

element = tree.create_element('img')
element['src'] = 'https://example.com/foo.png'
element.setattr('alt', 'Foo')

print(element)
# >>> <img src="https://example.com/foo.png" alt="Foo">

For id and class attributes, you can also use id and class_name or class_list:

element = tree.create_element('div')

element.id = 'my-id'
element.class_name = 'class-a'
element.class_list.add('class-b')

print(element)
# >>> <div id="my-id" class="class-a class-b"></div>

print(element.class_list)
# >>> ['class-a', 'class-b']

element.class_list.remove('class-a')
print(element)
# >>> <div id="my-id" class="class-b"></div>

Inner HTML and Inner Text

An easier, but less efficient way of manipulating the DOM is to assign a string directly to either its html or text property. This will replace the inner HTML or inner text of these nodes with the new value:

main_element.html = '<p>New inner HTML content</p>'
print(main_element)
# >>> <main id="foo"><p>New HTML content</p></main>

main_element.text = '<p>New inner text content</p>'
print(main_element)
# >>> <main id="foo">&lt;p&gt;New inner text content&lt;/p&gt;</main>

Benchmarks

The Resiliparse CLI parser comes with a small HTML parser benchmarking tool that can measure the parsing engine’s performance and compare it to other Python HTML parsing libraries. Supported third-party libraries are Selectolax (both the old MyHTML and the new Lexbor engine) and BeautifulSoup4 (lxml engine only, which is the fastest BS4 backend).

Here are the results of extracting the titles from all web pages in an uncompressed 42,015-document WARC file on a Ryzen Threadripper 2920X machine:

$ resiliparse html benchmark CC-MAIN-*.warc

HTML parser benchmark <title> extraction:
=========================================
Resiliparse (Lexbor):  42015 documents in 36.55s (1149.56 documents/s)
Selectolax (Lexbor):   42015 documents in 37.46s (1121.52 documents/s)
Selectolax (MyHTML):   42015 documents in 53.82s (780.72 documents/s)
BeautifulSoup4 (lxml): 42015 documents in 874.40s (48.05 documents/s)

Not surprisingly, the two parsers based on the Lexbor engine perform almost identically, whereas lxml is by far the slowest by a factor of 24x.