HTML Parsing

Resiliparse HTML parsing and DOM traversal utilities API documentation.

class resiliparse.parse.html.NodeType(*values)

DOM node type enum.

ELEMENT = 1

ATTRIBUTE = 2

TEXT = 3

CDATA_SECTION = 4

ENTITY_REFERENCE = 5

ENTITY = 6

PROCESSING_INSTRUCTION = 7

COMMENT = 8

DOCUMENT = 9

DOCUMENT_TYPE = 10

DOCUMENT_FRAGMENT = 11

NOTATION = 12

class resiliparse.parse.html.DOMCollection(self)

Collection of DOM nodes that are the result set of an element matching operation.

A node collection is only valid for as long as the owning HTMLTree is alive and the DOM tree hasn’t been modified. Do not access DOMCollection instances after any sort of DOM tree manipulation.

__getitem__(self, key)

Return the DOMNode at the given index in this collection or another DOMCollection if key is a slice object. Negative indexing is supported.

Parameters:

key – index or slice

Return type:

DOMNode or DOMCollection

Raises:

IndexError – if key is out of range
TypeError – if key is not an int or slice

__iter__(self)

Iterate the DOM node collection.

Return type:: t.Iterable[DOMNode]

get_element_by_id(self, element_id, case_insensitive=False)

Within all elements in this collection, find and return the element whose ID attribute matches element_id.

Parameters:

element_id (str) – element ID
case_insensitive (bool) – match ID case-insensitively

Returns:

matching element or None if no such element exists

Return type:

DOMNode or None

get_elements_by_attr(self, attr_name, attr_value, case_insensitive=False)

Within all elements in this collection, find the elements matching the given arbitrary attribute name and value and return a DOMCollection with the aggregated results.

Parameters:

attr_name (str) – attribute name
attr_value – attribute value
case_insensitive (bool) – match attribute value case-insensitively

Returns:

collection of matching elements

Return type:

DOMCollection or None

get_elements_by_class_name(self, element_class, case_insensitive=False)

Within all elements in this collection, find the elements matching the given class name and return a DOMCollection with the aggregated results.

Parameters:

class_name (str) – element class
case_insensitive (bool) – match class name case-insensitively

Returns:

collection of matching elements

Return type:

DOMCollection or None

get_elements_by_tag_name(self, tag_name)

Within all elements in this collection, find the elements with the given tag name and return a DOMCollection with the aggregated results.

Parameters:: tag_name (str) – tag name for matching elements
Returns:: collection of matching elements
Return type:: DOMCollection

matches(self, selector)

Within all elements in this collection, check whether any element in the DOM tree matches the given CSS selector. This is more efficient than matching with query_selector_all() and checking the size of the returned collection.

Parameters:: selector (str) – CSS selector
Returns:: boolean value indicating whether a matching element exists
Return type:: bool

query_selector(self, selector)

Within all elements in this collection, find and return the first element matching the given CSS selector.This is more efficient than matching with query_selector_all() and discarding additional elements.

Parameters:: selector (str) – CSS selector
Returns:: matching element or None
Return type:: DOMNode or None

query_selector_all(self, selector)

Within all elements in this collection, find the elements matching the given CSS selector and return a DOMCollection with the aggregated results.

Parameters:: selector (str) – CSS selector
Returns:: collection of matching elements
Return type:: DOMCollection

class resiliparse.parse.html.DOMContext

DOM node traversal context object.

The context object has two attributes that are set by the traversal function for keeping track of the current DOMNode and the current traversal depth. Besides these, the context object is arbitrarily mutable and can be used for maintaining custom state.

Variables:

node (DOMNode) – the current DOMNode
depth (int) – the current traversal depth

class resiliparse.parse.html.DOMElementClassList

Class name list of an Element DOM node.

__getitem__(): __contains__(self, item)

__iter__(self)

add(self, class_name)

Add new class name to Element node if not already present.

Parameters:: class_name (str) – new class name

remove(self, class_name)

Remove a class name from this Element node.

Parameters:: class_name (str) – new class name

class resiliparse.parse.html.DOMNode(self)

DOM node.

A DOM node is only valid as long as the owning HTMLTree is alive and the DOM tree hasn’t been modified. Do not access DOMNode instances after any sort of DOM tree manipulation.

__getitem__(self, attr_name)

Get the value of an attribute.

Parameters:

attr_name – attribute name

Return type:

str

Raises:

KeyError – if no such attribute exists
ValueError – if node is not an Element node

__iter__(self)

Traverse the DOM tree in pre-order starting at the current node.

Return type:: t.Iterable[DOMNode]

__setitem__(self, attr_name, attr_value)

Insert or update an attribute with the given name to the given value.

Parameters:

attr_name (str) – attribute name
attr_value (str) – attribute value

Returns:

attribute value

Raises:

ValueError – if node is not an Element node

append_child(self, node)

Append a new child node to this DOM node.

Parameters:: node (DOMNode) – DOM node to append as new child node
Returns:: the appended child node
Return type:: DOMNode
Raises:: ValueError – if trying to append node to itself

decompose(self): Delete the current node and all its children.

delattr(self, attr_name)

Remove the given attribute if it exists.

Parameters:: attr_name (str) – attribute to remove
Raises:: ValueError – if node is not an Element node

get_element_by_id(self, element_id, case_insensitive=False)

Find and return the element whose ID attribute matches element_id.

Parameters:

element_id (str) – element ID
case_insensitive (bool) – match ID case-insensitively

Returns:

matching element or None if no such element exists

Return type:

DOMNode or None

get_elements_by_attr(self, attr_name, attr_value, case_insensitive=False)

Find all elements matching the given arbitrary attribute name and value and return a DOMCollection with the results.

Parameters:

attr_name (str) – attribute name
attr_value – attribute value
case_insensitive (bool) – match attribute value case-insensitively

Returns:

collection of matching elements

Return type:

DOMCollection or None

get_elements_by_class_name(self, element_class, case_insensitive=False)

Find all elements matching the given class name and return a DOMCollection with the results.

Parameters:

class_name (str) – element class
case_insensitive (bool) – match class name case-insensitively

Returns:

collection of matching elements

Return type:

DOMCollection or None

get_elements_by_tag_name(self, tag_name)

Find all elements with the given tag name and return a DOMCollection with the results.

Parameters:: tag_name (str) – tag name for matching elements
Returns:: collection of matching elements
Return type:: DOMCollection

getattr(self, attr_name, default_value=None)

Get the value of the attribute attr_name or default_value if the element has no such attribute.

Parameters:

attr_name (str) – attribute name
default_value (str or None) – default value to return if attribute is unset

Returns:

attribute value

Return type:

str or None

Raises:

ValueError – if node is invalid or not an Element node

hasattr(self, attr_name)

Check if node has an attribute with the given name.

Parameters:: attr_name (str) – attribute name
Return type:: bool
Raises:: ValueError – if node is not an Element node

insert_before(self, node, reference)

Insert node before reference as a new child node. The reference node must be a child of this node or None. If reference is None, the new node will be appended after the last child node.

Parameters:

node (DOMNode) – DOM node to insert as new child node
reference (DOMNode) – child node before which to insert the new node or None

Returns:

the inserted child node

Return type:

DOMNode

Raises:

ValueError – if trying to add node as its own child or if reference is not a child

matches(self, selector)

Check whether any element in the DOM tree matches the given CSS selector. This is more efficient than matching with query_selector_all() and checking the size of the returned collection.

Parameters:: selector (str) – CSS selector
Returns:: boolean value indicating whether a matching element exists
Return type:: bool

query_selector(self, selector)

Find and return the first element matching the given CSS selector. This is more efficient than matching with query_selector_all() and discarding additional elements.

Parameters:: selector (str) – CSS selector
Returns:: matching element or None
Return type:: DOMNode or None

query_selector_all(self, selector)

Find all elements matching the given CSS selector and return a DOMCollection with the results.

Parameters:: selector (str) – CSS selector
Returns:: collection of matching elements
Return type:: DOMCollection

remove_child(self, node)

Remove the child node node from the DOM tree and return it.

Parameters:: node (DOMNode) – DOM node to remove
Returns:: the removed child node
Return type:: DOMNode
Raises:: ValueError – if node is not a child of this node

replace_child(self, new_child, old_child)

Replace the child node old_child with new_child.

Parameters:

new_child (DOMNode) – new child node to insert
old_child (DOMNode) – old child node to replace

Returns:

the old child node

Return type:

DOMNode

Raises:

ValueError – if old_child is not a child of this node

setattr(self, attr_name, attr_value)

Insert or update an attribute with the given name to the given value.

Parameters:

attr_name (str) – attribute name
attr_value (str) – attribute value

Returns:

attribute value

Raises:

ValueError – if node is not an Element node

attrs

List of attribute names if node is an Element node.

Type:: t.List[str] or None

child_nodes

List of child nodes.

Type:: t.List[DOMNode]

class_list

List of class names set on this Element node.

Type:: DOMElementClassList

class_name

Class name attribute of this Element node (empty string if unset).

Type:: str

first_child

First child element of this DOM node.

Type:: DOMNode or None

first_element_child

First element child of this DOM node.

Type:: DOMNode or None

html

HTML contents of this DOM node and its children.

The DOM node’s inner HTML can be modified by assigning to this property.

Type:: str

id

ID attribute of this Element node (empty string if unset).

Type:: str

last_child

Last child element of this DOM node.

Type:: DOMNode or None

last_element_child

Last element child element of this DOM node.

Type:: DOMNode or None

next

Next sibling node.

Type:: DOMNode or None

next_element

Next sibling element node.

Type:: DOMNode or None

parent

Parent of this node.

Type:: DOMNode or None

prev

Previous sibling node.

Type:: DOMNode or None

prev_element

Previous sibling element node.

Type:: DOMNode or None

tag

DOM element tag or node name.

Type:: str or None

text

Text contents of this DOM node and its children.

The DOM node’s inner text can be modified by assigning to this property.

Type:: str

type

DOM node type.

Type:: NodeType

value

Node text value.

Type:: str or None

class resiliparse.parse.html.HTMLTree(self)

HTML DOM tree parser.

classmethod parse(self, document)

Parse HTML from a Unicode string into a DOM tree.

Parameters:: document – input HTML document
Returns:: HTML DOM tree
Return type:: HTMLTree
Raises:: ValueError – if HTML parsing fails for unknown reasons

classmethod parse_from_bytes(self, document, encoding='utf-8', errors='ignore')

Decode a raw HTML byte string and parse it into a DOM tree.

The decoding routine uses bytes_to_str() to take care of decoding errors, so it is sufficient if encoding is just a best guess of what the actual input encoding is. The encoding name will be remapped according to the WHATWG specification by calling map_encoding_to_html5() before trying to decode the byte string with it.

Parameters:

document – input byte string
encoding – encoding for decoding byte string
errors – decoding error policy (same as str.decode())

Returns:

HTML DOM tree

Return type:

HTMLTree

Raises:

ValueError – if HTML parsing fails for unknown reasons

create_element(self, tag_name)

Create a new DOM Element node.

Parameters:: tag_name (str) – element tag name
Returns:: new Element node
Return type:: DOMNode

create_text_node(self, text)

Create a new DOM Element node.

Parameters:: text (str) – string contents of the new text element
Returns:: new text node
Return type:: DOMNode

body

HTML body element or None if document has no body.

Type:: DOMNode or None

document

Document root node.

Type:: DOMNode or None

head

HTML head element or None if document has no head.

Type:: DOMNode or None

title

The HTML document title.

Type:: str or None

resiliparse.parse.html.traverse_dom(base_node, start_callback, end_callback=None, context=None, elements_only=False)

DOM traversal helper.

Traverses the DOM tree starting at base_node in pre-order and calls start_callback at each child node. If end_callback is not None, it will be called each time a DOM element’s end tag is encountered.

The callbacks are expected to take exactly one DOMContext context parameter, which keeps track of the current node and traversal depth. The context object will be the same throughout the whole traversal process, so it can be mutated with custom data.

Parameters:

base_node (DOMNode) – root node of the traversal
start_callback (t.Callable[[DOMContext], None]) – callback for each DOM node on the way (takes a DOMNode and context as a parameter)
end_callback (t.Callable[[DOMContext], None] or None) – optional callback for element node end tags (takes a DOMNode and context as a parameter)
context (DOMContext) – optional pre-initialized context object
elements_only (bool) – traverse only element nodes