HTML Parsing

Resiliparse HTML parsing and DOM traversal utilities API documentation.

class resiliparse.parse.html.NodeType(value)

An enumeration.

DOM node type enum.

ELEMENT = 1
ATTRIBUTE = 2
TEXT = 3
CDATA_SECTION = 4
ENTITY_REFERENCE = 5
ENTITY = 6
PROCESSING_INSTRUCTION = 7
COMMENT = 8
DOCUMENT = 9
DOCUMENT_TYPE = 10
DOCUMENT_FRAGMENT = 11
NOTATION = 12
LAST_ENTRY = 13
class resiliparse.parse.html.DOMCollection(self)

Collection of DOM nodes that are the result set of an element matching operation.

A node collection is only valid for as long as the owning HTMLTree is alive and the DOM tree hasn’t been modified. Do not access DOMCollection instances after any sort of DOM tree manipulation.

__getitem__(self, key)

Return the DOMNode at the given index in this collection or another DOMCollection if key is a slice object. Negative indexing is supported.

Parameters:

key – index or slice

Return type:

DOMNode or DOMCollection

Raises:
  • IndexError – if key is out of range

  • TypeError – if key is not an int or slice

__iter__(self)

Iterate the DOM node collection.

Return type:

t.Iterable[DOMNode]

get_element_by_id(self, element_id, case_insensitive=False)

Within all elements in this collection, find and return the element whose ID attribute matches element_id.

Parameters:
  • element_id (str) – element ID

  • case_insensitive (bool) – match ID case-insensitively

Returns:

matching element or None if no such element exists

Return type:

DOMNode or None

get_elements_by_attr(self, attr_name, attr_value, case_insensitive=False)

Within all elements in this collection, find the elements matching the given arbitrary attribute name and value and return a DOMCollection with the aggregated results.

Parameters:
  • attr_name (str) – attribute name

  • attr_value – attribute value

  • case_insensitive (bool) – match attribute value case-insensitively

Returns:

collection of matching elements

Return type:

DOMCollection or None

get_elements_by_class_name(self, element_class, case_insensitive=False)

Within all elements in this collection, find the elements matching the given class name and return a DOMCollection with the aggregated results.

Parameters:
  • class_name (str) – element class

  • case_insensitive (bool) – match class name case-insensitively

Returns:

collection of matching elements

Return type:

DOMCollection or None

get_elements_by_tag_name(self, tag_name)

Within all elements in this collection, find the elements with the given tag name and return a DOMCollection with the aggregated results.

Parameters:

tag_name (str) – tag name for matching elements

Returns:

collection of matching elements

Return type:

DOMCollection

matches(self, selector)

Within all elements in this collection, check whether any element in the DOM tree matches the given CSS selector. This is more efficient than matching with query_selector_all() and checking the size of the returned collection.

Parameters:

selector (str) – CSS selector

Returns:

boolean value indicating whether a matching element exists

Return type:

bool

query_selector(self, selector)

Within all elements in this collection, find and return the first element matching the given CSS selector.This is more efficient than matching with query_selector_all() and discarding additional elements.

Parameters:

selector (str) – CSS selector

Returns:

matching element or None

Return type:

DOMNode or None

query_selector_all(self, selector)

Within all elements in this collection, find the elements matching the given CSS selector and return a DOMCollection with the aggregated results.

Parameters:

selector (str) – CSS selector

Returns:

collection of matching elements

Return type:

DOMCollection

class resiliparse.parse.html.DOMContext

DOM node traversal context object.

The context object has two attributes that are set by the traversal function for keeping track of the current DOMNode and the current traversal depth. Besides these, the context object is arbitrarily mutable and can be used for maintaining custom state.

Variables:
  • node (DOMNode) – the current DOMNode

  • depth (int) – the current traversal depth

class resiliparse.parse.html.DOMElementClassList

Class name list of an Element DOM node.

__getitem__()

__contains__(self, item)

__iter__(self)
add(self, class_name)

Add new class name to Element node if not already present.

Parameters:

class_name (str) – new class name

remove(self, class_name)

Remove a class name from this Element node.

Parameters:

class_name (str) – new class name

class resiliparse.parse.html.DOMNode(self)

DOM node.

A DOM node is only valid as long as the owning HTMLTree is alive and the DOM tree hasn’t been modified. Do not access DOMNode instances after any sort of DOM tree manipulation.

__getitem__(self, attr_name)

Get the value of an attribute.

Parameters:

attr_name – attribute name

Return type:

str

Raises:
  • KeyError – if no such attribute exists

  • ValueError – if node is not an Element node

__iter__(self)

Traverse the DOM tree in pre-order starting at the current node.

Return type:

t.Iterable[DOMNode]

__setitem__(self, attr_name, attr_value)

Insert or update an attribute with the given name to the given value.

Parameters:
  • attr_name (str) – attribute name

  • attr_value (str) – attribute value

Returns:

attribute value

Raises:

ValueError – if node is not an Element node

append_child(self, node)

Append a new child node to this DOM node.

Parameters:

node (DOMNode) – DOM node to append as new child node

Returns:

the appended child node

Return type:

DOMNode

Raises:

ValueError – if trying to append node to itself

decompose(self)

Delete the current node and all its children.

delattr(self, attr_name)

Remove the given attribute if it exists.

Parameters:

attr_name (str) – attribute to remove

Raises:

ValueError – if node is not an Element node

get_element_by_id(self, element_id, case_insensitive=False)

Find and return the element whose ID attribute matches element_id.

Parameters:
  • element_id (str) – element ID

  • case_insensitive (bool) – match ID case-insensitively

Returns:

matching element or None if no such element exists

Return type:

DOMNode or None

get_elements_by_attr(self, attr_name, attr_value, case_insensitive=False)

Find all elements matching the given arbitrary attribute name and value and return a DOMCollection with the results.

Parameters:
  • attr_name (str) – attribute name

  • attr_value – attribute value

  • case_insensitive (bool) – match attribute value case-insensitively

Returns:

collection of matching elements

Return type:

DOMCollection or None

get_elements_by_class_name(self, element_class, case_insensitive=False)

Find all elements matching the given class name and return a DOMCollection with the results.

Parameters:
  • class_name (str) – element class

  • case_insensitive (bool) – match class name case-insensitively

Returns:

collection of matching elements

Return type:

DOMCollection or None

get_elements_by_tag_name(self, tag_name)

Find all elements with the given tag name and return a DOMCollection with the results.

Parameters:

tag_name (str) – tag name for matching elements

Returns:

collection of matching elements

Return type:

DOMCollection

getattr(self, attr_name, default_value=None)

Get the value of the attribute attr_name or default_value if the element has no such attribute.

Parameters:
  • attr_name (str) – attribute name

  • default_value (str or None) – default value to return if attribute is unset

Returns:

attribute value

Return type:

str or None

Raises:

ValueError – if node is invalid or not an Element node

hasattr(self, attr_name)

Check if node has an attribute with the given name.

Parameters:

attr_name (str) – attribute name

Return type:

bool

Raises:

ValueError – if node is not an Element node

insert_before(self, node, reference)

Insert node before reference as a new child node. The reference node must be a child of this node or None. If reference is None, the new node will be appended after the last child node.

Parameters:
  • node (DOMNode) – DOM node to insert as new child node

  • reference (DOMNode) – child node before which to insert the new node or None

Returns:

the inserted child node

Return type:

DOMNode

Raises:

ValueError – if trying to add node as its own child or if reference is not a child

matches(self, selector)

Check whether any element in the DOM tree matches the given CSS selector. This is more efficient than matching with query_selector_all() and checking the size of the returned collection.

Parameters:

selector (str) – CSS selector

Returns:

boolean value indicating whether a matching element exists

Return type:

bool

query_selector(self, selector)

Find and return the first element matching the given CSS selector. This is more efficient than matching with query_selector_all() and discarding additional elements.

Parameters:

selector (str) – CSS selector

Returns:

matching element or None

Return type:

DOMNode or None

query_selector_all(self, selector)

Find all elements matching the given CSS selector and return a DOMCollection with the results.

Parameters:

selector (str) – CSS selector

Returns:

collection of matching elements

Return type:

DOMCollection

remove_child(self, node)

Remove the child node node from the DOM tree and return it.

Parameters:

node (DOMNode) – DOM node to remove

Returns:

the removed child node

Return type:

DOMNode

Raises:

ValueError – if node is not a child of this node

replace_child(self, new_child, old_child)

Replace the child node old_child with new_child.

Parameters:
  • new_child (DOMNode) – new child node to insert

  • old_child (DOMNode) – old child node to replace

Returns:

the old child node

Return type:

DOMNode

Raises:

ValueError – if old_child is not a child of this node

setattr(self, attr_name, attr_value)

Insert or update an attribute with the given name to the given value.

Parameters:
  • attr_name (str) – attribute name

  • attr_value (str) – attribute value

Returns:

attribute value

Raises:

ValueError – if node is not an Element node

attrs

List of attribute names if node is an Element node.

Type:

t.List[str] or None

child_nodes

List of child nodes.

Type:

t.List[DOMNode]

class_list

List of class names set on this Element node.

Type:

DOMElementClassList

class_name

Class name attribute of this Element node (empty string if unset).

Type:

str

first_child

First child element of this DOM node.

Type:

DOMNode or None

first_element_child

First element child of this DOM node.

Type:

DOMNode or None

html

HTML contents of this DOM node and its children.

The DOM node’s inner HTML can be modified by assigning to this property.

Type:

str

id

ID attribute of this Element node (empty string if unset).

Type:

str

last_child

Last child element of this DOM node.

Type:

DOMNode or None

last_element_child

Last element child element of this DOM node.

Type:

DOMNode or None

next

Next sibling node.

Type:

DOMNode or None

next_element

Next sibling element node.

Type:

DOMNode or None

parent

Parent of this node.

Type:

DOMNode or None

prev

Previous sibling node.

Type:

DOMNode or None

prev_element

Previous sibling element node.

Type:

DOMNode or None

tag

DOM element tag or node name.

Type:

str or None

text

Text contents of this DOM node and its children.

The DOM node’s inner text can be modified by assigning to this property.

Type:

str

type

DOM node type.

Type:

NodeType

value

Node text value.

Type:

str or None

class resiliparse.parse.html.HTMLTree(self)

HTML DOM tree parser.

create_element(self, tag_name)

Create a new DOM Element node.

Parameters:

tag_name (str) – element tag name

Returns:

new Element node

Return type:

DOMNode

create_text_node(self, text)

Create a new DOM Element node.

Parameters:

text (str) – string contents of the new text element

Returns:

new text node

Return type:

DOMNode

classmethod parse(self, document)

Parse HTML from a Unicode string into a DOM tree.

Parameters:

document – input HTML document

Returns:

HTML DOM tree

Return type:

HTMLTree

Raises:

ValueError – if HTML parsing fails for unknown reasons

classmethod parse_from_bytes(self, document, encoding='utf-8', errors='ignore')

Decode a raw HTML byte string and parse it into a DOM tree.

The decoding routine uses bytes_to_str() to take care of decoding errors, so it is sufficient if encoding is just a best guess of what the actual input encoding is. The encoding name will be remapped according to the WHATWG specification by calling map_encoding_to_html5() before trying to decode the byte string with it.

Parameters:
  • document – input byte string

  • encoding – encoding for decoding byte string

  • errors – decoding error policy (same as str.decode())

Returns:

HTML DOM tree

Return type:

HTMLTree

Raises:

ValueError – if HTML parsing fails for unknown reasons

body

HTML body element or None if document has no body.

Type:

DOMNode or None

document

Document root node.

Type:

DOMNode or None

head

HTML head element or None if document has no head.

Type:

DOMNode or None

title

The HTML document title.

Type:

str or None

class resiliparse.parse.html.NodeType(value)

An enumeration.

resiliparse.parse.html.traverse_dom(base_node, start_callback, end_callback=None, context=None, elements_only=False)

DOM traversal helper.

Traverses the DOM tree starting at base_node in pre-order and calls start_callback at each child node. If end_callback is not None, it will be called each time a DOM element’s end tag is encountered.

The callbacks are expected to take exactly one DOMContext context parameter, which keeps track of the current node and traversal depth. The context object will be the same throughout the whole traversal process, so it can be mutated with custom data.

Parameters:
  • base_node (DOMNode) – root node of the traversal

  • start_callback (t.Callable[[DOMContext], None]) – callback for each DOM node on the way (takes a DOMNode and context as a parameter)

  • end_callback (t.Callable[[DOMContext], None] or None) – optional callback for element node end tags (takes a DOMNode and context as a parameter)

  • context (DOMContext) – optional pre-initialized context object

  • elements_only (bool) – traverse only element nodes