HTML Parsing

Resiliparse HTML parsing and DOM traversal utilities API documentation.

class resiliparse.parse.html.NodeType(value)

An enumeration.

DOM node type enum.

ELEMENT = 1
ATTRIBUTE = 2
TEXT = 3
CDATA_SECTION = 4
ENTITY_REFERENCE = 5
ENTITY = 6
PROCESSING_INSTRUCTION = 7
COMMENT = 8
DOCUMENT = 9
DOCUMENT_TYPE = 10
DOCUMENT_FRAGMENT = 11
NOTATION = 12
LAST_ENTRY = 13
class resiliparse.parse.html.DOMCollection(self)

Collection of DOM nodes that are the result set of an element matching operation.

A node collection is only valid for as long as the owning HTMLTree is alive and the DOM tree hasn’t been modified. Do not access DOMCollection instances after any sort of DOM tree manipulation.

__getitem__(self, key)

Return the DOMNode at the given index in this collection or another DOMCollection if key is a slice object. Negative indexing is supported.

Parameters

key – index or slice

Return type

DOMNode or DOMCollection

Raises
  • IndexError – if key is out of range

  • TypeError – if key is not an int or slice

__iter__(self)

Iterate the DOM node collection.

Return type

t.Iterable[DOMNode]

get_element_by_id(self, element_id, case_insensitive=False)

Within all elements in this collection, find and return the element whose ID attribute matches element_id.

Parameters
  • element_id (str) – element ID

  • case_insensitive (bool) – match ID case-insensitively

Returns

matching element or None if no such element exists

Return type

DOMNode or None

get_elements_by_attr(self, attr_name, attr_value, case_insensitive=False)

Within all elements in this collection, find the elements matching the given arbitrary attribute name and value and return a DOMCollection with the aggregated results.

Parameters
  • attr_name (str) – attribute name

  • attr_value – attribute value

  • case_insensitive (bool) – match attribute value case-insensitively

Returns

collection of matching elements

Return type

DOMCollection or None

get_elements_by_class_name(self, element_class, case_insensitive=False)

Within all elements in this collection, find the elements matching the given class name and return a DOMCollection with the aggregated results.

Parameters
  • class_name (str) – element class

  • case_insensitive (bool) – match class name case-insensitively

Returns

collection of matching elements

Return type

DOMCollection or None

get_elements_by_tag_name(self, tag_name)

Within all elements in this collection, find the elements with the given tag name and return a DOMCollection with the aggregated results.

Parameters

tag_name (str) – tag name for matching elements

Returns

collection of matching elements

Return type

DOMCollection

matches(self, selector)

Within all elements in this collection, check whether any element in the DOM tree matches the given CSS selector. This is more efficient than matching with query_selector_all() and checking the size of the returned collection.

Parameters

selector (str) – CSS selector

Returns

boolean value indicating whether a matching element exists

Return type

bool

query_selector(self, selector)

Within all elements in this collection, find and return the first element matching the given CSS selector.This is more efficient than matching with query_selector_all() and discarding additional elements.

Parameters

selector (str) – CSS selector

Returns

matching element or None

Return type

DOMNode or None

query_selector_all(self, selector)

Within all elements in this collection, find the elements matching the given CSS selector and return a DOMCollection with the aggregated results.

Parameters

selector (str) – CSS selector

Returns

collection of matching elements

Return type

DOMCollection

class resiliparse.parse.html.DOMElementClassList

Class name list of an Element DOM node.

__getitem__()

__contains__(self, item)

__iter__(self)
add(self, class_name)

Add new class name to Element node if not already present.

Parameters

class_name (str) – new class name

remove(self, class_name)

Remove a class name from this Element node.

Parameters

class_name (str) – new class name

class resiliparse.parse.html.DOMNode(self)

DOM node.

A DOM node is only valid as long as the owning HTMLTree is alive and the DOM tree hasn’t been modified. Do not access DOMNode instances after any sort of DOM tree manipulation.

__getitem__(self, attr_name)

Get the value of the an attribute.

Parameters

attr_name – attribute name

Return type

str

Raises
  • KeyError – if no such attribute exists

  • ValueError – if node is not an Element node

__iter__(self)

Traverse the DOM tree in pre-order starting at the current node.

Return type

t.Iterable[DOMNode]

__setitem__(self, attr_name, attr_value)

Insert or update an attribute with the given name to the given value.

Parameters
  • attr_name (str) – attribute name

  • attr_value (str) – attribute value

Returns

attribute value

Raises

ValueError – if node is not an Element node

append_child(self, node)

Append a new child node to this DOM node.

Parameters

node (DOMNode) – DOM node to append as new child node

Returns

the appended child node

Return type

DOMNode

Raises

ValueError – if trying to append node to itself

decompose(self)

Delete the current node and all its children.

delattr(self, attr_name)

Remove the given attribute if it exists.

Parameters

attr_name (str) – attribute to remove

Raises

ValueError – if node is not an Element node

get_element_by_id(self, element_id, case_insensitive=False)

Find and return the element whose ID attribute matches element_id.

Parameters
  • element_id (str) – element ID

  • case_insensitive (bool) – match ID case-insensitively

Returns

matching element or None if no such element exists

Return type

DOMNode or None

get_elements_by_attr(self, attr_name, attr_value, case_insensitive=False)

Find all elements matching the given arbitrary attribute name and value and return a DOMCollection with the results.

Parameters
  • attr_name (str) – attribute name

  • attr_value – attribute value

  • case_insensitive (bool) – match attribute value case-insensitively

Returns

collection of matching elements

Return type

DOMCollection or None

get_elements_by_class_name(self, element_class, case_insensitive=False)

Find all elements matching the given class name and return a DOMCollection with the results.

Parameters
  • class_name (str) – element class

  • case_insensitive (bool) – match class name case-insensitively

Returns

collection of matching elements

Return type

DOMCollection or None

get_elements_by_tag_name(self, tag_name)

Find all elements with the given tag name and return a DOMCollection with the results.

Parameters

tag_name (str) – tag name for matching elements

Returns

collection of matching elements

Return type

DOMCollection

getattr(self, attr_name, default_value=None)

Get the value of the attribute attr_name or default_value if the element has no such attribute.

Parameters
  • attr_name (str) – attribute name

  • default_value (str) – default value to return if attribute is unset

Returns

attribute value

Return type

str

Raises

ValueError – if node is not an Element node

hasattr(self, attr_name)

Check if node has an attribute with the given name.

Parameters

attr_name (str) – attribute name

Return type

bool

Raises

ValueError – if node is not an Element node

insert_before(self, node, reference)

Insert node before reference as a new child node. The reference node must be a child of this node or None. If reference is None, the new node will be appended after the last child node.

Parameters
  • node (DOMNode) – DOM node to insert as new child node

  • reference (DOMNode) – child node before which to insert the new node or None

Returns

the inserted child node

Return type

DOMNode

Raises

ValueError – if trying to add node as its own child or if reference is not a child

matches(self, selector)

Check whether any element in the DOM tree matches the given CSS selector. This is more efficient than matching with query_selector_all() and checking the size of the returned collection.

Parameters

selector (str) – CSS selector

Returns

boolean value indicating whether a matching element exists

Return type

bool

query_selector(self, selector)

Find and return the first element matching the given CSS selector. This is more efficient than matching with query_selector_all() and discarding additional elements.

Parameters

selector (str) – CSS selector

Returns

matching element or None

Return type

DOMNode or None

query_selector_all(self, selector)

Find all elements matching the given CSS selector and return a DOMCollection with the results.

Parameters

selector (str) – CSS selector

Returns

collection of matching elements

Return type

DOMCollection

remove_child(self, node)

Remove the child node node from the DOM tree and return it.

Parameters

node (DOMNode) – DOM node to remove

Returns

the removed child node

Return type

DOMNode

Raises

ValueError – if node is not a child of this node

replace_child(self, new_child, old_child)

Replace the child node old_child with new_child.

Parameters
  • new_child (DOMNode) – new child node to insert

  • old_child (DOMNode) – old child node to replace

Returns

the old child node

Return type

DOMNode

Raises

ValueError – if old_child is not a child of this node

setattr(self, attr_name, attr_value)

Insert or update an attribute with the given name to the given value.

Parameters
  • attr_name (str) – attribute name

  • attr_value (str) – attribute value

Returns

attribute value

Raises

ValueError – if node is not an Element node

attrs

List of attribute names if node is an Element node.

Type

t.List[str] or None

child_nodes

List of child nodes.

Type

t.List[DOMNode]

class_list

List of class names set on this Element node.

Type

DOMElementClassList

class_name

Class name attribute of this Element node (empty string if unset).

Type

str

first_child

First child element of this DOM node.

Type

DOMNode or None

html

HTML contents of this DOM node and its children.

The DOM node’s inner HTML can be modified by assigning to this property.

Type

str

id

ID attribute of this Element node (empty string if unset).

Type

str

last_child

Last child element of this DOM node.

Type

DOMNode or None

next

Next sibling node.

Type

DOMNode or None

parent

Parent of this node.

Type

DOMNode or None

prev

Previous sibling node.

Type

DOMNode or None

tag

DOM node tag name if node is an Element node.

Type

str or None

text

Text contents of this DOM node and its children.

The DOM node’s inner text can be modified by assigning to this property.

Type

str

type

DOM node type.

Type

NodeType

class resiliparse.parse.html.HTMLTree(self)

HTML DOM tree parser.

create_element(self, tag_name)

Create a new DOM Element node.

Parameters

tag_name (str) – element tag name

Returns

new Element node

Return type

DOMNode

create_text_node(self, text)

Create a new DOM Element node.

Parameters

text (str) – string contents of the new text element

Returns

new text node

Return type

DOMNode

parse(self, document)

Parse HTML from a Unicode string into a DOM tree.

Parameters

document – input HTML document

Returns

HTML DOM tree

Return type

HTMLTree

Raises

ValueError – if HTML parsing fails for unknown reasons

parse_from_bytes(self, document, encoding='utf-8', errors='ignore')

Decode a raw HTML byte string and parse it into a DOM tree.

The decoding routine uses bytes_to_str() to take care of decoding errors, so it is sufficient if encoding is just a best guess of what the actual input encoding is. The encoding name will be remapped according to the WHATWG specification by calling map_encoding_to_html5() before trying to decode the byte string with it.

Parameters
  • document – input byte string

  • encoding – encoding for decoding byte string

  • errors – decoding error policy (same as str.decode())

Returns

HTML DOM tree

Return type

HTMLTree

Raises

ValueError – if HTML parsing fails for unknown reasons

body

HTML body element or None if document has no body.

Type

DOMNode or None

document

Document root node.

Type

DOMNode or None

head

HTML head element or None if document has no head.

Type

DOMNode or None

title

The HTML document title.

Type

str or None