HTML Parsing
Resiliparse HTML parsing and DOM traversal utilities API documentation.
- class resiliparse.parse.html.NodeType(value)
An enumeration.
DOM node type enum.
- ELEMENT = 1
- ATTRIBUTE = 2
- TEXT = 3
- CDATA_SECTION = 4
- ENTITY_REFERENCE = 5
- ENTITY = 6
- PROCESSING_INSTRUCTION = 7
- COMMENT = 8
- DOCUMENT = 9
- DOCUMENT_TYPE = 10
- DOCUMENT_FRAGMENT = 11
- NOTATION = 12
- LAST_ENTRY = 13
- class resiliparse.parse.html.DOMCollection(self)
Collection of DOM nodes that are the result set of an element matching operation.
A node collection is only valid for as long as the owning
HTMLTree
is alive and the DOM tree hasn’t been modified. Do not accessDOMCollection
instances after any sort of DOM tree manipulation.- __getitem__(self, key)
Return the
DOMNode
at the given index in this collection or anotherDOMCollection
ifkey
is a slice object. Negative indexing is supported.- Parameters:
key – index or slice
- Return type:
- Raises:
IndexError – if
key
is out of rangeTypeError – if
key
is not anint
orslice
- get_element_by_id(self, element_id, case_insensitive=False)
Within all elements in this collection, find and return the element whose ID attribute matches
element_id
.- Parameters:
element_id (str) – element ID
case_insensitive (bool) – match ID case-insensitively
- Returns:
matching element or
None
if no such element exists- Return type:
DOMNode or None
- get_elements_by_attr(self, attr_name, attr_value, case_insensitive=False)
Within all elements in this collection, find the elements matching the given arbitrary attribute name and value and return a
DOMCollection
with the aggregated results.- Parameters:
attr_name (str) – attribute name
attr_value – attribute value
case_insensitive (bool) – match attribute value case-insensitively
- Returns:
collection of matching elements
- Return type:
DOMCollection or None
- get_elements_by_class_name(self, element_class, case_insensitive=False)
Within all elements in this collection, find the elements matching the given class name and return a
DOMCollection
with the aggregated results.- Parameters:
class_name (str) – element class
case_insensitive (bool) – match class name case-insensitively
- Returns:
collection of matching elements
- Return type:
DOMCollection or None
- get_elements_by_tag_name(self, tag_name)
Within all elements in this collection, find the elements with the given tag name and return a
DOMCollection
with the aggregated results.- Parameters:
tag_name (str) – tag name for matching elements
- Returns:
collection of matching elements
- Return type:
- matches(self, selector)
Within all elements in this collection, check whether any element in the DOM tree matches the given CSS selector. This is more efficient than matching with
query_selector_all()
and checking the size of the returned collection.- Parameters:
selector (str) – CSS selector
- Returns:
boolean value indicating whether a matching element exists
- Return type:
bool
- query_selector(self, selector)
Within all elements in this collection, find and return the first element matching the given CSS selector.This is more efficient than matching with
query_selector_all()
and discarding additional elements.- Parameters:
selector (str) – CSS selector
- Returns:
matching element or
None
- Return type:
DOMNode or None
- query_selector_all(self, selector)
Within all elements in this collection, find the elements matching the given CSS selector and return a
DOMCollection
with the aggregated results.- Parameters:
selector (str) – CSS selector
- Returns:
collection of matching elements
- Return type:
- class resiliparse.parse.html.DOMContext
DOM node traversal context object.
The context object has two attributes that are set by the traversal function for keeping track of the current
DOMNode
and the current traversal depth. Besides these, the context object is arbitrarily mutable and can be used for maintaining custom state.
- class resiliparse.parse.html.DOMElementClassList
Class name list of an Element DOM node.
- __getitem__()
__contains__(self, item)
- __iter__(self)
- add(self, class_name)
Add new class name to Element node if not already present.
- Parameters:
class_name (str) – new class name
- remove(self, class_name)
Remove a class name from this Element node.
- Parameters:
class_name (str) – new class name
- class resiliparse.parse.html.DOMNode(self)
DOM node.
A DOM node is only valid as long as the owning
HTMLTree
is alive and the DOM tree hasn’t been modified. Do not accessDOMNode
instances after any sort of DOM tree manipulation.- __getitem__(self, attr_name)
Get the value of an attribute.
- Parameters:
attr_name – attribute name
- Return type:
str
- Raises:
KeyError – if no such attribute exists
ValueError – if node is not an Element node
- __iter__(self)
Traverse the DOM tree in pre-order starting at the current node.
- Return type:
t.Iterable[DOMNode]
- __setitem__(self, attr_name, attr_value)
Insert or update an attribute with the given name to the given value.
- Parameters:
attr_name (str) – attribute name
attr_value (str) – attribute value
- Returns:
attribute value
- Raises:
ValueError – if node is not an Element node
- append_child(self, node)
Append a new child node to this DOM node.
- decompose(self)
Delete the current node and all its children.
- delattr(self, attr_name)
Remove the given attribute if it exists.
- Parameters:
attr_name (str) – attribute to remove
- Raises:
ValueError – if node is not an Element node
- get_element_by_id(self, element_id, case_insensitive=False)
Find and return the element whose ID attribute matches
element_id
.- Parameters:
element_id (str) – element ID
case_insensitive (bool) – match ID case-insensitively
- Returns:
matching element or
None
if no such element exists- Return type:
DOMNode or None
- get_elements_by_attr(self, attr_name, attr_value, case_insensitive=False)
Find all elements matching the given arbitrary attribute name and value and return a
DOMCollection
with the results.- Parameters:
attr_name (str) – attribute name
attr_value – attribute value
case_insensitive (bool) – match attribute value case-insensitively
- Returns:
collection of matching elements
- Return type:
DOMCollection or None
- get_elements_by_class_name(self, element_class, case_insensitive=False)
Find all elements matching the given class name and return a
DOMCollection
with the results.- Parameters:
class_name (str) – element class
case_insensitive (bool) – match class name case-insensitively
- Returns:
collection of matching elements
- Return type:
DOMCollection or None
- get_elements_by_tag_name(self, tag_name)
Find all elements with the given tag name and return a
DOMCollection
with the results.- Parameters:
tag_name (str) – tag name for matching elements
- Returns:
collection of matching elements
- Return type:
- getattr(self, attr_name, default_value=None)
Get the value of the attribute
attr_name
ordefault_value
if the element has no such attribute.- Parameters:
attr_name (str) – attribute name
default_value (str or None) – default value to return if attribute is unset
- Returns:
attribute value
- Return type:
str or None
- Raises:
ValueError – if node is invalid or not an Element node
- hasattr(self, attr_name)
Check if node has an attribute with the given name.
- Parameters:
attr_name (str) – attribute name
- Return type:
bool
- Raises:
ValueError – if node is not an Element node
- insert_before(self, node, reference)
Insert
node
beforereference
as a new child node. The reference node must be a child of this node orNone
. Ifreference
isNone
, the new node will be appended after the last child node.
- matches(self, selector)
Check whether any element in the DOM tree matches the given CSS selector. This is more efficient than matching with
query_selector_all()
and checking the size of the returned collection.- Parameters:
selector (str) – CSS selector
- Returns:
boolean value indicating whether a matching element exists
- Return type:
bool
- query_selector(self, selector)
Find and return the first element matching the given CSS selector. This is more efficient than matching with
query_selector_all()
and discarding additional elements.- Parameters:
selector (str) – CSS selector
- Returns:
matching element or
None
- Return type:
DOMNode or None
- query_selector_all(self, selector)
Find all elements matching the given CSS selector and return a
DOMCollection
with the results.- Parameters:
selector (str) – CSS selector
- Returns:
collection of matching elements
- Return type:
- remove_child(self, node)
Remove the child node
node
from the DOM tree and return it.
- replace_child(self, new_child, old_child)
Replace the child node
old_child
withnew_child
.
- setattr(self, attr_name, attr_value)
Insert or update an attribute with the given name to the given value.
- Parameters:
attr_name (str) – attribute name
attr_value (str) – attribute value
- Returns:
attribute value
- Raises:
ValueError – if node is not an Element node
- attrs
List of attribute names if node is an Element node.
- Type:
t.List[str] or None
- class_list
List of class names set on this Element node.
- Type:
- class_name
Class name attribute of this Element node (empty string if unset).
- Type:
str
- html
HTML contents of this DOM node and its children.
The DOM node’s inner HTML can be modified by assigning to this property.
- Type:
str
- id
ID attribute of this Element node (empty string if unset).
- Type:
str
- tag
DOM element tag or node name.
- Type:
str or None
- text
Text contents of this DOM node and its children.
The DOM node’s inner text can be modified by assigning to this property.
- Type:
str
- value
Node text value.
- Type:
str or None
- class resiliparse.parse.html.HTMLTree(self)
HTML DOM tree parser.
- create_element(self, tag_name)
Create a new DOM Element node.
- Parameters:
tag_name (str) – element tag name
- Returns:
new Element node
- Return type:
- create_text_node(self, text)
Create a new DOM Element node.
- Parameters:
text (str) – string contents of the new text element
- Returns:
new text node
- Return type:
- classmethod parse(self, document)
Parse HTML from a Unicode string into a DOM tree.
- Parameters:
document – input HTML document
- Returns:
HTML DOM tree
- Return type:
- Raises:
ValueError – if HTML parsing fails for unknown reasons
- classmethod parse_from_bytes(self, document, encoding='utf-8', errors='ignore')
Decode a raw HTML byte string and parse it into a DOM tree.
The decoding routine uses
bytes_to_str()
to take care of decoding errors, so it is sufficient ifencoding
is just a best guess of what the actual input encoding is. The encoding name will be remapped according to the WHATWG specification by callingmap_encoding_to_html5()
before trying to decode the byte string with it.- Parameters:
document – input byte string
encoding – encoding for decoding byte string
errors – decoding error policy (same as
str.decode()
)
- Returns:
HTML DOM tree
- Return type:
- Raises:
ValueError – if HTML parsing fails for unknown reasons
- title
The HTML document title.
- Type:
str or None
- class resiliparse.parse.html.NodeType(value)
An enumeration.
- resiliparse.parse.html.traverse_dom(base_node, start_callback, end_callback=None, context=None, elements_only=False)
DOM traversal helper.
Traverses the DOM tree starting at
base_node
in pre-order and callsstart_callback
at each child node. Ifend_callback
is notNone
, it will be called each time a DOM element’s end tag is encountered.The callbacks are expected to take exactly one
DOMContext
context parameter, which keeps track of the current node and traversal depth. The context object will be the same throughout the whole traversal process, so it can be mutated with custom data.- Parameters:
base_node (DOMNode) – root node of the traversal
start_callback (t.Callable[[DOMContext], None]) – callback for each DOM node on the way (takes a
DOMNode
andcontext
as a parameter)end_callback (t.Callable[[DOMContext], None] or None) – optional callback for element node end tags (takes a
DOMNode
andcontext
as a parameter)context (DOMContext) – optional pre-initialized context object
elements_only (bool) – traverse only element nodes