HTML Parsing
Resiliparse HTML parsing and DOM traversal utilities API documentation.
- class resiliparse.parse.html.NodeType(value)
An enumeration.
DOM node type enum.
- ELEMENT = 1
- ATTRIBUTE = 2
- TEXT = 3
- CDATA_SECTION = 4
- ENTITY_REFERENCE = 5
- ENTITY = 6
- PROCESSING_INSTRUCTION = 7
- COMMENT = 8
- DOCUMENT = 9
- DOCUMENT_TYPE = 10
- DOCUMENT_FRAGMENT = 11
- NOTATION = 12
- LAST_ENTRY = 13
- class resiliparse.parse.html.DOMCollection(self)
Collection of DOM nodes that are the result set of an element matching operation.
A node collection is only valid for as long as the owning
HTMLTree
is alive and the DOM tree hasn’t been modified. Do not accessDOMCollection
instances after any sort of DOM tree manipulation.- __getitem__(self, key)
Return the
DOMNode
at the given index in this collection or anotherDOMCollection
ifkey
is a slice object. Negative indexing is supported.- Parameters
key – index or slice
- Return type
- Raises
IndexError – if
key
is out of rangeTypeError – if
key
is not anint
orslice
- get_element_by_id(self, element_id, case_insensitive=False)
Within all elements in this collection, find and return the element whose ID attribute matches
element_id
.- Parameters
element_id (str) – element ID
case_insensitive (bool) – match ID case-insensitively
- Returns
matching element or
None
if no such element exists- Return type
DOMNode or None
- get_elements_by_attr(self, attr_name, attr_value, case_insensitive=False)
Within all elements in this collection, find the elements matching the given arbitrary attribute name and value and return a
DOMCollection
with the aggregated results.- Parameters
attr_name (str) – attribute name
attr_value – attribute value
case_insensitive (bool) – match attribute value case-insensitively
- Returns
collection of matching elements
- Return type
DOMCollection or None
- get_elements_by_class_name(self, element_class, case_insensitive=False)
Within all elements in this collection, find the elements matching the given class name and return a
DOMCollection
with the aggregated results.- Parameters
class_name (str) – element class
case_insensitive (bool) – match class name case-insensitively
- Returns
collection of matching elements
- Return type
DOMCollection or None
- get_elements_by_tag_name(self, tag_name)
Within all elements in this collection, find the elements with the given tag name and return a
DOMCollection
with the aggregated results.- Parameters
tag_name (str) – tag name for matching elements
- Returns
collection of matching elements
- Return type
- matches(self, selector)
Within all elements in this collection, check whether any element in the DOM tree matches the given CSS selector. This is more efficient than matching with
query_selector_all()
and checking the size of the returned collection.- Parameters
selector (str) – CSS selector
- Returns
boolean value indicating whether a matching element exists
- Return type
bool
- query_selector(self, selector)
Within all elements in this collection, find and return the first element matching the given CSS selector.This is more efficient than matching with
query_selector_all()
and discarding additional elements.- Parameters
selector (str) – CSS selector
- Returns
matching element or
None
- Return type
DOMNode or None
- query_selector_all(self, selector)
Within all elements in this collection, find the elements matching the given CSS selector and return a
DOMCollection
with the aggregated results.- Parameters
selector (str) – CSS selector
- Returns
collection of matching elements
- Return type
- class resiliparse.parse.html.DOMContext
DOM node traversal context object.
The context object has two attributes that are set by the traversal function for keeping track of the current
DOMNode
and the current traversal depth. Besides these, the context object is arbitrarily mutable and can be used for maintaining custom state.
- class resiliparse.parse.html.DOMElementClassList
Class name list of an Element DOM node.
- __getitem__()
__contains__(self, item)
- __iter__(self)
- add(self, class_name)
Add new class name to Element node if not already present.
- Parameters
class_name (str) – new class name
- remove(self, class_name)
Remove a class name from this Element node.
- Parameters
class_name (str) – new class name
- class resiliparse.parse.html.DOMNode(self)
DOM node.
A DOM node is only valid as long as the owning
HTMLTree
is alive and the DOM tree hasn’t been modified. Do not accessDOMNode
instances after any sort of DOM tree manipulation.- __getitem__(self, attr_name)
Get the value of the an attribute.
- Parameters
attr_name – attribute name
- Return type
str
- Raises
KeyError – if no such attribute exists
ValueError – if node is not an Element node
- __iter__(self)
Traverse the DOM tree in pre-order starting at the current node.
- Return type
t.Iterable[DOMNode]
- __setitem__(self, attr_name, attr_value)
Insert or update an attribute with the given name to the given value.
- Parameters
attr_name (str) – attribute name
attr_value (str) – attribute value
- Returns
attribute value
- Raises
ValueError – if node is not an Element node
- append_child(self, node)
Append a new child node to this DOM node.
- decompose(self)
Delete the current node and all its children.
- delattr(self, attr_name)
Remove the given attribute if it exists.
- Parameters
attr_name (str) – attribute to remove
- Raises
ValueError – if node is not an Element node
- get_element_by_id(self, element_id, case_insensitive=False)
Find and return the element whose ID attribute matches
element_id
.- Parameters
element_id (str) – element ID
case_insensitive (bool) – match ID case-insensitively
- Returns
matching element or
None
if no such element exists- Return type
DOMNode or None
- get_elements_by_attr(self, attr_name, attr_value, case_insensitive=False)
Find all elements matching the given arbitrary attribute name and value and return a
DOMCollection
with the results.- Parameters
attr_name (str) – attribute name
attr_value – attribute value
case_insensitive (bool) – match attribute value case-insensitively
- Returns
collection of matching elements
- Return type
DOMCollection or None
- get_elements_by_class_name(self, element_class, case_insensitive=False)
Find all elements matching the given class name and return a
DOMCollection
with the results.- Parameters
class_name (str) – element class
case_insensitive (bool) – match class name case-insensitively
- Returns
collection of matching elements
- Return type
DOMCollection or None
- get_elements_by_tag_name(self, tag_name)
Find all elements with the given tag name and return a
DOMCollection
with the results.- Parameters
tag_name (str) – tag name for matching elements
- Returns
collection of matching elements
- Return type
- getattr(self, attr_name, default_value=None)
Get the value of the attribute
attr_name
ordefault_value
if the element has no such attribute.- Parameters
attr_name (str) – attribute name
default_value (str) – default value to return if attribute is unset
- Returns
attribute value
- Return type
str
- Raises
ValueError – if node is not an Element node
- hasattr(self, attr_name)
Check if node has an attribute with the given name.
- Parameters
attr_name (str) – attribute name
- Return type
bool
- Raises
ValueError – if node is not an Element node
- insert_before(self, node, reference)
Insert
node
beforereference
as a new child node. The reference node must be a child of this node orNone
. Ifreference
isNone
, the new node will be appended after the last child node.
- matches(self, selector)
Check whether any element in the DOM tree matches the given CSS selector. This is more efficient than matching with
query_selector_all()
and checking the size of the returned collection.- Parameters
selector (str) – CSS selector
- Returns
boolean value indicating whether a matching element exists
- Return type
bool
- query_selector(self, selector)
Find and return the first element matching the given CSS selector. This is more efficient than matching with
query_selector_all()
and discarding additional elements.- Parameters
selector (str) – CSS selector
- Returns
matching element or
None
- Return type
DOMNode or None
- query_selector_all(self, selector)
Find all elements matching the given CSS selector and return a
DOMCollection
with the results.- Parameters
selector (str) – CSS selector
- Returns
collection of matching elements
- Return type
- remove_child(self, node)
Remove the child node
node
from the DOM tree and return it.
- replace_child(self, new_child, old_child)
Replace the child node
old_child
withnew_child
.
- setattr(self, attr_name, attr_value)
Insert or update an attribute with the given name to the given value.
- Parameters
attr_name (str) – attribute name
attr_value (str) – attribute value
- Returns
attribute value
- Raises
ValueError – if node is not an Element node
- attrs
List of attribute names if node is an Element node.
- Type
t.List[str] or None
- class_list
List of class names set on this Element node.
- Type
- class_name
Class name attribute of this Element node (empty string if unset).
- Type
str
- html
HTML contents of this DOM node and its children.
The DOM node’s inner HTML can be modified by assigning to this property.
- Type
str
- id
ID attribute of this Element node (empty string if unset).
- Type
str
- tag
DOM element tag or node name.
- Type
str or None
- text
Text contents of this DOM node and its children.
The DOM node’s inner text can be modified by assigning to this property.
- Type
str
- value
Node text value.
- Type
str or None
- class resiliparse.parse.html.HTMLTree(self)
HTML DOM tree parser.
- create_element(self, tag_name)
Create a new DOM Element node.
- Parameters
tag_name (str) – element tag name
- Returns
new Element node
- Return type
- create_text_node(self, text)
Create a new DOM Element node.
- Parameters
text (str) – string contents of the new text element
- Returns
new text node
- Return type
- parse(self, document)
Parse HTML from a Unicode string into a DOM tree.
- Parameters
document – input HTML document
- Returns
HTML DOM tree
- Return type
- Raises
ValueError – if HTML parsing fails for unknown reasons
- parse_from_bytes(self, document, encoding='utf-8', errors='ignore')
Decode a raw HTML byte string and parse it into a DOM tree.
The decoding routine uses
bytes_to_str()
to take care of decoding errors, so it is sufficient ifencoding
is just a best guess of what the actual input encoding is. The encoding name will be remapped according to the WHATWG specification by callingmap_encoding_to_html5()
before trying to decode the byte string with it.- Parameters
document – input byte string
encoding – encoding for decoding byte string
errors – decoding error policy (same as
str.decode()
)
- Returns
HTML DOM tree
- Return type
- Raises
ValueError – if HTML parsing fails for unknown reasons
- title
The HTML document title.
- Type
str or None
- resiliparse.parse.html.traverse_dom(base_node, start_callback, end_callback=None, context=None, elements_only=False)
DOM traversal helper.
Traverses the DOM tree starting at
base_node
in pre-order and callsstart_callback
at each child node. Ifend_callback
is notNone
, it will be called each time a DOM element’s end tag is encountered.The callbacks are expected to take exactly one
DOMContext
context parameter, which keeps track of the current node and traversal depth. The context object will be the same throughout the whole traversal process, so it can be mutated with custom data.- Parameters
base_node (DOMNode) – root node of the traversal
start_callback (t.Callable[[DOMContext], None]) – callback for each DOM node on the way (takes a
DOMNode
andcontext
as a parameter)end_callback (t.Callable[[DOMContext], None] or None) – optional callback for element node end tags (takes a
DOMNode
andcontext
as a parameter)context (DOMContext) – optional pre-initialized context object
elements_only (bool) – traverse only element nodes