ChatNoir Resiliparse
stable

User Manual

  • Installation Instructions
    • Building Resiliparse From Source
  • Resiliparse Parsing Utilities
    • Character Encoding
      • Character Encoding Detection
      • Map Encodings to WHATWG Specification
      • Convert Byte String to Unicode
      • Detect MIME Types
    • HTTP Tools
      • Read Chunked HTTP Payloads
    • HTML Parsing
      • DOM Selection
        • Elements
        • Attributes
        • HTML and Text Serialization
      • DOM Tree Traversal
        • Advanced Traversal
      • DOM Tree Manipulation
        • Elements
        • Attributes
        • Inner HTML and Inner Text
      • Benchmarks
    • Language Tools
      • Fast Language Detection
        • Benchmarks
        • Supported Languages
  • Resiliparse Extraction Utilities
    • HTML2Text
      • Basic Plain Text Conversion
      • Main Content Extraction
  • Resiliparse Process Guards
    • TimeGuard
      • Interrupt Escalation Behaviour
      • Reporting Progress
      • Progress Loops
      • Using TimeGuard as a Context Manager
      • TimeGuard Check Interval
    • MemGuard
      • Using MemGuard as a Context Manager
      • MemGuard Check Interval
  • Resiliparse Itertools
    • Exception Loops
    • WARC Retry Loops
  • Resiliparse Beam Transforms
    • Installing Resiliparse Beam Transforms
    • Reading WARC Files
    • Reading Text Files
    • Bulk-indexing to Elasticsearch
  • FastWARC
    • Why FastWARC and not WARCIO?
    • Installing FastWARC
    • Building FastWARC From Source
    • Iterating WARC Files
    • Filtering Records
      • Record Type Filter
      • Content-Length Filter
      • Function Filter
      • Digest Filter
    • Record Properties
    • Verifying Record Digests
    • Benchmarks

API Documentation

  • Resiliparse Parsing Utilities
    • Character Encoding
      • EncodingDetector
        • EncodingDetector.encoding()
        • EncodingDetector.reset()
        • EncodingDetector.update()
      • bytes_to_str()
      • detect_encoding()
      • detect_mime()
      • map_encoding_to_html5()
    • HTTP Tools
      • iterate_http_chunks()
      • read_http_chunk()
    • HTML Parsing
      • NodeType
        • NodeType.ELEMENT
        • NodeType.ATTRIBUTE
        • NodeType.TEXT
        • NodeType.CDATA_SECTION
        • NodeType.ENTITY_REFERENCE
        • NodeType.ENTITY
        • NodeType.PROCESSING_INSTRUCTION
        • NodeType.COMMENT
        • NodeType.DOCUMENT
        • NodeType.DOCUMENT_TYPE
        • NodeType.DOCUMENT_FRAGMENT
        • NodeType.NOTATION
        • NodeType.LAST_ENTRY
      • DOMCollection
        • DOMCollection.__getitem__()
        • DOMCollection.__iter__()
        • DOMCollection.get_element_by_id()
        • DOMCollection.get_elements_by_attr()
        • DOMCollection.get_elements_by_class_name()
        • DOMCollection.get_elements_by_tag_name()
        • DOMCollection.matches()
        • DOMCollection.query_selector()
        • DOMCollection.query_selector_all()
      • DOMContext
      • DOMElementClassList
        • DOMElementClassList.__getitem__()
        • DOMElementClassList.__iter__()
        • DOMElementClassList.add()
        • DOMElementClassList.remove()
      • DOMNode
        • DOMNode.__getitem__()
        • DOMNode.__iter__()
        • DOMNode.__setitem__()
        • DOMNode.append_child()
        • DOMNode.decompose()
        • DOMNode.delattr()
        • DOMNode.get_element_by_id()
        • DOMNode.get_elements_by_attr()
        • DOMNode.get_elements_by_class_name()
        • DOMNode.get_elements_by_tag_name()
        • DOMNode.getattr()
        • DOMNode.hasattr()
        • DOMNode.insert_before()
        • DOMNode.matches()
        • DOMNode.query_selector()
        • DOMNode.query_selector_all()
        • DOMNode.remove_child()
        • DOMNode.replace_child()
        • DOMNode.setattr()
        • DOMNode.attrs
        • DOMNode.child_nodes
        • DOMNode.class_list
        • DOMNode.class_name
        • DOMNode.first_child
        • DOMNode.first_element_child
        • DOMNode.html
        • DOMNode.id
        • DOMNode.last_child
        • DOMNode.last_element_child
        • DOMNode.next
        • DOMNode.next_element
        • DOMNode.parent
        • DOMNode.prev
        • DOMNode.prev_element
        • DOMNode.tag
        • DOMNode.text
        • DOMNode.type
        • DOMNode.value
      • HTMLTree
        • HTMLTree.create_element()
        • HTMLTree.create_text_node()
        • HTMLTree.parse()
        • HTMLTree.parse_from_bytes()
        • HTMLTree.body
        • HTMLTree.document
        • HTMLTree.head
        • HTMLTree.title
      • traverse_dom()
    • Language Tools
      • detect_fast()
      • supported_langs()
      • train_language_examples()
  • Resiliparse Extraction Utilities
    • HTML2Text
      • extract_plain_text()
  • Resiliparse Process Guards
    • InterruptType
      • InterruptType.exception
      • InterruptType.signal
      • InterruptType.exception_then_signal
    • ExecutionTimeout
    • MemoryLimitExceeded
    • ResiliparseGuardException
    • MemGuard
    • TimeGuard
      • TimeGuard.progress()
    • mem_guard()
    • progress()
    • progress_loop()
    • time_guard()
  • Resiliparse Itertools
    • exc_loop()
    • progress_loop()
    • warc_retry()
  • Resiliparse Beam Transforms
    • Elasticsearch
      • ElasticsearchBulkIndex
      • delete_action()
      • ensure_index()
      • index_action()
      • update_action()
    • File I/O
      • MatchFiles
    • Text File I/O
      • ReadAllFromText
      • ReadFromText
      • StrUtf8Coder
        • StrUtf8Coder.decode()
    • WARC I/O
      • ReadAllWarcs
      • ReadWarcs
  • FastWARC
    • WARC
      • WarcRecordType
        • WarcRecordType.unknown
        • WarcRecordType.any_type
        • WarcRecordType.no_type
        • WarcRecordType.warcinfo
        • WarcRecordType.response
        • WarcRecordType.resource
        • WarcRecordType.request
        • WarcRecordType.metadata
        • WarcRecordType.revisit
        • WarcRecordType.conversion
        • WarcRecordType.continuation
      • ArchiveIterator
        • ArchiveIterator.__iter__()
        • ArchiveIterator.__next__()
      • WarcHeaderMap
        • WarcHeaderMap.__iter__()
        • WarcHeaderMap.append()
        • WarcHeaderMap.asdict()
        • WarcHeaderMap.astuples()
        • WarcHeaderMap.clear()
        • WarcHeaderMap.get()
        • WarcHeaderMap.items()
        • WarcHeaderMap.keys()
        • WarcHeaderMap.values()
        • WarcHeaderMap.write()
        • WarcHeaderMap.reason_phrase
        • WarcHeaderMap.status_code
        • WarcHeaderMap.status_line
      • WarcRecord
        • WarcRecord.freeze()
        • WarcRecord.init_headers()
        • WarcRecord.parse_http()
        • WarcRecord.set_bytes_content()
        • WarcRecord.verify_block_digest()
        • WarcRecord.verify_payload_digest()
        • WarcRecord.write()
        • WarcRecord.content_length
        • WarcRecord.headers
        • WarcRecord.http_charset
        • WarcRecord.http_content_type
        • WarcRecord.http_date
        • WarcRecord.http_headers
        • WarcRecord.http_last_modified
        • WarcRecord.is_http
        • WarcRecord.is_http_parsed
        • WarcRecord.reader
        • WarcRecord.record_date
        • WarcRecord.record_id
        • WarcRecord.record_type
        • WarcRecord.stream_pos
      • has_block_digest()
      • has_payload_digest()
      • is_concurrent()
      • is_http()
      • is_warc_10()
      • is_warc_11()
    • StreamIO
      • FastWARCError
      • ReaderStaleError
      • StreamError
      • BrotliStream
        • BrotliStream.begin_member()
        • BrotliStream.close()
        • BrotliStream.end_member()
        • BrotliStream.flush()
        • BrotliStream.seek()
        • BrotliStream.tell()
      • BufferedReader
        • BufferedReader.close()
        • BufferedReader.consume()
        • BufferedReader.read()
        • BufferedReader.readline()
        • BufferedReader.tell()
      • BytesIOStream
        • BytesIOStream.close()
        • BytesIOStream.getvalue()
        • BytesIOStream.seek()
        • BytesIOStream.tell()
      • CompressingStream
        • CompressingStream.begin_member()
        • CompressingStream.end_member()
      • FileStream
        • FileStream.close()
        • FileStream.flush()
        • FileStream.seek()
        • FileStream.tell()
      • GZipStream
        • GZipStream.begin_member()
        • GZipStream.close()
        • GZipStream.end_member()
        • GZipStream.flush()
        • GZipStream.prepopulate()
        • GZipStream.tell()
      • IOStream
        • IOStream.close()
        • IOStream.flush()
        • IOStream.read()
        • IOStream.seek()
        • IOStream.tell()
        • IOStream.write()
      • LZ4Stream
        • LZ4Stream.begin_member()
        • LZ4Stream.close()
        • LZ4Stream.end_member()
        • LZ4Stream.flush()
        • LZ4Stream.prepopulate()
        • LZ4Stream.tell()
      • PythonIOStreamAdapter
        • PythonIOStreamAdapter.close()
        • PythonIOStreamAdapter.flush()
        • PythonIOStreamAdapter.seek()
        • PythonIOStreamAdapter.tell()
      • wrap_stream()

CLI Documentation

  • Resiliparse CLI
    • Top-Level Commands
      • resiliparse
    • Full Command Listing
      • resiliparse
        • encoding
        • html
        • lang
  • FastWARC CLI
    • Top-Level Commands
      • fastwarc
    • Full Command Listing
      • fastwarc
        • benchmark
        • check
        • extract
        • index
        • recompress
ChatNoir Resiliparse
  • Resiliparse Beam Transforms
  • Edit on GitHub

Resiliparse Beam Transforms

Utility PTransforms and helpers for big data processing in Apache Beam.

Apache Beam Utilities

  • Elasticsearch
    • ElasticsearchBulkIndex
    • delete_action()
    • ensure_index()
    • index_action()
    • update_action()
  • File I/O
    • MatchFiles
  • Text File I/O
    • ReadAllFromText
    • ReadFromText
    • StrUtf8Coder
      • StrUtf8Coder.decode()
  • WARC I/O
    • ReadAllWarcs
    • ReadWarcs
Previous Next

© Copyright 2021, Janek Bevendorff. Revision 00a124c1.

Built with Sphinx using a theme provided by Read the Docs.
Read the Docs v: stable
Versions
latest
stable
Downloads
On Read the Docs
Project Home
Builds