eaiovnaovbqoebvqoeavibavo html5lib ======== .. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master :target: https://travis-ci.org/html5lib/html5lib-python html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers. Usage ----- Simple usage follows this pattern: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: document = html5lib.parse(f) or: .. code-block:: python import html5lib document = html5lib.parse("
Hello World!") By default, the ``document`` will be an ``xml.etree`` element instance. Whenever possible, html5lib chooses the accelerated ``ElementTree`` implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x). Two other tree types are supported: ``xml.dom.minidom`` and ``lxml.etree``. To use an alternative format, specify the name of a treebuilder: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml") When using with ``urllib2`` (Python 2), the charset from HTTP should be pass into html5lib as follows: .. code-block:: python from contextlib import closing from urllib2 import urlopen import html5lib with closing(urlopen("http://example.com/")) as f: document = html5lib.parse(f, transport_encoding=f.info().getparam("charset")) When using with ``urllib.request`` (Python 3), the charset from HTTP should be pass into html5lib as follows: .. code-block:: python from urllib.request import urlopen import html5lib with urlopen("http://example.com/") as f: document = html5lib.parse(f, transport_encoding=f.info().get_content_charset()) To have more control over the parser, create a parser object explicitly. For instance, to make the parser raise exceptions on parse errors, use: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: parser = html5lib.HTMLParser(strict=True) document = parser.parse(f) When you're instantiating parser objects explicitly, pass a treebuilder class as the ``tree`` keyword argument to use an alternative document format: .. code-block:: python import html5lib parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) minidom_document = parser.parse("
Hello World!")
More documentation is available at https://html5lib.readthedocs.io/.
Installation
------------
html5lib works on CPython 2.7+, CPython 3.3+ and PyPy. To install it,
use:
.. code-block:: bash
$ pip install html5lib
Optional Dependencies
---------------------
The following third-party libraries may be used for additional
functionality:
- ``datrie`` can be used under CPython to improve parsing performance
(though in almost all cases the improvement is marginal);
- ``lxml`` is supported as a tree format (for both building and
walking) under CPython (but *not* PyPy where it is known to cause
segfaults);
- ``genshi`` has a treewalker (but not builder); and
- ``chardet`` can be used as a fallback when character encoding cannot
be determined.
Bugs
----
Please report any bugs on the `issue tracker