How to parse restructuredtext in python? - python

Is there any module that can parse restructuredtext into a tree model?
Can docutils or sphinx do this?

I'd like to extend upon the answer from Gareth Latty. "What you probably want is the parser at docutils.parsers.rst" is a good starting point of the answer, but what's next? Namely:
How to parse restructuredtext in python?
Below is the exact answer for Python 3.6 and docutils 0.14:
import docutils.nodes
import docutils.parsers.rst
import docutils.utils
import docutils.frontend
def parse_rst(text: str) -> docutils.nodes.document:
parser = docutils.parsers.rst.Parser()
components = (docutils.parsers.rst.Parser,)
settings = docutils.frontend.OptionParser(components=components).get_default_values()
document = docutils.utils.new_document('<rst-doc>', settings=settings)
parser.parse(text, document)
return document
And the resulting document can be processed using, for example, below, which will print all references in the document:
class MyVisitor(docutils.nodes.NodeVisitor):
def visit_reference(self, node: docutils.nodes.reference) -> None:
"""Called for "reference" nodes."""
print(node)
def unknown_visit(self, node: docutils.nodes.Node) -> None:
"""Called for all other node types."""
pass
Here's how to run it:
doc = parse_rst('spam spam lovely spam')
visitor = MyVisitor(doc)
doc.walk(visitor)

Docutils does indeed contain the tools to do this.
What you probably want is the parser at docutils.parsers.rst
See this page for details on what is involved. There are also some examples at docutils/examples.py - particularly check out the internals() function, which is probably of interest.

Related

accessing bookmarks using python-docx

I am using the python-docx module to read and edit a .docm file,
The file contains bookmarks, how do I access all the bookmarks already stored using that module, there doesnt seem to be any methods within the doc object.
As commented by #D Malan, now in 2021-november it is still an open issue in the python-docx.
Meanwhile we can live with our own implementation.
Please create a file named docxbookmark.py in a folder accessible as an import:
from docx.document import Document as _innerdoclass
from docx import Document as _innerdocfn
from docx.oxml.shared import qn
from lxml.etree import Element as El
class Document(_innerdoclass):
def _bookmark_elements(self, recursive=True):
if recursive:
startag = qn('w:start')
bkms = []
def _bookmark_elements_recursive(parent):
if parent.tag == startag:
bkms.append(parent)
for el in parent:
_bookmark_elements_recursive(el)
_bookmark_elements_recursive(self._element)
return bkms
else:
return self._element.xpath('//'+qn('w:bookmarkStart'))
def bookmark_names(self):
"""
Gets a list of bookmarks
"""
return [v for bkmkels in self._bookmark_elements() for k,v in bkmkels.items() if k.endswith('}name')]
def add_bookmark(self, bookmarkname):
"""
Adds a bookmark with bookmark with name bookmarkname to the end of the file
"""
el = [el for el in self._element[0] if el.tag.endswith('}p')][-1]
el.append(El(qn('w:bookmarkStart'),{qn('w:id'):'0',qn('w:name'):bookmarkname}))
el.append(El(qn('w:bookmarkEnd'),{qn('w:id'):'0'}))
def __init__(self, innerDocInstance = None):
super().__init__(Document, None)
if innerDocInstance is not None and type(innerDocInstance) is _innerdoclass:
self.__body = innerDocInstance.__body
self._element = innerDocInstance._element
self._part = innerDocInstance._part
def DocumentCreate(docx=None):
"""
Return a |Document| object loaded from *docx*, where *docx* can be
either a path to a ``.docx`` file (a string) or a file-like object. If
*docx* is missing or ``None``, the built-in default document "template"
is loaded.
"""
return Document(_innerdocfn(docx))
Now we can use our facade implementation just like the old one, along with those new add_bookmark and bookmark_names.
To add a bookmark in a new file, import our implementation and use add_bookmark on the document object:
from docxbookmark import DocumentCreate as Document
doc = Document()
document.add_paragraph('First Paragraph')
document.add_bookmark('FirstBookmark')
document.add_paragraph('Second Paragraph')
document.save('docwithbookmarks.docx')
To see bookmarks in a document, import our implementation and use bookmark_names on the document object:
from docxbookmark import DocumentCreate as Document
doc = Document('docwithbookmarks.docx')
doc.bookmark_names()
The returned list is simplier than other objects, it shows only strings not objects. There is an internal _bookmark_elements which will return lxml nodes which are not the same as python-docx objects.
Just a few tests were made, probably not working in many cases. Please tell in the comments if it didn't work.

How to link typing-like nested classes and other urls with Sphinx and RST

Using intersphinx and autodoc, having:
:param stores: Array of objects
:type stores: list[dict[str,int]]
Would result in an entry like:
stores (list[dict[str,int]]) - Array of objects.
Is there a way to convert list[dict[str,int]] outside of the autodoc :param: derivative (or others like :rtype:) with raw RST (within the docstring) or programatically given a 'list[dict[str,int]]' string?
Additionally, is it possible to use external links within the aforementioned example?
Example
Consider a script.py file:
def some_func(arg1):
"""
This is a head description.
:param arg1: The type of this param is hyperlinked.
:type arg1: list[dict[str,int]]
Is it possible to hyperlink this, here: dict[str,list[int]]
Or even add custom references amongst the classes: dict[int,ref]
Where *ref* links to a foreign, external source.
"""
Now in the Sphinx conf.py file add:
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.intersphinx'
]
intersphinx_mapping = {
'py': ('https://docs.python.org/3', None),
}
In your index.rst, add:
Title
=====
.. toctree::
:maxdepth: 2
.. autofunction:: script.some_func
And now simply make the html for the page.
The list[dict[str,int]] next to :type arg1: will be hyperlinked as shown at the beginning of this question, but dict[str,list[int]] obviously won't. Is there a way to make the latter behave like the former?
I reached a solution after digging around sphinx's code.
Injecting External References (:param:)
I created a custom extension that connects to the missing-reference event and attempts to resolve unknown references.
Code of reflinks.py:
import docutils.nodes as nodes
_cache = {}
def fill_cache(app):
_cache.update(app.config.reflinks)
def missing_reference(app, env, node, contnode):
target = node['reftarget']
try:
uri = _cache[target]
except KeyError:
return
newnode = nodes.reference('', '', internal = False, refuri = uri)
if not node.get('refexplicit'):
name = target.replace('_', ' ') # style
contnode = contnode.__class__(name, name)
newnode.append(contnode)
return newnode
def setup(app):
app.add_config_value('reflinks', None, False)
app.connect('builder-inited', fill_cache)
app.connect('missing-reference', missing_reference, priority = 1000)
Explanation
I consulted intersphinx's methodology for resolving unknown references and connected the function with high priority so it's hopefully only consulted as a last result.
Followup
Include the extenion.
Adding to conf.py:
reflinks = {'google': 'https://google.com'}
Allowed for script.py:
def some_func(arg1):
"""
:param arg1: Google homepages.
:type arg1: dict[str, google]
"""
Where dict[str, google] are now all hyperlinks.
Formatting Nested Types
There were instances where I wanted to use type structures like list[dict[str,myref]] outside of fields like :param:, :rtype:, etc. Another short extension did the trick.
Code of nestlinks.py:
import sphinx.domains.python as domain
import docutils.parsers.rst.roles as roles
_field = domain.PyTypedField('class')
def handle(name, rawtext, text, lineno, inliner, options = {}, content = []):
refs = _field.make_xrefs('class', 'py', text)
return (refs, [])
def setup(app):
roles.register_local_role('nref', handle)
Explanation
After reading this guide on roles, and digging here and here I realised that all I needed was a dummy field to handle the whole reference-making work and pretend like it's trying to reference classes.
Followup
Include the extension.
Now script.py:
def some_func(arg1):
"""
:param arg1: Google homepages.
:type arg1: dict[str, google]
Now this :nref:`list[dict[str,google]]` is hyperlinked!
"""
Notes
I am using intersphinx and autodoc to link to python's types and document my function's docstrings.
I am not well-versed in Sphinx's underlying mechanisms so take my methodology with a grain of salt.
The examples are provided are adjusted for the sake of being re-usable and generic and have not been tested.
The usability of such features is obviously questionable and only necessary when libraries like extlinks don't cover your needs.

Can I call sphinx.parsers.Parser() directly and parse a fragment of reST?

I need to parse a stand-alone fragment of reST content to a doctree (for later processing). I can do it via docutils easily enough, e.g.:
# ref: http://stackoverflow.com/questions/12883428/
import docutils.nodes
import docutils.parsers.rst
import docutils.utils
def parse_rst(text: str) -> docutils.nodes.document:
parser = docutils.parsers.rst.Parser()
components = (docutils.parsers.rst.Parser,)
settings = docutils.frontend.OptionParser(
components=components).get_default_values()
document = docutils.utils.new_document('<rst-doc>', settings=settings)
parser.parse(text, document)
return document
class MyVisitor(docutils.nodes.NodeVisitor):
def visit_reference(self, node: docutils.nodes.reference) -> None:
"""Called for "reference" nodes."""
print(node)
def unknown_visit(self, node: docutils.nodes.Node) -> None:
"""Called for all other node types."""
print(node)
if __name__ == '__main__':
doc = parse_rst('spam spam lovely spam')
visitor = MyVisitor(doc)
doc.walk(visitor)
That works unless the reST content includes Sphinx-specific directives/roles (e.g. .. glossary::, :term:, etc).
So I need the Sphinx parser rather than the docutils one.
I tried subclassing sphinx.parsers.Parser; using from sphinx.application import Sphinx then app.build() to fake a sphinx docs project.
But references and workarounds get complicated so I suspect I'm approaching it the wrong way. Can I use the sphinx parser outside of the full-blown sphinx-build workflow?

What is the meaning of "html.parser" when doing BeautifulSoup(source_code, 'html.parser')?

I am not getting the BeautifulSoup's syntax, especially the purpose of HTML parser inside the parenthesis.
BeautifulSoup(source_code, 'html.parser')
This seems to be the defining which library you want to use for parsing the source_code. Checkout the options in the docs and how they compare.
From what I understand, "html.parser" will use Python3 html module found here.
More reading on parsers:
Parser differences demo
A diagnostics method which shows you which packages are used
You can check out the BeautifulSoup source code to understand the constructor parameters and how they are used. Here is the code for the BeautifulSoup class __init__.py:
def __init__(self, markup="", features=None, builder=None,
parse_only=None, from_encoding=None, exclude_encodings=None,
**kwargs):
...
if builder is None:
original_features = features
if isinstance(features, basestring):
features = [features]
if features is None or len(features) == 0:
features = self.DEFAULT_BUILDER_FEATURES
builder_class = builder_registry.lookup(*features)
if builder_class is None:
raise FeatureNotFound(
"Couldn't find a tree builder with the features you "
"requested: %s. Do you need to install a parser library?"
% ",".join(features))
builder = builder_class()
if not (original_features == builder.NAME or
original_features in builder.ALTERNATE_NAMES):
if builder.is_xml:
markup_type = "XML"
else:
markup_type = "HTML"
The 1st argument is the markup code (ex. HTML code) and the 2nd argument specifies how to parse that markup, with the default being the built-in HTML parser but it can be overriden:
You can override this by specifying one of the following:
What type of markup you want to parse. Currently supported are “html”, “xml”, and “html5”.
The name of the parser library you want to use. Currently supported options are “lxml”, “html5lib”, and “html.parser” (Python’s built-in HTML parser).

Adding an HTML class to a Sphinx directive with an extension

I'm trying to write a basic extension to create a new directive (output) based on the only directive provided by Sphinx. This new directive simply needs to add an HTML class to the standard result of the Only directive.
So for this I have tried the following code based on this collapse-code-block extension.
Since docutils has virtually no documentation and that I'm not a very experienced python developer, I'm struggling to make this work.
Here is what I have tried, amongst other variations that all led to no real indication on the issue:
from docutils import nodes
from sphinx.directives.other import Only
class output_node(nodes.General, nodes.Element):
pass
class output_directive(Only):
option_spec = Only.option_spec
def run(self):
env = self.state.document.settings.env
node = output_node()
output = Only.run(self)
node.setup_child(output)
node.append(output)
return [node]
def html_visit_output_node(self, node):
self.body.append(self.starttag(node, 'div', '', CLASS='internalonly'))
def html_depart_output_node(self, node):
self.body.append('</div>')
def setup(app):
app.add_node(
output_node,
html=(
html_visit_output_node,
html_depart_output_node
)
)
app.add_directive('output', output_directive)
I don't think it should be more complicated than that but this just doesn't cut it.
Any idea?

Categories