Programmatically converting/parsing LaTeX code to plain text

Programmatically converting/parsing LaTeX code to plain text - python

I have a couple of code projects in C++/Python in which LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks. However, we also have some plain text outputs, such as an HTML version of the documentation (I already have code to write minimal markup for that) and a non-TeX-enabled plot renderer.
For these I would like to eliminate the TeX markup that is necessary for e.g. representing physical units. This includes non-breaking (thin) spaces, \text, \mathrm etc. It would also be nice to parse down things like \frac{#1}{#2} into #1/#2 for the plain text output (and use MathJax for the HTML). Due to the system that we've got at the moment, I need to be able to do this from Python, i.e. ideally I'm looking for a Python package, but a non-Python executable which I can call from Python and catch the output string would also be fine.
I'm aware of the similar question on the TeX StackExchange site, but there weren't any really programmatic solutions to that: I've looked at detex, plasTeX and pytex, which they all seem a bit dead and don't really do what I need: programmatic conversion of a TeX string to a representative plain text string.
I could try writing a basic TeX parser using e.g. pyparsing, but a) that might be pitfall-laden and help would be appreciated and b) surely someone has tried that before, or knows of a way to hook into TeX itself to get a better result?
Update: Thanks for all the answers... it does indeed seem to be a bit of an awkward request! I can make do with less than general parsing of LaTeX, but the reason for considering a parser rather than a load of regexes in a loop is that I want to be able to handle nested macros and multi-arg macros nicely, and get the brace matching to work properly. Then I can e.g. reduce txt-irrelevant macros like \text and \mathrm first, and handle txt-relevant ones like \frac last... maybe even with appropriate parentheses! Well, I can dream... for now regexes are not doing such a terrible job.

I understand this is an old post, but since this post comes up often in latex-python-parsing searches (as evident by Extract only body text from arXiv articles formatted as .tex), leaving this here for folks down the line: Here's a LaTeX parser in Python that supports search over and modification of the parse tree, https://github.com/alvinwan/texsoup. Taken from the README, here is sample text and how you can interact with it via TexSoup.
from TexSoup import TexSoup
soup = TexSoup("""
\begin{document}
\section{Hello \textit{world}.}
\subsection{Watermelon}
(n.) A sacred fruit. Also known as:
\begin{itemize}
\item red lemon
\item life
\end{itemize}
Here is the prevalence of each synonym.
\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}
\end{document}
""")
Here's how to navigate the parse tree.
>>> soup.section # grabs the first `section`
\section{Hello \textit{world}.}
>>> soup.section.name
'section'
>>> soup.section.string
'Hello \\textit{world}.'
>>> soup.section.parent.name
'document'
>>> soup.tabular
\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}
>>> soup.tabular.args[0]
'c c'
>>> soup.item
\item red lemon
>>> list(soup.find_all('item'))
[\item red lemon, \item life]
Disclaimer: I wrote this lib, but it was for similar reasons. Regarding the post by Little Bobby Tales (regarding def), TexSoup doesn't handle definitions.

A word of caution: It is much more difficult to write a complete parser for plain TeX than what you might think. The TeX-level (not LaTeX) \def command actually extends TeX's syntax. For example, \def\foo #1.{{\bf #1}} will expand \foo goo. into goo - Notice that the dot became a delimiter for the foo macro! Therefore, if you have to deal with any form of TeX, without restrictions on which packages may be used, it is not recommended to rely on simple parsing. You need TeX rendering. catdvi is what I use, although it is not perfect.

Try detex (shipped with most *TeX distributions), or the improved version: http://code.google.com/p/opendetex/
Edit: oh, I see you tried detex already. Still, opendetex might work for you.

I would try pandoc [enter link description here][1]. It is written in Haskell, but it is a really nice latex 2 whatever converter.
[1]: http://johnmacfarlane.net/pandoc/index.html .

As you're considering using TeX itself for doing the rendering, I suspect that performance is not an issue. In this case you've got a couple of options: dvi2txt to fetch your text from a single dvi file (be prepared to generate one for each label) or even rendering dvi into raster images, if it's ok for you - that's how hevea or latex2html treats formulas.

Necroing this old thread, but found this nifty library called pylatexenc that seems to do almost exactly what the OP was after:
from pylatexenc.latex2text import LatexNodes2Text
LatexNodes2Text().latex_to_text(r"""\
\section{Euler}
\emph{This} bit is \textbf{very} clever:
\begin{equation}
\mathrm{e}^{i \pi} + 1 = 0 % wow!!
\end{equation}
where
\[
\mathrm{e} = \lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n
\]
""")
which produces
§ EULER
This bit is very clever:
e^i π + 1 = 0
where
e = lim_n →∞(1 + 1/n)^n
As you can see, the result is not perfect for the equations, but it does a great job of stripping and converting all the tex commands.

Building the other post Eduardo Leoni, I was looking at pandoc and I see that it comes with a standalone executable but also on this page it promises a way to build to a C-callable system library. Perhaps this is something that you can live with?

LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks
This is your mistake. You shouldn't have done that.
Use RST or some other -- better -- markup language.
Use Docutils to create LaTeX and HTML from the RST source.

Related

Use python to modify words in a LaTex file, ignoring LaTeX markup

I want to run an automated "spell checker" over some LaTex files (in addition to spelling it detects certain custom words, etc). I need to read the LaTex file, find certain words in the document text (i.e. ignoring words if they are part of the LaTeX markup code), then wrap each word in additional LaTeX highlighting markup and write the file back out. E.g.
\title{My Document}
...
I won the title!
If I search for "title", then it should ignore "\title".
This is so that, when rendered, the modified LaTeX will display found words using the highlighting I add e.g.:
\title{My Document}
...
I won the \colorbox{red}{title}!
A library would be helpful since I may eventually require additional parsing/control features, but simple modification is all I need for now.
It seems the hard part is discerning LaTex commands, comments, etc. from actual body text.
Thanks.

You need a Python LaTeX parser to do this. This looks like a good candidate https://github.com/alvinwan/TexSoup, there there are several available.
Like BeautifulSoup, there are search functions which would allow you to find all text nodes, then you can use regular python split/search functions to find your misspelled words, then replace the text node with a new set of latex nodes (with the wrapping syntax around the selected words).
TexSoup's documentation is a little unclear as to how to write the document back out, but looking at their source code they appear to override the repr function, so:
with open('out.tex','w') as f:
f.write(repr(soup))
Should do it for you.
EDIT:
If you look at the descendants generator:
>>> [x for x in soup.descendants if isinstance(x, str)]
['\x08egin', '(n.) A sacred fruit. Also known as:', '\x08egin', 'Here is the prevalence of each synonym.', '\x08egin', 'red lemon & uncommon ', 'Hello \textit', '.', 'Watermelon', 'red lemon', 'life', 'itemize', '& common', 'tabular', 'document']
The "children" are a mix of strs and TexNodes. You can pick out the pure strings there for your check, and just walk the tree yourself. The children attribute bizzarely only includes the TextNode elements.

As I got what you need, the python shouldn't be the best fitting instrument. I think that the thing what you need is sed or vim editors and a group of editing scripts. It'd work faster and be easier to maintain than writing python script

How to apply string method on regular expression in Python

I'm having a markdown file wich is a little bit broken: the links and images which are too long have line-breaks in it. I would like to remove line-breaks from them.
Example:
from:
See for example the
[installation process for Ubuntu
Trusty](https://wiki.diasporafoundation.org/Installation/Ubuntu/Trusty). The
project offers a Vagrant installation too, but the documentation only admits
that you know what you do, that you are a developer. If it is difficult to
![https://diasporafoundation.org/assets/pages/about/network-
distributed-e941dd3e345d022ceae909beccccbacd.png](data/images/network-
distributed-e941dd3e345d022ceae909beccccbacd.png)
_A pretty decentralized network (Source: <https://diasporafoundation.org/>)_
to:
See for example the
[installation process for Ubuntu Trusty](https://wiki.diasporafoundation.org/Installation/Ubuntu/Trusty). The
project offers a Vagrant installation too, but the documentation only admits
that you know what you do, that you are a developer. If it is difficult to
![https://diasporafoundation.org/assets/pages/about/network-distributed-e941dd3e345d022ceae909beccccbacd.png](data/images/network-distributed-e941dd3e345d022ceae909beccccbacd.png)
_A pretty decentralized network (Source: <https://diasporafoundation.org/>)_
As you can see in this snippet, I managed to match the all links and images with the right pattern: https://regex101.com/r/uL8pO4/2
But now, what is the syntax in Python to use a string method like string.trim() on what I have captured with regular expression?
For the moment, I'm stuck with this:
fix_newlines = re.compile(r'\[([\w\s*:/]*)\]\(([^()]+)\)')
# Capture the links and remove line-breaks from their urls
# Something like r'[\1](\2)'.trim() ??
post['content'] = fix_newlines.sub(r'[\1](\2)', post['content'])
Edit: I updated the example to be more explicit about my problem.
Thank you for your answer

strip would work similar to functionality of trim. As you would need to trim the new lines, use strip('\n'),
fin.readline.strip('\n')

This will work also:
>>> s = """
... ![https://diasporafoundation.org/assets/pages/about/network-
... distributed-e941dd3e345d022ceae909beccccbacd.png](data/images/network-
... distributed-e941dd3e345d022ceae909beccccbacd.png)
... """
>>> new_s = "".join(s.strip().split('\n'))
>>> new_s
'![https://diasporafoundation.org/assets/pages/about/network-distributed-e941dd3e345d022ceae909beccccbacd.png](data/images/network-distributed-e941dd3e345d022ceae909beccccbacd.png)'
>>>
Often times built-in string functions will do, and are easier to read than figuring out regexes. In this case strip removes leading and trailing space, then split returns a list of items between newlines, and join puts them back together in a single string.

Alright, I finally found what I was searching. With the snippet below, I could capture a string with a regex and then apply the treatment on each of them.
def remove_newlines(match):
return "".join(match.group().strip().split('\n'))
links_pattern = re.compile(r'\[([\w\s*:/\-\.]*)\]\(([^()]+)\)')
post['content'] = links_pattern.sub(remove_newlines, post['content'])
Thank you for your answers and sorry if my question wasn't explicit enough.

Python markdown edge case: /* */

Is there a way to get Python to interpret Markdown the same way as it is interpreted here on stackoverflow:
This is a C comment: /* */ tada!
and on github? https://gist.github.com/jason-s/fc81280dc6108f9ec3a8
Python's markdown module interprets the * * as italics:
>>> import markdown
>>> markdown.markdown('This is a C comment: /* */ tada!')
u'<p>This is a C comment: /<em> </em>/ tada!</p>'
(Babelmark 2 shows some of the differences. Looks like there are different interpretations of the markdown syntax.)

The /* */ syntax is not standard Markdown. In fact, it is not mentioned at all in the syntax rules. Therefore, it is less likely to be handled consistently among different Markdown implementations.
If it is a C comment, then it is "code" and should probably be marked up as such. Either in a code block or using inline code backticks (`/* */`). As mentioned in a comment to the OP, it could also be escaped with backslashes if you really don't want it marked up as code. Personally, I would instruct the author to fix their documents (regardless of parser behavior).
In fact, the Markdown parsers that do ignore it do so by accident. In an effort to avoid matching a few edge cases that should not be interpreted as emphasis, they require a word boundary before the opening asterisk (but not after it) and a word boundary after the closing asterisk (but not before it) to consider is as emphasis. Because the C comment has a slash before the opening asterisk (and a space after it) and a slash after the closing asterisk (and a space before it), some parsers do not see it as emphasis. I suspect you will find that those same parsers fail to identify a few edge cases as emphasis that should be. And as the Syntax Rules are silent on these edge cases, each implementation gets them slightly different. I would even go so far as to say that the implementations that do not see that as emphasis are potentially in the wrong here. But this is not the place to debate that.
That said, you are using Python-Markdown, which has a comprehensive Extension API. If an existing third party extension does not already exist (see below), you can create your own. You may add your own pattern to match the C comment specifically and handle it however you like. Or you may override the parser's default handling of emphasis and make it match some other implementation who's behavior you desire.
Actually, the BetterEm Extension (which, for some reason is not on the list of third party extensions) might do the later and give you the behavior you want. Unfortunately, it does not ship by itself, but as part of a larger package which includes multiple extensions. Of course, you only need to use the one. To get it working you need to install it. Unfortunately, it does not appear to be hosted on PyPI, so you'll have to download it directly from GitHub. The following command should download and install it all in one go:
pip install https://github.com/facelessuser/pymdown-extensions/archive/master.zip
Once you have successfully installed it just do the following in Python:
>>> import markdown
>>> html = markdown.markdown(yourtext, extensions=['pymdownx.betterem'])
Or from the commandline:
python -m markdown -x 'pymdownx.betterem' yourinputfile.md > output.html
Note: I have not tested the BetterEm Extension. It may or may not give you the behavior you want. According to the docs, "the behavior will be very similar in feel to GFM bold and italic (but not necessarily exact)."

well, it's ugly, but this post-Markdown-processing patch seems to do what I want.
>>> import markdown
>>> mdtext = 'This is a C comment: /* */ tada!'
>>> mdhtml = markdown.markdown(mdtext)
>>> mdhtml
u'<p>This is a C comment: /<em> </em>/ tada!</p>'
>>> import re
>>> mdccommentre = re.compile('/<em>( | .* )</em>/')
>>> mdccommentre.sub('/*\\1*/',mdhtml)
u'<p>This is a C comment: /* */ tada!</p>'

Regex to find a string python

I have a string
<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />
What is the Regex to find ABCDXYZ in Python

Don't use regex to parse HTML. Use BeautifulSoup.
from bs4 import BeautifulSoup as BS
text = '''<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />'''
soup = BS(text)
print soup.find('img').attrs['alt']

If you're looking for the value of that alt attribute, you can do this:
>>> r = r'alt="(.*?)"'
Then:
>>> m = re.search(r, mystring)
>>> m.group(1)
'ABCDXYZ'
And you can use re.findall if you want to find more than one.
However, this code will be easily fooled by something like this:
<span>Here's some text explaining how to do alt="foo" in an img tag.</span>
On the other hand, it'll also fail to pick up something like this:
<img src='/p_img/411/123411/639469aa9f_123411_100.jpg' alt='ABCDXYZ' />
How do you deal with that? The short answer is: You don't. XML and HTML are not regular languages.
It's worth backing up here to point out that Python's re engine is not actually a true regular expression engine—and, on top of that, it's embedded in a Turing-complete programming language. So obviously it is possible to build an HTML parser around Python and re. This answer shows part of a parser written in perl, where regexes do most of the heavy lifting. But that doesn't mean you should do it this way. You shouldn't be writing a parser in the first place, given that perfectly good ones already exist, and if you did, you shouldn't be forcing yourself to use regexes even when there's an easier way to do what you want. For quick&dirty playing around, regex is fine. For a production program, it's almost always the wrong answer.
One way to convince your boss to let you use a parser is by crafting a suite of tests that are all obviously valid, and that cannot possibly be handled by any regex-based solution short of a full parser. If you can come up with a test that can be parsed, but only using exponential backtracking, and therefore takes 12 hours with regex vs. 0.1 seconds with bs4, even better, but that's a bit trickier…
Of course it's also worth looking for articles online (and SO questions like this and this and the 300 other dups) and picking the best ones to show your boss.
If you really can't convince your boss otherwise, then you're done at this point. Given what's been specified, this works. Given what may or may not actually be intended, nothing short of mind-reading will work. As you find more and more real-life cases that fail, you can hack it up by adding more and more complex alternations and/or context onto the regex itself, or possibly use a series of regexes and post-filters, until finally you get sick of it and find yourself a better job.

First, a disclaimer: You shouldn't be using regular expressions to parse HTML. You can use BeautifulSoup for this
Next, if you are actually serious about using regular expressions and the above is the exact case you want then you could do something like:
<a href="[a-zA-Z0-9/]+"><img src="[a-zA-Z0-9/]+" alt="([a-zA-Z0-9/]+)" />
and you could access the text via the match object's groups attribute.

How do you enable block folding for Python comments in TextMate?

In TextMate 1.5.10 r1623, you get little arrows that allow you to fold method blocks:
Unfortunately, if you have a multi-lined Python comment, it doesn't recognize it, so you can't fold it:
def foo():
"""
How do
I fold
these comments?
"""
print "bar"
TextMate has this on their site on how to customize folding: http://manual.macromates.com/en/navigation_overview#customizing_foldings
...but I'm not skilled in regex enough to do anything about it. TextMate uses the Oniguruma regex API, and I'm using the default Python.tmbundle updated to the newest version via GetBundles.
Does anyone have an idea of how to do this? Thanks in advance for your help! :)
Adding the default foldingStartMarker and foldingStopMarker regex values for Python.tmbundle under the Python Language in Bundle Editor:
foldingStartMarker = '^\s*(def|class)\s+([.a-zA-Z0-9_ <]+)\s*(\((.*)\))?\s*:|\{\s*$|\(\s*$|\[\s*$|^\s*"""(?=.)(?!.*""")';
foldingStopMarker = '^\s*$|^\s*\}|^\s*\]|^\s*\)|^\s*"""\s*$';

It appears that multi-line comment folding does work in TextMate, but your must line up your quotes exactly like so:
""" Some sort of multi
line comment, which needs quotes
in just the right places to work. """
That seems to do it:

According to this Textmate Mailing list thread, if you follow it to the end, proper code folding for Python is not supported. Basically, regular expressions as implemented in the foldingStartMarker and foldingStopMarker do not allow for captures, thus the amount of spacing at the beginning of the "end fold" cannot be matched to the "begin fold".
The issue is not finally and officially addressed by Textmate's creator, Allan Odgaard; however since the thread is from 2005, I assume it is a dead issue, and not one that will be supported.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.