Python markdown edge case: /* */

Python markdown edge case: /* */ - python

Is there a way to get Python to interpret Markdown the same way as it is interpreted here on stackoverflow:
This is a C comment: /* */ tada!
and on github? https://gist.github.com/jason-s/fc81280dc6108f9ec3a8
Python's markdown module interprets the * * as italics:
>>> import markdown
>>> markdown.markdown('This is a C comment: /* */ tada!')
u'<p>This is a C comment: /<em> </em>/ tada!</p>'
(Babelmark 2 shows some of the differences. Looks like there are different interpretations of the markdown syntax.)

The /* */ syntax is not standard Markdown. In fact, it is not mentioned at all in the syntax rules. Therefore, it is less likely to be handled consistently among different Markdown implementations.
If it is a C comment, then it is "code" and should probably be marked up as such. Either in a code block or using inline code backticks (`/* */`). As mentioned in a comment to the OP, it could also be escaped with backslashes if you really don't want it marked up as code. Personally, I would instruct the author to fix their documents (regardless of parser behavior).
In fact, the Markdown parsers that do ignore it do so by accident. In an effort to avoid matching a few edge cases that should not be interpreted as emphasis, they require a word boundary before the opening asterisk (but not after it) and a word boundary after the closing asterisk (but not before it) to consider is as emphasis. Because the C comment has a slash before the opening asterisk (and a space after it) and a slash after the closing asterisk (and a space before it), some parsers do not see it as emphasis. I suspect you will find that those same parsers fail to identify a few edge cases as emphasis that should be. And as the Syntax Rules are silent on these edge cases, each implementation gets them slightly different. I would even go so far as to say that the implementations that do not see that as emphasis are potentially in the wrong here. But this is not the place to debate that.
That said, you are using Python-Markdown, which has a comprehensive Extension API. If an existing third party extension does not already exist (see below), you can create your own. You may add your own pattern to match the C comment specifically and handle it however you like. Or you may override the parser's default handling of emphasis and make it match some other implementation who's behavior you desire.
Actually, the BetterEm Extension (which, for some reason is not on the list of third party extensions) might do the later and give you the behavior you want. Unfortunately, it does not ship by itself, but as part of a larger package which includes multiple extensions. Of course, you only need to use the one. To get it working you need to install it. Unfortunately, it does not appear to be hosted on PyPI, so you'll have to download it directly from GitHub. The following command should download and install it all in one go:
pip install https://github.com/facelessuser/pymdown-extensions/archive/master.zip
Once you have successfully installed it just do the following in Python:
>>> import markdown
>>> html = markdown.markdown(yourtext, extensions=['pymdownx.betterem'])
Or from the commandline:
python -m markdown -x 'pymdownx.betterem' yourinputfile.md > output.html
Note: I have not tested the BetterEm Extension. It may or may not give you the behavior you want. According to the docs, "the behavior will be very similar in feel to GFM bold and italic (but not necessarily exact)."

well, it's ugly, but this post-Markdown-processing patch seems to do what I want.
>>> import markdown
>>> mdtext = 'This is a C comment: /* */ tada!'
>>> mdhtml = markdown.markdown(mdtext)
>>> mdhtml
u'<p>This is a C comment: /<em> </em>/ tada!</p>'
>>> import re
>>> mdccommentre = re.compile('/<em>( | .* )</em>/')
>>> mdccommentre.sub('/*\\1*/',mdhtml)
u'<p>This is a C comment: /* */ tada!</p>'

Related

How to let pyflakes ignore some errors?

I am using SublimePythonIDE which is using pyflakes.
There are some errors that I would like it to ignore like:
(E501) line too long
(E101) indentation contains mixed spaces and tabs
What is the easiest way to do that?

Configuring a plugin in Sublime almost always uses the same procedure: Click on Preferences -> Package Settings -> Plugin Name -> Settings-Default to open the (surprise surprise) default settings. This file generally contains all the possible settings for the plugin, usually along with comments explaining what each one does. This file cannot be modified, so to customize any settings you open Preferences -> Package Settings -> Plugin Name -> Settings-User. I usually copy the entire contents of the default settings into the user file, then customize as desired, then save and close.
In the case of this particular plugin, while it does use pyflakes (as advertised), it also makes use of pep8, a style checker that makes use of the very same PEP-8 official Python style guide I mentioned in the comments. This knowledge is useful because pyflakes does not make use of specific error codes, while pep8 does.
So, upon examination of the plugin's settings file, we find a "pep8_ignore" option as well as a "pyflakes_ignore" one. Since the error codes are coming from pep8, we'll use that setting:
"pep8_ignore": [ "E501", // line too long
"E303", // too many blank lines (3)
"E402" // module level import not at top of file
]
Please note that codes E121, E123, E126, E133, E226, E241, E242, and E704 are ignored by default because they are not rules unanimously accepted, and PEP 8 does not enforce them.
Regarding long lines:
Sometimes, long lines are unavoidable. PEP-8's recommendation of 79-character lines is based in ancient history when terminal monitors only had 80 character-wide screens, but it continues to this day for several reasons: it's backwards-compatible with old code, some equipment is still being used with those limitations, it looks good, it makes it easier on wider displays to have multiple files open side-by-side, and it is readable (something that you should always be keeping in mind when coding). If you prefer to have a 90- or 100-character limit, that's fine (if your team/project agrees with it), but use it consistently, and be aware that others may use different values. If you'd like to set pep8 to a larger value than its default of 80, just modify the "pep8_max_line_length" setting.
There are many ways to either decrease the character count of lines to stay within the limit, or split long lines into multiple shorter ones. In the case of your example in the comments:
flag, message = FacebookUserController.AddFBUserToDB(iOSUserId, fburl, fbsecret, code)
you can do a couple of things:
# shorten the module/class name
fbuc = FacebookUserController
# or
import FacebookUserController as fbuc
flag, message = fbuc.AddFBUserToDB(iOSUserId, fburl, fbsecret, code)
# or eliminate it all together
from FacebookUserController import AddFBUserToDB
flag, message = AddFBUserToDB(iOSUserId, fburl, fbsecret, code)
# split the function's arguments onto separate lines
flag, message = FacebookUserController.AddFBUserToDB(iOSUserId,
fburl,
fbsecret,
code)
# There are multiple ways of doing this, just make sure the subsequent
# line(s) are indented. You don't need to escape newlines inside of
# braces, brackets, and parentheses, but you do need to outside of them.

As others suggest, possibly heed the warnings. But in those cases where you can't, you can add # NOQA to the end offending lines. Note the two spaces before the # as that too is a style thing that will be complained about.
And if pyflakes is wrapped in flake8 that allows ignoring by specific errors.
For example in a file in the project put or add to tox.ini:
[flake8]
exclude = .tox,./build
filename = *.py
ignore = E501,E101
This is possibly a duplicate with How do I get Pyflakes to ignore a statement?

How to make Sphinx understand Sage doctests?

I have a package that is primarily Python, and mostly meant to be used with Python. But also there are a few extra functions that are available when the module is used under Sage. The problem is that Sage doctests must be prefixed by sage: rather than >>>, and Sphinx doesn't pick these up when generating the documentation.
Is there a way to get Sphinx to recognize the sage: prefix as being equivalent to >>> when generating the HTML (or other) docs?

Well, you can use Sage's built-in version of Sphinx and its documentation builder. Work in progress for Sage at http://trac.sagemath.org/ticket/13679 allows for building the documentation for a single Python file which is not in Sage's source tree, so you could try that.

I finally found out how to preprocess the docstring, to change sage: to >>>. The following goes into my project's doc/conf.py:
# Adapted from http://stackoverflow.com/a/11746519/1048959
def process_docstring(app, what, name, obj, options, lines):
for i in range(len(lines)):
lines[i] = re.sub(r'^(\s*)sage: ', r'\1>>> ', lines[i])
def setup(app):
app.connect('autodoc-process-docstring', process_docstring)
At least now Sphinx can parse my docstrings without generating errors. I'm still leaving this question open though because there is still a problem: the generated documentation shows >>> rather than sage:, which can be misleading to the reader.

Okay, here's another idea: try prefacing your indented blocks with double colons. For example, in slices.rst, change
You can use numpy style indexes:
>>> x[0, 0]
0j
to
You can use numpy style indexes::
sage: x[0, 0]
0j
(I added the double colon, and I changed the prompt to sage:.) I tried this with your code, but commenting out your modification to conf.py. See the Sphinx docs for source code blocks.
Then you need to modify one Sphinx file:
diff -ur sphinx/highlighting.py sphinx/highlighting.py
--- sphinx/highlighting.py 2010-08-11 17:17:48.000000000 +0200
+++ sphinx/highlighting.py 2010-11-28 12:04:44.068642703 +0100
## -161,7 +161,7 ##
# find out which lexer to use
if lang in ('py', 'python'):
- if source.startswith('>>>'):
+ if source.startswith('>>>') or source.startswith('sage: '):
# interactive session
lexer = lexers['pycon']
else:

Programmatically converting/parsing LaTeX code to plain text

I have a couple of code projects in C++/Python in which LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks. However, we also have some plain text outputs, such as an HTML version of the documentation (I already have code to write minimal markup for that) and a non-TeX-enabled plot renderer.
For these I would like to eliminate the TeX markup that is necessary for e.g. representing physical units. This includes non-breaking (thin) spaces, \text, \mathrm etc. It would also be nice to parse down things like \frac{#1}{#2} into #1/#2 for the plain text output (and use MathJax for the HTML). Due to the system that we've got at the moment, I need to be able to do this from Python, i.e. ideally I'm looking for a Python package, but a non-Python executable which I can call from Python and catch the output string would also be fine.
I'm aware of the similar question on the TeX StackExchange site, but there weren't any really programmatic solutions to that: I've looked at detex, plasTeX and pytex, which they all seem a bit dead and don't really do what I need: programmatic conversion of a TeX string to a representative plain text string.
I could try writing a basic TeX parser using e.g. pyparsing, but a) that might be pitfall-laden and help would be appreciated and b) surely someone has tried that before, or knows of a way to hook into TeX itself to get a better result?
Update: Thanks for all the answers... it does indeed seem to be a bit of an awkward request! I can make do with less than general parsing of LaTeX, but the reason for considering a parser rather than a load of regexes in a loop is that I want to be able to handle nested macros and multi-arg macros nicely, and get the brace matching to work properly. Then I can e.g. reduce txt-irrelevant macros like \text and \mathrm first, and handle txt-relevant ones like \frac last... maybe even with appropriate parentheses! Well, I can dream... for now regexes are not doing such a terrible job.

I understand this is an old post, but since this post comes up often in latex-python-parsing searches (as evident by Extract only body text from arXiv articles formatted as .tex), leaving this here for folks down the line: Here's a LaTeX parser in Python that supports search over and modification of the parse tree, https://github.com/alvinwan/texsoup. Taken from the README, here is sample text and how you can interact with it via TexSoup.
from TexSoup import TexSoup
soup = TexSoup("""
\begin{document}
\section{Hello \textit{world}.}
\subsection{Watermelon}
(n.) A sacred fruit. Also known as:
\begin{itemize}
\item red lemon
\item life
\end{itemize}
Here is the prevalence of each synonym.
\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}
\end{document}
""")
Here's how to navigate the parse tree.
>>> soup.section # grabs the first `section`
\section{Hello \textit{world}.}
>>> soup.section.name
'section'
>>> soup.section.string
'Hello \\textit{world}.'
>>> soup.section.parent.name
'document'
>>> soup.tabular
\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}
>>> soup.tabular.args[0]
'c c'
>>> soup.item
\item red lemon
>>> list(soup.find_all('item'))
[\item red lemon, \item life]
Disclaimer: I wrote this lib, but it was for similar reasons. Regarding the post by Little Bobby Tales (regarding def), TexSoup doesn't handle definitions.

A word of caution: It is much more difficult to write a complete parser for plain TeX than what you might think. The TeX-level (not LaTeX) \def command actually extends TeX's syntax. For example, \def\foo #1.{{\bf #1}} will expand \foo goo. into goo - Notice that the dot became a delimiter for the foo macro! Therefore, if you have to deal with any form of TeX, without restrictions on which packages may be used, it is not recommended to rely on simple parsing. You need TeX rendering. catdvi is what I use, although it is not perfect.

Try detex (shipped with most *TeX distributions), or the improved version: http://code.google.com/p/opendetex/
Edit: oh, I see you tried detex already. Still, opendetex might work for you.

I would try pandoc [enter link description here][1]. It is written in Haskell, but it is a really nice latex 2 whatever converter.
[1]: http://johnmacfarlane.net/pandoc/index.html .

As you're considering using TeX itself for doing the rendering, I suspect that performance is not an issue. In this case you've got a couple of options: dvi2txt to fetch your text from a single dvi file (be prepared to generate one for each label) or even rendering dvi into raster images, if it's ok for you - that's how hevea or latex2html treats formulas.

Necroing this old thread, but found this nifty library called pylatexenc that seems to do almost exactly what the OP was after:
from pylatexenc.latex2text import LatexNodes2Text
LatexNodes2Text().latex_to_text(r"""\
\section{Euler}
\emph{This} bit is \textbf{very} clever:
\begin{equation}
\mathrm{e}^{i \pi} + 1 = 0 % wow!!
\end{equation}
where
\[
\mathrm{e} = \lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n
\]
""")
which produces
§ EULER
This bit is very clever:
e^i π + 1 = 0
where
e = lim_n →∞(1 + 1/n)^n
As you can see, the result is not perfect for the equations, but it does a great job of stripping and converting all the tex commands.

Building the other post Eduardo Leoni, I was looking at pandoc and I see that it comes with a standalone executable but also on this page it promises a way to build to a C-callable system library. Perhaps this is something that you can live with?

LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks
This is your mistake. You shouldn't have done that.
Use RST or some other -- better -- markup language.
Use Docutils to create LaTeX and HTML from the RST source.

Common coding style for Python?

I'm pretty new to Python, and I want to develop my first serious open source project. I want to ask what is the common coding style for python projects. I'll put also what I'm doing right now.
1.- What is the most widely used column width? (the eternal question)
I'm currently sticking to 80 columns (and it's a pain!)
2.- What quotes to use? (I've seen everything and PEP 8 does not mention anything clear)
I'm using single quotes for everything but docstrings, which use triple double quotes.
3.- Where do I put my imports?
I'm putting them at file header in this order.
import sys
import -rest of python modules needed-
import whatever
import -rest of application modules-
<code here>
4.- Can I use "import whatever.function as blah"?
I saw some documents that disregard doing this.
5.- Tabs or spaces for indenting?
Currently using 4 spaces tabs.
6.- Variable naming style?
I'm using lowercase for everything but classes, which I put in camelCase.
Anything you would recommend?

PEP 8 is pretty much "the root" of all common style guides.
Google's Python style guide has some parts that are quite well thought of, but others are idiosyncratic (the two-space indents instead of the popular four-space ones, and the CamelCase style for functions and methods instead of the camel_case style, are pretty major idiosyncrasies).
On to your specific questions:
1.- What is the most widely used column width? (the eternal question)
I'm currently sticking to 80 columns
(and it's a pain!)
80 columns is most popular
2.- What quotes to use? (I've seen everything and PEP 8 does not mention
anything clear) I'm using single
quotes for everything but docstrings,
which use triple double quotes.
I prefer the style you're using, but even Google was not able to reach a consensus about this:-(
3.- Where do I put my imports? I'm putting them at file header in this
order.
import sys import -rest of python
modules needed-
import whatever import -rest of
application modules-
Yes, excellent choice, and popular too.
4.- Can I use "import whatever.function as blah"? I saw some
documents that disregard doing this.
I strongly recommend you always import modules -- not specific names from inside a module. This is not just style -- there are strong advantages e.g. in testability in doing that. The as clause is fine, to shorten a module's name or avoid clashes.
5.- Tabs or spaces for indenting? Currently using 4 spaces tabs.
Overwhelmingly most popular.
6.- Variable naming style? I'm using lowercase for everything but classes,
which I put in camelCase.
Almost everybody names classes with uppercase initial and constants with all-uppercase.

1.- Most everyone has a 16:9 or 16:10 monitor now days. Even if they don't have a wide-screen they have lots of pixels, 80 cols isn't a big practical deal breaker like it was when everyone was hacking at the command line in a remote terminal window on a 4:3 monitor at 320 X 240. I usually end the line when it gets too long, which is subjective. I am at 2048 X 1152 on a 23" Monitor X 2.
2.- Single quotes by default so you don't have to escape Double quotes, Double quotes when you need to embed single quotes, and Triple quotes for strings with embedded newlines.
3.- Put them at the top of the file, sometimes you put them in the main function if they aren't needed globally to the module.
4.- It is a common idiom to rename some modules. A good example is the following.
try:
# for Python 2.6.x
import json
except ImportError:
# for previous Pythons
try:
import simplejson as json
except ImportError:
sys.exit('easy_install simplejson')
but the preferred way to import just a class or function is from module import xxx with the optional as yyy if needed
5.- Always use SPACES! 2 or 4 as long as no TABS
6.- Classes should up UpperCaseCamelStyle, variables are lowercase sometimes lowerCamelCase or sometimes all_lowecase_separated_by_underscores, as are function names. "Constants" should be ALL_UPPER_CASE_SEPARATED_BY_UNDERSCORES
When in doubt refer to the PEP 8, the Python source, existing conventions in a code base. But the most import thing is to be internally consistent as possible. All Python code should look like it was written by the same person when ever possible.

Since I'm really crazy about "styling" I'll write down the guidelines that I currently use in a near 8k SLOC project with about 35 files, most of it matches PEP8.
PEP8 says 79(WTF?), I go with 80 and I'm used to it now. Less eye movement after all!
Docstrings and stuff that spans multiple lines in '''. Everything else in ''. Also I don't like double quotes, I only use single quotes all the time... guess that's because I came form the JavaScript corner, where it's just easier too use '', because that way you don't have to escape all the HTML stuff :O
At the head, built-in before custom application code. But I also go with a "fail early" approach, so if there's something that's version depended(GTK for example) I'd import that first.
Depends, most of the times I go with import foo and from foo import, but there a certain cases(e.G. the name is already defined by another import) were I use from foo import bar as bla too.
4 Spaces. Period. If you really want to use tabs, make sure to convert them to spaces before committing when working with SCM. BUT NEVER(!) MIX TABS AND SPACES!!! It can AND WILL introduce horrible bugs.
some_method or foo_function, a CONSTANT, MyClass.
Also you can argue about indentation in cases where a method call or something spans multiple lines, and you can argue about which line continuation style you will use. Either surround everything with () or do the \ at the end of the line thingy. I do the latter, and I also place operators and other stuff at the start of the next line.
# always insert a newline after a wrapped one
from bla import foo, test, goo, \
another_thing
def some_method_thats_too_long_for_80_columns(foo_argument, bar_argument, bla_argument,
baz_argument):
do_something(test, bla, baz)
value = 123 * foo + ten \
- bla
if test > 20 \
and x < 4:
test_something()
elif foo > 7 \
and bla == 2 \
or me == blaaaaaa:
test_the_megamoth()
Also I have some guidelines for comparison operations, I always use is(not) to check against None True False and I never do an implicit boolean comparison like if foo:, I always do if foo is True:, dynamic typing is nice but in some cases I just want to be sure that the thing does the right thing!
Another thing that I do is to never use empty strings! They are in a constants file, in the rest of the code I have stuff like username == UNSET_USERNAME or label = UNSET_LABEL it's just more descriptive that way!
I also have some strict whitespace guidelines and other crazy stuff, but I like it(because I'm crazy about it), I even wrote a script which checks my code:
http://github.com/BonsaiDen/Atarashii/blob/master/checkstyle
WARNING(!): It will hurt your feelings! Even more than JSLint does...
But that's just my 2 cents.

## in python using notepad++ syntax coloring

In my editor (notepad++) in Python script edit mode, a line
## is this a special comment or what?
Turns a different color (yellow) than a normal #comment.
What's special about a ##comment vs a #comment?

From the Python point of view, there's no difference. However, Notepad++'s highlighter considers the ## sequence as a STRINGEOL, which is why it colours it this way. See this thread.

I thought the difference had something to do with usage:
#this is a code block header
vs.
##this is a comment
I know Python doesn't care one way or the other, but I thought it was just convention to do it that way.

Also, in a different situations:
Comment whose first line is a double hash:
This is used by doxygen and Fredrik Lundh's PythonDoc. In doxygen,
if there's text on the line with the double hash, it is treated as
a summary string. I dislike this convention because it seems too
likely to result in false positives. E.g., if you comment-out a
region with a comment in it, you get a double-hash.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.