How to write unicode characters to LaTeX document via pylatex - python

Assume the string myStr which contains three special characters.
myStr = "emdash —; delta Δ; thin space:  ;"
Further assume that we wish to write this string to a LaTeX document via pylatex.
If we write the string as is to a LaTeX document, errors occur during its compilation:
import pylatex
doc = pylatex.Document()
with doc.create(pylatex.Section('myStr -- not encoded')):
doc.append(myStr)
doc.generate_pdf("myStr_notEncoded", clean_tex=False)
...
! Package inputenc Error: Unicode character Δ (U+0394)
(inputenc) not set up for use with LaTeX.
...
! Package inputenc Error: Unicode character   (U+2009)
(inputenc) not set up for use with LaTeX.
...
If we first encode the string via pylatexenc, the special characters are either represented by their respective LaTeX encoding (emdash, delta) or encoded in a way unclear to me (thin space).
import pylatexenc
from pylatexenc import latexencode
myStr_latex = pylatexenc.latexencode.unicode_to_latex(myStr)
doc = pylatex.Document()
with doc.create(pylatex.Section('myStr')):
doc.append(myStr_latex)
doc.generate_pdf("myStr", clean_tex=False)
How do I have to write the string into the LaTeX document so that the special characters are printed as the actual characters when compiling with pdflatex?
Edit 1:
I also tried to change the default encoding inside the LaTeX document for the unencoded pathway but it results in a series of compilation errors as well.
doc.preamble.append(pylatex.NoEscape("\\usepackage[utf8]{inputenc}"))

You were close with your pylatexenc solution. When you encode latex yourself, e.g. with pylatexenc.latexencode.unicode_to_latex() you have to ensure that you tell pylatex the string should not be additional escaped. To wit:
Using regular LaTeX strings may not be as simple as is seems though, because by default almost all strings are escaped[...] there are cases where raw LaTeX strings should just be used directly in the document. This is why the NoEscapestring type exists. This is just a subclass of str, but it will not be escaped
In other words to solve, just make sure to use NoEscape to tell pylatex your string is already encoded as latex and not to encode it again:
import pylatex
from pylatexenc import latexencode
myStr_latex = latexencode.unicode_to_latex(myStr)
doc = pylatex.Document()
with doc.create(pylatex.Section('myStr')):
doc.append(pylatex.utils.NoEscape(myStr_latex))
doc.generate_pdf("myStr", clean_tex=False)

Until a better solution has been found, a workable (although by no means preferable) solution would be to encode all the offending characters via the package newunicodechar:
doc.preamble.append(pylatex.Command('usepackage', 'newunicodechar'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{²}{\ensuremath{{}^2}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{ }{\,}'))
doc.preamble.append(pylatex.NoEscape(r"\newunicodechar{′}{'}"))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{−}{\ensuremath{-}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{∶}{\ensuremath{:}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{≤}{\ensuremath{\leq}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{≥}{\ensuremath{\geq}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{α}{\ensuremath{\alpha}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{β}{\ensuremath{\beta}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{Δ}{\ensuremath{\Delta}}')) ## (U+0394)
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{∆}{\ensuremath{\Delta}}')) ## (U+2206)
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{λ}{\ensuremath{\lambda}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{μ}{\ensuremath{\mu}}'))

Related

How can I write exponents in a PySide QLabel?

I'm writing a Qt interface for a computing program and I would like to write in the units for an area (ie, the LaTex output of m^2, or m².)
If I use the special ² character in this code: area_label = QtGui.QLabel("m²"), it will display the following in the GUI: m².
I suspect this could be an encoding issue, what would be the way to write the squared exponent I'm looking for?
Additional question: is there a way to output any exponent, any one not defined as a special character (say m^8)?
Additional info:
I'm working on python 2.7.2, with PySide version 1.1.1, and Qt 4.7.4. Working in Windows 7, SP1, but I would like my code to be cross-platform if possible.
Also, as I'm working on windows and I use french accents (like à and é), I'm using this encoding line at the beginning of my file: # -*-coding:Latin-1 -*.
Your encoding problem appears to be that you're passing UTF-8 strings, which PySide/Qt is trying to interpret according to your system encoding, which is something Latin-1 compatible (like cp1252, the traditional Windows default for western European languages) rather than UTF-8. You can see this pretty easily:
>>> print u'm\u00b2'.encode('utf-8').decode('latin-1')
m²
PySide can take unicode strings everywhere. So, if you just use unicode everywhere instead of str/bytes, including at the interface to PySide, you should be fine.
is there a way to output any exponent, any one not defined as a special character (say m^8)?
Well, ⁸ (U+2078) is defined as a special character, as evidenced by the fact that I was able to type it here.
However, you will have to write your own code to parse your expressions and generate proper superscript characters.
The superscripts and subscripts block in U+2070 to U+209F has all of the characters you need, except for 2 and 3, which were left in their Latin-1-compatible positions at U+00B2 and U+00B3. (Some fonts will display U+2072 and U+2073 as equivalent characters, but those fonts are not correct, so you shouldn't depend on that. Also, 1 appears as both U+2071 and U+00B9, and some fonts distinguish them. You may want to print out the whole list and see which ones look better for you.)
The function to turn each digit into a superscript looks like this:
def superscript(digit):
if digit in (2, 3):
return unichr(0x00B0 + digit)
else:
return unichr(0x2070 + digit)
So, a really simple wrapper would be:
def term(base, exponent):
return base + u''.join(superscript(int(digit)) for digit in exponent)
Now:
>>> print term('x', '123')
xⁱ²³
However, if you want something more flexible, you're probably going to want to generate HTML instead of plain text. Recent versions of Qt can take HTML directly in a QLabel.
If you can generate MathML, Latex, etc. from your expressions, there are tools that generate HTML from those formats.
But for a really trivial example:
def term(base, exponent):
return u'{}<sup>{}</sup>'.format(base, exponent)
When printed out, this will just show x<sup>123</sup>, but when stuck in a QLabel (or a Stack Overflow answer), that shows as x123.
I'm usint this encoding line: # -*-coding:Latin-1 -*.
Why? If it's at all possible for you to edit text files in UTF-8, that will make your life a lot easier. For one thing, Latin-1 doesn't have characters for any superscripts but 1, 2, and 3, which means you will have to write things like u'm\2074' instead of just writing u'm⁴'
Also, it's a bit misleading to use a coding declaration which is almost, but not quite, in emacs format. Either use emacs format (with the final hyphen and proper spacing):
# -*- coding: Latin-1 -*-
… or don't:
# coding=Latin-1
At any rate, all the encoding line does is to tell Python how to interpret your string literals. If you create non-unicode literals (without the u prefix), you still have to decode them at some point. And, if you don't do that decoding yourself, PySide will have to guess, and it will guess your system encoding (which is probably cp1252—which is close enough to Latin-1 for superscripts, but not close enough to UTF-8).
So, to solve all of your problems:
Use UTF-8 encoding if possible.
If you can't use UTF-8 encoding, use explicit Unicode escapes or dynamic generation of strings to handle the characters Latin-1 is missing in your literals.
Make all of your literals Unicode.
Use Unicode strings wherever possible in your code.
If you do need byte strings anywhere, explicitly encode/decode them rather than letting Python/PySide/Qt guess for you.

Loading document as raw string in yaml with PyYAML

I want to parse yaml documents like the following
meta-info-1: val1
meta-info-2: val2
---
Plain text/markdown content!
jhaha
If I load_all this with PyYAML, I get the following
>>> list(yaml.load_all(open('index.yml')))
[{'meta-info-1': 'val1', 'meta-info-2': 'val2'}, 'Plain text/markdown content! jhaha']
What I am trying to achieve here is that the yaml file should contain two documents, and the second one is supposed to be interpreted as a single string document, more specifically any large body of text with markdown formatting. I don't want it to be parsed as YAML syntax.
In the above example, PyYAML returns the second document as a single string. But if the second document has a : character in place of the ! for instance, I get a syntax error. This is because PyYAML is parsing the stuff in that document.
Is there a way I can tell PyYAML that the second document is a just a raw string and not to parse it?
Edit: A few excellent answers there. While using quotes or the literal syntax solves the said problem, I'd like the users to be able to write the plain text without any extra cruft. Just the three -'s (or .'s) and write away a large body of plain text. Which might also include quotes too. So, I'd like to know if I can tell PyYAML to parse only one document, and give the second to me raw.
Eidt 2: So, adapting agf's idea, instead of using a try/except as the second document could be valid yaml syntax,
config_content, body_content = open(filename).read().split('\n---')
config = yaml.loads(config_content)
body = yaml.loads(body_content)
Thanks agf.
You can do
raw = open(filename).read()
docs = []
for raw_doc in raw.split('\n---'):
try:
docs.append(yaml.load(raw_doc))
except SyntaxError:
docs.append(raw_doc)
If you won't have control over the format of the original document.
From the PyYAML docs,
Double-quoted is the most powerful style and the only style that can express any scalar value. Double-quoted scalars allow escaping. Using escaping sequences \x** and \u****, you may express any ASCII or Unicode character.
So it sounds like there is no way to represent an arbitrary scalar in the parsing if it's not double quoted.
If all you want is to escape the colon character in YAML, then enclose it within single or double quotes. Also, you can try literal style for your second document which should be treated as single scalar.

Replace numeric character references in XML document using Python

I am struggling with the following issue: I have an XML string that contains the following tag and I want to convert this, using cElementTree, to a valid XML document:
<tag>#55296;#57136;#55296;#57149;#55296;#57139;#55296;#57136;#55296;#57151;#55296;
#57154;#55296;#57136;</tag>
but each # sign is preceded by a & sign and hence the output looks like: 𐌰𐌽𐌳𐌰𐌿𐍂𐌰
This is a unicode string and the encoding is UTF-8. I want to discard these numeric character references because they are not legal XML in a valid XML document (see Parser error using Perl XML::DOM module, "reference to invalid character number")
I have tried different regular expression to match these numeric character references. For example, I have tried the following (Python) regex:
RE_NUMERIC_CHARACTER = re.compile('&#[\d{1,5}]+;')
This does work in regular python session but as soon as I use the same regex in my code then it doesn't work, presumably because those numeric characters have been interpreted (and are shown as boxes or question marks).
I have also tried the unescape function from http://effbot.org/zone/re-sub.htm but that does not work either.
Thus: how can I match, using a regular expression in Python, these numeric character references and create a valid XML document?
Eurgh. You've got surrogates (UTF-16 code units in the range D800-DFFF), which some fool has incorrectly encoded individually instead of using a pair of code units for a single character. It would be ideal to replace this mess with what it should look like:
<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>
Or, just as valid, in literal characters (if you've got a font that can display the Gothic alphabet):
<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>
Usually, it would be best to do replacement operations like this on parsed text nodes, to avoid messing up non-character-reference sequences in other places like comments or PIs. However of course that's not possible in this case since this isn't really XML at all. You could try to fix it up with a crude regex, though it would be better to find out where the invalid input is coming from and kick the person responsible until they fix it.
>>> def lenient_deccharref(m):
... return unichr(int(m.group(1)))
...
>>> tag= '<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>'
>>> re.sub('&#(\d+);', lenient_deccharref, tag).encode('utf-8')
'<tag>\xf0\x90\x8c\xb0\xf0\x90\x8c\xbd\xf0\x90\x8c\xb3\xf0\x90\x8c\xb0\xf0\x90\x8c\xbf\xf0\x90\x8d\x82\xf0\x90\x8c\xb0</tag>'
This is the correct UTF-8 encoding of 𐌰𐌽𐌳𐌰𐌿𐍂𐌰. The utf-8 codec allows you to encode a sequence of surrogates to correct UTF-8 even on a wide-Unicode platform where the surrogates should not have appeared in the string in the first place.
>>> _.decode('utf-8')
u'<tag>\U00010330\U0001033d\U00010333\U00010330\U0001033f\U00010342\U00010330</tag>'

Python: getting \\u00bd correctly in editor

I would like to do the following:
1) Serialize my class
2) Also manually edit the serialization dump file to remove certain objects of my class which I find unnecessary.
I am currently using python with simplejson. As you know, simplejson converts all characters to unicde. As a result, when I dump a particular object with simplejson, the unicode characters becomes something like that "\u00bd" for 好.
I am interested to manually edit the simplejson file for convenience. Anyone here know a work around for me to do this?
My requirements for this serialization format:
1) Easy to use (just dump and load - done)
2) Allows me to edit them manually without much hassle.
3) Able to display chinese character
I use vim. Does anyone know a way to conver "\u00bd" to 好 in vim?
I don't know anything about simplejson or the Serialisation part of the question, but you asked about converting "\u00bd" to 好 in Vim. Here are some vim tips for working with unicode:
You'll need the correct encoding set up in vim, see:
:help 'encoding'
:help 'fileencoding'
Entering unicode characters by number is simply a case of going into insert mode, pressing Ctrl-V and then typing u followed by the four digit number (or U followed by an 8-digit number). See:
:help i_CTRL-V_digit
Also bear in mind that in order for the character to display correctly in Vim, you'll need a fixed-width font containing that character. It appears as a wide space in Envy Code R and as various boxes in Lucida Console, Consolas and Courier New.
To replace \uXXXX with unicode character XXXX (where X is any hexadecimal digit), type this when in normal mode (where <ENTER> means press the ENTER key, don't type it literally):
:%s/\\u\x\{4\}/\=eval('"' . submatch(0) . '"')/g<ENTER>
Note however that u00bd appears to be unicode character ½ (1/2 in case that character doesn't display correctly on your screen), not the 好 character you mentioned (which is u597D I think). See this unicode table. Start vim and type these characters (where <Ctrl-V> is produced by holding CTRL, pressing V, releasing V and then releasing CTRL):
i<Ctrl-V>u00bd
You should see a small character looking like 1/2, assuming your font supports that character.
If you want json/simplejson to produce unicode output instead of str output with Unicode escapes then you need to pass ensure_ascii=False to dump()/dumps(), then either encode before saving or use a file-like from codecs.

UTF in Python Regex

I'm aware that Python 3 fixes a lot of UTF issues, I am not however able to use Python 3, I am using 2.5.1
I'm trying to regex a document but the document has UTF hyphens in it – rather than -. Python can't match these and if I put them in the regex it throws a wobbly.
How can I force Python to use a UTF string or in some way match a character such as that?
Thanks for your help
You have to escape the character in question (–) and put a u in front of the string literal to make it a unicode string.
So, for example, this:
re.compile("–")
becomes this:
re.compile(u"\u2013")
After a quick test and visit to PEP 0264: Defining Python Source Code Encodings, I see you may need to tell Python the whole file is UTF-8 encoded by adding adding a comment like this to the first line.
# encoding: utf-8
Here's the test file I created and ran on Python 2.5.1 / OS X 10.5.6
# encoding: utf-8
import re
x = re.compile("–")
print x.search("xxx–x").start()
Don't use UTF-8 in a regular expression. UTF-8 is a multibyte encoding where some unicode code points are encoded by 2 or more bytes. You may match parts of your string that you didn't plan to match. Instead use unicode strings as suggested.

Categories