I'm writing a Qt interface for a computing program and I would like to write in the units for an area (ie, the LaTex output of m^2, or m².)
If I use the special ² character in this code: area_label = QtGui.QLabel("m²"), it will display the following in the GUI: m².
I suspect this could be an encoding issue, what would be the way to write the squared exponent I'm looking for?
Additional question: is there a way to output any exponent, any one not defined as a special character (say m^8)?
Additional info:
I'm working on python 2.7.2, with PySide version 1.1.1, and Qt 4.7.4. Working in Windows 7, SP1, but I would like my code to be cross-platform if possible.
Also, as I'm working on windows and I use french accents (like à and é), I'm using this encoding line at the beginning of my file: # -*-coding:Latin-1 -*.
Your encoding problem appears to be that you're passing UTF-8 strings, which PySide/Qt is trying to interpret according to your system encoding, which is something Latin-1 compatible (like cp1252, the traditional Windows default for western European languages) rather than UTF-8. You can see this pretty easily:
>>> print u'm\u00b2'.encode('utf-8').decode('latin-1')
m²
PySide can take unicode strings everywhere. So, if you just use unicode everywhere instead of str/bytes, including at the interface to PySide, you should be fine.
is there a way to output any exponent, any one not defined as a special character (say m^8)?
Well, ⁸ (U+2078) is defined as a special character, as evidenced by the fact that I was able to type it here.
However, you will have to write your own code to parse your expressions and generate proper superscript characters.
The superscripts and subscripts block in U+2070 to U+209F has all of the characters you need, except for 2 and 3, which were left in their Latin-1-compatible positions at U+00B2 and U+00B3. (Some fonts will display U+2072 and U+2073 as equivalent characters, but those fonts are not correct, so you shouldn't depend on that. Also, 1 appears as both U+2071 and U+00B9, and some fonts distinguish them. You may want to print out the whole list and see which ones look better for you.)
The function to turn each digit into a superscript looks like this:
def superscript(digit):
if digit in (2, 3):
return unichr(0x00B0 + digit)
else:
return unichr(0x2070 + digit)
So, a really simple wrapper would be:
def term(base, exponent):
return base + u''.join(superscript(int(digit)) for digit in exponent)
Now:
>>> print term('x', '123')
xⁱ²³
However, if you want something more flexible, you're probably going to want to generate HTML instead of plain text. Recent versions of Qt can take HTML directly in a QLabel.
If you can generate MathML, Latex, etc. from your expressions, there are tools that generate HTML from those formats.
But for a really trivial example:
def term(base, exponent):
return u'{}<sup>{}</sup>'.format(base, exponent)
When printed out, this will just show x<sup>123</sup>, but when stuck in a QLabel (or a Stack Overflow answer), that shows as x123.
I'm usint this encoding line: # -*-coding:Latin-1 -*.
Why? If it's at all possible for you to edit text files in UTF-8, that will make your life a lot easier. For one thing, Latin-1 doesn't have characters for any superscripts but 1, 2, and 3, which means you will have to write things like u'm\2074' instead of just writing u'm⁴'
Also, it's a bit misleading to use a coding declaration which is almost, but not quite, in emacs format. Either use emacs format (with the final hyphen and proper spacing):
# -*- coding: Latin-1 -*-
… or don't:
# coding=Latin-1
At any rate, all the encoding line does is to tell Python how to interpret your string literals. If you create non-unicode literals (without the u prefix), you still have to decode them at some point. And, if you don't do that decoding yourself, PySide will have to guess, and it will guess your system encoding (which is probably cp1252—which is close enough to Latin-1 for superscripts, but not close enough to UTF-8).
So, to solve all of your problems:
Use UTF-8 encoding if possible.
If you can't use UTF-8 encoding, use explicit Unicode escapes or dynamic generation of strings to handle the characters Latin-1 is missing in your literals.
Make all of your literals Unicode.
Use Unicode strings wherever possible in your code.
If you do need byte strings anywhere, explicitly encode/decode them rather than letting Python/PySide/Qt guess for you.
Related
I am wondering why the output for the following code is changing:
N = 128
print(chr(N))
file = open('output.txt', 'w')
file.write(chr(N))
file.close()
In the output.txt the output is: (<- character not showing up but its a box with two zeros on top row and an 8 and a 0 on the bottom row..) however in my IDE the output is an empty square: □ . Can someone explain why these two outputs are not matching?
I am using Ubuntu 16.04 and my IDE is PyCharm CE. Also, the situation does not change if I try encoding:
file = open('output.txt', 'w', encoding = 'utf-8')
There’s nothing wrong with your code, or the file, or anything else.
You are correctly writing chr(128), aka U+0080, aka a Unicode control character, as UTF-8. The file will have the UTF-8 encoding of that character (the two bytes \xc2\x80).
When you view it in the unspecified first program (maybe you’re just catting it to whatever your terminal is?), it’s correctly reading those two bytes as the UTF-8 for the character U+0800 and displaying whatever image its selected font has for that character.
When you view it in PyCharm, it’s also correctly reading U+0800 and displaying it using whatever its selected font is.
The only difference is that they’re using different fonts. Different fonts do different things for non-printable control characters. (There's no standard rendering for this character—it has no specific meaning in Unicode, but is mapped to the Latin-1 Supplement character 0x80, which is defined as control character "PAD", short for "Padding Character".1) Different things could be useful, so different fonts do different things:
Showing you the hex value of the control character could be useful for, e.g., the kind of people who work with Unicode at the shell, so your terminal (or whatever) is configured to use a font that shows them the way.
Just showing you that this is something you probably didn’t want to print by using the generic replacement box2 could also be reasonable, so PyCharm is configured with a font that does that.
Just displaying it as a space could also be reasonable, especially in a fixed-width font. That's what I get when I cat it, or print it from my Python REPL, on my terminal.
Displaying the traditional Latin-1 name for the control character (PAD) in a box could also be useful. This is what Unifont has.
Displaying it as a Euro sign could be useful for cases where you're dealing with a bunch of old Java or Win32 code, for backward compatibility reasons.3
1. Technically, that's no longer quite true; Unicode defines it in terms of ISO-15924 code 998, "Zyyy: Code for undetermined script", not as part of ISO-8859 at all. But practically, it's either PAD, or it's an indeterminate meaningless character, which isn't exactly more useful.
2. What you actually pasted into your question is neither U+0080 nor U+FFFD but U+25A1, aka "White Square". Presumably either PyCharm recognized that its font didn't have a glyph for U+0080 and manually substituted U+25A1, or something on the chain from your clipboard to your browser to Stack Overflow did the same thing…
3. After the Euro sign was created, but before Unicode 2.1 added U+20AC and ISO-8859 added the Latin-9 encoding, people had to have some way of displaying Euros. And one of the two most common non-standard encodings was to use Latin-1 80/Unicode U+0080. (The other was A4/U+00A4). And there are a few Java and Win32 code applications written for Unicode 2.0, using this hack, still being used in the wild, and fonts to support them.
Python uses UTF-8 for its encoding. The functin chr returns the corresponding character for each input value. However, not all characters can be shown; some characters are only for control purposes. In your case, 128 is the Padding Character. Since it cannot be shown, each environment treats it differently. Hence, your file editor shows its value in hex and your IDE simply doesn't show it. Nevertheless, both editor and IDE realize what character it is.
I read the symbol °C using xlrd library. I get the unicode value as u'\xb0C'. However I want to use it as a normal string.
I went through a couple of posts including the below link
Convert a Unicode string to a string in Python (containing extra symbols)
It seems to be working for many special signals. but in this case I am seeing only C that is without ° (degree). any help would be much appreciated
Maybe I don't understand something, but:
>>> print u'\xb0C'.encode("UTF-8")
°C
If by "normal string" you mean ASCII encoded string, then you can't do exactly what you want. The degree symbol is not part of the ASCII character set, so the best you can hope to do is either drop it or convert it to a best approximation character from the ASCII character set. You could choose a different encoding, however you have to be sure that whatever systems you are interacting with will work with the encoding you choose. UTF-8 is usually a safe bet, and can encode pretty much any character you'll ever likely run in to.
Is there a comprehensive character replacement module for python that finds all non-ascii or non-unicode characters in a string and replaces them with ascii or unicode equivilents? This comfort with the "ignore" argument during encoding or decoding is insane, but likewise so is a '?' in every place that a non translated character was.
I'm looking for one module that finds irksome characters and conforms them to whatever standard is requested.
I realize that the amount of extant alphabets and encodings makes this somewhat impossible, but surely someone has taken a stab at it? Even a rudimentary solution would be better than the status quo.
The simplification for data transfer that this would mean is enormous.
i don't think what you want is really possible - but i think there is a decent option.
unicodedata has a 'normalize' method that can gracefully degrade text for you...
import unicodedata
def gracefully_degrade_to_ascii( text ):
return unicodedata.normalize('NFKD',text).encode('ascii','ignore')
assuming the charset you're using is already mapped into unicode - or at least can be mapped into unicode - you should be able to degrade the unicode version of that text down to ascii or utf-8 with this module ( it's part of the standard library too )
Full Docs - http://docs.python.org/library/unicodedata.html
To look at any individual character and guess its encoding would be hard and probably not very accurate. However, you can use chardet to try and detect the encoding of an entire file. Then you can use the string decode() and encode() methods to convert its encoding to UTF-8.
http://pypi.python.org/pypi/chardet
And UTF-8 is backwards compatible with ASCII so that won't be a big deal.
I know I've done this before at another job, but I can't remember what I did.
I have a database that is full of varchar and memo fields that were cut and pasted from Office, webpages, and who knows where else. This is starting to cause encoding errors for me. Since Python has a very nice "decode" function to take a byte stream and translate it into Unicode, I thought that would just write my own encoding to fix this up. (For example, to take "smart quotes" and turn them into "standard quotes".)
But I can't remember how to get started. I think I copied one of the encodings that was close (cp1252.py) and then updated it.
Can anyone put me on the right path? Or suggest a better path?
I've expanded this with a bit more detail.
If you are reasonably sure of the encoding of the text in the database, you can do text.decode('cp1252') to get a Unicode string. If the guess is wrong this will likely blow up with an exception, or the decoder will 'disappear' some characters.
Creating a decoder along the lines you describe (modifying cp1252.py) is easy. You just need to define the translation table from bytes to Unicode characters.
However if not all of the text in the database has the same encoding, your decoder will need some rules to decide which is the correct mapping. In this case you may want punt and use the chardet module, which can scan the text and make a guess the encoding.
Maybe the best approach would be try to decode using the most likely encoding (cp1252) and if that fails, fallback to using chardet to guess the correct encoding.
If you use text.decode() and/or chardet, you'll end up with a Unicode string. Below is a simple routine which can translate characters in a Unicode string, e.g. "convert curly quotes to ASCII":
CHARMAP = [
(u'\u201c\u201d', '"'),
(u'\u2018\u2019', "'")
]
# replace with text.decode('cp1252') or chardet
text = u'\u201cit\u2019s probably going to work\u201d, he said'
_map = dict((c, r) for chars, r in CHARMAP for c in list(chars))
fixed = ''.join(_map.get(c, c) for c in text)
print fixed
Output:
"it's probably going to work", he said
I would like to do the following:
1) Serialize my class
2) Also manually edit the serialization dump file to remove certain objects of my class which I find unnecessary.
I am currently using python with simplejson. As you know, simplejson converts all characters to unicde. As a result, when I dump a particular object with simplejson, the unicode characters becomes something like that "\u00bd" for 好.
I am interested to manually edit the simplejson file for convenience. Anyone here know a work around for me to do this?
My requirements for this serialization format:
1) Easy to use (just dump and load - done)
2) Allows me to edit them manually without much hassle.
3) Able to display chinese character
I use vim. Does anyone know a way to conver "\u00bd" to 好 in vim?
I don't know anything about simplejson or the Serialisation part of the question, but you asked about converting "\u00bd" to 好 in Vim. Here are some vim tips for working with unicode:
You'll need the correct encoding set up in vim, see:
:help 'encoding'
:help 'fileencoding'
Entering unicode characters by number is simply a case of going into insert mode, pressing Ctrl-V and then typing u followed by the four digit number (or U followed by an 8-digit number). See:
:help i_CTRL-V_digit
Also bear in mind that in order for the character to display correctly in Vim, you'll need a fixed-width font containing that character. It appears as a wide space in Envy Code R and as various boxes in Lucida Console, Consolas and Courier New.
To replace \uXXXX with unicode character XXXX (where X is any hexadecimal digit), type this when in normal mode (where <ENTER> means press the ENTER key, don't type it literally):
:%s/\\u\x\{4\}/\=eval('"' . submatch(0) . '"')/g<ENTER>
Note however that u00bd appears to be unicode character ½ (1/2 in case that character doesn't display correctly on your screen), not the 好 character you mentioned (which is u597D I think). See this unicode table. Start vim and type these characters (where <Ctrl-V> is produced by holding CTRL, pressing V, releasing V and then releasing CTRL):
i<Ctrl-V>u00bd
You should see a small character looking like 1/2, assuming your font supports that character.
If you want json/simplejson to produce unicode output instead of str output with Unicode escapes then you need to pass ensure_ascii=False to dump()/dumps(), then either encode before saving or use a file-like from codecs.