Python: getting \\u00bd correctly in editor - python

I would like to do the following:
1) Serialize my class
2) Also manually edit the serialization dump file to remove certain objects of my class which I find unnecessary.
I am currently using python with simplejson. As you know, simplejson converts all characters to unicde. As a result, when I dump a particular object with simplejson, the unicode characters becomes something like that "\u00bd" for 好.
I am interested to manually edit the simplejson file for convenience. Anyone here know a work around for me to do this?
My requirements for this serialization format:
1) Easy to use (just dump and load - done)
2) Allows me to edit them manually without much hassle.
3) Able to display chinese character
I use vim. Does anyone know a way to conver "\u00bd" to 好 in vim?

I don't know anything about simplejson or the Serialisation part of the question, but you asked about converting "\u00bd" to 好 in Vim. Here are some vim tips for working with unicode:
You'll need the correct encoding set up in vim, see:
:help 'encoding'
:help 'fileencoding'
Entering unicode characters by number is simply a case of going into insert mode, pressing Ctrl-V and then typing u followed by the four digit number (or U followed by an 8-digit number). See:
:help i_CTRL-V_digit
Also bear in mind that in order for the character to display correctly in Vim, you'll need a fixed-width font containing that character. It appears as a wide space in Envy Code R and as various boxes in Lucida Console, Consolas and Courier New.
To replace \uXXXX with unicode character XXXX (where X is any hexadecimal digit), type this when in normal mode (where <ENTER> means press the ENTER key, don't type it literally):
:%s/\\u\x\{4\}/\=eval('"' . submatch(0) . '"')/g<ENTER>
Note however that u00bd appears to be unicode character ½ (1/2 in case that character doesn't display correctly on your screen), not the 好 character you mentioned (which is u597D I think). See this unicode table. Start vim and type these characters (where <Ctrl-V> is produced by holding CTRL, pressing V, releasing V and then releasing CTRL):
i<Ctrl-V>u00bd
You should see a small character looking like 1/2, assuming your font supports that character.

If you want json/simplejson to produce unicode output instead of str output with Unicode escapes then you need to pass ensure_ascii=False to dump()/dumps(), then either encode before saving or use a file-like from codecs.

Related

Trouble with chr and encoding issues

I am wondering why the output for the following code is changing:
N = 128
print(chr(N))
file = open('output.txt', 'w')
file.write(chr(N))
file.close()
In the output.txt the output is: (<- character not showing up but its a box with two zeros on top row and an 8 and a 0 on the bottom row..) however in my IDE the output is an empty square: □ . Can someone explain why these two outputs are not matching?
I am using Ubuntu 16.04 and my IDE is PyCharm CE. Also, the situation does not change if I try encoding:
file = open('output.txt', 'w', encoding = 'utf-8')
There’s nothing wrong with your code, or the file, or anything else.
You are correctly writing chr(128), aka U+0080, aka a Unicode control character, as UTF-8. The file will have the UTF-8 encoding of that character (the two bytes \xc2\x80).
When you view it in the unspecified first program (maybe you’re just catting it to whatever your terminal is?), it’s correctly reading those two bytes as the UTF-8 for the character U+0800 and displaying whatever image its selected font has for that character.
When you view it in PyCharm, it’s also correctly reading U+0800 and displaying it using whatever its selected font is.
The only difference is that they’re using different fonts. Different fonts do different things for non-printable control characters. (There's no standard rendering for this character—it has no specific meaning in Unicode, but is mapped to the Latin-1 Supplement character 0x80, which is defined as control character "PAD", short for "Padding Character".1) Different things could be useful, so different fonts do different things:
Showing you the hex value of the control character could be useful for, e.g., the kind of people who work with Unicode at the shell, so your terminal (or whatever) is configured to use a font that shows them the way.
Just showing you that this is something you probably didn’t want to print by using the generic replacement box2 could also be reasonable, so PyCharm is configured with a font that does that.
Just displaying it as a space could also be reasonable, especially in a fixed-width font. That's what I get when I cat it, or print it from my Python REPL, on my terminal.
Displaying the traditional Latin-1 name for the control character (PAD) in a box could also be useful. This is what Unifont has.
Displaying it as a Euro sign could be useful for cases where you're dealing with a bunch of old Java or Win32 code, for backward compatibility reasons.3
1. Technically, that's no longer quite true; Unicode defines it in terms of ISO-15924 code 998, "Zyyy: Code for undetermined script", not as part of ISO-8859 at all. But practically, it's either PAD, or it's an indeterminate meaningless character, which isn't exactly more useful.
2. What you actually pasted into your question is neither U+0080 nor U+FFFD but U+25A1, aka "White Square". Presumably either PyCharm recognized that its font didn't have a glyph for U+0080 and manually substituted U+25A1, or something on the chain from your clipboard to your browser to Stack Overflow did the same thing…
3. After the Euro sign was created, but before Unicode 2.1 added U+20AC and ISO-8859 added the Latin-9 encoding, people had to have some way of displaying Euros. And one of the two most common non-standard encodings was to use Latin-1 80/Unicode U+0080. (The other was A4/U+00A4). And there are a few Java and Win32 code applications written for Unicode 2.0, using this hack, still being used in the wild, and fonts to support them.
Python uses UTF-8 for its encoding. The functin chr returns the corresponding character for each input value. However, not all characters can be shown; some characters are only for control purposes. In your case, 128 is the Padding Character. Since it cannot be shown, each environment treats it differently. Hence, your file editor shows its value in hex and your IDE simply doesn't show it. Nevertheless, both editor and IDE realize what character it is.

How can I write exponents in a PySide QLabel?

I'm writing a Qt interface for a computing program and I would like to write in the units for an area (ie, the LaTex output of m^2, or m².)
If I use the special ² character in this code: area_label = QtGui.QLabel("m²"), it will display the following in the GUI: m².
I suspect this could be an encoding issue, what would be the way to write the squared exponent I'm looking for?
Additional question: is there a way to output any exponent, any one not defined as a special character (say m^8)?
Additional info:
I'm working on python 2.7.2, with PySide version 1.1.1, and Qt 4.7.4. Working in Windows 7, SP1, but I would like my code to be cross-platform if possible.
Also, as I'm working on windows and I use french accents (like à and é), I'm using this encoding line at the beginning of my file: # -*-coding:Latin-1 -*.
Your encoding problem appears to be that you're passing UTF-8 strings, which PySide/Qt is trying to interpret according to your system encoding, which is something Latin-1 compatible (like cp1252, the traditional Windows default for western European languages) rather than UTF-8. You can see this pretty easily:
>>> print u'm\u00b2'.encode('utf-8').decode('latin-1')
m²
PySide can take unicode strings everywhere. So, if you just use unicode everywhere instead of str/bytes, including at the interface to PySide, you should be fine.
is there a way to output any exponent, any one not defined as a special character (say m^8)?
Well, ⁸ (U+2078) is defined as a special character, as evidenced by the fact that I was able to type it here.
However, you will have to write your own code to parse your expressions and generate proper superscript characters.
The superscripts and subscripts block in U+2070 to U+209F has all of the characters you need, except for 2 and 3, which were left in their Latin-1-compatible positions at U+00B2 and U+00B3. (Some fonts will display U+2072 and U+2073 as equivalent characters, but those fonts are not correct, so you shouldn't depend on that. Also, 1 appears as both U+2071 and U+00B9, and some fonts distinguish them. You may want to print out the whole list and see which ones look better for you.)
The function to turn each digit into a superscript looks like this:
def superscript(digit):
if digit in (2, 3):
return unichr(0x00B0 + digit)
else:
return unichr(0x2070 + digit)
So, a really simple wrapper would be:
def term(base, exponent):
return base + u''.join(superscript(int(digit)) for digit in exponent)
Now:
>>> print term('x', '123')
xⁱ²³
However, if you want something more flexible, you're probably going to want to generate HTML instead of plain text. Recent versions of Qt can take HTML directly in a QLabel.
If you can generate MathML, Latex, etc. from your expressions, there are tools that generate HTML from those formats.
But for a really trivial example:
def term(base, exponent):
return u'{}<sup>{}</sup>'.format(base, exponent)
When printed out, this will just show x<sup>123</sup>, but when stuck in a QLabel (or a Stack Overflow answer), that shows as x123.
I'm usint this encoding line: # -*-coding:Latin-1 -*.
Why? If it's at all possible for you to edit text files in UTF-8, that will make your life a lot easier. For one thing, Latin-1 doesn't have characters for any superscripts but 1, 2, and 3, which means you will have to write things like u'm\2074' instead of just writing u'm⁴'
Also, it's a bit misleading to use a coding declaration which is almost, but not quite, in emacs format. Either use emacs format (with the final hyphen and proper spacing):
# -*- coding: Latin-1 -*-
… or don't:
# coding=Latin-1
At any rate, all the encoding line does is to tell Python how to interpret your string literals. If you create non-unicode literals (without the u prefix), you still have to decode them at some point. And, if you don't do that decoding yourself, PySide will have to guess, and it will guess your system encoding (which is probably cp1252—which is close enough to Latin-1 for superscripts, but not close enough to UTF-8).
So, to solve all of your problems:
Use UTF-8 encoding if possible.
If you can't use UTF-8 encoding, use explicit Unicode escapes or dynamic generation of strings to handle the characters Latin-1 is missing in your literals.
Make all of your literals Unicode.
Use Unicode strings wherever possible in your code.
If you do need byte strings anywhere, explicitly encode/decode them rather than letting Python/PySide/Qt guess for you.

Convert unicode special symbol in python

I read the symbol °C using xlrd library. I get the unicode value as u'\xb0C'. However I want to use it as a normal string.
I went through a couple of posts including the below link
Convert a Unicode string to a string in Python (containing extra symbols)
It seems to be working for many special signals. but in this case I am seeing only C that is without ° (degree). any help would be much appreciated
Maybe I don't understand something, but:
>>> print u'\xb0C'.encode("UTF-8")
°C
If by "normal string" you mean ASCII encoded string, then you can't do exactly what you want. The degree symbol is not part of the ASCII character set, so the best you can hope to do is either drop it or convert it to a best approximation character from the ASCII character set. You could choose a different encoding, however you have to be sure that whatever systems you are interacting with will work with the encoding you choose. UTF-8 is usually a safe bet, and can encode pretty much any character you'll ever likely run in to.

Search and replace characters in a file with Python

I am trying to do transliteration where I need to replace every source character in English from a file with its equivalent from a dictionary I am using in the source code corresponding to another language in Unicode format. I am now able to read character by character from a file in English how do I search for its equivalent map in the dictionary I have defined in the source code and make sure that is printed in a new transliterated output file. Thank you:).
The translate method of Unicode objects is the simplest and fastest way to perform the transliteration you require. (I assume you're using Unicode, not plain byte strings which would make it impossible to have characters such as 'पत्र'!).
All you have to do is layout your transliteration dictionary in a precise way, as specified in the docs to which I pointed you:
each key must be an integer, the codepoint of a Unicode character; for example, 0x0904 is the codepoint for ऄ, AKA "DEVANAGARI LETTER SHORT A", so for transliterating it you would use as the key in the dict the integer 0x0904 (equivalently, decimal 2308). (For a table with the codepoints for many South-Asian scripts, see this pdf).
the corresponding value can be a Unicode ordinal, a Unicode string (which is presumably what you'll use for your transliteration task, e.g. u'a' if you want to transliterate the Devanagari letter short A into the English letter 'a'), or None (if during the "transliteration" you want to simply remove instances of that Unicode character).
Characters that aren't found as keys in the dict are passed on untouched from the input to the output.
Once your dict is laid out like that, output_text = input_text.translate(thedict) does all the transliteration for you -- and pretty darn fast, too. You can apply this to blocks of Unicode text of any size that will fit comfortably in memory -- basically doing one text file as a time will be just fine on most machines (e.g., the wonderful -- and huge -- Mahabharata takes at most a few tens of megabytes in any of the freely downloadable forms -- Sanskrit [[cross-linked with both Devanagari and roman-transliterated forms]], English translation -- available from this site).
Note: Updated after clarifications from questioner. Please read the comments from the OP attached to this answer.
Something like this:
for syllable in input_text.split_into_syllables():
output_file.write(d[syllable])
Here output_file is a file object, open for writing. d is a dictionary where the indexes are your source characters and the values are the output characters. You can also try to read your file line-by-line instead of reading it all in at once.

Unicode utf-8/utf-16 encoding in Python

In python:
u'\u3053\n'
Is it utf-16?
I'm not really aware of all the unicode/encoding stuff, but this type of thing is coming up in my dataset,
like if I have a=u'\u3053\n'.
print gives an exception and
decoding gives an exception.
a.encode("utf-16") > '\xff\xfeS0\n\x00'
a.encode("utf-8") > '\xe3\x81\x93\n'
print a.encode("utf-8") > πüô
print a.encode("utf-16") >  ■S0
What's going on here?
It's a unicode character that doesn't seem to be displayable in your terminals encoding. print tries to encode the unicode object in the encoding of your terminal and if this can't be done you get an exception.
On a terminal that can display utf-8 you get:
>>> print u'\u3053'
こ
Your terminal doesn't seem to be able to display utf-8, else at least the print a.encode("utf-8") line should produce the correct character.
You ask:
u'\u3053\n'
Is it utf-16?
The answer is no: it's unicode, not any specific encoding. utf-16 is an encoding.
To print a Unicode string effectively to your terminal, you need to find out what encoding that terminal is willing to accept and able to display. For example, the Terminal.app on my laptop is set to UTF-8 and with a rich font, so:
(source: aleax.it)
...the Hiragana letter displays correctly. On a Linux workstation I have a terminal program that keeps resetting to Latin-1 so it would mangle things somewhat like yours -- I can set it to utf-8, but it doesn't have huge number of glyphs in the font, so it would display somewhat-useless placeholder glyphs instead.
Character U+3053 "HIRAGANA LETTER KO".
The \xff\xfe bit at the start of the UTF-16 binary format is the encoded byte order mark (U+FEFF), then "S0" is \x5e\x30, then there's the \n from the original string. (Each of the characters has its bytes "reversed" as it's using little endian UTF-16 encoding.)
The UTF-8 form represents the same Hiragana character in three bytes, with the bit pattern as documented here.
Now, as for whether you should really have it in your data set... where is this data coming from? Is it reasonable for it to have Hiragana characters in it?
Here's the Unicode HowTo Doc for Python 2.6.2:
http://docs.python.org/howto/unicode.html
Also see the links in the Reference section of that document for other explanations, including one by Joel Spolsky.

Categories