Trouble with chr and encoding issues - python

I am wondering why the output for the following code is changing:
N = 128
print(chr(N))
file = open('output.txt', 'w')
file.write(chr(N))
file.close()
In the output.txt the output is: (<- character not showing up but its a box with two zeros on top row and an 8 and a 0 on the bottom row..) however in my IDE the output is an empty square: □ . Can someone explain why these two outputs are not matching?
I am using Ubuntu 16.04 and my IDE is PyCharm CE. Also, the situation does not change if I try encoding:
file = open('output.txt', 'w', encoding = 'utf-8')

There’s nothing wrong with your code, or the file, or anything else.
You are correctly writing chr(128), aka U+0080, aka a Unicode control character, as UTF-8. The file will have the UTF-8 encoding of that character (the two bytes \xc2\x80).
When you view it in the unspecified first program (maybe you’re just catting it to whatever your terminal is?), it’s correctly reading those two bytes as the UTF-8 for the character U+0800 and displaying whatever image its selected font has for that character.
When you view it in PyCharm, it’s also correctly reading U+0800 and displaying it using whatever its selected font is.
The only difference is that they’re using different fonts. Different fonts do different things for non-printable control characters. (There's no standard rendering for this character—it has no specific meaning in Unicode, but is mapped to the Latin-1 Supplement character 0x80, which is defined as control character "PAD", short for "Padding Character".1) Different things could be useful, so different fonts do different things:
Showing you the hex value of the control character could be useful for, e.g., the kind of people who work with Unicode at the shell, so your terminal (or whatever) is configured to use a font that shows them the way.
Just showing you that this is something you probably didn’t want to print by using the generic replacement box2 could also be reasonable, so PyCharm is configured with a font that does that.
Just displaying it as a space could also be reasonable, especially in a fixed-width font. That's what I get when I cat it, or print it from my Python REPL, on my terminal.
Displaying the traditional Latin-1 name for the control character (PAD) in a box could also be useful. This is what Unifont has.
Displaying it as a Euro sign could be useful for cases where you're dealing with a bunch of old Java or Win32 code, for backward compatibility reasons.3
1. Technically, that's no longer quite true; Unicode defines it in terms of ISO-15924 code 998, "Zyyy: Code for undetermined script", not as part of ISO-8859 at all. But practically, it's either PAD, or it's an indeterminate meaningless character, which isn't exactly more useful.
2. What you actually pasted into your question is neither U+0080 nor U+FFFD but U+25A1, aka "White Square". Presumably either PyCharm recognized that its font didn't have a glyph for U+0080 and manually substituted U+25A1, or something on the chain from your clipboard to your browser to Stack Overflow did the same thing…
3. After the Euro sign was created, but before Unicode 2.1 added U+20AC and ISO-8859 added the Latin-9 encoding, people had to have some way of displaying Euros. And one of the two most common non-standard encodings was to use Latin-1 80/Unicode U+0080. (The other was A4/U+00A4). And there are a few Java and Win32 code applications written for Unicode 2.0, using this hack, still being used in the wild, and fonts to support them.

Python uses UTF-8 for its encoding. The functin chr returns the corresponding character for each input value. However, not all characters can be shown; some characters are only for control purposes. In your case, 128 is the Padding Character. Since it cannot be shown, each environment treats it differently. Hence, your file editor shows its value in hex and your IDE simply doesn't show it. Nevertheless, both editor and IDE realize what character it is.

Related

Why under specific enviroments UNICODE characters trigger an EncodeError and at others not?

I am making a Python project that needs to work with Greek characters print, edit and return strings.
On my main PC that has the Greek language installed everything runs fine but when I am running on my English laptop the same program with the same version of python an encode error is triggered. Especially this one:
EncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
The error happens due to this code
my_string = "Δίας"
print(my_string)
Why is this happening and what I need to do to fix it?
Why is this happening? You are using Python 2 and although it supports Unicode, it makes you jump through a few more hoops explicitly than Python 3 does. The string you provide contains characters that fall outside the normal first 128 ASCII characters, which is what is causing the problem.
The print statement tries to encode the string as standard ascii, but it runs into characters it doesn't understand and by that point, it does not know what encoding the characters are supposed to be in. You might think this is obvious: "the same encoding the file is in!" or "always UTF-8!", but Python 2 wants you to make it explicit.
What do you need to do to fix it? One solution would be to use Python 3 and not worry about it, if all you need is a quick solution. Python 3 really is the way forward at this point and using Python 2 makes you solve problems that many Python programmers today don't have to solve (although they should be able to, in the end).
If you want to keep using Python 2, you should change your code to this:
# coding=utf-8
my_string = u"Δίας"
print(my_string.encode('utf-8'))
The first line tells the interpreter explicitly what encoding the source file was written in. This helps your IDE as well, to make sure it is showing you the code correctly. The second line has the u in front of the string, telling Python my_string is in fact a unicode string. And the third line explicitly tells Python that you want the output to be utf-8 encoded as well.
A more complete explanation of all this is here https://docs.python.org/2/howto/unicode.html
If you're wondering why it works on your Greek computer, but not on your English computer - the default encoding on the Greek computer actually has the code points for the characters you're using, while the English encoding does not. This indicates that Python is clever enough to figure out that things are utf (and the string is a series of unicode code points), but by the time it needs to encode them, it doesn't know what encoding to use, as the standard (English) encoding doesn't have the characters in the string.

Unicode Emoji's in python from csv files

I have some csv data of some users tweet.
In excel it is displayed like this:
‰ÛÏIt felt like they were my friends and I was living the story with them‰Û #retired #IAN1
I had imported this csv file into python and in python the same tweet appears like this (I am using putty to connect to a server and I copied this from putty's screen)
▒▒▒It felt like they were my friends and I was living the story with them▒۝ #retired #IAN1
I am wondering how to display these emoji characters properly. I am trying to separate all the words in this tweet but I am not sure how I can separate those emoji unicode characters.
In fact, you certainly have a loss of data…
I don’t know how you get your CSV file from users tweet (you may explain that). But generally, CSV files are encoded in "cp1252" (or "windows-1252"), sometimes in "iso-8859-1" encoding. Nowadays, we can found CSV files encoded in "utf-8".
If you tweets are encoded in "cp1252" or any 8-bit single-byte coded character sets, the Emojis are lost (replaced by "?") or badly converted.
Then, if you open your CSV file into Excel, it will use it’s default encoding ("cp1252") and load the file with corrupted characters. You can try with Libre Office, it has a dialog box which allows you to choose your encoding more easily.
The copy/paste from Putty will also convert your characters depending of your console encoding… It is worst!
If your CSV file use "utf-8" encoding (or "utf-16", "utf-32") you may have more chance to preserve the Emojis. But there is still a problem: most Emojis have a code-point greater that U+FFFF (65535 in decimal). For instance, Grinning Face "😀" has the code-point U+1F600).
This kind of characters are badly handled in Python, try this:
# coding: utf8
from __future__ import unicode_literals
emoji = u"😀"
print(u"emoji: " + emoji)
print(u"repr: " + repr(emoji))
print(u"len: {}".format(len(emoji)))
You’ll get (if your console allow it):
emoji: 😀
repr: u'\U0001f600'
len: 2
The first line won’t print if your console don’t allow unicode,
The \U escape sequence is similar to the \u, but expects 8 hex digits, not 4.
Yes, this character has a length of 2!
EDIT: With Python 3, you get:
emoji: 😀
repr: '😀'
len: 1
No escape sequence for repr(),
the length is 1!
What you can do is posting your CSV file (a fragment) as attachment, then one could analyse it…
See also Unicode Literals in Python Source Code in the Python 2.7 documentation.
First of all you shouldn't work with text copied from a console (nonetheless from a remote connection) because of formatting differences and how unreliable clipboards are. I'd suggest exporting your CSV and reading it directly.
I'm not quite sure what you are trying to do but twitter emojis cannot be displayed in a console due to them being basically compressed images. Would you mind explaning your issue further?
I would personally treat the whole string as Unicode, separate each character in a list then rebuilding words based on spaces.

How can I write exponents in a PySide QLabel?

I'm writing a Qt interface for a computing program and I would like to write in the units for an area (ie, the LaTex output of m^2, or m².)
If I use the special ² character in this code: area_label = QtGui.QLabel("m²"), it will display the following in the GUI: m².
I suspect this could be an encoding issue, what would be the way to write the squared exponent I'm looking for?
Additional question: is there a way to output any exponent, any one not defined as a special character (say m^8)?
Additional info:
I'm working on python 2.7.2, with PySide version 1.1.1, and Qt 4.7.4. Working in Windows 7, SP1, but I would like my code to be cross-platform if possible.
Also, as I'm working on windows and I use french accents (like à and é), I'm using this encoding line at the beginning of my file: # -*-coding:Latin-1 -*.
Your encoding problem appears to be that you're passing UTF-8 strings, which PySide/Qt is trying to interpret according to your system encoding, which is something Latin-1 compatible (like cp1252, the traditional Windows default for western European languages) rather than UTF-8. You can see this pretty easily:
>>> print u'm\u00b2'.encode('utf-8').decode('latin-1')
m²
PySide can take unicode strings everywhere. So, if you just use unicode everywhere instead of str/bytes, including at the interface to PySide, you should be fine.
is there a way to output any exponent, any one not defined as a special character (say m^8)?
Well, ⁸ (U+2078) is defined as a special character, as evidenced by the fact that I was able to type it here.
However, you will have to write your own code to parse your expressions and generate proper superscript characters.
The superscripts and subscripts block in U+2070 to U+209F has all of the characters you need, except for 2 and 3, which were left in their Latin-1-compatible positions at U+00B2 and U+00B3. (Some fonts will display U+2072 and U+2073 as equivalent characters, but those fonts are not correct, so you shouldn't depend on that. Also, 1 appears as both U+2071 and U+00B9, and some fonts distinguish them. You may want to print out the whole list and see which ones look better for you.)
The function to turn each digit into a superscript looks like this:
def superscript(digit):
if digit in (2, 3):
return unichr(0x00B0 + digit)
else:
return unichr(0x2070 + digit)
So, a really simple wrapper would be:
def term(base, exponent):
return base + u''.join(superscript(int(digit)) for digit in exponent)
Now:
>>> print term('x', '123')
xⁱ²³
However, if you want something more flexible, you're probably going to want to generate HTML instead of plain text. Recent versions of Qt can take HTML directly in a QLabel.
If you can generate MathML, Latex, etc. from your expressions, there are tools that generate HTML from those formats.
But for a really trivial example:
def term(base, exponent):
return u'{}<sup>{}</sup>'.format(base, exponent)
When printed out, this will just show x<sup>123</sup>, but when stuck in a QLabel (or a Stack Overflow answer), that shows as x123.
I'm usint this encoding line: # -*-coding:Latin-1 -*.
Why? If it's at all possible for you to edit text files in UTF-8, that will make your life a lot easier. For one thing, Latin-1 doesn't have characters for any superscripts but 1, 2, and 3, which means you will have to write things like u'm\2074' instead of just writing u'm⁴'
Also, it's a bit misleading to use a coding declaration which is almost, but not quite, in emacs format. Either use emacs format (with the final hyphen and proper spacing):
# -*- coding: Latin-1 -*-
… or don't:
# coding=Latin-1
At any rate, all the encoding line does is to tell Python how to interpret your string literals. If you create non-unicode literals (without the u prefix), you still have to decode them at some point. And, if you don't do that decoding yourself, PySide will have to guess, and it will guess your system encoding (which is probably cp1252—which is close enough to Latin-1 for superscripts, but not close enough to UTF-8).
So, to solve all of your problems:
Use UTF-8 encoding if possible.
If you can't use UTF-8 encoding, use explicit Unicode escapes or dynamic generation of strings to handle the characters Latin-1 is missing in your literals.
Make all of your literals Unicode.
Use Unicode strings wherever possible in your code.
If you do need byte strings anywhere, explicitly encode/decode them rather than letting Python/PySide/Qt guess for you.

Encoding error in Python with Chinese characters

I'm a beginner having trouble decoding several dozen CSV file with numbers + (Simplified) Chinese characters to UTF-8 in Python 2.7.
I do not know the encoding of the input files so I have tried all the possible encodings I am aware of -- GB18030, UTF-7, UTF-8, UTF-16 & UTF-32 (LE & BE). Also, for good measure, GBK and GB3212, though these should be a subset of GB18030. The UTF ones all stop when they get to the first Chinese characters. The other encodings stop somewhere in the first line except GB18030. I thought this would be the solution because it read through the first few files and decoded them fine. Part of my code, reading line by line, is:
line = line.decode("GB18030")
The first 2 files I tried to decode worked fine. Midway through the third file, Python spits out
UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 168-169: illegal multibyte sequence
In this file, there are about 5 such errors in about a million lines.
I opened the input file in a text editor and checked which characters were giving the decoding errors, and the first few all had Euro signs in a particular column of the CSV files. I am fairly confident these are typos, so I would just like to delete the Euro characters. I would like to examine types of encoding errors one by one; I would like to get rid of all the Euro errors but do not want to just ignore others until I look at them first.
Edit: I used chardet which gave GB2312 as the encoding with .99 confidence for all files. I tried using GB2312 to decode which gave:
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 108-109: illegal multibyte sequence
""" ... GB18030. I thought this would be the solution because it read through the first few files and decoded them fine.""" -- please explain what you mean. To me, there are TWO criteria for a successful decoding: firstly that raw_bytes.decode('some_encoding') didn't fail, secondly that the resultant unicode when displayed makes sense in a particular language. Every file in the universe will pass the first test when decoded with latin1 aka iso_8859_1. Many files in East Asian languages pass the first test with gb18030, because mostly the frequently used characters in Chinese, Japanese, and Korean are encoded using the same blocks of two-byte sequences. How much of the second test have you done?
Don't muck about looking at the data in an IDE or text editor. Look at it in a web browser; they usually make a better job of detecting encodings.
How do you know that it's a Euro character? By looking at the screen of a text editor that's decoding the raw bytes using what encoding? cp1252?
How do you know it contains Chinese characters? Are you sure it's not Japanese? Korean? Where did you get it from?
Chinese files created in Hong Kong, Taiwan, maybe Macao, and other places off the mainland use big5 or big5_hkscs encoding -- try that.
In any case, take Mark's advice and point chardet at it; chardet usually makes a reasonably good job of detecting the encoding used if the file is large enough and correctly encoded Chinese/Japanese/Korean -- however if someone has been hand editing the file in a text editor using a single-byte charset, a few illegal characters may cause the encoding used for the other 99.9% of the characters not to be detected.
You may like to do print repr(line) on say 5 lines from the file and edit the output into your question.
If the file is not confidential, you may like to make it available for download.
Was the file created on Windows? How are you reading it in Python? (show code)
Update after OP comments:
Notepad etc don't attempt to guess the encoding; "ANSI" is the default. You have to tell it what to do. What you are calling the Euro character is the raw byte "\x80" decoded by your editor using the default encoding for your environment -- the usual suspect being "cp1252". Don't use such an editor to edit your file.
Earlier you were talking about the "first few errors". Now you say you have 5 errors total. Please explain.
If the file is indeed almost correct gb18030, you should be able to decode the file line by line, and when you get such an error, trap it, print the error message, extract the byte offsets from the message, print repr(two_bad_bytes), and keep going. I'm very interested in which of the two bytes the \x80 appears. If it doesn't appear at all, the "Euro character" is not part of your problem. Note that \x80 can appear validly in a gb18030 file, but only as the 2nd byte of a 2-byte sequence starting with \x81 to \xfe.
It's a good idea to know what your problem is before you try to fix it. Trying to fix it by bashing it about with Notepad etc in "ANSI" mode is not a good idea.
You have been very coy about how you decided that the results of gb18030 decoding made sense. In particular I would be closely scrutinising the lines where gbk fails but gb18030 "works" -- there must be some extremely rare Chinese characters in there, or maybe some non-Chinese non-ASCII characters ...
Here's a suggestion for a better way to inspect the damage: decode each file with raw_bytes.decode(encoding, 'replace') and write the result (encoded in utf8) to another file. Count the errors by result.count(u'\ufffd'). View the output file with whatever you used to decide that the gb18030 decoding made sense. The U+FFFD character should show up as a white question mark inside a black diamond.
If you decide that the undecodable pieces can be discarded, the easiest way is raw_bytes.decode(encoding, 'ignore')
Update after further information
All those \\ are confusing. It appears that "getting the bytes" involves repr(repr(bytes)) instead of just repr(bytes) ... at the interactive prompt, do either bytes (you'll get an implict repr()), or print repr(bytes) (which won't get the implicit repr())
The blank space: I presume that you mean that '\xf8\xf8'.decode('gb18030') is what you interpret as some kind of full-width space, and that the interpretation is done by visual inspection using some unnameable viewer software. Is that correct?
Actually, '\xf8\xf8'.decode('gb18030') -> u'\e28b'. U+E28B is in the Unicode PUA (Private Use Area). The "blank space" presumably means that the viewer software unsuprisingly doesn't have a glyph for U+E28B in the font it is using.
Perhaps the source of the files is deliberately using the PUA for characters that are not in standard gb18030, or for annotation, or for transmitting pseudosecret info. If so, you will need to resort to the decoding tambourine, an offshoot of recent Russian research reported here.
Alternative: the cp939-HKSCS theory. According to the HK government, HKSCS big5 code FE57 was once mapped to U+E28B but is now mapped to U+28804.
The "euro": You said """Due to the data I can't share the whole line, but what I was calling the euro char is in: \xcb\xbe\x80\x80" [I'm assuming a \ was omitted from the start of that, and the " is literal]. The "euro character", when it appears, is always in the same column that I don't need, so I was hoping to just use "ignore". Unfortunately, since the "euro char" is right next to quotes in the file, sometimes "ignore" gets rid of both the euro character as well [as] quotes, which poses a problem for the csv module to determine columns"""
It would help enormously if you could show the patterns of where these \x80 bytes appear in relation to the quotes and the Chinese characters -- keep it readable by just showing the hex, and hide your confidential data e.g. by using C1 C2 to represent "two bytes which I am sure represent a Chinese character". For example:
C1 C2 C1 C2 cb be 80 80 22 # `\x22` is the quote character
Please supply examples of (1) where the " is not lost by 'replace' or 'ignore' (2) where the quote is lost. In your sole example to date, the " is not lost:
>>> '\xcb\xbe\x80\x80\x22'.decode('gb18030', 'ignore')
u'\u53f8"'
And the offer to send you some debugging code (see example output below) is still open.
>>> import decode_debug as de
>>> def logger(s):
... sys.stderr.write('*** ' + s + '\n')
...
>>> import sys
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'replace', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8\ufffd\ufffd"'
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'ignore', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8"'
>>>
Eureka: -- Probable cause of sometimes losing the quote character --
It appears there is a bug in the gb18030 decoder replace/ignore mechanism: \x80 is not a valid gb18030 lead byte; when it is detected the decoder should attempt to resync with the NEXT byte. However it seems to be ignoring both the \x80 AND the following byte:
>>> '\x80abcd'.decode('gb18030', 'replace')
u'\ufffdbcd' # the 'a' is lost
>>> de.decode_debug('\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffdabcd'
>>> '\x80\x80abcd'.decode('gb18030', 'replace')
u'\ufffdabcd' # the second '\x80' is lost
>>> de.decode_debug('\x80\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80\x80ab') doesn't start with a plausible code sequence
*** input[1:5] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffd\ufffdabcd'
>>>
You might try chardet.
Try this:
codecs.open(file, encoding='gb18030', errors='replace')
Don't forget the parameter errors, you can also set it to 'ignore'.

Python: getting \\u00bd correctly in editor

I would like to do the following:
1) Serialize my class
2) Also manually edit the serialization dump file to remove certain objects of my class which I find unnecessary.
I am currently using python with simplejson. As you know, simplejson converts all characters to unicde. As a result, when I dump a particular object with simplejson, the unicode characters becomes something like that "\u00bd" for 好.
I am interested to manually edit the simplejson file for convenience. Anyone here know a work around for me to do this?
My requirements for this serialization format:
1) Easy to use (just dump and load - done)
2) Allows me to edit them manually without much hassle.
3) Able to display chinese character
I use vim. Does anyone know a way to conver "\u00bd" to 好 in vim?
I don't know anything about simplejson or the Serialisation part of the question, but you asked about converting "\u00bd" to 好 in Vim. Here are some vim tips for working with unicode:
You'll need the correct encoding set up in vim, see:
:help 'encoding'
:help 'fileencoding'
Entering unicode characters by number is simply a case of going into insert mode, pressing Ctrl-V and then typing u followed by the four digit number (or U followed by an 8-digit number). See:
:help i_CTRL-V_digit
Also bear in mind that in order for the character to display correctly in Vim, you'll need a fixed-width font containing that character. It appears as a wide space in Envy Code R and as various boxes in Lucida Console, Consolas and Courier New.
To replace \uXXXX with unicode character XXXX (where X is any hexadecimal digit), type this when in normal mode (where <ENTER> means press the ENTER key, don't type it literally):
:%s/\\u\x\{4\}/\=eval('"' . submatch(0) . '"')/g<ENTER>
Note however that u00bd appears to be unicode character ½ (1/2 in case that character doesn't display correctly on your screen), not the 好 character you mentioned (which is u597D I think). See this unicode table. Start vim and type these characters (where <Ctrl-V> is produced by holding CTRL, pressing V, releasing V and then releasing CTRL):
i<Ctrl-V>u00bd
You should see a small character looking like 1/2, assuming your font supports that character.
If you want json/simplejson to produce unicode output instead of str output with Unicode escapes then you need to pass ensure_ascii=False to dump()/dumps(), then either encode before saving or use a file-like from codecs.

Categories