convert Unicode to normal string [duplicate] - python

When I parse this XML with p = xml.parsers.expat.ParserCreate():
<name>Fortuna Düsseldorf</name>
The character parsing event handler includes u'\xfc'.
How can u'\xfc' be turned into u'ü'?
This is the main question in this post, the rest just shows further (ranting) thoughts about it
Isn't Python unicode broken since u'\xfc' shall yield u'ü' and nothing else?
u'\xfc' is already a unicode string, so converting it to unicode again doesn't work!
Converting it to ASCII as well doesn't work.
The only thing that I found works is: (This cannot be intended, right?)
exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')
Replacing 8859 with utf-8 fails! What is the point of that?
Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.
Unicode is no magic - why do so many ppl here have issues?
The underlying problem of unicode conversion is dirt simple:
One bidirectional lookup table '\xFC' <-> u'ü'
unicode( 'Fortuna D\xfcsseldorf' )
What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'?
Also why did they made it not reversible?:
>>> u'Fortuna Düsseldorf'.encode('utf-8')
'Fortuna D\xc3\xbcsseldorf'
>>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
u'Fortuna D\xfcsseldorf'

You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.
In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \xhh and \uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.
As such ü has been replaced by \xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.
If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:
>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf
If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:
>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf
The alternative is for you upgrade to Python 3; there repr() only uses escape sequences for codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc; if the codepoint is not a space but falls in a C* or Z* general category, it is escaped). The new ascii() function gives you the Python 2 repr() behaviour still.

Related

python u'\u00b0' returns u'\xb0'. Why?

I use python 2.7.10.
On dealing with character encoding, and after reading a lot of stack-overflow etc. etc. on the subject, I encountered this behaviour which looks strange to me. Python interpreter input
>>>u'\u00b0'
results in the following output:
u'\xb0'
I could repeat this behaviour using a dos window, the idle console, and the wing-ide python shell.
My assumptions (correct me if I am wrong):
The "degree symbol" has unicode 0x00b0, utf-8 code 0xc2b0, latin-1 code 0xb0.
Python doc say, a string literal with u-prefix is encoded using unicode.
Question: Why is the result converted to a unicode-string-literal with a byte-escape-sequence which matches the latin-1 encoding, instead of persisting the unicode escape sequence ?
Thanks in advance for any help.
Python uses some rules for determining what to output from repr for each character. The rule for Unicode character codepoints in the 0x0080 to 0x00ff range is to use the sequence \xdd where dd is the hex code, at least in Python 2. There's no way to change it. In Python 3, all printable characters will be displayed without converting to a hex code.
As for why it looks like Latin-1 encoding, it's because Unicode started with Latin-1 as the base. All the codepoints up to 0xff match their Latin-1 counterpart.

Reading unicode characters from file/sqlite database and using it in Python

I have a list of variables with unicode characters, some of them for chemicals like Ozone gas: like 'O\u2083'. All of them are stored in a sqlite database which is read in a Python code to produce O3. However, when I read I get 'O\\u2083'. The sqlite database is created using an csv file that contains the string 'O\u2083' among others. I understand that \u2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be \,u,2,0,8,3). Is there any way to recognize unicode characters in this context? Now my first option to solve it is to create a function to recognize set of characters and replace for unicode characters. Is there anything like this already implemented?
SQLite allows you to read/write Unicode text directly. u'O\u2083' is two characters u'O' and u'\u2083' (your question has a typo: 'u\2083' != '\u2083').
I understand that u\2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be u,\,2,0,8,3)
Don't confuse u'u\2083' and u'\u2083': the latter is a single character while the former is 4-character sequence: u'u', u'\x10' ('\20' is interpreted as octal in Python), u'8', u'3'.
If you save a single Unicode character u'\u2083' into a SQLite database; it is stored as a single Unicode character (the internal representation of Unicode inside the database is irrelevant as long as the abstraction holds).
On Python 2, if there is no from __future__ import unicode_literals at the top of the module then 'abc' string literal creates a bytestring instead of a Unicode string -- in that case both 'u\2083' and '\u2083' are sequences of bytes, not text characters (\uxxxx is not recognized as a unicode escape sequence inside bytestrings).
If you have a byte string (length 7), decode the Unicode escape.
>>> s = 'O\u2083'
>>> len(s)
7
>>> s
'O\\u2083'
>>> print(s)
O\u2083
>>> u = s.decode('unicode-escape')
>>> len(u)
2
>>> u
u'O\u2083'
>>> print(u)
O₃
Caveat: Your console/IDE used to print the character needs to use an encoding that supports the character or you'll get a UnicodeEncodeError when printing. The font must support the symbol as well.
It's important to remember everything is bytes. To pull bytes into something useful to you, you kind of have to know what encoding is used when you pull in data. There are too many ambiguous cases to determine encoding by analyzing the data. When you send data out of your program, it's all back out to bytes again. Depending on whether you're using Python 2.x or 3.x you'll have a very different experience with Unicode and Python.
You can, however attempt encoding and simply do a "replace" on errors. For example the_string.encode("utf-8","replace") will try to encode as utf-8 and will replace problems with a ? -- You could also anticipate problem characters and replace them beforehand, but that gets unmanageable quickly. Take a look at codecs classes for more replacement options.

Python Unicode Bug

I'm making a virtual machine in RPython using PyPy. The problem is, when I tried to add unicode support I found an unusual problem. I'll use the letter "á" in my examples.
# The char in the example is á
print len(char)
OUTPUT:
2
I understand how the letter "á" takes two bytes, hence the length of 2. But the problem is when I use this example below I am faced with the problem.
# In this example instr = "á" (including the quotes)
for char in instr:
print hex(int(ord(char)))
OUTPUT:
0x22
0xc3
0xa1
0x22
As you can there are 4 numbers. For 0x22 are for the quotes, but there is only 1 letter in between the quotes but there are two numbers. My question is, some machines I tested this script on produced this output:
OUTPUT:
0x22
0xe1
0x22
Is there anyway to make the output the same on both machines? The script is exactly the same on each.
The program is not being given the same input on the two machines:
In [154]: '\xe1'.decode('cp1252').encode('utf_8') == '\xc3\xa1'
Out[154]: True
When you type á in a console, you may see the glyph á, but the console is translating that into bytes. The particular bytes it translates that into depends on the encoding used by the console. On a Windows machine, that may be cp1252, while on a Unix machine it is likely to be utf-8.
So you may see the input as the same, but the console (and thus the program) receives different input.
If your program were to decode the bytes with the appropriate encoding, and then work with unicode, then both programs will operate the same after that point. If you are receiving the bytes from sys.stdin, then sys.stdin.encoding will be the encoding Python detects the console is using.
You have this question tagged "Python-3.x" -- is it possible that some machines are running Python 2.x, and others are running Python 3.x?
The character á is in fact U+00E1, so on a Python 3.x system, I would expect to see your second output. Since strings are Unicode in Python3 by default, len(char) will be 3 (including the quotes).
In Python 2.x, that same character in a string will be two bytes long, and (depending on your input method) will be represented in UTF-8 as \xc3\xa1. On that system, len(char) will be 4, and you would see your first output.
The issue is that you use bytestrings to work with a text data. You should use Unicode instead.
It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text.
If you know the character encoding then it is easy to convert a bytestring to Unicode e.g.:
unicode_text = bytestring.decode(encoding)
It should resolve your initial issue.
There are also Unicode normalization forms e.g.:
import unicodedata
norm_text = unicodedata.normalize('NFC', unicode_text)
If I don't change the encoding in the program how can I output unicode characters for example?
You might mean that you have a sequence of bytes e.g., '\xc3\xa1' (two bytes) that can be interpreted as text using some character encoding e.g., it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted e.g., instead of á you might get ├б on the screen.
In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.

Python - The Standard Library - ascii( ) Function

I have begun to look through the Python Standard Library: (http://docs.python.org/3/library/functions.html)
In an attempt to further familiarise myself with basic python. When it comes to the explanation on the ascii( ) function, I'm not finding it clear.
Is someone able to supply a concise explanation giving examples of useful situations in which one may use the ascii( ) function please?
ascii() is a function that encodes the output of repr() to use escape sequences for any codepoint in the output produced by repr() that is not within the ASCII range.
So a Latin 1 codepoint like ë is represented by the Python escape sequence \xeb instead.
This was the standard representation in Python 2; Python 3 repr() leaves most Unicode codepoints as their actual value in the output, as long as it is a printable character:
>>> print(repr('ë'))
'ë'
>>> print(ascii('ë'))
'\xeb'
Both outputs are valid Python string literals, but the latter uses just ASCII characters, while the former requires a Unicode-compatible encoding.
For unicode codepoints between U+0100 and U+FFFF \uxxxx escape code sequences are used, for anything over that the \Uxxxxxxxx form is used. See the available escape code syntax for Python string literals.
Like repr(), ascii() is a very helpful debugging tool, especially when it comes to exact contents of a string. Unlike repr(), the ascii() output makes many Unicode gotchas much more visible.
Take de-normalised codepoints for example; The ë character can be represented in two ways, as the U+00EB codepoint, or as an ASCII e plus combining diaeresis ¨ (codepoint U+0308):
>>> import unicodedata
>>> one, two = 'ë', unicodedata.normalize('NFD', 'ë')
>>> print(one, two)
ë ë
>>> print(repr(one), repr(two))
'ë' 'ë'
>>> print(ascii(one), ascii(two))
'\xeb' 'e\u0308'
Only with ascii() is it clear that two consists of two distinct codepoints.
ascii() can be useful for finding out exactly what is in a string. If a string has whitespace or unprintable characters, or if the terminal is turning the string into mojibake because of a character-encoding mismatch, it is useful to look at the ascii representation of the string since it provides a visible and unambiguous representation for those otherwise unreadable characters which will print the same way on everyone's terminals.
There are frequent questions on Stackoverflow regarding incorrectly printed strings, and sometimes it is hard to tell what's going on because the question only shows the mojibake and not an unambiguous representation of the string. When the questioner shows the ascii representation (or the repr in Python 2) then the situation becomes much clearer.

Python - Unicode to ASCII conversion

I am unable to convert the following Unicode to ASCII without losing data:
u'ABRA\xc3O JOS\xc9'
I tried encode and decode and they won’t do it.
Does anyone have a suggestion?
The Unicode characters u'\xce0' and u'\xc9' do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:
>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e
All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).
See str.encode, Python Specific Encodings, and Unicode HOWTO for more info.
As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:
>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'
The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.
I found https://pypi.org/project/Unidecode/ this library very useful
>>> from unidecode import unidecode
>>> unidecode('ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode('30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode('\u5317\u4EB0')
'Bei Jing '
I needed to calculate the MD5 hash of a unicode string received in HTTP request. MD5 was giving UnicodeEncodeError and python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex values for the characters thus changing the MD5 hash.
So I came up with the following code, which keeps the string intact while converting from unicode.
unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()
This removes the unicode part from the string and keeps all the data intact.

Categories