Python Character Encoding - python

I have a python script that retrieves information from a web service and then looks up data in a MySQL db. The data is unicode when I receive it, however I want the SQL statement to use the actual character (Băcioi in the example below). As you can see, when I try and encode it to utf-8 the result is still not what I'm looking for.
>>> x = u'B\u0103cioi'
>>> x
u'B\u0103cioi'
>>> x.encode('utf-8')
'B\xc4\x83cioi'
>>> print x
Băcioi ## << What I want!

Your encoding is working fine. Python is simply showing you the repr()'d version of it on the command line, which uses \x escapes. You can tell because of the fact that it's also displaying the quotes around the string.
print does not do any mutation of the string - if it prints out the character you want, that's what is actually in the contents of the string.

Related

Python: Correct Way to refer to index of unicode string

Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ה</b>קדמה
But my Chrome browser displays out.htm as:

Lxml trying to extract data with windows-1250 characters

Hello i am experimenting with Python and LXML, and I am stuck with the problem of extracting data from the webpage which contains windows-1250 characters like ž and ć.
tree = html.fromstring(new.text,parser=hparser)
title = tree.xpath('//strong[text()="Title"]')
opis[g] = opis[g].tail.encode('utf-8')[2:]
I get text responses containing something like this :
\xc2\x9ea
instead of characters. Then I have the problem with storing into database
So how can I accomplish this? I tried put 'windows-1250' instead utf8 without success. Can I convert this code to original characters somehow?
Try:
text = "\xc2\x9ea"
print text.decode('windows-1250').encode('utf-8')
Output:
ža
And save nice chars in your DB.
If encoding to UTF-8 results in b'\xc2\x9ea', then that means the original string was '\x9ea'. Whether lxml didn't do things correctly, or something happened on your end (perhaps a parser configuration issue), the fact is that you get the equivalent of this (Python 3.x syntax):
>>> '\x9ea'.encode('utf-8')
b'\xc2\x9ea'
How do you fix it? One error-prone way would be to encode as something other than UTF-8 that can properly handle the characters. It's error-prone because while something might work in one case, it might not in another. You could instead extract the character ordinals by mapping the character ordinals and work with the character ordinals instead:
>>> list(map((lambda n: hex(n)[2:]), map(ord, '\x9ea')))
['9e', '61']
That gets us somewhere because the bytes type has a fromhex method that can decode a string containing hexadecimal values to the equivalent byte values:
>>> bytes.fromhex(''.join(map((lambda n: hex(n)[2:]), map(ord, '\x9ea'))))
b'\x9ea'
You can use decode('cp1250') on the result of that to get ža, which I believe is the string you wanted. If you are using Python 2.x, the equivalent would be
from binascii import unhexlify
unhexlify(u''.join(map((lambda n: hex(n)[2:]), map(ord, u'\x9ea'))))
Note that this is highly destructive as it forces all characters in a Unicode string to be interpreted as bytes. For this reason, it should only be used on strings containing Unicode characters that fit in a single byte. If you had something like '\x9e\u724b\x61', that code would result in joining ['9e', '724b', '61'] as '9e724b61', and interpreting that using a single-byte character set such as CP1250 would result in something like 'žrKa'.
For that reason, more reliable code would replace ord with a function that throws an exception if 0 <= ord(ch) < 0x100 is false, but I'll leave that for you to code.

Character Encoding, XML, Excel, python

I am reading a list of strings that were imported into an excel xml file from another software program. I am not sure what the encoding of the excel file is, but I am pretty sure its not windows-1252, because when I try to use that encoding, I wind up with a lot of errors.
The specific word that is causing me trouble right now is: "Zmysłowska, Magdalena" (notice the "l" is not a standard "l", but rather, has a slash through it).
I have tried a few things, Ill mention three of them here:
(1)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
page = page.encode("utf-8", "ignore")
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena
(2)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
Output: Zmys\u0142owska, Magdalena
Output after print statment: Zmysłowska, Magdalena
Note: this is great, but I need to encode it back to utf-8 before putting the string into my db. When I do that, by running page.encode("utf-8", "ignore"), I end up with Zmysłowska, Magdalena again.
(3)
Do nothing (no normalization, no decode, no encode). It seems like the string is already utf-8 when it comes in. However, when I do nothing, the string ends up with the following output again:
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena
Is there a way for me to convert this string to utf-8?
Your problem isn't your encoding and decoding. Your code correctly takes a UTF-8 string, and converts it to an NFKD-normalized UTF-8 string. (You might want to use page.decode("utf-8") instead of unicode(page, "utf-8") just for future-proofing in case you ever go to Python 3, and to make the code a bit easier to read because the encode and decode are more obviously parallel, but you don't have to; the two are equivalent.)
Your actually problem is that you're printing UTF-8 strings to some context that isn't UTF-8. Most likely you're printing to the cmd window, which is defaulting to Windows-1252. So, cmd tries to interpret the UTF-8 characters as Windows-1252, and gets garbage.
There's a pretty easy way to test this. Make Python decode the UTF-8 string as if it were Windows-1252 and see if the resulting Unicode string looks like what're seeing.
>>> print page.decode('windows-1252')
Zmysłowska, Magdalena
>>> print repr(page.decode('windows-1252'))
u'Zmys\xc5\u201aowska, Magdalena'
There are two ways around this:
Print Unicode strings and let Python take care of it.
Print strings converted to the appropriate encoding.
For option 1:
print page.decode("utf-8") # of unicode(page, "utf-8")
For option 2, it's going to be one of the following:
print page.decode("utf-8").encode("windows-1252")
print page.decode("utf-8").encode(sys.getdefaultencoding())
Of course if you keep the intermediate Unicode string around, you don't need all those decode calls:
upage = page.decode("utf-8")
upage = unicodedata.normalize("NFKD", upage)
page = upage.encode("utf-8", "ignore")
print upage

Python UTF-8 conversion problem

In my database, I have stored some UTF-8 characters. E.g. 'α' in the "name" field
Via Django ORM, when I read this out, I get something like
>>> p.name
u'\xce\xb1'
>>> print p.name
α
I was hoping for 'α'.
After some digging, I think if I did
>>> a = 'α'
>>> a
'\xce\xb1'
So when Python is trying to display '\xce\xb1' I get alpha, but when it's trying to display u'\xce\xb1', it's double encoding?
Why did I get u'\xce\xb1' in the first place? Is there a way I can just get back '\xce\xb1'?
Thanks. My UTF-8 and unicode handling knowledge really need some help...
Try to put the unicode signature u before your string, e.g. u'YOUR_ALFA_CHAR' and revise your database encoding, because Django always supports UTF-8 .
What you seem to have is the individual bytes of a UTF-8 encoded string interpreted as unicode codepoints. You can "decode" your string out of this strange form with:
p.name = ''.join(chr(ord(x)) for x in p.name)
or perhaps
p.name = ''.join(chr(ord(x)) for x in p.name).decode('utf8')
One way to get your strings "encoded" into this form is
''.join(unichr(ord(x)) for x in '\xce\xb1')
although I have a feeling your strings actually got in this state by different components of your system disagreeing on the encoding in use.
You will probably have to fix the source of your bad "encoding" rather than just fixing the data currently in your database. And the code above might be okay to convert your bad data once, but I would advise you don't insert this code into your Django app.
The problem is that p.name was not correctly stored and/or read in from the database.
Unicode small alpha is U+03B1 and p.name should have printed as u'\x03b1' or if you were using a Unicode capable terminal the actual alpha symbol itself may have been printed in quotes. Note the difference between u'\xce\xb1' and u'\xceb1'. The former is a two character string and the latter in a single character string. I have no idea how the '03' byte of the UTF-8 got translated into 'CE'.
You can turn any byte sequence into internal unicode representation through the decode function:
print '\xce\xb1'.decode('utf-8')
This allows you to import a byte sequence from any source and then turn it into a Python unicode string.
Reference: http://docs.python.org/library/stdtypes.html#string-methods
Try converting the encoding with p.name.encode('latin-1'). Here's a demonstration:
>>> print u'\xce\xb1'
α
>>> print u'\xce\xb1'.encode('latin-1')
α
>>> print '\xce\xb1'
α
>>> '\xce\xb1' == u'\xce\xb1'.encode('latin1')
True
For more information, see str.encode and Standard Encodings.

Latin-1 and the unicode factory in Python

I have a Python 2.6 script that is gagging on special characters, encoded in Latin-1, that I am retrieving from a SQL Server database. I would like to print these characters, but I'm somewhat limited because I am using a library that calls the unicode factory, and I don't know how to make Python use a codec other than ascii.
The script is a simple tool to return lookup data from a database without having to execute the SQL directly in a SQL editor. I use the PrettyTable 0.5 library to display the results.
The core of the script is this bit of code. The tuples I get from the cursor contain integer and string data, and no Unicode data. (I'd use adodbapi instead of pyodbc, which would get me Unicode, but adodbapi gives me other problems.)
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
t.add_row(rec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t
But the Name column can contain characters that fall outside the ASCII range. I'll sometimes get an error message like this, in line 222 of prettytable.pyc, when it gets to the t.add_row call:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 12: ordinal not in range(128)
This is line 222 in prettytable.py. It uses unicode, which is the source of my problems, and not just in this script, but in other Python scripts that I have written.
for i in range(0,len(row)):
if len(unicode(row[i])) > self.widths[i]: # This is line 222
self.widths[i] = len(unicode(row[i]))
Please tell me what I'm doing wrong here. How can I make unicode work without hacking prettytable.py or any of the other libraries that I use? Is there even a way to do this?
EDIT: The error occurs not at the print statement, but at the t.add_row call.
EDIT: With Bastien Léonard's help, I came up with the following solution. It's not a panacea, but it works.
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
urec = [s.decode('latin-1') if isinstance(s, str) else s for s in rec]
t.add_row(urec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t.get_string().encode('latin-1')
I ended up having to decode on the way in and encode on the way out. All of this makes me hopeful that everybody ports their libraries to Python 3.x sooner than later!
Add this at the beginning of the module:
# coding: latin1
Or decode the string to Unicode yourself.
[Edit]
It's been a while since I played with Unicode, but hopefully this example will show how to convert from Latin1 to Unicode:
>>> s = u'ééé'.encode('latin1') # a string you may get from the database
>>> s.decode('latin1')
u'\xe9\xe9\xe9'
[Edit]
Documentation:
http://docs.python.org/howto/unicode.html
http://docs.python.org/library/codecs.html
Maybe try to decode the latin1-encoded strings into unicode?
t.add_row((value.decode('latin1') for value in rec))
After a quick peek at the source for PrettyTable, it appears that it works on unicode objects internally (see _stringify_row, add_row and add_column, for example). Since it doesn't know what encoding your input strings are using, it uses the default encoding, usually ascii.
Now ascii is a subset of latin-1, which means if you're converting from ascii to latin-1, you shouldn't have any problems. The reverse however, isn't true; not all latin-1 characters map to ascii characters. To demonstrate this:
>>> s = u'\xed\x31\x32\x33'
>>> print s
# FAILS: Python calls "s.decode('ascii')", but ascii codec can't decode '\xed'
>>> print s.decode('ascii')
# FAILS: Same as above
>>> print s.decode('latin-1')
í123
Explicitly converting the strings to unicode (like you eventually did) fixes things, and makes more sense, IMO -- you're more likely to know what charset your data is using, than the author of PrettyTable :). BTW, you can omit the check for strings in your list comprehension by replacing s.decode('latin-1') with unicode(s, 'latin-1') since all objects can be coerced to strings.
One last thing: don't forget to check the character set of your database and tables -- you don't want to assume 'latin-1' in code, when the data is actually being stored as something else ('utf-8'?) in the database. In MySQL, you can use the SHOW CREATE TABLE <table_name> command to find out what character set a table is using, and SHOW CREATE DATABASE <db_name> to do the same for a database.

Categories