Unicode latin1 string encode / decode - python

While fetching data from an unknown/old/non-consistent Mysql database to a Postgres utf-8 db using Python (Django) ORM I have sometimes faulty encoded data as a result.
Target: grégory
> a
u'gr\xe3\xa9gory'
> print a
grã©gory
I tried several decode/encode tricks without success:
> print a.encode('utf-8').decode('latin1')
grã©gory
> print a.encode('utf-8').decode('latin1')
grã©gory
> print a.decode('latin-1')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
Even with some
unicode_escape

I guess the string has been incorrectly converted to lowercase at some point, changing \xc3 to \xe3. The lowercase conversion has assumed latin1 encoding when it was actually utf-8.
>>> print 'gr\xc3\xa9gory'.decode('utf8')
grégory

Since the problem was the lower(), I could fix it doing:
print a.upper().encode('latin1').lower()

Try this:
print a.decode('latin1')

Related

How to solve garbled characters starting with "\u0e" in the Robot Framework log [duplicate]

I'm sanitizing a pandas dataframe and encounters unicode string that has a u inside it with a backslash than I need to replace e.g.
u'\u2014'.replace('\u','')
Result: u'\u2014'
I've tried encoding it as utf-8 then decoding it but that didn't work and I feel there must be an easier way around this.
pandas code
merged['Rank World Bank'] = merged['Rank World Bank'].astype(str)
Error
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 0: ordinal not in range(128)
u'\u2014' is actually -. It's not a number. It's a utf-8 character. Try using print keyword to print it . You will know
This is the output in ipython:
In [4]: print("val = ", u'\u2014')
val = —
Based on your comment, here is what you are doing wrong
"-" is not same as "EM Dash" Unicode character(u'\u2014')
So, you should do the following
print(u'\u2014'.replace("\u2014",""))
and that will work
EDIT:
since you are using python 2.x, you have to encode it with utf-8 as follows
u'\u2014'.encode('utf-8').decode('utf-8').replace("-","")
Yeah, Because it is taking '2014' followed by '\u' as a unicode string and not a string literal.
Things that can help:
Converting to ascii using .encode('ascii', 'ignore')
As you are using pandas, you can use 'encoding' parameter and pass 'ascii' there.
Do this instead : u'\u2014'.replace(u'\u2014', u'2014').encode('ascii', 'ignore')
Hope this helps.

Python unicode accent a (à) hex

I have a string from bs4 that is
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
\u00c3\u00a0should be accent a (à) I have gotten it to show up in the console partly correct as
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
with
str2 = u'%s' % s
print(str2.encode('utf-8').decode('unicode-escape'))
but it's decoding c3 and a0 separately, so I get a tilde A instead of an accent a. I know that c3 a0 is the hex utf-8 for accent a. I have no idea what's going on and I got to here using Google and the combinatory approach to the answers I got. This entire character encoding thing seems like a big mess to me.
The way it is supposed to be is
311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
EDIT:
Andrey's method worked when printing it out, but trying to use urlopen with the string I get UnicodeEncodeError: 'ascii' codec can't encode character '\xe0' in position 60: ordinal not in range(128)
After using unquote(str,":/") it gives UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128).
Transform the string back into bytes using .encode('latin-1'), then decode the unicode-escapes \u, transform everything into bytes again using the "wrong" 'latin-1' encoding, and finally, decode "properly" as 'utf-8':
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
s.encode('latin-1').decode('raw_unicode_escape').encode('latin-1').decode('utf-8')
gives:
'vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html'
It works for the same reason as explained in this answer.
Assuming Python 2:
This is a byte string with Unicode escapes. The Unicode escapes were incorrectly generated for some UTF-8-encoded data:
>>> s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
>>> s.decode('unicode-escape')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it is a Unicode string but now appears mis-decoded since the code points resemble UTF-8 bytes. It turns output the latin1 (also iso-8859-1) codec maps the first 256 code points directly to bytes 0-255, so use this trick to convert back to a byte string:
>>> s.decode('unicode-escape').encode('latin1')
'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it can be decoded correctly as UTF-8:
>>> s.decode('unicode-escape').encode('latin1').decode('utf8')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xe0-la-me-creatura.html'
It is a Unicode string, so Python displays its repr() value, which shows code points above U+007F as escape codes. print it to see the actual value assuming your terminal is correctly configured with an encoding that supports the characters printed:
>>> print(s.decode('unicode-escape').encode('latin1').decode('utf8'))
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
Ideally, fix the problem that generated this string incorrectly in the first place instead of working around the mess.

UCS2 coding and decoding using Python

s = "خالد".encode("utf-16be")
uni = s.decode("utf-16be")
print (uni)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-7: ordinal not in range(128).
Any suggestion?
In Python 3 what you have would work already, because string literals are unicode by default.
In Python 2, you can make a unicode string literal with the u prefix.
s = u"خالد".encode("utf-16be")
uni = s.decode("utf-16be")
print (uni)
Result:
خالد
Ok, you have an unicode encode error with the ascii charset. This error should not have been raised on any of your two first lines, because none is trying to encode an unicode string as ascii.
So I assume that it is caused by the print in the third line. Depending on your system, and your exact Python version, the print will try to encode with a default encoding which happens to be ascii here.
You must find what encoding is supported by your terminal, or if you can use 'UTF-8'.
Then you can print with
print(uni.encode("utf-8", errors="replace")) # or the encoding supported by your terminal

Troubles in printing unicode string that I have in byte format

Reading from a database I get the following value
b'd\xe2\x80\x99int'
How can I print it to get the string d’int (note that this is different from d'int)?
I tried with print(b'd\xe2\x80\x99int'.decode('utf-8')) but I get the error:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 1: ordinal not in range(128)
EDIT: thanks to the comment I understood that the problem is not in my Python code but in emacs, I am having exactly the same problem as described here Unicode conversion issue using Python in Emacs
I will close the question
You can use bytes.decode()
>>> bytes.decode(b'd\xe2\x80\x99int', 'utf8')
'd’int'
or analogously, the .decode() method over the bytes object itself:
>>> b'd\xe2\x80\x99int'.decode('utf-8')
'd’int'

exceptions.UnicodeDecodeError - 'ascii' codec can't decode byte

I keep getting this error:
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
args = ('ascii', '\xe2\x9d\xb6 Senn =)', 0, 1, 'ordinal not in range(128)')
encoding = 'ascii'
end = 1
message = ''
object = '\xe2\x9d\xb6 Senn =)'
reason = 'ordinal not in range(128)'
start = 0
Using this code:
steamFriend = data['response']['players'][i]
n = steamUser(steamFriend['personaname'].encode("utf-8"), steamFriend['steamid'], steamFriend['avatarfull'], steamFriend['profileurl'], steamFriend['personastate'], False)
Some things to note here:
steamFriend is a JSON object
I get this error only sometimes, because the steamFriend['personaname'] contains some weird symbols (for example ❶), and I don't know how to parse this correctly so I don't get errors.
Any help is greatly appreciated.
Also, \xe2\x9d\xb6 Senn =) is supposed to represent ❶ Senn =), if that helps.
Without seeing the full code it is hard to tell, but it seems that steamUser expects ascii input. If that is the problem, you can solve it by:
streamFriend['personaname'].encode("ascii", errors="ignore")
or
streamFriend['personaname'].encode("ascii", errors="replace")
Obviously you will lose unicode characters in the process.
If the quoted error is occurring on the n=... line, the implication is that steamFriend['personaname'] is a byte string, not a Unicode string.
Consequently when you ask to .encode it, Python has to decode the string to Unicode in order to be able to encode it back to bytes. An implicit decoding happens using the default encoding, which is ASCII, so because the byte string does not contain only ASCII you get a failure.
Are you sure you didn't mean to do:
steamFriend['personaname'].decode("utf-8")
decoding the byte string '\xe2\x9d\xb6 Senn =)' using UTF-8 would give you the Unicode string u'\u2776 Senn =)', where U+2776=❶ so that would seem more like what you are after.
(Normally, however, JSON strings are explicitly Unicode, so it's not clear where you would have got the byte string from. How are you loading the JSON content?)

Categories