How to decode cp1252 string? - python

I am getting an mp3 tag (ID V1) with eyeD3 and would like to understand its encoding. Here is what I try:
>>> print(type(mp3artist_v1))
<type 'unicode'>
>>> print(type(mp3artist_v1.encode('utf-8')))
<type 'str'>
>>> print(mp3artist_v1)
Zåìôèðà
>>> print(mp3artist_v1.encode('utf-8').decode('cp1252'))
ZåìôèðÃ
>>> print(u'Zемфира'.encode('utf-8').decode('cp1252'))
Zемфира
If I use an online tool to decode the value, it says that the value Zемфира could be converted to correct value Zемфира by changing encodings CP1252 → UTF-8 and value Zåìôèðà by changing encodings like CP1252 → CP1251.
What should I do to get Zемфира from mp3artist_v1? .encode('cp1252').decode('cp1251') works well, but how can I understand possible encoding automatically (just 3 encodings are possible - cp1251, cp1252, utf-8? I was planning to use the following code:
def forceDecode(string, codecs=['utf-8', 'cp1251', 'cp1252']):
for i in codecs:
try:
print(i)
return string.decode(i)
except:
pass
print "cannot decode url %s" % ([string])
but it does not help since I should encode with one charset first and then decode with another.

This
s = u'Zåìôèðà'
print s.encode('latin1').decode('cp1251')
# Zемфира
Explanation: Zåìôèðà is mistakenly treated as a unicode string, while it's actually a sequence of bytes, which mean Zемфира in cp1251. By applying encode('latin1') we convert this "unicode" string back to bytes, using codepoint numbers as byte values, and then convert these bytes back to unicode telling the decode we're using cp1251.
As to automatic decoding, the following brute force approach seems to work with your examples:
import re, itertools
def guess_decode(s):
encodings = ['cp1251', 'cp1252', 'utf8']
for steps in range(2, 10, 2):
for encs in itertools.product(encodings, repeat=steps):
r = s
try:
for enc in encs:
r = r.encode(enc) if isinstance(r, unicode) else r.decode(enc)
except (UnicodeEncodeError, UnicodeDecodeError) as e:
continue
if re.match(ur'^[\w\sа-яА-Я]+$', r):
print 'debug', encs, r
return r
print guess_decode(u'Zемфира')
print guess_decode(u'Zåìôèðà')
print guess_decode(u'ZåìôèðÃ\xA0')
Results:
debug ('cp1252', 'utf8') Zемфира
Zемфира
debug ('cp1252', 'cp1251') Zемфира
Zемфира
debug ('cp1252', 'utf8', 'cp1252', 'cp1251') Zемфира
Zемфира

Related

unprintable python unicode string

I retrieved some exif info from an image and got the following:
{ ...
37510: u'D2\nArbeitsamt\n\xc3\x84nderungsbescheid'
...}
I expected it to be
{ ...
37510: u'D2\nArbeitsamt\nÄnderungsbescheid'
... }
I need to convert the value to a str, but i couldn't manage it to work. I always get something like (using python27)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-15: ordinal not in range(128)
Any ideas how I can handle this?
UPDATE:
I tried it with python3 and there is now error thrown, but the result is now
{ ...
37510: 'D2\nArbeitsamt\nÃ\x84nderungsbescheid',
... }
which is still not the expected.
It seems to be utf8 which was incorrectly decoded as latin1 and then placed in a unicode string. You can use .encode('iso8859-1') to reverse the incorrect decoding.
>>> my_dictionary = {37510: u'D2\nArbeitsamt\n\xc3\x84nderungsbescheid'}
>>> print(my_dictionary[37510].encode('iso8859-1'))
D2
Arbeitsamt
Änderungsbescheid
You can print it out just fine now, but you might then also decode it as unicode, so it ends up with the correct type for further processing:
>>> type(my_dictionary[37510].encode('iso8859-1'))
<type 'str'>
>>> print(my_dictionary[37510].encode('iso8859-1').decode('utf8'))
D2
Arbeitsamt
Änderungsbescheid
>>> type(my_dictionary[37510].encode('iso8859-1').decode('utf8'))
<type 'unicode'>

Python urlencode special character

I have this variable here
reload(sys)
sys.setdefaultencoding('utf8')
foo = u'"Esp\xc3\xadrito"'
which translates to "Espírito". But when I pass my variable to urlencode like this
urllib.urlencode({"q": foo}) # q=%22Esp%C3%83%C2%ADrito%22'
The special character is being "represented" wrongly in the URL.
How should I fix this?
You got the wrong encoding of "Espírito", I don't know where you get that, but this is the right one:
>>> s = u'"Espírito"'
>>>
>>> s
u'"Esp\xedrito"'
Then encoding your query:
>>> u.urlencode({'q':s.encode('utf-8')})
'q=%22Esp%C3%ADrito%22'
This should give you back the right encoding of your string.
EDIT: This is regarding right encoding of your query string, demo:
>>> s = u'"Espírito"'
>>> print s
"Espírito"
>>> s.encode('utf-8')
'"Esp\xc3\xadrito"'
>>> s.encode('latin-1')
'"Esp\xedrito"'
>>>
>>> print "Esp\xc3\xadrito"
Espí­rito
>>> print "Esp\xedrito"
Espírito
This clearly shows that the right encoding for your string is most probably latin-1 (even cp1252 works as well), now as far as I understand, urlparse.parse_qs either assumes default encoding utf-8 or your system default encoding, which as per your post, you set it to utf-8 as well.
Interestingly, I was playing with the query you provided in your comment, I got this:
>>> q = "q=Esp%C3%ADrito"
>>>
>>> p = urlparse.parse_qs(q)
>>> p['q'][0].decode('utf-8')
u'Esp\xedrito'
>>>
>>> p['q'][0].decode('latin-1')
u'Esp\xc3\xadrito'
#Clearly not ASCII encoding.
>>> p['q'][0].decode()
Traceback (most recent call last):
File "<pyshell#320>", line 1, in <module>
p['q'][0].decode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
>>>
>>> p['q'][0]
'Esp\xc3\xadrito'
>>> print p['q'][0]
Espírito
>>> print p['q'][0].decode('utf-8')
Espírito
urllib and urlparse appear to work with byte string in Python 2. To get unicode strings, encode and decode using utf-8.
Here's an example of a round-trip:
data = { 'q': u'Espírito'}
# to query string:
bdata = {k: v.encode('utf-8') for k, v in data.iteritems()}
qs = urllib.urlencode(bdata)
# qs = 'q=Esp%C3%ADrito'
# to dict:
bdata = urlparse.parse_qs(qs)
data = { k: map(lambda s: s.decode('utf-8'), v)
for k, v in bdata.iteritems() }
# data = {'q': [u'Espídrito']}
Note the different meaning of escape sequences: in 'Esp\xc3\xadrito' (a string), they represent bytes, while in u'"Esp\xedrito"' (a unicode object) they represent Unicode code points.

How to permissively decode a UTF-8 bytearray?

I need to decode a UTF-8 sequence, which is stored in a bytearray, to a string.
The UTF-8 sequence might contain erroneous parts. In this case I need to decode as much as possible and (optionally?) substitute invalid parts by something like "?".
# First part decodes to "ABÄC"
b = bytearray([0x41, 0x42, 0xC3, 0x84, 0x43])
s = str(b, "utf-8")
print(s)
# Second part, invalid sequence, wanted to decode to something like "AB?C"
b = bytearray([0x41, 0x42, 0xC3, 0x43])
s = str(b, "utf-8")
print(s)
What's the best way to achieve this in Python 3?
There are several builtin error-handling schemes for encoding and decoding str to and from bytes and bytearray with e.g. bytearray.decode(). For example:
>>> b = bytearray([0x41, 0x42, 0xC3, 0x43])
>>> b.decode('utf8', errors='ignore') # discard malformed bytes
'ABC'
>>> b.decode('utf8', errors='replace') # replace with U+FFFD
'AB�C'
>>> b.decode('utf8', errors='backslashreplace') # replace with backslash-escape
'AB\\xc3C'
In addition, you can write your own error handler and register it:
import codecs
def my_handler(exception):
"""Replace unexpected bytes with '?'."""
return '?', exception.end
codecs.register_error('my_handler', my_handler)
>>> b.decode('utf8', errors='my_handler')
'AB?C'
All of these error handling schemes can also be used with the str() constructor as in your question:
>>> str(b, 'utf8', errors='my_handler')
'AB?C'
... although it's more idiomatic to use str.decode() explicitly.

How could list decode to 'UTF-8'

I got a list = [0x97, 0x52], not unicode object. this is unicode of a charactor '青'(u'\u9752'). How could I change this list to unicode object first, then encode to 'UTF-8'?
bytes = [0x97, 0x52]
code = bytes[0] * 256 + bytes[1] # build the 16-bit code
char = unichr(code) # convert code to unicode
utf8 = char.encode('utf-8') # encode unicode as utf-8
print utf8 # prints '青'
Not sure if this is the most elegant way, but it works for this particular example.
>>> ''.join([chr(x) for x in [0x97, 0x52]]).decode('utf-16be')
u'\u9752'

How can I iterate over every character in a given encoding using Python?

Is there a way to iterate over every character in a given encoding, and print it's code? Say, UTF8?
All Unicode characters can be represented in UTF-n for all defined n. What are you trying to achieve?
If you really want to do something like print all the valid characters in a particular encoding, without needing to know whether the encoding is "single byte" or "multi byte" or whether its size is fixed or not:
import unicodedata as ucd
import sys
def dump_encoding(enc):
for i in xrange(sys.maxunicode):
u = unichr(i)
try:
s = u.encode(enc)
except UnicodeEncodeError:
continue
try:
name = ucd.name(u)
except:
name = '?'
print "U+%06X %r %s" % (i, s, name)
if __name__ == "__main__":
dump_encoding(sys.argv[1])
Suggestions: Try it out on something small, like cp1252. Redirect stdout to a file.
dude, do you have any idea how many code points there are in unicode...
btw, from the Python docs:
chr( i )
Return the string representing a character whose Unicode code point is the integer i. For example, chr(97) returns the string 'a', while chr(8364) returns the string '€'. This is the inverse of ord().
The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16). ValueError will be raised if i is outside that range.
So
import sys
for i in range(sys.maxunicode + 1):
char = chr(i)
print(repr(char)) # print('\ud800') causes a UnicodeEncodeError
For single-byte encodings you can use:
''.join(chr(x) for x in range(256)).decode(encoding, 'ignore')
to get a string containing all the valid characters in the given encoding.
For fixed-size multibyte encodings careful use of struct.pack() in place of chr() should work.
I'm using python3.7
import unicodedata as ucd
import sys
def dump_encoding(enc):
for i in range(sys.maxunicode):
u = chr(i)
try:
s = u.encode(enc)
except UnicodeEncodeError:
continue
try:
name = ucd.name(u)
except:
name = '?'
print (i, s, u, name)
if __name__ == "__main__":
sys.getdefaultencoding()
dump_encoding(sys.argv[1])
Make sure you are using correct version of python to run the script, and one argument is required:
python3.7 ./iterate_over_charset.py utf-8 > unicode_all.txt
Sample output:
4473 b'\xe1\x85\xb9' ᅹ HANGUL JUNGSEONG YA-YO
4474 b'\xe1\x85\xba' ᅺ HANGUL JUNGSEONG EO-O
4475 b'\xe1\x85\xbb' ᅻ HANGUL JUNGSEONG EO-U
4476 b'\xe1\x85\xbc' ᅼ HANGUL JUNGSEONG EO-EU
4477 b'\xe1\x85\xbd' ᅽ HANGUL JUNGSEONG YEO-O
4478 b'\xe1\x85\xbe' ᅾ HANGUL JUNGSEONG YEO-U
4479 b'\xe1\x85\xbf' ᅿ HANGUL JUNGSEONG O-EO
4480 b'\xe1\x86\x80' ᆀ HANGUL JUNGSEONG O-E
4481 b'\xe1\x86\x81' ᆁ HANGUL JUNGSEONG O-YE

Categories