Decode an ENCODED unicode string in Python

Decode an ENCODED unicode string in Python - python

I need to decode a "UNICODE" encoded string:
>>> id = u'abcdß'
>>> encoded_id = id.encode('utf-8')
>>> encoded_id
'abcd\xc3\x9f'
The problem I have is:
Using Pylons routing, I get the encoded_id variable as a unicode string u'abcd\xc3\x9f' instead of a just a regular string 'abcd\xc3\x9f':
Using python, how can I decode my encoded_id variable which is a unicode string?
>>> encoded_id = u'abcd\xc3\x9f'
>>> encoded_id.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/test/vng/lib64/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)

You have UTF-8 encoded data (there is no such thing as UNICODE encoded data).
Encode the unicode value to Latin-1, then decode from UTF8:
encoded_id.encode('latin1').decode('utf8')
Latin 1 maps the first 255 unicode points one-on-one to bytes.
Demo:
>>> encoded_id = u'abcd\xc3\x9f'
>>> encoded_id.encode('latin1').decode('utf8')
u'abcd\xdf'
>>> print encoded_id.encode('latin1').decode('utf8')
abcdß

Related

Convert str to unicode str

I need to convert a str to text in Python 2.7
a = u'"\u0274\u1d1c\u0274\u1d04\u1d00 \u1d00\u028f\u1d1c\u1d05\u1d07s \u1d00 \u1d1c\u0274 \u0274\u026a\xf1\u1d0f \u1d0f \u1d1c\u0274\u1d00 \u0274\u026a\xf1\u1d00 \u1d04\u1d0f\u0274 \u1d1c\u0274\u1d00 \u1d1b\u1d00\u0280\u1d07\u1d00 \u1d07\u0274 \u029f\u1d00 \u01eb\u1d1c\u1d07 s\u026a\u1d07\u0274\u1d1b\u1d07 \u01eb\u1d1c\u1d07 \u1d18\u1d1c\u1d07\u1d05\u1d07 \u1d1b\u1d07\u0274\u1d07\u0280 \u1d07x\u026a\u1d1b\u1d0f"'
I try with a.decode('utf8') but the truth is I don't know what kind of code is the str a
The output I need is:
"ɴᴜɴᴄᴀ ᴀʏᴜᴅᴇs ᴀ ᴜɴ ɴɪñᴏ ᴏ ᴜɴᴀ ɴɪñᴀ ᴄᴏɴ ᴜɴᴀ ᴛᴀʀᴇᴀ ᴇɴ ʟᴀ ǫᴜᴇ sɪᴇɴᴛᴇ ǫᴜᴇ ᴘᴜᴇᴅᴇ ᴛᴇɴᴇʀ ᴇxɪᴛᴏ"
ERROR:
>>> print(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "F:\WinPython-64bit-2.7.13.1Zero\python-2.7.13.amd64\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-5: character maps to <undefined>

Since you are on Python2, you have to encode the string contents - which are already text, to your terminal encoding.
So, if you are on windows, print(a.encode("cp-850")), if you are on Linux, Mac-OS, or other O.S.: print(a.encode("utf-8"))
On Python3 the encoding should be done automatically.
Also, it is important to understand that characters codified like \uNNNN in Python correspond to Unicode codepoints - and not to specific character encodings like "utf-8", "latin1" or "utf-16". In Python 3 most readable characters encoding like this will be shown even with the string internal representation, which is displayed by default in a Python interactive session (otherwise use the built-in repr call to see it). By using the built-in "str" or a call to print, you see the rendered string, and all \uXXXX, \UXXXXXXXX, \xNN and \N{unicode character name} tokens are rendered as the actual characters. (In Python2 you need to manually encode this representation to the character encoding used in your device)
In other words, if you are using Python 3, this is as simple as:
In [15]: a = u'"\u0274\u1d1c\u0274\u1d04\u1d00 \u1d00\u028f\u1d1c\u1d05\u1d07s \u1d00 \u1d1c\u0274 \u0274\u026a\xf1\u1d0f \u1d0f \u1d1c\u0274\u1d00 \u0274\u026a\xf1\u1d00 \u1d04\u1d0f\u0274 \u1d1c\u0274\u1d00 \u1d1b\u1d00\u0280\u1d07\u1d00 \u1d07\u0274 \u029f\u1d00 \u01eb\u1d1c\u1d07 s\u026a\u1d07\u0274\u1d1b\u1d07 \u01eb\u1d1c\u1d07 \u1d18\u1d1c\u1d07\u1d05\u1d07 \u1d1b\u1d07\u0274\u1d07\u0280 \u1d07x\u026a\u1d1b\u1d0f"'
...:
In [16]: a
Out[16]: '"ɴᴜɴᴄᴀ ᴀʏᴜᴅᴇs ᴀ ᴜɴ ɴɪñᴏ ᴏ ᴜɴᴀ ɴɪñᴀ ᴄᴏɴ ᴜɴᴀ ᴛᴀʀᴇᴀ ᴇɴ ʟᴀ ǫᴜᴇ sɪᴇɴᴛᴇ ǫᴜᴇ ᴘᴜᴇᴅᴇ ᴛᴇɴᴇʀ ᴇxɪᴛᴏ"'
Or:
In [17]: print(a)
"ɴᴜɴᴄᴀ ᴀʏᴜᴅᴇs ᴀ ᴜɴ ɴɪñᴏ ᴏ ᴜɴᴀ ɴɪñᴀ ᴄᴏɴ ᴜɴᴀ ᴛᴀʀᴇᴀ ᴇɴ ʟᴀ ǫᴜᴇ sɪᴇɴᴛᴇ ǫᴜᴇ ᴘᴜᴇᴅᴇ ᴛᴇɴᴇʀ ᴇxɪᴛᴏ"

UnicodeDecodeError -Decoding and saving in Excel

I'm trying to save some data I scraped in an excel sheet, and I'm having unicode decode problems with one particular piece, that has the following form:
work_info['title'] = Darimān-i afsaradgī : rāhnamā-yi kāmil bira-yi hamah-ʼi khānvādahʹhā
The code that is causing the error is:
data.write(b + book + accumulated_books+ 2, 43, work_info['title'])
wb.save('/Users/apple/Downloads/WC Scrape_trialfortwo.csv')
And the error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 5: ordinal not in range(128)
I've tried several different encoding/decoding techniques, but nothing have worked so far. Any suggestion would be extremely appreciated.
Thanks!

It looks like you are using python2, and python2's unicode/bytes handling is causing the problem.
>>> s = 'Darimān-i afsaradgī : rāhnamā-yi kāmil bira-yi hamah-ʼi khānvādahʹhā'
>>> wb = Workbook()
>>> ws = wb.add_sheet('test')
>>> ws.write(1, 0, s)
>>> wb.save('test.xls')
Traceback (most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 5: ordinal not in range(128)
xlwt assumes that s is an ascii-encoded string and tries to decode it to unicode, but fails:
>>> s.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 5: ordinal not in range(128)
In fact, s is encoded as utf-8:
>>> s.decode('utf-8')
u'Darim\u0101n-i afsaradg\u012b : r\u0101hnam\u0101-yi k\u0101mil bira-yi hamah-\u02bci kh\u0101nv\u0101dah\u02b9h\u0101'
The simplest solution may be to encode your workbook as utf-8:
>>> wb = Workbook(encoding='utf-8')
>>> ws = wb.add_sheet('test')
>>> ws.write(1, 0, s)
>>> wb.save('test.xls')
If you need a finer-grained approach, you could explicitly decode the string to unicode before writing it to the worksheet:
>>> wb = Workbook()
>>> ws = wb.add_sheet('test')
>>> ws.write(1, 0, s.decode('utf-8'))
>>> wb.save('test.xls')

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 42: ordinal not in range(128)

def main():
client = ##client_here
db = client.brazil
rio_bus = client.tweets
result_cursor = db.tweets.find()
first = result_cursor[0]
ordered_fieldnames = first.keys()
with open('brazil_tweets.csv','wb') as csvfile:
csvwriter = csv.DictWriter(csvfile,fieldnames = ordered_fieldnames,extrasaction='ignore')
csvwriter.writeheader()
for x in result_cursor:
print x
csvwriter.writerow( {k: str(x[k]).encode('utf-8') for k in x})
#[ csvwriter.writerow(x.encode('utf-8')) for x in result_cursor ]
if __name__ == '__main__':
main()
Basically the issue is that the tweets contain a bunch of characters in Portuguese. I tried to correct for this by encoding everything into unicode values before putting them in the dictionary that was to be added to the row. However this doesn't work. Any other ideas for formatting these values so that csv reader and dictreader can read them?

str(x[k]).encode('utf-8') is the problem.
str(x[k]) will convert a Unicode string to an byte string using the default ascii codec in Python 2:
>>> x = u'résumé'
>>> str(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Non-Unicode values, like booleans, will be converted to byte strings, but then Python will implicitly decode the byte string to a Unicode string before calling .encode(), because you can only encode Unicode strings. This usually won't cause an error because most non-Unicode objects have an ASCII representation. Here's an example where a custom object returns a non-ASCII str() representation:
>>> class Test(object):
... def __str__(self):
... return 'r\xc3\xa9sum\xc3\xa9'
...
>>> x=Test()
>>> str(x)
'r\xc3\xa9sum\xc3\xa9'
>>> str(x).encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Note the above was a decode error instead of an encode error.
If str() is only there to coerce booleans to a string, coerce it to a Unicode string instead:
unicode(x[k]).encode('utf-8')
Non-Unicode values will be converted to Unicode strings, which can then be correctly encoded, but Unicode strings will remain unchanged, so they will also be encoded correctly.
>>> x = True
>>> unicode(x)
u'True'
>>> unicode(x).encode('utf8')
'True'
>>> x = u'résumé'
>>> unicode(x).encode('utf8')
'r\xc3\xa9sum\xc3\xa9'
P.S. Python 3 does not do implicit encode/decode between byte and Unicode strings and makes these errors easier to spot.

Python and Pandas: UnicodeDecodeError: 'ascii' codec can't decode byte

After using Pandas to read a json object into a Pandas.DataFrame, we only want to print the first year in each pandas row. Eg: if we have 2013-2014(2015), we want to print 2013
Full code (here)
x = '{"0":"1985\\u2013present","1":"1985\\u2013present",......}'
a = pd.read_json(x, typ='series')
for i, row in a.iteritems():
print row.split('-')[0].split('—')[0].split('(')[0]
the following error occurs:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1333-d8ef23860c53> in <module>()
1 for i, row in a.iteritems():
----> 2 print row.split('-')[0].split('—')[0].split('(')[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
Why is this happening? How can we fix the problem?

Your json data strings are unicode string, which you can see for example by just printing one of the values:
In: a[0]
Out: u'1985\u2013present'
Now you try to split the string at the unicode \u2031 (EN DASH), but the string you give to split is no unicode string (therefore the error 'ascii' codec can't decode byte 0xe2 - the EN DASH is no ASCII character).
To make your example working, you could use:
for i, row in a.iteritems():
print row.split('-')[0].split(u'—')[0].split('(')[0]
Notice the u in front of the uncode dash. You could also write u'\u2013' to split the string.
For details on unicode in Python, see https://docs.python.org/2/howto/unicode.html

How to convert unicode string like u'\\u4f60\\u4f60' to u'\u4f60\u4f60' in Python?

I capture the string from a html source file using regex:
f = open(rrfile, 'r')
p = re.compile(r'"name":"([^"]+)","head":"([^"]+)"')
match = re.findall(p, f.read())
And I've tried:
>>> u'\\u4f60\\u4f60'.replace('\\u', '\u')
u'\\u4f60\\u4f60'
>>> u'\\u4f60\\u4f60'.replace(u'\\u', '\u')
u'\\u4f60\\u4f60'
>>> u'\\u4f60\\u4f60'.replace('\\u', u'\u')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
Could that be done by str.replace()? Or need something more complex?

>>> u'\\u4f60\\u4f60'.decode('unicode_escape')
u'\u4f60\u4f60'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decode an ENCODED unicode string in Python - python

Related

Convert str to unicode str

UnicodeDecodeError -Decoding and saving in Excel

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 42: ordinal not in range(128)

Python and Pandas: UnicodeDecodeError: 'ascii' codec can't decode byte

How to convert unicode string like u'\\u4f60\\u4f60' to u'\u4f60\u4f60' in Python?

Categories

Resources