python 2.7 string.join() with unicode

python 2.7 string.join() with unicode - python

I have bunch of byte strings (str, not unicode, in python 2.7) containing unicode data (in utf-8 encoding).
I am trying to join them( by "".join(utf8_strings) or u"".join(utf8_strings)) which throws
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 0: ordinal not in range(128)`
Is there any way to make use of .join() method for non-ascii strings? sure I can concatenate them in a for loop, but that wouldn't be cost-effective.

Joining byte strings using ''.join() works just fine; the error you see would only appear if you mixed unicode and str objects:
>>> utf8 = [u'\u0123'.encode('utf8'), u'\u0234'.encode('utf8')]
>>> ''.join(utf8)
'\xc4\xa3\xc8\xb4'
>>> u''.join(utf8)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
>>> ''.join(utf8 + [u'unicode object'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
The exceptions above are raised when using the Unicode value u'' as the joiner, and adding a Unicode string to the list of strings to join, respectively.

"".join(...) will work if each parameter is a str (whatever the encoding may be).
The issue you are seeing is probably not related to the join, but the data you supply to it. Post more code so we can see what's really wrong.

Related

Inconsistent character decoding errors in Python

In both cp1250 and latin2, there is no character corresponding to the byte \x88 (cf. gray cells in the code page tables). Yet, if I try to decode this byte using the two encodings in Python 3, I get different results. The first encoding yields an error:
>>> b"\x88".decode("cp1250")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/encodings/cp1250.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 0: character maps to <undefined>
Which makes sense, since there really is no character defined for that byte. However, the second encoding returns a character corresponding to Unicode codepoint \u0088, even though it shouldn't be defined either:
>>> b"\x88".decode("latin2")
'\x88'
>>> "\x88" == "\u0088"
True
Why is that?

Python - UnicodeDecode Error - Resolving [duplicate]

Odd error with unicode for me. I was dealing with unicode fine, but when I ran it this morning one item u'\u201d' gave error and gives me
UnicodeError: ASCII encoding error: ordinal not in range(128)
I looked up the code and apparently its utf-32 but when I try to decode it in the interpreter:
c = u'\u201d'
c.decode('utf-32', 'replace')
Or any other operation with it for that matter, it just doesnt recognize it in any codec but yet I found it as "RIGHT DOUBLE QUOTATION MARK"
I get:
Traceback (most recent call last):
File "<pyshell#154>", line 1, in <module>
c.decode('utf-32')
File "C:\Python27\lib\encodings\utf_32.py", line 11, in decode
return codecs.utf_32_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)

You already have a unicode string, there is no need to decode it to a unicode string again.
What happens in that case is that python helpfully tries to first encode it for you, so that you can then decode it from utf-32. It uses the default encoding to do so, which happens to be ASCII. Here is an explicit encode to show you the exception raised in that case:
>>> u'\u201d'.encode('ASCII')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
In short, when you have a unicode literal like u'', there is no need to decode it.
Read up on the unicode, encodings, and default settings in the Python Unicode HOWTO. Another invaluable article on the subject is Joel Spolsky's Minimun Unicode knowledge post.

Confusion about Python Decode method

I'm trying to run the command u'\xe1'.decode("utf-8") in python and I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
Why does it say I'm trying to decode ascii when I'm passing utf-8 as the first argument? In addition to this, is there any way I can get the character á from u'\xe1' and save it in a string?

decode will take a string and convert it to unicode (eg: "\xb0".decode("utf8") ==> u"\xb0")
encode will take unicode and convert it to a string (eg: u"\xb0".encode("utf8") ==> "\xb0")
neither has much to do with the rendering of a string... it is mostly an internal representation
try
print u"\xe1"
(your terminal will need to support unicode (idle will work ... dos terminal not so much))
>>> print u"\xe1"
á
>>> print repr(u"\xe1".encode("utf8"))
'\xc3\xa1'
>>> print repr("\xc3\xa1".decode("utf8"))
u'\xe1'

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal not in range(128)

I'm having troubles in encoding characters in utf-8. I'm using Django, and I get this error when I tried to send an Android notification with non-plain text. I tried to find where the source of the error and I managed to figure out that the source of the error is not in my project.
In python shell, I type:
'ç'.encode('utf8')
and I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal not in range(128)
I get the same errors with:
'á'.encode('utf-8')
unicode('ç')
'ç'.encode('utf-8','ignore')
I get errors with smart_text, force_text and smart_bytes too.
Is that a problem with Python, my OS, or another thing?
I'm running Python 2.6.6 on a Red Hat version 4.4.7-3

You're trying to encode / decode strings, not Unicode strings. The following statements do work:
u'ç'.encode('utf8')
u'á'.encode('utf-8')
unicode(u'ç')
u'ç'.encode('utf-8','ignore')

Use u'...', without the u prefix it is byte-string not a unicode string.:
>>> u'ç'.encode('utf8')
'\xc3\xa7'

Unicode error Ordinal not in range

Odd error with unicode for me. I was dealing with unicode fine, but when I ran it this morning one item u'\u201d' gave error and gives me
UnicodeError: ASCII encoding error: ordinal not in range(128)
I looked up the code and apparently its utf-32 but when I try to decode it in the interpreter:
c = u'\u201d'
c.decode('utf-32', 'replace')
Or any other operation with it for that matter, it just doesnt recognize it in any codec but yet I found it as "RIGHT DOUBLE QUOTATION MARK"
I get:
Traceback (most recent call last):
File "<pyshell#154>", line 1, in <module>
c.decode('utf-32')
File "C:\Python27\lib\encodings\utf_32.py", line 11, in decode
return codecs.utf_32_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)

You already have a unicode string, there is no need to decode it to a unicode string again.
What happens in that case is that python helpfully tries to first encode it for you, so that you can then decode it from utf-32. It uses the default encoding to do so, which happens to be ASCII. Here is an explicit encode to show you the exception raised in that case:
>>> u'\u201d'.encode('ASCII')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
In short, when you have a unicode literal like u'', there is no need to decode it.
Read up on the unicode, encodings, and default settings in the Python Unicode HOWTO. Another invaluable article on the subject is Joel Spolsky's Minimun Unicode knowledge post.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python 2.7 string.join() with unicode - python

"".join(...) will work if each parameter is a str (whatever the encoding may be). The issue you are seeing is probably not related to the join, but the data you supply to it. Post more code so we can see what's really wrong.

Related

Inconsistent character decoding errors in Python

Python - UnicodeDecode Error - Resolving [duplicate]

Confusion about Python Decode method

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal not in range(128)

Unicode error Ordinal not in range

Categories

Resources