convert unicode ucs4 into utf8 - python

I have a value like u'\U00000958' being returned back from a database and I want to convert this string to utf8. I try something like this:
cp = u'\\U00000958'
value = cp.decode('unicode-escape').encode('utf-8')
print 'Value: " + value
I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position
0: ordinal not in range(128)
What can I do to properly convert this value?
More Detail. I'm in 2.7.10 which uses ucs2.

For unicode issues, it often helps to specify python 2 versus python 3, and also how one came to get a particular representation.
It is not clear from the first sentence what the actual value is, as opposed to how it is displayed. It is unclear if the value like u'\\U00000958' is a 1 char unicode string, a 10 char unicode string, a 14 char (ascii) byte string, or something else. Use of len and type can be used to be sure of what you have.
By trying to decode cp, you are implying that you know that cp is bytes, but what encoding? The error message says that it is not ascii bytes. 0xe0 is a typical start byte for utf-8 encoding. The following interaction
>>> s = "u'\\U00000958'"
>>> se = eval(s)
>>> se
u'\u0958'
>>> se.encode(encoding='utf-8')
'\xe0\xa5\x98'
>>>
suggests to me that cp, starting with \xe0 is 3 utf-8 encoded bytes and that u'\\U00000958' is an evaluable representation of its unicode decoding.

Related

Python unicode accent a (à) hex

I have a string from bs4 that is
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
\u00c3\u00a0should be accent a (à) I have gotten it to show up in the console partly correct as
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
with
str2 = u'%s' % s
print(str2.encode('utf-8').decode('unicode-escape'))
but it's decoding c3 and a0 separately, so I get a tilde A instead of an accent a. I know that c3 a0 is the hex utf-8 for accent a. I have no idea what's going on and I got to here using Google and the combinatory approach to the answers I got. This entire character encoding thing seems like a big mess to me.
The way it is supposed to be is
311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
EDIT:
Andrey's method worked when printing it out, but trying to use urlopen with the string I get UnicodeEncodeError: 'ascii' codec can't encode character '\xe0' in position 60: ordinal not in range(128)
After using unquote(str,":/") it gives UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128).
Transform the string back into bytes using .encode('latin-1'), then decode the unicode-escapes \u, transform everything into bytes again using the "wrong" 'latin-1' encoding, and finally, decode "properly" as 'utf-8':
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
s.encode('latin-1').decode('raw_unicode_escape').encode('latin-1').decode('utf-8')
gives:
'vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html'
It works for the same reason as explained in this answer.
Assuming Python 2:
This is a byte string with Unicode escapes. The Unicode escapes were incorrectly generated for some UTF-8-encoded data:
>>> s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
>>> s.decode('unicode-escape')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it is a Unicode string but now appears mis-decoded since the code points resemble UTF-8 bytes. It turns output the latin1 (also iso-8859-1) codec maps the first 256 code points directly to bytes 0-255, so use this trick to convert back to a byte string:
>>> s.decode('unicode-escape').encode('latin1')
'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it can be decoded correctly as UTF-8:
>>> s.decode('unicode-escape').encode('latin1').decode('utf8')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xe0-la-me-creatura.html'
It is a Unicode string, so Python displays its repr() value, which shows code points above U+007F as escape codes. print it to see the actual value assuming your terminal is correctly configured with an encoding that supports the characters printed:
>>> print(s.decode('unicode-escape').encode('latin1').decode('utf8'))
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
Ideally, fix the problem that generated this string incorrectly in the first place instead of working around the mess.

why does pythons `s.encode('ascii', 'replace')` fails encoding

Why does using replace here:
s = s.encode('ascii', 'replace')
Give me this error?:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 6755: ordinal not in range(128)
Isn't the whole point of 'replace' or 'ignore' to not fail when it can't decode a byte. Am I not understanding this?
(sorry I can't provide the actual string, the corpus is very large)
In any case, how do I tell python to ignore or replace characters that aren't ascii?
Note that you're getting a UnicodeDecodeError, not a UnicodeEncodeError.
That's because s.encode() takes a unicode string as input, but in this case you're not giving it one; you're giving it a bytestring instead.
Thus, it's encoding the bytestring you're handing it to unicode before trying to decode it, and it's in that initial encode that the error occurs.
This three-way round-trip is silly, but if you really wanted to do it:
s_bytes = '\xcb' # standard Python 2 string, aka a Python 3 bytestring
s_unicode = s_bytes.decode('ascii', 'replace') # a unicode string now
s_ascii = s_unicode.encode('ascii', 'replace') # a bytestring again

How to allow encode('utf-8') twice without getting error in python?

I have a legacy code segment that always encode('utf-8') for me when I pass in an unicode string (directly from database), is there a way to change unicode string to other format to allow it to be encoded to 'utf-8' again without getting an error, since I am not allowed to change the legacy code segment.
I've tried decoding it first but it returns this error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
If I leave the unicode string as is it returns
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 986: ordinal not in range(128)
If I change the legacy code to not encode('utf-8') it works, but this is not a viable option
Edit:
Here is the code snippet
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
if __name__ == "__main__":
# 1
a = u'贸易'
# 2
a = a.decode('utf-8')
# 3
a.encode('utf-8')
For some reason if I skip #2 I don't get the error that I mentioned above, I double check the type for the string, it seems like both is unicode, and both is the same character, but the code I am working on does not allow me to encode or decode to utf-8 , while the same character in some snippet allows me to do that.
Consider the following cases:
If you want a unicode string, and you already have a unicode string, you need do nothing.
If you want a bytestring, and you already have a bytestring, you need do nothing.
If you have a unicode string and want a bytestring, you encode it.
If you have a bytestring and want a unicode string, you decode it.
In none of these cases is it appropriate to encode or decode more than once.
In order for encode('utf-8') to make sense, the string must be a unicode string (or contain all-ASCII characters...). So, unless it's a unicode instance already, you have to decode it first from whatever encoding it's in to a unicode string, after which you can pass it into your legacy interface.
At no point does it make sense for anything to be double-encoded -- encoding takes a string and transforms it to a series of bytes; decoding takes a series of bytes and transforms them back into a string. The confusion only arises because Python 2 uses the str for both plain-ASCII strings and byte sequences.
>>> u'é'.encode('utf-8') # unicode string
'\xc3\xa9' # bytes, not unicode string
>>> '\xc3\xa9'.decode('utf-8')
u'\xe9' # unicode string
>>> u'\xe9' == u'é'
True

encode and decode for a specific character set

There is no difference for the printing results, what is the usage of encoding and decoding for utf-8?
And is it encode('utf8') or encode('utf-8')?
u ='abc'
print(u)
u=u.encode('utf-8')
print(u)
uu = u.decode('utf-8')
print(uu)
str.encode encodes the string (or unicode string) into a series of bytes. In Python 3 this is a bytearray, in Python 2 it's str again (confusingly). When you encode a unicode string, you are left with bytes, not unicode—remember that UTF-8 is not unicode, it's an encoding method that can turn unicode codepoints into bytes.
str.decode will decode the serialized byte stream with the selected codec, picking the proper unicode codepoints and giving you a unicode string.
So, what you're doing in Python 2 is: 'abc' > 'abc' > u'abc', and in Python 3 is:
'abc' > b'abc' > 'abc'. Try printing repr(u) or type(u) in addition to see what's changing where.
utf_8 might be the most canonical, but it doesn't really matter.
Usually Python will first try to decode it to unicode before it can encode it back to UTF-8.There are encording which doesnt have anything to do with the character sets which can be applied to 8 bit strings
For eg
data = u'\u00c3' # Unicode data
data = data.encode('utf8')
print data
'\xc3\x83' //the output.
Please have a look through here and here.It would be helpful.

converting string to unicode type in python

I'm trying this code:
s = "سلام"
'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
but this error occurs:
'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd3 in position 0: ordinal not in range(128)
I tried '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16)) but nothing changed.
what should I do?
Since you're using python 2, s = "سلام" is a byte string (in whatever encoding your terminal uses, presumably utf8):
>>> s = "سلام"
>>> s
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
You cannot encode byte strings (as they are already "encoded"). You're looking for unicode ("real") strings, which in python2 must be prefixed with u:
>>> s = u"سلام"
>>> s
u'\u0633\u0644\u0627\u0645'
>>> '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
'1101100010110011110110011000010011011000101001111101100110000101'
If you're getting a byte string from a function such as raw_input then your string is already encoded - just skip the encode part:
'{:b}'.format(int(s.encode('hex'), 16))
or (if you're going to do anything else with it) convert it to unicode:
s = s.decode('utf8')
This assumes that your input is UTF-8 encoded, if this might not be the case, check sys.stdin.encoding first.
i10n stuff is complicated, here are two articles that will help you further:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

Categories