I have the next code snippet in Python (2.7.8) on Windows:
text1 = 'áéíóú'
text2 = text1.encode("utf-8")
and i have the next error exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
Any ideas?
You forgot to specify that you are dealing with a unicode string:
text1 = u'áéíóú' #prefix string with "u"
text2 = text1.encode("utf-8")
In python 3 this behavior has changed, and any string is unicode, so you don't need to specify it.
I have tried the following code in Linux with Python 2.7:
>>> text1 = 'áéíóú'
>>> text1
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
>>> type(text1)
<type 'str'>
>>> text1.decode("utf-8")
u'\xe1\xe9\xed\xf3\xfa'
>>> print '\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
áéíóú
>>> print u'\xe1\xe9\xed\xf3\xfa'
áéíóú
>>> u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba is the utf-8 coding of áéíóú. And \xe1\xe9\xed\xf3\xfa is the unicode coding of áéíóú.
text1 is encoded by utf-8, it only can be decoded to unicode by:
text1.decode("utf-8")
an unicode string can be encoded to an utf-8 string:
u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')
Related
I have an unicode character like 🏆 and I want to get back the \Uxxxxxxxx format. But until now, couldn't find an easy way. Already tried:
text = 🏆
text.encode('utf-32').decode('utf-8')
returns error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
text.encode('utf-32').decode('unicode-escape')
returns ÿþ
How to make it return \U000XXXXX ? I know I can get the character from \U000XXXXX making:
string = "foo bar foo \U000XXXXX"
string.encode('utf-8').decode('unicode-escape')
returns "foo bar foo 🏆"
For a byte string:
>>> text = '🏆'
>>> text.encode('unicode-escape')
b'\\U0001f3c6'
for a Unicode string:
>>> text.encode('unicode-escape').decode('ascii')
'\\U0001f3c6'
Now I have some unicode literal string like "\\u0061" which is by default interpreted as 6 unicode character. How can I convert it into unicode character 'a' ?.
You're looking for the unicode-escape codec:
>>> import codecs
>>> print(r'\u2603')
\u2603
>>> print(codecs.decode(r'\u2603', 'unicode-escape'))
☃
Even easier:
>>> "\\u0061".encode().decode('unicode-escape')
'a'
>>>
i got a problem trying to encore non ASCII characters.
I have this function :
#function to treat special characters
tagsA=["À","Á","Â","à","á","â","Æ","æ"]
tagsC=["Ç","ç"]
tagsE=["È","É","Ê","Ë","è","é","ê","ë"]
tagsI=["Ì","Í","Î","Ï","ì","í","î","ï"]
tagsN=["Ñ","ñ"]
tagsO=["Ò","Ó","Ô","Œ","ò","ó","ô","œ"]
tagsU=["Ù","Ú","Û","Ü","ù","ú","û","ü"]
tagsY=["Ý","Ÿ","ý","ÿ"]
def toASCII(word):
for i in range (0, len(word),1):
if any(word[i] in s for s in tagsA):
word[i]="a"
if any(word[i] in s for s in tagsC):
word[i]="c"
if any(word[i] in s for s in tagsE):
word[i]="e"
if any(word[i] in s for s in tagsI):
word[i]="i"
if any(word[i] in s for s in tagsN):
word[i]="n"
if any(word[i] in s for s in tagsO):
word[i]="o"
if any(word[i] in s for s in tagsU):
word[i]="u"
if any(word[i] in s for s in tagsY):
word[i]="y"
print word
return word
i get this error usually :
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
tried to change encoding to utf8 but it doesn't change the issue.
# -*- coding: utf-8 -*-
You can use the unicodedata module to remove all the accents from string.
Ex:
import unicodedata
print unicodedata.normalize('NFKD', u"ÀÁ").encode('ASCII', 'ignore')
Output:
AA
I want to generate all utf8 characters list.
I wrote the code below but it didn't work well.
I thought that because chr() expected unicode number, but I gave utf8 code number.
I think I have to convert utf8 code number to unicode code number but I don't know the way.
How can I do? Or do you know better way?
def utf8_2byte():
characters = []
# first byte range: [C2-DF]
for first in range(0xC2, 0xDF + 1):
# second byte range: [80-BF]
for second in range(0x80, 0xBF + 1):
num = (first << 8) + second
line = [hex(num), chr(num)]
characters.append(line)
return characters
I expect:
# UTF8 code number, UTF8 character
[0xc380,À]
[0xc381,Á]
[0xc382,Â]
actually:
[0xc380,쎀]
[0xc381,쎁]
[0xc382,쎂]
In python 3, chr takes unicode codepoints, not utf-8. U+C380 is in the Hangul range. Instead you can use bytearray for the decode
>>> bytearray((0xc3, 0x80)).decode('utf-8')
'À'
There are other methods also, like struct or ctypes. Anything that assembles native bytes and converts them to bytes will do.
Unicode is a character set while UTF-8 is a encoding which is a algorithm to encode code point from Unicode to bytes in machine level and vice versa.
The code point 0xc380 is 쎀 in the standard of Unicode.
The bytes 0xc380 is À when you decode it use UTF-8 encoding.
>>> s = "쎀"
>>> hex(ord(s))
'0xc380'
>>> b = bytes.fromhex("C3 80")
>>> b
b'\xc3\x80'
>>> b.decode("utf8")
'À'
>>> bytes((0xc3, 0x80)).decode("utf8")
'À'
I have a string.
s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
How can I translate s into a utf-8 string? I have tried s.decode('gbk').encode('utf-8') but python reports error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 35-50: ordinal not in range(128)
in python2, try this to convert your unicode string:
>>> s.encode('latin-1').decode('gbk')
u"<script language=javascript>alert('\u8bf7\u8f93\u5165\u6b63\u786e\u9a8c\u8bc1\u7801,\u8c22\u8c22!');location='index.asp';</script></script>"
then you can encode to utf-8 as you wish.
>>> s.encode('latin-1').decode('gbk').encode('utf-8')
"<script language=javascript>alert('\xe8\xaf\xb7\xe8\xbe\x93\xe5\x85\xa5\xe6\xad\xa3\xe7\xa1\xae\xe9\xaa\x8c\xe8\xaf\x81\xe7\xa0\x81,\xe8\xb0\xa2\xe8\xb0\xa2!');location='index.asp';</script></script>"
You are mixing apples and oranges. The GBK-encoded string is not a Unicode string and should hence not end up in a u'...' string.
This is the correct way to do it in Python 2.
g = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,' \
'\xd0\xbb\xd0\xbb!'.decode('gbk')
s = u"<script language=javascript>alert(" + g +
u");location='index.asp';</script></script>"
Notice how the initializer for g which is passed to .decode('gbk') is not represented as a Unicode string, but as a plain byte string.
See also http://nedbatchelder.com/text/unipain.html
If you can keep the alert in a separate string "a":
a = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!'.decode("gbk")
s = u"<script language=javascript>alert('"+a+"');location='index.asp';</script></script>"
print s
Then it will print:
<script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>
If you want to automatically extract the substring in one go:
s = "<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
s = unicode("'".join((s.decode("gbk").split("'",2))))
print s
will print:
<script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>
Take a look at unicodedata but I think one way to do this is:
import unicodedata
s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
unicodedata.normalize('NFKD', s).encode('utf-8','ignore')
I got the same question
Like this:
name = u'\xb9\xc5\xbd\xa3\xc6\xe6\xcc\xb7'
I want convert to
u'\u53e4\u5251\u5947\u8c2d'
Here is my solution:
new_name = name.encode('iso-8859-1').decode('gbk')
And I tried yours
s = u"alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';"
print s
alert('ÇëÊäÈëÕýÈ·ÑéÖ¤Âë,лл!');location='index.asp';
Then:
_s = s.encode('iso-8859-1').decode('gbk')
print _s
alert('请输入正确验证码,谢谢!');location='index.asp';
Hope can help you ..