Python - Can't concatenate more than 1 non-ascii string - python

I'm trying to create a new string containing more than 1 string with special characters in it. This doesn't work:
# -*- coding: utf-8 -*-
str1 = "I am"
str2 = "español"
str3 = "%s %s %s" % (str1, u'–', str2)
print str3
>> Traceback (most recent call last):
File "myscript.py", line 5, in <module>
str3 = "%s %s %s" % (str1, u'–', str2)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
The strange thing is that if I delete the ñ or the – character, it creates the string correctly:
# -*- coding: utf-8 -*-
str1 = "I am"
str2 = "espaol"
str3 = "%s %s %s" % (str1, u'–', str2)
print str3
>> I am – espaol
or:
# -*- coding: utf-8 -*-
str1 = "I am"
str2 = "español"
str3 = "%s %s" % (str1, str2)
print str3
>> I am español
What is wrong about it?

You are mixing Unicode strings and byte strings. Don't do that. Make sure all your strings are of the same type. Preferably, that's unicode.
When mixing str and unicode, Python implicitly will decode or encode one or the other type using the ASCII codec. Avoid implicit operations by explicitly encoding or decoding to make everything one type.
This is what is causing your UnicodeDecodeError exception; you are mixing two str objects (byte strings, str1 and str3), but only str1 can be decoded as ASCII. str3 contains UTF-8 data and thus decoding fails. Explicitly creating unicode strings or decoding your data makes things work:
str1 = u"I am" # Unicode strings
str2 = u"español" # Unicode strings
str3 = u"%s %s %s" % (str1, u'–', str2)
print str3
or
str1 = "I am"
str2 = "español"
str3 = u"%s %s %s" % (str1.decode('utf-8'), u'–', str2.decode('utf-8'))
print str3
Note that I used a Unicode string literal for the formatting string too!
You really should read up on Unicode and codecs and Python. I strongly recommend the following articles:
Ned Batchelder's Pragmatic Unicode
Joel Spolsky's The Absolute Minimum Every Programmer Must Know About Unicode
The Python Unicode HOWTO

Related

Python and encoding, again

I have the next code snippet in Python (2.7.8) on Windows:
text1 = 'áéíóú'
text2 = text1.encode("utf-8")
and i have the next error exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
Any ideas?
You forgot to specify that you are dealing with a unicode string:
text1 = u'áéíóú' #prefix string with "u"
text2 = text1.encode("utf-8")
In python 3 this behavior has changed, and any string is unicode, so you don't need to specify it.
I have tried the following code in Linux with Python 2.7:
>>> text1 = 'áéíóú'
>>> text1
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
>>> type(text1)
<type 'str'>
>>> text1.decode("utf-8")
u'\xe1\xe9\xed\xf3\xfa'
>>> print '\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
áéíóú
>>> print u'\xe1\xe9\xed\xf3\xfa'
áéíóú
>>> u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba is the utf-8 coding of áéíóú. And \xe1\xe9\xed\xf3\xfa is the unicode coding of áéíóú.
text1 is encoded by utf-8, it only can be decoded to unicode by:
text1.decode("utf-8")
an unicode string can be encoded to an utf-8 string:
u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')

UnicodeDecodeError: 'ascii' codec can't decode '\xc3\xa8' together with '\xe8'

I am having this strange problem below:
>>> a=u'Pal-Andr\xe8'
>>> b='Pal-Andr\xc3\xa8'
>>> print "%s %s" % (a,b) # boom
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)
>>> print "%s" % a
Pal-Andrè
>>> print "%s" % b
Pal-Andrè
Where I can print a, b separately but not both.
What's the problem? How can I print them both?
The actual problem is
b = 'Pal-Andr\xc3\xa8'
Now, b has a string literal not a unicode literal. So, when you are printing them as strings separately, a is treated as a Unicode String and b is treated as a normal string.
>>> "%s" % a
u'Pal-Andr\xe8'
>>> "%s" % b
'Pal-Andr\xc3\xa8'
Note the u at the beginning is missing. You can confirm further
>>> type("%s" % b)
<type 'str'>
>>> type("%s" % a)
<type 'unicode'>
But when you are printing them together, string becomes a unicode string and \xc3 is not a valid ASCII code and that is why the code is failing.
To fix it, you simply have to declare b also as a unicode literal, like this
>>> a=u'Pal-Andr\xe8'
>>> b=u'Pal-Andr\xc3\xa8'
>>> "%s" % a
u'Pal-Andr\xe8'
>>> "%s" % b
u'Pal-Andr\xc3\xa8'
>>> "%s %s" % (a, b)
u'Pal-Andr\xe8 Pal-Andr\xc3\xa8'
I am not sure what the real issue here, but one thing for sure a is a unicode string and b is a string.
You will have to encode or decode one of them before print them both.
Here is an example.
>>> b = b.decode('utf-8')
>>> print u"%s %s" % (a,b)
Pal-Andrè Pal-Andrè
Having a mix of Unicode and byte strings makes the combined print try to promote everything to Unicode strings. You've got to decode the byte string with the correct codec, else Python 2 will default to ascii. b is a byte string encoded in UTF-8. The format string is promoted as well, but it happens to work decoded from ASCII. Best to use Unicode everywhere:
>>> print u'%s %s' % (a,b.decode('utf8'))
Pal-Andrè Pal-Andrè

Convert GBK to utf8 string in python

I have a string.
s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
How can I translate s into a utf-8 string? I have tried s.decode('gbk').encode('utf-8') but python reports error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 35-50: ordinal not in range(128)
in python2, try this to convert your unicode string:
>>> s.encode('latin-1').decode('gbk')
u"<script language=javascript>alert('\u8bf7\u8f93\u5165\u6b63\u786e\u9a8c\u8bc1\u7801,\u8c22\u8c22!');location='index.asp';</script></script>"
then you can encode to utf-8 as you wish.
>>> s.encode('latin-1').decode('gbk').encode('utf-8')
"<script language=javascript>alert('\xe8\xaf\xb7\xe8\xbe\x93\xe5\x85\xa5\xe6\xad\xa3\xe7\xa1\xae\xe9\xaa\x8c\xe8\xaf\x81\xe7\xa0\x81,\xe8\xb0\xa2\xe8\xb0\xa2!');location='index.asp';</script></script>"
You are mixing apples and oranges. The GBK-encoded string is not a Unicode string and should hence not end up in a u'...' string.
This is the correct way to do it in Python 2.
g = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,' \
'\xd0\xbb\xd0\xbb!'.decode('gbk')
s = u"<script language=javascript>alert(" + g +
u");location='index.asp';</script></script>"
Notice how the initializer for g which is passed to .decode('gbk') is not represented as a Unicode string, but as a plain byte string.
See also http://nedbatchelder.com/text/unipain.html
If you can keep the alert in a separate string "a":
a = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!'.decode("gbk")
s = u"<script language=javascript>alert('"+a+"');location='index.asp';</script></script>"
print s
Then it will print:
<script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>
If you want to automatically extract the substring in one go:
s = "<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
s = unicode("'".join((s.decode("gbk").split("'",2))))
print s
will print:
<script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>
Take a look at unicodedata but I think one way to do this is:
import unicodedata
s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
unicodedata.normalize('NFKD', s).encode('utf-8','ignore')
I got the same question
Like this:
name = u'\xb9\xc5\xbd\xa3\xc6\xe6\xcc\xb7'
I want convert to
u'\u53e4\u5251\u5947\u8c2d'
Here is my solution:
new_name = name.encode('iso-8859-1').decode('gbk')
And I tried yours
s = u"alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';"
print s
alert('ÇëÊäÈëÕýÈ·ÑéÖ¤Âë,лл!');location='index.asp';
Then:
_s = s.encode('iso-8859-1').decode('gbk')
print _s
alert('请输入正确验证码,谢谢!');location='index.asp';
Hope can help you ..

Regex sub function not working with unicode string

I'm trying to use Python's sub function and I'm having a problem getting it to work. From the troubleshooting I've been doing I believe it has something to do with the unicode characters in the string.
# -*- coding: utf-8 -*-
reload(sys)
sys.setdefaultencoding('utf-8')
import re
someFunction(string):
string = string.decode('utf-8')
match = re.search(ur'éé', string)
if match:
print >> sys.stderr, "It was found"
else:
print >> sys.stderr, "It was NOT found"
if isinstance(string, str):
print >> sys.stderr, 'string is a string object'
elif isinstance(string, unicode):
print >> sys.stderr, 'string is a unicode object'
new_string = re.sub(ur'éé', ur'é:', string)
return new_string
stringNew = 'éégktha'
returnedString = someFunction(stringNew)
print >> sys.stderr, "After printing it: " + returnedString
#At this point in the code string = 'éégktha'
returnString = someFunction(string)
print >> sys.stderr, "After printing it: " + returnedString
So I would like 'é:gktha'. Below is what is printed to the error log when I run this code.
It was found
string is a unicode object
é:gktha
It was NOT found
string is a unicode object
éégktha
So I'm thinking it must be something with string that is passed into my function. When I declared is as a unicode string or a string literal and then decode it the pattern is found. But the pattern is not being found in the string being passed in. I was thinking my string = string.decode('utf-8') statement would convert any string passed into the function and then would would work.
I tried to do this in the python interpreter to work through this and when I declare string as a unicode string it works.
string = u'éégktha'
So to simulate the function I declared the string and then 'decode' it to and then tried my regex statement and it worked.
string = 'éégktha'
newString = string.decode('utf8')
string = re.sub(ur'éé', ur'é:', newString)
print string #é:gktha
This web app that works with a lot of unicode characters. This is Python 2.5 and I've always had a hard time when working with unicode characters. Any help and knowledge is greatly appreciated.
You should print what it returned by someFunction.
>>> string = 'éégktha'
>>> def someFunction(string):
... #string = 'éégktha'
... string = string.decode('utf8')
... new_string = re.sub(ur'éé', ur'é:', string)
... return new_string
>>> import re
>>> someFunction(string)
u'\xe9:gktha'
>>> print someFunction(string)
é:gktha
Your functions fine. In the simulation you are printing which prints what is returned by __str__ while when you return the interpreter prints what is returned by the __repr__ of the new_string/newString.

String to unicode code point escpe sequences in Python

Here is my problem... I have a "normal" String like :
Hello World
And unlike all the other subjects I have found, I WANT to print it as it's Unicode Codepoint Escape value !
The output I am looking for is something like this:
\u0015\u0123
If anyone has an idea :)
You are encoding ASCII codepoints only. UTF-8 is a superset of ASCII, any ASCII codepoints are encoded to the same bytes as ASCII would use. What you are printing is correct, that is UTF-8.
Use some non-ASCII codepoints to see the difference:
>>> 'Hello world with an em-dash: \u2014\n'.encode('utf8')
b'Hello world with an em-dash: \xe2\x80\x94\n'
Python will just use the characters themselves when it shows you a bytes value with printable ASCII bytes in it. Any byte value that is not printable is shown as a \x.. escape code, or a single-character escape sequence if there is one (\n for newline).
From your example output, on the other hand, you seem to be expecting to output Python unicode literal escape codes:
>>> '\u0015\u0123'
'\x15ģ'
Since U+0123 is printable, Python 3 just shows it; the non-printable U+0015 (NEGATIVE ACKNOWLEDGE) is a codepoint in the 0x00-0xFF range and is shown using the shorter \x.. escape notation.
To show only unicode escape sequences for your text, you need to process it character by character:
>>> input_text = 'Hello World!'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064\u0021
>>> input_text = 'Hello world with an em-dash: \u2014\n'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u006f\u0072\u006c\u0064\u0020\u0077\u0069\u0074\u0068\u0020\u0061\u006e\u0020\u0065\u006d\u002d\u0064\u0061\u0073\u0068\u003a\u0020\u2014\u000a
It is important to stress that this is not UTF-8, however.
You can use ord to the encoded bytes into numbers and use string formatting you display their hex values.
>>> s = u'Hello World \u0664\u0662'
>>> print s
Hello World ٤٢
>>> print ''.join('\\x%02X' % ord(c) for c in s.encode('utf-8'))
\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x20\xD9\xA4\xD9\xA2

Categories