(python utf-8) using 'à','ç','é','è','ê','ë','î','ô','ù' - python

I am having trouble with accent in python
I wrote # -- coding: utf-8 -- so it can recognize the accent.
But still sometime it doesn't work. I get '?' and when I use it after I get an error " SyntaxError: Non-ASCII character '\xc3' "
Why ? What should I change? Thanks
(doesn't work for all those characters 'à','ç','é','è','ê','ë','î','ô','ù',"‘","’")
this is my code :
# -*- coding: utf-8 -*-
testList = ['à','ç','é','è','ê','ë','î','ô','ù',"‘","’"]
testCharacter = raw_input('test a character : ') # example : é
print(testCharacter) # getting é
print(testCharacter[0]) # getting ?
print(testCharacter + testCharacter[0]) # getting é?
testCharacterPosition = testList.index(testCharacter)
print(testCharacterPosition) #getting 2
this is the result on my console :
test a character : é
é
?
é?
2

It seems you are still using python2 (you should consider switching to python3 since python2 is discontinued).
If pasted some utf8 string, it is encoded and therefore consists of multiple characters, e.g.:
>>> s = 'à'
>>> s
'\xc3\xa0'
>>> s[0]
'\xc3'
Of course this will print an question mark since one alone doesn't make the full character:
>>> print(s + s[0])
à�
you can convert this to a unicode string, which then consists of one character:
>>> s.decode('utf-8')
u'\xe0'
>>> print(s.decode('utf-8'))
à
You can get around decode when directly using unicode strings in py2:
>>> s = u'à'
>>> s
u'\xe0'
Better would be to use python3, which simplifies the whole thing to:
>>> s = 'à'
>>> s
'à'
>>>

Related

How do I replace \xc3 etc. with umlauts?

I have an output of spannkr \xc3\xa4ftig, da\xc3\x9f unser in Python. How do I replace this with umlauts?
The German characters are already there, but encoded as utf-8. If you want to see the umlauts etc in the interpreter then you can decode to str:
>>> bs = b'spannkr \xc3\xa4ftig, da\xc3\x9f unser'
>>> s = bs.decode('utf-8')
>>> print(s)
spannkr äftig, daß unser
It's possible that you are dealing with a str that somehow contains utf-8 encoded data. In this case you need to perform an extra step:
>>> s = 'spannkr \xc3\xa4ftig, da\xc3\x9f unser'
>>> bs = s.encode('raw-unicode-escape') # encode to bytes without double-encoding
>>> print(bs)
b'spannkr \xc3\xa4ftig, da\xc3\x9f unser'
>>> decoded = bs.decode('utf-8')
>>> print(decoded)
spannkr äftig, daß unser
There isn't an easy way to distinguish between incorrectly embedded spaces and the spaces between words. You would need to use some kind of spellchecker or natural language application.

Russian character decoding in python

This question only for python:
I have a city name in a string in Russian language and which is in Unicode form like,
\u041C\u043E\u0441\u043A\u0432\u0430
means
Москва
How to get original text instead of unicode characters?
Note: Do not use any import module
>>> a=u"\u041C\u043E\u0441\u043A\u0432\u0430"
>>> print a
Москва
Your string is a unicode string because each character/code point with \u is only usable from a unicode string, you should prefix the string with u. Otherwise is a regular string and each \u counts as a regular ascii character:
>>> len(a)
6
>>> b="\u041C\u043E\u0441\u043A\u0432\u0430"
>>> len(b)
36
In addition to vz0 answer : Pay attention to script's encoding.
This file will works great :
# coding: utf-8
s = u"\u041C\u043E\u0441\u043A\u0432\u0430"
print(s)
But this one will lead to an UnicodeEncodeError :
# coding: ASCII
s = u"\u041C\u043E\u0441\u043A\u0432\u0430"
print(s)

Weird behaviour when trying to print characters of a byte string

Why this short code behaves differently from a run to other ?
# -*- coding: utf-8 -*-
for c in 'aɣyul':
print c
The outputs that I have in each run are:
# nothing
---
a
---
l
---
u
l
---
a
y
u
l
...etc
EDIT:
I know how to solve the problem, the question is just why Python prints a different part of the string, instead of the same part, at each run ?
You need to add an u at leading of your string which make that python treads with your string as an unicode, and decode your character while printing:
>>> for c in u'aɣyul':
... print c
...
a
ɣ
y
u
l
Note that without encoding python will break the unicode character in two separate hex value and in each print you will get the string representation of this hex values:
>>> 'aɣyul'
'a\xc9\xa3yul'
^ ^
If you want to know that why python break the unicode to 2 hex value that's because of that instances of str contain raw 8-bit values while a unicode character used more than 8 bit memory.
You can also decode the hex values manually:
>>> print '\xc9\xa3'.decode('utf8')
ɣ

Convert GBK to utf8 string in python

I have a string.
s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
How can I translate s into a utf-8 string? I have tried s.decode('gbk').encode('utf-8') but python reports error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 35-50: ordinal not in range(128)
in python2, try this to convert your unicode string:
>>> s.encode('latin-1').decode('gbk')
u"<script language=javascript>alert('\u8bf7\u8f93\u5165\u6b63\u786e\u9a8c\u8bc1\u7801,\u8c22\u8c22!');location='index.asp';</script></script>"
then you can encode to utf-8 as you wish.
>>> s.encode('latin-1').decode('gbk').encode('utf-8')
"<script language=javascript>alert('\xe8\xaf\xb7\xe8\xbe\x93\xe5\x85\xa5\xe6\xad\xa3\xe7\xa1\xae\xe9\xaa\x8c\xe8\xaf\x81\xe7\xa0\x81,\xe8\xb0\xa2\xe8\xb0\xa2!');location='index.asp';</script></script>"
You are mixing apples and oranges. The GBK-encoded string is not a Unicode string and should hence not end up in a u'...' string.
This is the correct way to do it in Python 2.
g = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,' \
'\xd0\xbb\xd0\xbb!'.decode('gbk')
s = u"<script language=javascript>alert(" + g +
u");location='index.asp';</script></script>"
Notice how the initializer for g which is passed to .decode('gbk') is not represented as a Unicode string, but as a plain byte string.
See also http://nedbatchelder.com/text/unipain.html
If you can keep the alert in a separate string "a":
a = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!'.decode("gbk")
s = u"<script language=javascript>alert('"+a+"');location='index.asp';</script></script>"
print s
Then it will print:
<script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>
If you want to automatically extract the substring in one go:
s = "<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
s = unicode("'".join((s.decode("gbk").split("'",2))))
print s
will print:
<script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>
Take a look at unicodedata but I think one way to do this is:
import unicodedata
s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
unicodedata.normalize('NFKD', s).encode('utf-8','ignore')
I got the same question
Like this:
name = u'\xb9\xc5\xbd\xa3\xc6\xe6\xcc\xb7'
I want convert to
u'\u53e4\u5251\u5947\u8c2d'
Here is my solution:
new_name = name.encode('iso-8859-1').decode('gbk')
And I tried yours
s = u"alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';"
print s
alert('ÇëÊäÈëÕýÈ·ÑéÖ¤Âë,лл!');location='index.asp';
Then:
_s = s.encode('iso-8859-1').decode('gbk')
print _s
alert('请输入正确验证码,谢谢!');location='index.asp';
Hope can help you ..

Python3 : unescaping non ascii characters

(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.
I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()
You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')
import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.

Categories