I have a string with escaped data like
escaped_data = '\\x50\\x51'
print escaped_data # gives '\x50\x51'
What Python function would unescape it so I would get
raw_data = unescape( escaped_data)
print raw_data # would print "PQ"
You can decode with string-escape.
>>> escaped_data = '\\x50\\x51'
>>> escaped_data.decode('string-escape')
'PQ'
In Python 3.0 there's no string-escape, but you can use unicode_escape.
From a bytes object:
>>> escaped_data = b'\\x50\\x51'
>>> escaped_data.decode("unicode_escape")
'PQ'
From a Unicode str object:
>>> import codecs
>>> escaped_data = '\\x50\\x51'
>>> codecs.decode(escaped_data, "unicode_escape")
'PQ'
You could use the 'unicode_escape' codec:
>>> '\\x50\\x51'.decode('unicode_escape')
u'PQ'
Alternatively, 'string-escape' will give you a classic Python 2 string (bytes in Python 3):
>>> '\\x50\\x51'.decode('string_escape')
'PQ'
escaped_data.decode('unicode-escape') helps?
Try:
eval('"' + raw_data + '"')
It should work.
Related
I'm using selenium to insert text input with german umlauts in a web formular. The declared coding for the python script is utf-8. The page uses utf-8 encoding. When i definine a string like that everything works fine:
q = u"Hällö" #type(q) returns unicode
...
textbox.send_keys(q)
But when i try to read from a config file using ConfigParser (or another kind of file) i get malformed output in the webformular (Hällö). This is the code i use for that:
the_encoding = chardet.detect(q)['encoding'] #prints utf-8
q = parser.get('info', 'query') # type(q) returns str
q = q.decode('unicode-escape') # type(q) returns unicode
textbox.send_keys(q)
Whats the difference between the both q's given to the send_keys function?
This is probably bad encoding. Try printing q before the last statement, and see if it's equal. This line q = parser.get('info', 'query') # type(q) returns str should return the string 'H\xc3\xa4ll\xc3\xb6'. If it's different, then you are using the wrong coding.
>>> q = u"Hällö" # unicode obj
>>> q
u'H\xe4ll\xf6'
>>> print q
Hällö
>>> q.encode('utf-8')
'H\xc3\xa4ll\xc3\xb6'
>>> a = q.encode('utf-8') # str obj
>>> a
'H\xc3\xa4ll\xc3\xb6' # <-- this should be the value of the str
>>> a.decode('utf-8') # <-- unicode obj
u'H\xe4ll\xf6'
>>> print a.decode('utf-8')
Hällö
>>>
from ConfigParser import SafeConfigParser
import codecs
parser = SafeConfigParser()
with codecs.open('cfg.ini', 'r', encoding='utf-8-sig') as f:
parser.readfp(f)
greet = parser.get('main', 'greet')
print 'greet:', greet.encode('utf-8-sig')
greet: Hällö
cfg.ini file
[main]
greet=Hällö
I have this variable here
reload(sys)
sys.setdefaultencoding('utf8')
foo = u'"Esp\xc3\xadrito"'
which translates to "Espírito". But when I pass my variable to urlencode like this
urllib.urlencode({"q": foo}) # q=%22Esp%C3%83%C2%ADrito%22'
The special character is being "represented" wrongly in the URL.
How should I fix this?
You got the wrong encoding of "Espírito", I don't know where you get that, but this is the right one:
>>> s = u'"Espírito"'
>>>
>>> s
u'"Esp\xedrito"'
Then encoding your query:
>>> u.urlencode({'q':s.encode('utf-8')})
'q=%22Esp%C3%ADrito%22'
This should give you back the right encoding of your string.
EDIT: This is regarding right encoding of your query string, demo:
>>> s = u'"Espírito"'
>>> print s
"Espírito"
>>> s.encode('utf-8')
'"Esp\xc3\xadrito"'
>>> s.encode('latin-1')
'"Esp\xedrito"'
>>>
>>> print "Esp\xc3\xadrito"
EspÃrito
>>> print "Esp\xedrito"
Espírito
This clearly shows that the right encoding for your string is most probably latin-1 (even cp1252 works as well), now as far as I understand, urlparse.parse_qs either assumes default encoding utf-8 or your system default encoding, which as per your post, you set it to utf-8 as well.
Interestingly, I was playing with the query you provided in your comment, I got this:
>>> q = "q=Esp%C3%ADrito"
>>>
>>> p = urlparse.parse_qs(q)
>>> p['q'][0].decode('utf-8')
u'Esp\xedrito'
>>>
>>> p['q'][0].decode('latin-1')
u'Esp\xc3\xadrito'
#Clearly not ASCII encoding.
>>> p['q'][0].decode()
Traceback (most recent call last):
File "<pyshell#320>", line 1, in <module>
p['q'][0].decode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
>>>
>>> p['q'][0]
'Esp\xc3\xadrito'
>>> print p['q'][0]
EspÃrito
>>> print p['q'][0].decode('utf-8')
Espírito
urllib and urlparse appear to work with byte string in Python 2. To get unicode strings, encode and decode using utf-8.
Here's an example of a round-trip:
data = { 'q': u'Espírito'}
# to query string:
bdata = {k: v.encode('utf-8') for k, v in data.iteritems()}
qs = urllib.urlencode(bdata)
# qs = 'q=Esp%C3%ADrito'
# to dict:
bdata = urlparse.parse_qs(qs)
data = { k: map(lambda s: s.decode('utf-8'), v)
for k, v in bdata.iteritems() }
# data = {'q': [u'Espídrito']}
Note the different meaning of escape sequences: in 'Esp\xc3\xadrito' (a string), they represent bytes, while in u'"Esp\xedrito"' (a unicode object) they represent Unicode code points.
Is there a build-in function in python 3 to let me get b from a?
a = '\\xe9\\x82\\xa3'
b = b'\xe9\x82\xa3'
You can use unicode-escape encoding:
>>> a = '\\xe9\\x82\\xa3'
>>> a.encode().decode('unicode-escape').encode('latin1')
b'\xe9\x82\xa3'
>>> import codecs
>>> codecs.decode(a, 'unicode-escape').encode('latin1')
b'\xe9\x82\xa3'
Template of the list is:
EmployeeList = [u'<EmpId>', u'<Name>', u'<Doj>', u'<Salary>']
I would like to convert from this
EmployeeList = [u'1001', u'Karick', u'14-12-2020', u'1$']
to this:
EmployeeList = ['1001', 'Karick', '14-12-2020', '1$']
After conversion, I am actually checking if "1001" exists in EmployeeList.values().
Encode each value in the list to a string:
[x.encode('UTF8') for x in EmployeeList]
You need to pick a valid encoding; don't use str() as that'll use the system default (for Python 2 that's ASCII) which will not encode all possible codepoints in a Unicode value.
UTF-8 is capable of encoding all of the Unicode standard, but any codepoint outside the ASCII range will lead to multiple bytes per character.
However, if all you want to do is test for a specific string, test for a unicode string and Python won't have to auto-encode all values when testing for that:
u'1001' in EmployeeList.values()
[str(x) for x in EmployeeList] would do a conversion, but it would fail if the unicode string characters do not lie in the ascii range.
>>> EmployeeList = [u'1001', u'Karick', u'14-12-2020', u'1$']
>>> [str(x) for x in EmployeeList]
['1001', 'Karick', '14-12-2020', '1$']
>>> EmployeeList = [u'1001', u'करिक', u'14-12-2020', u'1$']
>>> [str(x) for x in EmployeeList]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
We can use map function
print map(str, EmployeeList)
Just simply use this code
EmployeeList = eval(EmployeeList)
EmployeeList = [str(x) for x in EmployeeList]
how about:
def fix_unicode(data):
if isinstance(data, unicode):
return data.encode('utf-8')
elif isinstance(data, dict):
data = dict((fix_unicode(k), fix_unicode(data[k])) for k in data)
elif isinstance(data, list):
for i in xrange(0, len(data)):
data[i] = fix_unicode(data[i])
return data
Just use
unicode_to_list = list(EmployeeList)
There are several ways to do this. I converted like this
def clean(s):
s = s.replace("u'","")
return re.sub("[\[\]\'\s]", '', s)
EmployeeList = [clean(i) for i in str(EmployeeList).split(',')]
After that you can check
if '1001' in EmployeeList:
#do something
Hope it will help you.
You can do this by using json and ast modules as follows
>>> import json, ast
>>>
>>> EmployeeList = [u'1001', u'Karick', u'14-12-2020', u'1$']
>>>
>>> result_list = ast.literal_eval(json.dumps(EmployeeList))
>>> result_list
['1001', 'Karick', '14-12-2020', '1$']
Just json.dumps will fix the problem
json.dumps function actually converts all the unicode literals to string literals and it will be easy for us to load the data either in json file or csv file.
sample code:
import json
EmployeeList = [u'1001', u'Karick', u'14-12-2020', u'1$']
result_list = json.dumps(EmployeeList)
print result_list
output: ["1001", "Karick", "14-12-2020", "1$"]
Given a unicode object:
u'[obj1,obj2,ob3]'
How do you convert it to a normal list of objects?
import ast
s = u'[obj1,obj2,ob3]'
n = ast.literal_eval(s)
n
[obj1, obj2, ob3]
Did you mean this? Converting a unicode string to a list of strings.
BTW, you need to know the encoding when dealing with unicode. Here I have used utf-8
>>> s = u'[obj1,obj2,ob3]'
>>> n = [e.encode('utf-8') for e in s.strip('[]').split(',')]
>>> n
['obj1', 'obj2', 'ob3']
What you posted is a unicode string.
To encode it e.g. as UTF-8 use yourutf8str = yourunicodestr.encode('utf-8')
When unicode data doesn't show unicode u...
On exporting data from an excel table using openpyxl, my unicode was invisible. Use print repr(s) to see it
>>>print(data)
>>>print(type(data))
["Independent", "Primary/Secondary Combined", "Coed", "Anglican", "Boarding"]
<type 'unicode>
>>>print repr(data)
u'["Independent", "Primary/Secondary Combined", "Coed", "Anglican", "Boarding"]'
The fix:
>>>import ast
>>>data = ast.literal_eval(entry)
>>>print(data)
>>>print(type(data))
["Independent", "Primary/Secondary Combined", "Coed", "Anglican", "Boarding"]
<type 'list'>