Python encoding unicode utf-8 - python

I'm using selenium to insert text input with german umlauts in a web formular. The declared coding for the python script is utf-8. The page uses utf-8 encoding. When i definine a string like that everything works fine:
q = u"Hällö" #type(q) returns unicode
...
textbox.send_keys(q)
But when i try to read from a config file using ConfigParser (or another kind of file) i get malformed output in the webformular (Hällö). This is the code i use for that:
the_encoding = chardet.detect(q)['encoding'] #prints utf-8
q = parser.get('info', 'query') # type(q) returns str
q = q.decode('unicode-escape') # type(q) returns unicode
textbox.send_keys(q)
Whats the difference between the both q's given to the send_keys function?

This is probably bad encoding. Try printing q before the last statement, and see if it's equal. This line q = parser.get('info', 'query') # type(q) returns str should return the string 'H\xc3\xa4ll\xc3\xb6'. If it's different, then you are using the wrong coding.
>>> q = u"Hällö" # unicode obj
>>> q
u'H\xe4ll\xf6'
>>> print q
Hällö
>>> q.encode('utf-8')
'H\xc3\xa4ll\xc3\xb6'
>>> a = q.encode('utf-8') # str obj
>>> a
'H\xc3\xa4ll\xc3\xb6' # <-- this should be the value of the str
>>> a.decode('utf-8') # <-- unicode obj
u'H\xe4ll\xf6'
>>> print a.decode('utf-8')
Hällö
>>>

from ConfigParser import SafeConfigParser
import codecs
parser = SafeConfigParser()
with codecs.open('cfg.ini', 'r', encoding='utf-8-sig') as f:
parser.readfp(f)
greet = parser.get('main', 'greet')
print 'greet:', greet.encode('utf-8-sig')
greet: Hällö
cfg.ini file
[main]
greet=Hällö

Related

How to decode a string representation of a bytes object?

I have a string which includes encoded bytes inside it:
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
I want to decode it, but I can't since it has become a string. Therefore I want to ask whether there is any way I can convert it into
str2 = b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
Here str2 is a bytes object which I can decode easily using
str2.decode('utf-8')
to get the final result:
'Output file 문항분석.xlsx Created'
You could use ast.literal_eval:
>>> print(str1)
b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
>>> type(str1)
<class 'str'>
>>> from ast import literal_eval
>>> literal_eval(str1).decode('utf-8')
'Output file 문항분석.xlsx Created'
Based on the SyntaxError mentioned in your comments, you may be having a testing issue when attempting to print due to the fact that stdout is set to ascii in your console (and you may also find that your console does not support some of the characters you may be trying to print). You can try something like the following to set sys.stdout to utf-8 and see what your console will print (just using string slice and encode below to get bytes rather than the ast.literal_eval approach that has already been suggested):
import codecs
import sys
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)
s = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
b = s[2:-1].encode().decode('utf-8')
A simple way is to assume that all the characters of the initial strings are in the [0,256) range and map to the same Unicode value, which means that it is a Latin1 encoded string.
The conversion is then trivial:
str1[2:-1].encode('Latin1').decode('utf8')
Finally I have found an answer where i use a function to cast a string to bytes without encoding.Given string
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
now i take only actual encoded text inside of it
str1[2:-1]
and pass this to the function which convert the string to bytes without encoding its values
import struct
def rawbytes(s):
"""Convert a string to raw bytes without encoding"""
outlist = []
for cp in s:
num = ord(cp)
if num < 255:
outlist.append(struct.pack('B', num))
elif num < 65535:
outlist.append(struct.pack('>H', num))
else:
b = (num & 0xFF0000) >> 16
H = num & 0xFFFF
outlist.append(struct.pack('>bH', b, H))
return b''.join(outlist)
So, calling the function would convert it to bytes which then is decoded
rawbytes(str1[2:-1]).decode('utf-8')
will give the correct output
'Output file 문항분석.xlsx Created'

Python urlencode special character

I have this variable here
reload(sys)
sys.setdefaultencoding('utf8')
foo = u'"Esp\xc3\xadrito"'
which translates to "Espírito". But when I pass my variable to urlencode like this
urllib.urlencode({"q": foo}) # q=%22Esp%C3%83%C2%ADrito%22'
The special character is being "represented" wrongly in the URL.
How should I fix this?
You got the wrong encoding of "Espírito", I don't know where you get that, but this is the right one:
>>> s = u'"Espírito"'
>>>
>>> s
u'"Esp\xedrito"'
Then encoding your query:
>>> u.urlencode({'q':s.encode('utf-8')})
'q=%22Esp%C3%ADrito%22'
This should give you back the right encoding of your string.
EDIT: This is regarding right encoding of your query string, demo:
>>> s = u'"Espírito"'
>>> print s
"Espírito"
>>> s.encode('utf-8')
'"Esp\xc3\xadrito"'
>>> s.encode('latin-1')
'"Esp\xedrito"'
>>>
>>> print "Esp\xc3\xadrito"
Espí­rito
>>> print "Esp\xedrito"
Espírito
This clearly shows that the right encoding for your string is most probably latin-1 (even cp1252 works as well), now as far as I understand, urlparse.parse_qs either assumes default encoding utf-8 or your system default encoding, which as per your post, you set it to utf-8 as well.
Interestingly, I was playing with the query you provided in your comment, I got this:
>>> q = "q=Esp%C3%ADrito"
>>>
>>> p = urlparse.parse_qs(q)
>>> p['q'][0].decode('utf-8')
u'Esp\xedrito'
>>>
>>> p['q'][0].decode('latin-1')
u'Esp\xc3\xadrito'
#Clearly not ASCII encoding.
>>> p['q'][0].decode()
Traceback (most recent call last):
File "<pyshell#320>", line 1, in <module>
p['q'][0].decode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
>>>
>>> p['q'][0]
'Esp\xc3\xadrito'
>>> print p['q'][0]
Espírito
>>> print p['q'][0].decode('utf-8')
Espírito
urllib and urlparse appear to work with byte string in Python 2. To get unicode strings, encode and decode using utf-8.
Here's an example of a round-trip:
data = { 'q': u'Espírito'}
# to query string:
bdata = {k: v.encode('utf-8') for k, v in data.iteritems()}
qs = urllib.urlencode(bdata)
# qs = 'q=Esp%C3%ADrito'
# to dict:
bdata = urlparse.parse_qs(qs)
data = { k: map(lambda s: s.decode('utf-8'), v)
for k, v in bdata.iteritems() }
# data = {'q': [u'Espídrito']}
Note the different meaning of escape sequences: in 'Esp\xc3\xadrito' (a string), they represent bytes, while in u'"Esp\xedrito"' (a unicode object) they represent Unicode code points.

CSV to dict, dict not finding the item

I am converting a CSV to dict, all the values are loaded correctly but with one issue.
CSV :
Testing testing\nwe are into testing mode
My\nServer This is my server.
When I convert the CSV to dict and if I try to use dict.get() method it is returning None.
When I debug, I get the following output:
{'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
The My\nServer key is having an extra backslash.
If I do .get("My\nServer"), I am getting the output as None.
Can anyone help me?
#!/usr/bin/env python
import os
import codecs
import json
from csv import reader
def get_dict(path):
with codecs.open(path, 'r', 'utf-8') as msgfile:
data = msgfile.read()
data = reader([r.encode('utf-8') for r in data.splitlines()])
newdata = []
for row in data:
newrow = []
for val in row:
newrow.append(unicode(val, 'utf-8'))
newdata.append(newrow)
return dict(newdata)
thanks
You either need to escape the newline properly, using \\n:
>>> d = {'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
>>> d.get('My\\nServer')
'This is my server.'
or you can use a raw string literal which doesn't need extra escaping:
>>> d.get(r'My\nServer')
'This is my server.'
Note that raw string will treat all the backslash escape sequences this way, not just the newline \n.
In case you are getting the values dynamically, you can use str.encode with string_escape or unicode_escape encoding:
>>> k = 'My\nServer' # API call result
>>> k.encode('string_escape')
'My\\nServer'
>>> d.get(k.encode('string_escape'))
'This is my server.'
"\n" is newline.
If you want to represent a text like "---\n---" in Python, and not having there newline, you have to escape it.
The way you write it in code and how it gets printed differs, in code, you will have to write "\" (unless u use raw string), when printed, the extra slash will not be seen
So in your code, you shall ask:
>>> dct = {'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
>>> dct.get("My\\nServer")
'This is my server.'

Easy way to convert a unicode list to a list containing python strings?

Template of the list is:
EmployeeList = [u'<EmpId>', u'<Name>', u'<Doj>', u'<Salary>']
I would like to convert from this
EmployeeList = [u'1001', u'Karick', u'14-12-2020', u'1$']
to this:
EmployeeList = ['1001', 'Karick', '14-12-2020', '1$']
After conversion, I am actually checking if "1001" exists in EmployeeList.values().
Encode each value in the list to a string:
[x.encode('UTF8') for x in EmployeeList]
You need to pick a valid encoding; don't use str() as that'll use the system default (for Python 2 that's ASCII) which will not encode all possible codepoints in a Unicode value.
UTF-8 is capable of encoding all of the Unicode standard, but any codepoint outside the ASCII range will lead to multiple bytes per character.
However, if all you want to do is test for a specific string, test for a unicode string and Python won't have to auto-encode all values when testing for that:
u'1001' in EmployeeList.values()
[str(x) for x in EmployeeList] would do a conversion, but it would fail if the unicode string characters do not lie in the ascii range.
>>> EmployeeList = [u'1001', u'Karick', u'14-12-2020', u'1$']
>>> [str(x) for x in EmployeeList]
['1001', 'Karick', '14-12-2020', '1$']
>>> EmployeeList = [u'1001', u'करिक', u'14-12-2020', u'1$']
>>> [str(x) for x in EmployeeList]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
We can use map function
print map(str, EmployeeList)
Just simply use this code
EmployeeList = eval(EmployeeList)
EmployeeList = [str(x) for x in EmployeeList]
how about:
def fix_unicode(data):
if isinstance(data, unicode):
return data.encode('utf-8')
elif isinstance(data, dict):
data = dict((fix_unicode(k), fix_unicode(data[k])) for k in data)
elif isinstance(data, list):
for i in xrange(0, len(data)):
data[i] = fix_unicode(data[i])
return data
Just use
unicode_to_list = list(EmployeeList)
There are several ways to do this. I converted like this
def clean(s):
s = s.replace("u'","")
return re.sub("[\[\]\'\s]", '', s)
EmployeeList = [clean(i) for i in str(EmployeeList).split(',')]
After that you can check
if '1001' in EmployeeList:
#do something
Hope it will help you.
You can do this by using json and ast modules as follows
>>> import json, ast
>>>
>>> EmployeeList = [u'1001', u'Karick', u'14-12-2020', u'1$']
>>>
>>> result_list = ast.literal_eval(json.dumps(EmployeeList))
>>> result_list
['1001', 'Karick', '14-12-2020', '1$']
Just json.dumps will fix the problem
json.dumps function actually converts all the unicode literals to string literals and it will be easy for us to load the data either in json file or csv file.
sample code:
import json
EmployeeList = [u'1001', u'Karick', u'14-12-2020', u'1$']
result_list = json.dumps(EmployeeList)
print result_list
output: ["1001", "Karick", "14-12-2020", "1$"]

Python to show special characters

I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem.
I am trying to print a string but when printed it doesn't show special characters (e.g. æ, ø, å, ö and ü). When I print the string using repr() this is what I get:
u'Von D\xc3\xbc' and u'\xc3\x96berg'
Does anyone know how I can convert this to Von Dü and Öberg? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore").
EDIT
This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell (<td>) in a table (<table>), is put into the variable name. This is the variable which contains special characters that I cannot print.
web = urllib2.urlopen(url);
soup = BeautifulSoup(web)
tables = soup.find_all("table")
scene_tables = [2, 3, 6, 7, 10]
scene_index = 0
# Iterate over the <table>s we want to work with
for scene_table in scene_tables:
i = 0
# Iterate over < td> to find time and name
for td in tables[scene_table].find_all("td"):
if i % 2 == 0: # td contains the time
time = remove_whitespace(td.get_text())
else: # td contains the name
name = remove_whitespace(td.get_text()) # This is the variable containing "nonsense"
print "%s: %s" % (time, name,)
i += 1
scene_index += 1
Prevention is better than cure. What you need is to find out how that rubbish is being created. Please edit your question to show the code that creates it, and then we can help you fix it. It looks like somebody has done:
your_unicode_string = original_utf8_encoded_bytestring.decode('latin1')
The cure is to reverse the process, simply, and then decode.
correct_unicode_string = your_unicode_string.encode('latin1').decode('utf8')
Update Based on the code that you supplied, the probable cause is that the website declares that it is encoded in ISO-8859-1 (aka latin1) but in reality it is encoded in UTF-8. Please update your question to show us the url.
If you can't show it, read the BS docs; it looks like you'll need to use:
BeautifulSoup(web, from_encoding='utf8')
Unicode support in many languages is confusing, so your error here is understandable. Those strings are UTF-8 bytes, which would work properly if you drop the u at the front:
>>> err = u'\xc3\x96berg'
>>> print err
Ã?berg
>>> x = '\xc3\x96berg'
>>> print x
Öberg
>>> u = x.decode('utf-8')
>>> u
u'\xd6berg'
>>> print u
Öberg
For lots more information:
http://www.joelonsoftware.com/articles/Unicode.html
http://docs.python.org/howto/unicode.html
You should really really read those links and understand what is going on before proceeding. If, however, you absolutely need to have something that works today, you can use this horrible hack that I am embarrassed to post publicly:
def convert_fake_unicode_to_real_unicode(string):
return ''.join(map(chr, map(ord, string))).decode('utf-8')
The contents of the strings are not unicode, they are UTF-8 encoded.
>>> print u'Von D\xc3\xbc'
Von Dü
>>> print 'Von D\xc3\xbc'
Von Dü
>>> print unicode('Von D\xc3\xbc', 'utf-8')
Von Dü
>>>
Edit:
>>> print '\xc3\x96berg' # no unicode identifier, works as expected because it's an UTF-8 encoded string
Öberg
>>> print u'\xc3\x96berg' # has unicode identifier, means print uses the unicode charset now, outputs weird stuff
Ãberg
# Look at the differing object types:
>>> type('\xc3\x96berg')
<type 'str'>
>>> type(u'\xc3\x96berg')
<type 'unicode'>
>>> '\xc3\x96berg'.decode('utf-8') # this command converts from UTF-8 to unicode, look at the unicode identifier in the output
u'\xd6berg'
>>> unicode('\xc3\x96berg', 'utf-8') # this does the same thing
u'\xd6berg'
>>> unicode(u'foo bar', 'utf-8') # trying to convert a unicode string to unicode will fail as expected
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported

Categories