I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem.
I am trying to print a string but when printed it doesn't show special characters (e.g. æ, ø, å, ö and ü). When I print the string using repr() this is what I get:
u'Von D\xc3\xbc' and u'\xc3\x96berg'
Does anyone know how I can convert this to Von Dü and Öberg? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore").
EDIT
This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell (<td>) in a table (<table>), is put into the variable name. This is the variable which contains special characters that I cannot print.
web = urllib2.urlopen(url);
soup = BeautifulSoup(web)
tables = soup.find_all("table")
scene_tables = [2, 3, 6, 7, 10]
scene_index = 0
# Iterate over the <table>s we want to work with
for scene_table in scene_tables:
i = 0
# Iterate over < td> to find time and name
for td in tables[scene_table].find_all("td"):
if i % 2 == 0: # td contains the time
time = remove_whitespace(td.get_text())
else: # td contains the name
name = remove_whitespace(td.get_text()) # This is the variable containing "nonsense"
print "%s: %s" % (time, name,)
i += 1
scene_index += 1
Prevention is better than cure. What you need is to find out how that rubbish is being created. Please edit your question to show the code that creates it, and then we can help you fix it. It looks like somebody has done:
your_unicode_string = original_utf8_encoded_bytestring.decode('latin1')
The cure is to reverse the process, simply, and then decode.
correct_unicode_string = your_unicode_string.encode('latin1').decode('utf8')
Update Based on the code that you supplied, the probable cause is that the website declares that it is encoded in ISO-8859-1 (aka latin1) but in reality it is encoded in UTF-8. Please update your question to show us the url.
If you can't show it, read the BS docs; it looks like you'll need to use:
BeautifulSoup(web, from_encoding='utf8')
Unicode support in many languages is confusing, so your error here is understandable. Those strings are UTF-8 bytes, which would work properly if you drop the u at the front:
>>> err = u'\xc3\x96berg'
>>> print err
Ã?berg
>>> x = '\xc3\x96berg'
>>> print x
Öberg
>>> u = x.decode('utf-8')
>>> u
u'\xd6berg'
>>> print u
Öberg
For lots more information:
http://www.joelonsoftware.com/articles/Unicode.html
http://docs.python.org/howto/unicode.html
You should really really read those links and understand what is going on before proceeding. If, however, you absolutely need to have something that works today, you can use this horrible hack that I am embarrassed to post publicly:
def convert_fake_unicode_to_real_unicode(string):
return ''.join(map(chr, map(ord, string))).decode('utf-8')
The contents of the strings are not unicode, they are UTF-8 encoded.
>>> print u'Von D\xc3\xbc'
Von Dü
>>> print 'Von D\xc3\xbc'
Von Dü
>>> print unicode('Von D\xc3\xbc', 'utf-8')
Von Dü
>>>
Edit:
>>> print '\xc3\x96berg' # no unicode identifier, works as expected because it's an UTF-8 encoded string
Öberg
>>> print u'\xc3\x96berg' # has unicode identifier, means print uses the unicode charset now, outputs weird stuff
Ãberg
# Look at the differing object types:
>>> type('\xc3\x96berg')
<type 'str'>
>>> type(u'\xc3\x96berg')
<type 'unicode'>
>>> '\xc3\x96berg'.decode('utf-8') # this command converts from UTF-8 to unicode, look at the unicode identifier in the output
u'\xd6berg'
>>> unicode('\xc3\x96berg', 'utf-8') # this does the same thing
u'\xd6berg'
>>> unicode(u'foo bar', 'utf-8') # trying to convert a unicode string to unicode will fail as expected
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported
Related
I'm using selenium to insert text input with german umlauts in a web formular. The declared coding for the python script is utf-8. The page uses utf-8 encoding. When i definine a string like that everything works fine:
q = u"Hällö" #type(q) returns unicode
...
textbox.send_keys(q)
But when i try to read from a config file using ConfigParser (or another kind of file) i get malformed output in the webformular (Hällö). This is the code i use for that:
the_encoding = chardet.detect(q)['encoding'] #prints utf-8
q = parser.get('info', 'query') # type(q) returns str
q = q.decode('unicode-escape') # type(q) returns unicode
textbox.send_keys(q)
Whats the difference between the both q's given to the send_keys function?
This is probably bad encoding. Try printing q before the last statement, and see if it's equal. This line q = parser.get('info', 'query') # type(q) returns str should return the string 'H\xc3\xa4ll\xc3\xb6'. If it's different, then you are using the wrong coding.
>>> q = u"Hällö" # unicode obj
>>> q
u'H\xe4ll\xf6'
>>> print q
Hällö
>>> q.encode('utf-8')
'H\xc3\xa4ll\xc3\xb6'
>>> a = q.encode('utf-8') # str obj
>>> a
'H\xc3\xa4ll\xc3\xb6' # <-- this should be the value of the str
>>> a.decode('utf-8') # <-- unicode obj
u'H\xe4ll\xf6'
>>> print a.decode('utf-8')
Hällö
>>>
from ConfigParser import SafeConfigParser
import codecs
parser = SafeConfigParser()
with codecs.open('cfg.ini', 'r', encoding='utf-8-sig') as f:
parser.readfp(f)
greet = parser.get('main', 'greet')
print 'greet:', greet.encode('utf-8-sig')
greet: Hällö
cfg.ini file
[main]
greet=Hällö
I am converting a CSV to dict, all the values are loaded correctly but with one issue.
CSV :
Testing testing\nwe are into testing mode
My\nServer This is my server.
When I convert the CSV to dict and if I try to use dict.get() method it is returning None.
When I debug, I get the following output:
{'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
The My\nServer key is having an extra backslash.
If I do .get("My\nServer"), I am getting the output as None.
Can anyone help me?
#!/usr/bin/env python
import os
import codecs
import json
from csv import reader
def get_dict(path):
with codecs.open(path, 'r', 'utf-8') as msgfile:
data = msgfile.read()
data = reader([r.encode('utf-8') for r in data.splitlines()])
newdata = []
for row in data:
newrow = []
for val in row:
newrow.append(unicode(val, 'utf-8'))
newdata.append(newrow)
return dict(newdata)
thanks
You either need to escape the newline properly, using \\n:
>>> d = {'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
>>> d.get('My\\nServer')
'This is my server.'
or you can use a raw string literal which doesn't need extra escaping:
>>> d.get(r'My\nServer')
'This is my server.'
Note that raw string will treat all the backslash escape sequences this way, not just the newline \n.
In case you are getting the values dynamically, you can use str.encode with string_escape or unicode_escape encoding:
>>> k = 'My\nServer' # API call result
>>> k.encode('string_escape')
'My\\nServer'
>>> d.get(k.encode('string_escape'))
'This is my server.'
"\n" is newline.
If you want to represent a text like "---\n---" in Python, and not having there newline, you have to escape it.
The way you write it in code and how it gets printed differs, in code, you will have to write "\" (unless u use raw string), when printed, the extra slash will not be seen
So in your code, you shall ask:
>>> dct = {'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
>>> dct.get("My\\nServer")
'This is my server.'
I'm falling the unicode hell.
My environment in on unix, python 2.7.3
LC_CTYPE=zh_TW.UTF-8
LANG=en_US.UTF-8
I'm trying to dump hex encoded data in human readable format, here is simplified code
#! /usr/bin/env python
# encoding:utf-8
import sys
s=u"readable\n" # previous result keep in unicode string
s2="fb is not \xfb" # data read from binary file
s += s2
print s # method 1
print s.encode('utf-8') # method 2
print s.encode('utf-8','ignore') # method 3
print s.decode('iso8859-1') # method 4
# method 1-4 display following error message
#UnicodeDecodeError: 'ascii' codec can't decode byte 0xfb
# in position 0: ordinal not in range(128)
f = open('out.txt','wb')
f.write(s)
I just want to print out the 0xfb.
I should describe more here. The key is 's += s2'.
Where s will keep my previous decoded string.
And the s2 is next string which should append into s.
If I modified as following, it occurs on write file.
s=u"readable\n"
s2="fb is not \xfb"
s += s2.decode('cp437')
print s
f=open('out.txt','wb')
f.write(s)
# UnicodeEncodeError: 'ascii' codec can't encode character
# u'\u221a' in position 1: ordinal not in range(128)
I wish the result of out.txt is
readable
fb is not \xfb
or
readable
fb is not 0xfb
[Solution]
#! /usr/bin/env python
# encoding:utf-8
import sys
import binascii
def fmtstr(s):
r = ''
for c in s:
if ord(c) > 128:
r = ''.join([r, "\\x"+binascii.hexlify(c)])
else:
r = ''.join([r, c])
return r
s=u"readable"
s2="fb is not \xfb"
s += fmtstr(s2)
print s
f=open('out.txt','wb')
f.write(s)
I strongly suspect that your code is actually erroring out on the previous line: the s += s2 one. s2 is just a series of bytes, which can't be arbitrarily tacked on to a unicode object (which is instead a series of code points).
If you had intended the '\xfb' to represent U+FB, LATIN SMALL LETTER U WITH CIRCUMFLEX, it would have been better to assign it like this instead:
s2 = u"\u00fb"
But you said that you just want to print out \xHH codes for control characters. If you just want it to be something humans can understand which still makes it apparent that special characters are in a string, then repr may be enough. First, don't have s be a unicode object, because you're treating your strings here as a series of bytes, not a series of code points.
s = s.encode('utf-8')
s += s2
print repr(s)
Finally, if you don't want the extra quotes on the outside that repr adds, for nice pretty printing or whatever, there's not a simple builtin way to do that in Python (that I know of). I've used something like this before:
import re
controlchars_re = re.compile(r'[\x00-\x31\x7f-\xff]')
def _show_control_chars(match):
txt = repr(match.group(0))
return txt[1:-1]
def escape_special_characters(s):
return controlchars_re.sub(_show_control_chars, s.replace('\\', '\\\\'))
You can pretty easily tweak the controlchars_re regex to define which characters you care about escaping.
I have a problem involving encoding/decoding.
I read text from file and compare it with text from database (Postgres)
Compare is done within two lists
from file i get "jo\x9a" for "još" and from database I get "jo\xc5\xa1" for same value
common = [a for a in codes_from_file if a in kode_prfoksov]
# Items in one but not the other
only1 = [a for a in codes_from_file if not a in kode_prfoksov]
#Items only in another
only2 = [a for a in kode_prfoksov if not a in codes_from_file ]
How to solve this? Which encoding should be set when comparing this two strings to solve the issue?
thank you
The first one seems to be windows-1250, and the second is utf-8.
>>> print 'jo\x9a'.decode('windows-1250')
još
>>> print 'jo\xc5\xa1'.decode('utf-8')
još
>>> 'jo\x9a'.decode('windows-1250') == 'jo\xc5\xa1'.decode('utf-8')
True
Your file strings seems to be Windows-1250 encoded. Your database seems to contain UTF-8 strings.
So you can either convert first all strings to unicode:
codes_from_file = [a.decode("windows-1250") for a in codes_from_file]
kode_prfoksov] = [a.decode("utf-8") for a in codes_from_file]
or if you do not want unicode strings, just convert the file string to UTF-8:
codes_from_file = [a.decode("windows-1250").encode("utf-8") for a in codes_from_file]
ACTIVATE_THIS = """
eJx1UsGOnDAMvecrIlYriDRlKvU20h5aaY+teuilGo1QALO4CwlKAjP8fe1QGGalRoLEefbzs+Mk
Sb7NcvRo3iTcoGqwgyy06As+HWSNVciKaBTFywYoJWc7yit2ndBVwEkHkIzKCV0YdQdmkvShs6YH
E3IhfjFaaSNLoHxQy2sLJrL0ow98JQmEG/rAYn7OobVGogngBgf0P0hjgwgt7HOUaI5DdBVJkggR
3HwSktaqWcCtgiHIH7qHV+esW2CnkRJ+9R5cQGsikkWEV/J7leVGs9TV4TvcO5QOOrTHYI+xeCjY
JR/m9GPDHv2oSZunUokS2A/WBelnvx6tF6LUJO2FjjlH5zU6Q+Kz/9m69LxvSZVSwiOlGnT1rt/A
77j+WDQZ8x9k2mFJetOle88+lc8sJJ/AeerI+fTlQigTfVqJUiXoKaaC3AqmI+KOnivjMLbvBVFU
1JDruuadNGcPmkgiBTnQXUGUDd6IK9JEQ9yPdM96xZP8bieeMRqTuqbxIbbey2DjVUNzRs1rosFS
TsLAdS/0fBGNdTGKhuqD7mUmsFlgGjN2eSj1tM3GnjfXwwCmzjhMbR4rLZXXk+Z/6Hp7Pn2+kJ49
jfgLHgI4Jg==
""".decode("base64").decode("zlib")
my code:
import zlib
print 'dsss'.decode('base64').decode('zlib')#error
Traceback (most recent call last):
File "D:\zjm_code\b.py", line 4, in <module>
print 'dsss'.decode('base64').decode('zlib')
File "D:\Python25\lib\encodings\zlib_codec.py", line 43, in zlib_decode
output = zlib.decompress(input)
zlib.error: Error -3 while decompressing data: unknown compression method
a='dsss'.encode('zlib')
print a
a.encode('base64')
print a
a.decode('base64')#error
print a
a.decode('zlib')
print a
x\x9cK)..Traceback (most recent call last):
File "D:\zjm_code\b.py", line 7, in <module>
a.decode('base64')
File "D:\Python25\lib\encodings\base64_codec.py", line 42, in base64_decode
output = base64.decodestring(input)
File "D:\Python25\lib\base64.py", line 321, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('zlib')
print a#why can't print 'dsss'
x\x9cK)..
a='dsss'
a=a.encode('zlib')
#print a
a=a.decode('zlib')
print a#its ok
i think the 'print a' encode the a with 'uhf-8'.
so:
#encoding:utf-8
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('utf-8')#but error.
a=a.decode('zlib')
print a#
x\x9cK)..Traceback (most recent call last):
File "D:\zjm_code\b.py", line 5, in <module>
a=a.decode('utf-8')
File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 1: unexpected code byte
The data in the strings is encoded and compressed binary data. The .decode("base64").decode("zlib") unencodes and decompresses it.
The error you got was because 'dsss' decoded from base64 is not valid zlib compressed data.
What is the purpose of x.decode(”base64”).decode(”zlib”) for x in ("sss", "dsss", random_garbage)? Excuse me, you should know; you are the one who is doing it!
Edit after OP's addition of various puzzles
Puzzle 1
a='dsss'.encode('zlib')
print a
a.encode('base64')
print a
a.decode('base64')#error
print a
a.decode('zlib')
print a
Resolution: all 3 statements of the form
a.XXcode('encoding')
should be
a = a.XXcode('encoding')
Puzzle 2
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('zlib')
print a#why can't print 'dsss'
x\x9cK)..
But it does print 'dsss':
>>> a='dsss'
>>> a=a.encode('zlib')
>>> print a
x£K)..♠ ♦F☺¥
>>> a=a.decode('zlib')
>>> print a#why can't print 'dsss'
dsss
>>>
Puzzle 3
"""i think the 'print a' encode the a with 'uhf-8'."""
Resolution: You think extremely incorrectly. What follows the print is an expression. There are no such side effects. What do you imagine happens when you do this:
print 'start text ' + a + 'end text'
?
What do you imagine happens if you do print a twice? Encoding the already-encoded text again? Why don't you stop imagining and try it out?
In any case, note that the output of str.encode('zlib') is an str object, not a unicode object:
>>> print repr('dsss'.encode('zlib'))
'x\x9cK)..\x06\x00\x04F\x01\xbe'
Getting from that to UTF-8 is going to be somewhat difficult ... it would have to be decoded into unicode first -- with what codec? ascii and utf8 are going to have trouble with the '\x9c' and the '\xbe' ...
It is the reverse of:
original_message.encode('zlib').encode('base64')
zlib is a binary compression algorithm. base64 is a text encoding of binary data, which is useful to send binary message through text protocols like SMTP.
After 'dsss' was decoded from base64 (the three bytes 76h, CBh, 2Ch), the result was not valid zlib compressed data so it couldn't be decoded.
Try printing ACTIVATE_THIS to see the result of the decoding. It turns out to be some Python code.
.decode('base64') can be called only on a string that's encoded as "base-64, in order to retrieve the byte sequence that was there encoded. Presumably that byte sequence, in the example you bring, was zlib-compressed, and so the .decode('zlib') part decompresses it.
Now, for your case:
>>> 'dsss'.decode('base64')
'v\xcb,'
But 'v\xcv,' is not a zlib-compressed string! And so of course you cannot ask zlib to "decompress" it. Fortunately zlib recognizes the fact (that 'v\xcv,' could not possibly have been produced by applying any of the compression algorithms zlib knows about to any input whatsoever) and so gives you a helpful error message (instead of a random-ish string of bytes, which you might well have gotten if you had randomly supplied a different but equally crazy input string!-)
Edit: the error in
a.encode('base64')
print a
a.decode('base64')#error
is obviously due to the fact that strings are immutable: just calling a.encode (or any other method) does not alter a, it produces a new string object (and here you're just printing it).
In the next snippet, the error is only in the OP's mind:
>>> a='dsss'
>>> a=a.encode('zlib')
>>> print a
x?K)..F?
>>> a=a.decode('zlib')
>>> print a#why can't print 'dsss'
dsss
>>>
that "why can't print" question is truly peculiar, applied to code that does print 'dsss'. Finally,
i think the 'print a' encode the a
with 'uhf-8'.
You think wrongly: there's no such thing as "uhf-8" (you mean "utf-8" maybe?), and anyway print a does not alter a, any more than just calling a.encode does.