Problem with encode decode. Python. Django. BeautifulSoup - python

In this code:
soup=BeautifulSoup(program.Description.encode('utf-8'))
name=soup.find('div',{'class':'head'})
print name.string.decode('utf-8')
error happening when i'm trying to print or save to database.
dosnt metter what i'm doing:
print name.string.encode('utf-8')
or just
print name.string
Traceback (most recent call last):
File "./manage.py", line 16, in <module>
execute_manager(settings)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 362, in execute_manager
utility.execute()
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 303, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 195, in run_from_argv
self.execute(*args, **options.__dict__)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 222, in execute
output = self.handle(*args, **options)
File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 50, in handle
self.FirstTimeLoad()
File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 115, in FirstTimeLoad
print name.string.decode('utf-8')
File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-5: ordinal not in range(128)
This is repr(name.string)
u'\u0412\u044b\u043f\u0443\u0441\u043a \u043e\u0442 27 \u0434\u0435\u043a\u0430\u0431\u0440\u044f'

I don't know what you are trying to do with name.string.decode('utf-8'). As the BeautifulSoup documentation eloquently points out, "BeautifulSoup gives you Unicode, dammit". So name.string is already decoded - it is in unicode. You can encode it back to utf-8 if you want to, but you can't decode it any further.

You can try:
print name.string.encode('ascii', 'replace')
The output should be accepted whatever the encoding of sys.stdout is (including None).
In fact, the file-like object that you are printing to might not accept UTF-8. Here is an example: if you have the apparently benign program
# -*- coding: utf-8 -*-
print u"hérisson"
then running it in a terminal that can print accented characters works fine:
lebigot#weinberg /tmp % python2.5 test.py
hérisson
but printing to a standard output connected to a Unix pipe does not:
lebigot#weinberg /tmp % python2.5 test.py | cat
Traceback (most recent call last):
File "test.py", line 3, in <module>
print u"hérisson"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
because sys.stdout has encoding None, in this case: Python considers that the program that reads through the pipe should receive ASCII, and the printing fails because ASCII cannot represent the word that we want to print. A solution like the one above solves the problem.
Note: You can check the encoding of your standard output with:
print sys.stdout.encoding
This can help you debug encoding problems.

Edit: name.string comes from BeautifulSoup, so it is presumably already a unicode string.
However, your error message mentions 'ascii':
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-5:
ordinal not in range(128)
According to the PrintFails Python wiki page, if Python does not know or
can not determine what kind of encoding your output device is expecting, it sets
sys.stdout.encoding to None and print attempts to encode its arguments with
the 'ascii' codec.
I believe this is the cause of your problem. You can can confirm this by seeing
if print sys.stdout.encoding prints None.
According to the same page, linked above, you can circumvent the problem by
explicitly telling Python what encoding to use. You do that be wrapping
sys.stdout in an instance of StreamWriter:
For example, you could try adding
import sys
import locale
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
to your script before the print statement. You may have to change
locale.getpreferredencoding() to and explicit encoding (e.g. 'utf-8',
'cp1252', etc.). The right encoding to use depends on your output device.
It should be set to whatever encoding your output device is expecting. If
you are outputing to a terminal, the terminal may have a menu setting to allow
the user to set what type of encoding the terminal should expect.
Original answer: Try:
print name.string
or
print name.string.encode('utf-8')

try
text = text.decode("utf-8", "replace")

Related

Python function to turn internationalized domain name from U-Label to A-Label? [duplicate]

I have a long list of domain names which I need to generate some reports on. The list contains some IDN domains, and although I know how to convert them in python on the command line:
>>> domain = u"pfarmerü.com"
>>> domain
u'pfarmer\xfc.com'
>>> domain.encode("idna")
'xn--pfarmer-t2a.com'
>>>
I'm struggling to get it to work with a small script reading data from the text file.
#!/usr/bin/python
import sys
infile = open(sys.argv[1])
for line in infile:
print line,
domain = unicode(line.strip())
print type(domain)
print "IDN:", domain.encode("idna")
print
I get the following output:
$ ./idn.py ./test
pfarmer.com
<type 'unicode'>
IDN: pfarmer.com
pfarmerü.com
Traceback (most recent call last):
File "./idn.py", line 9, in <module>
domain = unicode(line.strip())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 7: ordinal not in range(128)
I have also tried:
#!/usr/bin/python
import sys
import codecs
infile = codecs.open(sys.argv[1], "r", "utf8")
for line in infile:
print line,
domain = line.strip()
print type(domain)
print "IDN:", domain.encode("idna")
print
Which gave me:
$ ./idn.py ./test
Traceback (most recent call last):
File "./idn.py", line 8, in <module>
for line in infile:
File "/usr/lib/python2.6/codecs.py", line 679, in next
return self.reader.next()
File "/usr/lib/python2.6/codecs.py", line 610, in next
line = self.readline()
File "/usr/lib/python2.6/codecs.py", line 525, in readline
data = self.read(readsize, firstline=True)
File "/usr/lib/python2.6/codecs.py", line 472, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-5: unsupported Unicode code range
Here is my test data file:
pfarmer.com
pfarmerü.com
I'm very aware of my need to understand unicode now.
Thanks,
Peter
you need to know in which encoding you file was saved. This would be something like 'utf-8' (which is NOT Unicode) or 'iso-8859-1' or 'cp1252' or alike.
Then you can do (assuming 'utf-8'):
infile = open(sys.argv[1])
for line in infile:
print line,
domain = line.strip().decode('utf-8')
print type(domain)
print "IDN:", domain.encode("idna")
print
Convert encoded strings to unicode with decode. Convert unicode to string with encode. If you try to encode something which is already encoded, python tries to decode first, with the default codec 'ascii' which fails for non-ASCII-values.
Your first example is fine, except that:
domain = unicode(line.strip())
you have to specify a particular encoding here: unicode(line.strip(), 'utf-8'). Otherwise you get the default encoding which for safety is 7-bit ASCII, hence the error. Alternatively you can spell it line.strip().decode('utf-8') as in knitti's example; there is no difference in behaviour between the two syntaxes.
However judging by the error “can't decode byte 0xfc”, I think you haven't actually saved your test file as UTF-8. Presumably this is why the second example, that also looks OK in principle, fails.
Instead it's ISO-8859-1 or the very similar Windows code page 1252. If it's come from a text editor on a Western Windows box it will certainly be the latter; Linux machines use UTF-8 by default instead nowadays. Either make sure to save your file as UTF-8, or read the file using the encoding 'cp1252' instead.

Python ignores encoding argument in favor of cp1252

I have a lengthy json file that contains utf-8 characters (and is encoded in utf-8). I want to read it in python using the built-in json module.
My code looks like this:
dat = json.load(open("data.json"), "utf-8")
Though I understand the "utf-8" argument should be unnecessary as it is assumed as the default. However, I get this error:
Traceback (most recent call last):
File "winratio.py", line 9, in <module>
dat = json.load(open("data.json"), "utf-8")
File "C:\Python33\lib\json\__init__.py", line 271, in load
return loads(fp.read(),
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 28519: ch
aracter maps to <undefined>
My question is: Why does python seem to ignore my encoding specification and try to load the file in cp1252?
Try this:
import codecs
dat = json.load(codecs.open("data.json", "r", "utf-8"))
Also here are described some tips about a writing mode in context of the codecs library: Write to UTF-8 file in Python

python, vobject, encoding, vcards

I am using vobject in python. I am attempting to parse the vcard located here:
http://www.mayerbrown.com/people/vCard.aspx?Attorney=1150
to do this, I do the following:
import urllib
import vobject
vcard = urllib.urlopen("http://www.mayerbrown.com/people/vCard.aspx?Attorney=1150").read()
vcard_object = vobject.readOne(vcard)
Whenever I do this, I get the following error:
Traceback (most recent call last):
File "<pyshell#86>", line 1, in <module>
vobject.readOne(urllib.urlopen("http://www.mayerbrown.com/people/vCard.aspx?Attorney=1150").read())
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 1078, in readOne
ignoreUnreadable, allowQP).next()
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 1031, in readComponents
vline = textLineToContentLine(line, n)
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 888, in textLineToContentLine
return ContentLine(*parseLine(text, n), **{'encoded':True, 'lineNumber' : n})
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 262, in __init__
self.value = str(self.value).decode('quoted-printable')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 29: ordinal not in range(128)
I have tried a number of other variations on this, such as converting vcard into unicode, using various encodings,etc. But I always get the same, or a very similar, error message.
Any ideas on how to fix this?
It's failing on line 13 of the vCard because the ADR property is incorrectly marked as being encoded in the "quoted-printable" encoding. The ü character should be encoded as =FC, which is why vobject is throwing the error.
File is downloaded as UTF-8 (i think) encoded string, but library tries to interpret it as ASCII.
Try adding following line after urlopen:
vcard = vcard.decode('utf-8')
vobject library readOne method is pretty awkward.
To avoid problems I decided to persist in my database the vcards in form of quoted-printable data, which the one likes.
assuming some_vcard is string with UTF-8 encoding
quopried_vcard = quopri.encodestring(some_vcard)
and the quopried_vcard gets persisted, and when needed just:
vobj = vobject.readOne(quopried_vcard)
and then to get back decoded data, e.g for fn field in vcard:
quopri.decodestring(vobj.fn.value)
Maybe somebody can handle UTF-8 with readOne better. If yes I would love to see it.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

I want to parse my XML document. So I have stored my XML document as below
class XMLdocs(db.Expando):
id = db.IntegerProperty()
name=db.StringProperty()
content=db.BlobProperty()
Now my below is my code
parser = make_parser()
curHandler = BasketBallHandler()
parser.setContentHandler(curHandler)
for q in XMLdocs.all():
parser.parse(StringIO.StringIO(q.content))
I am getting below error
'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 517, in __call__
handler.post(*groups)
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/base_handler.py", line 59, in post
self.handle()
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 168, in handle
scan_aborted = not self.process_entity(entity, ctx)
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 233, in process_entity
handler(entity)
File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 71, in process
parser.parse(StringIO.StringIO(q.content))
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 136, in characters
print ch
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
The actual best answer for this problem depends on your environment, specifically what encoding your terminal expects.
The quickest one-line solution is to encode everything you print to ASCII, which your terminal is almost certain to accept, while discarding characters that you cannot print:
print ch #fails
print ch.encode('ascii', 'ignore')
The better solution is to change your terminal's encoding to utf-8, and encode everything as utf-8 before printing. You should get in the habit of thinking about your unicode encoding EVERY time you print or read a string.
Just putting .encode('utf-8') at the end of object will do the job in recent versions of Python.
It seems you are hitting a UTF-8 byte order mark (BOM). Try using this unicode string with BOM extracted out:
import codecs
content = unicode(q.content.strip(codecs.BOM_UTF8), 'utf-8')
parser.parse(StringIO.StringIO(content))
I used strip instead of lstrip because in your case you had multiple occurences of BOM, possibly due to concatenated file contents.
This worked for me:
from django.utils.encoding import smart_str
content = smart_str(content)
The problem according to your traceback is the print statement on line 136 of parseXML.py. Unfortunately you didn't see fit to post that part of your code, but I'm going to guess it is just there for debugging. If you change it to:
print repr(ch)
then you should at least see what you are trying to print.
The problem is that you're trying to print an unicode character to a possibly non-unicode terminal. You need to encode it with the 'replace option before printing it, e.g. print ch.encode(sys.stdout.encoding, 'replace').
An easy solution to overcome this problem is to set your default encoding to utf8. Follow is an example
import sys
reload(sys)
sys.setdefaultencoding('utf8')

Converting domain names to idn in python

I have a long list of domain names which I need to generate some reports on. The list contains some IDN domains, and although I know how to convert them in python on the command line:
>>> domain = u"pfarmerü.com"
>>> domain
u'pfarmer\xfc.com'
>>> domain.encode("idna")
'xn--pfarmer-t2a.com'
>>>
I'm struggling to get it to work with a small script reading data from the text file.
#!/usr/bin/python
import sys
infile = open(sys.argv[1])
for line in infile:
print line,
domain = unicode(line.strip())
print type(domain)
print "IDN:", domain.encode("idna")
print
I get the following output:
$ ./idn.py ./test
pfarmer.com
<type 'unicode'>
IDN: pfarmer.com
pfarmerü.com
Traceback (most recent call last):
File "./idn.py", line 9, in <module>
domain = unicode(line.strip())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 7: ordinal not in range(128)
I have also tried:
#!/usr/bin/python
import sys
import codecs
infile = codecs.open(sys.argv[1], "r", "utf8")
for line in infile:
print line,
domain = line.strip()
print type(domain)
print "IDN:", domain.encode("idna")
print
Which gave me:
$ ./idn.py ./test
Traceback (most recent call last):
File "./idn.py", line 8, in <module>
for line in infile:
File "/usr/lib/python2.6/codecs.py", line 679, in next
return self.reader.next()
File "/usr/lib/python2.6/codecs.py", line 610, in next
line = self.readline()
File "/usr/lib/python2.6/codecs.py", line 525, in readline
data = self.read(readsize, firstline=True)
File "/usr/lib/python2.6/codecs.py", line 472, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-5: unsupported Unicode code range
Here is my test data file:
pfarmer.com
pfarmerü.com
I'm very aware of my need to understand unicode now.
Thanks,
Peter
you need to know in which encoding you file was saved. This would be something like 'utf-8' (which is NOT Unicode) or 'iso-8859-1' or 'cp1252' or alike.
Then you can do (assuming 'utf-8'):
infile = open(sys.argv[1])
for line in infile:
print line,
domain = line.strip().decode('utf-8')
print type(domain)
print "IDN:", domain.encode("idna")
print
Convert encoded strings to unicode with decode. Convert unicode to string with encode. If you try to encode something which is already encoded, python tries to decode first, with the default codec 'ascii' which fails for non-ASCII-values.
Your first example is fine, except that:
domain = unicode(line.strip())
you have to specify a particular encoding here: unicode(line.strip(), 'utf-8'). Otherwise you get the default encoding which for safety is 7-bit ASCII, hence the error. Alternatively you can spell it line.strip().decode('utf-8') as in knitti's example; there is no difference in behaviour between the two syntaxes.
However judging by the error “can't decode byte 0xfc”, I think you haven't actually saved your test file as UTF-8. Presumably this is why the second example, that also looks OK in principle, fails.
Instead it's ISO-8859-1 or the very similar Windows code page 1252. If it's come from a text editor on a Western Windows box it will certainly be the latter; Linux machines use UTF-8 by default instead nowadays. Either make sure to save your file as UTF-8, or read the file using the encoding 'cp1252' instead.

Categories