Python ignores encoding argument in favor of cp1252 - python

I have a lengthy json file that contains utf-8 characters (and is encoded in utf-8). I want to read it in python using the built-in json module.
My code looks like this:
dat = json.load(open("data.json"), "utf-8")
Though I understand the "utf-8" argument should be unnecessary as it is assumed as the default. However, I get this error:
Traceback (most recent call last):
File "winratio.py", line 9, in <module>
dat = json.load(open("data.json"), "utf-8")
File "C:\Python33\lib\json\__init__.py", line 271, in load
return loads(fp.read(),
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 28519: ch
aracter maps to <undefined>
My question is: Why does python seem to ignore my encoding specification and try to load the file in cp1252?

Try this:
import codecs
dat = json.load(codecs.open("data.json", "r", "utf-8"))
Also here are described some tips about a writing mode in context of the codecs library: Write to UTF-8 file in Python

Related

Python function to turn internationalized domain name from U-Label to A-Label? [duplicate]

I have a long list of domain names which I need to generate some reports on. The list contains some IDN domains, and although I know how to convert them in python on the command line:
>>> domain = u"pfarmerü.com"
>>> domain
u'pfarmer\xfc.com'
>>> domain.encode("idna")
'xn--pfarmer-t2a.com'
>>>
I'm struggling to get it to work with a small script reading data from the text file.
#!/usr/bin/python
import sys
infile = open(sys.argv[1])
for line in infile:
print line,
domain = unicode(line.strip())
print type(domain)
print "IDN:", domain.encode("idna")
print
I get the following output:
$ ./idn.py ./test
pfarmer.com
<type 'unicode'>
IDN: pfarmer.com
pfarmerü.com
Traceback (most recent call last):
File "./idn.py", line 9, in <module>
domain = unicode(line.strip())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 7: ordinal not in range(128)
I have also tried:
#!/usr/bin/python
import sys
import codecs
infile = codecs.open(sys.argv[1], "r", "utf8")
for line in infile:
print line,
domain = line.strip()
print type(domain)
print "IDN:", domain.encode("idna")
print
Which gave me:
$ ./idn.py ./test
Traceback (most recent call last):
File "./idn.py", line 8, in <module>
for line in infile:
File "/usr/lib/python2.6/codecs.py", line 679, in next
return self.reader.next()
File "/usr/lib/python2.6/codecs.py", line 610, in next
line = self.readline()
File "/usr/lib/python2.6/codecs.py", line 525, in readline
data = self.read(readsize, firstline=True)
File "/usr/lib/python2.6/codecs.py", line 472, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-5: unsupported Unicode code range
Here is my test data file:
pfarmer.com
pfarmerü.com
I'm very aware of my need to understand unicode now.
Thanks,
Peter
you need to know in which encoding you file was saved. This would be something like 'utf-8' (which is NOT Unicode) or 'iso-8859-1' or 'cp1252' or alike.
Then you can do (assuming 'utf-8'):
infile = open(sys.argv[1])
for line in infile:
print line,
domain = line.strip().decode('utf-8')
print type(domain)
print "IDN:", domain.encode("idna")
print
Convert encoded strings to unicode with decode. Convert unicode to string with encode. If you try to encode something which is already encoded, python tries to decode first, with the default codec 'ascii' which fails for non-ASCII-values.
Your first example is fine, except that:
domain = unicode(line.strip())
you have to specify a particular encoding here: unicode(line.strip(), 'utf-8'). Otherwise you get the default encoding which for safety is 7-bit ASCII, hence the error. Alternatively you can spell it line.strip().decode('utf-8') as in knitti's example; there is no difference in behaviour between the two syntaxes.
However judging by the error “can't decode byte 0xfc”, I think you haven't actually saved your test file as UTF-8. Presumably this is why the second example, that also looks OK in principle, fails.
Instead it's ISO-8859-1 or the very similar Windows code page 1252. If it's come from a text editor on a Western Windows box it will certainly be the latter; Linux machines use UTF-8 by default instead nowadays. Either make sure to save your file as UTF-8, or read the file using the encoding 'cp1252' instead.

XML Encoding error while writing it in to file

I think I am following the right approach but I am still getting an encoding error:
from xml.dom.minidom import Document
import codecs
doc = Document()
wml = doc.createElement("wml")
doc.appendChild(wml)
property = doc.createElement("property")
wml.appendChild(property)
descriptionNode = doc.createElement("description")
property.appendChild(descriptionNode)
descriptionText = doc.createTextNode(description.decode('ISO-8859-1'))
descriptionNode.appendChild(descriptionText)
file = codecs.open('contentFinal.xml', 'w', encoding='ISO-8859-1')
file.write(doc.toprettyxml())
file.close()
The description node contains some characters in ISO-8859-1 encoding, this is encoding specified by the site it self in meta tag. But when doc.toprettyxml() starts writing in file I got following error:
Traceback (most recent call last):
File "main.py", line 467, in <module>
file.write(doc.toprettyxml())
File "C:\Python27\lib\xml\dom\minidom.py", line 60, in toprettyxml
return writer.getvalue()
File "C:\Python27\lib\StringIO.py", line 271, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 10: ordinal not in range(128)
Why am I getting this error as I am decoding and encoding with same standard?
Edited
I have following deceleration in my script file:
#!/usr/bin/python
# -*- coding: utf-8 -*-
may be this is conflicting?
Ok i have found a solution. When ever data is in other foriegn language you just need to defined the proper encoding in xml header. You do not need to describe encoding in file.write(doc.toprettyxml(encoding='ISO-8859-1')) not even when you are opening a file for writing file = codecs.open('contentFinal.xml', 'w', encoding='ISO-8859-1'). Below is the technique which i used. May be This is not a professional method but that works for me.
file = codecs.open('abc.xml', 'w')
xm = doc.toprettyxml()
xm = xm.replace('<?xml version="1.0" ?>', '<?xml version="1.0" encoding="ISO-8859-1"?>')
file.write(xm)
file.close()
May be there is a method to set default encoding in header but i could not find it.
Above method does not bring any error on browser and all data display perfectly.

UnicodeDecodeError reading string in CSV

I'm having a problem reading some chars in python.
I have a csv file in UTF-8 format, and I'm reading, but when script read:
Preußen Münster-Kaiserslautern II
I get this error:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 515, in __call__
handler.get(*groups)
File "/Users/fermin/project/gae/cuotastats/controllers/controllers.py", line 50, in get
f.name = unicode( row[1])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
I tried to use Unicode functions and convert string to Unicode, but I haven't found the solution. I tried to use sys.setdefaultencoding('utf8') but that doesn't work either.
Try the unicode_csv_reader() generator described in the csv module docs.

Converting domain names to idn in python

I have a long list of domain names which I need to generate some reports on. The list contains some IDN domains, and although I know how to convert them in python on the command line:
>>> domain = u"pfarmerü.com"
>>> domain
u'pfarmer\xfc.com'
>>> domain.encode("idna")
'xn--pfarmer-t2a.com'
>>>
I'm struggling to get it to work with a small script reading data from the text file.
#!/usr/bin/python
import sys
infile = open(sys.argv[1])
for line in infile:
print line,
domain = unicode(line.strip())
print type(domain)
print "IDN:", domain.encode("idna")
print
I get the following output:
$ ./idn.py ./test
pfarmer.com
<type 'unicode'>
IDN: pfarmer.com
pfarmerü.com
Traceback (most recent call last):
File "./idn.py", line 9, in <module>
domain = unicode(line.strip())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 7: ordinal not in range(128)
I have also tried:
#!/usr/bin/python
import sys
import codecs
infile = codecs.open(sys.argv[1], "r", "utf8")
for line in infile:
print line,
domain = line.strip()
print type(domain)
print "IDN:", domain.encode("idna")
print
Which gave me:
$ ./idn.py ./test
Traceback (most recent call last):
File "./idn.py", line 8, in <module>
for line in infile:
File "/usr/lib/python2.6/codecs.py", line 679, in next
return self.reader.next()
File "/usr/lib/python2.6/codecs.py", line 610, in next
line = self.readline()
File "/usr/lib/python2.6/codecs.py", line 525, in readline
data = self.read(readsize, firstline=True)
File "/usr/lib/python2.6/codecs.py", line 472, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-5: unsupported Unicode code range
Here is my test data file:
pfarmer.com
pfarmerü.com
I'm very aware of my need to understand unicode now.
Thanks,
Peter
you need to know in which encoding you file was saved. This would be something like 'utf-8' (which is NOT Unicode) or 'iso-8859-1' or 'cp1252' or alike.
Then you can do (assuming 'utf-8'):
infile = open(sys.argv[1])
for line in infile:
print line,
domain = line.strip().decode('utf-8')
print type(domain)
print "IDN:", domain.encode("idna")
print
Convert encoded strings to unicode with decode. Convert unicode to string with encode. If you try to encode something which is already encoded, python tries to decode first, with the default codec 'ascii' which fails for non-ASCII-values.
Your first example is fine, except that:
domain = unicode(line.strip())
you have to specify a particular encoding here: unicode(line.strip(), 'utf-8'). Otherwise you get the default encoding which for safety is 7-bit ASCII, hence the error. Alternatively you can spell it line.strip().decode('utf-8') as in knitti's example; there is no difference in behaviour between the two syntaxes.
However judging by the error “can't decode byte 0xfc”, I think you haven't actually saved your test file as UTF-8. Presumably this is why the second example, that also looks OK in principle, fails.
Instead it's ISO-8859-1 or the very similar Windows code page 1252. If it's come from a text editor on a Western Windows box it will certainly be the latter; Linux machines use UTF-8 by default instead nowadays. Either make sure to save your file as UTF-8, or read the file using the encoding 'cp1252' instead.

Problem with encode decode. Python. Django. BeautifulSoup

In this code:
soup=BeautifulSoup(program.Description.encode('utf-8'))
name=soup.find('div',{'class':'head'})
print name.string.decode('utf-8')
error happening when i'm trying to print or save to database.
dosnt metter what i'm doing:
print name.string.encode('utf-8')
or just
print name.string
Traceback (most recent call last):
File "./manage.py", line 16, in <module>
execute_manager(settings)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 362, in execute_manager
utility.execute()
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 303, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 195, in run_from_argv
self.execute(*args, **options.__dict__)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 222, in execute
output = self.handle(*args, **options)
File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 50, in handle
self.FirstTimeLoad()
File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 115, in FirstTimeLoad
print name.string.decode('utf-8')
File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-5: ordinal not in range(128)
This is repr(name.string)
u'\u0412\u044b\u043f\u0443\u0441\u043a \u043e\u0442 27 \u0434\u0435\u043a\u0430\u0431\u0440\u044f'
I don't know what you are trying to do with name.string.decode('utf-8'). As the BeautifulSoup documentation eloquently points out, "BeautifulSoup gives you Unicode, dammit". So name.string is already decoded - it is in unicode. You can encode it back to utf-8 if you want to, but you can't decode it any further.
You can try:
print name.string.encode('ascii', 'replace')
The output should be accepted whatever the encoding of sys.stdout is (including None).
In fact, the file-like object that you are printing to might not accept UTF-8. Here is an example: if you have the apparently benign program
# -*- coding: utf-8 -*-
print u"hérisson"
then running it in a terminal that can print accented characters works fine:
lebigot#weinberg /tmp % python2.5 test.py
hérisson
but printing to a standard output connected to a Unix pipe does not:
lebigot#weinberg /tmp % python2.5 test.py | cat
Traceback (most recent call last):
File "test.py", line 3, in <module>
print u"hérisson"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
because sys.stdout has encoding None, in this case: Python considers that the program that reads through the pipe should receive ASCII, and the printing fails because ASCII cannot represent the word that we want to print. A solution like the one above solves the problem.
Note: You can check the encoding of your standard output with:
print sys.stdout.encoding
This can help you debug encoding problems.
Edit: name.string comes from BeautifulSoup, so it is presumably already a unicode string.
However, your error message mentions 'ascii':
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-5:
ordinal not in range(128)
According to the PrintFails Python wiki page, if Python does not know or
can not determine what kind of encoding your output device is expecting, it sets
sys.stdout.encoding to None and print attempts to encode its arguments with
the 'ascii' codec.
I believe this is the cause of your problem. You can can confirm this by seeing
if print sys.stdout.encoding prints None.
According to the same page, linked above, you can circumvent the problem by
explicitly telling Python what encoding to use. You do that be wrapping
sys.stdout in an instance of StreamWriter:
For example, you could try adding
import sys
import locale
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
to your script before the print statement. You may have to change
locale.getpreferredencoding() to and explicit encoding (e.g. 'utf-8',
'cp1252', etc.). The right encoding to use depends on your output device.
It should be set to whatever encoding your output device is expecting. If
you are outputing to a terminal, the terminal may have a menu setting to allow
the user to set what type of encoding the terminal should expect.
Original answer: Try:
print name.string
or
print name.string.encode('utf-8')
try
text = text.decode("utf-8", "replace")

Categories