Difference between Python 3.3 and 3.4 `open` default encoding? - python

I have a file with some non-ASCII characters.
$ file bi companies.txt
text/plain; charset=utf-8
On my desktop with Python 3.4 I can open this file with no problems:
>>> open('companies.txt').read()
'...'
On a CI system with Python 3.3 I get this:
>>> open('companies.txt').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.3/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1223: ordinal not in range(128)
But if I explicitly specify encoding='utf8', it works:
>>> open('companies.txt', encoding='utf8').read()
'...'
On both systems, sys.getdefaultencoding returns 'utf-8'.
Any ideas what is causing the systems to behave differently? Why is the CI system trying to use ascii?

The encoding for text files is determined by locale.getpreferredencoding, rather than sys.getdefaultencoding.

Related

Can't install l18n, UnicodeDecodeError: 'cp950' codec can't decode byte

Error occurs when I run
pip install l18n
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\X\AppData\Local\Temp\pip-install-8urtlamu\l18n\setup.py", line 99, in <module>
long_description=open(os.path.join('README.rst')).read(),
UnicodeDecodeError: 'cp950' codec can't decode byte 0xc3 in position 2135: illegal multibyte sequence
Tried but didnt work:
chcp 65001
Alternative concole: cmder
Config:
Windows 7
Python 3.6.4
Pip 10.0.1
Thanks!
It is manifestly a bug inside l18n: in the setup.py, the long_description parameter is build by reading the README.rst file (this is a classic way to do that).
The trace back says: 'cp950' codec can't decode byte 0xc3 in position 2135. This is a classic error with utf-8 encoded text which contains non-ascii characters.
The source code is stored in Bitbucket:
long_description=open(os.path.join('README.rst')).read(),
The behavior of the open function changed in Python 3. You must set the file encoding which is utf-8 here:
A portable way to solve that is to define a function:
import io
def read(path):
with io.open(path, mode='r', encoding='utf-8') as f:
return f.read()
And to use it like this:
long_description=read('README.rst')
There is an issue about that.

Python - cannot decode html (urllib)

I'm trying to write html from webpage to file, but I have problem with decode characters:
import urllib.request
response = urllib.request.urlopen("https://www.google.com")
charset = response.info().get_content_charset()
print(response.read().decode(charset))
Last line causes error:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in
position 6079: ordinal not in range(128)
response.info().get_content_charset() returns iso-8859-2, but if i check content of response without decoding (print(resposne.read())) there is "utf-8" encoding as html metatag. If i use "utf-8" in decode function there is also similar problem:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position
6111: invalid start byte
What's going on?
You can ignore invalid characters using
response.read().decode("utf-8", 'ignore')
Instead of ignore there are other options, e.g. replace
https://www.tutorialspoint.com/python/string_encode.htm
https://docs.python.org/3/howto/unicode.html#the-string-type
(There is also str.encode(encoding='UTF-8',errors='strict') for strings.)

representing µs in Python 2.7

I'm parsing a csv, and writing part of its contents to a xls file using xlwt
Every time µs pops up in the original file, I get a UnicodeDecodeError from xlwt:
File "C:\SW_DevSandbox\E2\FlightTestInstrumentation\ICDforFTI\ICDforFTI.py", line 243, in generateICD
icd.write(icdLine,icdTitle.index('Unit'),entry['Unit'])
File "C:\espressoE2\tools\OpenVIB\1.2\python\lib\site-packages\xlwt\Worksheet.py", line 1030, in write
self.row(r).write(c, label, style)
File "C:\espressoE2\tools\OpenVIB\1.2\python\lib\site-packages\xlwt\Row.py", line 240, in write
StrCell(self.__idx, col, style_index, self.__parent_wb.add_str(label))
File "C:\espressoE2\tools\OpenVIB\1.2\python\lib\site-packages\xlwt\Workbook.py", line 326, in add_str
return self.__sst.add_str(s)
File "C:\espressoE2\tools\OpenVIB\1.2\python\lib\site-packages\xlwt\BIFFRecords.py", line 24, in add_str
s = unicode(s, self.encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 0: invalid start byte
I think the root problem is the following:
In python 3, I can easily represent µs:
>>> '\xb5s'
'µs'
>>>
In python 2, apparently not:
>>> '\xb5s'
'\xb5s'
>>> u'\xb5s'
u'\xb5s'
>>> unicode('\xb5s')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: ordinal not in range(128)
>>> unicode('\xb5s','utf8')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\espressoE2\tools\OpenVIB\1.2\python\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 0: invalid start byte
>>>
Edit: print u'\xb5s' works in Python 2, thanks #cdarke. But print does not solve the problem, it's not an internal representation I can then feed to xlwt.
end of Edit.
So how can I represent µs in Python 2?
Notepad++ displays the csv file fine, with µs. The "Encoding" menu shows it's encoding as "ANSI", and if I change to "UTF-8" I start seeing the "B5" all over the text.
Python 2 Unicode has no encoding called "ANSI".
Is there a Python 2 Unicode encoding equivalent to what Notepad++ calls "ANSI"?
ANSI in Notepad is the native locale for Windows. If you are using US Windows that locale is cp1252. Your file is probably encoded in cp1252 and not utf8. If you are using another version of Windows, locale.getpreferredencoding() will tell you what Windows considers ANSI.
>>> '\xb5s'.decode('cp1252')
u'\xb5s'

Prevent encoding errors in Python

I have scripts which print out messages by the logging system or sometimes print commands. On the Windows console I get error messages like
Traceback (most recent call last):
File "C:\Python32\lib\logging\__init__.py", line 939, in emit
stream.write(msg)
File "C:\Python32\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 4537:character maps to <undefined>
Is there a general way to make all encodings in the logging system, print commands, etc. fail-safe (ignore errors)?
The problem is that your terminal/shell (cmd as your are on Windows) cannot print every Unicode character.
You can fail-safe encode your strings with the errors argument of the str.encode method. For example you can replace not supported chars with ? by setting errors='replace'.
>>> s = u'\u2019'
>>> print s
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can\'t encode character u'\u2019' in position
0: character maps to <undefined>
>>> print s.encode('cp850', errors='replace')
?
See the documentation for other options.
Edit If you want a general solution for the logging, you can subclass StreamHandler:
class CustomStreamHandler(logging.StreamHandler):
def emit(self, record):
record = record.encode('cp850', errors='replace')
logging.StreamHandler.emit(self, record)

UnicodeDecodeError reading string in CSV

I'm having a problem reading some chars in python.
I have a csv file in UTF-8 format, and I'm reading, but when script read:
Preußen Münster-Kaiserslautern II
I get this error:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 515, in __call__
handler.get(*groups)
File "/Users/fermin/project/gae/cuotastats/controllers/controllers.py", line 50, in get
f.name = unicode( row[1])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
I tried to use Unicode functions and convert string to Unicode, but I haven't found the solution. I tried to use sys.setdefaultencoding('utf8') but that doesn't work either.
Try the unicode_csv_reader() generator described in the csv module docs.

Categories