representing µs in Python 2.7 - python

I'm parsing a csv, and writing part of its contents to a xls file using xlwt
Every time µs pops up in the original file, I get a UnicodeDecodeError from xlwt:
File "C:\SW_DevSandbox\E2\FlightTestInstrumentation\ICDforFTI\ICDforFTI.py", line 243, in generateICD
icd.write(icdLine,icdTitle.index('Unit'),entry['Unit'])
File "C:\espressoE2\tools\OpenVIB\1.2\python\lib\site-packages\xlwt\Worksheet.py", line 1030, in write
self.row(r).write(c, label, style)
File "C:\espressoE2\tools\OpenVIB\1.2\python\lib\site-packages\xlwt\Row.py", line 240, in write
StrCell(self.__idx, col, style_index, self.__parent_wb.add_str(label))
File "C:\espressoE2\tools\OpenVIB\1.2\python\lib\site-packages\xlwt\Workbook.py", line 326, in add_str
return self.__sst.add_str(s)
File "C:\espressoE2\tools\OpenVIB\1.2\python\lib\site-packages\xlwt\BIFFRecords.py", line 24, in add_str
s = unicode(s, self.encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 0: invalid start byte
I think the root problem is the following:
In python 3, I can easily represent µs:
>>> '\xb5s'
'µs'
>>>
In python 2, apparently not:
>>> '\xb5s'
'\xb5s'
>>> u'\xb5s'
u'\xb5s'
>>> unicode('\xb5s')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: ordinal not in range(128)
>>> unicode('\xb5s','utf8')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\espressoE2\tools\OpenVIB\1.2\python\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 0: invalid start byte
>>>
Edit: print u'\xb5s' works in Python 2, thanks #cdarke. But print does not solve the problem, it's not an internal representation I can then feed to xlwt.
end of Edit.
So how can I represent µs in Python 2?
Notepad++ displays the csv file fine, with µs. The "Encoding" menu shows it's encoding as "ANSI", and if I change to "UTF-8" I start seeing the "B5" all over the text.
Python 2 Unicode has no encoding called "ANSI".
Is there a Python 2 Unicode encoding equivalent to what Notepad++ calls "ANSI"?

ANSI in Notepad is the native locale for Windows. If you are using US Windows that locale is cp1252. Your file is probably encoded in cp1252 and not utf8. If you are using another version of Windows, locale.getpreferredencoding() will tell you what Windows considers ANSI.
>>> '\xb5s'.decode('cp1252')
u'\xb5s'

Related

"UnicodeEncodeError: 'charmap' codec can't encode characters" when trying to parse .xlsx by openpyxl

--- update ---
I think this console log nails the issue, however it's still not clear how to fix it:
>>> workbook = openpyxl.load_workbook('data.xlsx')
>>> worksheet = workbook.active
>>> worksheet['A2'].value
u'\u041c\u0435\u0448\u043e\u043a \u0434\u0435\u043d\u0435\u0433'
>>> print worksheet['A2'].value
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to <undefined>
--- end update ---
I'm trying to print the values of some .xlsx cells using openpyxl:
import openpyxl
workbook = openpyxl.load_workbook(filename='puzzles.xlsx')
worksheet = workbook.active
for row in worksheet.iter_rows('A2:K5'):
print row[0].value
Which results in the following error:
Traceback (most recent call last):
File "xls_import.py", line 8, in <module>
print row[0].value
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to <undefined>
As far as I know, XLSX is encoded as UTF-8, however:
print row[0].value.decode('utf-8')
does not help either:
Traceback (most recent call last):
File "xls_import.py", line 8, in <module>
print row[0].value.decode('utf-8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
Any suggestions?
I'm running Python 2.7 and openpyxl 2.2.5.
openpyxl returns unicode strings (XML itself is encoded in UTF-8) so you don't need to decode them (decoding goes from an encoding to unicode) but encode them in encoding of your choice.

Difference between Python 3.3 and 3.4 `open` default encoding?

I have a file with some non-ASCII characters.
$ file bi companies.txt
text/plain; charset=utf-8
On my desktop with Python 3.4 I can open this file with no problems:
>>> open('companies.txt').read()
'...'
On a CI system with Python 3.3 I get this:
>>> open('companies.txt').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.3/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1223: ordinal not in range(128)
But if I explicitly specify encoding='utf8', it works:
>>> open('companies.txt', encoding='utf8').read()
'...'
On both systems, sys.getdefaultencoding returns 'utf-8'.
Any ideas what is causing the systems to behave differently? Why is the CI system trying to use ascii?
The encoding for text files is determined by locale.getpreferredencoding, rather than sys.getdefaultencoding.

Python fails with parsing file using re

I have a file that is mostly ascii file, but there appear some non-ascii characters sometimes. I want to parse this files and extract the lines that are marked in a certain way. Previously I used sed for this, but now I need to do the same in python. (Of course I still can use os.system, but I'm hoping for something more convenient).
I'm doing following.
p = re.compile(".*STATWAH ([0-9]*):([0-9]*):([0-9 ]*):([0-9 ]*) STATWAH.*")
f = open("capture_8_8_8__1_2_3.log", encoding="ascii")
fl = filter(lambda line: p.match(line), f)
len(list(fl))
And in the last line I get following error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 2227: ordinal not in range(128)
If I remove encoding parameter from the second line, i. e. use default encoding which is utf-8, the error is following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 2227: invalid start byte
Could you help me please what can I do here, except calling sed from python?
UPD.
Thanks to #Wooble I found the answer.
The correct code looks following:
p = re.compile(rb".*STATWAH ([0-9]*):([0-9]*):([0-9 ]*):([0-9 ]*) STATWAH.*")
f = open("capture_8_8_8__1_2_3.log", "rb")
fl = filter(lambda line: p.match(line), f)
len(list(fl))
I opened file in binary mode and also compile regex from binary string representation.

Why do some strings encode in utf-16, while others only encode in utf-8?

>>> unicode('восстановление информации', 'utf-16')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xb8 in position 48: truncated data
>>> unicode('восстановление информации', 'utf-8')
u'\u0432\u043e\u0441\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0438\u0435 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438'
Why do these Russian words encode in UTF-8 fine, but not UTF-16?
You are asking the unicode function to decode a byte string and then giving it the wrong encoding.
Pasting your string into Python-2.7 on OS-X gives
>>> 'восстановление информации'
'\xd0\xb2\xd0\xbe\xd1\x81\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xbd\xd1\x84\xd0\xbe\xd1\x80\xd0\xbc\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8'
At this stage it is already a UTF-8 encoded string (probably your terminal determined this), so you can decode it by specifying the utf-8 codec
>>> 'восстановление информации'.decode('utf-8')
u'\u0432\u043e\u0441\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0438\u0435 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438'
But not UTF-16 as that would be invalid
>>> 'восстановление информации'.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xb8 in position 48: truncated data
If you want to encode a unicode string to UTF-8 or UTF-16, then use
>>> u'восстановление информации'.encode('utf-16')
'\xff\xfe2\x04>\x04A\x04A\x04B\x040\x04=\x04>\x042\x04;\x045\x04=\x048\x045\x04 \x008\x04=\x04D\x04>\x04#\x04<\x040\x04F\x048\x048\x04'
>>> u'восстановление информации'.encode('utf-8')
'\xd0\xb2\xd0\xbe\xd1\x81\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xbd\xd1\x84\xd0\xbe\xd1\x80\xd0\xbc\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8'
Notice the input strings are unicode (have a u at the front), but the outputs here are byte-strings (they don't have u at the start) which contain the unicode data encoded in the respective formats.

UnicodeDecodeError reading string in CSV

I'm having a problem reading some chars in python.
I have a csv file in UTF-8 format, and I'm reading, but when script read:
Preußen Münster-Kaiserslautern II
I get this error:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 515, in __call__
handler.get(*groups)
File "/Users/fermin/project/gae/cuotastats/controllers/controllers.py", line 50, in get
f.name = unicode( row[1])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
I tried to use Unicode functions and convert string to Unicode, but I haven't found the solution. I tried to use sys.setdefaultencoding('utf8') but that doesn't work either.
Try the unicode_csv_reader() generator described in the csv module docs.

Categories