UnicodeDecodeError when trying to save an Excel File with Python xlwt - python

I'm running a Python script that writes HTML code found using BeautifulSoup into multiple rows of an Excel spreadsheet column.
[...]
Col_HTML = 19
w_sheet.write(row_index, Col_HTML, str(HTML_Code))
wb.save(output)
When trying to save the file, I get the following error message:
Traceback (most recent call last):
File "C:\Users\[..]\src\MYCODE.py", line 201, in <module>
wb.save(output)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 662, in save
doc.save(filename, self.get_biff_data())
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 637, in get_biff_data
shared_str_table = self.__sst_rec()
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 599, in __sst_rec
return self.__sst.get_biff_record()
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\BIFFRecords.py", line 76, in get_biff_record
self._add_to_sst(s)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\BIFFRecords.py", line 91, in _add_to_sst
u_str = upack2(s, self.encoding)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\UnicodeUtils.py", line 50, in upack2
us = unicode(s, encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5181: ordinal not in range(128)
I've successfully written Python script in the past to write into worksheets. It's the first time I try to write a string of HTML into cells and I'm wondering what is causing the error and how I could fix it.

Use this line before passing HTML_Code to w_sheet.write
HTML_Code = HTML_Code.decode('utf-8')
Because, in the error line UnicodeDecodeError: 'ascii' codec can't decode, Python is trying to decode unicode into ascii, so you need to decode unicode using the proper encoding format, that is, utf-8.
So, you have:
Col_HTML = 19
HTML_Code = HTML_Code.decode('utf-8')
w_sheet.write(row_index, Col_HTML, str(HTML_Code))

Related

dvc.api.read() raises an "UnicodeDecodeError"

I am trying to acess a DICOM file [image saved in the Digital Imaging and Communications in Medicine (DICOM) format]:
import dvc.api
path = 'dir/image.dcm'
remote = 'remote_name'
repo = 'git_repo'
mode = 'r'
data = dvc.api.read(path = path, remote = remote, repo = repo, mode = mode)
When I run the previous code, and after the "downloading progress bar" is complete, I get the following error:
Traceback (most recent call last): File "draft.py", line 7, in <module> mode ='r') File "C:\Users\lbrandao\anaconda3\envs\my_env\lib\site-packages\dvc\api.py", line 91, in read return fd.read() File "C:\Users\lbrandao\anaconda3\envs\my_env\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 764: character maps to <undefined>
I tried to overcome this issue by using the encoding argument:
data = dvc.api.read(path = path, remote = remote, repo = repo, mode = mode, encoding='ANSI')
Since, when I open a DICOM file using for example Notepad++, this is the encoding specified. However, it raises the error:
Exception ignored in: <bound method Pool.__del__ of <dvc.fs.pool.Pool object at 0x0000021D1347A160>> Traceback (most recent call last): File "C:\Users\lbrandao\anaconda3\envs\my_env\lib\site-packages\dvc\fs\pool.py", line 42, in __del__ File "C:\Users\lbrandao\anaconda3\envs\my_env\lib\site-packages\dvc\fs\pool.py", line 46, in close File "C:\Users\lbrandao\anaconda3\envs\my_env\lib\site-packages\dvc\fs\ssh\connection.py", line 71, in close File "C:\Users\lbrandao\anaconda3\envs\my_env\lib\site-packages\paramiko\sftp_client.py", line 194, in close File "C:\Users\lbrandao\anaconda3\envs\my_env\lib\site-packages\paramiko\sftp_client.py", line 185, in _log File "C:\Users\lbrandao\anaconda3\envs\my_env\lib\site-packages\paramiko\sftp.py", line 158, in _log File "C:\Users\lbrandao\anaconda3\envs\my_env\lib\logging\__init__.py", line 1372, in log File "C:\Users\lbrandao\anaconda3\envs\my_env\lib\logging\__init__.py", line 1441, in _log File "C:\Users\lbrandao\anaconda3\envs\my_env\lib\logging\__init__.py", line 1411, in makeRecord TypeError: 'NoneType' object is not callable
I also tried encoding = 'utf-8', but the "UnicodeDecodeError" continues to appear:
Traceback (most recent call last): File "draft.py", line 7, in <module> mode ='r', encoding='utf-8') File "C:\Users\lbrandao\anaconda3\envs\ccab_env_dev\lib\site-packages\dvc\api.py", line 91, in read return fd.read() File "C:\Users\lbrandao\anaconda3\envs\ccab_env_dev\lib\codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 140: invalid continuation byte
Can anyone please help? Thanks.
I am not familiar with dvc.api, but a lot of people who deal with DICOMs use either pydicom or simpleITK. I wrote a library which can get most of the the data using either (cleanX) and you can look at an example notebook here.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte

I had built chatbot following this github:
https://github.com/llSourcell/tensorflow_chatbot
Also I get data on: https://github.com/suriyadeepan/easy_seq2seq/tree/master/data
I use tensorflow 0.12 with python 3.5. Can someone help me fix this problem: >> Mode : test
Traceback (most recent call last):
File "execute.py", line 324, in
decode()
File "execute.py", line 220, in decode
enc_vocab, _ = data_utils.initialize_vocabulary(enc_vocab_path)
File "D:\My_document\AI\Chatbot_Conversation\tensorflow_chatbot-master\data_utils.py", line 86, in initialize_vocabulary
rev_vocab.extend(f.readlines())
File "C:\Users\Hoang\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 131, in readlines
s = self.readline()
File "C:\Users\Hoang\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 124, in readline
return compat.as_str_any(self._read_buf.ReadLineAsString())
File "C:\Users\Hoang\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\util\compat.py", line 106, in as_str_any
return as_str(value)
File "C:\Users\Hoang\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\util\compat.py", line 84, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte
This error mean that I need to revise my data to utf-8?
I appreciate any help. Thank you!
Please try to read the data using encoding='unicode_escape'. For example
df= pd.read_csv('file_name.csv',encoding ='unicode_escape')
It worked for me.

Umlauts in JSON files lead to errors in Python code created by ANTLR4

I've created python modules from the JSON grammar on github / antlr4 with
antlr4 -Dlanguage=Python3 JSON.g4
I've written a main program "JSON2.py" following this guide: https://github.com/antlr/antlr4/blob/master/doc/python-target.md
and downloaded the example1.json also from github.
python3 ./JSON2.py example1.json # works perfectly, but
python3 ./JSON2.py bookmarks-2017-05-24.json # the bookmarks contain German Umlauts like "ü"
...
File "/home/xyz/lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 227: ordinal not in range(128)
The offending line in JSON2.py is
input = FileStream(argv[1])
I've searched stackoverflow and tried this instead of using the above FileStream:
fp = codecs.open(argv[1], 'rb', 'utf-8')
try:
input = fp.read()
finally:
fp.close()
lexer = JSONLexer(input)
stream = CommonTokenStream(lexer)
parser = JSONParser(stream)
tree = parser.json() # This is line 39, mentioned in the error message
Execution of this program ends with an error message, even if the input file doesn't contain Umlauts:
python3 ./JSON2.py example1.json
Traceback (most recent call last):
File "./JSON2.py", line 46, in <module>
main(sys.argv)
File "./JSON2.py", line 39, in main
tree = parser.json()
File "/home/x/Entwicklung/antlr/links/JSONParser.py", line 108, in json
self.enterRule(localctx, 0, self.RULE_json)
File "/home/xyz/lib/python3.5/site-packages/antlr4/Parser.py", line 358, in enterRule
self._ctx.start = self._input.LT(1)
File "/home/xyz/lib/python3.5/site-packages/antlr4/CommonTokenStream.py", line 61, in LT
self.lazyInit()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit
self.setup()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 189, in setup
self.sync(0)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 111, in sync
fetched = self.fetch(n)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 123, in fetch
t = self.tokenSource.nextToken()
File "/home/xyz/lib/python3.5/site-packages/antlr4/Lexer.py", line 111, in nextToken
tokenStartMarker = self._input.mark()
AttributeError: 'str' object has no attribute 'mark'
This parses correctly:
javac *.java
grun JSON json -gui bookmarks-2017-05-24.json
So the grammar itself is not the problem.
So finally the question: How should I process the input file in python, so that lexer and parser can digest it?
Thanks in advance.
Make sure your input file is actually encoded as UTF-8. Many problems with character recognition by the lexer are caused by using other encodings. I just took a testbed application, added ëto the list of available characters for an IDENTIFIER and it works again. UTF-8 is the key -- and make sure your grammar also allows these characters where you want to accept them.
I solved it by passing the encoding info:
input = FileStream(sys.argv[1], encoding = 'utf8')
If without the encoding info, I will have the same issue as yours.
Traceback (most recent call last):
File "test.py", line 20, in <module>
main()
File "test.py", line 9, in main
input = FileStream(sys.argv[1])
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 20, in __init__
super().__init__(self.readDataFrom(fileName, encoding, errors))
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)
Where my input data is
[今明]天(台南|高雄)的?天氣如何

UnicodeDecodeError while cursor.fetchall()

I was trying to export some data from Google Cloud SQL database to an excel file using Python xlsxwriter , webapp2, appengine in a deferred task.
The data to be written has to be retrieved from the database.
The query is executing fine but when I try to fetch the data from the query either using cursor.fetchall() or by iterating over cursor it is throwing the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9f in position 4: invalid start byte
The stacktrace is :
for row in cursor:
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 689, in fetchone
self._FetchMoreRows()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 606, in _FetchMoreRows
self._DoExec(request)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 448, in _DoExec
return self._HandleResult(response.result)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 487, in _HandleResult
new_rows = self._GetRows(result)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 532, in _GetRows
tuple_proto.values[value_index]))
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 402, in _DecodeVariable
return converter(value)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/converters.py", line 126, in Str2Unicode
return unicode(arg, 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9f in position 4: invalid start byte
The same code works if I try to locally run it using MySQLdb instead of rdbms.
There could be some encoding issue in data but that should come up while writing to the file.
I tried finding some data that may be corrupt but was not able to find any.
You do not have text that is encoded in utf8. What is it encoded in? If it is latin1, the the 9F stands for Ÿ; would that make sense? Find the line with the hex 9f; let's see the context.

How to Parse HTML with Non-ASCII Characters using BeautifulSoup?

I keep getting the following error when trying to parse some html using BeautifulSoup:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)
I've tried decoding the html using the solution to the questions below, but keep getting the same error. I've tried all the solutions to the questions below but none of them work (posting so that I don't get duplicate answers and in case they help anyone to find a solution by viewing related approaches to the problem).
Anybody know where I'm going wrong here? Is this a bug in BeautifulSoup and should I install an earlier version?
EDIT: code and traceback below:
from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
self._feed()
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)
EDIT: error message per comment below:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
self._feed()
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)
Thanks for your help!
'ascii' codec error in beautifulsoup
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
How do I convert a file's format from Unicode to ASCII using Python?
python UnicodeEncodeError > How can I simply remove troubling unicode characters?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
You say in a comment: """I just looked up the content-type of the html I'm trying to parse to see if it was something I hadn't tried (earlier I just assumed it was UTF-8) but sure enough it was UTF-8 so another dead end."""
Sigh. This is exactly why I have been trying to get you to divulge the HTML that you are trying to parse. The error message indicates that the (first) problem byte is \xae which is definitely NOT a valid lead byte in a UTF-8 sequence.
Either divulge the link to your HTML, or do some basic debugging:
Does uc = html.decode('utf8') work or fail? If fail, with what error message?
You also said: """I'm starting to think this is a bug in BS, which they allude to in the docs, and can be seen here: crummy.com/software/BeautifulSoup/CHANGELOG.html."""
I can't imagine which of the vague entries in the changelog you are referring to. Consider debugging your problem before you rush to update.
Update Looks like an obscure bug in sgmllib.py. In line 394, change 255 to 127 and it appears to work. Corner case: HTML char ref (®) in an attribute value AND with 128 <= ordinal < 255.
Further comments Rather than hack your copy of sgmllib.py, grab a copy of the latest sgmllib.py from the 2.7 branch -- BS 3.0.4 ran OK for me on Python 2.7.1. Even better, upgrade your Python to 2.7.
I tried to use pyquery on the html and the result is fine.
import urllib
from pyquery import PyQuery
html = urllib.urlopen('http://www.6pm.com/onitsuka-tiger-by-asics-ultimate-81').read()
pq = PyQuery(html)
print pq('span#price').text() # "$39.00 40% off MSRP $65.00"
pyquery is based on lxml so it's also much faster than beautifulsoup.

Categories