Python Mechanize: UnicodeEncodeError when uploading UTF-8 file. 'ascii' codec - python

the mechanize framework works great for automating the first couple of web screens. The problem is where it needs to upload a file with in a form.
Here is the section of code just before the error:
br.select_form(name="form.uploadXMLDataWizardForm")
xmlFile = codecs.open("MyFile.xml", "rt", "utf8")
br.form.add_file(file_object=xmlFile, content_type="text/xml", filename="MyFile.xml", name="dataFile")
br.submit(name="$action:next")
It results in the following error at runtime:
br.submit(name="$action:next")
File "build/bdist.macosx-10.6-universal/egg/mechanize/_mechanize.py", line 541, in submit
File "build/bdist.macosx-10.6-universal/egg/mechanize/_mechanize.py", line 530, in click
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 2999, in click
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 3201, in _click
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 2350, in _click
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 3269, in _switch_click
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 3252, in _request_data
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 1341, in _write_mime_data
UnicodeEncodeError: 'ascii' codec can't encode characters in position 650-651: ordinal not in range(128)
Any idea how to make mechanize handle upload of a UTF-8 file?

Mechanize seems to expect the file data as raw bytes, not Unicode data. Try opening the file using the usual open() function:
...
xmlFile = open("MyFile.xml", "rt")
...

Related

indian rupee symbol UnicodeEncodeError while uploading file to s3 using pandas

I have scraped some data from a website for my assignment. It consists of Indian rupee character - "₹". The data when I'm trying to save into CSV file in utf-8 characters on local machine using pandas, it is saving effortlessly. The same file, I have changed the delimiters and tried to save the file to s3 using pandas, but it gave "UnicodeEncodeError" error. I'm scraping the web page using scrapy framework.
Earlier I was trying to save the file in Latin-1 i.e. "ISO-8859-1" formatting and hence changed to "utf-8" but the same error is occurring. I'm using pythn 3.7 for the development.
Below code used for saving on the local machine which is working:
result_df.to_csv(filename+str2+'.csv',index=False)
Below code is used to save the file to S3:
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
Below is the error while saving the file to S3:
2019-10-29 19:24:27 [scrapy.utils.signal] ERROR: Error caught on signal handler: <function Spider.close at 0x0000019CD3B1AA60>
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "c:\programdata\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 94, in close
return closed(reason)
File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
File "c:\programdata\anaconda3\lib\site-packages\pandas\core\generic.py", line 3020, in to_csv
formatter.save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 172, in save
self._save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 288, in _save
self._save_chunk(start_i, end_i)
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 315, in _save_chunk
self.cols, self.writer)
File "pandas/_libs/writers.pyx", line 75, in pandas._libs.writers.write_csv_rows
File "c:\programdata\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20b9' in position 2661: character maps to <undefined>
I am very new to this StackOverflow platform and please let me know if more information is to be presented.
The error gives an evidence that the code tries to encode the filename_str2.csv file in cp1252. From your stack trace:
...File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/ filename_str2.csv ',......
File "c:\programdata\anaconda3\lib\encodings\ cp1252.py ", line 19, in encode
The reason I do not know, because you explicitely ask for an utf-8 encoding. But as the codecs page in the Python Standard Library reference says that the canonical name for utf8 is utf_8 (notice the underline instead of minus sign) and does not list utf-8 in allowed aliases, I would first try to use utf_8. If it still uses cp1252, then you will have to give the exact versions of Python and pandas that you are using.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 14: invalid start byte

I am doing a File upload test with Django REST.
Python3.6.2
Django1.11
djangorestframework==3.6.4
Excel-OSX 15.38(170902)
OSX 10.12.6
It used to be done successfully with ordinary photo files.
This time is Excel file from website. Here is my testcase copy from references.
def test_upload_and_process_data_complete_case(self):
from django.core.files import File
from django.core.files.uploadedfile import SimpleUploadedFile
from soken_web.apps.imported_files.models import ImportFile
file = File(open(str(settings.BASE_DIR) + '/apps/zipcodes/complete.xlsx'))
uploaded_file = SimpleUploadedFile('new_image.xlsx', file.read(), content_type='multipart/form-data')
data = {
'attribute': {'author': 'Sigh'},
'type': ImportFile.FileType.zipcode,
'file': uploaded_file
}
response = self.client.post(reverse('api:import_file-list'), data, format='multipart')
response.render()
self.assertEqual(status.HTTP_201_CREATED, response.status_code)
Like a copy cat. Except this time I download a mock file from https://www.mockaroo.com/.
Here is the error raises when I execute file.read()
file
<File: /Users/el/Code/norak-cutter/soken/soken-web/soken_web/apps/zipcodes/complete.xlsx>
file.read()
Traceback (most recent call last):
File "/Users/el/Library/Application Support/JetBrains/Toolbox/apps/PyCharm-P/ch-0/172.3968.37/PyCharm.app/Contents/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/Users/el/.pyenv/versions/3.6.2/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 14: invalid start byte
Confirmations:
1. I can upload file from my web browser
2. I can open that file without any warning messages.
Question:
Are there anything I forgot to concern?
References:
How can I test binary file uploading with django-rest-framework's test client?
Django REST UnitTest No file was submitted
The default mode of opening files is "r" which means non-binary read. Python is assuming your file is a text (encoded) file and trying to decode the contents. But it isnt a text file - it's a binary data file.
Change:
open(str(settings.BASE_DIR) + '/apps/zipcodes/complete.xlsx')
to:
open(str(settings.BASE_DIR) + '/apps/zipcodes/complete.xlsx', 'rb')
and it will probably work.

Umlauts in JSON files lead to errors in Python code created by ANTLR4

I've created python modules from the JSON grammar on github / antlr4 with
antlr4 -Dlanguage=Python3 JSON.g4
I've written a main program "JSON2.py" following this guide: https://github.com/antlr/antlr4/blob/master/doc/python-target.md
and downloaded the example1.json also from github.
python3 ./JSON2.py example1.json # works perfectly, but
python3 ./JSON2.py bookmarks-2017-05-24.json # the bookmarks contain German Umlauts like "ü"
...
File "/home/xyz/lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 227: ordinal not in range(128)
The offending line in JSON2.py is
input = FileStream(argv[1])
I've searched stackoverflow and tried this instead of using the above FileStream:
fp = codecs.open(argv[1], 'rb', 'utf-8')
try:
input = fp.read()
finally:
fp.close()
lexer = JSONLexer(input)
stream = CommonTokenStream(lexer)
parser = JSONParser(stream)
tree = parser.json() # This is line 39, mentioned in the error message
Execution of this program ends with an error message, even if the input file doesn't contain Umlauts:
python3 ./JSON2.py example1.json
Traceback (most recent call last):
File "./JSON2.py", line 46, in <module>
main(sys.argv)
File "./JSON2.py", line 39, in main
tree = parser.json()
File "/home/x/Entwicklung/antlr/links/JSONParser.py", line 108, in json
self.enterRule(localctx, 0, self.RULE_json)
File "/home/xyz/lib/python3.5/site-packages/antlr4/Parser.py", line 358, in enterRule
self._ctx.start = self._input.LT(1)
File "/home/xyz/lib/python3.5/site-packages/antlr4/CommonTokenStream.py", line 61, in LT
self.lazyInit()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit
self.setup()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 189, in setup
self.sync(0)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 111, in sync
fetched = self.fetch(n)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 123, in fetch
t = self.tokenSource.nextToken()
File "/home/xyz/lib/python3.5/site-packages/antlr4/Lexer.py", line 111, in nextToken
tokenStartMarker = self._input.mark()
AttributeError: 'str' object has no attribute 'mark'
This parses correctly:
javac *.java
grun JSON json -gui bookmarks-2017-05-24.json
So the grammar itself is not the problem.
So finally the question: How should I process the input file in python, so that lexer and parser can digest it?
Thanks in advance.
Make sure your input file is actually encoded as UTF-8. Many problems with character recognition by the lexer are caused by using other encodings. I just took a testbed application, added ëto the list of available characters for an IDENTIFIER and it works again. UTF-8 is the key -- and make sure your grammar also allows these characters where you want to accept them.
I solved it by passing the encoding info:
input = FileStream(sys.argv[1], encoding = 'utf8')
If without the encoding info, I will have the same issue as yours.
Traceback (most recent call last):
File "test.py", line 20, in <module>
main()
File "test.py", line 9, in main
input = FileStream(sys.argv[1])
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 20, in __init__
super().__init__(self.readDataFrom(fileName, encoding, errors))
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)
Where my input data is
[今明]天(台南|高雄)的?天氣如何

Spyder crashes at start: UnicodeDecodeError

During a Spyder session my Linux froze. After startup, I could not start Spyder; I got the following error instead:
(trusty)dreamer#localhost:~$ spyder
Traceback (most recent call last):
File "/home/dreamer/anaconda2/bin/spyder", line 2, in <module>
from spyderlib import start_app
File "/home/dreamer/anaconda2/lib/python2.7/site-packages/spyderlib/start_app.py", line 13, in <module>
from spyderlib.config import CONF
File "/home/dreamer/anaconda2/lib/python2.7/site-packages/spyderlib/config.py", line 736, in <module>
subfolder=SUBFOLDER, backup=True, raw_mode=True)
File "/home/dreamer/anaconda2/lib/python2.7/site-packages/spyderlib/userconfig.py", line 215, in __init__
self.load_from_ini()
File "/home/dreamer/anaconda2/lib/python2.7/site-packages/spyderlib/userconfig.py", line 260, in load_from_ini
self.readfp(configfile)
File "/home/dreamer/anaconda2/lib/python2.7/ConfigParser.py", line 324, in readfp
self._read(fp, filename)
File "/home/dreamer/anaconda2/lib/python2.7/ConfigParser.py", line 479, in _read
line = fp.readline()
File "/home/dreamer/anaconda2/lib/python2.7/codecs.py", line 690, in readline
return self.reader.readline(size)
File "/home/dreamer/anaconda2/lib/python2.7/codecs.py", line 545, in readline
data = self.read(readsize, firstline=True)
File "/home/dreamer/anaconda2/lib/python2.7/codecs.py", line 492, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 2: invalid start byte
(trusty)dreamer#localhost:~$
I have found this solution, which sounds very much like my problem, but am curious if there are others, and whether anyone knows why this occurred.
My guess is that your spyder configuration file somehow got corrupted. This is the file spyder.ini, which resides in a directory like ~/.spyder2 (the exact name of the directory depends on the version you have installed). Maybe the encoding of the configuration file changed or a Unicode byte order mark was somehow introduced.
Possible solutions: use an editor to convert the file back to UTF-8; delete the configuration file; delete the whole directory containing the configuration file. The last two obviously delete any changes you made to the configuration.

UnicodeDecodeError when trying to save an Excel File with Python xlwt

I'm running a Python script that writes HTML code found using BeautifulSoup into multiple rows of an Excel spreadsheet column.
[...]
Col_HTML = 19
w_sheet.write(row_index, Col_HTML, str(HTML_Code))
wb.save(output)
When trying to save the file, I get the following error message:
Traceback (most recent call last):
File "C:\Users\[..]\src\MYCODE.py", line 201, in <module>
wb.save(output)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 662, in save
doc.save(filename, self.get_biff_data())
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 637, in get_biff_data
shared_str_table = self.__sst_rec()
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 599, in __sst_rec
return self.__sst.get_biff_record()
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\BIFFRecords.py", line 76, in get_biff_record
self._add_to_sst(s)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\BIFFRecords.py", line 91, in _add_to_sst
u_str = upack2(s, self.encoding)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\UnicodeUtils.py", line 50, in upack2
us = unicode(s, encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5181: ordinal not in range(128)
I've successfully written Python script in the past to write into worksheets. It's the first time I try to write a string of HTML into cells and I'm wondering what is causing the error and how I could fix it.
Use this line before passing HTML_Code to w_sheet.write
HTML_Code = HTML_Code.decode('utf-8')
Because, in the error line UnicodeDecodeError: 'ascii' codec can't decode, Python is trying to decode unicode into ascii, so you need to decode unicode using the proper encoding format, that is, utf-8.
So, you have:
Col_HTML = 19
HTML_Code = HTML_Code.decode('utf-8')
w_sheet.write(row_index, Col_HTML, str(HTML_Code))

Categories