I was trying to export some data from Google Cloud SQL database to an excel file using Python xlsxwriter , webapp2, appengine in a deferred task.
The data to be written has to be retrieved from the database.
The query is executing fine but when I try to fetch the data from the query either using cursor.fetchall() or by iterating over cursor it is throwing the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9f in position 4: invalid start byte
The stacktrace is :
for row in cursor:
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 689, in fetchone
self._FetchMoreRows()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 606, in _FetchMoreRows
self._DoExec(request)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 448, in _DoExec
return self._HandleResult(response.result)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 487, in _HandleResult
new_rows = self._GetRows(result)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 532, in _GetRows
tuple_proto.values[value_index]))
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 402, in _DecodeVariable
return converter(value)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/converters.py", line 126, in Str2Unicode
return unicode(arg, 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9f in position 4: invalid start byte
The same code works if I try to locally run it using MySQLdb instead of rdbms.
There could be some encoding issue in data but that should come up while writing to the file.
I tried finding some data that may be corrupt but was not able to find any.
You do not have text that is encoded in utf8. What is it encoded in? If it is latin1, the the 9F stands for Ÿ; would that make sense? Find the line with the hex 9f; let's see the context.
Related
I have scraped some data from a website for my assignment. It consists of Indian rupee character - "₹". The data when I'm trying to save into CSV file in utf-8 characters on local machine using pandas, it is saving effortlessly. The same file, I have changed the delimiters and tried to save the file to s3 using pandas, but it gave "UnicodeEncodeError" error. I'm scraping the web page using scrapy framework.
Earlier I was trying to save the file in Latin-1 i.e. "ISO-8859-1" formatting and hence changed to "utf-8" but the same error is occurring. I'm using pythn 3.7 for the development.
Below code used for saving on the local machine which is working:
result_df.to_csv(filename+str2+'.csv',index=False)
Below code is used to save the file to S3:
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
Below is the error while saving the file to S3:
2019-10-29 19:24:27 [scrapy.utils.signal] ERROR: Error caught on signal handler: <function Spider.close at 0x0000019CD3B1AA60>
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "c:\programdata\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 94, in close
return closed(reason)
File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
File "c:\programdata\anaconda3\lib\site-packages\pandas\core\generic.py", line 3020, in to_csv
formatter.save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 172, in save
self._save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 288, in _save
self._save_chunk(start_i, end_i)
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 315, in _save_chunk
self.cols, self.writer)
File "pandas/_libs/writers.pyx", line 75, in pandas._libs.writers.write_csv_rows
File "c:\programdata\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20b9' in position 2661: character maps to <undefined>
I am very new to this StackOverflow platform and please let me know if more information is to be presented.
The error gives an evidence that the code tries to encode the filename_str2.csv file in cp1252. From your stack trace:
...File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/ filename_str2.csv ',......
File "c:\programdata\anaconda3\lib\encodings\ cp1252.py ", line 19, in encode
The reason I do not know, because you explicitely ask for an utf-8 encoding. But as the codecs page in the Python Standard Library reference says that the canonical name for utf8 is utf_8 (notice the underline instead of minus sign) and does not list utf-8 in allowed aliases, I would first try to use utf_8. If it still uses cp1252, then you will have to give the exact versions of Python and pandas that you are using.
I'm running a Python script that writes HTML code found using BeautifulSoup into multiple rows of an Excel spreadsheet column.
[...]
Col_HTML = 19
w_sheet.write(row_index, Col_HTML, str(HTML_Code))
wb.save(output)
When trying to save the file, I get the following error message:
Traceback (most recent call last):
File "C:\Users\[..]\src\MYCODE.py", line 201, in <module>
wb.save(output)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 662, in save
doc.save(filename, self.get_biff_data())
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 637, in get_biff_data
shared_str_table = self.__sst_rec()
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 599, in __sst_rec
return self.__sst.get_biff_record()
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\BIFFRecords.py", line 76, in get_biff_record
self._add_to_sst(s)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\BIFFRecords.py", line 91, in _add_to_sst
u_str = upack2(s, self.encoding)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\UnicodeUtils.py", line 50, in upack2
us = unicode(s, encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5181: ordinal not in range(128)
I've successfully written Python script in the past to write into worksheets. It's the first time I try to write a string of HTML into cells and I'm wondering what is causing the error and how I could fix it.
Use this line before passing HTML_Code to w_sheet.write
HTML_Code = HTML_Code.decode('utf-8')
Because, in the error line UnicodeDecodeError: 'ascii' codec can't decode, Python is trying to decode unicode into ascii, so you need to decode unicode using the proper encoding format, that is, utf-8.
So, you have:
Col_HTML = 19
HTML_Code = HTML_Code.decode('utf-8')
w_sheet.write(row_index, Col_HTML, str(HTML_Code))
I am using Solr 3.3 to index stuff from my database. I compose the JSON content in Python. I manage to upload 2126 records which add up to 523246 chars (approx 511kb). But when I try 2027 records, Python gives me the error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "D:\Technovia\db_indexer\solr_update.py", line 69, in upload_service_details
request_string.append(param_list)
File "C:\Python27\lib\json\__init__.py", line 238, in dumps
**kw).encode(obj)
File "C:\Python27\lib\json\encoder.py", line 203, in encode
chunks = list(chunks)
File "C:\Python27\lib\json\encoder.py", line 425, in _iterencode
for chunk in _iterencode_list(o, _current_indent_level):
File "C:\Python27\lib\json\encoder.py", line 326, in _iterencode_list
for chunk in chunks:
File "C:\Python27\lib\json\encoder.py", line 384, in _iterencode_dict
yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 68: invalid start byte
Ouch. Is 512kb worth of bytes a fundamental limit? Is there any high-volume alternative to the existing JSON module?
Update: its a fault of some data as trying to encode *biz_list[2126:]* causes an immediate error. Here is the offending piece:
'2nd Floor, Gurumadhavendra Towers,\nKadavanthra Road, Kaloor,\nCochin \x96 682 017'
How can I configure it so that it can be encodable into JSON?
Update 2: The answer worked as expected: the data came from a MySQL table encoded in "latin-1-swedish-ci". I saw significance in a random number. Sorry for spontaneously channeling the spirit of a headline writer when diagnosing the fault.
Simple, just don't use utf-8 encoding if your data is not in utf-8
>>> json.loads('["\x96"]')
....
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte
>>> json.loads('["\x96"]', encoding="latin-1")
[u'\x96']
json.loads
If s is a str instance and is encoded with an ASCII based
encoding other than utf-8 (e.g. latin-1) then an appropriate
encoding name must be specified. Encodings that are not ASCII
based (such as UCS-2) are not allowed and should be decoded to
unicode first.
Edit: To get proper unicode value of "\x96" use "cp1252" as Eli Collins mentioned
>>> json.loads('["\x96"]', encoding="cp1252")
[u'\u2013']
I have a GAE project in Python where I am setting a cookie in one of my RequestHandlers with this code:
self.response.headers['Set-Cookie'] = 'app=ABCD; expires=Fri, 31-Dec-2020 23:59:59 GMT'
I checked in Chrome and I can see the cookie listed, so it appears to be working.
Then later in another RequestHandler, I get the cookie to check it:
appCookie = self.request.cookies['app']
This line gives the following error when executed:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
It seems that it is trying to decode the incoming cookie info using an ASCII codec rather than UTF-8.
How do I force Python to use UTF-8 to decode this?
Are there any other Unicode-related gotchas that I need to be aware of as a newbie to Python and Google App Engine (but an experienced programmer in other languages)?
Here is the full Traceback:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4144, in _HandleRequest
self._Dispatch(dispatcher, self.rfile, outfile, env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4049, in _Dispatch
base_env_dict=env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 616, in Dispatch
base_env_dict=base_env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3120, in Dispatch
self._module_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3024, in ExecuteCGI
reset_modules = exec_script(handler_path, cgi_path, hook)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 2887, in ExecuteOrImportScript
exec module_code in script_module.__dict__
File "/Users/ken/hgdev/juicekit/main.py", line 402, in <module>
main()
File "/Users/ken/hgdev/juicekit/main.py", line 399, in main
run_wsgi_app(application)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 98, in run_wsgi_app
run_bare_wsgi_app(add_wsgi_middleware(application))
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 116, in run_bare_wsgi_app
result = application(env, _start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 721, in __call__
response.wsgi_write(start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 296, in wsgi_write
body = self.out.getvalue()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/StringIO.py", line 270, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
You're looking to use the decode function somewhat like this (cred #agf:):
self.request.cookies['app'].decode('utf-8')
From official python documentation (plus a couple added details):
Python’s 8-bit strings have a .decode([encoding], [errors]) method that interprets the string using the given encoding. The following example shows the string as it goes to unicode and then back to 8-bit string:
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
>>> type(u), u # Examine
(<type 'unicode'>, u'\ua000abcd\u07b4')
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version # Examine
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True
First, encode any unicode value you set in the cookies. You also need to quote them in case they can break the header:
import urllib
# This is the value we want to set.
initial_value = u'äëïöü'
# WebOb version that comes with SDK doesn't quote cookie values
# in the Response, neither webapp.Response. So we have to do it.
quoted_value = urllib.quote(initial_value.encode('utf-8'))
rsp = webapp.Response()
rsp.headers['Set-Cookie'] = 'app=%s; Path=/' % quoted_value
Now let's read the value. To test it, create a fake Request to test the cookie we have set. This code was extracted from a real unittest:
cookie = rsp.headers.get('Set-Cookie')
req = webapp.Request.blank('/', headers=[('Cookie', cookie)])
# The stored value is the same quoted value from before.
# Notice that here we use .str_cookies, not .cookies.
stored_value = req.str_cookies.get('app')
self.assertEqual(stored_value, quoted_value)
Our value is still encoded and quoted. We must do the reverse to get the initial one:
# And we can get the initial value unquoting and decoding.
final_value = urllib.unquote(stored_value).decode('utf-8')
self.assertEqual(final_value, initial_value)
If you can, consider using webapp2. webob.Response does all the hard work of quoting and setting cookies, and you can set unicode values directly. See a summary of these issues here.
the mechanize framework works great for automating the first couple of web screens. The problem is where it needs to upload a file with in a form.
Here is the section of code just before the error:
br.select_form(name="form.uploadXMLDataWizardForm")
xmlFile = codecs.open("MyFile.xml", "rt", "utf8")
br.form.add_file(file_object=xmlFile, content_type="text/xml", filename="MyFile.xml", name="dataFile")
br.submit(name="$action:next")
It results in the following error at runtime:
br.submit(name="$action:next")
File "build/bdist.macosx-10.6-universal/egg/mechanize/_mechanize.py", line 541, in submit
File "build/bdist.macosx-10.6-universal/egg/mechanize/_mechanize.py", line 530, in click
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 2999, in click
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 3201, in _click
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 2350, in _click
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 3269, in _switch_click
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 3252, in _request_data
File "build/bdist.macosx-10.6-universal/egg/mechanize/_form.py", line 1341, in _write_mime_data
UnicodeEncodeError: 'ascii' codec can't encode characters in position 650-651: ordinal not in range(128)
Any idea how to make mechanize handle upload of a UTF-8 file?
Mechanize seems to expect the file data as raw bytes, not Unicode data. Try opening the file using the usual open() function:
...
xmlFile = open("MyFile.xml", "rt")
...