Errors during downloading data from Google App Engine using bulkloader - python

I am trying to download some data from the datastore using the following
command:
appcfg.py download_data --config_file=bulkloader.yaml --application=myappname
--kind=mykindname --filename=myappname_mykindname.csv
--url=http://myappname.appspot.com/_ah/remote_api
When I didn't have much data in this particular kind/table I could
download the data in one shot - occasionally running into the
following error:
.................................[ERROR ] [Thread-11]
ExportProgressThread:
Traceback (most recent call last):
File "C:\Program Files\Google\google_appengine\google\appengine\tools
\bulkload
er.py", line 1448, in run
self.PerformWork()
File "C:\Program Files\Google\google_appengine\google\appengine\tools
\bulkload
er.py", line 2216, in PerformWork
item.key_end)
File "C:\Program Files\Google\google_appengine\google\appengine\tools
\bulkload
er.py", line 2011, in StoreKeys
(STATE_READ, unicode(kind), unicode(key_start), unicode(key_end)))
OperationalError: unable to open database file
This is what I see in the server log:
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/
ext/remote_api/handler.py", line 277, in post
response_data = self.ExecuteRequest(request)
File "/base/python_runtime/python_lib/versions/1/google/appengine/
ext/remote_api/handler.py", line 308, in ExecuteRequest
response_data)
File "/base/python_runtime/python_lib/versions/1/google/appengine/
api/apiproxy_stub_map.py", line 86, in MakeSyncCall
return stubmap.MakeSyncCall(service, call, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/
api/apiproxy_stub_map.py", line 286, in MakeSyncCall
rpc.CheckSuccess()
File "/base/python_runtime/python_lib/versions/1/google/appengine/
api/apiproxy_rpc.py", line 126, in CheckSuccess
raise self.exception
ApplicationError: ApplicationError: 4 no matching index found.
When that error appeared I would simply re-run the download and things
would work out well.
Of late, I am noticing that as the size of my kind increases, the
download tool fails much more often. For instance, with a kind with
~3500 entities I had to run to the command 5 times - only the last of
which succeeded. Is there a way around this error? Previously, my only
worry was I wouldn't be able to automate downloads in a script because
of the occasional failures - now I am scared I won't be able to get my
data out at all.
This issue was discussed previously here
but the post is old and I am not sure what the suggested flag does -
hence posting my similar query again.
Some additional details.
As mentioned here I tried the suggestion to proceed with interrupted downloads (in the section Downloading Data from App Engine ). When I resume after the interruption, I get no errors, but the number of rows that are downloaded are lesser than the entity count the datastore admin shows me.This is the message I get:
[INFO ] Have 3220 entities, 3220 previously transferred
[INFO ] 3220 entities (1003 bytes) transferred in 2.9 seconds
The datastore admin tells me this particular kind has ~4300 entities. Why aren't the remaining entities getting downloaded?
Thanks!

I am going to make a completely uneducated guess at this just based on the fact that I saw the word "unicode" in the first error; I had an issue that was related to my data being user generated from the web. A user put in a few unicode characters and a whole load of stuff started breaking - probably my fault - as I had implemented pretty looking repr functions and a load of other stuff. If you can, take a quick scan of your data via the console utility in your live app, maybe (if it's only 4k records), try converting all of the data to ascii strings to find any that don't conform.
And after that, I started "sanitising" user inputs (sorry, but my "public handle" field needs to be ascii only players!)

Related

python Twitch-chatbot MONKALOT encounters json error on startup

Presently I'm trying to make MONKALOT run on a PythonAnywhere account (customized Web Developer). I have basic knowledge of Linux but unfortunately no knowledge of dev'oping python scripts but advanced knowledge of dev'oping Java (hope that helps).
My success log so far:
After upgrading my account to Web Developer level I finally made pip download the (requirements)[https://github.com/NMisko/monkalot/blob/master/requirements.txt] and half the internet (2 of 5GB used). All modules and dependencies seem to be successfully installed.
I configured my own monkalot-channel including OAuth which serves as a staging instance for now. The next challenge was how to get monkalot starting up. Using python3.7 instead of python or any other python3 environment did the trick.
But now I'm stuck. After "completing the training stage" the monkalot-script prematurely ends with the following message:
[22:14] ...chat bot finished training.
Traceback (most recent call last):
File "monkalot.py", line 72, in <module>
bots.append(TwitchBot(path))
File "/home/Chessalot/monkalot/bot/bot.py", line 56, in __init__
self.users = self.twitch.get_chatters()
File "/home/Chessalot/monkalot/bot/data_sources/twitch.py", line 25, in get_chatters
data = requests.get(USERLIST_API.format(self.channel)).json()
File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 900, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/local/lib/python3.7/site-packages/simplejson/__init__.py", line 525, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python3.7/site-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/usr/local/lib/python3.7/site-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
By now I figured out that monkalot tries to load the chatters list and expects at least an empty json array as result but actually seems to receive an empty string.
So my question is: What can I do to make the monkalot-script work? Is monkalot's current version incompatible to the current Twitch-API? Are there any outdated python libraries which may cause the incompatibility? Or is there an unrecognized configuration issue preventing the script from running successfully?
Thank you all in advance. Any ideas provided by you are highly appreciated.
The most likely cause of that is that you are using a free PythonAnywhere account and have not configured monkalot to use the proxy. Check the documentation of monkalot to determine how you can configure it to use a proxy. See https://help.pythonanywhere.com/pages/403ForbiddenError/ for the proxy details.
Only a quick thought, might not be the problem you are encountering, but it may be due to the project name. E.g.:
From github:
... I believed that the issue was something other than the project name, since I get a different error if I use a project name that doesn't exist. However, I just tried using ben-heil/saged instead of just saged for the project name and that seems to have fixed it.
EDIT: your HTTP 404 error was caused by this:
File "monkalot.py", line 72, in <module>
bots.append(TwitchBot(path))
Now this points out that the function called with path is giving an error. Especially since you see a lot of decode in the traceback error, you can deduce it has something to with your characters you inputted.
Other errors in your traceback that point this out:
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
JSONDecodeError: Expecting value: line 1 column 1 (char 0) occurs when we try to parse something that is not valid JSON as if it were. To solve the error, make sure the response or the file is not empty or conditionally check for the content type before parsing.
In most cases your json.loads- JSONDecodeError: Expecting value: line 1 column 1 (char 0) error is due to :
non-JSON conforming quoting
XML/HTML output (that is, a string starting with <), or
incompatible character encoding
In this case, the case caused the error (content type!).
Related sources:
python json decoder
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
After I expected the response, I found out that I received a HTTP 400, Bad Request error WITHOUT any data in the HTTP response body. Since monkalot expects a JSON answer the errors were raised. This was due to the fact that in the channel configuration I used an uppercase letter whereas Twitch expects all letters lowercase.

Cannot read Statistics Canada sdmx file

I am trying to read some Canadian census data from Statistics Canada
(the XML option for the "Canada, provinces and territories" geograpic level). I see that the xml file is in the SDMX format and that there is a structure file provided, but I cannot figure out how to read the data from the xml file.
It seems there are 2 options in Python, pandasdmx and sdmx1, both of which say they can read local files. When I try
import sdmx
datafile = '~/Documents/Python/Generic_98-401-X2016059.xml'
canada = sdmx.read_sdmx(datafile)
It appears to read the first 903 lines and then produces the following:
Traceback (most recent call last):
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/xml.py", line 238, in read_message
raise NotImplementedError(element.tag, event) from None
NotImplementedError: ('{http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message}GenericData', 'start')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/__init__.py", line 126, in read_sdmx
return reader().read_message(obj, **kwargs)
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/xml.py", line 259, in read_message
raise XMLParseError from exc
sdmx.exceptions.XMLParseError: NotImplementedError: ('{http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message}GenericData', 'start')
Is this happening because I've not loaded the structure of the sdmx file (Structure_98-401-X2016059.xml in the zip file from the StatsCan link above)? If so, how do I go about loading that and telling sdmx to use that when reading datafile?
The documentation for sdmx and pandasdmx only show examples of loading files from online providers and not from local files, so I'm stuck. I have limited familiarity with python so any help is much appreciated.
For reference, I can read the file in R using the instructions from the rsdmx github. I would like to be able to do the same/similar in Python.
Thanks in advance.
From a cursory inspection of the documentation, it seems that Statistics Canada is not one of the sources that is included by default. There is however an sdmx.add_source function. I suggest you try that (before loading the data).
As per the sdmx1 developer, StatsCan is using the older, unsupported version of the SDMX (v. 2.0). The current version is 2.1 and rsdmx1 only supports this (support is also going towards the upcoming v.3).

Large file upload fails

I'm in the process of writing a python module to POST files to a server , I can upload files of size of upto 500MB but when I tried to upload a 1gb file the upload failed, If I were to use something like cURL it won't fail. I got the code after googling how to upload multipart formdata using python , the code can be found here. I just compiled and ran that code , the error I'm getting is this
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
opener.open("http://127.0.0.1/test_server/upload",params)
File "C:\Python27\lib\urllib2.py", line 392, in open
req = meth(req)
File "C:\Python27\MultipartPostHandler.py", line 35, in http_request
boundary, data = self.multipart_encode(v_vars, v_files)
File "C:\Python27\MultipartPostHandler.py", line 63, in multipart_encode
buffer += '\r\n' + fd.read() + '\r\n'
MemoryError
I'm new to python and having a hard time grasping it. I also came across another program here , I'll be honest I don't know how to run it. I tried running it by guessing based on the function name , but that didn't work.
The script in question isn't very smart and builds the POST body in memory.
Thus, to POST a 1GB file, you'll need 1GB of memory just to hold that data, plus the HTTP headers, boundaries, and python and the code itself.
You'd have to rework the script to use mmap instead, where you first construct the whole body in a temp file before handing that file wrapped in a mmap.mmap value to passing it to request.add_data.
See Python: HTTP Post a large file with streaming for hints on how to achieve that.

Google App Engine: "Cannot create a file when that file already exists"

I'm running the Google App Engine devserver 1.3.3 on Windows 7.
Usually, this method works fine, but this time it gave an error:
def _deleteType(type):
results = type.all().fetch(1000)
while results:
db.delete(results)
results = type.all().fetch(1000)
The error:
File "src\modelutils.py", line 38, in _deleteType
db.delete(results)
File "C:\Program Files\Google\google_appengine\google\appengine\ext\db\__init__.py", line 1302, in delete
datastore.Delete(keys, rpc=rpc)
File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore.py", line 386, in Delete
'datastore_v3', 'Delete', req, datastore_pb.DeleteResponse(), rpc)
File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore.py", line 186, in _MakeSyncCall
rpc.check_success()
File "C:\Program Files\Google\google_appengine\google\appengine\api\apiproxy_stub_map.py", line 474, in check_success
self.__rpc.CheckSuccess()
File "C:\Program Files\Google\google_appengine\google\appengine\api\apiproxy_rpc.py", line 149, in _WaitImpl
self.request, self.response)
File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore_file_stub.py", line 667, in MakeSyncCall
response)
File "C:\Program Files\Google\google_appengine\google\appengine\api\apiproxy_stub.py", line 80, in MakeSyncCall
method(request, response)
File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore_file_stub.py", line 775, in _Dynamic_Delete
self.__WriteDatastore()
File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore_file_stub.py", line 610, in __WriteDatastore
self.__WritePickled(encoded, self.__datastore_file)
File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore_file_stub.py", line 656, in __WritePickled
os.rename(tmpfile.name, filename)
WindowsError: [Error 183] Cannot create a file when that file already exists
What am I doing wrong? How could this have failed this time, but usually it doesn't?
UPDATE I restarted the devserver, and when it came back online, the datastore was empty.
Unfortunately, 1.3.3 is too far back for me to look at its sources and try to diagnose your problem precisely - the SDK has no 1.3.3 release tag and I can't guess which revision of the datastore_filestub.py was in 1.3.3. Can you upgrade to the current version, 1.3.5, and try again? Running old versions (especially 2+ versions back) is not recommended since they'll be possibly a little out of sync with what's actually available on Google's actual servers, anyway (and/or have bugs that are fixed in later versions). Anyway...
On Windows, os.rename doesn't work if the destination exists -- but the revisions I see are careful to catch the OSError that results (WindowsError derives from it), remove the existing file, and try renaming again. So I don't know what could explain your bug -- if the sources of the SDK you're running have that careful arrangement, and I think they do.
Plus, I'd recommend to --use_sqlite (see Nick Johnson's blog announcing it here) in lieu of the file-stub for your SDK datastore - it just seems to make more sense!-)
(disclaimer: i'm not answering your question but helping you optimize the code you're running)
your code seems to be massively deleting objects. in the SDK/dev server, you can accomplish wiping out the datastore using this command as a quicker and more convenient alternative:
$ dev_appserver.py -c helloworld
now, that is, if you want to wipe your entire SDK datastore. if not, then of course, don't use it. :-)
more importantly, you can make your code run faster and use less CPU on production if you change your query to be something like:
results = type.all(keys_only=True).fetch(SIZE)
this works the same as your's except it only fetches the keys as you don't need full entities retrieved from the datastore in order to delete them. also, your code is currently setting SIZE=1000, but you can make it larger than that, esp. if you have an idea of how many entities you have in your system... the 1000 result limit was lifted in 1.3.1 http://bit.ly/ahoLQp
one minor nit... try not to use type as a variable name... that's one of the most import objects and built-in/factory functions in Python. your code may act odd if do this -- in your case, it's only fractionally better since you're inside a function/method, but that's not going to be true as a global variable.
hope this helps!

WindowsError: priveledged instruction when saving a FreeImagePy Image in script, works in IDLE

I'm working on a program to do some image wrangling in Python for work. I'm using FreeImagePy because PIL doesn't support multi-page TIFFs. Whenever I try to save a file with it from my program I get this error message (or something similar depending on which way I try to save):
Error returned. TIFF FreeImage_Save: failed to open file C:/OCRtmp/ocr page0
Traceback (most recent call last):
File "C:\Python25\Projects\OCRPageUnzipper\PageUnzipper.py", line 102, in <mod
ule> OCRBox.convertToPages("C:/OCRtmp/ocr page",FIPY.FIF_TIFF)
File "C:\Python25\lib\site-packages\FreeImagePy\FreeImagePy\FreeImagePy.py", l
ine 2080, in convertToPages self.Save(FIF, dib, fileNameOut, flags)
File "C:\Python25\lib\site-packages\FreeImagePy\FreeImagePy\FreeImagePy.py", l
ine 187, in Save return self.__lib.Save(typ, bitmap, fileName, flags)
WindowsError: exception: priviledged instruction
When I try and do the same things from IDLE, it works fine.
Looks like a permission issues, make sure you don't have the file open in another application, and that you have write permissions to the file location your trying to write to.
That's what I thought too, but I figured it out a couple hours ago. Apparently if the directory/file I'm trying to write to doesn't exist, FreeImagePy isn't smart enough to create it (most of the time. Creating a new multipage image seems to work fine) but i guess running it within IDLE, IDLE figures it out and takes care of it or something. I managed to work around it by using os.mkdir to explicitly make sure things that I need exist.

Categories