we're trying to heavily use MapReduce in our project.
Now we have this problem, there is a lots of 'DeadlineExceededError' errors in the log...
One example of it ( traceback differs each time a bit ) :
Traceback (most recent call last):
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 207, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
rv = self.router.dispatch(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
return handler.dispatch()
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~sba/1.362471299468574812/mapreduce/base_handler.py", line 65, in post
self.handle()
File "/base/data/home/apps/s~sba/1.362471299468574812/mapreduce/handlers.py", line 208, in handle
ctx.flush()
File "/base/data/home/apps/s~sba/1.362471299468574812/mapreduce/context.py", line 333, in flush
pool.flush()
File "/base/data/home/apps/s~sba/1.362471299468574812/mapreduce/context.py", line 221, in flush
self.__flush_ndb_puts()
File "/base/data/home/apps/s~sba/1.362471299468574812/mapreduce/context.py", line 239, in __flush_ndb_puts
ndb.put_multi(self.ndb_puts.items, config=self.__create_config())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/model.py", line 3625, in put_multi
for future in put_multi_async(entities, **ctx_options)]
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 323, in get_result
self.check_success()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 318, in check_success
self.wait()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 302, in wait
if not ev.run1():
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/eventloop.py", line 219, in run1
delay = self.run0()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/eventloop.py", line 181, in run0
callback(*args, **kwds)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 365, in _help_tasklet_along
value = gen.send(val)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/context.py", line 274, in _put_tasklet
keys = yield self._conn.async_put(options, datastore_entities)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1560, in async_put
for pbs, indexes in pbsgen:
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1350, in __generate_pb_lists
incr_size = pb.lengthString(pb.ByteSize()) + 1
DeadlineExceededError
My questions are:
How can we avoid this Error?
What happens with the job, does it get retried (if so how can we control it?) or not ?
Does it causes data inconsistency in the end ?
Apparently you are doing too many puts than it is possible to insert in one datastore call. You have multiple options here:
If this is a relatively rare event - ignore it. Mapreduce will retry the slice and will lower put pool size. Make sure that your map is idempotent.
Take a look at http://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/context.py - in your main.py you can lower DATASTORE_DEADLINE, MAX_ENTITY_COUNT or MAX_POOL_SIZE to lower the size of the pool for the whole mapreduce.
If you're using an InputReader, you might be able to adjust the default batch_size to reduce the number of entities processed by each task.
I believe the task queue will retry tasks, but you probably don't want it to, since it'll likley hit the same DeadlineExceededError.
Data inconsistencies are possible.
See this question as well.
App Engine - Task Queue Retry Count with Mapper API
Related
I am running a google.py scraping script to get data. The script is reading a csv file and for each line of the csv file it is sacraping a page. After the scraping is done, the script saves the result on the same csv file.
The dataframe is several thousand lines long.
After it was getting captcha results on several lines i interrupted the scraping with Ctrl+C.
I re-ran the script just after, and the length of the dataframe read from the csv file was 3929 lines less long.
this is the output of the the ctrl+C :
^CTraceback (most recent call last):
File "google.py", line 255, in <module>
Scraping().scrape()
File "google.py", line 239, in scrape
self.write_dataframe(df_psys,psy,tel_list, mail_list)
File "google.py", line 143, in write_dataframe
df_psys.to_csv('psychologues.csv',sep=';',index=False)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/generic.py", line 3563, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1180, in to_csv
csv_formatter.save()
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 261, in save
self._save()
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 266, in _save
self._save_body()
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 304, in _save_body
self._save_chunk(start_i, end_i)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 311, in _save_chunk
res = df._mgr.to_native_types(**self._number_format)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 473, in to_native_types
return self.apply("to_native_types", **kwargs)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 304, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 634, in to_native_types
result = to_native_types(self.values, na_rep=na_rep, quoting=quoting, **kwargs)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 2207, in to_native_types
mask = isna(values)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/dtypes/missing.py", line 143, in isna
return _isna(obj)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/dtypes/missing.py", line 172, in _isna
return _isna_array(obj, inf_as_na=inf_as_na)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/dtypes/missing.py", line 254, in _isna_array
result = _isna_string_dtype(values, inf_as_na=inf_as_na)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/dtypes/missing.py", line 278, in _isna_string_dtype
result = libmissing.isnaobj2d(values, inf_as_na=inf_as_na)
KeyboardInterrupt
It seems there is an interrupt with the command to_csv, so i am wondering if the data missing comes from that or from a hack/a physical intervention on my computer. I have another keyboard interrupt on a previous run of the script, and there is no to_csv in it:
^CTraceback (most recent call last):
File "google.py", line 255, in <module>
Scraping().scrape()
File "google.py", line 238, in scrape
(tel_list, mail_list) = self.google_scraping(psy, counter)
File "google.py", line 166, in google_scraping
if a.text == "Que s'est-il passé ?":
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 77, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 740, in _execute
return self._parent.execute(command, params)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 428, in execute
response = self.command_executor.execute(driver_command, params)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/selenium/webdriver/remote/remote_connection.py", line 347, in execute
return self._request(command_info[0], url, body=data)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/selenium/webdriver/remote/remote_connection.py", line 369, in _request
response = self._conn.request(method, url, body=body, headers=headers)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/request.py", line 74, in request
return self.request_encode_url(
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/request.py", line 96, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/poolmanager.py", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/http/client.py", line 1337, in getresponse
response.begin()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
KeyboardInterrupt
I didnt notice the missing lines before recently and I ran the script
3 times after the missing lines appeared. I have tried to look into the history commands of my terminal with history -E 1 | grep google.py but I dont have the times I ran the command, only 1 command is showing up which is the last one i ran.
So I dont really know exactly when the deletion of data happened (in the last 24 hours for sure). I would like to check other system log files but if the hypothesis of deletion comes from a bug of pandas i wont look further in my logs...
What do you think?
Is there a way I can prevent the ctrl+C interrupt to provoke this error?
This is write_dataframe:
def write_dataframe(self,df,psy,tel_list, mail_list):
index=df[df['psy'] == psy].index.values[0]
print('writing dataframes')
df_psys.loc[index,'tel_google']=tel_list
df_psys.loc[index, 'mail_google'] = mail_list
df_psys.to_csv('file.csv',sep=';',index=False)
If I do
try:
write_dataframes(args)
except KeyboardInterrupt:
sys.exit()
Will it be enough to prevent loss of data for a keyboard interupt?
Thank you
Reading the comments and the downvotes, the answer is yes.
Reading this post How to prevent a block of code from being interrupted by KeyboardInterrupt in Python?,
I have implemented the method that seemed more reliable to prevent keyboard interrupt:
import signal
s = signal.signal(signal.SIGINT, signal.SIG_IGN)
My code not to be interrupted
signal.signal(signal.SIGINT, s)
I am using this python script to migrate data from one ElastiCache redis instance to another. It uses the redis pipelining to migrate data in chunks.
https://gist.github.com/thomasst/afeda8fe80534a832607
But I am getting this strange error:
Traceback (most recent call last):########### | ETA: 0:00:12
File "migrate-redis.py", line 95, in <module>
migrate()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 664, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 464, in invoke
return callback(*args, **kwargs)
File "migrate-redis.py", line 74, in migrate
results = pipeline.execute(False)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2593, in execute
return execute(conn, stack, raise_on_error)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2446, in _execute_transaction
all_cmds = connection.pack_commands([args for args, _ in cmds])
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 637, in pack_commands
output.append(SYM_EMPTY.join(pieces))
MemoryError
There are no issues with RAM as node has 6 GB of RAM.
The Memory Profile of source redis is as follows:
used_memory:1483900120
used_memory_human:1.38G
used_memory_rss:1945829376
used_memory_peak:2431795528
used_memory_peak_human:2.26G
used_memory_lua:86016
mem_fragmentation_ratio:1.31
mem_allocator:jemalloc-3.6.0
What can be the possible cause for this ?
From your error log, It has no relation with your redis server. The error happens in your redis client when it pack all commands into a memory buffer.
Maybe you could try to decrease the SCAN count option in your migrate-redis.py to test if it is too large to pack it.
I just installed pootle and I'm having this messagge "Some data on this page is currently being calculated, and the page will be refreshed automatically x seconds". Upon going to the admin page, I found out that there is a failed job so I run on my command line pootle retry_failed_jobs.
And this is what it says :/
`DoesNotExist: Directory matching query does not exist.
Traceback (most recent call last):
File "/var/www/pootle/env/local/lib/python2.7/site-packages/rq/worker.py", line 568, in perform_job
rv = job.perform()
File "/var/www/pootle/env/local/lib/python2.7/site-packages/rq/job.py", line 495, in perform
self._result = self.func(*self.args, **self.kwargs)
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 683, in update_cache_job
instance._update_cache_job(keys, decrement)
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 534, in _update_cache_job
create_update_cache_job_wrapper(p, keys_for_parent, decrement)
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 693, in create_update_cache_job_wrapper
connection.on_commit(_create_update_cache_job)
File "/var/www/pootle/env/local/lib/python2.7/site-packages/transaction_hooks/mixin.py", line 31, in on_commit
func()
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 692, in _create_update_cache_job
create_update_cache_job(queue, instance, keys, decrement=decrement)
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 707, in create_update_cache_job
last_job_key = instance.get_last_job_key()
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 299, in get_last_job_key
key = self.get_cachekey()
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/apps/pootle_translationproject/models.py", line 373, in get_cachekey
return self.directory.pootle_path
File "/var/www/pootle/env/local/lib/python2.7/site-packages/django/db/models/fields/related.py", line 572, in __get__
rel_obj = qs.get()
File "/var/www/pootle/env/local/lib/python2.7/site-packages/django/db/models/query.py", line 357, in get
self.model._meta.object_name)
DoesNotExist: Directory matching query does not exist.
Traceback (most recent call last):
File "/var/www/pootle/env/local/lib/python2.7/site-packages/rq/worker.py", line 568, in perform_job
rv = job.perform()
File "/var/www/pootle/env/local/lib/python2.7/site-packages/rq/job.py", line 495, in perform
self._result = self.func(*self.args, **self.kwargs)
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 683, in update_cache_job
instance._update_cache_job(keys, decrement)
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 534, in _update_cache_job
create_update_cache_job_wrapper(p, keys_for_parent, decrement)
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 693, in create_update_cache_job_wrapper
connection.on_commit(_create_update_cache_job)
File "/var/www/pootle/env/local/lib/python2.7/site-packages/transaction_hooks/mixin.py", line 31, in on_commit
func()
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 692, in _create_update_cache_job
create_update_cache_job(queue, instance, keys, decrement=decrement)
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 707, in create_update_cache_job
last_job_key = instance.get_last_job_key()
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/core/mixins/treeitem.py", line 299, in get_last_job_key
key = self.get_cachekey()
File "/var/www/pootle/env/local/lib/python2.7/site-packages/pootle/apps/pootle_translationproject/models.py", line 373, in get_cachekey
return self.directory.pootle_path
File "/var/www/pootle/env/local/lib/python2.7/site-packages/django/db/models/fields/related.py", line 572, in __get__
rel_obj = qs.get()
File "/var/www/pootle/env/local/lib/python2.7/site-packages/django/db/models/query.py", line 357, in get
self.model._meta.object_name)
DoesNotExist: Directory matching query does not exist.
`
This actually happened when I deleted the language of the project using the admin panel, then suddenly somewhat deleted the folder of that language in the system. What I did is to create a new project and copy the translation files. So I didn't resolve the problem but I was able to remove the refreshing of data.
The stats in Pootle are managed by Redis. Pootle can sometimes get into a state where the stats are broken. Issues like broken files can cause this. You can clean up the stats using this guide.
I'd also report the situation and any tracebacks to the Pootle developers so that they can make the stats calculations more robust.
We're trying to heavily use MapReduce in our project. Now we have this problem, there are a lot of 'InternalError: internal error.' errors in the log...
One example of it:
"POST /mapreduce/worker_callback HTTP/1.1" 500 0 "http://appname/mapreduce/worker_callback" "AppEngine-Google;
(+http://code.google.com/appengine)" "appname.appspot.com" ms=18856 cpu_ms=15980
queue_name=default task_name=appengine-mrshard-15828822618486744D69C-11-195
instance=00c61b117c47e0cba49bc5e5c7f9d328693e95ce
W 2012-10-24 06:51:27.140
suspended generator _put_tasklet(context.py:274) raised InternalError(internal error.)
W 2012-10-24 06:51:27.153
suspended generator put(context.py:703) raised InternalError(internal error.)
E 2012-10-24 06:51:27.207
internal error.
Traceback (most recent call last):
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1511, in __call__
rv = self.handle_exception(request, response, e)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
rv = self.router.dispatch(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
return handler.dispatch()
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 547, in dispatch
return self.handle_exception(e, self.app.debug)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~appname/1.362664407983567993/mapreduce/base_handler.py", line 65, in post
self.handle()
File "/base/data/home/apps/s~appname/1.362664407983567993/mapreduce/handlers.py", line 208, in handle
ctx.flush()
File "/base/data/home/apps/s~appname/1.362664407983567993/mapreduce/context.py", line 333, in flush
pool.flush()
File "/base/data/home/apps/s~appname/1.362664407983567993/mapreduce/context.py", line 221, in flush
self.__flush_ndb_puts()
File "/base/data/home/apps/s~appname/1.362664407983567993/mapreduce/context.py", line 239, in __flush_ndb_puts
ndb.put_multi(self.ndb_puts.items, config=self.__create_config())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/model.py", line 3650, in put_multi
for future in put_multi_async(entities, **ctx_options)]
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 325, in get_result
self.check_success()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 368, in _help_tasklet_along
value = gen.throw(exc.__class__, exc, tb)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/context.py", line 703, in put
key = yield self._put_batcher.add(entity, options)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 368, in _help_tasklet_along
value = gen.throw(exc.__class__, exc, tb)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/context.py", line 274, in _put_tasklet
keys = yield self._conn.async_put(options, datastore_entities)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 454, in _on_rpc_completion
result = rpc.get_result()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 834, in get_result
result = rpc.get_result()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 604, in get_result
return self.__get_result_hook(self)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1569, in __put_hook
self.check_rpc_success(rpc)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1224, in check_rpc_success
raise _ToDatastoreError(err)
InternalError: internal error.
queue.yaml:
queue:
- name: default
rate: 500/s
bucket_size: 100
max_concurrent_requests: 400
retry_parameters:
min_backoff_seconds: 5
max_backoff_seconds: 120
max_doublings: 2
MapReduce mapper params:
'shard_count': 16,
'processing_rate': 200,
'batch_size': 20
we would like to increase these numbers, since we need more speed in processing, but once we try to increase it increases error rate...
Blobstore Files Count: several ( some of them contain millions of lines )
Frontend Instance Class: F4
Processing flow:
We use only mapper for this particular processing.
We user BlobstoreLineInputReader (blob contains text file).
Each line represents new entry we need to create if it does not exist already(some of them we update).
My questions are:
How can we avoid these errors?
Are there any tips/hints on how we can choose/balance mapper params (shard_count, processing_rate, batch_size) ?
What happens with the job, does it get
retried (if so, how can we control it?) or not ?
BTW, we tried to play with some of the suggestions provided here (control batch_size) but we still see this.
This looks like a timeout error - check you logs to see how long that process is running before that happens.
If it is, you should try reducing the number of items that you're calling put_multi() on (ie reduce your batch size) and adding a timer check so that when your average time per put_multi() call gets close to the process time limit you quit and let another one start.
I have created a backend for my google app that looks like this:
backends:
- name: dbops
options: dynamic
and I've created an admin handler for it:
- url: /backend/.*
script: backend.app
login: admin
Now I understand that admin jobs should be able to run forever and I'm launching this job with a TaskQueue, but for some reason mine is not. My job is simply creating a summary table in datastore from a much larger table. This table holds about 12000 records and it takes several minutes for it to process the job on the development server, but it works fine. When I push the code out to appspot and try to get it to run the same job, I'm getting what looks like datastore timeouts.
Traceback (most recent call last):
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 1536, in __call__
rv = self.handle_exception(request, response, e)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 1530, in __call__
rv = self.router.dispatch(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 572, in dispatch
return self.handle_exception(e, self.app.debug)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~myzencoder/dbops.362541511260492787/backend.py", line 626, in get
for asset in assets:
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2314, in next
return self.__model_class.from_entity(self.__iterator.next())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2816, in next
next_batch = self.__batcher.next()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2678, in next
return self.next_batch(self.AT_LEAST_ONE)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2715, in next_batch
batch = self.__next_batch.get_result()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 604, in get_result
return self.__get_result_hook(self)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2452, in __query_result_hook
self._batch_shared.conn.check_rpc_success(rpc)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1224, in check_rpc_success
raise _ToDatastoreError(err)
Timeout: The datastore operation timed out, or the data was temporarily unavailable.
Anyone got any suggestions on how to make this work?
While the backend request can run for a long time, a query can only run for 60 sec. You'll have to loop over your query results with cursors.
Mapreduce will get you a result quicker by doing the queries in parallel.
In production you use the HR datastore and you can run into contention problems. See this article.
https://developers.google.com/appengine/articles/scaling/contention?hl=nl
And have a look at mapreduce for creating a report. Maybe this is a better solution.