GAE update app, how to avoid violent stop of long process? - python

I have a GAE apps that spawn some long process via an other module (managed by basic_scaling).
This long process handles correctly the DeadlineExceededError but spawning a defered method that will save the current state of the long process to be resumed later.
Today I discovered that when I do a appcfg.py -A <YOUR_PROJECT_ID> update myapp/, it abruptaly stops the long process. Just stop, no DeadlineExceededError (here goes my hope), nothing.
Is there some events triggered by GAE before stopping the app that would let me save the current state of my long process, write data to files (via s3, so a bit long), and re-queue the process to be re-run later ? (or something like this) ?
Thank you for your help.

From Scaling types and instance classes, both manual and basic scaling appear to behave identically from the instance shutdown prospective:
As with manual scaling, an instance that is stopped with appcfg stop
or from the Cloud Platform Console) has 30 seconds to finish handling
requests before it is forcibly terminated.
I assume the same shutdown method is used when the app is updated.
And from Shutdown:
There are two ways for an app to determine if a manual scaling
instance is about to be shut down. First, the is_shutting_down()
method from google.appengine.api.runtime starts returning true. Second
(and preferred), you can register a shutdown hook, as described below.
When App Engine begins to shut down an instance, existing requests are
given 30 seconds to complete, and new requests immediately return 404.
If an instance is handling a request, App Engine pauses the request
and runs the shutdown hook. If there is no active request, App Engine
sends an /_ah/stop request, which runs the shutdown hook. The
/_ah/stop request bypasses normal handling logic and cannot be handled
by user code; its sole purpose is to invoke the shutdown hook. If you
raise an exception in your shutdown hook while handling another
request, it will bubble up into the request, where you can catch it.
If you have enabled concurrent requests by specifying threadsafe: true
in app.yaml (which is the default), raising an exception from a
shutdown hook copies that exception to all threads. The following code
sample demonstrates a basic shutdown hook:
from google.appengine.api import apiproxy_stub_map
from google.appengine.api import runtime
def my_shutdown_hook():
apiproxy_stub_map.apiproxy.CancelApiCalls()
save_state()
# May want to raise an exception
runtime.set_shutdown_hook(my_shutdown_hook)
Alternatively, the following sample demonstrates how to use the
is_shutting_down() method:
while more_work_to_do and not runtime.is_shutting_down():
do_some_work()
save_state()
Note: It's important to recognize that the shutdown hook is not always
able to run before an instance terminates. In rare cases, an outage
can occur that prevents App Engine from providing 30 seconds of
shutdown time. Thus, we recommend periodically checkpointing the state
of your instance and using it primarily as an in-memory cache rather
than a reliable data store.
Based on my assumption above I expect these methods should work for your case as well, give them a try.

It looks like you replacing an existing version of your app (the default version). When you do this, it doesn't gracefully handle existing processing.
Whenever I update the production version of my app, I do it in a new version. I use the current date for my version name (e.g., 2016-05-13). I then go to the Google cloud console and make that new version the default. This way, the old version continues to run in parallel.
I asked a similar question a couple years ago that you can see here.

Related

python raven timing out when using django logging from a celery worker

I am using raven to log from my celery jobs to sentry. I am finding that whenever I use the django logging system to log to sentry each update can take minutes (but the log succeeds). If I remove sentry from my logging configuration it is instant.
I tried reverting back to using raven directly via:
import raven
client=raven.Client("DSN")
client.captureMessage("message")
this works with no delay inside the worker.
But if I try to use the django specific client instead as below the delay exists:
from raven.contrib.django.raven_compat.models import client
client.captureMessage("message")
It is usually a little over 2 minutes so it looks like a timeout but the operation succeeds.
The delays are adding up and making my jobs queue unreliable.
If you're using the default Celery worker model things should generally just work. If you're using something else that may be less true.
The Python client by default uses a threaded worker. Meaning, upon instantiation, it creates a queue and a thread to process messages asynchronously. If this happens in various ways it could cause problems (i.e. pre-fork), or if you're using something like gevent and not patching threads.
You can try changing the transport to be synchronous to confirm this is related:
https://docs.getsentry.com/hosted/clients/python/transports/

Managed VM Background thread loses environment

I am having some problems using Background Threads in a Managed VM in Google App Engine.
I am getting callbacks from a library linked via Ctypes which need to be executed in the background as I am explaining in a previous question.
The problem is: The Application loses its execution context (wsgi application) and is missing environment variables like the Application id. Without those I cannot make calls to the database as they fail.
I do call the background thread like
background_thread.start_new_background_thread(saveItemsToDatabase, [])
Is there a way to copy the environment to the background thread or maybe execute the task in a different context?
Update: The traceback which makes it already clear what the problem is:
_ToDatastoreError(err)google.appengine.api.datastore_errors.BadRequestError: Application Id (app) format is invalid: '_'
application context is thread local in appengine when created through standard app handler. Remember the applications in appengine run in python27 with thread enabled already have threads. So each wsgi call then environment variables has to be thread local, or information would leak between handled requests.
This means that additional threads you create will need to be passed the app context explicitly.
In fact when you start reading the docs on background threads it is pretty clear about what is going on, https://cloud.google.com/appengine/docs/python/modules/#Python_Background_threads - A background thread's os.environ and logging entries are independent of those of the spawning thread.
So you have to copy the env (os.environ) or the parts you need and pass it to the thread as arguments. The problem may not be limited to appid you may find thats only the first thing missing. For instance if you use namespaces.

Random python module loading failures with threading

I'm trying to debug an error where python import statements randomly fail, at other times they run cleanly.
This is an example of the exceptions I see. Sometimes I'll see this one, sometimes I'll see another one in a different module, though it seems to always hit in one of 4 modules.
ERROR:root:/home/user/projecteat/django/contrib/auth/management/__init__.py:25: RuntimeWarning: Parent module 'django.contrib.auth.management' not found while handling absolute import
from django.contrib.contenttypes.models import ContentType
Because of the random nature, I'm almost certain it's a threading issue, but I don't understand why I would get import errors, so I'm not sure what to look for in debugging. Can this be caused by filesystem contention if different threads are trying to load the same modules?
I'm trying to get Django 1.4's LiveServerTestCase working on Google App Engine's development server. The main thread runs django's test framework. When it loads up a LiveServerTestCase based test class, it spawns a child thread which launches the App Engine dev_appserver, which is a local webserver. The main thread continues to run the test, using the Selenium driver to make HTTP requests, which are handled by dev_appserver on the child thread.
The test framework may run a few tests in the LiveServerTestCase based class before tearing down the testcase class. At teardown, the child thread is ended.
It looks like the exceptions are happening in the child (HTTP server) thread, mostly between tests within a single testcase class.
The code for the App Engine LiveServerTestCase class is here: https://github.com/dragonx/djangoappengine/blob/django-1.4/test.py
It's pretty hard to provide all the debugging info required for this question. I'm mostly looking for suggestions as to why python import statements would give RuntimeWarning errors.
I have a partial answer to my own question. What's going on is that I have two threads running.
Thread 1 is running the main internal function inside dev_appserver (dev_appserver_main) which is handling HTTP requests.
Thread 2 is running the Selenium based testcases. This thread will send commands to the browser to do something (which then indirectly generates an HTTP request and re-enters in thread 1). It then either issues more requests to Selenium to check status, or makes a datastore query to check for a result.
I think the problem is that upon handling every HTTP request, Thread 1 (dev_appserver) changes the environment so that certain folders are not accessible (folder excluded in app.yaml, as well as the environment that is not part of appengine). If Thread 2 happens to run some code in this time, certain imports may fail to load if they are located in these folders.

Flask/Werkzeug debugger, process model, and initialization code

I'm writing a Python web application using Flask. My application establishes a connection to another server at startup, and communicates with that server periodically in the background.
If I don't use Flask's builtin debugger (invoking app.run with debug=False), no problem.
If I do use the builtin debugger (invoking app.run with debug=True), Flask starts a second Python process with the same code. It's the child process that ends up listening for HTTP connections and generally behaving as my application is supposed to, and I presume the parent is just there to watch over it when the debugger kicks in.
However, this wreaks havoc with my startup code, which runs in both processes; I end up with 2 connections to the external server, 2 processes logging to the same logfile, and in general, they trip over each other.
I presume that I should not be doing real work before the call to app.run(), but where should I put this initialization code (which I only want to run once per Flask process group, regardless of the debugger mode, but which needs to run at startup and independent of client requests)?
I found this question about "Flask auto-reload and long-running thread" which is somewhat related, but somewhat different, and the answer didn't help me. (I too have a separate long-running thread marked as a daemon thread, but it is killed when the reloader kicks in, but the problem I'm trying to solve is before any reload needs to happen. I'm not concerned with the reload; I'm concerned with the extra process, and the right way to avoid executing unnecessary code in the parent process.)
I confirmed this behavior is due to Werkzeug, not Flask proper, and it is related to the reloader. You can see this in Werkzeug's serving.py -- in run_simple(), if use_reloader is true, it invokes make_server via a helper function run_with_reloader() / restart_with_reloader() which does a subprocess.call(sys.executable), after setting an environment variable WERKZEUG_RUN_MAIN in the environment which will be inherited by the subprocess.
I worked around it with a fairly ugly hack: in my main function, before creating the wsgi application object and calling app.run(), I look for WERKZEUG_RUN_MAIN:
if use_reloader and not os.environ.get('WERKZEUG_RUN_MAIN'):
logger.warning('startup: pid %d is the werkzeug reloader' % os.getpid())
else:
logger.warning('startup: pid %d is the active werkzeug' % os.getpid()
# my real init code is invoked from here
I have a feeling this would be better done from inside the application object, if there's a method that's called before Werkzeug starts serving it. I don't know of such a method, though.
This all boils down to: in Werkzeug's run_simple.py, there's only going to be one eventual call to make_server().serve_forever(), but there may be two calls to run_simple() (and the entire call stack up to that point) before we make it to make_server().

Can AppEngine python threads last longer than the original request?

We're trying to use the new python 2.7 threading ability in Google App Engine and it seems like the created thread is getting killed before it finishes running. Our scenario:
User sends a message to the server
We update the user's data
We spawn a thread to do some more heavy duty processing
We return a response to the user before waiting for the heavy duty processing to finish
My assumption was that the thread would continue to run after the request had returned, as long as it did not exceed the total request time limit. What we're seeing though is that the thread is randomly killed partway through it's execution. No exceptions, no errors, nothing. It just stops running.
Are threads allowed to exist after the response has been returned? This does not repro on the dev server, only on live servers.
We could of course use a task queue instead, but that's a real pain since we'd have to set up a url for the action and serialize/deserialize the data.
The 'Sandboxing' section of this page:
http://code.google.com/appengine/docs/python/python27/using27.html#Sandboxing
indicates that threads cannot run past the end of the request.
Deferred tasks are the way to do this. You don't need a URL or serialization to use them:
from google.appengine.ext import deferred
deferred.defer(myfunction, arg1, arg2)

Categories