Flask/Werkzeug debugger, process model, and initialization code

Flask/Werkzeug debugger, process model, and initialization code - python

I'm writing a Python web application using Flask. My application establishes a connection to another server at startup, and communicates with that server periodically in the background.
If I don't use Flask's builtin debugger (invoking app.run with debug=False), no problem.
If I do use the builtin debugger (invoking app.run with debug=True), Flask starts a second Python process with the same code. It's the child process that ends up listening for HTTP connections and generally behaving as my application is supposed to, and I presume the parent is just there to watch over it when the debugger kicks in.
However, this wreaks havoc with my startup code, which runs in both processes; I end up with 2 connections to the external server, 2 processes logging to the same logfile, and in general, they trip over each other.
I presume that I should not be doing real work before the call to app.run(), but where should I put this initialization code (which I only want to run once per Flask process group, regardless of the debugger mode, but which needs to run at startup and independent of client requests)?
I found this question about "Flask auto-reload and long-running thread" which is somewhat related, but somewhat different, and the answer didn't help me. (I too have a separate long-running thread marked as a daemon thread, but it is killed when the reloader kicks in, but the problem I'm trying to solve is before any reload needs to happen. I'm not concerned with the reload; I'm concerned with the extra process, and the right way to avoid executing unnecessary code in the parent process.)

I confirmed this behavior is due to Werkzeug, not Flask proper, and it is related to the reloader. You can see this in Werkzeug's serving.py -- in run_simple(), if use_reloader is true, it invokes make_server via a helper function run_with_reloader() / restart_with_reloader() which does a subprocess.call(sys.executable), after setting an environment variable WERKZEUG_RUN_MAIN in the environment which will be inherited by the subprocess.
I worked around it with a fairly ugly hack: in my main function, before creating the wsgi application object and calling app.run(), I look for WERKZEUG_RUN_MAIN:
if use_reloader and not os.environ.get('WERKZEUG_RUN_MAIN'):
logger.warning('startup: pid %d is the werkzeug reloader' % os.getpid())
else:
logger.warning('startup: pid %d is the active werkzeug' % os.getpid()
# my real init code is invoked from here
I have a feeling this would be better done from inside the application object, if there's a method that's called before Werkzeug starts serving it. I don't know of such a method, though.
This all boils down to: in Werkzeug's run_simple.py, there's only going to be one eventual call to make_server().serve_forever(), but there may be two calls to run_simple() (and the entire call stack up to that point) before we make it to make_server().

Related

Atomic Code in gunicorn multiprocessing / only run code in worker 1?

I am new to gunicorn multiprocessing (by calling gunicorn --worker=X).
I am using it with Flask to provide the WSGI implementation for our productive frontend. To use multiprocessing, we pass the above mentioned parameter to unicorn.
Our Flask application also uses APScheduler (via Flask-APScheduler) to run a cron task every Y hours. This task searches for new database entries to process, and when it finds them, starts processing them one by one.
The process should only be run by one worker obviously. But because of gunicorn, X workers are now spawned, each running the task every X hours, creating race conditions.
Is there a way to make the code atomic so that I can set the "processed" variable in the DB entry to true? Or, maybe, tell gunicorn to only run that specific code on the parent process, or first spawned worker?
Thanks for every input! :-)

The --preload parameter for gunicorn gives an opportunity to run code just in the parent worker.
All the code that is run before app.run() (or whatever you called your Flask() object) is apparently run on the parent process.
Didn't find any documentation on this unfortunately, but this post lead me to it.
So, running the APScheduler code before it makes sure that it's only run (or registered, in this case) once.

GAE update app, how to avoid violent stop of long process?

I have a GAE apps that spawn some long process via an other module (managed by basic_scaling).
This long process handles correctly the DeadlineExceededError but spawning a defered method that will save the current state of the long process to be resumed later.
Today I discovered that when I do a appcfg.py -A <YOUR_PROJECT_ID> update myapp/, it abruptaly stops the long process. Just stop, no DeadlineExceededError (here goes my hope), nothing.
Is there some events triggered by GAE before stopping the app that would let me save the current state of my long process, write data to files (via s3, so a bit long), and re-queue the process to be re-run later ? (or something like this) ?
Thank you for your help.

From Scaling types and instance classes, both manual and basic scaling appear to behave identically from the instance shutdown prospective:
As with manual scaling, an instance that is stopped with appcfg stop
or from the Cloud Platform Console) has 30 seconds to finish handling
requests before it is forcibly terminated.
I assume the same shutdown method is used when the app is updated.
And from Shutdown:
There are two ways for an app to determine if a manual scaling
instance is about to be shut down. First, the is_shutting_down()
method from google.appengine.api.runtime starts returning true. Second
(and preferred), you can register a shutdown hook, as described below.
When App Engine begins to shut down an instance, existing requests are
given 30 seconds to complete, and new requests immediately return 404.
If an instance is handling a request, App Engine pauses the request
and runs the shutdown hook. If there is no active request, App Engine
sends an /_ah/stop request, which runs the shutdown hook. The
/_ah/stop request bypasses normal handling logic and cannot be handled
by user code; its sole purpose is to invoke the shutdown hook. If you
raise an exception in your shutdown hook while handling another
request, it will bubble up into the request, where you can catch it.
If you have enabled concurrent requests by specifying threadsafe: true
in app.yaml (which is the default), raising an exception from a
shutdown hook copies that exception to all threads. The following code
sample demonstrates a basic shutdown hook:
from google.appengine.api import apiproxy_stub_map
from google.appengine.api import runtime
def my_shutdown_hook():
apiproxy_stub_map.apiproxy.CancelApiCalls()
save_state()
# May want to raise an exception
runtime.set_shutdown_hook(my_shutdown_hook)
Alternatively, the following sample demonstrates how to use the
is_shutting_down() method:
while more_work_to_do and not runtime.is_shutting_down():
do_some_work()
save_state()
Note: It's important to recognize that the shutdown hook is not always
able to run before an instance terminates. In rare cases, an outage
can occur that prevents App Engine from providing 30 seconds of
shutdown time. Thus, we recommend periodically checkpointing the state
of your instance and using it primarily as an in-memory cache rather
than a reliable data store.
Based on my assumption above I expect these methods should work for your case as well, give them a try.

It looks like you replacing an existing version of your app (the default version). When you do this, it doesn't gracefully handle existing processing.
Whenever I update the production version of my app, I do it in a new version. I use the current date for my version name (e.g., 2016-05-13). I then go to the Google cloud console and make that new version the default. This way, the old version continues to run in parallel.
I asked a similar question a couple years ago that you can see here.

Managed VM Background thread loses environment

I am having some problems using Background Threads in a Managed VM in Google App Engine.
I am getting callbacks from a library linked via Ctypes which need to be executed in the background as I am explaining in a previous question.
The problem is: The Application loses its execution context (wsgi application) and is missing environment variables like the Application id. Without those I cannot make calls to the database as they fail.
I do call the background thread like
background_thread.start_new_background_thread(saveItemsToDatabase, [])
Is there a way to copy the environment to the background thread or maybe execute the task in a different context?
Update: The traceback which makes it already clear what the problem is:
_ToDatastoreError(err)google.appengine.api.datastore_errors.BadRequestError: Application Id (app) format is invalid: '_'

application context is thread local in appengine when created through standard app handler. Remember the applications in appengine run in python27 with thread enabled already have threads. So each wsgi call then environment variables has to be thread local, or information would leak between handled requests.
This means that additional threads you create will need to be passed the app context explicitly.
In fact when you start reading the docs on background threads it is pretty clear about what is going on, https://cloud.google.com/appengine/docs/python/modules/#Python_Background_threads - A background thread's os.environ and logging entries are independent of those of the spawning thread.
So you have to copy the env (os.environ) or the parts you need and pass it to the thread as arguments. The problem may not be limited to appid you may find thats only the first thing missing. For instance if you use namespaces.

How do I run long term (infinite) Python processes?

I've recently started experimenting with using Python for web development. So far I've had some success using Apache with mod_wsgi and the Django web framework for Python 2.7. However I have run into some issues with having processes constantly running, updating information and such.
I have written a script I call "daemonManager.py" that can start and stop all or individual python update loops (Should I call them Daemons?). It does that by forking, then loading the module for the specific functions it should run and starting an infinite loop. It saves a PID file in /var/run to keep track of the process. So far so good. The problems I've encountered are:
Now and then one of the processes will just quit. I check ps in the morning and the process is just gone. No errors were logged (I'm using the logging module), and I'm covering every exception I can think of and logging them. Also I don't think these quitting processes has anything to do with my code, because all my processes run completely different code and exit at pretty similar intervals. I could be wrong of course. Is it normal for Python processes to just die after they've run for days/weeks? How should I tackle this problem? Should I write another daemon that periodically checks if the other daemons are still running? What if that daemon stops? I'm at a loss on how to handle this.
How can I programmatically know if a process is still running or not? I'm saving the PID files in /var/run and checking if the PID file is there to determine whether or not the process is running. But if the process just dies of unexpected causes, the PID file will remain. I therefore have to delete these files every time a process crashes (a couple of times per week), which sort of defeats the purpose. I guess I could check if a process is running at the PID in the file, but what if another process has started and was assigned the PID of the dead process? My daemon would think that the process is running fine even if it's long dead. Again I'm at a loss just how to deal with this.
Any useful answer on how to best run infinite Python processes, hopefully also shedding some light on the above problems, I will accept
I'm using Apache 2.2.14 on an Ubuntu machine.
My Python version is 2.7.2

I'll open by stating that this is one way to manage a long running process (LRP) -- not de facto by any stretch.
In my experience, the best possible product comes from concentrating on the specific problem you're dealing with, while delegating supporting tech to other libraries. In this case, I'm referring to the act of backgrounding processes (the art of the double fork), monitoring, and log redirection.
My favorite solution is http://supervisord.org/
Using a system like supervisord, you basically write a conventional python script that performs a task while stuck in an "infinite" loop.
#!/usr/bin/python
import sys
import time
def main_loop():
while 1:
# do your stuff...
time.sleep(0.1)
if __name__ == '__main__':
try:
main_loop()
except KeyboardInterrupt:
print >> sys.stderr, '\nExiting by user request.\n'
sys.exit(0)
Writing your script this way makes it simple and convenient to develop and debug (you can easily start/stop it in a terminal, watching the log output as events unfold). When it comes time to throw into production, you simply define a supervisor config that calls your script (here's the full example for defining a "program", much of which is optional: http://supervisord.org/configuration.html#program-x-section-example).
Supervisor has a bunch of configuration options so I won't enumerate them, but I will say that it specifically solves the problems you describe:
Backgrounding/Daemonizing
PID tracking (can be configured to restart a process should it terminate unexpectedly)
Log normally in your script (stream handler if using logging module rather than printing) but let supervisor redirect to a file for you.

You should consider Python processes as able to run "forever" assuming you don't have any memory leaks in your program, the Python interpreter, or any of the Python libraries / modules that you are using. (Even in the face of memory leaks, you might be able to run forever if you have sufficient swap space on a 64-bit machine. Decades, if not centuries, should be doable. I've had Python processes survive just fine for nearly two years on limited hardware -- before the hardware needed to be moved.)
Ensuring programs restart when they die used to be very simple back when Linux distributions used SysV-style init -- you just add a new line to the /etc/inittab and init(8) would spawn your program at boot and re-spawn it if it dies. (I know of no mechanism to replicate this functionality with the new upstart init-replacement that many distributions are using these days. I'm not saying it is impossible, I just don't know how to do it.)
But even the init(8) mechanism of years gone by wasn't as flexible as some would have liked. The daemontools package by DJB is one example of process control-and-monitoring tools intended to keep daemons living forever. The Linux-HA suite provides another similar tool, though it might provide too much "extra" functionality to be justified for this task. monit is another option.

I assume you are running Unix/Linux but you don't really say. I have no direct advice on your issue. So I don't expect to be the "right" answer to this question. But there is something to explore here.
First, if your daemons are crashing, you should fix that. Only programs with bugs should crash. Perhaps you should launch them under a debugger and see what happens when they crash (if that's possible). Do you have any trace logging in these processes? If not, add them. That might help diagnose your crash.
Second, are your daemons providing services (opening pipes and waiting for requests) or are they performing periodic cleanup? If they are periodic cleanup processes you should use cron to launch them periodically rather then have them run in an infinite loop. Cron processes should be preferred over daemon processes. Similarly, if they are services that open ports and service requests, have you considered making them work with INETD? Again, a single daemon (inetd) should be preferred to a bunch of daemon processes.
Third, saving a PID in a file is not very effective, as you've discovered. Perhaps a shared IPC, like a semaphore, would work better. I don't have any details here though.
Fourth, sometimes I need stuff to run in the context of the website. I use a cron process that calls wget with a maintenance URL. You set a special cookie and include the cookie info in with wget command line. If the special cookie doesn't exist, return 403 rather than performing the maintenance process. The other benefit here is login to the database and other environmental concerns of avoided since the code that serves normal web pages are serving the maintenance process.
Hope that gives you ideas. I think avoiding daemons if you can is the best place to start. If you can run your python within mod_wsgi that saves you having to support multiple "environments". Debugging a process that fails after running for days at a time is just brutal.

Memory model for apache/modwsgi application in python?

In a regular application (like on Windows), when objects/variables are created on a global level it is available to the entire program during the entire duration the program is running.
In a web application written in PHP for instance, all variables/objects are destroyed at the end of the script so everything has to be written to the database.
a) So what about python running under apache/modwsgi? How does that work in regards to the memory?
b) How do you create objects that persist between web page requests and how do you ensure there isn't threading issues in apache/modwsgi?

Go read the following from the official mod_wsgi documentation:
http://code.google.com/p/modwsgi/wiki/ProcessesAndThreading
It explains the various modes things can be run in and gives some general guidelines about data scope and sharing.

All Python globals are created when the module is imported. When module is re-imported the same globals are used.
Python web servers do not do threading, but pre-forked processes. Thus there is no threading issues with Apache.
The lifecycle of Python processes under Apache depends. Apache has settings how many child processes are spawned, keep in reserve and killed. This means that you can use globals in Python processes for caching (in-process cache), but the process may terminate after any request so you cannot put any persistent data in the globals. But the process does not necessarily need to terminate and in this regard Python is much more efficient than PHP (the source code is not parsed for every request - but you need to have the server in reload mode to read source code changes during the development).
Since globals are per-process and there can be N processes, the processes share "web server global" state using mechanisms like memcached.
Usually Python globals only contain
Setting variables set during the process initialization
Cached data (session/user neutral)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.