I have a Python web application (using WSGI) deployed on Openshift. The application is quite memory greedy. What I have noticed is that there are several instance of Apache httpd service deployed at all times. That means the memory usage of my gear is multiplied by the number of these processes and the application crashes pretty often.
I don't have lots of traffic yet, so there is no need to have multiple httpd running.
Is there any way to configure Python cartridge to limit it to a single httpd process?
If you are using the OpenShift Python cartridge and its default setup, only two of those processes should actually have copies of your application running in it. The other httpd processes are the parent monitor process and the Apache child worker processes which will proxy requests to the processes which are actually running your web application.
If you need control to reduce it down to one process, then you would need to follow:
http://blog.dscpl.com.au/2015/01/using-alternative-wsgi-servers-with.html
to override the standard setup and use mod_wsgi-express instead. This will default to using one process for your application and allow you to control both number of processes and threads for the application processes.
If you are seeing lots of memory use, then it could just be your application code, or there is an outside chance you are seeing memory issues due to use of older mod_wsgi as there are some odd corner cases which can cause extra memory usage because of how Apache works. If you use mod_wsgi-express it will use the latest and avoid those problems.
So try mod_wsgi-express and if still have memory issues, suggest you get on the mod_wsgi mailing list to get help debugging it.
Related
I am aware of the memory limitations of the Heroku platform, and I know that it is far more scalable to separate an app into web and worker dynos. However, I still would like to run asynchronous tasks alongside the web process for testing purposes. Dynos are costly and I would like to prototype on the free instance that Heroku provides.
Are there any issues with spawning a new job as a process or subprocess in the same dyno as a web process?
On the newer Cedar stack, there are no issues with spawning multiple processes. Each dyno is a virtual machine and has no particular limitations except in memory and CPU usage (about 512 MB of memory, I think, and 1 CPU core). Following the newer installation instructions for some stacks such as Python will result in a configuration with multiple (web server) processes out of the box.
Software installed on web dynos may vary depending on what buildpack you are using; if your subprocesses need special software then you may have to either bundle it with your application or (better) roll your own buildpack.
At this point I would normally remind you that running asynchronous tasks on worker dynos instead of web dynos, with a proper task queue system, is strongly encouraged, but it sounds like you know that already. Do keep in mind that accounts with only one web dyno (typically this means, "free" accounts) will have that dyno spun down after an hour or so of not receiving any web requests, and that any background processes running on the dyno at that time will necessarily be killed. Accounts with multiple web dynos are not subject to this restriction.
I have bunch of python-projects with untrusted WSGI-apps inside them. I need to run them simulatiously and safely. So I need restrictions for directory access, python module usage and limitations for CPU and Memory.
I consider two approaches:
Import via imp-module WSGI-object from defined file, and running it with pysandbox. Now I have SandboxError: Read only object when doing:
self.config = SandboxConfig('stdout')
self.sandbox = Sandbox(self.config)
self.s = imp.get_suffixes()
wsgi_obj = imp.load_module("run", open(path+"/run.py", "r"), path, self.s[2]).app
…
return self.sandbox.call(wsgi_obj, environ, start_response)
Modify Python interpreter, exclude potentially risky modules, run in parallel processes, communicate via ZMQ/Unix sockets. I even don't know where to start here.
What could you recommend?
I would run your applications with gunicorn, with a separate process and configuration for each app, and with user-level permissions (each untrusted app on a different user). Each gunicorn instance would serve on localhost on a user-range port, and nginx or another webserver could connect into them to route and serve them to the web.
Heroku takes this a step further and sandboxes each gunicorn instance (or unicorn or apache or arbitrary other server) in a virtual machine. This is probably the most secure possible way to do things, and definitely the best option for reliably limiting CPU and memory usage, but you may not need to go that far depending on your requirements.
One of the advantages of this kind of approach is that each application can run on a different version of Python if appropriate; with the virtual machine sandbox they can even run on different operating systems entirely.
Edit: To limit memory usage without using a VM sandbox approach, see this question. To limit CPU usage, tweak the gunicorn settings -- spin up one gevent-style worker per core an application is allowed to use.
Edit again: One completely different approach would be to use PyPy's sandboxing mechanism which should be much more secure than CPython plus a sandboxing module. However, I would prefer the guincorn or gunicorn + virtual machine approach.
I've been able to deploy a test application by using pyramid with pserve and running pceleryd (I just send an email without blocking while it is sent).
But there's one point that I don't understand: I want to run my application with mod_wsgi, and I don't understand if I can can do it without having to run pceleryd from a shell, but if I can do something in the virtualhost configuration.
Is it possible? How?
There are technically ways you could use Apache/mod_wsgi to manage a process distinct from that handling web requests, but the pain point is that Celery will want to fork off further worker processes. Forking further processes from a process managed by Apache can cause problems at times and so is not recommended.
You are thus better of starting up Celery process separately. One option is to use supervisord to start it up and manage it.
I think in the past python scripts would run off CGI, which would create a new thread for each process.
I am a newbie so I'm not really sure, what options do we have?
Is the web server pipeline that python works under any more/less effecient than say php?
You can still use CGI if you want, but the normal approach these days is using WSGI on the Python side, e.g. through mod_wsgi on Apache or via bridges to FastCGI on other web servers. At least with mod_wsgi, I know of no inefficiencies with this approach.
BTW, your description of CGI ("create a new thread for each process") is inaccurate: what it does is create a new process for each query's service (and that process typically needs to open a database connection, import all needed modules, etc etc, which is what may make it slow even on platforms where forking a process, per se, is pretty fast, such as all Unix variants).
I would suggest Django http://www.djangoproject.com. It is very convenient to use, has everything you need for making web services. The most efficient way to use it is to run it as via Apache's mod_wsgi, and make Apache itself serve the static files.
This generally has better performance than solutions such as CGI and mod-python, as the Python process running the web service runs separate from the main web server, so it can cache stuff and easily re-use resources (like DB handles).
Also, you can then tweak the number of worker threads for Apache and your web application separately, resulting in better scalability.
I suggest cherrypy (http://www.cherrypy.org/). It is very convenient to use, has everything you need for making web services, but still quite simple (no mega-framework). The most efficient way to use it is to run it as self-contained server on localhost and put it behind Apache via a Proxy statement, and make apache itself serve the static files.
This generally has better performance than solutions such as CGI and mod-python, as the Python process running the web service runs separate from the main web server, so it can cache stuff and easily re-use resources (like DB handles).
Also, you can then tweak the number of worker threads for Apache and your web application separately, resulting in better scalability.
I have written a Django app that makes use of Python threading to create a web spider, the spider operates as a series of threads to check links.
When I run this app using the django test server (built in), the app runs fine and the threads seem to start and stop on time.
However, running the app on Apache it seems the threads aren't kicking off and running (after about 80 seconds there should be a queued database update and these changes aren't occuring).
Does anyone have an idea what I'm missing here?
-- Edit: My question is, how does Apache handle threaded applications, i.e. is there a limit on how many threads can be run from a single app?
Any help would be appreciated!
Most likely, you are missing the creation of new processes. Apache will not run in a single process, but fork new processes for requests every now and then (depending on a dozen or so configuration parameters). If you run django in each process, they will share no memory, and the results produced in one worker won't be visible to any of the others. In addition, the Apache process might terminate (on idle, or after a certain time), discarding your in-memory results.