What is the best way to share the SQLAlchemy session between my Pyramid application and Celery tasks while only instantiating the database engine once? I looked at this answer here, however, I don't want to have to create a another engine (that also happens to be global) since this is not very DRY. Also, during the Pyramid application startup the application .ini settings are passed into the main function so I would like to be able to configure the engine from this method but also have it available to all Celery tasks. Perhaps I am going about things the wrong way when it comes to Celery integration with Pyramid? Thanks for your help!
A major motivation behind using a message broker (celery) in the first place is that your web app and workers do not operate in the same process. Because of this, I'm going to suggest that you back up a bit and think of your system as separate processes that are not sharing the same database connection.
Related
So, I am currently working on a django project hosted at pythonanywhere, which includes a feature for notifications, while also receiving data externally from sensors through AWS. I have been thinking of the best practice in order to implement this.
I currently have a simple implementation which is a view that checks all notifications and does the actions as needed if required, with an always-on task (which simply means a script that is running independently) sending a REST request to the server every minute.
Server side:
views.py:
def checkNotifications(request):
notificationsObject = notifications.objects.order_by('thing').values_list('thing').distinct()
thingsList = list(notificationsObject)
for thing in thingsList:
valuesDic = returnAllField(thing)
thingNotifications = notifications.objects.filter(thing=thing)
#Do stuff for each notification
urls:
path('notifications/',views.checkNotifications,name="checkNotification")
and the client just sents a GET request to my URL/notifications/. which works.
Now, while researching I saw some other options such as the ones discussed here with django background tasks and/or celery:
How to initialize repeating tasks using Django Background Tasks?
Celery task best practices in Django/Python
as well as some other options.
My question is: Is there a benefit to moving from my first implementation to this one? The only benefit I can see directly is avoid abuse from another service trying to hit my URl to check notifications too often, but I can/have a required authentication to avoid that. And, is there a certain "best practice" with regards to this, considering that I am checking with this repeating task quite so often, it almost feels like there should be a more proper/cleaner solution. For one, I am not sure if running a repeating task is the best option with pythonanywhere.
(https://help.pythonanywhere.com/pages/AsyncInWebApps/ suggests using always-on tasks, but it also mentions django background tasks)
Thank you
To use Django background tasks on PythonAnywhere you need to run it using an always-on task, so it is not an alternative, but just the other use of always-on tasks.
You can also access your Django code in your always-on task directly with some kind of long-running management command, so you do not need to hit your web app with a special request.
I know GIL blocks python from running its threads across cores. If it does so, why python is being used in webservers, how are the companies like youtube, instagram handling it.
PS: I know alternatives like multiprocessing can solve it. But it would be great if anyone can post it with a scenario that was handled by them.
Python is used for server-side handling in webservers, but not (usually) as webserver.
On normal setup: we have have Apache or other webserver to handles a lot of processes (server-side) (python uses usually wsgi). Note usually apache handles directly "static" files. So we have one apache server, many parallel apache processes (to handle connection and basic http) and many python processes which handles one connection per time.
Each of such process are independent each others (they just use the same resources), so you can program your server side part easily, without worrying about deadlocks. It is mostly a trade-off: performance of code, and easy and quickly to produce code without huge problems. But usually webserver with python scale very well (also on large sites), and servers are cheaper then programmers.
Note: security is also increased by having just one request in a process.
GIL exists in CPython, (Python interpreter made in C and most used), other interpreter versions such as Jython or IronPython don't have such problem, because they don't have GIL.
Even though, using CPython you can still have concurrency, just do your thing in C and then "link it" in your Python code, just like Numpy or similar do.
Other thing is, even though you have your page using Flask or Django, when you set up it in a production server, you have an Apache or Nginx, etc which has a real charge balancer (or load balancer, I can't remember the name in english now) that can serve the page to many people at the same time.
Take it from the Flask docs (link):
Flask’s built-in server is not suitable for production as it doesn’t scale well and by default serves only one request at a time.
[...]
If you want to deploy your Flask application to a WSGI server not listed here, look up the server documentation about how to use a WSGI app with it. Just remember that your Flask application object is the actual WSGI application.
Although a bit late, but I will try to give a generic and useful answer.
#Giacomo Catenazzi's answer is a good one but some part of it is factually incorrect.
API requests (or other form of web requests) are served from an already running process. The creation of this 'already running' process is handled by some webserver like gunicorn which on startup creates specified number of processes that are running the code in your web application continuously waiting to serve any incoming request.
Needless to say, each of these processes are limited by the GIL to only run one thread at a time. But one process in its lifetime handles more than one (normally many) request. Here it would be better if we could understand the flow of a request.
We will take an example of flask but this is applicable to most web frameworks. When a request comes from Nginx, it is handed over to gunicorn which interacts with your web application via wsgi. When the request reaches to the framework, an app context is created and some variables are pushed into the app-context. Then it follows the normal route that mostly people are familiar with: routing, db calls, response creation and so on. The response is then handed back to the gunicorn via wsgi again. At the time of handing over the response, the app context is teared down. So it's the app context, not the process that is created on every new request.
Also, I have talked only about the sync worker in gunicorn but it also has an option of async worker which can handle multiple requests in parallel through coroutines. But thats a separate topic.
So answering your question:
Nginx (Capable of handling multiple requests at a time)
Gunicorn creates a pool of n number of processes at the start and also manages the pool in the sense that if a process exits or gets stuck, it kills/recreates ans adds that to the pool.
Each process handling 1 request at a time.
Read more about gunicorn's design and how it can be used to help you achieve your requirements. This is a good thread about gunicorn with flask understanding. And this is a great resource to understand flask app context
What would be the best practice in this scenario?
I have an App Engine Python app, with multiple cron jobs. Instantiated by user requests and cron jobs, push notifications might be sent. This could easily scale up to a total of +- 100 pushes per minute.
Setting up and tearing down a connection to APNs for every batch is not what I want. Neither is Apple advising to do this. So I would like to keep the connection alive, even when user requests finish or when a cron finishes. Possibly with a timeout (2 minutes no pushes, then close then connection).
Reading the GAE documentation, I couldn't figure out if there even is such a thing available. Also, I might need this to be available in different apps and/or modules.
You can put the messages in a pull taskqueue and have a backend instance (or a cron job) to process the tasks
First, please take a look at Google Cloud Messaging. It's cool and you can use it easier than APNS's protocol.
If you can not use GCM (because of code refactoring, etc ...), I think AppEngine Managed VM is suitable for your situation now. Managed VM is something that stands between AppEngine and Compute Engine.
You can use the datastore (eventually shadowed by memcache for performance) to persist all the necessary APN (or any other) connection/protocol status/context info such that multiple related requests can share the same connection as if your app would be a long-living one.
Maybe not trivial, but definitely feasible.
Some requests may need to be postponed temporarily, depending on the shared connection status/context, that's true.
I've been trying to make a decision about my student project before going further. The main idea is get disk usage data, active linux user data, and so on from multiple internal server and publish them with Django.
Before I came to RabbitMQ I was thinking about developing a client application for each linux server and geting this data through a socket. But I want to make that student project simple. Also, I don't know how difficult it is to make a socket connection via Django.
So, I thought I could solve my problem with RabbitMQ without socket programming. Basically, I send a message to rabbit queue. Then get whatever I want from the consumer server.
On the Django side, the client will select one of the internal servers and click the "details" button. Then I want to show this information on web page.
I already read almost all documentation about rabbitmq, celery and pika. Sending messages to all internal servers(clients) and the calculation of information that I want to get is OKAY but I can't figure out how I can put this data on a webpage with Django?
How would you figure out that problem if you were me?
Thank you.
I solved my problem own my own. Solution is RabbitMQ RPC call. You can execute your python code on remote server and get result of process via RPC requests. Details can ben found here.
http://www.rabbitmq.com/tutorials/tutorial-six-python.html
Thank you guys.
Looks like you already done the hard work(celery, rabbit, etc) but missing Django basics. Go through the polls tutorial and getting started with django or the many other resources on the web, and It would be quite simple. Basically:
create the models (objects represented in db)
declare urls
setup views to pass the data from the model to the webpage template
create the templates (or do it with client side framework and create a JSON response)
EDIT: (after you clarified the question) Actually I just hit the same problem too. The answer is running another python process parallel to the Django process (in the same virtualenv) in this process you can set up a rabbit consumer (using pica, puka, kombu or whatever) and calling specific Django functions/methods to do something with the information from rabbitmq. you can also just call celery tasks from there to be executed in the Django app context.
a procfile for example (just illustrating, you can run both process in many other ways):
web: python manage.py runserver
worker: python listen_from_servers.py
Notice that you'll have to set the DJANGO_SETTIGNS_MODULE for the settings file enviroment variable for django imports to work.
You need the following two programs running at all times:
The producer, which will populate the queue. This is the program that will collect the various messages and then post them on the queue.
The consumer, which will process messages from the queue. This consumer's job is to read the message and do something with it; so that it is processed and removed from the queue. The function that this consumer does is entirely up to you, but what you want to do in this scenario is write information from the message to a database model; the same database that is part of your django app.
As the producer pushes messages and the consumer removes them from the queue, your database will get updated.
On the django side, the process is simply to filter this database and display records for a particular machine. As such, django does not need to be aware of how the records are being populated in the database - all django is doing is fetching, filtering, sending to the template and rendering the views.
The question comes how best (well actually, easily) populate the databases. You can do it the traditional way, by using Python's well documentation DB-API and write your own SQL statements; but since celery is so well integrated with django - you can use the django's ORM to do this work for you as well.
I hope this gets you going in the right direction.
How can I fix number of concurrent sessions allowed at app level?
Basically I want a limit to how many concurrent requests to this url to keep the server from getting congested.
I guess some middleware hack?
Thanks.
Don't do this in django, but in Apache / nginx / whatever webserver you have in front of Django. They have specific modules exactly for such tasks.
A possible solution for Apache would be: mod_limitipconn2 - http://dominia.org/djao/limitipconn2.html
Django stores session information in the database by default. You could use a middleware, which checks the number of rows (coarsly speaking) in that table. Keep in mind that django won't purge expired sessions from the DB automatically.
On clearing the session table:
http://docs.djangoproject.com/en/dev/topics/http/sessions/#clearing-the-session-table
You're not asking about sessions, you're asking about requests. What you want is known as throttling. However, it would be quite difficult to do it inside the app, because Apache manages multiple processes and threads, so you'd need some external process to keep track of these in order to enable the throttling.
Basically, this sort of thing is best done within Apache itself, by something like mod_throttle.