I am working on an application to fetch multiple Rest api's, and in turn, expose the data from those fetches, in a Rest api.
The Api's I need to fetch are slow, and do not have the same structure. All fetches need to be recurrent, daily.
I am thinking of using :
-Django Rest Framework to expose the api, use the parsers and serializers to treat the data received, and store the aggregated data to be exposed, in a Postgresql.
-Celery to launch workers and/or child processes in parallel
-Celery beat to fetch regularly, and keep data up to date.
-RabbiMq as a message broker to Celery.
Can I use celery to make the fetch calls to the api's or do I need to write a special fetch script?
And how do I get the result back from Celery to Django RF to store it? Do I need to use another queue and celery worker for the response ?
Is it better to have 1 worker with multiple child processes, or multiple workers with 1 child process each ?
Related
I have an API to do the below operations. I am using python, Django framework and gunicorn/nginx for deployment.
API has deployed in AWS lightsail. Request will come in for every 2secs.
receives data from the client.
creates record in local SQLite database and sends response.
runs task asynchronously in thread. (running in thread). Entire task takes about 1 sec on average.
a. gets the updated record from step 2 with ID. (0 sec)
b. posts data to another API using requests. (.5 secs)
c. updates to database (AWS RDS) (.5 secs)
Setup:
I have ThreadPoolExecutor max_workers=12.
gunicorn has one worker as the instance has 1vCPU. I don't use gunicorn workers with threads, since I have to perform some other task within the api.
The reason asyncio is not used was that base update to db in Django is not supported with this. So I kept the post API in the threading itself.
Each request will be unique. No same request.
Even If I keep max_worker to 1 in threadpool, it is bursting 10% mark in AWS 5$ instance. API only receive request for every 2secs.
I am not able to profile this situation where it is causing the CPU usage.
There are a couple of reason, I can think of.
gunicorn master is constantly checking on the worker.
OS is managing threads for context switching.
Any pointer will be helpful for profiling.
This is bit of a high level question as I may just have a poorly designed Flask app, but I currently have built an app where after user submits a form, a celery worker formats the input into a sql query and executes that query.
Now I have set up a Redis backend to keep the results, so when the task is done, I can display the results by getting It from Redis. However, I don't want these results to persist in Redis forever, but I do want them to persist for an entire user session in the app so they can view the results from previous queries while in the session.
How would I go about doing this? My instinct is to add the individual task ids to a session and when the user wuits the app, use forget() in celery to remove all the task ids.
You can use 'result_expires' configuration flag and set it to ~5min. For more information: http://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-result_expires
I have a Flask app which uses SocketIO to communicate with users currently online. I keep track of them by mapping the user ID with a session ID, which I can then use to communicate with them:
online_users = {'uid...':'sessionid...'}
I delcare this in my run.py file where the app is launched, and then I import it when I need it as such:
from app import online_users
I'm using Celery with RabbitMQ for task deployment, and I need to use this dict from within the tasks. So I import it as above, but when I use it it is empty even when I know it is populated. I realize after reading this that it is because each task is asynchronous and starts a new process with an empty dict, and so my best bet is to use some sort of database or cache.
I'd rather not run an additional service, and I only need to read from the dict (I won't be writing to it from the tasks). Is a cache/database my only option here?
That depends on what you have in the dict.... If it's something you can serialize to string you can serialize it to Json and pass it as an argument to that task. If it's an object you cannot serialize, then yes you need to use a cache/database.
I came across this discussion which seems to be a solution for exactly what I'm trying to do.
Communication through a message queue is now implemented in package python-socketio, through the use of Kombu, which provides a common API to work with several message queues including Redis and RabbitMQ.
Supposedly an official release will be out soon, but as of now it can be done using an additional package.
I am currently trying to develop something using Google AppEngine, I am using Python as my runtime and require some advise on setting up the following.
I am running a webserver that provides JSON data to clients, The data comes from an external service in which I have to pull the data from.
What I need to be able to do is run a background system that will check the memcache to see if there are any required ID's, if there is an ID I need to fetch some data for that ID from the external source and place the data in the memecache.
If there are multiple id's, > 30 I need to be able to pull all 30 request as quickly and efficiently as possible.
I am new to Python Development and AppEngine so any advise you guys could give would be great.
Thanks.
You can use "backends" or "task queues" to run processes in the background. Tasks have a 10-minute run time limit, and backends have no run time limit. There's also a cronjob mechanism which can trigger requests at regular intervals.
You can fetch the data from external servers with the "URLFetch" service.
Note that using memcache as the communication mechanism between front-end and back-end is unreliable -- the contents of memcache may be partially or fully erased at any time (and it does happen from time to time).
Also note that you can't query memcache of you don't know the exact keys ahead of time. It's probably better to use the task queue to queue up requests instead of using memcache, or using the datastore as a storage mechanism.
The documentation under 'Deleting Tasks' says
You can delete an individual task or a list of tasks using the REST method Tasks.delete.
but doesn't indicate how to do this. The discovery-doc for the method shows only a string param, task.
Is there something that I'm missing, or is the only solution to send multiple delete requests?