How to cache individual Django REST API POSTs for bulk_create? - python

I have a Django REST API endpoint. It receives a JSON payload eg.
{ "data" : [0,1,2,3] }
This is decoded in a views.py function and generates a new database object like so (pseudo code):
newobj = MyObj.(col0 = 0, col1= 1, col2 = 2, col3 = 3)
newobj.save()
In tests, it is 20x faster to create a list of x1000 newobjs, then do a bulk create:
Myobj.objects.bulk_create(newobjs, 1000)
So, the question is how to save individual POSTs somewhere in Django ready for batch writes when we have 1000 of them ?

You can cache it with Memcached or with Redis, for example.
But you will need to write some kind of service that checks how many new items in the cache and if there are more than e.g. 1000 -> insert it.
So:
POST are populating a cache
Service getting new items from the cache and then inserting them in the persistence database.
Do you really need it?
What will happen if data already exist? If data is corrupted? How the user will know about this?

save individual POSTs somewhere in Django ready for batch writes when we have 1000 of them
You can,
use django's cache framework,
maintain a CSV file using python's csv module
you probably want to maintain the order of the posts, so you can use persist-queue package.
But as Victor mentioned as well, why? Why are you so concerned about speeds of SQL Insert which are pretty fast anyway ?
Ofcourse, bulk_create is much faster because it takes a single network call to your DB server and adds all the rows in a single SQL transaction but it only makes sense to use it when you actually have bunch of data to be added together. - At the end, you must save the data somewhere which is gonna take some processing time one way or another.
Because there are many disadvantages to your approach:
you risk losing the data
you will not be able to achieve UNIQUE or any other constraint on your table.
your users won't get instant feedback on creating a post.
you cannot show/access the posts in useful way if they are not stored in your primary DB.
EDIT
Use a fast cache like Redis to maintain a list of the entries, in your api_view you can call cache.get to get the current list, append object to it and then call cache.set to update it. After this add a check that whenever len(list) >= 1000 == True call bulk_create. You might also want to consider using Elasticsearch for such enormous amount of data.

Thanks for the above responses, the answers included some of what was suggested, but is a superset, so here's a summary.
This is really about creating a FIFO. memcached turns out to be unsuitable (after trying) because only redis has a list function that enables this, explained nicely here.
Also note that the Django built in cache does not support the redis list api calls.
So we need a new docker-compose.yml entry to add redis:
redis:
image: redis
ports:
- 6379:6379/tcp
networks:
- app-network
Then in views.py we add: (note the use of redis rpush)
import redis
...
redis_host=os.environ['REDIS_HOST']
redis_port = 6379
redis_password = ""
r = redis.StrictRedis(host=redis_host, port=redis_port, password=redis_password, decode_responses=True)
...
def write_post_to_redis(request):
payload = json.loads(request.body)
r.rpush("df",json.dumps(payload))
So this pushes the received payload into the redis in-memory cache. We now need to read (or pop ) it and write to the postgres database. So we need a process that wakes up every n seconds and checks. For this we need Django background_task. First, install it with:
pipenv install django-background-tasks
And add to the installed apps of the settings.py
INSTALLED_APPS = [
...
'background_task',
Then run a migrate to add the background task tables:
python manage.py migrate
Now in views.py, add:
from background_task import background
from background_task.models import CompletedTask
And add the function to write the cached data to the postgres database, note the decorator which states it should run in the background every 5 seconds. Also note use of redis lpop.
#background(schedule=5)
def write_cached_samples():
...
payload = json.loads(r.lpop('df'))
# now do your write of payload to postgres
... and delete the completed tasks or we'll have a big db leak
CompletedTask.objects.all().delete()
In order to start the process up, add the following to the base of urls.py:
write_cached_samples(repeat=10, repeat_until=None)
Finally, because the background task needs a separate process, we duplicate the django docker container in docker-compose.yml but replace the asgi server run command with the background process run command.
django_bg:
image: my_django
command: >
sh -c "python manage.py process_tasks"
...
In summary we add two new docker containers, one for the redis in-memory cache, and one to run the django background tasks. We use the redis lists rpush and lpop functions to create a FIFO with the API receive pushing and a background task popping.
There was a small issue where nginx was hooking up to the wrong django container, rectified by stopping and restarting the background container, some issue where docker networking routing is wrongly initialising.
Next I am replacing the Django HTTP API endpoint with a Go one to see how much of a speed up we get, as the Daphne ASGI server is hitting max CPU at only 100 requests per sec.

Related

Concurrent requests Flask python

I have some Flask application. It works with some database, I'm using SQLAlchemy for this. So I have one question:
Flask handle requests one-by-one. So, for example, I have two users, which are modifying the same record in the table of database, for example A and B (they are concurrent).
How can I say to user B that user A has changed this record? It must be some message to user B.
In the development server version, when you do app.run(), you get a single synchronous process, which means at most 1 requests being processed at a time. So you cannot accept multiple users at the same time.
However, gunicorn is a solid, easy-to-use WSGI server that will let you spawn multiple workers (separate processes), and even comes with asynchronous workers when you need to deploy your application.
However, to answer your question, since, they run on separate threads, the data that exists in the database at the specific time when the query is run in that thread will be used/returned.
I hope this answers your query.

Reading/writing Python dictionary to Redis cache on AWS crashes server

I am developing a computation intensive django application. Using Celery, to perform time taking tasks, and using Redis as a broker, and for cache puposes.
Redis cache is used to share a large dictionary structure across celery tasks.
I have a rest api to frequently write/update a python dictionary in Redis cache (after 1 second). Each api call initiates a new task.
At localhost it all works good. But on Aws, the the elastic-beanstalk app crashes when run for sometime.
It does not crash when the dictionary structure is empty. Here is the code how I update the cache.
r = redis.StrictRedis(host=Constants.REDIS_CACHE_ADDRESS, port=6379, db=0)
mydict_obj = r.get("mydict")
if mydict_obj:
mydict = eval(str(mydict_obj))
else:
mydict = {}
for hash_instance in all_hashes:
if hash_instance[1] in mydict:
mydict[hash_instance[1]].append((str(hash_instance[0]), str(data.recordId)))
else:
mydict[hash_instance[1]] = [(str(hash_instance[0]), str(data.recordId))]
r.set("mydict", mydict)
Can't find a solution, why the elastic-beanstalk app crashes on aws. It works fine on localhost.
From the documentation:
Don't use more memory than the specified amount of bytes.
When the memory limit is reached Redis will try to remove keys
according to the eviction policy selected (see maxmemory-policy).
If Redis can't remove keys according to the policy, or if the policy is
set to 'noeviction', Redis will start to reply with errors to commands
that would use more memory, like SET, LPUSH, and so on, and will continue
to reply to read-only commands like GET.
This option is usually useful when using Redis as an LRU cache, or to set
a hard memory limit for an instance (using the 'noeviction' policy).

Writing a Django backend program that runs indefinitely -- what to keep in mind?

I am trying to write a Django app that queries a remote database for some data, performs some calculations on a portion of this data and stores the results (in the local database using Django models). It also filters another portion and stores the result separately. My front end then queries my Django database for these processed data and displays them to the user.
My questions are:
How do I write an agent program that continuously runs in the backend, downloads data from the remote database, does calculations/ filtering and stores the result in the local Django database ? Particularly, what are the most important things to keep in mind when writing a program that runs indefinitely?
Is using cron for this purpose a good idea ?
The data retrieved from the remote database belong to multiple users and each user's data must be kept/ stored separately in my local database as well. How do I achieve that? using row-level/ class-instance level permissions maybe? Remember that the backend agent does the storage, update and delete. Front end only reads data (through http requests).
And finally, I allow creation of new users. If a new user has valid credentials for the remote database the user should be allowed to use my app. In which case, my backend will download this particular user's data from the remote database, performs calculations/ filtering and presents the results to the user. How can I handle the dynamic creation of objects/ database tables for the new users? and how can I differentiate between users' data when retrieving them ?
Would very much appreciate answers from experienced programmers with knowledge of Django. Thank you.
For
1) The standard get-go solution for timed and background task is Celery which has Django integration. There are others, like Huey https://github.com/coleifer/huey
2) The usual solution is that each row contains user_id column for which this data belongs to. This maps to User model using Django ORM's ForeignKey field. Do your users to need to query the database directly or do they have direct database accounts? If not then this solution should be enough. It sounds like it your front end has 1 database connection and all permission logic is handled by the front end, not the database itself.
3) See 2

with Python is there a way to listen for changes when insert or update is made in mongodb

I am building a small system which throws data from a mongodb collection, it already works fine but I have to restart it everytime I make changes.
I already have a monitor that dectect changes and restarts the server automatically but I want to do something like this with mongodb changes.
I am currenlty using CentOs 5, Nginx, uWsgi & python2.7.
I'd look into using tailable cursors, which remain alive after they've reached the end of a collection, and can block until a new object is available.
Using PyMongo, you can call Collection.find with a tailable=True option to enable this behavior. This blog post gives some good examples of its usage.
Additionally, instead of just querying the collection, which will only alert you to new objects added to that collection, you may want to query the database's oplog, which is a collection of all insert, updates, and deletes called on any collection in the database. Note that replication must be enabled for mongo to keep an oplog. Check out this blog post for info about the oplog and enabling replication.

Django statelessness?

I'm just wondering if Django was designed to be a fully stateless framework?
It seems to encourage statelessness and external storage mechanisms (databases and caches) but I'm wondering if it is possible to store some things in the server's memory while my app is in develpoment and runs via manage.py runserver.
Sure it's possible. But if you are writing a web application you probably won't want to do that because of threading issues.
That depends on what you mean by "store things in the server's memory." It also depends on the type of data. If you can, you're better off storing "global data" in a database or in the file system somewhere. Unless it is needed every request it doesn't really make sense to store it in the Django instance itself. You'll need to implement some form of locking to prevent race conditions, but you'd need to worry about race conditions if you stored everything on the server object anyway.
Of course, if you're talking about user-by-user data, Django does support sessions. Or, and this is another perfectly good option if you're willing to make the user save the data, cookies.
The best way to maintain state in a django app on a per-user basis is request.session (see django sessions) which is a dictionary you can use to remember things about the current user.
For Application-wide state you should use the a persistent datastore (database or key/value store)
example view for sessions:
def my_view(request):
pages_viewed = request.session.get('pages_viewed', 1) + 1
request.session['pages_viewed'] = pages_viewed
...
If you wanted to maintain local variables on a per app-instance basis you can just define module level variables like so
# number of times my_view has been served since by this server
# instance since the last restart
served_since_restart = 0
def my_view(request):
served_since_restart += 1
...
If you wanted to maintain some server state across ALL app servers (like total number of pages viewed EVER) you should probably use a persistent key/value store like redis, memcachedb, or riak. There is a decent comparison of all these options here: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
You can do it with redis (via redis-py) like so (assuming your redis server is at "127.0.0.1" (localhost) and it's port 6379 (the default):
import redis
def my_view(request):
r = redis.Redis(host='127.0.0.1', port="6379")
served = r.get('pages_served_all_time', 0)
served += 1
r.set('pages_served_all_time', served)
...
There is LocMemCache cache backend that stores data in-process. You can use it with sessions (but with great care: this cache is not cross-process so you will have to use single process for deployment because it will not be guaranteed that subsequent requests will be handled by the same process otherwise). Global variables may also work (use threadlocals if they shouldn't be shared for all process threads; the warning about cross-process communication also applies here).
By the way, what's wrong with external storage? External storage provides easy cross-process data sharing and other features (like memory limiting algorithms for cache or persistance with databases).

Categories