Reading/writing Python dictionary to Redis cache on AWS crashes server - python

I am developing a computation intensive django application. Using Celery, to perform time taking tasks, and using Redis as a broker, and for cache puposes.
Redis cache is used to share a large dictionary structure across celery tasks.
I have a rest api to frequently write/update a python dictionary in Redis cache (after 1 second). Each api call initiates a new task.
At localhost it all works good. But on Aws, the the elastic-beanstalk app crashes when run for sometime.
It does not crash when the dictionary structure is empty. Here is the code how I update the cache.
r = redis.StrictRedis(host=Constants.REDIS_CACHE_ADDRESS, port=6379, db=0)
mydict_obj = r.get("mydict")
if mydict_obj:
mydict = eval(str(mydict_obj))
else:
mydict = {}
for hash_instance in all_hashes:
if hash_instance[1] in mydict:
mydict[hash_instance[1]].append((str(hash_instance[0]), str(data.recordId)))
else:
mydict[hash_instance[1]] = [(str(hash_instance[0]), str(data.recordId))]
r.set("mydict", mydict)
Can't find a solution, why the elastic-beanstalk app crashes on aws. It works fine on localhost.

From the documentation:
Don't use more memory than the specified amount of bytes.
When the memory limit is reached Redis will try to remove keys
according to the eviction policy selected (see maxmemory-policy).
If Redis can't remove keys according to the policy, or if the policy is
set to 'noeviction', Redis will start to reply with errors to commands
that would use more memory, like SET, LPUSH, and so on, and will continue
to reply to read-only commands like GET.
This option is usually useful when using Redis as an LRU cache, or to set
a hard memory limit for an instance (using the 'noeviction' policy).

Related

How to cache individual Django REST API POSTs for bulk_create?

I have a Django REST API endpoint. It receives a JSON payload eg.
{ "data" : [0,1,2,3] }
This is decoded in a views.py function and generates a new database object like so (pseudo code):
newobj = MyObj.(col0 = 0, col1= 1, col2 = 2, col3 = 3)
newobj.save()
In tests, it is 20x faster to create a list of x1000 newobjs, then do a bulk create:
Myobj.objects.bulk_create(newobjs, 1000)
So, the question is how to save individual POSTs somewhere in Django ready for batch writes when we have 1000 of them ?
You can cache it with Memcached or with Redis, for example.
But you will need to write some kind of service that checks how many new items in the cache and if there are more than e.g. 1000 -> insert it.
So:
POST are populating a cache
Service getting new items from the cache and then inserting them in the persistence database.
Do you really need it?
What will happen if data already exist? If data is corrupted? How the user will know about this?
save individual POSTs somewhere in Django ready for batch writes when we have 1000 of them
You can,
use django's cache framework,
maintain a CSV file using python's csv module
you probably want to maintain the order of the posts, so you can use persist-queue package.
But as Victor mentioned as well, why? Why are you so concerned about speeds of SQL Insert which are pretty fast anyway ?
Ofcourse, bulk_create is much faster because it takes a single network call to your DB server and adds all the rows in a single SQL transaction but it only makes sense to use it when you actually have bunch of data to be added together. - At the end, you must save the data somewhere which is gonna take some processing time one way or another.
Because there are many disadvantages to your approach:
you risk losing the data
you will not be able to achieve UNIQUE or any other constraint on your table.
your users won't get instant feedback on creating a post.
you cannot show/access the posts in useful way if they are not stored in your primary DB.
EDIT
Use a fast cache like Redis to maintain a list of the entries, in your api_view you can call cache.get to get the current list, append object to it and then call cache.set to update it. After this add a check that whenever len(list) >= 1000 == True call bulk_create. You might also want to consider using Elasticsearch for such enormous amount of data.
Thanks for the above responses, the answers included some of what was suggested, but is a superset, so here's a summary.
This is really about creating a FIFO. memcached turns out to be unsuitable (after trying) because only redis has a list function that enables this, explained nicely here.
Also note that the Django built in cache does not support the redis list api calls.
So we need a new docker-compose.yml entry to add redis:
redis:
image: redis
ports:
- 6379:6379/tcp
networks:
- app-network
Then in views.py we add: (note the use of redis rpush)
import redis
...
redis_host=os.environ['REDIS_HOST']
redis_port = 6379
redis_password = ""
r = redis.StrictRedis(host=redis_host, port=redis_port, password=redis_password, decode_responses=True)
...
def write_post_to_redis(request):
payload = json.loads(request.body)
r.rpush("df",json.dumps(payload))
So this pushes the received payload into the redis in-memory cache. We now need to read (or pop ) it and write to the postgres database. So we need a process that wakes up every n seconds and checks. For this we need Django background_task. First, install it with:
pipenv install django-background-tasks
And add to the installed apps of the settings.py
INSTALLED_APPS = [
...
'background_task',
Then run a migrate to add the background task tables:
python manage.py migrate
Now in views.py, add:
from background_task import background
from background_task.models import CompletedTask
And add the function to write the cached data to the postgres database, note the decorator which states it should run in the background every 5 seconds. Also note use of redis lpop.
#background(schedule=5)
def write_cached_samples():
...
payload = json.loads(r.lpop('df'))
# now do your write of payload to postgres
... and delete the completed tasks or we'll have a big db leak
CompletedTask.objects.all().delete()
In order to start the process up, add the following to the base of urls.py:
write_cached_samples(repeat=10, repeat_until=None)
Finally, because the background task needs a separate process, we duplicate the django docker container in docker-compose.yml but replace the asgi server run command with the background process run command.
django_bg:
image: my_django
command: >
sh -c "python manage.py process_tasks"
...
In summary we add two new docker containers, one for the redis in-memory cache, and one to run the django background tasks. We use the redis lists rpush and lpop functions to create a FIFO with the API receive pushing and a background task popping.
There was a small issue where nginx was hooking up to the wrong django container, rectified by stopping and restarting the background container, some issue where docker networking routing is wrongly initialising.
Next I am replacing the Django HTTP API endpoint with a Go one to see how much of a speed up we get, as the Daphne ASGI server is hitting max CPU at only 100 requests per sec.

Python, sqlalchemy: how to improve performance of encrypted sqlite database?

I have a simple service application: python, tornado web server, sqlite database. The database is encrypted.
The problem is that processing even very simple http request takes about 300msec.
From logs I can see that most of that time takes processing of the very first sql request, no matter how simple this first request is. Subsequent sql requests are processed much faster. But then server starts processing next http request, and again the first sql request is very slow.
If I turn off the database encryption the problem is gone: processing time of sql requests does not depend on if the request is first or not and my server response time decreases by factor 10 to 15.
I do not quite understand what's going on. Looks like sqlalchemy reads and decrypts the database file each time it starts new session. Is there any way to workaround this problem?
Due to how pysqlite, or the sqlite3 module, works SQLAlchemy defaults to using a NullPool with file-based databases. This explains why your database is decrypted per each request: a NullPool discards connections as they are closed. The reason why this is done is that pysqlite's default behaviour is to disallow using a connection in more than one thread, and without encryption creating new connections is very fast.
Pysqlite does have an undocumented flag check_same_thread that can be used to disable the check, but sharing connections between threads should be handled with care and the SQLAlchemy documentation makes a passing mention that the NullPool works well with SQLite's file locking.
Depending on your web server you could use a SingletonThreadPool, which means that all connections in a thread are the same connection:
engine = create_engine('sqlite:///my.db',
poolclass=SingletonThreadPool)
If you feel adventurous and your web server does not share connections / sessions between threads while in use (for example using a scoped session), then you could try using a different pooling strategy paired with check_same_thread=False:
engine = create_engine('sqlite:///my.db',
poolclass=QueuePool,
connect_args={'check_same_thread':False})
To encrypt database sqlcipher creates a key from the passphrase I provided. This operation is resource consuming by design.
But it is possible to use not a passphrase, but 256-bit raw key. In this case sqlcipher would not have to generate the encryption key.
Originally my code was:
session.execute('PRAGMA KEY = "MY_PASSPHRASE";')
To use raw key I changed this line to:
session.execute('''PRAGMA KEY = "x'<the key>'";''')
where <the key> is 64 characters long string of hexadecimals.
Result is 20+ times speed up on small requests.
Just for reference: to convert database to use new encryption key the following commands should be executed:
PRAGMA KEY = ""MY_PASSPHRASE";
PRAGMA REKEY = "x'<the key>'";
Related question: python, sqlite, sqlcipher: very poor performance processing first request
Some info about sqlcipher commands and difference between keys and raw keys: https://www.zetetic.net/sqlcipher/sqlcipher-api/

Can Redis write out to a database like PostgreSQL?

I've been using PostgreSQL for the longest time. All of my data lives inside Postgres. I've recently looked into redis and it has a lot of powerful features that would otherwise take a couple of lines in Django (python) to do. Redis data is persistent as long the machine it's running on doesn't go down and you can configure it to write out the data it's storing to disk every 1000 keys or every 5 minutes or so depending on your choice.
Redis would make a great cache and it would certainly replace a lot of functions I have written in python (up voting a user's post, viewing their friends list etc...). But my concern is, all of this data would some how need to be translated over to postgres. I don't trust storing this data in redis. I see redis as a temporary storage solution for quick retrieval of information. It's extremely fast and this far outweighs doing repetitive queries against postgres.
I'm assuming the only way I could technically write the redis data to the database is to save() whatever I get from the 'get' query from redis to the postgres database through Django.
That's the only solution I could think of. Do you know of any other solutions to this problem?
Redis is increasingly used as a caching layer, much like a more sophisticated memcached, and is very useful in this role. You usually use Redis as a write-through cache for data you want to be durable, and write-back for data you might want to accumulate then batch write (where you can afford to lose recent data).
PostgreSQL's LISTEN and NOTIFY system is very useful for doing selective cache invalidation, letting you purge records from Redis when they're updated in PostgreSQL.
For combining it with PostgreSQL, you will find the Redis foreign data wrapper provider that Andrew Dunstain and Dave Page are working on very interesting.
I'm not aware of any tool that makes Redis into a transparent write-back cache for PostgreSQL. Their data models are probably too different for this to work well. Usually you write changes to PostgreSQL and invalidate their Redis cache entries using listen/notify to a cache manager worker, or you queue changes in Redis then have your app read them out and write them into Pg in chunks.
Redis is persistent if configured to be so, both through snapshots and a kind of WAL called AOF. Loads of people use it as a primary datastore.
https://redis.io/topics/persistence
If one is referring to the greater world of Redis compatible (resp protocol) datastores, many are not limited to in-memory storage:
https://keydb.dev/
http://ssdb.io/
and many more...

Google AppEngine and Threaded Workers

I am currently trying to develop something using Google AppEngine, I am using Python as my runtime and require some advise on setting up the following.
I am running a webserver that provides JSON data to clients, The data comes from an external service in which I have to pull the data from.
What I need to be able to do is run a background system that will check the memcache to see if there are any required ID's, if there is an ID I need to fetch some data for that ID from the external source and place the data in the memecache.
If there are multiple id's, > 30 I need to be able to pull all 30 request as quickly and efficiently as possible.
I am new to Python Development and AppEngine so any advise you guys could give would be great.
Thanks.
You can use "backends" or "task queues" to run processes in the background. Tasks have a 10-minute run time limit, and backends have no run time limit. There's also a cronjob mechanism which can trigger requests at regular intervals.
You can fetch the data from external servers with the "URLFetch" service.
Note that using memcache as the communication mechanism between front-end and back-end is unreliable -- the contents of memcache may be partially or fully erased at any time (and it does happen from time to time).
Also note that you can't query memcache of you don't know the exact keys ahead of time. It's probably better to use the task queue to queue up requests instead of using memcache, or using the datastore as a storage mechanism.

Django statelessness?

I'm just wondering if Django was designed to be a fully stateless framework?
It seems to encourage statelessness and external storage mechanisms (databases and caches) but I'm wondering if it is possible to store some things in the server's memory while my app is in develpoment and runs via manage.py runserver.
Sure it's possible. But if you are writing a web application you probably won't want to do that because of threading issues.
That depends on what you mean by "store things in the server's memory." It also depends on the type of data. If you can, you're better off storing "global data" in a database or in the file system somewhere. Unless it is needed every request it doesn't really make sense to store it in the Django instance itself. You'll need to implement some form of locking to prevent race conditions, but you'd need to worry about race conditions if you stored everything on the server object anyway.
Of course, if you're talking about user-by-user data, Django does support sessions. Or, and this is another perfectly good option if you're willing to make the user save the data, cookies.
The best way to maintain state in a django app on a per-user basis is request.session (see django sessions) which is a dictionary you can use to remember things about the current user.
For Application-wide state you should use the a persistent datastore (database or key/value store)
example view for sessions:
def my_view(request):
pages_viewed = request.session.get('pages_viewed', 1) + 1
request.session['pages_viewed'] = pages_viewed
...
If you wanted to maintain local variables on a per app-instance basis you can just define module level variables like so
# number of times my_view has been served since by this server
# instance since the last restart
served_since_restart = 0
def my_view(request):
served_since_restart += 1
...
If you wanted to maintain some server state across ALL app servers (like total number of pages viewed EVER) you should probably use a persistent key/value store like redis, memcachedb, or riak. There is a decent comparison of all these options here: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
You can do it with redis (via redis-py) like so (assuming your redis server is at "127.0.0.1" (localhost) and it's port 6379 (the default):
import redis
def my_view(request):
r = redis.Redis(host='127.0.0.1', port="6379")
served = r.get('pages_served_all_time', 0)
served += 1
r.set('pages_served_all_time', served)
...
There is LocMemCache cache backend that stores data in-process. You can use it with sessions (but with great care: this cache is not cross-process so you will have to use single process for deployment because it will not be guaranteed that subsequent requests will be handled by the same process otherwise). Global variables may also work (use threadlocals if they shouldn't be shared for all process threads; the warning about cross-process communication also applies here).
By the way, what's wrong with external storage? External storage provides easy cross-process data sharing and other features (like memory limiting algorithms for cache or persistance with databases).

Categories