Caching a Django queryset for the calendar date - python

I have a query which results only change once a day. Seems like a waste to be performing that query every request I get for that page. I am investigating using memcached for this.
How would I begin? Anyone have any suggestions or pitfalls I should avoid in using Django's caching? Should I cache at the template or at the view?
This question might seem vague but it's only because I've never dealt with caching before. So if there's something I could elaborate on, please just ask.
Elaboration
Per Ken Cochrane:
How often does this data change: The relevant data would be locked in for that calendar date. So, for example, I'll pull the data for 1/30/2011 and I'm okay with serving that cached copy for the whole day until 1/31/2011 where it would be refreshed.
Do I use this data in more then one place: Only in one view.
How much data is it going to be: An average of 10 model objects that contain about 15 fields with the largest being a CharField(max_length=120). I will trim the number of fields down using values() to about half of those.

Normally before I decide where to do the caching I ask myself a few questions.
How often does this data change
Do I use this data in more then one place
How much data is it going to be
Since I don't know all of the details for your application, I'm going to make some assumptions.
you have a view that either takes in a date or uses the current date to query the database to pull out all of the calender events for that date.
you only display this information on one template,
The amount of data isn't too large (less then 100 entries).
With these assumptions you have 3 options.
1. cache the templates
2. cache the view
3. cache the queryset
Normally when I do my caching I cache the queryset, this allows me greater control of how I want to cache the data and I can reuse the same cached data in more then one place.
The easiest way that I have found to cache the queryset is to do this in the ModelManger for the model in question. I would create a method like get_calender_by_date(date) that will handle the query and caching for me. Here is a rough mockup
CACHE_TIMEOUT_SECONDS = 60 * 60 * 24 # this is 24 hours
class CalendarManager(models.Manager):
def get_calendar_by_date(self, by_date):
""" assuming date is a datetime object """
date_key = by_date.strftime("%m_%d_%Y")
cache_key = 'CAL_DATE_%s' % (date_key)
cal_date = cache.get(cache_key)
if cal_date is not None:
return cal_date
# not in cache get from database
cal_date = self.filter(event_date=by_date)
# set cal_date in cache for later use
cache.set(cache_key, cal_date, CACHE_TIMEOUT_SECONDS)
return cal_date
Some things to look out for when caching
Make sure the objects that you are storing in the cache can be pickled
Since memcache doesn't know what day it is you need to make sure you don't over cache. For example if it was Noon on Jan 21st and you cache for 24 hours, that calendar information will show up until Noon on Jan 22nd and that might not be what you are looking for, so make sure when you set the time of the query you either set it to a small value so it expires quicker or you calculate how long to cache so that it expires when you want it to expire.
Make sure you know the size of the objects you want to cache. If your memcache instance only have 16MB of storage but you want to store 32MB of data, the cache isn't going to do you much good.
When caching the template or view you need to watch out for the following
set your cache timeout so that it isn't too large, I don't think you can programtically change the template cache timeout, and it is hard coded, so if you set it too high you will end up having a page that is out of date. You should be able to programaticly change the cache time, so it is a little safer.
If you are caching the template and there is other information on the template that is dynamic and changes all of the time, make sure that you only put the cache tags around the section of the page you want cached for a while. If you put it in the wrong place you might end up the wrong result.
Hopefully that gives you enough information to get started. Good Luck.

Try to read this first of all.
Django has an ability to {% cache for_seconds something %}
Just use cache tag.
http://docs.djangoproject.com/en/dev/topics/cache/

You can cache the results of a function by date with Python's builtin lru_cache, as long as the method param is a plain "2021-09-22" date and not a timestamp:
import datetime
from functools import lru_cache
#lru_cache(maxsize=1)
def my_func(date: datetime.date):
if type(date) is not datetime.date:
raise ValueError(f"This method is cached by calendar date, but received date {date}.")

Related

Django ORM Limit QuerySet By Using Start Limit

I am recently building a Django project, which deals with result set of
20 k rows.
All these data is responses as JSON, and am parsing this to use in template.
Currently, I am using objects.all() from django ORM.
I would like to know, if we can get complete result set in parts.
Say, if result is 10k rows, then split in 2K rows each.
My approach would be to lazy load data, using a limit variable incremented by 2k at a time.
Would like to know if, this approach is feasible or any help in this regards?
Yes, you can make use of .iterator(…) [Django-doc]. As the documentation says:
for obj in MyModel.objects.all().iterator(chunk_size=2000):
# … do something with obj
pass
This will fetch the records in chucks of 2'000 in thus case. If you set the limit higher, it will fetch more records per query, but then you need more memory to store all these records at a specific moment. Setting the chunk_size lower will result in less memory usage, but more queries to the database.
You might however be interested in pagination [Django-doc] instead. In that case the request contains a page number, and you return only a limited number of records. This is often better since not all clients per se need all the data, and furthermore the client often will need to be able to process the data itself, and if the chunks are too large, the client can get flooded as well.

Posting data to database through a "workflow" (Ex: on field changed to 20, create new record)

I'm looking to post new records on a user triggered basis (i.e. workflow). I've spent the last couple of days reasearching the best way to approach this and so far I've come up with the following ideas:
(1) Utilize Django signals to check for conditions on a field change, and then post data originating from my Django app.
(2) Utilize JS/AJAX on the front-end to post data to the app based upon a user changing certain fields.
(3) Utilize a prebuilt workflow app like http://viewflow.io/, again based upon changes triggers by users.
Of the three above options, is there a best practice? Are there any other options I'm not considering for how to take this workflow based approach to post new records?
The second approach of monitoring the changes in the front end and then calling a backend view to update go database would be a better approach because processing on the backend or any other site would put the processing on the server which would slow down the site whereas second approach is more of a client side solution thereby keeping server relieved.
I do not think there will be a data loss, you are just trying to monitor a change, as soon as it changes your view will update the database, you can also use cookies or sessions to keep appending values as a list and update the database when site closes. Also django gives https errors you could put proper try and except conditions in that case as well. Anyways cookies would be a good approach I think
For anyone that finds this post I ended up deciding to take the Signals route. Essentially I'm utilizing Signals to track when users change a fields, and based on the field that changes I'm performing certain actions on the database.
For testing purposes this has been working well. When I reach production with this project I'll try to update this post with any challenges I run into.
Example:
#receiver(pre_save, sender=subTaskChecklist)
def do_something_if_changed(sender, instance, **kwargs):
try:
obj = sender.objects.get(pk=instance.pk) #define obj as "old" before change values
except sender.DoesNotExist:
pass
else:
previous_Value = obj.FieldToTrack
new_Value = instance.FieldToTrack #instance represents the "new" after change object
DoSomethingWithChangedField(new_Value)

Storing queryset after fetching it once

I am new to django and web development.
I am building a website with a considerable size of database.
Large amount of data should be shown in many pages, and a lot of this data is repeated. I mean I need to show the same data in many pages.
Is it a good idea to make a query to the database asking for the data in every GET request? it takes many seconds to get the data every time I refresh the page or request another page that has the same data shown.
Is there a way to fetch the data once and store it somewhere and just display it in every page, and only refetch it when some updates are being done.
I thought about the session but I found that it is limited to 5MB which is small for my data.
Any suggestions?
Thank you.
Django's cache - as mentionned by Leistungsabfall - can help, but like most cache systems it has some drawbacks too if you use it naively for this kind of problems (long queries/computations): when the cache expires, the next request will have to recompute the whole thing - which might take some times durring which every new request will trigger a recomputation... Also, proper cache invalidation can be really tricky.
Actually there's no one-size-fits-all answer to your question, the right solution is often a mix of different solutions (code optimisation, caching, denormalisation etc), based on your actual data, how often they change, how much visitors you have, how critical it is to have up-to-date data etc, but the very first steps would be to
check the code fetching the data and find out if there are possible optimisations at this level using QuerySet features (.select_related() / prefetch_related(), values() and/or values_list(), annotations etc) to avoid issues like the "n+1 queries" problem, fetching whole records and building whole model instances when you only need a single field's value, doing computations at the Python level when they could be done at the database level etc
check your db schema's indexes - well used indexes can vastly improve performances, badly used ones can vastly degrade performances...
and of course use the right tools (db query logging, Python's profiler etc) to make sure you identify the real issues.

Django - queryset caching request-independent?

I need to cache a mid-sized queryset (about 500 rows). I had a look on some solutions, django-cache-machine being the most promising.
Since the queryset is pretty much static (it's a table of cities that's been populated in advance and gets updated only by me and anyway, almost never), I just need to serve the same queryset at every request for filtering.
In my search, one detail was really not clear to me: is the cache a sort of singleton object, which is available to every request? By which I mean, if two different users access the same page, and the queryset is evaluated for the first user, does the second one get the cached queryset?
I could not figure out, what problem you are exactly facing. What you are saying is the classical use case for caching. Memcache and redis are two most popular options. You just needs to write some method or function which first tries to load the result from cache, if it not there , the it queries the database. E.g:-
from django.core.cache import cache
def cache_user(userid):
key = "user_{0}".format(userid)
value = cache.get(key)
if value is None:
# fetch value from db
cache.set(value)
return value
Although for simplicity, I have written this as function, ideally this should be a manager method of the concerned model.

Attribute Cache in Django - What's the point?

I was just looking over EveryBlock's source code and I noticed this code in the alerts/models.py code:
def _get_user(self):
if not hasattr(self, '_user_cache'):
from ebpub.accounts.models import User
try:
self._user_cache = User.objects.get(id=self.user_id)
except User.DoesNotExist:
self._user_cache = None
return self._user_cache
user = property(_get_user)
I've noticed this pattern around a bunch, but I don't quite understand the use. Is the whole idea to make sure that when accessing the FK on self (self = alert object), that you only grab the user object once from the db? Why wouldn't you just rely upon the db caching amd django's ForeignKey() field? I noticed that the model definition only holds the user id and not a foreign key field:
class EmailAlert(models.Model):
user_id = models.IntegerField()
...
Any insights would be appreciated.
I don't know why this is an IntegerField; it looks like it definitely should be a ForeignKey(User) field--you lose things like select_related() here and other things because of that, too.
As to the caching, many databases don't cache results--they (or rather, the OS) will cache the data on disk needed to get the result, so looking it up a second time should be faster than the first, but it'll still take work.
It also still takes a database round-trip to look it up. In my experience, with Django, doing an item lookup can take around 0.5 to 1ms, for an SQL command to a local Postgresql server plus sometimes nontrivial overhead of QuerySet. 1ms is a lot if you don't need it--do that a few times and you can turn a 30ms request into a 35ms request.
If your SQL server isn't local and you actually have network round-trips to deal with, the numbers get bigger.
Finally, people generally expect accessing a property to be fast; when they're complex enough to cause SQL queries, caching the result is generally a good idea.
Although databases do cache things internally, there's still an overhead in going back to the db every time you want to check the value of a related field - setting up the query within Django, the network latency in connecting to the db and returning the data over the network, instantiating the object in Django, etc. If you know the data hasn't changed in the meantime - and within the context of a single web request you probably don't care if it has - it makes much more sense to get the data once and cache it, rather than querying it every single time.
One of the applications I work on has an extremely complex home page containing a huge amount of data. Previously it was carrying out over 400 db queries to render. I've refactored it now so it 'only' uses 80, using very similar techniques to the one you've posted, and you'd better believe that it gives a massive performance boost.

Categories