I need to cache a mid-sized queryset (about 500 rows). I had a look on some solutions, django-cache-machine being the most promising.
Since the queryset is pretty much static (it's a table of cities that's been populated in advance and gets updated only by me and anyway, almost never), I just need to serve the same queryset at every request for filtering.
In my search, one detail was really not clear to me: is the cache a sort of singleton object, which is available to every request? By which I mean, if two different users access the same page, and the queryset is evaluated for the first user, does the second one get the cached queryset?
I could not figure out, what problem you are exactly facing. What you are saying is the classical use case for caching. Memcache and redis are two most popular options. You just needs to write some method or function which first tries to load the result from cache, if it not there , the it queries the database. E.g:-
from django.core.cache import cache
def cache_user(userid):
key = "user_{0}".format(userid)
value = cache.get(key)
if value is None:
# fetch value from db
cache.set(value)
return value
Although for simplicity, I have written this as function, ideally this should be a manager method of the concerned model.
Related
So I'm currently working in Python/Django and I have a problem where Django caches querysets "within a session".
If I run python manage.py shell and do so:
>>> from myproject.services.models import *
>>> test = TestModel.objects.filter(pk = 5)
>>> print test[0].name
>>> John
Now, if I then update it directly in SQL to Bob and run it again, it'll still say John. If I however CTRL+D out (exit) and run the same thing, it will have updated and will now print Bob.
My problem is that I'm running a SOAP service in a screen and it'll always return the same result, even if the data gets changed.
I need a way to force the query to actually pull the data from the database again, not just pull the cached data. I could just use raw queries but that doesn't feel like a solution to me, any ideas?
The queryset is not cached 'within a session'.
The Django documentation: Caching and QuerySets mentions:
Each QuerySet contains a cache to minimize database access. Understanding how it works will allow you to write the most efficient code.
In a newly created QuerySet, the cache is empty. The first time a QuerySet is evaluated – and, hence, a database query happens – Django saves the query results in the QuerySet’s cache and returns the results that have been explicitly requested (e.g., the next element, if the QuerySet is being iterated over). Subsequent evaluations of the QuerySet reuse the cached results.
Keep this caching behavior in mind, because it may bite you if you don’t use your QuerySets correctly.
(emphasis mine)
For more information on when querysets are evaluated, refer to this link.
If it is critical for your application that he querysets gets updated, you have to evaluate it each time, be it within a single view function, or with ajax.
It is like running a SQL query again and again. Like old times when no querysets have been available and you kept the data in some structure that you had to refresh.
I am relatively new to Django and Python, but I have not been able to quite figure this one out.
I essentially want to query the database using filter for a large number of users. Then I want to make a bunch of queries on this just this section of users. So I thought it would be most efficient do first query for my larger filter parameters, and then make my separate filter queries on that set. In code, it looks like this
#Get the big groups of users, like all people with brown hair.
group_of_users = Data.objects.filter(......)
#Now get all the people with brown hair and blue eyes, and then all with green eyes, etc.
for each haircolor :
subset_of_group = group_of_users.filter(....)
That is just pseudo-code by the way, I am not that inept. I thought this would be more efficient, but it seems that if eliminate the first query and simply just get the querysets in the for loop, it is much faster (actually timed).
I fear this is because when I filter first, and then filter each time in the for loop, it is actually doing both sets of filter queries on each for loop execution. So really, doing twice the amount of work I want. I thought with caching this would not matter, as the first filter results would be cached and it would still be faster, but again, I timed it with multiple tests and the single filter is faster. Any ideas?
EDIT:
So it seems that querying for a set of data, and then trying to further query only against that set of data, is not possible. Rather, I should query for a set of data and then further parse that data using regular Python.
As garnertb ans lanzz said, it doesn't matter where you use the filter function, the only thing that matters is when you evaluate the query (see when querysets are evaluated). My guess is that in your tests, you evaluate the queryset somewhere in your code, and that you do more evaluations in your test with separate filter calls.
Whenever a queryset is evaluated, its results are cached. However, this cache does not carry over if you use another method, such as filter or order_by, on the queryset. SO you can't try to evaluate the bigger set, and use filtering on the queryset to retrieve the smaller sets without doing another query.
If you only have a small set of haircolours, you can get away with doing a query for each haircolour. However, if you have many of them, the amount of queries will have a severe impact on performance. In that case it might be better to do a query for the full set of users you want to use, and the do subsequent processing in python:
qs = Data.objects.filter(hair='brown')
objects = dict()
for obj in qs:
objects.setdefault(obj.haircolour, []).append(obj)
for (k, v) in objects.items():
print "Objects for colour '%s':" % k
for obj in v:
print "- %s" % obj
Filtering Django querysets does not perform any database operation, until you actually try to access the result. Filtering only adds conditions to the queryset, which are then used to build the final query when you access the result of the query.
When you assign group_of_users = Data.objects.filter(...), no data is retrieved from the database; you just get a queryset that knows that you want records that satisfy a specific condition (the filtering parameters you supplied to Data.objects.filter), but it does not pre-fetch those actual users. After that, when you assign subset_of_group = group_of_users.filter(....), you don't filter just that previous group of users, but only add more conditions to the queryset; still no data has been retrived from the database at this point. Only when you actually try to access the results of the queryset (by e.g. iterating over the queryset, or by slicing it, or by accessing a single index in it), the queryset will build an (usually) single query that would retrieve only user records that satisfy all filtering conditions you have accumulated in your querysets up to that point. It will still need to filter your entire users table to find those matching users; it cannot take advantage of the "previously retrieved" users from the group_of_users = Data.objects.filter(...) queryset, because nothing has been actually retrieved at that point.
Your approach is exactly right and it is efficient. The Querysets don't touch the database until they are evaluated, so you can add as many filters as you like and the database won't be touched. Django's excellent documentation provides all the information you need to figure out what operations cause the Queryset to be evaluated.
Django stated in their docs that all query sets are automatically cached, https://docs.djangoproject.com/en/dev/topics/db/queries/#caching-and-querysets. But they weren't super specific with the details of this functionality.
The example that they gave was to save the qs in a python variable, and subsequent calls after the first will be taken from the cache.
queryset = Entry.objects.all()
print([p.headline for p in queryset]) # Evaluate the query set.
print([p.pub_date for p in queryset]) # Re-use the cache from the evaluation.
So even if two exact queryset calls were made without a variable subsequently when a user loads a view, would the results not be cached?
# When the user loads the homepage, call number one (not cached)
def home(request):
entries = Entry.objects.filter(something)
return render_to_response(...)
# Call number two, is this cached automatically? Or do I need to import cache and
# manually do it? This is the same method as above, called twice
def home(request):
entries = Entry.objects.filter(something)
return render_to_response(...)
Sorry if this is confusing, I pasted the method twice to make it look like the user is calling it twice, its just one method. Are entries automatically cached?
Thanks
The queryset example you have given rightly indicates that querysets are evaluated lazily i.e the first time they are used. So when subsequently used again, they are not evaluated in the same flow when assigned to a variable. This is not exactly caching but re-using an evaluated expression as long as it is available in an optimized manner.
For the kind of caching you are looking at i.e the same view called twice, you will need to manually cache the database object when it is fetched the first time. Memcached is good for this. Then subsequently check and fetch like in example below.
def view(request):
results = cache.get(request.user.id)
if not results:
results = do_a_ton_of_work()
cache.set(request.user.id, results)
There are of course a lot of other ways to do caching at different levels right from your proxy server to per url caching. Whatever works best for you. Here is a good read on this topic.
It is not cached for two reasons:
When you use just filter, but don't "loop" through the results the queryset is not yet evaluated, which means the cache is still empty.
Even they would be evaluated it is not cached, because when you call the function the second time the queryset is recreated (new local variable), even you created it already the first time you called the function. The second function call does simply not "know" what you did before. Its simply a new queryset instance. In this case you might rely on the database cache though.
Memcached is built-in module, it works nice but even you can try for "johnny cache" for more better results.
you can get the more info here
http://packages.python.org/johnny-cache/
I have a query which results only change once a day. Seems like a waste to be performing that query every request I get for that page. I am investigating using memcached for this.
How would I begin? Anyone have any suggestions or pitfalls I should avoid in using Django's caching? Should I cache at the template or at the view?
This question might seem vague but it's only because I've never dealt with caching before. So if there's something I could elaborate on, please just ask.
Elaboration
Per Ken Cochrane:
How often does this data change: The relevant data would be locked in for that calendar date. So, for example, I'll pull the data for 1/30/2011 and I'm okay with serving that cached copy for the whole day until 1/31/2011 where it would be refreshed.
Do I use this data in more then one place: Only in one view.
How much data is it going to be: An average of 10 model objects that contain about 15 fields with the largest being a CharField(max_length=120). I will trim the number of fields down using values() to about half of those.
Normally before I decide where to do the caching I ask myself a few questions.
How often does this data change
Do I use this data in more then one place
How much data is it going to be
Since I don't know all of the details for your application, I'm going to make some assumptions.
you have a view that either takes in a date or uses the current date to query the database to pull out all of the calender events for that date.
you only display this information on one template,
The amount of data isn't too large (less then 100 entries).
With these assumptions you have 3 options.
1. cache the templates
2. cache the view
3. cache the queryset
Normally when I do my caching I cache the queryset, this allows me greater control of how I want to cache the data and I can reuse the same cached data in more then one place.
The easiest way that I have found to cache the queryset is to do this in the ModelManger for the model in question. I would create a method like get_calender_by_date(date) that will handle the query and caching for me. Here is a rough mockup
CACHE_TIMEOUT_SECONDS = 60 * 60 * 24 # this is 24 hours
class CalendarManager(models.Manager):
def get_calendar_by_date(self, by_date):
""" assuming date is a datetime object """
date_key = by_date.strftime("%m_%d_%Y")
cache_key = 'CAL_DATE_%s' % (date_key)
cal_date = cache.get(cache_key)
if cal_date is not None:
return cal_date
# not in cache get from database
cal_date = self.filter(event_date=by_date)
# set cal_date in cache for later use
cache.set(cache_key, cal_date, CACHE_TIMEOUT_SECONDS)
return cal_date
Some things to look out for when caching
Make sure the objects that you are storing in the cache can be pickled
Since memcache doesn't know what day it is you need to make sure you don't over cache. For example if it was Noon on Jan 21st and you cache for 24 hours, that calendar information will show up until Noon on Jan 22nd and that might not be what you are looking for, so make sure when you set the time of the query you either set it to a small value so it expires quicker or you calculate how long to cache so that it expires when you want it to expire.
Make sure you know the size of the objects you want to cache. If your memcache instance only have 16MB of storage but you want to store 32MB of data, the cache isn't going to do you much good.
When caching the template or view you need to watch out for the following
set your cache timeout so that it isn't too large, I don't think you can programtically change the template cache timeout, and it is hard coded, so if you set it too high you will end up having a page that is out of date. You should be able to programaticly change the cache time, so it is a little safer.
If you are caching the template and there is other information on the template that is dynamic and changes all of the time, make sure that you only put the cache tags around the section of the page you want cached for a while. If you put it in the wrong place you might end up the wrong result.
Hopefully that gives you enough information to get started. Good Luck.
Try to read this first of all.
Django has an ability to {% cache for_seconds something %}
Just use cache tag.
http://docs.djangoproject.com/en/dev/topics/cache/
You can cache the results of a function by date with Python's builtin lru_cache, as long as the method param is a plain "2021-09-22" date and not a timestamp:
import datetime
from functools import lru_cache
#lru_cache(maxsize=1)
def my_func(date: datetime.date):
if type(date) is not datetime.date:
raise ValueError(f"This method is cached by calendar date, but received date {date}.")
I was just looking over EveryBlock's source code and I noticed this code in the alerts/models.py code:
def _get_user(self):
if not hasattr(self, '_user_cache'):
from ebpub.accounts.models import User
try:
self._user_cache = User.objects.get(id=self.user_id)
except User.DoesNotExist:
self._user_cache = None
return self._user_cache
user = property(_get_user)
I've noticed this pattern around a bunch, but I don't quite understand the use. Is the whole idea to make sure that when accessing the FK on self (self = alert object), that you only grab the user object once from the db? Why wouldn't you just rely upon the db caching amd django's ForeignKey() field? I noticed that the model definition only holds the user id and not a foreign key field:
class EmailAlert(models.Model):
user_id = models.IntegerField()
...
Any insights would be appreciated.
I don't know why this is an IntegerField; it looks like it definitely should be a ForeignKey(User) field--you lose things like select_related() here and other things because of that, too.
As to the caching, many databases don't cache results--they (or rather, the OS) will cache the data on disk needed to get the result, so looking it up a second time should be faster than the first, but it'll still take work.
It also still takes a database round-trip to look it up. In my experience, with Django, doing an item lookup can take around 0.5 to 1ms, for an SQL command to a local Postgresql server plus sometimes nontrivial overhead of QuerySet. 1ms is a lot if you don't need it--do that a few times and you can turn a 30ms request into a 35ms request.
If your SQL server isn't local and you actually have network round-trips to deal with, the numbers get bigger.
Finally, people generally expect accessing a property to be fast; when they're complex enough to cause SQL queries, caching the result is generally a good idea.
Although databases do cache things internally, there's still an overhead in going back to the db every time you want to check the value of a related field - setting up the query within Django, the network latency in connecting to the db and returning the data over the network, instantiating the object in Django, etc. If you know the data hasn't changed in the meantime - and within the context of a single web request you probably don't care if it has - it makes much more sense to get the data once and cache it, rather than querying it every single time.
One of the applications I work on has an extremely complex home page containing a huge amount of data. Previously it was carrying out over 400 db queries to render. I've refactored it now so it 'only' uses 80, using very similar techniques to the one you've posted, and you'd better believe that it gives a massive performance boost.