I'd like to have a Django application record how much time each SQL query took.
The first problem is that SQL queries differ, even when they originate from the same code. That can be solved by normalizing them, so that
SELECT first_name, last_name FROM people WHERE NOW() - birth_date < interval '20' years;
would become something like
SELECT $ FROM people WHERE $ - birth_date < $;
After getting that done, we could just log the normalized query and the query timing to a file, syslog or statsd (for statsd, I'd probably also use a hash of the query as a key, and keep an index of hash->query relations elsewhere).
The bigger problem, however, is figuring out where that action can be performed. The best place for that I could find is this: https://github.com/django/django/blob/b5bacdea00c8ca980ff5885e15f7cd7b26b4dbb9/django/db/backends/util.py#L46 (note: we do use that ancient version of Django, but I'm fine with suggestions that are relevant only to newer versions).
Ideally, I'd like to make this a Django extension, rather than modifying Django source code. Sounds like I can make another backend, inheriting from the one we currently use, and make its CursorWrapper's class execute method record the timing and counter.
Is that the right approach, or should I be using some other primitives, like QuerySet or something?
Django debug toolbar has a panel that shows "SQL queries including time to execute and links to EXPLAIN each query"
http://django-debug-toolbar.readthedocs.io/en/stable/panels.html#sql
Related
I am kind of new to python and django.
I am using bulk_create to insert a lot of rows and as a former DBA I would very much like to see what insert statments are being executed. I know that for querys you can use .query but for insert statments I can't find a command.
Is there something I'm missing or is there no easy way to see it? (A regular print is fine by me.)
The easiest way is to set DEBUG = True and check connection.queries after executing the query. This stores the raw queries and the time each query takes.
from django.db import connection
MyModel.objects.bulk_create(...)
print(connection.queries[-1]['sql'])
There's more information in the docs.
A great tool to make this information easily accessible is the django-debug-toolbar.
I've created a bulk delete function that updates all enabled items' is_active flag. I've tried updating 5000 records with the following statement
Item.objects.filter(owner=request.user.profile, enabled=True, is_active=True).update(is_active=False)
But it is painfully slow and I'm afraid that this is causing my server to run out of memory.
I've previously had the following and it was still quite slow.
items = Item.objects.filter(owner=request.user.profile, enabled=True, is_active=True)
for item in items:
item.is_active = False
item.save()
The database being used is SQLite and I am using Django 1.7.
I wish to optimize this operation as much as possible. Any pointers or good query optimization docs would be appreciated.
You saying that you are deleting but in your code you are updating the rows rather than deleting. Aside from this, the format you are using in the first snippet is the way to go.
To increase performance you can use index_together with owner, enabled and is_active fields (note this adds some load while adding items).
But, as #Selcuk commented, if you aim performance, go use some serious database backend like postgresql.
Btw, take a look at db optimization docs Django offers so you can learn some tricks for future implementations ;).
Short Question
What is the default order of a list returned from a Django filter call when connected to a PostgreSQL database?
Background
By my own admission, I had made a poor assumption at the application layer in that the order in which a list is returned will be constant, that is without using 'order_by'. The list of items I was querying is not in alphabetic order or any other deliberate order. It was thought to remain in the same order as which they were added to the database.
This assumption held true for hundreds of queries, but a failure was reported by my application when the order changed unknowingly. To my knowledge, none of these records were touched during this time as I am the only person who maintains the DB. To add to the confusion, when running the Django app on Mac OS X, it still worked as expected, but on Win XP, it changed the order. (Note that the mentioned hundreds of queries was on Win XP).
Any insight to this would be helpful as I could not find anything in the Django or PostgreSQL documentation that explained the differences in operating systems.
Example Call
required_tests = Card_Test.objects.using(get_database()).filter(name__icontains=key)
EDIT
After speaking with some colleague's of mine today, I had come up with the same answer as Björn Lindqvist.
Looking back, I definitely understand why this is done wrong so often. One of the benefits to using an ORM Django, sqlalchemy, or whatever is that you can write commands without having to know or understand (in detail) the database it's connected to. Admittedly I happen to have been one of these users. However on the flip-side of this is that without knowing the database in detail debugging errors like this are quite troublesome and potentially catastrophic.
There is NO DEFAULT ORDER, a point that can not be emphasized enough because everyone does it wrong.
A table in a database is not an ordinary html table, it is an unordered set of tuples. It often surprises programmers only used to MySQL because in that particular database the order of the rows are often predictable due to it not taking advantage of some advanced optimization techniques. For example, it is not possible to know which rows will be returned, or their order in any of the following queries:
select * from table limit 10
select * from table limit 10 offset 10
select * from table order by x limit 10
In the last query, the order is only predictable if all values in column x are unique. The RDBMS is free to returns any rows in any order it pleases as long as it satisfies the conditions of the select statement.
Though you may add a default ordering on the Django level, which causes it to add an order by clause to every non-ordered query:
class Table(models.Model):
...
class Meta:
ordering = ['name']
Note that it may be a performance drag, if for some reason you don't need ordered rows.
If you want to have them returned in the order they were inserted:
Add the following to your model:
created = models.DateTimeField(auto_now_add=True, db_index=True)
# last_modified = models.DateTimeField(auto_now=True, db_index=True)
class Meta:
ordering = ['created',]
# ordering = ['-last_modified'] # sort last modified first
I was just looking over EveryBlock's source code and I noticed this code in the alerts/models.py code:
def _get_user(self):
if not hasattr(self, '_user_cache'):
from ebpub.accounts.models import User
try:
self._user_cache = User.objects.get(id=self.user_id)
except User.DoesNotExist:
self._user_cache = None
return self._user_cache
user = property(_get_user)
I've noticed this pattern around a bunch, but I don't quite understand the use. Is the whole idea to make sure that when accessing the FK on self (self = alert object), that you only grab the user object once from the db? Why wouldn't you just rely upon the db caching amd django's ForeignKey() field? I noticed that the model definition only holds the user id and not a foreign key field:
class EmailAlert(models.Model):
user_id = models.IntegerField()
...
Any insights would be appreciated.
I don't know why this is an IntegerField; it looks like it definitely should be a ForeignKey(User) field--you lose things like select_related() here and other things because of that, too.
As to the caching, many databases don't cache results--they (or rather, the OS) will cache the data on disk needed to get the result, so looking it up a second time should be faster than the first, but it'll still take work.
It also still takes a database round-trip to look it up. In my experience, with Django, doing an item lookup can take around 0.5 to 1ms, for an SQL command to a local Postgresql server plus sometimes nontrivial overhead of QuerySet. 1ms is a lot if you don't need it--do that a few times and you can turn a 30ms request into a 35ms request.
If your SQL server isn't local and you actually have network round-trips to deal with, the numbers get bigger.
Finally, people generally expect accessing a property to be fast; when they're complex enough to cause SQL queries, caching the result is generally a good idea.
Although databases do cache things internally, there's still an overhead in going back to the db every time you want to check the value of a related field - setting up the query within Django, the network latency in connecting to the db and returning the data over the network, instantiating the object in Django, etc. If you know the data hasn't changed in the meantime - and within the context of a single web request you probably don't care if it has - it makes much more sense to get the data once and cache it, rather than querying it every single time.
One of the applications I work on has an extremely complex home page containing a huge amount of data. Previously it was carrying out over 400 db queries to render. I've refactored it now so it 'only' uses 80, using very similar techniques to the one you've posted, and you'd better believe that it gives a massive performance boost.
I am trying to make a search view in Django. It is a search form with freetext input + some options to select, so that you can filter on years and so on. This is some of the code I have in the view so far, the part that does the filtering. And I would like some input on how expensive this would be on the database server.
soknad_list = Soknad.objects.all()
if var1:
soknad_list = soknad_list.filter(pub_date__year=var1)
if var2:
soknad_list = soknad_list.filter(muncipality__name__exact=var2)
if var3:
soknad_list = soknad_list.filter(genre__name__exact=var3)
# TEXT SEARCH
stop_word_list = re.compile(STOP_WORDS, re.IGNORECASE)
search_term = '%s' % request.GET['q']
cleaned_search_term = stop_word_list.sub('', search_term)
cleaned_search_term = cleaned_search_term.strip()
if len(cleaned_search_term) != 0:
soknad_list = soknad_list.filter(Q(dream__icontains=cleaned_search_term) | Q(tags__icontains=cleaned_search_term) | Q(name__icontains=cleaned_search_term) | Q(school__name__icontains=cleaned_search_term))
So what I do is, first make a list of all objects, then I check which variables exists (I fetch these with GET on an earlier point) and then I filter the results if they exists. But this doesn't seem too elegant, it probably does a lot of queries to achieve the result, so is there a better way to this?
It does exactly what I want, but I guess there is a better/smarter way to do this. Any ideas?
filter itself doesn't execute a query, no query is executed until you explicitly fetch items from query (e.g. get), and list( query ) also executes it.
You can see the query that will be generated by using:
soknad_list.query.as_sql()[0]
You can then put that into your database shell to see how long the query takes, or use EXPLAIN (if your database backend supports it) to see how expensive it is.
As Aaron mentioned, you should get a hold of the query text that is going to be run against the database and use an EXPLAIN (or other some method) to view the query execution plan. Once you have a hold of the execution plan for the query you can see what is going on in the database itself. There are a lot of operations that see very expensive to run through procedural code that are very trivial for any database to run, especially if you provide indexes that the database can use for speeding up your query.
If I read your question correctly, you're retrieving a result set of all rows in the Soknad table. Once you have these results back you use the filter() method to trim down your results meet your criteria. From looking at the Django documentation, it looks like this will do an in-memory filter rather than re-query the database (of course, this really depends on which data access layer you're using and not on Django itself).
The most optimal solution would be to use a full-text search engine (Lucene, ferret, etc) to handle this for you. If that is not available or practical the next best option would be to to construct a query predicate (WHERE clause) before issuing your query to the database and let the database perform the filtering.
However, as with all things that involve the database, the real answer is 'it depends.' The best suggestion is to try out several different approaches using data that is close to production and benchmark them over at least 3 iterations before settling on a final solution to the problem. It may be just as fast, or even faster, to filter in memory rather than filter in the database.