I am running a webservice where a user sends a word as a request, and I use that word to filter entries in my database (the default Django SQLite). The relationship word-to-entry is one-to-one.
That means there are two possible cases:
The word exists in the database -> Return the associated Entry.
The word doesn't exist -> Throw exception.
The following lookup should then return a QuerySet with 1 or 0 objects:
Entry.objects.filter(word__iexact=word)
Expected Behavior:
Cases 1 and 2 do not differ perceptibly in speed.
Current Behavior:
Case 1 takes at most half a second.
Case 2 takes forever, around 1-2 minutes.
I find this puzzling. If an existing word can be looked up regardless of where it is in the database, then why does case 2 take forever? I am not a django or database expert, so I feel like I'm missing something here. Is it worth just setting up a different type of database to see if that helps?
Here is the relevant portion of my code. I'm defining a helper function that gets called from a view:
mysite/myapp/utils.py
from .models import Entry
def get_entry(word):
if Entry.objects.filter(word__iexact=word).exists():
queryset = Entry.objects.filter(
word__iexact=word
) # Case insensitive exact lookup
entry = queryset[0] # Retrieve entry from queryset
return entry
else:
raise IndexError
This is normal, especially with a few million records on sqlite and I'm assuming without an index.
A missing word will always have to go through all records if there is no usable index. A word that is found, will terminate once found. There's no noticable difference if the word you are looking for is the last word in table order.
And it's actually because you're using a slice, so the slice uses LIMIT and database can stop looking at first match.
Related
My goal is to optimize the retrieval of the count of objects I have in my Django model.
I have two models:
Users
Prospects
It's a one-to-many relationship. One User can create many Prospects. One Prospect can only be created by one User.
I'm trying to get the Prospects created by the user in the last 24 hours.
Prospects model has roughly 7 millions rows on my PostgreSQL database. Users only 2000.
My current code is taking to much time to get the desired results.
I tried to use filter() and count():
import datetime
# get the date but 24 hours earlier
date_example = datetime.datetime.now() - datetime.timedelta(days = 1)
# Filter Prospects that are created by user_id_example
# and filter Prospects that got a date greater than date_example (so equal or sooner)
today_prospects = Prospect.objects.filter(user_id = 'user_id_example', create_date__gte = date_example)
# get the count of prospects that got created in the past 24 hours by user_id_example
# this is the problematic call that takes too long to process
count_total_today_prospects = today_prospects.count()
I works, but it takes too much time (5 minutes). Because it's checking the entire database instead of just checking, what I though it would: only the prospects that got created in the last 24 hours by the user.
I also tried using annotate but it's equally slow, because it's ultimately doing the same thing than the regular .count():
today_prospects.annotate(Count('id'))
How can I get the count in a more optimized way?
Assuming that you don't have it already, I suggest adding an index that includes both user and date fields (make sure that they are in this order, first the user and then the date, because for the user you are looking for an exact match but for the date you only have a starting point). That should speed up the query.
For example:
class Prospect(models.Model):
...
class Meta:
...
indexes = [
models.Index(fields=['user', 'create_date']),
]
...
This should create a new migration file (run makemigrations and migrate) where it adds the index to the database.
After that, your same code should run a bit faster:
count_total_today_prospects = Prospect.objects\
.filter(user_id='user_id_example', create_date__gte=date_example)\
.count()
Django's documentation:
A count() call performs a SELECT COUNT(*) behind the scenes, so you should always use count() rather than loading all of the record into Python objects and calling len() on the result (unless you need to load the objects into memory anyway, in which case len() will be faster).
Note that if you want the number of items in a QuerySet and are also retrieving model instances from it (for example, by iterating over it), it’s probably more efficient to use len(queryset) which won’t cause an extra database query like count() would.
If the queryset has already been fully retrieved, count() will use that length rather than perform an extra database query.
Take a look at this link: https://docs.djangoproject.com/en/3.2/ref/models/querysets/#count.
Try to use len().
This may be trivial, but I loaded a local DynamoDB instance with 30GB worth of Twitter data that I aggregated.
The primary key is id (tweet_id from the Tweet JSON), and I also store the date/text/username/geocode.
I basically am interested in mentions of two topics (let's say "Bees" and "Booze"). I want to get a count of each of those by state by day.
So by the end, I should know for each state, how many times each was mentioned on a given day. And I guess it'd be nice to export that as a CSV or something for later analysis.
Some issues I had with doing this...
First, the geocode info is a tuple of [latitude, longitude] so for each entry, I need to map that to a state. That I can do.
Second, is the most efficient way to do this to go through each entry and manually check if it contains a mention of either keyword and then have a dictionary for each that maps the date/location/count?
EDIT:
Since it took me 20 hours to load all the data into my table, I don't want to delete and re-create it. Perhaps I should create a global secondary index (?) and use that to search other fields in a query? That way I don't have to scan everything. Is that the right track?
EDIT 2:
Well, since the table is on my computer locally I should be OK with just using expensive operations like a Scan right?
So if I did something like this:
query = table.scan(
FilterExpression=Attr('text').contains("Booze"),
ProjectionExpression='id, text, date, geo',
Limit=100)
And did one scan for each keyword, then I would be able to go through the resulting filtered list and get a count of mentions of each topic for each state on a given day, right?
EDIT3:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100)
//do something with this set
while 'LastEvaluatedKey' in response:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100,
ExclusiveStartKey=response['LastEvaluatedKey']
)
//do something with each batch of 100 entries
So something like that, for both keywords. That way I'll be able to go through the resulting filtered set and do what I want (in this case, figure out the location and day and create a final dataset with that info). Right?
EDIT 4
If I add:
ProjectionExpression='date, location, user, text'
into the scan request, I get an error saying "botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the Scan operation: Invalid ProjectionExpression: Attribute name is a reserved keyword; reserved keyword: location". How do I fix that?
NVM I got it. Answer is to look into ExpressionAttributeNames (see: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ExpressionPlaceholders.html)
Yes, scanning the table for "Booze" and counting the items in the result should give you the total count. Please note that you need to do recursive scan until LastEvaluatedKey is null.
Refer exclusive start key as well.
Scan
EDIT:-
Yes, the code looks good. One thing to note, the result set wouldn't always contain 100 items. Please refer the LIMIT definition below (not same as SQL database).
Limit — (Integer) The maximum number of items to evaluate (not
necessarily the number of matching items). If DynamoDB processes the
number of items up to the limit while processing the results, it stops
the operation and returns the matching values up to that point, and a
key in LastEvaluatedKey to apply in a subsequent operation, so that
you can pick up where you left off. Also, if the processed data set
size exceeds 1 MB before DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a key
in LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information, see Query and Scan in the Amazon
DynamoDB Developer Guide.
I am aware that regular queryset or the iterator queryset methods evaluates and returns the entire data-set in one shot .
for instance, take this :
my_objects = MyObject.objects.all()
for rows in my_objects: # Way 1
for rows in my_objects.iterator(): # Way 2
Question
In both methods all the rows are fetched in a single-go.Is there any way in djago that the queryset rows can be fetched one by one from database.
Why this weird Requirement
At present my query fetches lets says n rows but sometime i get Python and Django OperationalError (2006, 'MySQL server has gone away').
so to have a workaround for this, i am currently using a weird while looping logic.So was wondering if there is any native or inbuilt method or is my question even logical in first place!! :)
I think you are looking to limit your query set.
Quote from above link:
Use a subset of Python’s array-slicing syntax to limit your QuerySet to a certain number of results. This is the equivalent of SQL’s LIMIT and OFFSET clauses.
In other words, If you start with a count you can then loop over and take slices as you require them..
cnt = MyObject.objects.count()
start_point = 0
inc = 5
while start_point + inc < cnt:
filtered = MyObject.objects.all()[start_point:inc]
start_point += inc
Of course you may need to error handle this more..
Fetching row by row might be worse. You might want to retrieve in batches for 1000s etc. I have used this Django snippet (not my work) successfully with very large querysets. It doesn't eat up memory and no trouble with connections going away.
Here's the snippet from that link:
import gc
def queryset_iterator(queryset, chunksize=1000):
'''''
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunksize (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
'''
pk = 0
last_pk = queryset.order_by('-pk')[0].pk
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()
To solve (2006, 'MySQL server has gone away') problem, your approach is not that logical. If you will hit database for each entry, it is going to increase number of queries which itself will create problem in future as usage of your application grows.
I think you should close mysql connection after iterating all elements of result, and then if you will try to make another query, django will create a new connection.
from django.db import connection:
connection.close()
Refer this for more details
New to using Python NDB.
I have something like:
class User(ndb.Model):
seen_list = nbd.KeyProperty(kind=Survey, repeated=True)
class Survey(ndb.Model):
same = ndb.StringProperty(required=True)
I want to be able to query for users that have not seen certain surveys.
What I am doing now is:
users = User.query(seen_list != 'survey name').fetch()
This does not work. What would be the proper way to do this? Should I first query the Survey list to get the key of the survey with a certain name? Is the != part correct?
I could not find any examples similar to this.
Thanks.
unfortunately, if your survey is a repeated property, it won't work that way. When you query a repeated property the datastore tries EVERY entry in your list, and if one works, it'll return the item. So when you say "!= survey name 1", if you have at least ONE entry in your list that isn't "survey name 1", it'll come back as positive, even if another result IS "survey name 1".
it's uninstinctive if you come from an SQL background I know.... the only way to go around that is to go programatically and evaluate the ones your query returns. It comes from the fact that, for repeated values, Big Table "flatten" your results, which means it creates one entry for EVERY value in your repeated attribute. so as it scans, it eventually finds one "correct" line with your info, grabs the object key from there, and returns the object.
At the moment I query my database this way:
for author in session.query(Author).filter(Author.queried==0).slice(0, 1000):
print "Processing:", author
# do stuff and commit later on
This means every 1000 authors I have to restart the script.
Is it possible to make the script run endless (or as long as there are authors)? I mean by that, if it is possible to turn
session.query(Author).filter(Author.queried==0).slice(0, 1000)
into some sort of generator which yields the next author for which queried==0 is true.
Query objects can be treated as iterators as-is. SQL will be executed once you start using data from the query iterator. Example:
for author in session.query(Author).filter(Author.queried==0):
print "Processing: ", author
Your question uses the word "endlessly", so a word of caution. SQL is not an event-processing API; you cannot simply do a query that runs "forever" and spits out each new row as it is added to a table. I wish that were possible, but it isn't.
If your intent is to detect new rows, you will have to poll regularly with the same query, and design an indicator into your data model that allows you to tell you which rows are new. You seem to be representing that now with a queried column. In that case, within the for loop above you might set author.queried = 1 and session.add(author). But you can't session.commit() inside the loop.
Since the query is turned into an equivalent SQL SELECT statement you will only be able to get the set of Author rows where queried is 0 that existed when this transaction started. Any updates to the Author queried column will not change the current SELECT set.
If you want to continue processing all the Author rows even if there are more than 1000 then you could do
for author in session.query(Author).filter(Author.queried==0):
print "Processing:", author
The __iter__ method on the Query object will called automatically which will return the same iterator as calling instances on the Query object.