Fetching queryset data one by one - python

I am aware that regular queryset or the iterator queryset methods evaluates and returns the entire data-set in one shot .
for instance, take this :
my_objects = MyObject.objects.all()
for rows in my_objects: # Way 1
for rows in my_objects.iterator(): # Way 2
Question
In both methods all the rows are fetched in a single-go.Is there any way in djago that the queryset rows can be fetched one by one from database.
Why this weird Requirement
At present my query fetches lets says n rows but sometime i get Python and Django OperationalError (2006, 'MySQL server has gone away').
so to have a workaround for this, i am currently using a weird while looping logic.So was wondering if there is any native or inbuilt method or is my question even logical in first place!! :)

I think you are looking to limit your query set.
Quote from above link:
Use a subset of Python’s array-slicing syntax to limit your QuerySet to a certain number of results. This is the equivalent of SQL’s LIMIT and OFFSET clauses.
In other words, If you start with a count you can then loop over and take slices as you require them..
cnt = MyObject.objects.count()
start_point = 0
inc = 5
while start_point + inc < cnt:
filtered = MyObject.objects.all()[start_point:inc]
start_point += inc
Of course you may need to error handle this more..

Fetching row by row might be worse. You might want to retrieve in batches for 1000s etc. I have used this Django snippet (not my work) successfully with very large querysets. It doesn't eat up memory and no trouble with connections going away.
Here's the snippet from that link:
import gc
def queryset_iterator(queryset, chunksize=1000):
'''''
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunksize (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
'''
pk = 0
last_pk = queryset.order_by('-pk')[0].pk
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()

To solve (2006, 'MySQL server has gone away') problem, your approach is not that logical. If you will hit database for each entry, it is going to increase number of queries which itself will create problem in future as usage of your application grows.
I think you should close mysql connection after iterating all elements of result, and then if you will try to make another query, django will create a new connection.
from django.db import connection:
connection.close()
Refer this for more details

Related

Return fixed number of items from dynamo db query with filter expression

I am trying to retrieve a fixed number(lets take it as 5 for now) of items from a dynamo db table.
This is the code I am using.
response = table.query(
KeyConditionExpression=Key('pk').eq('goo'),
Limit=5,
FilterExpression=Attr('goo').eq('bar'))
I am getting only 4 items from this. But if I remove FilterExpression, the item count will be 5. So is there any other way to get the fixed number of item even if I am using FilterExpression?
Filter Expressions are applied after items are read from the table in order to reduce the number of records sent over the wire. The Limit is applied during the query operation, i.e. before the filter expression.
If the Query read 5 items and only 4 of them match the FilterExpression you're getting only 4 items back.
The pragmatic thing would be to remove the limit from the Query and apply the limit client-side. The drawback is that you may pay for more Read Capacity Units.
If you want to avoid that you may have to reconsider your datamodel - a generic solution is difficult here.
In your specific case, you could create a Global Secondary Index with the partition key pk and the sort key goo (it doesn't have to be unique for GSIs). You can then fire your Query against the GSI with Limit 5 and it will give you what you want. But: you pay for the GSI storage + throughput.
Edit: This question is pretty much a duplicate except for the Python code
This is the answer I found.
paginator = dynamo_db_client.get_paginator('query')
response_iterator = paginator.paginate(
TableName='table_name',
KeyConditionExpression='#P=:p',
FilterExpression='#T=:t',
ExpressionAttributeNames={'#P':'pk','#T':'goo'},
ExpressionAttributeValues={
':p': {'S':'goo'},
':t': {'S':'bar'}
},
PaginationConfig={
'MaxItems':5
}
)
for page in response_iterator:
print(len(page['Items']))
link for paginator doc : DynamoDB.Paginator.Query

How to retrieve count objects faster on Django?

My goal is to optimize the retrieval of the count of objects I have in my Django model.
I have two models:
Users
Prospects
It's a one-to-many relationship. One User can create many Prospects. One Prospect can only be created by one User.
I'm trying to get the Prospects created by the user in the last 24 hours.
Prospects model has roughly 7 millions rows on my PostgreSQL database. Users only 2000.
My current code is taking to much time to get the desired results.
I tried to use filter() and count():
import datetime
# get the date but 24 hours earlier
date_example = datetime.datetime.now() - datetime.timedelta(days = 1)
# Filter Prospects that are created by user_id_example
# and filter Prospects that got a date greater than date_example (so equal or sooner)
today_prospects = Prospect.objects.filter(user_id = 'user_id_example', create_date__gte = date_example)
# get the count of prospects that got created in the past 24 hours by user_id_example
# this is the problematic call that takes too long to process
count_total_today_prospects = today_prospects.count()
I works, but it takes too much time (5 minutes). Because it's checking the entire database instead of just checking, what I though it would: only the prospects that got created in the last 24 hours by the user.
I also tried using annotate but it's equally slow, because it's ultimately doing the same thing than the regular .count():
today_prospects.annotate(Count('id'))
How can I get the count in a more optimized way?
Assuming that you don't have it already, I suggest adding an index that includes both user and date fields (make sure that they are in this order, first the user and then the date, because for the user you are looking for an exact match but for the date you only have a starting point). That should speed up the query.
For example:
class Prospect(models.Model):
...
class Meta:
...
indexes = [
models.Index(fields=['user', 'create_date']),
]
...
This should create a new migration file (run makemigrations and migrate) where it adds the index to the database.
After that, your same code should run a bit faster:
count_total_today_prospects = Prospect.objects\
.filter(user_id='user_id_example', create_date__gte=date_example)\
.count()
Django's documentation:
A count() call performs a SELECT COUNT(*) behind the scenes, so you should always use count() rather than loading all of the record into Python objects and calling len() on the result (unless you need to load the objects into memory anyway, in which case len() will be faster).
Note that if you want the number of items in a QuerySet and are also retrieving model instances from it (for example, by iterating over it), it’s probably more efficient to use len(queryset) which won’t cause an extra database query like count() would.
If the queryset has already been fully retrieved, count() will use that length rather than perform an extra database query.
Take a look at this link: https://docs.djangoproject.com/en/3.2/ref/models/querysets/#count.
Try to use len().

Exists check takes forever in Django SQLite

I am running a webservice where a user sends a word as a request, and I use that word to filter entries in my database (the default Django SQLite). The relationship word-to-entry is one-to-one.
That means there are two possible cases:
The word exists in the database -> Return the associated Entry.
The word doesn't exist -> Throw exception.
The following lookup should then return a QuerySet with 1 or 0 objects:
Entry.objects.filter(word__iexact=word)
Expected Behavior:
Cases 1 and 2 do not differ perceptibly in speed.
Current Behavior:
Case 1 takes at most half a second.
Case 2 takes forever, around 1-2 minutes.
I find this puzzling. If an existing word can be looked up regardless of where it is in the database, then why does case 2 take forever? I am not a django or database expert, so I feel like I'm missing something here. Is it worth just setting up a different type of database to see if that helps?
Here is the relevant portion of my code. I'm defining a helper function that gets called from a view:
mysite/myapp/utils.py
from .models import Entry
def get_entry(word):
if Entry.objects.filter(word__iexact=word).exists():
queryset = Entry.objects.filter(
word__iexact=word
) # Case insensitive exact lookup
entry = queryset[0] # Retrieve entry from queryset
return entry
else:
raise IndexError
This is normal, especially with a few million records on sqlite and I'm assuming without an index.
A missing word will always have to go through all records if there is no usable index. A word that is found, will terminate once found. There's no noticable difference if the word you are looking for is the last word in table order.
And it's actually because you're using a slice, so the slice uses LIMIT and database can stop looking at first match.

How do I tell if the returned cursor is the last cursor in App Engine

I apologize if I am missing something really obvious.
I'm making successive calls to app engine using cursors. How do I tell if the I'm on the last cursor? The current way I'm doing it now is to save the last cursor and then testing to see if that cursor equals the currently returned cursor. This requires an extra call to the datastore which is probably unnecessary though.
Is there a better way to do this?
Thanks!
I don't think there's a way to do this with ext.db in a single datastore call, but with ndb it is possible. Example:
query = Person.query(Person.name == 'Guido')
result, cursor, more = query.fetch_page(10)
If using the returned cursor will result in more records, more will be True. This is done smartly, in a single RPC call.
Since you say 'last cursor' I assume you are using cursors for some kind of pagination, which implies you will be fetching results in batches with a limit.
In this case then you know you are on the last cursor when you have less results returned than your limit.
limit = 100
results = Entity.all().with_cursor('x').fetch(limit)
if len(results)<limit:
# then there's no point trying to fetch another batch after this one
If you mean "has this cursor hit the end of the search results", then no, not without picking the cursor up and trying it again. If more entities are added that match the original search criteria, such that they logically land "after" the cursor (e.g., a query that sorts by an ascending timestamp), then reusing that saved cursor will let you retrieve those new entities.
I use the same technique Chris Familoe describes, but set the limit 1 more than I wish to return. So, in Chris' example, I would fetch 101 entities. 101 returned means I have another page with at least 1 on.
recs = db_query.fetch(limit + 1, offset)
# if less records returned than requested, we've reached the end
if len(recs) < limit + 1:
lastpage = True
entries = recs
else:
lastpage = False
entries = recs[:-1]
I know this post is kind of old but I was looking for a solution to the same problem. I found it in this excellent book:
http://shop.oreilly.com/product/0636920017547.do
Here is the tip:
results = query.fetch(RESULTS_FOR_PAGE)
new_cursor = query.cursor()
query.with_cursor(new_cursor)
has_more_results = query.count(1) == 1

SQLAlchemy: Scan huge tables using ORM?

I am currently playing around with SQLAlchemy a bit, which is really quite neat.
For testing I created a huge table containing my pictures archive, indexed by SHA1 hashes (to remove duplicates :-)). Which was impressingly fast...
For fun I did the equivalent of a select * over the resulting SQLite database:
session = Session()
for p in session.query(Picture):
print(p)
I expected to see hashes scrolling by, but instead it just kept scanning the disk. At the same time, memory usage was skyrocketing, reaching 1GB after a few seconds. This seems to come from the identity map feature of SQLAlchemy, which I thought was only keeping weak references.
Can somebody explain this to me? I thought that each Picture p would be collected after the hash is written out!?
Okay, I just found a way to do this myself. Changing the code to
session = Session()
for p in session.query(Picture).yield_per(5):
print(p)
loads only 5 pictures at a time. It seems like the query will load all rows at a time by default. However, I don't yet understand the disclaimer on that method. Quote from SQLAlchemy docs
WARNING: use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten.
In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=False) since those collections will be cleared for a new load when encountered in a subsequent result batch.
So if using yield_per is actually the right way (tm) to scan over copious amounts of SQL data while using the ORM, when is it safe to use it?
here's what I usually do for this situation:
def page_query(q):
offset = 0
while True:
r = False
for elem in q.limit(1000).offset(offset):
r = True
yield elem
offset += 1000
if not r:
break
for item in page_query(Session.query(Picture)):
print item
This avoids the various buffering that DBAPIs do as well (such as psycopg2 and MySQLdb). It still needs to be used appropriately if your query has explicit JOINs, although eagerly loaded collections are guaranteed to load fully since they are applied to a subquery which has the actual LIMIT/OFFSET supplied.
I have noticed that Postgresql takes almost as long to return the last 100 rows of a large result set as it does to return the entire result (minus the actual row-fetching overhead) since OFFSET just does a simple scan of the whole thing.
You can defer the picture to only retrieve on access. You can do it on a query by query basis.
like
session = Session()
for p in session.query(Picture).options(sqlalchemy.orm.defer("picture")):
print(p)
or you can do it in the mapper
mapper(Picture, pictures, properties={
'picture': deferred(pictures.c.picture)
})
How you do it is in the documentation here
Doing it either way will make sure that the picture is only loaded when you access the attribute.

Categories