Specifying limit and offset in Django QuerySet wont work - python

I'm using Django 1.6.5 and have MySQL's general-query-log on, so I can see the sql hitting MySQL.
And I noticed that Specifying a bigger limit in Django's QuerySet would not work:
>>> from blog.models import Author
>>> len(Author.objects.filter(pk__gt=0)[0:999])
>>> len(Author.objects.all()[0:999])
And MySQL's general log showed that both query had LIMIT 21.
But a limit smaller than 21 would work, e.g. len(Author.objects.all()[0:10]) would make a sql with LIMIT 10.
Why is that? Is there something I need to configure?

It happens when you make queries from the shell - the LIMIT clause is added to stop your terminal filling up with thousands of records when debugging:
You were printing (or, at least, trying to print) the repr() of the
queryset. To avoid people accidentally trying to retrieve and print a
million results, we (well, I) changed that to only retrieve and print
the first 20 results and print "remainder truncated" if there were more.
This is achieved by limiting the query to 21 results (if there are 21
results there are more than 20, so we print the "truncated" message).
That only happens in the repr() -- i.e. it's only for diagnostic
printing. No normal user code has this limit included automatically, so
you happily create a queryset that iterates over a million results.
(Source)

Django implements OFFSET using Python’s array-slicing syntax. If you want to offset the first 10 elements and then show the next 5 elements then use it
MyModel.objects.all()[OFFSET:OFFSET+LIMIT]
For example if you wanted to check 5 authors after an offset of 10 then your code would look something like this:
Author.objects.all()[10:15]
You can read more about it here in the official Django doc
I have also written a blog around this concept, you can here more here

The LIMIT and OFFSET doesn't work in the same way in Django, the way we expect it to work.
For example.
If we have to read next 10 rows starting from 10th row and if we specify :
Author.objects.all()[10:10]
It will return the empty record list. In order to fetch the next 10 rows, we have to add the offset to the limit.
Author.objects.all()[10:10+10]
And it will return the record list of next 10 rows starting from the 10th row.

for offset and limit i used and worked for me :)
MyModel.objects.all()[offset:limit]
for exapmle:-
Post.objects.filter(Post_type=typeId)[1:1]

I does work, but django uses an iterator. It does not load all objects at once.

Related

Use pywikibot to download complete list of pages from a Mediawiki server without iterating through pages

I have a large (50K+ pages) Mediawiki wiki, and I need to efficiently get a list of all pages, sorted by last update time. I'm working in Python using pywikibot. The documentation hints that this is possible, but I haven't decoded how to do it yet. (I can download up to 500 pages easily enough.) Is there a reasonably efficient way to do this that's better than downloading batches of 500 in alphabetic order, getting update times page by page, and merging the batches?
MediaWiki does not directly expose a list of pages sorted by last edit time. You could just download all pages and sort them locally (in Python or in some kind of database, depending on how many pages there are):
site = pywikibot.Site()
for namespace in site.namespaces():
for page in site.allpages(namespace = namespace):
// process page.title() and page.editTime()
or use the allrevisions API which can sort by time but returns all revisions of all pages, maybe by relying on a query like action=query&generator=allrevisions&prop=revisions (with pywikibot.data.api.QueryGenerator) which would also return the current revision of each page so you can discard old revisions; or use SQL support in Pywikibot with a query like SELECT page_ns, page_title FROM page JOIN revision ON page_latest = rev_id ORDER BY rev_timestamp (which will result in an inefficient filesort-based query, but for a small wiki that might not matter).
After some digging and a lot of experimenting, I found a solution using pywikibot which generates a list of all pages sorted by time of last update:
wiki=pywikibot.Site()
current_time = wiki.server_time()
iterator=wiki.recentchanges(start = current_time, end=current_time - timedelta(hours=600000)) # Not for all time, just for the last 60 years...
listOfAllWikiPages=[]
for v in iterator:
listOfAllWikiPages.append(v)
# This has an entry for each revision.
# Get rid of the older instances of each page by creating a dictionary which
# only contains the latest version.
temp={}
for p in listOfAllWikiPages:
if p["title"] in temp.keys():
if p["timestamp"] > temp[p["title"]]["timestamp"]:
temp[p["title"]]=p
else:
temp[p["title"]]=p
# Recreate the listOfAllWikiPages from the de-duped dictionary
listOfAllWikiPages=list(temp.values())

Neo4J / py2neo -- cursor-based query?

If I do something like this:
from py2neo import Graph
graph = Graph()
stuff = graph.cypher.execute("""
match (a:Article)-[p]-n return a, n, p.weight
""")
on a database with lots of articles and links, the query takes a long time and uses all my system's memory, presumably because it's copying the entire result set into memory in one go. Is there some kind of cursor-based version where I could iterate through the results one at a time without having to have them all in memory at once?
EDIT
I found the stream function:
stuff = graph.cypher.stream("""
match (a:Article)-[p]-n return a, n, p.weight
""")
which seems to be what I want according to the documentation but now I get a timeout error (py2neo.packages.httpstream.http.SocketError: timed out), followed by the server becoming unresponsive until I kill it using kill -9.
Have you tried implementing a paging mechanism? Perhaps with the skip keyword: http://neo4j.com/docs/stable/query-skip.html
Similar to using limit / offset in a postgres / mysql query.
EDIT: I previously said that the entire result set was stored in memory, but it appears this is not the case when using api streaming - per Nigel's (Neo engineer) comment below.

py2neo: depending batch insertion

I use py2neo (v 1.9.2) to write data to a neo4j db.
batch = neo4j.WriteBatch(graph_db)
current_relationship_index = graph_db.get_or_create_index(neo4j.Relationship, "Current_Relationship")
touched_relationship_index = graph_db.get_or_create_index(neo4j.Relationship, "Touched_Relationship")
get_rel = current_relationship_index.get(some_key1, some_value1)
if len(get_rel) == 1:
batch.add_indexed_relationship(touched_relationship_index, some_key2, some_value2, get_rel[0])
elif len(get_rel) == 0:
created_rel = current_relationship_index.create(some_key3, some_value3, (my_start_node, "KNOWS", my_end_node))
batch.add_indexed_relationship(touched_relationship_index, some_key4, "touched", created_rel)
batch.submit()
Is there a way to replace current_relationship_index.get(..) and current_relationship_index.create(...) with a batch command? I know that there is one, but the problem is, that I need to act depending on the return of these commands. And I would like to have all statements in a batch due to performance.
I have read that it is rather uncommon to index relationships but the reason I do it is the following: I need to parse some (text) file everyday and then need to check if any of the relations have changed towards the previous day, i.e. if a relation does not exist in the text file anymore I want to mark it with a "replaced" property in neo4j. Therefore, I add all "touched" relationships to the appropriate index, so I know that these did not change. All relations that are not in the touched_relationship_index obviously do not exist anymore so I can mark them.
I can't think of an easier way to do so, even though I'm sure that py2neo offers one.
EDIT: Considering Nigel's comment I tried this:
my_rel = batch.get_or_create_indexed_relationship(current_relationship_index, some_key, some_value, my_start_node, my_type, my_end_node)
batch.add_indexed_relationship(touched_relationship_index, some_key2, some_value2, my_rel)
batch.submit()
This obviously does not work, because i can't refer to "my_rel" in the batch. How can I solve this? Refer with "0" to the result of the previous batch statement? But consider that the whole thing is supposed to run in a loop, so the numbers are not fixed. Maybe use some variable "batch_counter" which refers to the current batch statement and is always incremented, whenever a statement is added to the batch??
Have a look at WriteBatch.get_or_create_indexed_relationship. That can conditionally create a relationship based on whether or not one currently exists and operates atomically. Documentation link below:
http://book.py2neo.org/en/latest/batches/#py2neo.neo4j.WriteBatch.get_or_create_indexed_relationship
There are a few similar uniqueness management facilities in py2neo that I recently blogged about here that you might want to read about.

Why are my Datastore Write Ops so high?

I busted through my daily free quota on a new project this weekend. For reference, that's .05 million writes, or 50,000 if my math is right.
Below is the only code in my project that is making any Datastore write operations.
old = Streams.query().fetch(keys_only=True)
ndb.delete_multi(old)
try:
r = urlfetch.fetch(url=streams_url,
method=urlfetch.GET)
streams = json.loads(r.content)
for stream in streams['streams']:
stream = Streams(channel_id=stream['_id'],
display_name=stream['channel']['display_name'],
name=stream['channel']['name'],
game=stream['channel']['game'],
status=stream['channel']['status'],
delay_timer=stream['channel']['delay'],
channel_url=stream['channel']['url'],
viewers=stream['viewers'],
logo=stream['channel']['logo'],
background=stream['channel']['background'],
video_banner=stream['channel']['video_banner'],
preview_medium=stream['preview']['medium'],
preview_large=stream['preview']['large'],
videos_url=stream['channel']['_links']['videos'],
chat_url=stream['channel']['_links']['chat'])
stream.put()
self.response.out.write("Done")
except urlfetch.Error, e:
self.response.out.write(e)
This is what I know:
There will never be more than 25 "stream" in "streams." It's
guaranteed to call .put() exactly 25 times.
I delete everything from the table at the start of this call because everything needs to be refreshed every time it runs.
Right now, this code is on a cron running every 60 seconds. It will never run more often than once a minute.
I have verified all of this by enabling Appstats and I can see the datastore_v3.Put count go up by 25 every minute, as intended.
I have to be doing something wrong here, because 25 a minute is 1,500 writes an hour, not the ~50,000 that I'm seeing now.
Thanks
You are mixing two different things here: write API calls (what your code calls) and low-level datastore write operations. See the billing docs for relations: Pricing of Costs for Datastore Calls (second section).
This is the relevant part:
New Entity Put (per entity, regardless of entity size) = 2 writes + 2 writes per indexed property value + 1 write per composite index value
In your case Streams has 15 indexed properties resulting in: 2 + 15 * 2 = 32 write OPs per write API call.
Total per hour: 60 (requests/hour) * 25 (puts/request) * 32 (operations/put) = 48,000 datastore write operations per hour
It seems as though I've finally figured out what was going on, so I wanted to update here.
I found this older answer: https://stackoverflow.com/a/17079348/1452497.
I've missed somewhere along the line where the properties being indexed were somehow multiplying the writes by factors of at least 10, I did not expect that. I didn't need everything indexed and after turning off the index in my model, I've noticed the write ops drop DRAMATICALLY. Down to about where I expect them.
Thanks guys!
It is 1500*24=36,000 writes/day, which is very near to the daily quota.

GeoModel with Google App Engine - queries

I'm trying to use GeoModel python module to quickly access geospatial data for my Google App Engine app.
I just have a few general questions for issues I'm running into.
There's two main methods, proximity_fetch and bounding_box_fetch, that you can use to return queries. They actually return a result set, not a filtered query, which means you need to fully prepare a filtered query before passing it in. It also limits you from iterating over the query set, since the results are fetched, and you don't have the option to input an offset into the fetch.
Short of modifying the code, can anyone recommend a solution for specifying an offset into the query? My problem is that I need to check each result against a variable to see if I can use it, otherwise throw it away and test the next. I may run into cases where I need to do an additional fetch, but starting with an offset.
You can also work directly with the location_geocells of your model.
from geospatial import geomodel, geocell, geomath
# query is a db.GqlQuery
# location is a db.GeoPt
# A resolution of 4 is box of environs 150km
bbox = geocell.compute_box(geocell.compute(geo_point.location, resolution=4))
cell = geocell.best_bbox_search_cells (bbox, geomodel.default_cost_function)
query.filter('location_geocells IN', cell)
# I want only results from 100kms.
FETCHED=200
DISTANCE=100
def _func (x):
x.dist = geomath.distance(geo_point.location, x.location)
return x.dist
results = sorted(query.fetch(FETCHED), key=_func)
results = [x for x in results if x.dist <= DISTANCE]
There's no practical way to do this, because a call to geoquery devolves into multiple datastore queries, which it merges together into a single result set. If you were able to specify an offset, geoquery would still have to fetch and discard all the first n results before returning the ones you requested.
A better option might be to modify geoquery to support cursors, but each query would have to return a set of cursors, not a single one.

Categories