How to implement Google-style pagination on app engine? - python

See the pagination on the app gallery? It has page numbers and a 'start' parameter which increases with the page number. Presumably this app was made on GAE. If so, how did they do this type of pagination? ATM I'm using cursors but passing them around in URLs is as ugly as hell.

You can simply pass in the 'start' parameter as an offset to the .fetch() call on your query. This gets less efficient as people dive deeper into the results, but if you don't expect people to browse past 1000 or so, it's manageable. You may also want to consider keeping a cache, mapping queries and offsets to cursors, so that repeated queries can fetch the next set of results efficiently.

Ben Davies's outstanding PagedQuery class will do everything that you want and more.

Related

Storing queryset after fetching it once

I am new to django and web development.
I am building a website with a considerable size of database.
Large amount of data should be shown in many pages, and a lot of this data is repeated. I mean I need to show the same data in many pages.
Is it a good idea to make a query to the database asking for the data in every GET request? it takes many seconds to get the data every time I refresh the page or request another page that has the same data shown.
Is there a way to fetch the data once and store it somewhere and just display it in every page, and only refetch it when some updates are being done.
I thought about the session but I found that it is limited to 5MB which is small for my data.
Any suggestions?
Thank you.
Django's cache - as mentionned by Leistungsabfall - can help, but like most cache systems it has some drawbacks too if you use it naively for this kind of problems (long queries/computations): when the cache expires, the next request will have to recompute the whole thing - which might take some times durring which every new request will trigger a recomputation... Also, proper cache invalidation can be really tricky.
Actually there's no one-size-fits-all answer to your question, the right solution is often a mix of different solutions (code optimisation, caching, denormalisation etc), based on your actual data, how often they change, how much visitors you have, how critical it is to have up-to-date data etc, but the very first steps would be to
check the code fetching the data and find out if there are possible optimisations at this level using QuerySet features (.select_related() / prefetch_related(), values() and/or values_list(), annotations etc) to avoid issues like the "n+1 queries" problem, fetching whole records and building whole model instances when you only need a single field's value, doing computations at the Python level when they could be done at the database level etc
check your db schema's indexes - well used indexes can vastly improve performances, badly used ones can vastly degrade performances...
and of course use the right tools (db query logging, Python's profiler etc) to make sure you identify the real issues.

Persistent object with Django?

So I have a site that on a per-user basis, and it is expected to query a very large database, and flip through the results. Due to the size of the number of entries returned, I run the query once (which takes some time...), store the result in a global, and let folks iterate through the results (or download them) as they want.
Of course, this isn't scalable, as the globals are shared across sessions. What is the correct way to do this in Django? I looked at session management, but I always ran into the "xyz is not serializeable on json" issue. Do I look into how I do this correctly using sessions, or is there another preferred way to do this?
If the user is flipping through the results, you probably don't want to pull back and render any more than you have to. Most SQL dialects have TOP and LIMIT clauses that will let you pull back a limited range of results, as long as your data is ordered consistently. Django's Pagination classes are a nice abstraction of this on top of Django Model classes: https://docs.djangoproject.com/en/dev/topics/pagination/
I would be careful of storing large amounts of data in user sessions, as it won't scale as your number of users grows, and user sessions can stay around for a while after the user has left the site. If you're set on this option, make sure you read about clearing the expired sessions. Django doesn't do it for you:
https://docs.djangoproject.com/en/1.7/topics/http/sessions/#clearing-the-session-store

Good way to make a SQL based activity feed faster

Need a way to improve performance on my website's SQL based Activity Feed. We are using Django on Heroku.
Right now we are using actstream, which is a Django App that implements an activity feed using Generic Foreign Keys in the Django ORM. Basically, every action has generic foreign keys to its actor and to any objects that it might be acting on, like this:
Action:
(Clay - actor) wrote a (comment - action object) on (Andrew's review of Starbucks - target)
As we've scaled, its become way too slow, which is understandable because it relies on big, expensive SQL joins.
I see at least two options:
Put a Redis layer on top of the SQL database and get activity feeds from there.
Try to circumvent the Django ORM and do all the queries in raw SQL, which I understand can improve performance.
Any one have thoughts on either of these two, or other ideas, I'd love to hear them.
You might want to look at Materialized Views. Since you're on Heroku, and that uses PostgreSQL generally, you could look at Materialized View Support for PostgreSQL. It is not as mature as for other database servers, but as far as I understand, it can be made to work. To work with the Django ORM, you would probably have to create a new "entity" (not familiar with Django here so modify as needed) for the feed, and then do queries over it as if it was a table. Manual management of the view is a consideration, so look into it carefully before you commit to it.
Hope this helps!
You said redis? Everything is better with redis.
Caching is one of the best ideas in software development, no mather if you use Materialized Views you should also consider trying to cache those, believe me your users will notice the difference.
Went with an approach that sort of combined the two suggestions.
We created a master list of every action in the database, which included all the information we needed about the actions, and stuck it in Redis. Given an action ID, we can now do a Redis look up on it and get a dictionary object that is ready to be returned to the front end.
We also created action id lists that correspond to all the different types of activity streams that are available to a user. So given a user id, we have his friends' activity, his own activity, favorite places activity, etc, available for look up. (These I guess correspond somewhat to materialized views, although they are in Redis, not in PSQL.)
So we get a user's feed as a list of action ids. Then we get the details of those actions by look ups on the ids in the master action list. Then we return the feed to the front end.
Thanks for the suggestions, guys.

Python: RE vs. Query

I am building a website using Django, and this website uses blocks which are enabled for a certain page.
Right now I use a textfield containing paths were a block is enabled. When a page is requested, Django retrieves all blocks from database and does re.search on the TextField.
However, I was wondering if it is not a better idea to use a separate DB table for block/paths, were each row contains a single path and reference to a block, in terms of overhead.
A seperate DB table is definitely the "right" way to do it, because mysql has to send all the data from your TEXT fields every time you query. As you add more rows and the TEXT fields get bigger, you'll start to notice performance issues and eventually crash the server. Also, you'll be able to use VARCHAR and add a unique index to the paths, making lookups lightning fast.
I am not exactly familiar with Django, but if I am understanding the situation correctly, you should use a table.
In fact this is exactly the kind of use that DB software is designed and optimized for.
No worries. It will actually be faster.
By doing the search yourself, you are trying to implement part of the DB logic on your own. Fun, certainly, but not so fast. :)
Here are some nice links on designing a database:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
http://en.wikipedia.org/wiki/Third_normal_form
Hope this helps. Good luck. :-)

Stateless pagination in CouchDB?

Most of the research I've seen on pagination with CouchDB suggests that what you need to do is take the first ten (or however many) items from your view, then record the last document's docid and pass it on to the next page. Unfortunately, I can see a few glaring issues with that method.
It apparently makes it impossible to skip around within the set of pages (if someone jumps directly to page 100, you would have to run the queries for pages 2-99 so you would know how to load page 100).
It requires you to pass around possibly a lot of state information between your pages.
It's difficult to properly code.
Unfortunately, my research has shown that using skip develops considerable slowdown for datasets 5000 records or larger, and would be positively crippling once you reached anything really huge (going to page 20000 with 10 records to a page would take about 20 seconds - and yes, there are datasets that big in production). So that's not really an option.
So, what I'm asking is, is there an efficient way to paginate view results in CouchDB that can get all the items from an arbitrary page? (I'm using couchdb-python, but hopefully there isn't anything about this that would be client-dependent.)
I'm new to CouchDB, but I think I might be able to help. I read the following from CouchDB: The Definitive Guide:
One drawback of the linked list style pagination is that... jumping to a specific page doesn’t really work... If you really do need jump to page over the full range of documents... you can still maintain an integer value index as the view index and have a hybrid approach at solving pagination. — http://books.couchdb.org/relax/receipts/pagination
If I'm reading that right, the approach in your case is going to be:
Embed a numeric sequence into your document set.
Extract that numeric sequence to a numeric view index.
Use arithmetic to calculate the correct start/end numeric keys for your arbitrary page.
For step 1, you need to actually add something like "page_seq" as a field to your document. I don't have a specific recommendation on how you obtain this number, and am curious to know what people think. For this scheme to work, it has to increment by exactly 1 for each new record, so RDBMS sequences are probably out (the ones I'm familiar with may skip numbers).
For step 2, you'd write a view with a map function that's something like this (in Javascript):
function(doc):
emit(doc.page_seq, doc)
For step 3, you'd write your query something like this (assuming the page_seq and page numbering sequences start at 1):
results = db.view("name_of_view")
page_size = ... # say, 20
page_no = ... # 1 = page 1, 2 = page 2, etc.
begin = ((page_no - 1) * page_size) + 1
end = begin + page_size
my_page = results[begin:end]
and then you can iterate through my_page.
A clear drawback of this is that page_seq assumes you're not filtering the data set for your view, and you'll quickly run into trouble if you're trying to get this to work with an arbitrary query.
Comments/improvements welcome.
We have solved this problem by using CouchDB Lucene for search listings.
The 0.6 Snapshot is stable enough you should try it :
CouchDB Lucene repository

Categories