Stateless pagination in CouchDB? - python

Most of the research I've seen on pagination with CouchDB suggests that what you need to do is take the first ten (or however many) items from your view, then record the last document's docid and pass it on to the next page. Unfortunately, I can see a few glaring issues with that method.
It apparently makes it impossible to skip around within the set of pages (if someone jumps directly to page 100, you would have to run the queries for pages 2-99 so you would know how to load page 100).
It requires you to pass around possibly a lot of state information between your pages.
It's difficult to properly code.
Unfortunately, my research has shown that using skip develops considerable slowdown for datasets 5000 records or larger, and would be positively crippling once you reached anything really huge (going to page 20000 with 10 records to a page would take about 20 seconds - and yes, there are datasets that big in production). So that's not really an option.
So, what I'm asking is, is there an efficient way to paginate view results in CouchDB that can get all the items from an arbitrary page? (I'm using couchdb-python, but hopefully there isn't anything about this that would be client-dependent.)

I'm new to CouchDB, but I think I might be able to help. I read the following from CouchDB: The Definitive Guide:
One drawback of the linked list style pagination is that... jumping to a specific page doesn’t really work... If you really do need jump to page over the full range of documents... you can still maintain an integer value index as the view index and have a hybrid approach at solving pagination. — http://books.couchdb.org/relax/receipts/pagination
If I'm reading that right, the approach in your case is going to be:
Embed a numeric sequence into your document set.
Extract that numeric sequence to a numeric view index.
Use arithmetic to calculate the correct start/end numeric keys for your arbitrary page.
For step 1, you need to actually add something like "page_seq" as a field to your document. I don't have a specific recommendation on how you obtain this number, and am curious to know what people think. For this scheme to work, it has to increment by exactly 1 for each new record, so RDBMS sequences are probably out (the ones I'm familiar with may skip numbers).
For step 2, you'd write a view with a map function that's something like this (in Javascript):
function(doc):
emit(doc.page_seq, doc)
For step 3, you'd write your query something like this (assuming the page_seq and page numbering sequences start at 1):
results = db.view("name_of_view")
page_size = ... # say, 20
page_no = ... # 1 = page 1, 2 = page 2, etc.
begin = ((page_no - 1) * page_size) + 1
end = begin + page_size
my_page = results[begin:end]
and then you can iterate through my_page.
A clear drawback of this is that page_seq assumes you're not filtering the data set for your view, and you'll quickly run into trouble if you're trying to get this to work with an arbitrary query.
Comments/improvements welcome.

We have solved this problem by using CouchDB Lucene for search listings.
The 0.6 Snapshot is stable enough you should try it :
CouchDB Lucene repository

Related

Filter results from browse_release_groups by artist_id to get discography, python

I'm trying to retrieve discographies for various artists. Wikipedia and the manual web interface for MusicBrainz.org seem to agree on what albums make this up, for the artists I've checked. My first thought was to attempt to screen-scrape either of these resources, but that looks like hard work to do it properly.
Direct queries of the musicbrainz data seemed to offer a quicker way to get clean data. I would ideally construct a request like this ...
data = get_release_groups(artist=mbid,
primary_type='Album',
status='Official',
includes=['first_release_date',
'title',
'secondary_type_list'])
I chose to use the python wrapper musicbrainsngs, as I am fairly experienced with python. It gave me a choice of three methods, get_, search_ and browse_. Get_ will not return sufficient records. Browse_ appeared to be what I wanted, so I tried that first, especially as search_ was documented around looking for text in the python examples, rather than the mb_id, which I already had.
When I did a browse_release_groups(artist=artist_id,,,), I got a list of release groups, each containing the data I wanted, which was album title, type and year. However, I also got a large number of other release groups that don't appear on their manual web results for (for example The Rolling Stones) https://musicbrainz.org/artist/b071f9fa-14b0-4217-8e97-eb41da73f598
There didn't appear to be any way to filter in the query for status='official', or to include the status as part of the results so I could manually filter.
In response to this question, Wieland has suggested I use the search_ query. I have tested search_release_groups(arid=mbid, status='official', primarytype='Album', strict=True, limit=...) and this returns many fewer release groups. As far as studio albums are concerned, it matches 1:1. There are still a few minor discrepancies in the compilations, which I can live with. However, this query did not return the first-release-date, and so far, it has been resistant to my attempts to find how to include it. I notice in the server search code linked to that every query starts off manipulating rgm.first_release_date_year etc, but it's not clear how/when this gets returned from a query.
It's just occurred to me that I can use both a browse_ and a search_ , as together they give me all the information. So I have a work around, but it feels rather agricultural.
TL;DR I want release groups (titles, dates, types, status) by artist ID. If I browse, I get dates, but can't include or filter by status. If I search, I can filter by status, but don't get dates. How can I get both in one query?
I'm not entirely sure what your question is, but the find_by_artist method of release groups (source here) is what's doing the filtering of release groups for the artist pages, in particular:
# Show only RGs with official releases by default, plus all-status-less ones so people fix the status
unless ($show_all) {
push #$conditions, "(EXISTS (SELECT 1 FROM release where release.release_group = rg.id AND release.status = '1') OR
NOT EXISTS (SELECT 1 FROM release where release.release_group = rg.id AND release.status IS NOT NULL))";
}
Unfortunately, I think it's not possible to express that condition in a normal web service call. You can, however, use the search web service to filter for release groups by the rolling stones that contain at least one "official" release: http://musicbrainz.org/ws/2/release-group/?query=arid:b071f9fa-14b0-4217-8e97-eb41da73f598%20AND%20status:official&offset=0. In python-musicbrainzngs, the call for this is
search_release_groups(arid="b071f9fa-14b0-4217-8e97-eb41da73f598", status="official", strict=True)
Unfortunately, the search results don't include the first-release-date field. There's an open ticket about it, but it's not going to be fixed in the near future.

Storing queryset after fetching it once

I am new to django and web development.
I am building a website with a considerable size of database.
Large amount of data should be shown in many pages, and a lot of this data is repeated. I mean I need to show the same data in many pages.
Is it a good idea to make a query to the database asking for the data in every GET request? it takes many seconds to get the data every time I refresh the page or request another page that has the same data shown.
Is there a way to fetch the data once and store it somewhere and just display it in every page, and only refetch it when some updates are being done.
I thought about the session but I found that it is limited to 5MB which is small for my data.
Any suggestions?
Thank you.
Django's cache - as mentionned by Leistungsabfall - can help, but like most cache systems it has some drawbacks too if you use it naively for this kind of problems (long queries/computations): when the cache expires, the next request will have to recompute the whole thing - which might take some times durring which every new request will trigger a recomputation... Also, proper cache invalidation can be really tricky.
Actually there's no one-size-fits-all answer to your question, the right solution is often a mix of different solutions (code optimisation, caching, denormalisation etc), based on your actual data, how often they change, how much visitors you have, how critical it is to have up-to-date data etc, but the very first steps would be to
check the code fetching the data and find out if there are possible optimisations at this level using QuerySet features (.select_related() / prefetch_related(), values() and/or values_list(), annotations etc) to avoid issues like the "n+1 queries" problem, fetching whole records and building whole model instances when you only need a single field's value, doing computations at the Python level when they could be done at the database level etc
check your db schema's indexes - well used indexes can vastly improve performances, badly used ones can vastly degrade performances...
and of course use the right tools (db query logging, Python's profiler etc) to make sure you identify the real issues.

How to implement Google-style pagination on app engine?

See the pagination on the app gallery? It has page numbers and a 'start' parameter which increases with the page number. Presumably this app was made on GAE. If so, how did they do this type of pagination? ATM I'm using cursors but passing them around in URLs is as ugly as hell.
You can simply pass in the 'start' parameter as an offset to the .fetch() call on your query. This gets less efficient as people dive deeper into the results, but if you don't expect people to browse past 1000 or so, it's manageable. You may also want to consider keeping a cache, mapping queries and offsets to cursors, so that repeated queries can fetch the next set of results efficiently.
Ben Davies's outstanding PagedQuery class will do everything that you want and more.

Get records before and after current selection in Django query

It sounds like an odd one but it's a really simple idea. I'm trying to make a simple Flickr for a website I'm building. This specific problem comes when I want to show a single photo (from my Photo model) on the page but I also want to show the image before it in the stream and the image after it.
If I were only sorting these streams by date, or was only sorting by ID, that might be simpler... But I'm not. I want to allow the user to sort and filter by a whole variety of methods. The sorting is simple. I've done that and I have a result-set, containing 0-many Photos.
If I want a single Photo, I start off with that filtered/sorted/etc stream. From it I need to get the current Photo, the Photo before it and the Photo after it.
Here's what I'm looking at, at the moment.
prev = None
next = None
photo = None
for i in range(1, filtered_queryset.count()):
if filtered_queryset[i].pk = desired_pk:
if i>1: prev = filtered_queryset[i-1]
if i<filtered_queryset.count(): next = filtered_queryset[i+1]
photo = filtered_queryset[i]
break
It just seems disgustingly messy. And inefficient. Oh my lord, so inefficient. Can anybody improve on it though?
Django queries are late-binding, so it would be nice to make use of that though I guess that might be impossible given my horrible restrictions.
Edit: it occurs to me that I can just chuck in some SQL to re-filter queryset. If there's a way of selecting something with its two (or one, or zero) closest neighbours with SQL, I'd love to know!
You could try the following:
Evaluate the filtered/sorted queryset and get the list of photo ids, which you hold in the session. These ids all match the filter/sort criteria.
Keep the current index into this list in the session too, and update it when the user moves to the previous/next photo. Use this index to get the prev/current/next ids to use in showing the photos.
When the filtering/sorting criteria change, re-evaluate the list and set the current index to a suitable value (e.g. 0 for the first photo in the new list).
I see the following possibilities:
Your URL query parameters contain the sort/filtering information and some kind of 'item number', which is the item number within your filtered queryset. This is the simple case - previous and next are item number minus one and plus one respectively (plus some bounds checking)
You want the URL to be a permalink, and contain the photo primary key (or some unique ID). In this case, you are presumably storing the sorting/filtering in:
in the URL as query parameters. In this case you don't have true permalinks, and so you may as well stick the item number in the URL as well, getting you back to option 1.
hidden fields in the page, and using POSTs for links instead of normal links. In this case, stick the item number in the hidden fields as well.
session data/cookies. This will break if the user has two tabs open with different sorts/filtering applied, but that might be a limitation you don't mind - after all, you have envisaged that they will probably just be using one tab and clicking through the list. In this case, store the item number in the session as well. You might be able to do something clever to "namespace" the item number for the case where they have multiple tabs open.
In short, store the item number wherever you are storing the filtering/sorting information.

Dealing with URLs in Django

So, basically what I'm trying to do is a hockey pool application, and there are a ton of ways I should be able to filter to view the data. For example, filter by free agent, goals, assists, position, etc.
I'm planning on doing this with a bunch of query strings, but I'm not sure what the best approach would be to pass along the these query strings. Lets say I wanted to be on page 2 (as I'm using pagination for splitting the pages), sort by goals, and only show forwards, I would have the following query set:
?page=2&sort=g&position=f
But if I was on that page, and it was showing me all this corresponding info, if I was to click say, points instead of goals, I would still want all my other filters in tact, so like this:
?page=2&sort=p&position=f
Since HTTP is stateless, I'm having trouble on what the best approach to this would be.. If anyone has some good ideas they would be much appreciated, thanks ;)
Shawn J
Firstly, think about whether you really want to save all the parameters each time. In the example you give, you change the sort order but preserve the page number. Does this really make sense, considering you will now have different elements on that page. Even more, if you change the filters, the currently selected page number might not even exist.
Anyway, assuming that is what you want, you don't need to worry about state or cookies or any of that, seeing as all the information you need is already in the GET parameters. All you need to do is to replace one of these parameters as required, then re-encode the string. Easy to do in a template tag, since GET parameters are stored as a QueryDict which is basically just a dictionary.
Something like (untested):
#register.simple_tag
def url_with_changed_parameter(request, param, value):
params = request.GET
request[param] = value
return "%s?%s" % (request.path, params.urlencode())
and you would use it in your template:
{% url_with_changed_parameter request "page" 2 %}
Have you looked at django-filter? It's really awesome.
Check out filter mechanism in the admin application, it includes dealing with dynamically constructed URLs with filter information supplied in the query string.
In addition - consider saving actual state information in cookies/sessions.
If You want to save all the "parameters", I'd say they are resource identifiers and should normally be the part of URI.

Categories