Solr & User data - python

Let's assume I am developing a service that provides a user with articles. Users can favourite articles and I am using Solr to store these articles for search purposes.
However, when the user adds an article to their favourites list, I would like to be able to figure out out which articles the user has added to favourites so that I can highlight the favourite button.
I am thinking of two approaches:
Fetch articles from Solr and then loop through each article to fetch the "favourite-status" of this article for this specific user from MySQL.
Whenever a user favourites an article, add this user's ID to a multi-valued column in Solr and check whether the ID of the current user is in this column or not.
I don't know the capacity of the multivalued column... and I also don't think the second approach would be a "good practice" (saving user-related data in index).
What other options do I have, if any? Is approach 2 a correct approach?

I'd go with a modified version of the first one - it'll keep user specific data that's not going to be used for search out of the index (although if you foresee a case where you want to search for favourite'd articles, it would probably be an interesting field to have in the index) for now. For just display purposes like in this case, I'd take all the id's returned from Solr, fetch them in one SQL statement from the database and then set the UI values depending on that. It's a fast and easy solution.
If you foresee that "search only in my fav'd articles" as a use case, I would try to get that information into the index as well (or other filter applications against whether a specific user has added the field as a favourite). I'd try to avoid indexing anything more than the user id that fav'd the article in that case.
Both solutions would however work, although the latter would require more code - and the required response from Solr could grow large if a large number of users fav's an article, so I'd try to avoid having to return a set of userid's if that's the case (many fav's for a single article).

Related

Filter results from browse_release_groups by artist_id to get discography, python

I'm trying to retrieve discographies for various artists. Wikipedia and the manual web interface for MusicBrainz.org seem to agree on what albums make this up, for the artists I've checked. My first thought was to attempt to screen-scrape either of these resources, but that looks like hard work to do it properly.
Direct queries of the musicbrainz data seemed to offer a quicker way to get clean data. I would ideally construct a request like this ...
data = get_release_groups(artist=mbid,
primary_type='Album',
status='Official',
includes=['first_release_date',
'title',
'secondary_type_list'])
I chose to use the python wrapper musicbrainsngs, as I am fairly experienced with python. It gave me a choice of three methods, get_, search_ and browse_. Get_ will not return sufficient records. Browse_ appeared to be what I wanted, so I tried that first, especially as search_ was documented around looking for text in the python examples, rather than the mb_id, which I already had.
When I did a browse_release_groups(artist=artist_id,,,), I got a list of release groups, each containing the data I wanted, which was album title, type and year. However, I also got a large number of other release groups that don't appear on their manual web results for (for example The Rolling Stones) https://musicbrainz.org/artist/b071f9fa-14b0-4217-8e97-eb41da73f598
There didn't appear to be any way to filter in the query for status='official', or to include the status as part of the results so I could manually filter.
In response to this question, Wieland has suggested I use the search_ query. I have tested search_release_groups(arid=mbid, status='official', primarytype='Album', strict=True, limit=...) and this returns many fewer release groups. As far as studio albums are concerned, it matches 1:1. There are still a few minor discrepancies in the compilations, which I can live with. However, this query did not return the first-release-date, and so far, it has been resistant to my attempts to find how to include it. I notice in the server search code linked to that every query starts off manipulating rgm.first_release_date_year etc, but it's not clear how/when this gets returned from a query.
It's just occurred to me that I can use both a browse_ and a search_ , as together they give me all the information. So I have a work around, but it feels rather agricultural.
TL;DR I want release groups (titles, dates, types, status) by artist ID. If I browse, I get dates, but can't include or filter by status. If I search, I can filter by status, but don't get dates. How can I get both in one query?
I'm not entirely sure what your question is, but the find_by_artist method of release groups (source here) is what's doing the filtering of release groups for the artist pages, in particular:
# Show only RGs with official releases by default, plus all-status-less ones so people fix the status
unless ($show_all) {
push #$conditions, "(EXISTS (SELECT 1 FROM release where release.release_group = rg.id AND release.status = '1') OR
NOT EXISTS (SELECT 1 FROM release where release.release_group = rg.id AND release.status IS NOT NULL))";
}
Unfortunately, I think it's not possible to express that condition in a normal web service call. You can, however, use the search web service to filter for release groups by the rolling stones that contain at least one "official" release: http://musicbrainz.org/ws/2/release-group/?query=arid:b071f9fa-14b0-4217-8e97-eb41da73f598%20AND%20status:official&offset=0. In python-musicbrainzngs, the call for this is
search_release_groups(arid="b071f9fa-14b0-4217-8e97-eb41da73f598", status="official", strict=True)
Unfortunately, the search results don't include the first-release-date field. There's an open ticket about it, but it's not going to be fixed in the near future.

Dynamic database tables in django

I am working on a project which requires me to create a table of every user who registers on the website using the username of that user. The columns in the table are same for every user.
While researching I found this Django dynamic model fields. I am not sure how to use django-mutant to accomplish this. Also, is there any way I could do this without using any external apps?
PS : The backend that I am using is Mysql
An interesting question, which might be of wider interest.
Creating one table per user is a maintenance nightmare. You should instead define a single table to hold all users' data, and then use the database's capabilities to retrieve only those rows pertaining to the user of interest (after checking permissions if necessary, since it is not a good idea to give any user unrestricted access to another user's data without specific permissions having been set).
Adopting your proposed solution requires that you construct SQL statements containing the relevant user's table name. Successive queries to the database will mostly be different, and this will slow the work down because every SQL statement has to be “prepared” (the syntax has to be checked, the names of table and columns has to be verified, the requesting user's permission to access the named resources has to be authorized, and so on).
By using a single table (model) the same queries can be used repeatedly, with parameters used to vary specific data values (in this case the name of the user whose data is being sought). Your database work will move along faster, you will only need a single model to describe all users' data, and database management will not be a nightmare.
A further advantage is that Django (which you appear to be using) has an extensive user-based permission model, and can easily be used to authenticate user login (once you know how). These advantages are so compelling I hope you will recant from your heresy and decide you can get away with a single table (and, if you planning to use standard Django logins, a relationship with the User model that comes as a central part of any Django project).
Please feel free to ask more questions as you proceed. It seems you are new to database work, and so I have tried to present an appropriate level of detail. There are many pitfalls such as this if you cannot access knowledgable advice. People on SO will help you.
This page shows how to create a model and install table to database on the fly. So, you could use type('table_with_username', (models.Model,), attrs) to create a model and use django.core.management to install it to the database.

CouchDB - How to create dynamic design docs based on end user input

I have been reading up more on CouchDB and really like it (master:master replication is one of the primary reasons I want to use it).
However, I have a query to ask of you guys... I cam from php, and used the Drupal CMS fairly often. One of my favorite (probably of the drupal community as a whole) was the 'Views' plugin written by MerlinOfChaos. The idea is that an admin can use the views ui system, to create a dynamic stream of content from the database. This content could be from any content type (blog posts, articles, users, image, et. al.) and could be filtered, ordered, arranged in grids, and so on. One simple example is creating a source of content for a animating slider. Where the admin could go in at any time and change what is shown in there. Though typically I would set it up as the most 5 recent of content type X.
So with something like mongo, I could kinda see how to could do this. A fairly advanced parser that would then convert what the admin wants into a db query. Since mongo is all based on dynamic querying, it is very doable. However, I want to use couch.
I have seen that I can create a view that takes a parameter and will return results based on that (such as a parameter of the 5 article id's you want displayed). But what if I want to be able to build something more advanced from the UI? would I just add more parameters? For example, say the created view selects all documents with the value 'contentType' = 'post' and the argument is the id/page title. But what if I want the end user to also be able to choose the content type that the view queries against. Or the 5 most recent articles as long as the content type is one of 3 different values?
Another thing this makes me think of, is once a view like this is created and saved to the db, and called for the first time, it spends the time to build the results. Would you do this on a production/live system?
Part of the idea is that I want an end user to be able to create a custom feed of content on their profile page based on articles and posts on the site. and to be able to filter them and make their own categories, so to speak and label them. Such as their 'tech' feed, and their 'food' feed.
I am still new to couch and still have reading to do. But this is something that was buggins me and I am trying to wrap my head around it. Since the product I have in mind is going to be heavily dynamic based on the end users input.
The application itself will be written in python
In a nutshell, you would need to emit something like this in the view:
emit([doc.contentType, doc.addDate], doc); // emit the entire doc,
// add date is timestamp (assuming)
or
emit([doc.contentType, doc.addDate], null); // use with include_docs=true
Then, when you need to fetch the listing:
startkey=["post",0]&endkey=["post",999999999]&limit=5&descending=true
Explain:
startkey = ["post",0] = contentType is post, and addDate >= 0
endkey = ["post",9999999999] = contentType is post, and addDate <= 9999999999
limit = 5, limit to five posts
descending = true = sort descending, which is sort by adddDate descending
To overcome the drawback of updating views on live db,
you can also create a new design(view) doc.
So, at least your existing code and view won't be affected.
Only after your new view is created,
you deploy the latest code to switch to this new view,
and you can retire the older view.

MongoDB: Embedded users into comments

I cant find "best" solution for very simple problem(or not very)
Have classical set of data: posts that attached to users, comments that attached to post and to user.
Now i can't decide how to build scheme/classes
On way is to store user_id inside comments and inside.
But what happens when i have 200 comments on page?
Or when i have N posts on page?
I mean it should be 200 additional requests to database to display user info(such as name,avatar)
Another solution is to embed user data into each comment and each post.
But first -> it is huge overhead, second -> model system is getting corrupted(using mongoalchemy), third-> user can change his info(like avatar). And what then? As i understand update operation on huge collections of comments or posts is not simple operation...
What would you suggest? Is 200 requests per page to mongodb is OK(must aim for performance)?
Or may be I am just missing something...
You can avoid the N+1-problem of hundreds of requests using $in-queries. Consider this:
Post {
PosterId: ObjectId
Text: string
Comments: [ObjectId, ObjectId, ...] // option 1
}
Comment {
PostId: ObjectId // option 2 (better)
Created: dateTime,
AuthorName: string,
AuthorId: ObjectId,
Text: string
}
Now you can find the posts comments with an $in query, and you can also easily find all comments made by a specific author.
Of course, you could also store the comments as an embedded array in post, and perform an $in query on the user information when you fetch the comments. That way, you don't need to de-normalize user names and still don't need hundreds of queries.
If you choose to denormalize the user names, you will have to update all comments ever made by that user when a user changes e.g. his name. On the other hand, if such operations don't occur very often, it shouldn't be a big deal. Or maybe it's even better to store the name the user had when he made the comment, depending your requirements.
A general problem with embedding is that different writers will write to the same object, so you will have to use the atomic modifiers (such as $push). This is sometimes harder to use with mappers (I don't know mongoalchemy though), and generally less flexible.
What I would do with mongodb would be to embed the user id into the comments (which are part of the structure of the "post" document).
Three simple hints for better performances:
1) Make sure to ensure an index on the user_id
2) Use comment pagination method to avoid querying 200 times the database
3) Caching is your friend
You could cache your user objects so you don't have to query the database each time.
I like the idea of embedding user data into each post but then you have to think about what happens when a user's profile is updated? have to make sure that no post is missed.
I would recommend starting out just by skimming how mongo recommends you handle schemas.
Generally, for "contains" relationships between entities,
embedding should be be chosen. Use linking when not using linking would result in
duplication of data.
http://www.mongodb.org/display/DOCS/Schema+Design
There's a pretty good use case from the MongoDB docs: http://docs.mongodb.org/manual/use-cases/storing-comments/
Conveniently it's also written in Python :-)

How to implement searching for a specified user's documents?

In my current project, users can like songs, and now I'm going to add a song search so that a user can search for some song she has liked before.
I have implemented search engines using xapian before, which involves building indexes of documents periodically.
In my case, do I have to build indexes for every user's songs independently?
If I want the search results to be more real-time, does this mean that I need to build indexes incrementally every short period of time?
To take your questions separately.
Do I have to build indexes for every user's songs independently?
No; a common technique for this kind of situation is to index each like separately with both the information about the song and additionally the identifier of the user. Then when you search, you want to filter the results of the user's natural text search by the user identifier who's actually logged in.
In Xapian you'd do this by adding a term representing the user (with a suitable prefix, so you might have XU175 for a user with id 175, perhaps), and then using OP_FILTER to restrict the search to just likes by the logged-in user.
Do I need to build indexes incrementally every short period of time [to support real-time indexing]?
This is entirely dependent on the search system you're using. With Xapian you can either do that and periodically 'compact' the databases generated into one base one; or you can index live into the database -- although since Xapian is single-writer, you'd want to find a way of serialising this, such as by putting new likes onto a queue and having a single process that pops them off and indexes into the database. One largely off-the-shelf solution to this would be to use Restpose, written by one of the Xapian developers, which fills the same sort of role as Solr does for Lucene.
You can also get fancier by indexing into one database, then replicating that to another and searching the replicated version, which also gives you options to scale horizontally in future. There's a discussion of replication in the Xapian documentation.

Categories