How to implement searching for a specified user's documents? - python

In my current project, users can like songs, and now I'm going to add a song search so that a user can search for some song she has liked before.
I have implemented search engines using xapian before, which involves building indexes of documents periodically.
In my case, do I have to build indexes for every user's songs independently?
If I want the search results to be more real-time, does this mean that I need to build indexes incrementally every short period of time?

To take your questions separately.
Do I have to build indexes for every user's songs independently?
No; a common technique for this kind of situation is to index each like separately with both the information about the song and additionally the identifier of the user. Then when you search, you want to filter the results of the user's natural text search by the user identifier who's actually logged in.
In Xapian you'd do this by adding a term representing the user (with a suitable prefix, so you might have XU175 for a user with id 175, perhaps), and then using OP_FILTER to restrict the search to just likes by the logged-in user.
Do I need to build indexes incrementally every short period of time [to support real-time indexing]?
This is entirely dependent on the search system you're using. With Xapian you can either do that and periodically 'compact' the databases generated into one base one; or you can index live into the database -- although since Xapian is single-writer, you'd want to find a way of serialising this, such as by putting new likes onto a queue and having a single process that pops them off and indexes into the database. One largely off-the-shelf solution to this would be to use Restpose, written by one of the Xapian developers, which fills the same sort of role as Solr does for Lucene.
You can also get fancier by indexing into one database, then replicating that to another and searching the replicated version, which also gives you options to scale horizontally in future. There's a discussion of replication in the Xapian documentation.

Related

Database Design for tracking user flow

I want to design a user tracking system between different models. There is a Module table which is related to sections and sections are related to sub-sections. Each sub-section can relate to different independent tables. The independent tables store different items (slides, videos, texts) which the user can go through in a linear flow just like a Coursera course. The flow will be predetermined by us but can be changed by us. So, the linear flow between the sub-sections can't be hardcoded. I also need to keep track of the user's progress through these sub-sections.
For Example:
A sub-section can point to game table where the user_info, game_score, game_completion_date_time will be stored or there might be case where the sub-section is pointing to slides table storing slide_text, slide_url and user_info.
I want to keep track of changes in those table so what should be my approach. Below i have posted an image in which i am approaching to the problem.
You might make a 'courseflows' table with the order being a sequence of comma-separated integers for ids in the proper order, then parse through that sequence when presenting sections to the user. Add whatever other information you need, and a timestamp so that any new users are assigned the most recent flow. By linking this, you can also preserve any user's flow in case you change it while users are halfway through.
I would also consider using different secondary tables to mark user completion of each module, section, and subsection instead of just one.

Filter results from browse_release_groups by artist_id to get discography, python

I'm trying to retrieve discographies for various artists. Wikipedia and the manual web interface for MusicBrainz.org seem to agree on what albums make this up, for the artists I've checked. My first thought was to attempt to screen-scrape either of these resources, but that looks like hard work to do it properly.
Direct queries of the musicbrainz data seemed to offer a quicker way to get clean data. I would ideally construct a request like this ...
data = get_release_groups(artist=mbid,
primary_type='Album',
status='Official',
includes=['first_release_date',
'title',
'secondary_type_list'])
I chose to use the python wrapper musicbrainsngs, as I am fairly experienced with python. It gave me a choice of three methods, get_, search_ and browse_. Get_ will not return sufficient records. Browse_ appeared to be what I wanted, so I tried that first, especially as search_ was documented around looking for text in the python examples, rather than the mb_id, which I already had.
When I did a browse_release_groups(artist=artist_id,,,), I got a list of release groups, each containing the data I wanted, which was album title, type and year. However, I also got a large number of other release groups that don't appear on their manual web results for (for example The Rolling Stones) https://musicbrainz.org/artist/b071f9fa-14b0-4217-8e97-eb41da73f598
There didn't appear to be any way to filter in the query for status='official', or to include the status as part of the results so I could manually filter.
In response to this question, Wieland has suggested I use the search_ query. I have tested search_release_groups(arid=mbid, status='official', primarytype='Album', strict=True, limit=...) and this returns many fewer release groups. As far as studio albums are concerned, it matches 1:1. There are still a few minor discrepancies in the compilations, which I can live with. However, this query did not return the first-release-date, and so far, it has been resistant to my attempts to find how to include it. I notice in the server search code linked to that every query starts off manipulating rgm.first_release_date_year etc, but it's not clear how/when this gets returned from a query.
It's just occurred to me that I can use both a browse_ and a search_ , as together they give me all the information. So I have a work around, but it feels rather agricultural.
TL;DR I want release groups (titles, dates, types, status) by artist ID. If I browse, I get dates, but can't include or filter by status. If I search, I can filter by status, but don't get dates. How can I get both in one query?
I'm not entirely sure what your question is, but the find_by_artist method of release groups (source here) is what's doing the filtering of release groups for the artist pages, in particular:
# Show only RGs with official releases by default, plus all-status-less ones so people fix the status
unless ($show_all) {
push #$conditions, "(EXISTS (SELECT 1 FROM release where release.release_group = rg.id AND release.status = '1') OR
NOT EXISTS (SELECT 1 FROM release where release.release_group = rg.id AND release.status IS NOT NULL))";
}
Unfortunately, I think it's not possible to express that condition in a normal web service call. You can, however, use the search web service to filter for release groups by the rolling stones that contain at least one "official" release: http://musicbrainz.org/ws/2/release-group/?query=arid:b071f9fa-14b0-4217-8e97-eb41da73f598%20AND%20status:official&offset=0. In python-musicbrainzngs, the call for this is
search_release_groups(arid="b071f9fa-14b0-4217-8e97-eb41da73f598", status="official", strict=True)
Unfortunately, the search results don't include the first-release-date field. There's an open ticket about it, but it's not going to be fixed in the near future.

Solr & User data

Let's assume I am developing a service that provides a user with articles. Users can favourite articles and I am using Solr to store these articles for search purposes.
However, when the user adds an article to their favourites list, I would like to be able to figure out out which articles the user has added to favourites so that I can highlight the favourite button.
I am thinking of two approaches:
Fetch articles from Solr and then loop through each article to fetch the "favourite-status" of this article for this specific user from MySQL.
Whenever a user favourites an article, add this user's ID to a multi-valued column in Solr and check whether the ID of the current user is in this column or not.
I don't know the capacity of the multivalued column... and I also don't think the second approach would be a "good practice" (saving user-related data in index).
What other options do I have, if any? Is approach 2 a correct approach?
I'd go with a modified version of the first one - it'll keep user specific data that's not going to be used for search out of the index (although if you foresee a case where you want to search for favourite'd articles, it would probably be an interesting field to have in the index) for now. For just display purposes like in this case, I'd take all the id's returned from Solr, fetch them in one SQL statement from the database and then set the UI values depending on that. It's a fast and easy solution.
If you foresee that "search only in my fav'd articles" as a use case, I would try to get that information into the index as well (or other filter applications against whether a specific user has added the field as a favourite). I'd try to avoid indexing anything more than the user id that fav'd the article in that case.
Both solutions would however work, although the latter would require more code - and the required response from Solr could grow large if a large number of users fav's an article, so I'd try to avoid having to return a set of userid's if that's the case (many fav's for a single article).

github api - fetch all commits of all repos of an organisation based on date

Assembla provides a simple way to fetch all commits of an organisation using api.assembla.com/v1/activity.json and it takes to and from parameters allowing to get commits of selected date(from all the spaces(repos) the user is participating.
Is there any similar way in Github ?
I found these for Github:
/repos/:owner/:repo/commits
Accepts since and until parameters for getting commits of selected date. But, since I want commits from all repos, I have to loop over all those repos and fetch commits for each repo.
/users/:user/events
This shows the commits of a user. I dont have any problem looping over all the users in the org, but how can I get for a particular date ?
/orgs/:org/events
This shows commits of all users of all repos but dont know how to fetch for a particular date ?
The problem with using the /users/:user/events endpoint is that you just don't get the PushEvents and you would have to skip over non-commit events and perform more calls to the API. Assuming you're authenticated, you should be safe so long as your users aren't hyper active.
For /orgs/:org/events I don't think they accept parameters for anything, but I can check with the API designers.
And just in case you aren't familiar, these are all paginated results. So you can go back until the beginning with the Link headers. My library (github3.py) provides iterators to do this for you automatically. You can also tell it how many events you'd like. (Same with commits, etc). But yeah, I'll come back an edit after talking to the API guys at GitHub.
Edit: Conversation
You might want to check out the GitHub Archive project -- http://www.githubarchive.org/, and the ability to query the archive using Google's BigQuery. Sounds like it would be a perfect tool for the job -- I'm pretty sure you could get exactly what you want with a single query.
The other option is to call the GitHub API -- iterate over all events for the organization and filter out the ones that don't satisfy your date rage criteria and event type criteria (commits). But since you can't specify date ranges in the API call, you will probably do a lot of calls to get the the events that interest you. Notice that you don't have to iterate over every page starting from 0 to find the page that contains the first result in the date range -- just do a (variation of) binary search over page numbers to find any page that contains a commit in the date range, a then iterate in both directions until you break out of the date range. That should reduce the number of API calls you make.

Search unreliable author names

We have scanned thousands of old documents and entered key data into a database. One of the fields is author name.
We need to search for documents by a given author but the exact name might have been entered incorrectly as on many documents the data is handwritten.
I thought of searching for only the first few letters of the surname and then presenting a list for the user to select from. I don't know at this stage how many distinct authors there are, I suspect it will be in the hundreds rather than hundreds of thousands. There will be hundreds of thousands of documents.
Is there a better way? Would an SQL database handle it better?
The software is python, and there will be a list of documents each with an author.
I think you can use mongodb where you can set list field with all possible names of author. For example you have handwriten name "black" and you cant recognize what letter in name for example "c" or "e" and you can set origin name as "black" and add to list of possible names "blaek"
You could use Sunburnt which is a Python-Solr library which accesses Solr which is built on top of Lucene.
An excerpt of what Solr is:
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.
It will give you all you want for searching documents, including partial hits and potential matches on whatever your search criteria is.

Categories