Dynamic sort with Redis

Dynamic sort with Redis - python

Let's say I have 5 entries in my Redis database:
news::id: the ID of the last news;
news::list: a list of all news IDs;
news:n where n is the news ID: a hash containing fields such as title, url, etc.;
news:n:upvotes: a list of all users' IDs who upvoted the news, thus giving the number of upvotes.
news:n:downvotes: a list of all users' IDs who downvoted the news, thus giving the number of downvotes.
Then I have multiple ranking algorithms, where rank =:
upvotes_count;
upvotes_count - downvotes_count;
upvotes_count - downvotes_count - age;
upvotes_count / downvotes_count;
age.
Now how do I sort those news according to each of these algorithms?
I thought about computing the different ranks on every votes, but then if I introduce a new algorithm I need to compute the new rank for all the news.
EVAL could help but it won't be available until v2.6, which surely I don't want to wait for.
Eventually, I could retrieve all the news and put them in a Python list. But again it translates into a high memory usage, not to mention the fact that Redis stores its data in memory.
So is there a proper way to do this or should I just move to MongoDB?

You can sort by constants stored in keys.
In your example, I can sort 1. almost trivially using Redis. If you store the other expression values after calculating them, you can sort by them too. For 1., you will need to store the list count somewhere, I will assume news:n:upvotes:count.
The catch is to use the SORT command. For instance, the first sort would be:
SORT news::list BY news:*:upvotes:count GET news:*->title GET news:*->url
...to get titles and urls sorted by upvotes, in crescent order.
There are modifiers too, for alpha sorting, and asc/desc sorting. Read the command page entirely, it is worthwhile.
PS: You can wrap the count, store, sort and possibly deletion of count in a MULTI/EXEC environment (a transaction).

Related

Analyze rarely populated fields in mongodb

I have a mongodb collection with close to 100,000 records and each record has around 5000 keys. Lot of this is empty. How can I find (maybe visually represent) this emptiness of data.
In other words, I would like to analyze the type of values in each key. What would be the right approach for this.

You could take a look to MongoDB aggregation strategies. Check out $group.
From how you exposed your problem, I could totally see an accumulator over the number of keys of each record.
As an example, with the appropriate thresholds and transformation, such an operation could basically return you the records grouped by number of keys (or an array solely populated with the number of keys for each record).
Such an approach could also allow you to perform some data analysis over the keys used for each record.

"Nested" queries in SQL / SQLAlchemy

I'm using SQLAlchemy (being relatively new both to it and SQL) and I want to get a list of all comments posted to a set of things, but I'm only interested in comments that have been posted since a certain date, and the date is different for each thing:
To clarify, here's what I'm doing now: I begin with a dictionary that maps the ID code of each thing I'm interested in to the date I'm interested in for that thing. I do a quick list comprehension to get a list of just the codes (thingCodes) and then do this query:
things = meta.Session.query(Thing)\
.filter(Thing.objType.in_(['fooType', 'barType']))\
.filter(Thing.data.any(and_(Data.key == 'thingCode',Data.value.in_(thingCodes))))\
.all()
which returns a list of the thing objects (I do need those in addition to the comments). I then iterate through this list, and for each thing do a separate query:
comms = meta.Session.query( Thing )
.filter_by(objType = 'comment').filter(Thing.data.any(wc('thingCode', code))) \
.filter(Thing.date >= date) \
.order_by('-date').all()
This works, but it seems horribly inefficient to be to be doing all these queries separately. So, I have 2 questions:
a) Rather than running the second query n times for an n-length list of things, is there a way I could do it in a single query while still returning a separate set of results for each ID (presumably in the form of a dictionary of ID's to lists)? I suppose I could do a value_in(listOfIds) to get a single list of all the comments I want and then iterate through that and build the dictionary manually, but I have a feeling there's a way to use JOINs for this.
b) Am I over-optimizing here? Would I be better off with the second approach I just mentioned? And is it even that important that I roll them all into a single transactions? The bulk of my experience is with Neo4j, which is pretty good at transparently nesting many small transactions into larger ones - does SQL/SQLAlchemy have similar functionality, or is it definitely in my interest to minimize the number of queries?

Global leaderboard in Google App Engine

I want to build a backend for a mobile game that includes a "real-time" global leaderboard for all players, for events that last a certain number of days, using Google App Engine (Python).
A typical usage would be as follows:
- User starts and finishes a combat, acquiring points (2-5 mins for a combat)
- Points are accumulated in the player's account for the duration of the event.
- Player can check the leaderboard anytime.
- Leaderboard will return top 10 players, along with 5 players just above and below the player's score.
Now, there is no real constraint on the real-time aspect, the board could be updated every 30 seconds, to every hour. I would like for it to be as "fast" as possible, without costing too much.
Since I'm not very familiar with GAE, this is the solution I've thought of:
Each Player entity has a event_points attribute
Using a Cron job, at a regular interval, a query is made to the datastore for all players whose score is not zero. The query is
sorted.
The cron job then iterates through the query results, writing back the rank in each Player entity.
When I think of this solution, it feels very "brute force".
The problem with this solution lies with the cost of reads and writes for all entities.
If we end up with 50K active users, this would mean a sorted query of 50K+1 reads, and 50k+1 writes at regular intervals, which could be very expensive (depending on the interval)
I know that memcache can be a way to prevent some reads and some writes, but if some entities are not in memcache, does it make sense to query it at all?
Also, I've read that memcache can be flushed at any time anyway, so unless there is a way to "back it up" cheaply, it seems like a dangerous use, since the data is relatively important.
Is there a simpler way to solve this problem?

You don't need 50,000 reads or 50,000 writes. The solution is to set a sorting order on your points property. Every time you update it, the datastore will update its order automatically, which means that you don't need a rank property in addition to the points property. And you don't need a cron job, accordingly.
Then, when you need to retrieve a leader board, you run two queries: one for 6 entities with more or equal number of points with your user; second - for 6 entities with less or equal number of points. Merge the results, and this is what you want to show to your user.
As for your top 10 query, you may want to put its results in Memcache with an expiration time of, say, 5 minutes. When you need it, you first check Memcache. If not found, run a query and update the Memcache.
EDIT:
To clarify the query part. You need to set the right combination of a sort order and inequality filter to get the results that you want. According to App Engine documentation, the query is performed in the following order:
Identifies the index corresponding to the query's kind, filter
properties, filter operators, and sort orders.
Scans from the
beginning of the index to the first entity that meets all of the
query's filter conditions.
Continues scanning the index, returning
each entity in turn, until it encounters an entity that does not meet
the filter conditions, or reaches the end of the index, or has
collected the maximum number of results requested by the query.
Therefore, you need to combine ASCENDING order with GREATER_THAN_OR_EQUAL filter for one query, and DESCENDING order with LESS_THAN_OR_EQUAL filter for the other query. In both cases you set the limit on the results to retrieve at 6.
One more note: you set a limit at 6 entities, because both queries will return the user itself. You can add another filter (userId NOT_EQUAL to your user's id), but I would not recommend it - the cost is not worth the savings. Obviously, you cannot use GREATER_THAN/LESS_THAN filters for points, because many users may have the same number of points.

Here is a Google Developer article explaining similar problem and the solution using the Google code JAM ranking library. Further help and extension to this library can be discussed in the Google groups forum.
The library basically creates a N-ary tree with each node containing the count of the scores in a particular range. The score ranges are further divided all the way down till leaf node where its a single score . A tree traversal ( O log(n) ) can be used to find the number of players with score higher than the specific score. That is the rank of the player. It also suggests to aggregate the score submission requests in a pull taskqueue and then process them in a batch in a background thread in a backend.

Whether this is simpler or not is debatable.
I have assumed that ranking is not just a matter of ordering an accumulation of points, in which case thats just a simple query. I ranking involves other factors rather than just current score.
I would consider writing out an Event record for each update of points for a User (effectively a queue) . Tasks run collecting all the current Event records, In addition you maintain a set of records representing the top of the leaderboard. Adjust this set of records, based on the incoming event records. Discard event records once processed. This will limit your reads and writes to only active events in a small time window. The leader board could probably be a single entity, and fetched by key and cached.
I assume you may have different ranking schemes like current active rank (for the current 7 days), vs all time ranks. (ie players not playing for a while won't have a good current rank).
As the players view their rank, you can do that with two simple queries Players.query(Players.score > somescore).fetch(5) and Players.query(Players.score < somescore).fetch(5) this shouldn't cost too much and you could cache them.

Filter and sort by different fields in Google App Engine

The following code produces a First ordering property must be the same as inequality filter property error when executed because you can't order by a field at was not a filter.
q = Score.all()
q.filter("levelname = ", levelname)
q.filter("submitted >", int(time.time()) - (86400*7))
q.order("-score")
scoreList = q.fetch(10)
What I need to do is find the top 10 scores that are less than a week old. There could be 10s of thousands (if not more) scores, so I can't just fetch them all and sort in python.
Is there a way to do this?

In general, every time a question of counting comes up, the consensus is that with GAE you should precompute everything you can. The way I'd approach your specific require of top 10 scores, is to create an entity that holds the top scores and update the position whenever you have a new score that outweighs the top 10.
When you compute a score, you can query for how many other scores are greater than the computed score. If the count is more than 10, you don't need to update your scores. This will be the majority of the time. If the count is equal to or greater than 10, you need to update the order, so you get your top 10 and insert the new score as appropriate.
To handle the time component, I'd have some process running that checks daily to see if a score should be evicted from the top 10, if so, grab the next highest to replace it with.
Here's a bunch of answers on a similar subject that address the design patterns and logic appropriate for GAE datastore: What's the best way to count results in GQL?

As Sologoub mentions, precomputing is the way to go.
You can many equality filters, though, so an alternative to keeping a separate list of the top scores could be to let each entity have a flag (say, a boolean value) that says whether it is eligible to be on the top scores list (in your case, no older than a week), and have a daily cron job that retrieves the list of all entities with the eligibility flag, checks the date, and changes the flag if required.
This costs more storage space (one more field per entity), but I suppose an advantage can be that you can choose dynamically how many top scores you want to return. (You can also have several of these flags, say one for all time high score and one just for the last week, etc.)

Real time update of relative leaderboard for each user among friends

Ive been working on a feature of my application to implement a leaderboard - basically stack rank users according to their score. Im currently tracking the score on an individual basis. My thought is that this leaderboard should be relative instead of absolute i.e. instead of having the top 10 highest scoring users across the site, its a top 10 among a user's friend network. This seems better because everyone has a chance to be #1 in their network and there is a form of friendly competition for those that are interested in this sort of thing. Im already storing the score for each user so the challenge is how to compute the rank of that score in real time in an efficient way. Im using Google App Engine so there are some benefits and limitations (e.g., IN [array]) queries perform a sub-query for every element of the array and also are limited to 30 elements per statement
For example
1st Jack 100
2nd John 50
Here are the approaches I came up with but they all seem to be inefficient and I thought that this community could come up with something more elegant. My sense is that any solution will likely be done with a cron and that I will store a daily rank and list order to optimize read operations but it would be cool if there is something more lightweight and real time
Pull the list of all users of the site ordered by score.
For each user pick their friends out of that list and create new rankings.
Store the rank and list order.
Update daily.
Cons - If I get a lot of users this will take forever
2a. For each user pick their friends and for each friend pick score.
Sort that list.
Store the rank and list order.
Update daily.
Record the last position of each user so that the pre-existing list can be used for re-ordering for the next update in order to make it more efficient (may save sorting time)
2b. Same as above except only compute the rank and list order for people who's profiles have been viewed in the last day
Cons - rank is only up to date for the 2nd person that views the profile

If writes are very rare compared to reads (a key assumption in most key-value stores, and not just in those;-), then you might prefer to take a time hit when you need to update scores (a write) rather than to get the relative leaderboards (a read). Specifically, when a user's score change, queue up tasks for each of their friends to update their "relative leaderboards" and keep those leaderboards as list attributes (which do keep order!-) suitably sorted (yep, the latter's a denormalization -- it's often necessary to denormalize, i.e., duplicate information appropriately, to exploit key-value stores at their best!-).
Of course you'll also update the relative leaderboards when a friendship (user to user connection) disappears or appears, but those should (I imagine) be even rarer than score updates;-).
If writes are pretty frequent, since you don't need perfectly precise up-to-the-second info (i.e., it's not financials/accounting stuff;-), you still have many viable approaches to try.
E.g., big score changes (rarer) might trigger the relative-leaderboards recomputes, while smaller ones (more frequent) get stashed away and only applied once in a while "when you get around to it". It's hard to be more specific without ballpark numbers about frequency of updates of various magnitude, typical network-friendship cluster sizes, etc, etc. I know, like everybody else, you want a perfect approach that applies no matter how different the sizes and frequencies in question... but, you just won't find one!-)

There is a python library available for storing rankings:
http://code.google.com/p/google-app-engine-ranklist/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.