Ive been working on a feature of my application to implement a leaderboard - basically stack rank users according to their score. Im currently tracking the score on an individual basis. My thought is that this leaderboard should be relative instead of absolute i.e. instead of having the top 10 highest scoring users across the site, its a top 10 among a user's friend network. This seems better because everyone has a chance to be #1 in their network and there is a form of friendly competition for those that are interested in this sort of thing. Im already storing the score for each user so the challenge is how to compute the rank of that score in real time in an efficient way. Im using Google App Engine so there are some benefits and limitations (e.g., IN [array]) queries perform a sub-query for every element of the array and also are limited to 30 elements per statement
For example
1st Jack 100
2nd John 50
Here are the approaches I came up with but they all seem to be inefficient and I thought that this community could come up with something more elegant. My sense is that any solution will likely be done with a cron and that I will store a daily rank and list order to optimize read operations but it would be cool if there is something more lightweight and real time
Pull the list of all users of the site ordered by score.
For each user pick their friends out of that list and create new rankings.
Store the rank and list order.
Update daily.
Cons - If I get a lot of users this will take forever
2a. For each user pick their friends and for each friend pick score.
Sort that list.
Store the rank and list order.
Update daily.
Record the last position of each user so that the pre-existing list can be used for re-ordering for the next update in order to make it more efficient (may save sorting time)
2b. Same as above except only compute the rank and list order for people who's profiles have been viewed in the last day
Cons - rank is only up to date for the 2nd person that views the profile
If writes are very rare compared to reads (a key assumption in most key-value stores, and not just in those;-), then you might prefer to take a time hit when you need to update scores (a write) rather than to get the relative leaderboards (a read). Specifically, when a user's score change, queue up tasks for each of their friends to update their "relative leaderboards" and keep those leaderboards as list attributes (which do keep order!-) suitably sorted (yep, the latter's a denormalization -- it's often necessary to denormalize, i.e., duplicate information appropriately, to exploit key-value stores at their best!-).
Of course you'll also update the relative leaderboards when a friendship (user to user connection) disappears or appears, but those should (I imagine) be even rarer than score updates;-).
If writes are pretty frequent, since you don't need perfectly precise up-to-the-second info (i.e., it's not financials/accounting stuff;-), you still have many viable approaches to try.
E.g., big score changes (rarer) might trigger the relative-leaderboards recomputes, while smaller ones (more frequent) get stashed away and only applied once in a while "when you get around to it". It's hard to be more specific without ballpark numbers about frequency of updates of various magnitude, typical network-friendship cluster sizes, etc, etc. I know, like everybody else, you want a perfect approach that applies no matter how different the sizes and frequencies in question... but, you just won't find one!-)
There is a python library available for storing rankings:
http://code.google.com/p/google-app-engine-ranklist/
Related
I was wonder what if i had 1 million users, how much time will take it to loop every account to check for the username and email if the registration page to check if they are already used or not, or check if the username and password and correct in the log in page, won't it take ages if i do it in a traditional for loop?
Rather than give a detailed technical answer, I will try to give a theoretical illustration of how to address your concerns. What you seem to be saying is this:
Linear search might be too slow when logging users in.
I imagine what you have in mind is this: a user types in a username and password, clicks a button, and then you loop through a list of username/password combinations and see if there is a matching combination. This takes time that is linear in terms of the number of users in the system; if you have a million users in the system, the loop will take about a thousand times as long as when you just had a thousand users... and if you get a billion users, it will take a thousand times longer over again.
Whether this is a performance problem in practice can only be determined through testing and requirements analysis. However, if it is determined to be a problem, then there is a place for theory to come to the rescue.
Imagine one small improvement to our original scheme: rather than storing the username/password combinations in arbitrary order and looking through the whole list each time, imagine storing these combinations in alphabetic order by username. This enables us to use binary search, rather than linear search, to determine whether there exists a matching username:
check the middle element in the list
if the target element is equal to the middle element, you found a match
otherwise, if the target element comes before the middle element, repeat binary search on the left half of the list
otherwise, if the target element comes after the middle element, repeat binary search on the right half of the list
if you run out of list without finding the target, it's not in the list
The time complexity of this is logarithmic in terms of the number of users in the system: if you go from a thousand users to a million users, the time taken goes up by a factor of roughly ten, rather than one thousand as was the case for linear search. This is already a vast improvement over linear search and for any realistic number of users is probably going to be efficient enough. However, if additional performance testing and requirements analysis determine that it's still too slow, there are other possibilities.
Imagine now creating a large array of username/password pairs and whenever a pair is added to the collection, a function is used to transform the username into a numeric index. The pair is then inserted at that index in the array. Later, when you want to find whether that entry exists, you use the same function to calculate the index, and then check just that index to see if your element is there. If the function that maps the username to indices (called a hash function; the index is called a hash) is perfect - different strings don't map to the same index - then this unambiguously tells you whether your element exists. Notably, under somewhat reasonable assumptions, the time to make this determination is mostly independent from the number of users currently in the system: you can get (amortized) constant time behavior from this scheme, or something reasonably close to it. That means the performance hit from going from a thousand to a million users might be negligible.
This answer does not delve into the ugly real-world minutia of implementing these ideas in a production system. However, real world systems to implement these ideas (and many more) for precisely the kind of situation presented.
EDIT: comments asked for some pointers on actually implementing a hash table in Python. Here are some thoughts on that.
So there is a built-in hash() function that can be made to work if you disable the security feature that causes it to produce different hashes for different executions of the program. Otherwise, you can import hashlib and use some hash function there and convert the output to an integer using e.g. int.from_bytes. Once you get your number, you can take the modulus (or remainder after division, using the % operator) w.r.t. the capacity of your hash table. This gives you the index in the hash table where the item gets put. If you find there's already an item there - i.e. the assumption we made in theory that the hash function is perfect turns out to be incorrect - then you need a strategy. Two strategies for handling collisions like this are:
Instead of putting items at each index in the table, put a linked list of items. Add items to the linked list at the index computed by the hash function, and look for them there when doing the search.
Modify the index using some deterministic method (e.g., squaring and taking the modulus) up to some fixed number of times, to see if a backup spot can easily be found. Then, when searching, if you do not find the value you expected at the index computed by the hash method, check the next backup, and so on. Ultimately, you must fall back to something like method 1 in the worst case, though, since this process could fail indefinitely.
As for how large to make the capacity of the table: I'd recommend studying recommendations but intuitively it seems like creating it larger than necessary by some constant multiplicative factor is the best bet generally speaking. Once the hash table begins to fill up, this can be detected and the capacity expanded (all hashes will have to be recomputed and items re-inserted at their new positions - a costly operation, but if you increase capacity in a multiplicative fashion then I imagine this will not be too frequent an issue).
I want to build a backend for a mobile game that includes a "real-time" global leaderboard for all players, for events that last a certain number of days, using Google App Engine (Python).
A typical usage would be as follows:
- User starts and finishes a combat, acquiring points (2-5 mins for a combat)
- Points are accumulated in the player's account for the duration of the event.
- Player can check the leaderboard anytime.
- Leaderboard will return top 10 players, along with 5 players just above and below the player's score.
Now, there is no real constraint on the real-time aspect, the board could be updated every 30 seconds, to every hour. I would like for it to be as "fast" as possible, without costing too much.
Since I'm not very familiar with GAE, this is the solution I've thought of:
Each Player entity has a event_points attribute
Using a Cron job, at a regular interval, a query is made to the datastore for all players whose score is not zero. The query is
sorted.
The cron job then iterates through the query results, writing back the rank in each Player entity.
When I think of this solution, it feels very "brute force".
The problem with this solution lies with the cost of reads and writes for all entities.
If we end up with 50K active users, this would mean a sorted query of 50K+1 reads, and 50k+1 writes at regular intervals, which could be very expensive (depending on the interval)
I know that memcache can be a way to prevent some reads and some writes, but if some entities are not in memcache, does it make sense to query it at all?
Also, I've read that memcache can be flushed at any time anyway, so unless there is a way to "back it up" cheaply, it seems like a dangerous use, since the data is relatively important.
Is there a simpler way to solve this problem?
You don't need 50,000 reads or 50,000 writes. The solution is to set a sorting order on your points property. Every time you update it, the datastore will update its order automatically, which means that you don't need a rank property in addition to the points property. And you don't need a cron job, accordingly.
Then, when you need to retrieve a leader board, you run two queries: one for 6 entities with more or equal number of points with your user; second - for 6 entities with less or equal number of points. Merge the results, and this is what you want to show to your user.
As for your top 10 query, you may want to put its results in Memcache with an expiration time of, say, 5 minutes. When you need it, you first check Memcache. If not found, run a query and update the Memcache.
EDIT:
To clarify the query part. You need to set the right combination of a sort order and inequality filter to get the results that you want. According to App Engine documentation, the query is performed in the following order:
Identifies the index corresponding to the query's kind, filter
properties, filter operators, and sort orders.
Scans from the
beginning of the index to the first entity that meets all of the
query's filter conditions.
Continues scanning the index, returning
each entity in turn, until it encounters an entity that does not meet
the filter conditions, or reaches the end of the index, or has
collected the maximum number of results requested by the query.
Therefore, you need to combine ASCENDING order with GREATER_THAN_OR_EQUAL filter for one query, and DESCENDING order with LESS_THAN_OR_EQUAL filter for the other query. In both cases you set the limit on the results to retrieve at 6.
One more note: you set a limit at 6 entities, because both queries will return the user itself. You can add another filter (userId NOT_EQUAL to your user's id), but I would not recommend it - the cost is not worth the savings. Obviously, you cannot use GREATER_THAN/LESS_THAN filters for points, because many users may have the same number of points.
Here is a Google Developer article explaining similar problem and the solution using the Google code JAM ranking library. Further help and extension to this library can be discussed in the Google groups forum.
The library basically creates a N-ary tree with each node containing the count of the scores in a particular range. The score ranges are further divided all the way down till leaf node where its a single score . A tree traversal ( O log(n) ) can be used to find the number of players with score higher than the specific score. That is the rank of the player. It also suggests to aggregate the score submission requests in a pull taskqueue and then process them in a batch in a background thread in a backend.
Whether this is simpler or not is debatable.
I have assumed that ranking is not just a matter of ordering an accumulation of points, in which case thats just a simple query. I ranking involves other factors rather than just current score.
I would consider writing out an Event record for each update of points for a User (effectively a queue) . Tasks run collecting all the current Event records, In addition you maintain a set of records representing the top of the leaderboard. Adjust this set of records, based on the incoming event records. Discard event records once processed. This will limit your reads and writes to only active events in a small time window. The leader board could probably be a single entity, and fetched by key and cached.
I assume you may have different ranking schemes like current active rank (for the current 7 days), vs all time ranks. (ie players not playing for a while won't have a good current rank).
As the players view their rank, you can do that with two simple queries Players.query(Players.score > somescore).fetch(5) and Players.query(Players.score < somescore).fetch(5) this shouldn't cost too much and you could cache them.
I writing an app in python for google app engine where each user can submit a post and each post has a ranking which is determined by its votes and comment count. The ranking is just a simple calculation based on these two parameters. I am wondering should I store this value in the datastore (and take up space there) or just simply calculate it every time that I need it. Now just fyi the posts will be sorted by ranking so that needs to be taken into account.
I am mostly thinking for the sake of efficiency and trying to balance if I should try and save the datastore room or save the read/write quota.
I would think it would be better to simply store it but then I need to recalculate and rewrite the ranking value every time anyone votes or comments on the post.
Any input would be great.
What about storing the ranking as a property in the post. That would make sense for querying/sorting wouldn't it.
If you store the ranking at the same time (meaning in the same entitiy) as you store the votes/comment count, then the only increase in write cost would be for the index. (ok initial write cost too but that is what 2 [very small anyway]).
You need to do a database operation everytime anyone votes or comments on the post anyway right!?! How else can to track votes/comments?
Actually though, I imagine you will get into use text search to find data in the posts. If so, I would look into maybe storing the ranking as a property in the search index and using it to rank matching results.
Don't we need to consider how you are selecting the posts to display. Is ranking by votes and comments the only criteria?
Caching is most useful when the calculation is expensive. If the calculation is simple and cheap, you might as well recalculate as needed.
If you're depending on keeping a running vote count in an entity, then you either have to be willing to lose an occasional vote, or you have to use transactions. If you use transactions, you're rate limited as to how many transactions you can do per second. (See the doc on transactions and entity groups). If you're liable to have a high volume of votes, rate limiting can be a problem.
For a low rate of votes, keeping a count in an entity might work fine. But if you any significant peaks in voting rate, storing separate Vote entities that periodically get rolled up into a cached count, perhaps adjusted by (possibly unreliable) incremental counts kept in memcache, might work better for you.
It really depends on what you want to optimize for. If you're trying to minimize disk writes by keeping a vote count cached non-transactionally, you risk losing votes.
The following code produces a First ordering property must be the same as inequality filter property error when executed because you can't order by a field at was not a filter.
q = Score.all()
q.filter("levelname = ", levelname)
q.filter("submitted >", int(time.time()) - (86400*7))
q.order("-score")
scoreList = q.fetch(10)
What I need to do is find the top 10 scores that are less than a week old. There could be 10s of thousands (if not more) scores, so I can't just fetch them all and sort in python.
Is there a way to do this?
In general, every time a question of counting comes up, the consensus is that with GAE you should precompute everything you can. The way I'd approach your specific require of top 10 scores, is to create an entity that holds the top scores and update the position whenever you have a new score that outweighs the top 10.
When you compute a score, you can query for how many other scores are greater than the computed score. If the count is more than 10, you don't need to update your scores. This will be the majority of the time. If the count is equal to or greater than 10, you need to update the order, so you get your top 10 and insert the new score as appropriate.
To handle the time component, I'd have some process running that checks daily to see if a score should be evicted from the top 10, if so, grab the next highest to replace it with.
Here's a bunch of answers on a similar subject that address the design patterns and logic appropriate for GAE datastore: What's the best way to count results in GQL?
As Sologoub mentions, precomputing is the way to go.
You can many equality filters, though, so an alternative to keeping a separate list of the top scores could be to let each entity have a flag (say, a boolean value) that says whether it is eligible to be on the top scores list (in your case, no older than a week), and have a daily cron job that retrieves the list of all entities with the eligibility flag, checks the date, and changes the flag if required.
This costs more storage space (one more field per entity), but I suppose an advantage can be that you can choose dynamically how many top scores you want to return. (You can also have several of these flags, say one for all time high score and one just for the last week, etc.)
Let's say I have 5 entries in my Redis database:
news::id: the ID of the last news;
news::list: a list of all news IDs;
news:n where n is the news ID: a hash containing fields such as title, url, etc.;
news:n:upvotes: a list of all users' IDs who upvoted the news, thus giving the number of upvotes.
news:n:downvotes: a list of all users' IDs who downvoted the news, thus giving the number of downvotes.
Then I have multiple ranking algorithms, where rank =:
upvotes_count;
upvotes_count - downvotes_count;
upvotes_count - downvotes_count - age;
upvotes_count / downvotes_count;
age.
Now how do I sort those news according to each of these algorithms?
I thought about computing the different ranks on every votes, but then if I introduce a new algorithm I need to compute the new rank for all the news.
EVAL could help but it won't be available until v2.6, which surely I don't want to wait for.
Eventually, I could retrieve all the news and put them in a Python list. But again it translates into a high memory usage, not to mention the fact that Redis stores its data in memory.
So is there a proper way to do this or should I just move to MongoDB?
You can sort by constants stored in keys.
In your example, I can sort 1. almost trivially using Redis. If you store the other expression values after calculating them, you can sort by them too. For 1., you will need to store the list count somewhere, I will assume news:n:upvotes:count.
The catch is to use the SORT command. For instance, the first sort would be:
SORT news::list BY news:*:upvotes:count GET news:*->title GET news:*->url
...to get titles and urls sorted by upvotes, in crescent order.
There are modifiers too, for alpha sorting, and asc/desc sorting. Read the command page entirely, it is worthwhile.
PS: You can wrap the count, store, sort and possibly deletion of count in a MULTI/EXEC environment (a transaction).