Global leaderboard in Google App Engine

Global leaderboard in Google App Engine - python

I want to build a backend for a mobile game that includes a "real-time" global leaderboard for all players, for events that last a certain number of days, using Google App Engine (Python).
A typical usage would be as follows:
- User starts and finishes a combat, acquiring points (2-5 mins for a combat)
- Points are accumulated in the player's account for the duration of the event.
- Player can check the leaderboard anytime.
- Leaderboard will return top 10 players, along with 5 players just above and below the player's score.
Now, there is no real constraint on the real-time aspect, the board could be updated every 30 seconds, to every hour. I would like for it to be as "fast" as possible, without costing too much.
Since I'm not very familiar with GAE, this is the solution I've thought of:
Each Player entity has a event_points attribute
Using a Cron job, at a regular interval, a query is made to the datastore for all players whose score is not zero. The query is
sorted.
The cron job then iterates through the query results, writing back the rank in each Player entity.
When I think of this solution, it feels very "brute force".
The problem with this solution lies with the cost of reads and writes for all entities.
If we end up with 50K active users, this would mean a sorted query of 50K+1 reads, and 50k+1 writes at regular intervals, which could be very expensive (depending on the interval)
I know that memcache can be a way to prevent some reads and some writes, but if some entities are not in memcache, does it make sense to query it at all?
Also, I've read that memcache can be flushed at any time anyway, so unless there is a way to "back it up" cheaply, it seems like a dangerous use, since the data is relatively important.
Is there a simpler way to solve this problem?

You don't need 50,000 reads or 50,000 writes. The solution is to set a sorting order on your points property. Every time you update it, the datastore will update its order automatically, which means that you don't need a rank property in addition to the points property. And you don't need a cron job, accordingly.
Then, when you need to retrieve a leader board, you run two queries: one for 6 entities with more or equal number of points with your user; second - for 6 entities with less or equal number of points. Merge the results, and this is what you want to show to your user.
As for your top 10 query, you may want to put its results in Memcache with an expiration time of, say, 5 minutes. When you need it, you first check Memcache. If not found, run a query and update the Memcache.
EDIT:
To clarify the query part. You need to set the right combination of a sort order and inequality filter to get the results that you want. According to App Engine documentation, the query is performed in the following order:
Identifies the index corresponding to the query's kind, filter
properties, filter operators, and sort orders.
Scans from the
beginning of the index to the first entity that meets all of the
query's filter conditions.
Continues scanning the index, returning
each entity in turn, until it encounters an entity that does not meet
the filter conditions, or reaches the end of the index, or has
collected the maximum number of results requested by the query.
Therefore, you need to combine ASCENDING order with GREATER_THAN_OR_EQUAL filter for one query, and DESCENDING order with LESS_THAN_OR_EQUAL filter for the other query. In both cases you set the limit on the results to retrieve at 6.
One more note: you set a limit at 6 entities, because both queries will return the user itself. You can add another filter (userId NOT_EQUAL to your user's id), but I would not recommend it - the cost is not worth the savings. Obviously, you cannot use GREATER_THAN/LESS_THAN filters for points, because many users may have the same number of points.

Here is a Google Developer article explaining similar problem and the solution using the Google code JAM ranking library. Further help and extension to this library can be discussed in the Google groups forum.
The library basically creates a N-ary tree with each node containing the count of the scores in a particular range. The score ranges are further divided all the way down till leaf node where its a single score . A tree traversal ( O log(n) ) can be used to find the number of players with score higher than the specific score. That is the rank of the player. It also suggests to aggregate the score submission requests in a pull taskqueue and then process them in a batch in a background thread in a backend.

Whether this is simpler or not is debatable.
I have assumed that ranking is not just a matter of ordering an accumulation of points, in which case thats just a simple query. I ranking involves other factors rather than just current score.
I would consider writing out an Event record for each update of points for a User (effectively a queue) . Tasks run collecting all the current Event records, In addition you maintain a set of records representing the top of the leaderboard. Adjust this set of records, based on the incoming event records. Discard event records once processed. This will limit your reads and writes to only active events in a small time window. The leader board could probably be a single entity, and fetched by key and cached.
I assume you may have different ranking schemes like current active rank (for the current 7 days), vs all time ranks. (ie players not playing for a while won't have a good current rank).
As the players view their rank, you can do that with two simple queries Players.query(Players.score > somescore).fetch(5) and Players.query(Players.score < somescore).fetch(5) this shouldn't cost too much and you could cache them.

Related

Django Rest Framework large number of queries for nested relationship

My problem
I have a large number of nested model serializers (5 levels deep) via one-to-many relationships. The total query time is not that high according to django debug toolkit - maybe 100ms, but it takes about 6s of cpu time to run due to the many hundreds of times the database is being hit per query.
The problem is, 3 of those tables are huge, with millions of rows. As such, whenever I run 'prefetch_related' for anything it takes far longer to execute, sometimes minutes. select_related isn't an option, as they are all one-to-many.
Imagine something like a pedometer that takes steps, and also ties multiple environment readings and accelerometer readings to a step.
Example Schema
Person
id
name
Pedometer
id
name
owner(Person)
Step
id
info
timestamp
pedometer
Environment_Readings
id
pressure
step
Accelerometer_Readings
id
force
step
What I want to achieve
So I want to select all people > an array of their pedometers (often multiple) > an array of all steps they have taken in the last minute > array of environment readings and array of accelerometer readings.
I am not seeing a way to prefetch environment_readings or accelerometer_readings as neither have a timestamp. The only ones that can be prefetched appear to be pedometer and step. And for some reason prefetching pedometer seems to slow it down.
Any ideas?

store a calculated value in the datastore or just calculate it on the fly

I writing an app in python for google app engine where each user can submit a post and each post has a ranking which is determined by its votes and comment count. The ranking is just a simple calculation based on these two parameters. I am wondering should I store this value in the datastore (and take up space there) or just simply calculate it every time that I need it. Now just fyi the posts will be sorted by ranking so that needs to be taken into account.
I am mostly thinking for the sake of efficiency and trying to balance if I should try and save the datastore room or save the read/write quota.
I would think it would be better to simply store it but then I need to recalculate and rewrite the ranking value every time anyone votes or comments on the post.
Any input would be great.

What about storing the ranking as a property in the post. That would make sense for querying/sorting wouldn't it.
If you store the ranking at the same time (meaning in the same entitiy) as you store the votes/comment count, then the only increase in write cost would be for the index. (ok initial write cost too but that is what 2 [very small anyway]).
You need to do a database operation everytime anyone votes or comments on the post anyway right!?! How else can to track votes/comments?
Actually though, I imagine you will get into use text search to find data in the posts. If so, I would look into maybe storing the ranking as a property in the search index and using it to rank matching results.
Don't we need to consider how you are selecting the posts to display. Is ranking by votes and comments the only criteria?

Caching is most useful when the calculation is expensive. If the calculation is simple and cheap, you might as well recalculate as needed.

If you're depending on keeping a running vote count in an entity, then you either have to be willing to lose an occasional vote, or you have to use transactions. If you use transactions, you're rate limited as to how many transactions you can do per second. (See the doc on transactions and entity groups). If you're liable to have a high volume of votes, rate limiting can be a problem.
For a low rate of votes, keeping a count in an entity might work fine. But if you any significant peaks in voting rate, storing separate Vote entities that periodically get rolled up into a cached count, perhaps adjusted by (possibly unreliable) incremental counts kept in memcache, might work better for you.
It really depends on what you want to optimize for. If you're trying to minimize disk writes by keeping a vote count cached non-transactionally, you risk losing votes.

Filter and sort by different fields in Google App Engine

The following code produces a First ordering property must be the same as inequality filter property error when executed because you can't order by a field at was not a filter.
q = Score.all()
q.filter("levelname = ", levelname)
q.filter("submitted >", int(time.time()) - (86400*7))
q.order("-score")
scoreList = q.fetch(10)
What I need to do is find the top 10 scores that are less than a week old. There could be 10s of thousands (if not more) scores, so I can't just fetch them all and sort in python.
Is there a way to do this?

In general, every time a question of counting comes up, the consensus is that with GAE you should precompute everything you can. The way I'd approach your specific require of top 10 scores, is to create an entity that holds the top scores and update the position whenever you have a new score that outweighs the top 10.
When you compute a score, you can query for how many other scores are greater than the computed score. If the count is more than 10, you don't need to update your scores. This will be the majority of the time. If the count is equal to or greater than 10, you need to update the order, so you get your top 10 and insert the new score as appropriate.
To handle the time component, I'd have some process running that checks daily to see if a score should be evicted from the top 10, if so, grab the next highest to replace it with.
Here's a bunch of answers on a similar subject that address the design patterns and logic appropriate for GAE datastore: What's the best way to count results in GQL?

As Sologoub mentions, precomputing is the way to go.
You can many equality filters, though, so an alternative to keeping a separate list of the top scores could be to let each entity have a flag (say, a boolean value) that says whether it is eligible to be on the top scores list (in your case, no older than a week), and have a daily cron job that retrieves the list of all entities with the eligibility flag, checks the date, and changes the flag if required.
This costs more storage space (one more field per entity), but I suppose an advantage can be that you can choose dynamically how many top scores you want to return. (You can also have several of these flags, say one for all time high score and one just for the last week, etc.)

Dynamic sort with Redis

Let's say I have 5 entries in my Redis database:
news::id: the ID of the last news;
news::list: a list of all news IDs;
news:n where n is the news ID: a hash containing fields such as title, url, etc.;
news:n:upvotes: a list of all users' IDs who upvoted the news, thus giving the number of upvotes.
news:n:downvotes: a list of all users' IDs who downvoted the news, thus giving the number of downvotes.
Then I have multiple ranking algorithms, where rank =:
upvotes_count;
upvotes_count - downvotes_count;
upvotes_count - downvotes_count - age;
upvotes_count / downvotes_count;
age.
Now how do I sort those news according to each of these algorithms?
I thought about computing the different ranks on every votes, but then if I introduce a new algorithm I need to compute the new rank for all the news.
EVAL could help but it won't be available until v2.6, which surely I don't want to wait for.
Eventually, I could retrieve all the news and put them in a Python list. But again it translates into a high memory usage, not to mention the fact that Redis stores its data in memory.
So is there a proper way to do this or should I just move to MongoDB?

You can sort by constants stored in keys.
In your example, I can sort 1. almost trivially using Redis. If you store the other expression values after calculating them, you can sort by them too. For 1., you will need to store the list count somewhere, I will assume news:n:upvotes:count.
The catch is to use the SORT command. For instance, the first sort would be:
SORT news::list BY news:*:upvotes:count GET news:*->title GET news:*->url
...to get titles and urls sorted by upvotes, in crescent order.
There are modifiers too, for alpha sorting, and asc/desc sorting. Read the command page entirely, it is worthwhile.
PS: You can wrap the count, store, sort and possibly deletion of count in a MULTI/EXEC environment (a transaction).

Real time update of relative leaderboard for each user among friends

Ive been working on a feature of my application to implement a leaderboard - basically stack rank users according to their score. Im currently tracking the score on an individual basis. My thought is that this leaderboard should be relative instead of absolute i.e. instead of having the top 10 highest scoring users across the site, its a top 10 among a user's friend network. This seems better because everyone has a chance to be #1 in their network and there is a form of friendly competition for those that are interested in this sort of thing. Im already storing the score for each user so the challenge is how to compute the rank of that score in real time in an efficient way. Im using Google App Engine so there are some benefits and limitations (e.g., IN [array]) queries perform a sub-query for every element of the array and also are limited to 30 elements per statement
For example
1st Jack 100
2nd John 50
Here are the approaches I came up with but they all seem to be inefficient and I thought that this community could come up with something more elegant. My sense is that any solution will likely be done with a cron and that I will store a daily rank and list order to optimize read operations but it would be cool if there is something more lightweight and real time
Pull the list of all users of the site ordered by score.
For each user pick their friends out of that list and create new rankings.
Store the rank and list order.
Update daily.
Cons - If I get a lot of users this will take forever
2a. For each user pick their friends and for each friend pick score.
Sort that list.
Store the rank and list order.
Update daily.
Record the last position of each user so that the pre-existing list can be used for re-ordering for the next update in order to make it more efficient (may save sorting time)
2b. Same as above except only compute the rank and list order for people who's profiles have been viewed in the last day
Cons - rank is only up to date for the 2nd person that views the profile

If writes are very rare compared to reads (a key assumption in most key-value stores, and not just in those;-), then you might prefer to take a time hit when you need to update scores (a write) rather than to get the relative leaderboards (a read). Specifically, when a user's score change, queue up tasks for each of their friends to update their "relative leaderboards" and keep those leaderboards as list attributes (which do keep order!-) suitably sorted (yep, the latter's a denormalization -- it's often necessary to denormalize, i.e., duplicate information appropriately, to exploit key-value stores at their best!-).
Of course you'll also update the relative leaderboards when a friendship (user to user connection) disappears or appears, but those should (I imagine) be even rarer than score updates;-).
If writes are pretty frequent, since you don't need perfectly precise up-to-the-second info (i.e., it's not financials/accounting stuff;-), you still have many viable approaches to try.
E.g., big score changes (rarer) might trigger the relative-leaderboards recomputes, while smaller ones (more frequent) get stashed away and only applied once in a while "when you get around to it". It's hard to be more specific without ballpark numbers about frequency of updates of various magnitude, typical network-friendship cluster sizes, etc, etc. I know, like everybody else, you want a perfect approach that applies no matter how different the sizes and frequencies in question... but, you just won't find one!-)

There is a python library available for storing rankings:
http://code.google.com/p/google-app-engine-ranklist/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.