Django: How to create a leaderboard - python

Lets say I have around 1,000,000 users. I want to find out what position any given user is in, and which users are around him. A user can get a new achievement at any time, and if he could see his standing update, that would be wonderful.
Honestly, every way I think of doing this would be horrendously expensive in time and/or memory. Ideas? My closest idea so far is to order the users offline and build percentile buckets, but that can't show a user his exact position.
Some code if that helps you django people :
class Alias(models.Model) :
awards = models.ManyToManyField('Award', through='Achiever')
#property
def points(self) :
p = cache.get('alias_points_' + str(self.id))
if p is not None : return p
points = 0
for a in self.achiever_set.all() :
points += a.award.points * a.count
cache.set('alias_points_' + str(self.id), points, 60 * 60) # 1 hour
return points
class Award(MyBaseModel):
owner_points = models.IntegerField(help_text="A non-normalized point value. Very subjective but try to be consistent. Should be proporional. 2x points = 2x effort (or skill)")
true_points = models.FloatField(help_text="The true value of this award. Recalculated with a cron job. Based on number of people who won it", editable=False, null=True)
#property
def points(self) :
if self.true_points :
# blend true_points into real points over 30 days
age = datetime.now() - self.created
blend_days = 30
if age > timedelta(days=blend_days) :
age = timedelta(days=blend_days)
num_days = 1.0 * age.days / blend_days
r = self.true_points * num_days + self.owner_points * (1 - num_days)
return int(r * 10) / 10.0
else :
return self.owner_points
class Achiever(MyBaseModel):
award = models.ForeignKey(Award)
alias = models.ForeignKey(Alias)
count = models.IntegerField(default=1)

I think Counterstrike solves this by requiring users to meet a minimum threshold to become ranked--you only need to accurately sort the top 10% or whatever.
If you want to sort everyone, consider that you don't need to sort them perfectly: sort them to 2 significant figures. With 1M users you could update the leaderboard for the top 100 users in real time, the next 1000 users to the nearest 10, then the masses to the nearest 1% or 10%. You won't jump from place 500,000 to place 99 in one round.
Its meaningless to get the 10 user context above and below place 500,000--the ordering of the masses will be incredibly jittery from round to round due to the exponential distribution.
Edit: Take a look at the SO leaderboard. Now go to page 500 out of 2500 (roughly 20th percentile). Is there any point to telling the people with rep '157' that the 10 people on either side of them also have rep '157'? You'll jump 20 places either way if your rep goes up or down a point. More extreme, is that right now the bottom 1056 pages (out of 2538), or the bottom 42% of users, are tied with rep 1. you get one more point, and you jumped up 1055 pages. Which is roughly a 37,000 increase in rank. It might be cool to tell them "you can beat 37k people if you get one more point!" but does it matter how many significant figures the 37k number has?
There's no value in knowing your peers on a ladder until you're already at the top, because anywhere but the top, there's an overwhelming number of them.

One million is not so much, I would try it the easy way first. If the points property is the thing you are sorting on that needs to be a database column. Then you can just do a count of points greater than the person in question to get the rank. To get other people near a person in question you do a query of people with higher points and sort ascending limit it to the number of people you want.
The tricky thing will be calculating the points on save. You need to use the current time as a bonus multiplier. One point now needs to turn into a number that is less than 1 point 5 days from now. If your users frequently gain points you will need to create a queue to handle the load.

Related

Selecting 6 teams out of 12 to make the playoffs, given each team has a different probability

I'm trying to run a Monte Carlo simulation using Python to determine multiple playoff scenarios of teams in my fantasy football league. But I haven't gotten that deep into it yet as I'm stuck on the first step of the algorithm.
I've determined that the first thing I want to do in the algorithm is reliably pick the 6 playoff teams in a way that if repeated 10,000 times should pretty closesly line up with the probabilities I've assigned the teams.
For example, if I gave team A a probability of 78% to make the playoffs, I want my team selection algorithm to choose team A 7800 times out of 10,000 with a reasonable degree of error maybe +/- 1% so like 7700 - 7900 times out of 10,000.
The algorithm in english is basically:
"for every team in the league, choose a random percent (1-100 / 100) and compare it to the probability that the team makes the playoffs. If the die roll is less than or equal to the team's probability to make the playoffs, then add them to a list. If after running through all 12 teams, this list has a length of exactly 6, return the result. Otherwise clear the list, start over, and keep going until you have exactly 6".
Unfortunately the actual results I'm getting is that the teams above 50% in probability are ending up with slightly more playoff appearances and the teams under 50% in probability are ending up with slightly less playoff appearances. So the 33% team is regularly showing up 2800-2900 times (28-29%) but the team with a 68% probability is showing up regularly 7100-7200 times (71-72%)
If I remove the part of the code that checks if the list is an exact length of 6, and just allow lists of all lengths, the output becomes more in line with the probabilities and are within the 1% margin of error I'm looking for. But I need for this list to have an exact length of 6. So how do I get around this?
I am pretty new to Python, so be gentle :) Advice is greatly appreciated, thank you!
import random
#List is (team abbreviation, playoff odds)
AA = ["AA",.33]
BG = ["BG",.99]
BSC = ["BSC",.68]
BT = ["BT",.95]
CHA = ["CHA",.97]
DDB = ["DDB",.11]
EJ = ["EJ",.48]
KCT = ["KCT",.82]
MTA = ["MTA",.00]
NSR = ["NSR",.01]
TDP = ["TDP",.57]
THR = ["THR",.09]
teams = [AA,BG,BSC,BT,CHA,DDB,EJ,KCT,MTA,NSR,TDP,THR]
# Algorithm to pick the 6 playoff teams
# Dice roll turned into a percentage and compared against each team's probability
# Throw out any results that don't have exactly 6 teams
def playoffs():
while True:
playoff_list = []
for i in teams:
diceroll = random.randint(1,100) / 100
if i[1] >= diceroll:
playoff_list.append(i[0])
else:
continue
if len(playoff_list) == 6:
break
else:
continue
return playoff_list
# Function to run the above selection algorithm 10,000 times
# Add every 6-team scenario to one giant list with 60,000 items
def playoffs_check():
playoff_teams = []
playoff_test = 0
while playoff_test < 10000:
playoff_teams.extend(playoffs())
playoff_test += 1
return playoff_teams
# Run the algorithm
# Count the results so you can compare to the original probabilities
playoffs_list = playoffs_check()
for x in teams:
print(x[0],playoffs_list.count(x[0]))

Birthday Paradox, incorrect output by about 1

I'm relatively new to python and wanted to test myself, by tackling the birthday problem. Rather than calculating it mathematically, I wanted to simulate it, to see if I would get the right answer. So I assigned all boolean values in the list sieve[] as False and then randomly pick a value from 0 to 364 and change it to True, if it's already True then it outputs how many times it had to iterate as an answer.
For some reason, every time I run the code, I get a value between 24.5 and 24.8
The expected result for 50% is 23 people, so why is my result 6% higher than it should be? Is there an error in my code?
import random
def howManyPeople():
sieve = [False] * 365
count = 1
while True:
newBirthday = random.randint(0,364)
if sieve[newBirthday]:
return count
else:
sieve[newBirthday] = True
count += 1
def multipleRun():
global timesToRun
results = []
for i in range(timesToRun):
results.append(howManyPeople())
finalResultAverage = sum(results)
return (finalResultAverage / timesToRun)
timesToRun = int(input("How many times would you like to run this code?"))
print("Average of all solutions = " + str(multipleRun()) + " people")
There's no error in your code. You're computing the mean of your sample of howManyPeople return values, when what you're really interested in (and what the birthday paradox tells you about) is the median of the distribution.
That is, you've got a random process where you incrementally add people to a set, then report the total number of people in that set on the first birthday collision. The birthday paradox implies that at least 50% of the time, your set will have 23 or fewer people. That's not the same thing as saying the expected number of people in the set is 23.0 or smaller.
Here's what I see from one million samples of your howManyPeople function.
In [4]: sample = [howManyPeople() for _ in range(1000000)]
In [5]: import numpy as np
In [6]: np.median(sample)
Out[6]: 23.0
In [7]: np.mean(sample)
Out[7]: 24.617082
In [8]: np.mean([x <= 23 for x in sample])
Out[8]: 0.506978
Note that there's a (tiny) amount of luck here: the median of the distribution of howManyPeople return values is 23 (at least according to Wikipedia's definition), but there's a chance that an unusual sample could have different median, purely through randomness. In this particular case, that chance is entirely negligible. And as user2357112 points out in comments, things are a bit messier in the 2-day year example, where any real number between 2.0 and 3.0 (inclusive) is a valid distribution median, and we could reasonably expect a sample median to be either 2 or 3.
Instead of sampling, we can also compute the probabilities of each output of howManyPeople directly: for any positive integer k, the probability that the output is strictly larger than k is the same as the probability that the first k people have distinct birthdays, which is given (in Python syntax) by factorial(365)/factorial(k)/365**k, and we can use that to compute the probabilities of individual outputs. Here I'm using the name X for the random variable represented by howManyPeople. Some inefficient code:
from math import factorial
def prob_X_greater_than(k):
"""Probability that the output of howManyPeople is > k."""
if k <= 0:
return 1.0
elif k > 365:
return 0.0
else:
return factorial(365) / factorial(365 - k) / 365**k
def prob_X_equals(k):
"""Probability that the output of howManyPeople is == k."""
return prob_x_greater_than(k-1) - prob_x_greater_than(k)
With this, we can get the exact (well, okay, exact up to numerical errors) mean and verify that it roughly matches what we got from the sample:
In [18]: sum(k*prob_x_equals(k) for k in range(1, 366))
Out[18]: 24.616585894598863
And the birthday paradox in this case should tell us that the sum of the probabilities for k <= 23 is greater than 0.5:
In [19]: sum(prob_x_equals(k) for k in range(1, 24))
Out[19]: 0.5072972343239854
What you're seeing is normal. There may be a >50% chance of having a duplicate birthday in a room of 23 random people (ignoring leap years and nonuniform birthday distributions), but that doesn't mean that if you add people to a room one by one, the mean point at which you get a duplicate will be 23.
To get an intuitive feel for this, imagine if years only had two days. In this case, it's clear that there's a 50% chance of having a duplicate birthday in a room with 2 people. However, if you add random people to the room one by one, you're going to need at least two people - 50% chance of stopping at 2 and 50% of 3. The mean stopping point is 2.5, not 2.

Why adding variables won't work

I'm making a game where you run your own coffee shop. You choose how many ingredients you want to buy, then it charges you for them as long as they don't cost too much.
Rather than add the variables up as numbers it comes up with this huge number that it 1000s of times larger than the expected value. I have no clue why (I'm fairly new to python so forgive me if it's obvious. This is also my first time using StackOverflow so if I've forgotten to add any info, let me know.)
var1 = 11
var 2 = 15
print(str(var1 + var 2))
float((NoStaff * 30))
where NoStaff = '1' is '111111111111111111111111111111' which then gets converted to a number.
You want
float(NoStaff) * 30
Additionally, you may want to address the following logic issues:
You can buy partial staff members (0.5)
Your bean count is reset every time you buy new beans
You get the beans/milk even if they cost too much

Randomly SELECTing rows based on certain criteria

I'm building a media player for the office, and so far so good but I want to add a voting system (kinda like Pandora thumbs up/thumbs down)
To build the playlist, I am currently using the following code, which pulls 100 random tracks that haven't been played recently (we make sure all tracks have around the same play count), and then ensures we don't hear the same artist within 10 songs and builds a playlist of 50 songs.
max_value = Items.select(fn.Max(Items.count_play)).scalar()
query = (Items
.select()
.where(Items.count_play < max_value, Items.count_skip_vote < 5)
.order_by(fn.Rand()).limit(100))
if query.count < 1:
max_value = max_value - 1
query = (Items
.select()
.where(Items.count_play < max_value, Items.count_skip_vote < 5)
.order_by(fn.Rand()).limit(100))
artistList = []
playList = []
for item in query:
if len(playList) is 50:
break
if item.artist not in artistList:
playList.append(item.path)
if len(artistList) < 10:
artistList.append(item.artist)
else:
artistList.pop(0)
artistList.append(item.artist)
for path in playList:
client.add(path.replace("/music/Library/",""))
I'm trying to work out the best way to use the up/down votes.
I want to see less with downvotes and more with upvotes.
I'm not after direct code because I'm pretty OK with python, it's more of the logic that I can't quite nut out (that being said, if you feel the need to improve my code, I won't stop you :) )
Initially give each track a weight w, e.g. 10 - a vote up increases this, down reduces it (but never to 0). Then when deciding which track to play next:
Calculate the total of all the weights, generate a random number between 0 and this total, and step through the tracks from 0-49 adding up their w until you exceed the random number. play that track.
The exact weighting algorithm (e.g. how much an upvote/downvote changes w) will of course affect how often tracks (re)appear. Wasn't it Apple who had to change the 'random' shuffle of their early iPod because it could randomly play the same track twice (or close enough together for a user to notice) so they had to make it less random, which I presume means also changing the weighting by how recently the track was played - in that case the time since last play would also be taken into account at the time of choosing the next track. Make sure you cover the end cases where everyone downvotes 49 (or all 50 if they want silence) of the tracks. Or maybe that's what you want...

Statistics: Optimizing probability calculations within python

Setup:
The question is complex form of a classic probability question:
70 colored balls are placed in an urn, 10 for each of the seven rainbow colors.
What is the expected number of distinct colors in 20 randomly picked balls?
My solution is python's itertools library:
combos = itertools.combinations(urn, 20),
print sum([1 for x in combos])
(where urn is a list of the 70 balls in the urn).
I can unpack the iterator up to a length of combinations(urn, 8) past that my computer can't handle it.
Note: I know this wouldn't give me the answer, this is only the road block in my script, in other words if this worked my script would work.
Question: How could I find the expected colors accurately, without the worlds fastest super computer? Is my way even computationally possible?
Since a couple of people have asked to see the mathematical solution, I'll give it. This is one of the Project Euler problems that can be done in a reasonable amount of time with pencil and paper. The answer is
7(1 - (60 choose 20)/(70 choose 20))
To get this write X, the count of colors present, as a sum X0+X1+X2+...+X6, where Xi is 1 if the ith color is present, and 0 if it is not present.
E(X)
= E(X0+X1+...+X6)
= E(X0) + E(X1) + ... + E(X6) by linearity of expectation
= 7E(X0) by symmetry
= 7 * probability that a particular color is present
= 7 * (1- probability that a particular color is absent)
= 7 * (1 - (# ways to pick 20 avoiding a color)/(# ways to pick 20))
= 7 * (1 - (60 choose 20)/(70 choose 20))
Expectation is always linear. So, when you are asked to find the average value of some random quantity, it often helps to try to rewrite the quantity as a sum of simpler pieces such as indicator (0-1) random variables.
This does not say how to make the OP's approach work. Although there is a direct mathematical solution, it is good to know how to iterate through the cases in an organized and practicable fashion. This could help if you next wanted a more complicated function of the set of colors present than the count. Duffymo's answer suggested something that I'll make more explicit:
You can break up the ways to draw 20 calls from 70 into categories indexed by the counts of colors. For example, the index (5,5,10,0,0,0,0) means we drew 5 of the first color, 5 of the second color, 10 of the third color, and none of the other colors.
The set of possible indices is contained in the collection of 7-tuples of nonnegative integers with sum 20. Some of these are impossible, such as (11,9,0,0,0,0,0) by the problem's assumption that there are only 10 balls of each color, but we can deal with that. The set of 7-tuples of nonnegative numbers adding up to 20 has size (26 choose 6)=230230, and it has a natural correspondence with the ways of choosing 6 dividers among 26 spaces for dividers or objects. So, if you have a way to iterate through the 6 element subsets of a 26 element set, you can convert these to iterate through all indices.
You still have to weight the cases by the counts of the ways to draw 20 balls from 70 to get that case. The weight of (a0,a1,a2,...,a6) is (10 choose a0)(10 choose a1)...*(10 choose a6). This handles the case of impossible indices gracefully, since 10 choose 11 is 0 so the product is 0.
So, if you didn't know about the mathematical solution by the linearity of expectation, you could iterate through 230230 cases and compute a weighted average of the number of nonzero coordinates of the index vector, weighted by a product of small binomial terms.
Wouldn't it just be combinations with repetition?
http://www.mathsisfun.com/combinatorics/combinations-permutations.html
Make an urn with 10 of each color.
Decide on the number of trials you want.
Make a container to hold the result of each trial
for each trial, pick a random sample of twenty items from the urn, make a set of those items, add the length of that set to the results.
find the average of the results

Categories