I think this problem can be solved using either itertools or cartesian, but I'm fairly new to Python and am struggling to use these:
I have a portfolio of 5 stocks, where each stock can have a weighting of -0.4, -0.2, 0, 0.2 or 0.4, with weightings adding up to 0. How do I create a function that produces a list of every possible combination of weights. e.g. [-0.4, 0.2, 0, 0.2, 0]... etc
Ideally, the function would work for n stocks, as I will eventually want to do the same process for 50 stocks.
edit: To clarify, I'm looking for all combinations of length n (in this case 5), summing to 0. The values can repeat: e.g: [0.2, 0.2, -0.4, 0, 0], [ 0.4, 0, -0.2, -0.2, 0.4], [0,0,0,0.2,-0.2], [0, 0.4, -0.4, 0.2, -0.2] etc. So [0,0,0,0,0] would be a possible combination. The fact that there are 5 possible weightings and 5 stocks is a coincidence (which i should have avoided!), this same question could be with 5 possible weightings and 3 stocks or 7 stocks. Thanks.
Something like this, although it's not really efficient.
from decimal import Decimal
import itertools
# possible optimization: use integers rather than Decimal
weights = [Decimal("-0.4"), Decimal("-0.2"), Decimal(0), Decimal("0.2"), Decimal("0.4")]
def possible_weightings(n = 5, target = 0):
for all_bar_one in itertools.product(weights, repeat = n - 1):
final = target - sum(all_bar_one)
if final in weights:
yield all_bar_one + (final,)
I repeat from comments, you cannot do this for n = 50. The code yields the right values, but there isn't time in the universe to iterate over all the possible weightings.
This code isn't brilliant. It does some unnecessary work examining cases where, for example, the sum of all but the first two is already greater than 0.8 and so there's no point separately checking all the possibilities for the first of those two.
So, this does n = 5 in nearly no time, but there is some value of n where this code becomes infeasibly slow, and you could get further with better code. You still won't get to 50. I'm too lazy to write that better code, but basically instead of all_bar_one you can make recursive calls to possible_weightings with successively smaller values of n and a value of target equal to the target you were given, minus the sum you have so far. Then prune all the branches you don't need to take, by bailing out early in cases where target is too large (positive or negative) to be reached using only n values.
I understand the values can repeat, but all have to sum to zero, therefore the solution might be:
>>> from itertools import permutations
>>> weights = [-0.4, -0.2, 0, 0.2, 0.4]
>>> result = (com for com in permutations(weights) if sum(com)==0)
>>> for i in result: print(i)
edit:
you might use product as #Steve Jassop suggested.
combi = (i for i in itertools.product(weights, repeat= len(weights)) if not sum(i))
for c in combi:
print(c)
I like using the filter function:
from itertools import permutations
w = [-0.4, -0.2, 0, 0.2, 0.4]
def foo(w):
perms = list(permutations(w))
sum0 = filter(lambda x: sum(x)==0, perms)
return sum0
print foo(w)
Different approach.
1 Figure out all sequences of the weights that add up to zero, in order.
for example, these are some possibilities (using whole numbers to type less):
[0, 0, 0, 0, 0]
[-4, 0, 0, +2, +2]
[-4, 0, 0, 0, +4]
[-4, +4, 0, 0, 0] is incorrect because weights are not picked in order.
2 Permute what you got above, because the permutations will add up to zero as well.
This is where you'd get your [-4, 0, 0, 0, +4] and [-4, +4, 0, 0, 0]
OK, being lazy. I am going to pseudo-code/comment-code a good deal of my solution. Not that strong at recursion, the stuff is too tricky to code quickly and I have doubts that this type of solution scales up to 50.
i.e. I don't think I am right, but it might give someone else an idea.
def find_solution(weights, length, last_pick, target_sum):
# returns a list of solutions, in growing order, of weights adding up to the target_sum
# weights are the sequence of possible weights - IN ORDER, NO REPEATS
# length is how many weights we are adding up
# last_pick - the weight picked by the caller
# target_sum is what we are aiming for, which will always be >=0
solutions = []
if length > 1:
#since we are picking in order, having picked 0 "disqualifies" -4 and -2.
if last_pick > weights[0]:
weights = [w for w in weights if w >= last_pick]
#all remaining weights are possible
for weight in weights:
child_target_sum = target_sum + weight
#basic idea, we are picking in growing order
#if we start out picking +2 in a [-4,-2,0,+2,+4] list in order, then we are constrained to finding -2
#with just 2 and 4 as choices. won't work.
if child_target_sum <= 0:
break
child_solutions = find_solution(weights, length=length-1, last_pick=weight, target_sum=child_target_sum)
[solutions.append([weight] + child ) for child in child_solutions if child_solution]
else:
#only 1 item to pick left, so it has be the target_sum
if target_sum in weights:
return [[target_sum]]
return solutions
weights = list(set(weights))
weights.sort()
#those are not permutated yet
solutions = find_solutions(weights, len(solution), -999999999, 0)
permutated = []
for solution in solutions:
permutated.extend(itertools.permutations(solution))
If you just want a list of all the combinations, use itertools.combinations:
w = [-0.4, -0.2, 0, 0.2, 0.4]
l = len(w)
if __name__ == '__main__':
for i in xrange(1, l+1):
for p in itertools.combinations(w, i):
print p
If you want to count the different weights that can be created with these combinations, it's a bit more complicated.
First, you generate permutations with 1, 2, 3, ... elements. Then you take the sum of them. Then you add the sum to the set (will no do anything if the number is already present, very fast operation). Finally you convert to a list and sort it.
from itertools import combinations
def round_it(n, p):
"""rounds n, to have maximum p figures to the right of the comma"""
return int((10**p)*n)/float(10**p)
w = [-0.4, -0.2, 0, 0.2, 0.4]
l = len(w)
res = set()
if __name__ == '__main__':
for i in xrange(1, l+1):
for p in combinations(w, i):
res.add(round_it(sum(p), 10)) # rounding necessary to avoid artifacts
print sorted(list(res))
Is this what you are looking for:
if L = [-0.4, 0.2, 0, 0.2, 0]
AllCombi = itertools.permutations(L)
for each in AllCombi:
print each
Related
I have been struggling to mock the IoT Sensor data. I need a list of floats which will increase and decrease sequentially.
For example [0.1, 0.12, 0.13, 0.18, 1.0, 1.2, 1.0, 0.9, 0.6]
Right now I have generated the list with max and min range using this,
for k in dts:
x = round(random.uniform(j["min"], j["max"]), 3)
random_float_list.append(x)
list generated form this code is not in a sequence. I need something which generates random floats in range and there are no abrupt changes in it. Values can increase and decrease in a sequence.
You can generate multiple random sequences and glue them together. Something like this:
import numpy as np
def gen_floats(count, min_step_size, max_step_size, max_seq_len):
# Start around 0
res = [np.round(np.random.rand() - 0.5, 2)]
while len(res) < count:
step_size = np.random.uniform(min_step_size, max_step_size)
# Generate random number of steps for sequence
remaining = count - len(res)
steps = np.random.randint(1, remaining + 1 if remaining < max_seq_len else max_seq_len)
# Generate additive or subtractive sequence using previous values
if np.random.rand() > 0.5:
vals = np.round(np.linspace(res[-1] + step_size, res[-1] + steps * step_size, steps), 2)
else:
vals = np.round(np.linspace(res[-1] + step_size, res[-1] - steps * step_size, steps), 2)
res.extend(vals)
return res
Then print(gen_floats(20, 0.1, 0.5, 10)) generates something like: [0.4, 0.86, 0.25, -0.37, -0.99, -1.61, -2.23, -2.85, -2.64, -2.95, -3.26, -3.57, -3.88, -3.63, -3.38, -3.19, -2.89, -2.63, -3.15, -3.68]. You can play with params to match desired output.
Something like this should work if you want a random where you can control the min, max and max difference between the values.
It will first random a value between start and end and append it to the list output. The next value will be a random value between the last value in the output list +-max_diff.
import random
def rand(start,end,max_diff,elements,output):
elements -= 1
if output:
if output[-1]-max_diff < start: #To not get a value smaller than start
output.append(round(random.uniform(start,output[-1]+max_diff),3))
elif output[-1]+max_diff > end: #To not get a value bigger than end
output.append(round(random.uniform(output[-1]-max_diff,end),3))
else:
output.append(round(random.uniform(output[-1]-max_diff,output[-1]+max_diff),3))
else:
output.append(round(random.uniform(start,end),3))
if elements > 0:
output = rand(start,end,max_diff,elements,output)
return output
print(rand(1,2,0.1,3,[])) #[1.381, 1.375, 1.373]
You can generate random numbers with a uniform distribution, and then sort the numbers into ascending order in the first part, and into descending order in the second part.
import numpy as np
np.random.seed(0)
def gen_rnd_sensor_data(low: float,
high: float,
n_incr: int,
n_decr: int) -> np.ndarray:
incr = np.random.uniform(low=low, high=high, size=n_incr)
incr.sort()
decr = np.random.uniform(low=low, high=high, size=n_decr)
decr[::-1].sort()
return np.concatenate((incr, decr))
Then you can call this function with:
print(gen_rnd_sensor_data(0, 1, 5, 3))
This generates data within 0. and 1., the first 5 values are increasing, the last 3 are decreasing. Within the program, every time you call the function, you get different results, but if you rerun your program, you get the same results, so you can debug your program.
I was given this problem. Given a list of percentage = [0.1,0.1,0.8] and number = 9, find all possible list (boundary of each element is 0.25 to 10, increment = 0.25) that multiply with the percentage list ,sum those number together and round to 1 decimal place must be equal to number = 9. I use brute force algorithm to solve this problem with the assistance of itertools product. but brute force this way is pretty slow. I'm trying to find a boundary (upper and lower boundary in range(lower boundary,upper boundary,25) for my 'for loop'. Can you guys suggest me a way to find it?
import itertools
ranges = []
n = int(input()) #number of element in percentage list
percent = []
for i in range(n):
percent.append(float(input())) #input the percentage list
total = float(input()) #the number mentioned above
for i in range(n):
ranges.append(range(25,1025,25)) #find boundary for this line
for xs in itertools.product(*ranges):
avg = 0
for i in range(n):
avg += xs[i]*percent[i]
if avg < (total*100+5) and avg >= (total*100-5):
for each in xs:
print(each/100, end = ' ')
print()
It's a little bit hard for me to explain algorithm in concise words T.T
So sufficient explanation is stated in the following code comments.
Basic idea is that this is be done in a recursive way (DFS, depth first search). The function should be something like recursion(percent_list, result_list, target).
Initially, it should be recursion([0.1, 0.1, 0.8], [], 9)
If we try the first value to be 3.25, then we update target value by 9 - 3.25*0.1 = 8.675. So we next call recursion([0.1, 0.8], [3.25], 8.675);
Then, we try the second value to be 4.00, then update target value by 8.675 - 4.0*0.1 = 8.275. So call recursion([0.8], [3.25, 4.0], 8.275);
Finally, we try the third value, and only 9.75, 10 is valid, since the summed up value are 8.525 and 8.725, respectly, and could round up to 9. So we append results [3.25, 4.0, 9.75] and [3.25, 4.0, 10.0] to result list.
After that, we try the second value to be 0.25, ..., 3.75, 4.25, 4.5, ..., 10.
Try first value to be 0.25, ..., 3.0, 3.5, 3.75, ..., 10.
To avoid too much recursion calls, we need to calculate the valid value could be appended to results every time, to cut the branches that's impossible.
The actual function signature is somehow differnt, to achieve round up.
import numpy as np
def recursion(percent_list, value_list, previous_results, target, tolerate_lower, tolerate_upper, result_list):
# change , 0.25 ~ 10 , change, , change, 0.5 = 9.5-9 , 0.4999 < 9-8.5, your answer
# init: [0.1,0.1,0.8] [] 9
# if reach the target within tolerate
if len(percent_list) == 0:
# print(previous_results)
result_dict.append(previous_results)
return
# otherwise, cut impossible branches, check minimum and maximum value acceptable for current percent_list
percent_sum = percent_list.sum() # sum up current percent list, **O(n)**, should be optimized by pre-generating a sum list
value_min = value_list[0] # minimum value from data list (this problem 0.25)
value_max = value_list[-1] # maximum value from data list (this problem 10.0)
search_min = (target - tolerate_lower - (percent_sum - percent_list[0]) * value_max) / percent_list[0] # minimum value acceptable as result
search_max = (target + tolerate_upper - (percent_sum - percent_list[0]) * value_min) / percent_list[0] # maximum value acceptable as result
idx_min = np.searchsorted(value_list, search_min, "left") # index of minimum value (for data list)
idx_max = np.searchsorted(value_list, search_max, "right") # index of maximum value (for data list)
# recursion step
for i in range(idx_min, idx_max):
# update result list
current_results = previous_results + [value_list[i]]
# remove the current state for variables `percent_list`, and update `target` for next step
recursion(percent_list[1:], value_list, current_results, target - percent_list[0] * value_list[i], tolerate_lower, tolerate_upper, result_list)
To solve this current problem,
result = []
recursion(np.array([0.1, 0.1, 0.8]), np.arange(0.25, 10.25, 0.25), [], 9, 0.5, 0.49999, result)
There's totally 4806 possible results. To validate results sum up to about 9 (but could not validate results is plenty enough),
for l in result:
if not (8.5 <= (np.array([0.1, 0.1, 0.8]) * np.array(l)).sum() < 9.5):
print("Wrong code!")
I think the wrost case complixty is still O(m^n * n), if m refers to data list length (0.25, 0.5, ..., 10), and n refers to percent list length (0.1, 0.1, 0.8). It should be further optimized to O(m^n * log(m)), to avoid summing up percent list every recursion; and to O(m^n), if we could fully utilize the nature of arithmetic sequence of the data list.
I have a file with some probabilities for different values e.g.:
1 0.1
2 0.05
3 0.05
4 0.2
5 0.4
6 0.2
I would like to generate random numbers using this distribution. Does an existing module that handles this exist? It's fairly simple to code on your own (build the cumulative density function, generate a random value [0,1] and pick the corresponding value) but it seems like this should be a common problem and probably someone has created a function/module for it.
I need this because I want to generate a list of birthdays (which do not follow any distribution in the standard random module).
scipy.stats.rv_discrete might be what you want. You can supply your probabilities via the values parameter. You can then use the rvs() method of the distribution object to generate random numbers.
As pointed out by Eugene Pakhomov in the comments, you can also pass a p keyword parameter to numpy.random.choice(), e.g.
numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])
If you are using Python 3.6 or above, you can use random.choices() from the standard library – see the answer by Mark Dickinson.
Since Python 3.6, there's a solution for this in Python's standard library, namely random.choices.
Example usage: let's set up a population and weights matching those in the OP's question:
>>> from random import choices
>>> population = [1, 2, 3, 4, 5, 6]
>>> weights = [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]
Now choices(population, weights) generates a single sample, contained in a list of length 1:
>>> choices(population, weights)
[4]
The optional keyword-only argument k allows one to request more than one sample at once. This is valuable because there's some preparatory work that random.choices has to do every time it's called, prior to generating any samples; by generating many samples at once, we only have to do that preparatory work once. Here we generate a million samples, and use collections.Counter to check that the distribution we get roughly matches the weights we gave.
>>> million_samples = choices(population, weights, k=10**6)
>>> from collections import Counter
>>> Counter(million_samples)
Counter({5: 399616, 6: 200387, 4: 200117, 1: 99636, 3: 50219, 2: 50025})
An advantage to generating the list using CDF is that you can use binary search. While you need O(n) time and space for preprocessing, you can get k numbers in O(k log n). Since normal Python lists are inefficient, you can use array module.
If you insist on constant space, you can do the following; O(n) time, O(1) space.
def random_distr(l):
r = random.uniform(0, 1)
s = 0
for item, prob in l:
s += prob
if s >= r:
return item
return item # Might occur because of floating point inaccuracies
(OK, I know you are asking for shrink-wrap, but maybe those home-grown solutions just weren't succinct enough for your liking. :-)
pdf = [(1, 0.1), (2, 0.05), (3, 0.05), (4, 0.2), (5, 0.4), (6, 0.2)]
cdf = [(i, sum(p for j,p in pdf if j < i)) for i,_ in pdf]
R = max(i for r in [random.random()] for i,c in cdf if c <= r)
I pseudo-confirmed that this works by eyeballing the output of this expression:
sorted(max(i for r in [random.random()] for i,c in cdf if c <= r)
for _ in range(1000))
Maybe it is kind of late. But you can use numpy.random.choice(), passing the p parameter:
val = numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])
I wrote a solution for drawing random samples from a custom continuous distribution.
I needed this for a similar use-case to yours (i.e. generating random dates with a given probability distribution).
You just need the funtion random_custDist and the line samples=random_custDist(x0,x1,custDist=custDist,size=1000). The rest is decoration ^^.
import numpy as np
#funtion
def random_custDist(x0,x1,custDist,size=None, nControl=10**6):
#genearte a list of size random samples, obeying the distribution custDist
#suggests random samples between x0 and x1 and accepts the suggestion with probability custDist(x)
#custDist noes not need to be normalized. Add this condition to increase performance.
#Best performance for max_{x in [x0,x1]} custDist(x) = 1
samples=[]
nLoop=0
while len(samples)<size and nLoop<nControl:
x=np.random.uniform(low=x0,high=x1)
prop=custDist(x)
assert prop>=0 and prop<=1
if np.random.uniform(low=0,high=1) <=prop:
samples += [x]
nLoop+=1
return samples
#call
x0=2007
x1=2019
def custDist(x):
if x<2010:
return .3
else:
return (np.exp(x-2008)-1)/(np.exp(2019-2007)-1)
samples=random_custDist(x0,x1,custDist=custDist,size=1000)
print(samples)
#plot
import matplotlib.pyplot as plt
#hist
bins=np.linspace(x0,x1,int(x1-x0+1))
hist=np.histogram(samples, bins )[0]
hist=hist/np.sum(hist)
plt.bar( (bins[:-1]+bins[1:])/2, hist, width=.96, label='sample distribution')
#dist
grid=np.linspace(x0,x1,100)
discCustDist=np.array([custDist(x) for x in grid]) #distrete version
discCustDist*=1/(grid[1]-grid[0])/np.sum(discCustDist)
plt.plot(grid,discCustDist,label='custom distribustion (custDist)', color='C1', linewidth=4)
#decoration
plt.legend(loc=3,bbox_to_anchor=(1,0))
plt.show()
The performance of this solution is improvable for sure, but I prefer readability.
Make a list of items, based on their weights:
items = [1, 2, 3, 4, 5, 6]
probabilities= [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]
# if the list of probs is normalized (sum(probs) == 1), omit this part
prob = sum(probabilities) # find sum of probs, to normalize them
c = (1.0)/prob # a multiplier to make a list of normalized probs
probabilities = map(lambda x: c*x, probabilities)
print probabilities
ml = max(probabilities, key=lambda x: len(str(x)) - str(x).find('.'))
ml = len(str(ml)) - str(ml).find('.') -1
amounts = [ int(x*(10**ml)) for x in probabilities]
itemsList = list()
for i in range(0, len(items)): # iterate through original items
itemsList += items[i:i+1]*amounts[i]
# choose from itemsList randomly
print itemsList
An optimization may be to normalize amounts by the greatest common divisor, to make the target list smaller.
Also, this might be interesting.
Another answer, probably faster :)
distribution = [(1, 0.2), (2, 0.3), (3, 0.5)]
# init distribution
dlist = []
sumchance = 0
for value, chance in distribution:
sumchance += chance
dlist.append((value, sumchance))
assert sumchance == 1.0 # not good assert because of float equality
# get random value
r = random.random()
# for small distributions use lineair search
if len(distribution) < 64: # don't know exact speed limit
for value, sumchance in dlist:
if r < sumchance:
return value
else:
# else (not implemented) binary search algorithm
from __future__ import division
import random
from collections import Counter
def num_gen(num_probs):
# calculate minimum probability to normalize
min_prob = min(prob for num, prob in num_probs)
lst = []
for num, prob in num_probs:
# keep appending num to lst, proportional to its probability in the distribution
for _ in range(int(prob/min_prob)):
lst.append(num)
# all elems in lst occur proportional to their distribution probablities
while True:
# pick a random index from lst
ind = random.randint(0, len(lst)-1)
yield lst[ind]
Verification:
gen = num_gen([(1, 0.1),
(2, 0.05),
(3, 0.05),
(4, 0.2),
(5, 0.4),
(6, 0.2)])
lst = []
times = 10000
for _ in range(times):
lst.append(next(gen))
# Verify the created distribution:
for item, count in Counter(lst).iteritems():
print '%d has %f probability' % (item, count/times)
1 has 0.099737 probability
2 has 0.050022 probability
3 has 0.049996 probability
4 has 0.200154 probability
5 has 0.399791 probability
6 has 0.200300 probability
based on other solutions, you generate accumulative distribution (as integer or float whatever you like), then you can use bisect to make it fast
this is a simple example (I used integers here)
l=[(20, 'foo'), (60, 'banana'), (10, 'monkey'), (10, 'monkey2')]
def get_cdf(l):
ret=[]
c=0
for i in l: c+=i[0]; ret.append((c, i[1]))
return ret
def get_random_item(cdf):
return cdf[bisect.bisect_left(cdf, (random.randint(0, cdf[-1][0]),))][1]
cdf=get_cdf(l)
for i in range(100): print get_random_item(cdf),
the get_cdf function would convert it from 20, 60, 10, 10 into 20, 20+60, 20+60+10, 20+60+10+10
now we pick a random number up to 20+60+10+10 using random.randint then we use bisect to get the actual value in a fast way
you might want to have a look at NumPy Random sampling distributions
None of these answers is particularly clear or simple.
Here is a clear, simple method that is guaranteed to work.
accumulate_normalize_probabilities takes a dictionary p that maps symbols to probabilities OR frequencies. It outputs usable list of tuples from which to do selection.
def accumulate_normalize_values(p):
pi = p.items() if isinstance(p,dict) else p
accum_pi = []
accum = 0
for i in pi:
accum_pi.append((i[0],i[1]+accum))
accum += i[1]
if accum == 0:
raise Exception( "You are about to explode the universe. Continue ? Y/N " )
normed_a = []
for a in accum_pi:
normed_a.append((a[0],a[1]*1.0/accum))
return normed_a
Yields:
>>> accumulate_normalize_values( { 'a': 100, 'b' : 300, 'c' : 400, 'd' : 200 } )
[('a', 0.1), ('c', 0.5), ('b', 0.8), ('d', 1.0)]
Why it works
The accumulation step turns each symbol into an interval between itself and the previous symbols probability or frequency (or 0 in the case of the first symbol). These intervals can be used to select from (and thus sample the provided distribution) by simply stepping through the list until the random number in interval 0.0 -> 1.0 (prepared earlier) is less or equal to the current symbol's interval end-point.
The normalization releases us from the need to make sure everything sums to some value. After normalization the "vector" of probabilities sums to 1.0.
The rest of the code for selection and generating a arbitrarily long sample from the distribution is below :
def select(symbol_intervals,random):
print symbol_intervals,random
i = 0
while random > symbol_intervals[i][1]:
i += 1
if i >= len(symbol_intervals):
raise Exception( "What did you DO to that poor list?" )
return symbol_intervals[i][0]
def gen_random(alphabet,length,probabilities=None):
from random import random
from itertools import repeat
if probabilities is None:
probabilities = dict(zip(alphabet,repeat(1.0)))
elif len(probabilities) > 0 and isinstance(probabilities[0],(int,long,float)):
probabilities = dict(zip(alphabet,probabilities)) #ordered
usable_probabilities = accumulate_normalize_values(probabilities)
gen = []
while len(gen) < length:
gen.append(select(usable_probabilities,random()))
return gen
Usage :
>>> gen_random (['a','b','c','d'],10,[100,300,400,200])
['d', 'b', 'b', 'a', 'c', 'c', 'b', 'c', 'c', 'c'] #<--- some of the time
Here is a more effective way of doing this:
Just call the following function with your 'weights' array (assuming the indices as the corresponding items) and the no. of samples needed. This function can be easily modified to handle ordered pair.
Returns indexes (or items) sampled/picked (with replacement) using their respective probabilities:
def resample(weights, n):
beta = 0
# Caveat: Assign max weight to max*2 for best results
max_w = max(weights)*2
# Pick an item uniformly at random, to start with
current_item = random.randint(0,n-1)
result = []
for i in range(n):
beta += random.uniform(0,max_w)
while weights[current_item] < beta:
beta -= weights[current_item]
current_item = (current_item + 1) % n # cyclic
else:
result.append(current_item)
return result
A short note on the concept used in the while loop.
We reduce the current item's weight from cumulative beta, which is a cumulative value constructed uniformly at random, and increment current index in order to find the item, the weight of which matches the value of beta.
I have three lists, each one with several possible values.
probs = ([0.1,0.1,0.2], \
[0.7,0.9], \
[0.5,0.4,0.1])
I want to test all possible combinations of choosing one element from each list. So, 3*2*3=18 possible combinations in this example. In the end, I want to choose the most favourable combinations according to some criteria. This is:
[<index in row 0> , <index in row 1> , <index in row 2> , <criteria value>]
I can accomplish my task by using three nested for loops (which I did). However, in the real application of this code, I will have a variable number of lists. Because of that, it seems the solution would be using a recursive function with a for loop inside it (which I did as well). The code:
# three rows. Test all combinations of one element from each row
# This is [value form row0, value from row1, value from row2]
# So: 3*2*3 = 18 possible combinations
probs = ([0.1,0.1,0.2], \
[0.7,0.9], \
[0.5,0.4,0.1])
meu = [] # The list that will store the best combinations in the recursion
#######################################################
def main():
choice = [] #the list that will store the best comb in the nested for
# accomplish by nested for loops
for n0 in range(len(probs[0])):
for n1 in range(len(probs[1])):
for n2 in range(len(probs[2])):
w = probs[0][n0] * probs[1][n1] * probs[2][n2]
cmb = [n0,n1,n2,w]
if len(choice) == 0:
choice.append(cmb)
elif len(choice) < 5:
for i in range(len(choice)+1):
if i == len(choice):
choice.append(cmb)
break
if w < choice[i][3]:
choice.insert(i,cmb)
break
else:
for i in range(len(choice)):
if w < choice[i][3]:
choice.insert(i,cmb)
del choice[-1]
break
# using recursive function
combinations(0,[])
#both results
print('By loops:')
print(choice)
print('By recursion:')
print(meu)
#######################################################
def combinations(step,cmb):
# Why does 'meu' needs to be global
if step < len(probs):
for i in range(len(probs[step])):
cmb = cmb[0:step] # I guess this is the same problem I dont understand recursion
# But, unlike 'meu', here I could use this workaround
cmb.append(i)
combinations(step+1,cmb)
else:
w = 1
for n in range(len(cmb)):
w *= probs[n][cmb[n]]
cmb.append(w)
if len(meu) == 0:
meu.append(cmb)
elif len(meu) < 5:
for i in range(len(meu)+1):
if i == len(meu):
meu.append(cmb)
break
if w < meu[i][-1]:
meu.insert(i,cmb)
break
else:
for i in range(len(meu)):
if w < meu[i][-1]:
meu.insert(i,cmb)
del meu[-1]
break
return
######################################################
main()
It outputs, as I wanted:
By loops:
[[0, 0, 2, 0.006999999999999999], [1, 0, 2, 0.006999999999999999], [0, 1, 2, 0.009000000000000001], [1, 1, 2, 0.009000000000000001], [2, 0, 2, 0.013999999999999999]]
By recursion:
[[0, 0, 2, 0.006999999999999999], [1, 0, 2, 0.006999999999999999], [0, 1, 2, 0.009000000000000001], [1, 1, 2, 0.009000000000000001], [2, 0, 2, 0.013999999999999999]]
Initially, I wanted to use the 'meu' list as internal of the function, because, I thought, it would be better to avoid global variables (perhaps not... I'm a newbie). The problem was I could not come up with a code that would pass both 'meu' and 'cmb' between depths to give the same effect of the nested loops.
How could I implement a recursive function with internal 'meu' instead of being a global list? What am I missing from recursion concept? Thanks.
++++++++++++++++++++++++++++++++++
Example of a failed function:
def combinations(choice,step,cmb):
if step < len(probs):
for i in range(len(probs[step])):
cmb = cmb[0:step] #workaroud for cmb
cmb.append(i)
choice = combinations(choice,step+1,cmb)
else:
w = 1
for n in range(len(cmb)):
w *= probs[n][cmb[n]]
cmb.append(w)
if len(choice) == 0:
choice.append(cmb)
elif len(choice) < 5:
for i in range(len(choice)+1):
if i == len(choice):
choice.append(cmb)
break
if w < choice[i][-1]:
choice.insert(i,cmb)
break
else:
for i in range(len(choice)):
if w < choice[i][-1]:
choice.insert(i,cmb)
del choice[-1]
break
return choice
Called by:
choice = combinations([],0,[])
Don't reinvent the wheel (recursively or not): use the included batteries. The problem you are trying to solve is extremely common and so a solution is included in Python's standard library.
What you want—every combination of every value from some number of lists—is called the Cartesian product of those lists. itertools.product exists to generate those for you.
import itertools
probs = ([0.1, 0.1, 0.2],
[0.7, 0.9],
[0.5, 0.4, 0.1])
for prob in itertools.product(*probs):
print prob
# prob is a tuple containing one combination of the variables
# from each of the input lists, do with it what you will
If you want to know what index each item comes from, the easiest way is to just pass the indices to product() rather than the values. You can easily get that using range().
for indices in itertools.product(*(range(len(p)) for p in probs)):
# get the values corresponding to the indices
prob = [probs[x][indices[x]] for x in range(len(probs))]
print indices, prob
Or you could use enumerate() -- this way, each item in the product is a tuple containing its index and its values (not two separate lists the way you get them in the above method):
for item in itertools.product(*(enumerate(p) for p in probs)):
print item
I'd like to create a function that takes a (sorted) list as its argument and outputs a list containing each element's corresponding percentile.
For example, fn([1,2,3,4,17]) returns [0.0, 0.25, 0.50, 0.75, 1.00].
Can anyone please either:
Help me correct my code below? OR
Offer a better alternative than my code for mapping values in a list to their corresponding percentiles?
My current code:
def median(mylist):
length = len(mylist)
if not length % 2:
return (mylist[length / 2] + mylist[length / 2 - 1]) / 2.0
return mylist[length / 2]
###############################################################################
# PERCENTILE FUNCTION
###############################################################################
def percentile(x):
"""
Find the correspoding percentile of each value relative to a list of values.
where x is the list of values
Input list should already be sorted!
"""
# sort the input list
# list_sorted = x.sort()
# count the number of elements in the list
list_elementCount = len(x)
#obtain set of values from list
listFromSetFromList = list(set(x))
# count the number of unique elements in the list
list_uniqueElementCount = len(set(x))
# define extreme quantiles
percentileZero = min(x)
percentileHundred = max(x)
# define median quantile
mdn = median(x)
# create empty list to hold percentiles
x_percentile = [0.00] * list_elementCount
# initialize unique count
uCount = 0
for i in range(list_elementCount):
if x[i] == percentileZero:
x_percentile[i] = 0.00
elif x[i] == percentileHundred:
x_percentile[i] = 1.00
elif x[i] == mdn:
x_percentile[i] = 0.50
else:
subList_elementCount = 0
for j in range(i):
if x[j] < x[i]:
subList_elementCount = subList_elementCount + 1
x_percentile[i] = float(subList_elementCount / list_elementCount)
#x_percentile[i] = float(len(x[x > listFromSetFromList[uCount]]) / list_elementCount)
if i == 0:
continue
else:
if x[i] == x[i-1]:
continue
else:
uCount = uCount + 1
return x_percentile
Currently, if I submit percentile([1,2,3,4,17]), the list [0.0, 0.0, 0.5, 0.0, 1.0] is returned.
I think your example input/output does not correspond to typical ways of calculating percentile. If you calculate the percentile as "proportion of data points strictly less than this value", then the top value should be 0.8 (since 4 of 5 values are less than the largest one). If you calculate it as "percent of data points less than or equal to this value", then the bottom value should be 0.2 (since 1 of 5 values equals the smallest one). Thus the percentiles would be [0, 0.2, 0.4, 0.6, 0.8] or [0.2, 0.4, 0.6, 0.8, 1]. Your definition seems to be "the number of data points strictly less than this value, considered as a proportion of the number of data points not equal to this value", but in my experience this is not a common definition (see for instance wikipedia).
With the typical percentile definitions, the percentile of a data point is equal to its rank divided by the number of data points. (See for instance this question on Stats SE asking how to do the same thing in R.) Differences in how to compute the percentile amount to differences in how to compute the rank (for instance, how to rank tied values). The scipy.stats.percentileofscore function provides four ways of computing percentiles:
>>> x = [1, 1, 2, 2, 17]
>>> [stats.percentileofscore(x, a, 'rank') for a in x]
[30.0, 30.0, 70.0, 70.0, 100.0]
>>> [stats.percentileofscore(x, a, 'weak') for a in x]
[40.0, 40.0, 80.0, 80.0, 100.0]
>>> [stats.percentileofscore(x, a, 'strict') for a in x]
[0.0, 0.0, 40.0, 40.0, 80.0]
>>> [stats.percentileofscore(x, a, 'mean') for a in x]
[20.0, 20.0, 60.0, 60.0, 90.0]
(I used a dataset containing ties to illustrate what happens in such cases.)
The "rank" method assigns tied groups a rank equal to the average of the ranks they would cover (i.e., a three-way tie for 2nd place gets a rank of 3 because it "takes up" ranks 2, 3 and 4). The "weak" method assigns a percentile based on the proportion of data points less than or equal to a given point; "strict" is the same but counts proportion of points strictly less than the given point. The "mean" method is the average of the latter two.
As Kevin H. Lin noted, calling percentileofscore in a loop is inefficient since it has to recompute the ranks on every pass. However, these percentile calculations can be easily replicated using different ranking methods provided by scipy.stats.rankdata, letting you calculate all the percentiles at once:
>>> from scipy import stats
>>> stats.rankdata(x, "average")/len(x)
array([ 0.3, 0.3, 0.7, 0.7, 1. ])
>>> stats.rankdata(x, 'max')/len(x)
array([ 0.4, 0.4, 0.8, 0.8, 1. ])
>>> (stats.rankdata(x, 'min')-1)/len(x)
array([ 0. , 0. , 0.4, 0.4, 0.8])
In the last case the ranks are adjusted down by one to make them start from 0 instead of 1. (I've omitted "mean", but it could easily be obtained by averaging the results of the latter two methods.)
I did some timings. With small data such as that in your example, using rankdata is somewhat slower than Kevin H. Lin's solution (presumably due to the overhead scipy incurs in converting things to numpy arrays under the hood) but faster than calling percentileofscore in a loop as in reptilicus's answer:
In [11]: %timeit [stats.percentileofscore(x, i) for i in x]
1000 loops, best of 3: 414 µs per loop
In [12]: %timeit list_to_percentiles(x)
100000 loops, best of 3: 11.1 µs per loop
In [13]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 39.3 µs per loop
With a large dataset, however, the performance advantage of numpy takes effect and using rankdata is 10 times faster than Kevin's list_to_percentiles:
In [18]: x = np.random.randint(0, 10000, 1000)
In [19]: %timeit [stats.percentileofscore(x, i) for i in x]
1 loops, best of 3: 437 ms per loop
In [20]: %timeit list_to_percentiles(x)
100 loops, best of 3: 1.08 ms per loop
In [21]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 102 µs per loop
This advantage will only become more pronounced on larger and larger datasets.
I think you want scipy.stats.percentileofscore
Example:
percentileofscore([1, 2, 3, 4], 3)
75.0
percentiles = [percentileofscore(data, i) for i in data]
In terms of complexity, I think reptilicus's answer is not optimal. It takes O(n^2) time.
Here is a solution that takes O(n log n) time.
def list_to_percentiles(numbers):
pairs = zip(numbers, range(len(numbers)))
pairs.sort(key=lambda p: p[0])
result = [0 for i in range(len(numbers))]
for rank in xrange(len(numbers)):
original_index = pairs[rank][1]
result[original_index] = rank * 100.0 / (len(numbers)-1)
return result
I'm not sure, but I think this is the optimal time complexity you can get. The rough reason I think it's optimal is because the information of all of the percentiles is essentially equivalent to the information of the sorted list, and you can't get better than O(n log n) for sorting.
EDIT: Depending on your definition of "percentile" this may not always give the correct result. See BrenBarn's answer for more explanation and for a better solution which makes use of scipy/numpy.
Pure numpy version of Kevin's solution
As Kevin said, optimal solution works in O(n log(n)) time. Here is fast version of his code in numpy, which works almost the same time as stats.rankdata:
percentiles = numpy.argsort(numpy.argsort(array)) * 100. / (len(array) - 1)
PS. This is one if my favourite tricks in numpy.
this might look oversimplyfied but what about this:
def percentile(x):
pc = float(1)/(len(x)-1)
return ["%.2f"%(n*pc) for n, i in enumerate(x)]
EDIT:
def percentile(x):
unique = set(x)
mapping = {}
pc = float(1)/(len(unique)-1)
for n, i in enumerate(unique):
mapping[i] = "%.2f"%(n*pc)
return [mapping.get(el) for el in x]
I tried Scipy's percentile score but it turned out to be very slow for one of my tasks. So, simply implemented it this way. Can be modified if a weak ranking is needed.
def assign_pct(X):
mp = {}
X_tmp = np.sort(X)
pct = []
cnt = 0
for v in X_tmp:
if v in mp:
continue
else:
mp[v] = cnt
cnt+=1
for v in X:
pct.append(mp[v]/cnt)
return pct
Calling the function
assign_pct([23,4,1,43,1,6])
Output of function
[0.75, 0.25, 0.0, 1.0, 0.0, 0.5]
If I understand you correctly, all you want to do, is to define the percentile this element represents in the array, how much of the array is before that element. as in [1, 2, 3, 4, 5]
should be [0.0, 0.25, 0.5, 0.75, 1.0]
I believe such code will be enough:
def percentileListEdited(List):
uniqueList = list(set(List))
increase = 1.0/(len(uniqueList)-1)
newList = {}
for index, value in enumerate(uniqueList):
newList[index] = 0.0 + increase * index
return [newList[val] for val in List]
For me the best solution is to use QuantileTransformer in sklearn.preprocessing.
from sklearn.preprocessing import QuantileTransformer
fn = lambda input_list : QuantileTransformer(100).fit_transform(np.array(input_list).reshape([-1,1])).ravel().tolist()
input_raw = [1, 2, 3, 4, 17]
output_perc = fn( input_raw )
print "Input=", input_raw
print "Output=", np.round(output_perc,2)
Here is the output
Input= [1, 2, 3, 4, 17]
Output= [ 0. 0.25 0.5 0.75 1. ]
Note: this function has two salient features:
input raw data is NOT necessarily sorted.
input raw data is NOT necessarily single column.
This version allows also to pass exact percentiles values used to ranking:
def what_pctl_number_of(x, a, pctls=np.arange(1, 101)):
return np.argmax(np.sign(np.append(np.percentile(x, pctls), np.inf) - a))
So it's possible to find out what's percentile number value falls for provided percentiles:
_x = np.random.randn(100, 1)
what_pctl_number_of(_x, 1.6, [25, 50, 75, 100])
Output:
3
so it hits to 75 ~ 100 range
for a pure python function to calculate a percentile score for a given item, compared to the population distribution (a list of scores), I pulled this from the scipy source code and removed all references to numpy:
def percentileofscore(a, score, kind='rank'):
n = len(a)
if n == 0:
return 100.0
left = len([item for item in a if item < score])
right = len([item for item in a if item <= score])
if kind == 'rank':
pct = (right + left + (1 if right > left else 0)) * 50.0/n
return pct
elif kind == 'strict':
return left / n * 100
elif kind == 'weak':
return right / n * 100
elif kind == 'mean':
pct = (left + right) / n * 50
return pct
else:
raise ValueError("kind can only be 'rank', 'strict', 'weak' or 'mean'")
source: https://github.com/scipy/scipy/blob/v1.2.1/scipy/stats/stats.py#L1744-L1835
Given that calculating percentiles is trickier than one would think, but way less complicated than the full scipy/numpy/scikit package, this is the best for light-weight deployment. The original code filters for only nonzero-values better, but otherwise, the math is the same. The optional parameter controls how it handles values that are in between two other values.
For this use case, one can call this function for each item in a list using the map() function.