Finding boundary with given number and a list - python

I was given this problem. Given a list of percentage = [0.1,0.1,0.8] and number = 9, find all possible list (boundary of each element is 0.25 to 10, increment = 0.25) that multiply with the percentage list ,sum those number together and round to 1 decimal place must be equal to number = 9. I use brute force algorithm to solve this problem with the assistance of itertools product. but brute force this way is pretty slow. I'm trying to find a boundary (upper and lower boundary in range(lower boundary,upper boundary,25) for my 'for loop'. Can you guys suggest me a way to find it?
import itertools
ranges = []
n = int(input()) #number of element in percentage list
percent = []
for i in range(n):
percent.append(float(input())) #input the percentage list
total = float(input()) #the number mentioned above
for i in range(n):
ranges.append(range(25,1025,25)) #find boundary for this line
for xs in itertools.product(*ranges):
avg = 0
for i in range(n):
avg += xs[i]*percent[i]
if avg < (total*100+5) and avg >= (total*100-5):
for each in xs:
print(each/100, end = ' ')
print()

It's a little bit hard for me to explain algorithm in concise words T.T
So sufficient explanation is stated in the following code comments.
Basic idea is that this is be done in a recursive way (DFS, depth first search). The function should be something like recursion(percent_list, result_list, target).
Initially, it should be recursion([0.1, 0.1, 0.8], [], 9)
If we try the first value to be 3.25, then we update target value by 9 - 3.25*0.1 = 8.675. So we next call recursion([0.1, 0.8], [3.25], 8.675);
Then, we try the second value to be 4.00, then update target value by 8.675 - 4.0*0.1 = 8.275. So call recursion([0.8], [3.25, 4.0], 8.275);
Finally, we try the third value, and only 9.75, 10 is valid, since the summed up value are 8.525 and 8.725, respectly, and could round up to 9. So we append results [3.25, 4.0, 9.75] and [3.25, 4.0, 10.0] to result list.
After that, we try the second value to be 0.25, ..., 3.75, 4.25, 4.5, ..., 10.
Try first value to be 0.25, ..., 3.0, 3.5, 3.75, ..., 10.
To avoid too much recursion calls, we need to calculate the valid value could be appended to results every time, to cut the branches that's impossible.
The actual function signature is somehow differnt, to achieve round up.
import numpy as np
def recursion(percent_list, value_list, previous_results, target, tolerate_lower, tolerate_upper, result_list):
# change , 0.25 ~ 10 , change, , change, 0.5 = 9.5-9 , 0.4999 < 9-8.5, your answer
# init: [0.1,0.1,0.8] [] 9
# if reach the target within tolerate
if len(percent_list) == 0:
# print(previous_results)
result_dict.append(previous_results)
return
# otherwise, cut impossible branches, check minimum and maximum value acceptable for current percent_list
percent_sum = percent_list.sum() # sum up current percent list, **O(n)**, should be optimized by pre-generating a sum list
value_min = value_list[0] # minimum value from data list (this problem 0.25)
value_max = value_list[-1] # maximum value from data list (this problem 10.0)
search_min = (target - tolerate_lower - (percent_sum - percent_list[0]) * value_max) / percent_list[0] # minimum value acceptable as result
search_max = (target + tolerate_upper - (percent_sum - percent_list[0]) * value_min) / percent_list[0] # maximum value acceptable as result
idx_min = np.searchsorted(value_list, search_min, "left") # index of minimum value (for data list)
idx_max = np.searchsorted(value_list, search_max, "right") # index of maximum value (for data list)
# recursion step
for i in range(idx_min, idx_max):
# update result list
current_results = previous_results + [value_list[i]]
# remove the current state for variables `percent_list`, and update `target` for next step
recursion(percent_list[1:], value_list, current_results, target - percent_list[0] * value_list[i], tolerate_lower, tolerate_upper, result_list)
To solve this current problem,
result = []
recursion(np.array([0.1, 0.1, 0.8]), np.arange(0.25, 10.25, 0.25), [], 9, 0.5, 0.49999, result)
There's totally 4806 possible results. To validate results sum up to about 9 (but could not validate results is plenty enough),
for l in result:
if not (8.5 <= (np.array([0.1, 0.1, 0.8]) * np.array(l)).sum() < 9.5):
print("Wrong code!")
I think the wrost case complixty is still O(m^n * n), if m refers to data list length (0.25, 0.5, ..., 10), and n refers to percent list length (0.1, 0.1, 0.8). It should be further optimized to O(m^n * log(m)), to avoid summing up percent list every recursion; and to O(m^n), if we could fully utilize the nature of arithmetic sequence of the data list.

Related

Generate List of random Floats in a sequence for Mocking IoT Sensor

I have been struggling to mock the IoT Sensor data. I need a list of floats which will increase and decrease sequentially.
For example [0.1, 0.12, 0.13, 0.18, 1.0, 1.2, 1.0, 0.9, 0.6]
Right now I have generated the list with max and min range using this,
for k in dts:
x = round(random.uniform(j["min"], j["max"]), 3)
random_float_list.append(x)
list generated form this code is not in a sequence. I need something which generates random floats in range and there are no abrupt changes in it. Values can increase and decrease in a sequence.
You can generate multiple random sequences and glue them together. Something like this:
import numpy as np
def gen_floats(count, min_step_size, max_step_size, max_seq_len):
# Start around 0
res = [np.round(np.random.rand() - 0.5, 2)]
while len(res) < count:
step_size = np.random.uniform(min_step_size, max_step_size)
# Generate random number of steps for sequence
remaining = count - len(res)
steps = np.random.randint(1, remaining + 1 if remaining < max_seq_len else max_seq_len)
# Generate additive or subtractive sequence using previous values
if np.random.rand() > 0.5:
vals = np.round(np.linspace(res[-1] + step_size, res[-1] + steps * step_size, steps), 2)
else:
vals = np.round(np.linspace(res[-1] + step_size, res[-1] - steps * step_size, steps), 2)
res.extend(vals)
return res
Then print(gen_floats(20, 0.1, 0.5, 10)) generates something like: [0.4, 0.86, 0.25, -0.37, -0.99, -1.61, -2.23, -2.85, -2.64, -2.95, -3.26, -3.57, -3.88, -3.63, -3.38, -3.19, -2.89, -2.63, -3.15, -3.68]. You can play with params to match desired output.
Something like this should work if you want a random where you can control the min, max and max difference between the values.
It will first random a value between start and end and append it to the list output. The next value will be a random value between the last value in the output list +-max_diff.
import random
def rand(start,end,max_diff,elements,output):
elements -= 1
if output:
if output[-1]-max_diff < start: #To not get a value smaller than start
output.append(round(random.uniform(start,output[-1]+max_diff),3))
elif output[-1]+max_diff > end: #To not get a value bigger than end
output.append(round(random.uniform(output[-1]-max_diff,end),3))
else:
output.append(round(random.uniform(output[-1]-max_diff,output[-1]+max_diff),3))
else:
output.append(round(random.uniform(start,end),3))
if elements > 0:
output = rand(start,end,max_diff,elements,output)
return output
print(rand(1,2,0.1,3,[])) #[1.381, 1.375, 1.373]
You can generate random numbers with a uniform distribution, and then sort the numbers into ascending order in the first part, and into descending order in the second part.
import numpy as np
np.random.seed(0)
def gen_rnd_sensor_data(low: float,
high: float,
n_incr: int,
n_decr: int) -> np.ndarray:
incr = np.random.uniform(low=low, high=high, size=n_incr)
incr.sort()
decr = np.random.uniform(low=low, high=high, size=n_decr)
decr[::-1].sort()
return np.concatenate((incr, decr))
Then you can call this function with:
print(gen_rnd_sensor_data(0, 1, 5, 3))
This generates data within 0. and 1., the first 5 values are increasing, the last 3 are decreasing. Within the program, every time you call the function, you get different results, but if you rerun your program, you get the same results, so you can debug your program.

Iterate through ranges and return those not in any range?

I have a list of floats.
values = [2.3, 6.4, 11.3]
What I want to do is find a range from each value in the list of size delta = 2, then iterate through another range of floats and compare each float to each range, then return the floats that do not fall in any ranges.
What I have so far is,
not_in_range =[]
for x in values:
pre = float(x - delta)
post = float(x + delta)
for y in numpy.arange(0,15,0.5):
if (pre <= y <= post) == True:
pass
else:
not_in_range.append(y)
But obviously, this does not work for several reasons: redundancy, does not check all ranges at once, etc. I am new to coding and I am struggling to think abstractly enough to solve this problem. Any help in formulating a plan of action would be greatly appreciated.
EDIT
For clarity, what I want is a list of ranges from each value (or maybe a numpy array?) as
[0.3, 4.3]
[4.4, 8.4]
[9.3, 13.3]
And to return any float from 0 - 15 in increments of 0.5 that do not fall in any of those ranges, so the final output would be:
not_in_ranges = [0, 8.5, 9, 13.5, 14, 14.5]
To generate the list of ranges, you could do a quick list comprehension:
ranges = [[x-2, x+2] for x in values]
## [[0.3, 4.3], [4.4, 8.4], [9.3, 13.3]]
Then, to return any float from 0 to 15 (in increments of 0.5) that don't fall in any of the ranges, you can use:
not_in_ranges = []
for y in numpy.arange(0, 15, 0.5): # for all desired values to check
if not any(pre < y and y < post for pre, post in ranges):
not_in_ranges.append(y) # if it is in none of the intervals, append it
## [0.0, 8.5, 9.0, 13.5, 14.0, 14.5]
Explanation: This loops through each of the possible values and appends it to the not_in_ranges list if it is not in any of the intervals. To check if it is in the intervals, I use the builtin python function any to check if there are any pre and post values in the list ranges that return True when pre < y < post (i.e. if y is in any of the intervals). If this is False, then it doesn't fit into any of the intervals and so is added to the list of such values.
Alternatively, if you only need the result (and not both of the lists), you can combine the two with something like:
not_in_ranges = []
for y in numpy.arange(0, 15, 0.5):
if not any(x-2 < y and y < x+2 for x in values):
not_in_ranges.append(y)
You could even use list comprehension again, giving the very pythonic looking:
not_in_ranges = [y for y in numpy.arange(0, 15, 0.5) if not any(x-2 < y and y < x+2 for x in values)]
Note that the last one is likely the fastest to run since the append call is quite slow and list comprehension is almost always faster. Though it certainly might not be the easiest to understand at a glance if you aren't already used to python list comprehension format.
I have done the comparative analysis (in jupyter notebook). Look the results.
# First cell
import numpy as np
values = np.random.randn(1000000)
values.shape
# Second cell
%%time
not_in_range =[]
for x in values:
pre = float(x - 2)
post = float(x + 2)
for y in np.arange(0,15,0.5):
if (pre <= y <= post) == True:
pass
else:
not_in_range.append(y)
# Second cell output - Wall time: 37.2 s
# Third cell
%%time
pre = values - 2
post = values + 2
whole_range = np.arange(0,15,0.5)
whole_range
search_range = []
for pr, po in zip(pre, post):
pr = (int(pr) + 0.5) if (pr%5) else int(pr)
po = (int(po) + 0.5) if (po%5) else int(po)
search_range += list(np.arange(pr, po, 0.5))
whole_range = set(whole_range)
search_range = set(search_range)
print(whole_range.difference(search_range))
# Third cell output - Wall time: 3.99 s
You can use the interval library intvalpy
from intvalpy import Interval
import numpy as np
values = [2.3, 6.4, 11.3]
delta = 2
intervals = values + Interval(-delta, delta)
not_in_ranges = []
for k in np.arange(0, 15, 0.5):
if not k in intervals:
not_in_ranges.append(k)
print(not_in_ranges)
Intervals are created according to the constructive definitions of interval arithmetic operations.
The in operator checks whether a point (or an interval) is contained within another interval.

Python: get every possible combination of weights for a portfolio

I think this problem can be solved using either itertools or cartesian, but I'm fairly new to Python and am struggling to use these:
I have a portfolio of 5 stocks, where each stock can have a weighting of -0.4, -0.2, 0, 0.2 or 0.4, with weightings adding up to 0. How do I create a function that produces a list of every possible combination of weights. e.g. [-0.4, 0.2, 0, 0.2, 0]... etc
Ideally, the function would work for n stocks, as I will eventually want to do the same process for 50 stocks.
edit: To clarify, I'm looking for all combinations of length n (in this case 5), summing to 0. The values can repeat: e.g: [0.2, 0.2, -0.4, 0, 0], [ 0.4, 0, -0.2, -0.2, 0.4], [0,0,0,0.2,-0.2], [0, 0.4, -0.4, 0.2, -0.2] etc. So [0,0,0,0,0] would be a possible combination. The fact that there are 5 possible weightings and 5 stocks is a coincidence (which i should have avoided!), this same question could be with 5 possible weightings and 3 stocks or 7 stocks. Thanks.
Something like this, although it's not really efficient.
from decimal import Decimal
import itertools
# possible optimization: use integers rather than Decimal
weights = [Decimal("-0.4"), Decimal("-0.2"), Decimal(0), Decimal("0.2"), Decimal("0.4")]
def possible_weightings(n = 5, target = 0):
for all_bar_one in itertools.product(weights, repeat = n - 1):
final = target - sum(all_bar_one)
if final in weights:
yield all_bar_one + (final,)
I repeat from comments, you cannot do this for n = 50. The code yields the right values, but there isn't time in the universe to iterate over all the possible weightings.
This code isn't brilliant. It does some unnecessary work examining cases where, for example, the sum of all but the first two is already greater than 0.8 and so there's no point separately checking all the possibilities for the first of those two.
So, this does n = 5 in nearly no time, but there is some value of n where this code becomes infeasibly slow, and you could get further with better code. You still won't get to 50. I'm too lazy to write that better code, but basically instead of all_bar_one you can make recursive calls to possible_weightings with successively smaller values of n and a value of target equal to the target you were given, minus the sum you have so far. Then prune all the branches you don't need to take, by bailing out early in cases where target is too large (positive or negative) to be reached using only n values.
I understand the values can repeat, but all have to sum to zero, therefore the solution might be:
>>> from itertools import permutations
>>> weights = [-0.4, -0.2, 0, 0.2, 0.4]
>>> result = (com for com in permutations(weights) if sum(com)==0)
>>> for i in result: print(i)
edit:
you might use product as #Steve Jassop suggested.
combi = (i for i in itertools.product(weights, repeat= len(weights)) if not sum(i))
for c in combi:
print(c)
I like using the filter function:
from itertools import permutations
w = [-0.4, -0.2, 0, 0.2, 0.4]
def foo(w):
perms = list(permutations(w))
sum0 = filter(lambda x: sum(x)==0, perms)
return sum0
print foo(w)
Different approach.
1 Figure out all sequences of the weights that add up to zero, in order.
for example, these are some possibilities (using whole numbers to type less):
[0, 0, 0, 0, 0]
[-4, 0, 0, +2, +2]
[-4, 0, 0, 0, +4]
[-4, +4, 0, 0, 0] is incorrect because weights are not picked in order.
2 Permute what you got above, because the permutations will add up to zero as well.
This is where you'd get your [-4, 0, 0, 0, +4] and [-4, +4, 0, 0, 0]
OK, being lazy. I am going to pseudo-code/comment-code a good deal of my solution. Not that strong at recursion, the stuff is too tricky to code quickly and I have doubts that this type of solution scales up to 50.
i.e. I don't think I am right, but it might give someone else an idea.
def find_solution(weights, length, last_pick, target_sum):
# returns a list of solutions, in growing order, of weights adding up to the target_sum
# weights are the sequence of possible weights - IN ORDER, NO REPEATS
# length is how many weights we are adding up
# last_pick - the weight picked by the caller
# target_sum is what we are aiming for, which will always be >=0
solutions = []
if length > 1:
#since we are picking in order, having picked 0 "disqualifies" -4 and -2.
if last_pick > weights[0]:
weights = [w for w in weights if w >= last_pick]
#all remaining weights are possible
for weight in weights:
child_target_sum = target_sum + weight
#basic idea, we are picking in growing order
#if we start out picking +2 in a [-4,-2,0,+2,+4] list in order, then we are constrained to finding -2
#with just 2 and 4 as choices. won't work.
if child_target_sum <= 0:
break
child_solutions = find_solution(weights, length=length-1, last_pick=weight, target_sum=child_target_sum)
[solutions.append([weight] + child ) for child in child_solutions if child_solution]
else:
#only 1 item to pick left, so it has be the target_sum
if target_sum in weights:
return [[target_sum]]
return solutions
weights = list(set(weights))
weights.sort()
#those are not permutated yet
solutions = find_solutions(weights, len(solution), -999999999, 0)
permutated = []
for solution in solutions:
permutated.extend(itertools.permutations(solution))
If you just want a list of all the combinations, use itertools.combinations:
w = [-0.4, -0.2, 0, 0.2, 0.4]
l = len(w)
if __name__ == '__main__':
for i in xrange(1, l+1):
for p in itertools.combinations(w, i):
print p
If you want to count the different weights that can be created with these combinations, it's a bit more complicated.
First, you generate permutations with 1, 2, 3, ... elements. Then you take the sum of them. Then you add the sum to the set (will no do anything if the number is already present, very fast operation). Finally you convert to a list and sort it.
from itertools import combinations
def round_it(n, p):
"""rounds n, to have maximum p figures to the right of the comma"""
return int((10**p)*n)/float(10**p)
w = [-0.4, -0.2, 0, 0.2, 0.4]
l = len(w)
res = set()
if __name__ == '__main__':
for i in xrange(1, l+1):
for p in combinations(w, i):
res.add(round_it(sum(p), 10)) # rounding necessary to avoid artifacts
print sorted(list(res))
Is this what you are looking for:
if L = [-0.4, 0.2, 0, 0.2, 0]
AllCombi = itertools.permutations(L)
for each in AllCombi:
print each

How to generate an evenly-spaced series of rational numbers in Python? [duplicate]

Is there a range() equivalent for floats in Python?
>>> range(0.5,5,1.5)
[0, 1, 2, 3, 4]
>>> range(0.5,5,0.5)
Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
range(0.5,5,0.5)
ValueError: range() step argument must not be zero
You can either use:
[x / 10.0 for x in range(5, 50, 15)]
or use lambda / map:
map(lambda x: x/10.0, range(5, 50, 15))
I don't know a built-in function, but writing one like [this](https://stackoverflow.com/a/477610/623735) shouldn't be too complicated.
def frange(x, y, jump):
while x < y:
yield x
x += jump
---
As the comments mention, this could produce unpredictable results like:
>>> list(frange(0, 100, 0.1))[-1]
99.9999999999986
To get the expected result, you can use one of the other answers in this question, or as #Tadhg mentioned, you can use decimal.Decimal as the jump argument. Make sure to initialize it with a string rather than a float.
>>> import decimal
>>> list(frange(0, 100, decimal.Decimal('0.1')))[-1]
Decimal('99.9')
Or even:
import decimal
def drange(x, y, jump):
while x < y:
yield float(x)
x += decimal.Decimal(jump)
And then:
>>> list(drange(0, 100, '0.1'))[-1]
99.9
[editor's not: if you only use positive jump and integer start and stop (x and y) , this works fine. For a more general solution see here.]
I used to use numpy.arange but had some complications controlling the number of elements it returns, due to floating point errors. So now I use linspace, e.g.:
>>> import numpy
>>> numpy.linspace(0, 10, num=4)
array([ 0. , 3.33333333, 6.66666667, 10. ])
Pylab has frange (a wrapper, actually, for matplotlib.mlab.frange):
>>> import pylab as pl
>>> pl.frange(0.5,5,0.5)
array([ 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])
Eagerly evaluated (2.x range):
[x * .5 for x in range(10)]
Lazily evaluated (2.x xrange, 3.x range):
itertools.imap(lambda x: x * .5, xrange(10)) # or range(10) as appropriate
Alternately:
itertools.islice(itertools.imap(lambda x: x * .5, itertools.count()), 10)
# without applying the `islice`, we get an infinite stream of half-integers.
using itertools: lazily evaluated floating point range:
>>> from itertools import count, takewhile
>>> def frange(start, stop, step):
return takewhile(lambda x: x< stop, count(start, step))
>>> list(frange(0.5, 5, 1.5))
# [0.5, 2.0, 3.5]
I helped add the function numeric_range to the package more-itertools.
more_itertools.numeric_range(start, stop, step) acts like the built in function range but can handle floats, Decimal, and Fraction types.
>>> from more_itertools import numeric_range
>>> tuple(numeric_range(.1, 5, 1))
(0.1, 1.1, 2.1, 3.1, 4.1)
There is no such built-in function, but you can use the following (Python 3 code) to do the job as safe as Python allows you to.
from fractions import Fraction
def frange(start, stop, jump, end=False, via_str=False):
"""
Equivalent of Python 3 range for decimal numbers.
Notice that, because of arithmetic errors, it is safest to
pass the arguments as strings, so they can be interpreted to exact fractions.
>>> assert Fraction('1.1') - Fraction(11, 10) == 0.0
>>> assert Fraction( 0.1 ) - Fraction(1, 10) == Fraction(1, 180143985094819840)
Parameter `via_str` can be set to True to transform inputs in strings and then to fractions.
When inputs are all non-periodic (in base 10), even if decimal, this method is safe as long
as approximation happens beyond the decimal digits that Python uses for printing.
For example, in the case of 0.1, this is the case:
>>> assert str(0.1) == '0.1'
>>> assert '%.50f' % 0.1 == '0.10000000000000000555111512312578270211815834045410'
If you are not sure whether your decimal inputs all have this property, you are better off
passing them as strings. String representations can be in integer, decimal, exponential or
even fraction notation.
>>> assert list(frange(1, 100.0, '0.1', end=True))[-1] == 100.0
>>> assert list(frange(1.0, '100', '1/10', end=True))[-1] == 100.0
>>> assert list(frange('1', '100.0', '.1', end=True))[-1] == 100.0
>>> assert list(frange('1.0', 100, '1e-1', end=True))[-1] == 100.0
>>> assert list(frange(1, 100.0, 0.1, end=True))[-1] != 100.0
>>> assert list(frange(1, 100.0, 0.1, end=True, via_str=True))[-1] == 100.0
"""
if via_str:
start = str(start)
stop = str(stop)
jump = str(jump)
start = Fraction(start)
stop = Fraction(stop)
jump = Fraction(jump)
while start < stop:
yield float(start)
start += jump
if end and start == stop:
yield(float(start))
You can verify all of it by running a few assertions:
assert Fraction('1.1') - Fraction(11, 10) == 0.0
assert Fraction( 0.1 ) - Fraction(1, 10) == Fraction(1, 180143985094819840)
assert str(0.1) == '0.1'
assert '%.50f' % 0.1 == '0.10000000000000000555111512312578270211815834045410'
assert list(frange(1, 100.0, '0.1', end=True))[-1] == 100.0
assert list(frange(1.0, '100', '1/10', end=True))[-1] == 100.0
assert list(frange('1', '100.0', '.1', end=True))[-1] == 100.0
assert list(frange('1.0', 100, '1e-1', end=True))[-1] == 100.0
assert list(frange(1, 100.0, 0.1, end=True))[-1] != 100.0
assert list(frange(1, 100.0, 0.1, end=True, via_str=True))[-1] == 100.0
assert list(frange(2, 3, '1/6', end=True))[-1] == 3.0
assert list(frange(0, 100, '1/3', end=True))[-1] == 100.0
Code available on GitHub
As kichik wrote, this shouldn't be too complicated. However this code:
def frange(x, y, jump):
while x < y:
yield x
x += jump
Is inappropriate because of the cumulative effect of errors when working with floats.
That is why you receive something like:
>>>list(frange(0, 100, 0.1))[-1]
99.9999999999986
While the expected behavior would be:
>>>list(frange(0, 100, 0.1))[-1]
99.9
Solution 1
The cumulative error can simply be reduced by using an index variable. Here's the example:
from math import ceil
def frange2(start, stop, step):
n_items = int(ceil((stop - start) / step))
return (start + i*step for i in range(n_items))
This example works as expected.
Solution 2
No nested functions. Only a while and a counter variable:
def frange3(start, stop, step):
res, n = start, 1
while res < stop:
yield res
res = start + n * step
n += 1
This function will work well too, except for the cases when you want the reversed range. E.g:
>>>list(frange3(1, 0, -.1))
[]
Solution 1 in this case will work as expected. To make this function work in such situations, you must apply a hack, similar to the following:
from operator import gt, lt
def frange3(start, stop, step):
res, n = start, 0.
predicate = lt if start < stop else gt
while predicate(res, stop):
yield res
res = start + n * step
n += 1
With this hack you can use these functions with negative steps:
>>>list(frange3(1, 0, -.1))
[1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999999, 0.29999999999999993, 0.19999999999999996, 0.09999999999999998]
Solution 3
You can go even further with plain standard library and compose a range function for the most of numeric types:
from itertools import count
from itertools import takewhile
def any_range(start, stop, step):
start = type(start + step)(start)
return takewhile(lambda n: n < stop, count(start, step))
This generator is adapted from the Fluent Python book (Chapter 14. Iterables, Iterators and generators). It will not work with decreasing ranges. You must apply a hack, like in the previous solution.
You can use this generator as follows, for example:
>>>list(any_range(Fraction(2, 1), Fraction(100, 1), Fraction(1, 3)))[-1]
299/3
>>>list(any_range(Decimal('2.'), Decimal('4.'), Decimal('.3')))
[Decimal('2'), Decimal('2.3'), Decimal('2.6'), Decimal('2.9'), Decimal('3.2'), Decimal('3.5'), Decimal('3.8')]
And of course you can use it with float and int as well.
Be careful
If you want to use these functions with negative steps, you should add a check for the step sign, e.g.:
no_proceed = (start < stop and step < 0) or (start > stop and step > 0)
if no_proceed: raise StopIteration
The best option here is to raise StopIteration, if you want to mimic the range function itself.
Mimic range
If you would like to mimic the range function interface, you can provide some argument checks:
def any_range2(*args):
if len(args) == 1:
start, stop, step = 0, args[0], 1.
elif len(args) == 2:
start, stop, step = args[0], args[1], 1.
elif len(args) == 3:
start, stop, step = args
else:
raise TypeError('any_range2() requires 1-3 numeric arguments')
# here you can check for isinstance numbers.Real or use more specific ABC or whatever ...
start = type(start + step)(start)
return takewhile(lambda n: n < stop, count(start, step))
I think, you've got the point. You can go with any of these functions (except the very first one) and all you need for them is python standard library.
Why Is There No Floating Point Range Implementation In The Standard Library?
As made clear by all the posts here, there is no floating point version of range(). That said, the omission makes sense if we consider that the range() function is often used as an index (and of course, that means an accessor) generator. So, when we call range(0,40), we're in effect saying we want 40 values starting at 0, up to 40, but non-inclusive of 40 itself.
When we consider that index generation is as much about the number of indices as it is their values, the use of a float implementation of range() in the standard library makes less sense. For example, if we called the function frange(0, 10, 0.25), we would expect both 0 and 10 to be included, but that would yield a generator with 41 values, not the 40 one might expect from 10/0.25.
Thus, depending on its use, an frange() function will always exhibit counter intuitive behavior; it either has too many values as perceived from the indexing perspective or is not inclusive of a number that reasonably should be returned from the mathematical perspective. In other words, it's easy to see how such a function would appear to conflate two very different use cases – the naming implies the indexing use case; the behavior implies a mathematical one.
The Mathematical Use Case
With that said, as discussed in other posts, numpy.linspace() performs the generation from the mathematical perspective nicely:
numpy.linspace(0, 10, 41)
array([ 0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75,
2. , 2.25, 2.5 , 2.75, 3. , 3.25, 3.5 , 3.75,
4. , 4.25, 4.5 , 4.75, 5. , 5.25, 5.5 , 5.75,
6. , 6.25, 6.5 , 6.75, 7. , 7.25, 7.5 , 7.75,
8. , 8.25, 8.5 , 8.75, 9. , 9.25, 9.5 , 9.75, 10.
])
The Indexing Use Case
And for the indexing perspective, I've written a slightly different approach with some tricksy string magic that allows us to specify the number of decimal places.
# Float range function - string formatting method
def frange_S (start, stop, skip = 1.0, decimals = 2):
for i in range(int(start / skip), int(stop / skip)):
yield float(("%0." + str(decimals) + "f") % (i * skip))
Similarly, we can also use the built-in round function and specify the number of decimals:
# Float range function - rounding method
def frange_R (start, stop, skip = 1.0, decimals = 2):
for i in range(int(start / skip), int(stop / skip)):
yield round(i * skip, ndigits = decimals)
A Quick Comparison & Performance
Of course, given the above discussion, these functions have a fairly limited use case. Nonetheless, here's a quick comparison:
def compare_methods (start, stop, skip):
string_test = frange_S(start, stop, skip)
round_test = frange_R(start, stop, skip)
for s, r in zip(string_test, round_test):
print(s, r)
compare_methods(-2, 10, 1/3)
The results are identical for each:
-2.0 -2.0
-1.67 -1.67
-1.33 -1.33
-1.0 -1.0
-0.67 -0.67
-0.33 -0.33
0.0 0.0
...
8.0 8.0
8.33 8.33
8.67 8.67
9.0 9.0
9.33 9.33
9.67 9.67
And some timings:
>>> import timeit
>>> setup = """
... def frange_s (start, stop, skip = 1.0, decimals = 2):
... for i in range(int(start / skip), int(stop / skip)):
... yield float(("%0." + str(decimals) + "f") % (i * skip))
... def frange_r (start, stop, skip = 1.0, decimals = 2):
... for i in range(int(start / skip), int(stop / skip)):
... yield round(i * skip, ndigits = decimals)
... start, stop, skip = -1, 8, 1/3
... """
>>> min(timeit.Timer('string_test = frange_s(start, stop, skip); [x for x in string_test]', setup=setup).repeat(30, 1000))
0.024284090992296115
>>> min(timeit.Timer('round_test = frange_r(start, stop, skip); [x for x in round_test]', setup=setup).repeat(30, 1000))
0.025324633985292166
Looks like the string formatting method wins by a hair on my system.
The Limitations
And finally, a demonstration of the point from the discussion above and one last limitation:
# "Missing" the last value (10.0)
for x in frange_R(0, 10, 0.25):
print(x)
0.25
0.5
0.75
1.0
...
9.0
9.25
9.5
9.75
Further, when the skip parameter is not divisible by the stop value, there can be a yawning gap given the latter issue:
# Clearly we know that 10 - 9.43 is equal to 0.57
for x in frange_R(0, 10, 3/7):
print(x)
0.0
0.43
0.86
1.29
...
8.14
8.57
9.0
9.43
There are ways to address this issue, but at the end of the day, the best approach would probably be to just use Numpy.
A solution without numpy etc dependencies was provided by kichik but due to the floating point arithmetics, it often behaves unexpectedly. As noted by me and blubberdiblub, additional elements easily sneak into the result. For example naive_frange(0.0, 1.0, 0.1) would yield 0.999... as its last value and thus yield 11 values in total.
A bit more robust version is provided here:
def frange(x, y, jump=1.0):
'''Range for floats.'''
i = 0.0
x = float(x) # Prevent yielding integers.
x0 = x
epsilon = jump / 2.0
yield x # yield always first value
while x + epsilon < y:
i += 1.0
x = x0 + i * jump
if x < y:
yield x
Because the multiplication, the rounding errors do not accumulate. The use of epsilon takes care of possible rounding error of the multiplication, even though issues of course might rise in the very small and very large ends. Now, as expected:
> a = list(frange(0.0, 1.0, 0.1))
> a[-1]
0.9
> len(a)
10
And with somewhat larger numbers:
> b = list(frange(0.0, 1000000.0, 0.1))
> b[-1]
999999.9
> len(b)
10000000
The code is also available as a GitHub Gist.
This can be done with numpy.arange(start, stop, stepsize)
import numpy as np
np.arange(0.5,5,1.5)
>> [0.5, 2.0, 3.5, 5.0]
# OBS you will sometimes see stuff like this happening,
# so you need to decide whether that's not an issue for you, or how you are going to catch it.
>> [0.50000001, 2.0, 3.5, 5.0]
Note 1:
From the discussion in the comment section here, "never use numpy.arange() (the numpy documentation itself recommends against it). Use numpy.linspace as recommended by wim, or one of the other suggestions in this answer"
Note 2:
I have read the discussion in a few comments here, but after coming back to this question for the third time now, I feel this information should be placed in a more readable position.
A simpler library-less version
Aw, heck -- I'll toss in a simple library-less version. Feel free to improve on it[*]:
def frange(start=0, stop=1, jump=0.1):
nsteps = int((stop-start)/jump)
dy = stop-start
# f(i) goes from start to stop as i goes from 0 to nsteps
return [start + float(i)*dy/nsteps for i in range(nsteps)]
The core idea is that nsteps is the number of steps to get you from start to stop and range(nsteps) always emits integers so there's no loss of accuracy. The final step is to map [0..nsteps] linearly onto [start..stop].
edit
If, like alancalvitti you'd like the series to have exact rational representation, you can always use Fractions:
from fractions import Fraction
def rrange(start=0, stop=1, jump=0.1):
nsteps = int((stop-start)/jump)
return [Fraction(i, nsteps) for i in range(nsteps)]
[*] In particular, frange() returns a list, not a generator. But it sufficed for my needs.
Usage
# Counting up
drange(0, 0.4, 0.1)
[0, 0.1, 0.2, 0.30000000000000004, 0.4]
# Counting down
drange(0, -0.4, -0.1)
[0, -0.1, -0.2, -0.30000000000000004, -0.4]
To round each step to N decimal places
drange(0, 0.4, 0.1, round_decimal_places=4)
[0, 0.1, 0.2, 0.3, 0.4]
drange(0, -0.4, -0.1, round_decimal_places=4)
[0, -0.1, -0.2, -0.3, -0.4]
Code
def drange(start, end, increment, round_decimal_places=None):
result = []
if start < end:
# Counting up, e.g. 0 to 0.4 in 0.1 increments.
if increment < 0:
raise Exception("Error: When counting up, increment must be positive.")
while start <= end:
result.append(start)
start += increment
if round_decimal_places is not None:
start = round(start, round_decimal_places)
else:
# Counting down, e.g. 0 to -0.4 in -0.1 increments.
if increment > 0:
raise Exception("Error: When counting down, increment must be negative.")
while start >= end:
result.append(start)
start += increment
if round_decimal_places is not None:
start = round(start, round_decimal_places)
return result
Why choose this answer?
Many other answers will hang when asked to count down.
Many other answers will give incorrectly rounded results.
Other answers based on np.linspace are hit-and-miss, they may or may not work due to difficulty in choosing the correct number of divisions. np.linspace really struggles with decimal increments of 0.1, and the order of divisions in the formula to convert the increment into a number of splits can result in either correct or broken code.
Other answers based on np.arange are deprecated.
If in doubt, try the four tests cases above.
I do not know if the question is old but there is a arange function in the NumPy library, it could work as a range.
np.arange(0,1,0.1)
#out:
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
i wrote a function that returns a tuple of a range of double precision floating point numbers without any decimal places beyond the hundredths. it was simply a matter of parsing the range values like strings and splitting off the excess. I use it for displaying ranges to select from within a UI. I hope someone else finds it useful.
def drange(start,stop,step):
double_value_range = []
while start<stop:
a = str(start)
a.split('.')[1].split('0')[0]
start = float(str(a))
double_value_range.append(start)
start = start+step
double_value_range_tuple = tuple(double_value_range)
#print double_value_range_tuple
return double_value_range_tuple
Whereas integer-based ranges are well defined in that "what you see is what you get", there are things that are not readily seen in floats that cause troubles in getting what appears to be a well defined behavior in a desired range.
There are two approaches that one can take:
split a given range into a certain number of segment: the linspace approach in which you accept the large number of decimal digits when you select a number of points that does not divide the span well (e.g. 0 to 1 in 7 steps will give a first step value of 0.14285714285714285)
give the desired WYSIWIG step size that you already know should work and wish that it would work. Your hopes will often be dashed by getting values that miss the end point that you wanted to hit.
Multiples can be higher or lower than you expect:
>>> 3*.1 > .3 # 0.30000000000000004
True
>>> 3*.3 < 0.9 # 0.8999999999999999
True
You will try to avoid accumulating errors by adding multiples of your step and not incrementing, but the problem will always present itself and you just won't get what you expect if you did it by hand on paper -- with exact decimals. But you know it should be possible since Python shows you 0.1 instead of the underlying integer ratio having a close approximation to 0.1:
>>> (3*.1).as_integer_ratio()
(1351079888211149, 4503599627370496)
In the methods offered as answers, the use of Fraction here with the option to handle input as strings is best. I have a few suggestions to make it better:
make it handle range-like defaults so you can start from 0 automatically
make it handle decreasing ranges
make the output look like you would expect if you were using exact arithmetic
I offer a routine that does these same sort of thing but which does not use the Fraction object. Instead, it uses round to create numbers having the same apparent digits as the numbers would have if you printed them with python, e.g. 1 decimal for something like 0.1 and 3 decimals for something like 0.004:
def frange(start, stop, step, n=None):
"""return a WYSIWYG series of float values that mimic range behavior
by excluding the end point and not printing extraneous digits beyond
the precision of the input numbers (controlled by n and automatically
detected based on the string representation of the numbers passed).
EXAMPLES
========
non-WYSIWYS simple list-comprehension
>>> [.11 + i*.1 for i in range(3)]
[0.11, 0.21000000000000002, 0.31]
WYSIWYG result for increasing sequence
>>> list(frange(0.11, .33, .1))
[0.11, 0.21, 0.31]
and decreasing sequences
>>> list(frange(.345, .1, -.1))
[0.345, 0.245, 0.145]
To hit the end point for a sequence that is divisibe by
the step size, make the end point a little bigger by
adding half the step size:
>>> dx = .2
>>> list(frange(0, 1 + dx/2, dx))
[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
"""
if step == 0:
raise ValueError('step must not be 0')
# how many decimal places are showing?
if n is None:
n = max([0 if '.' not in str(i) else len(str(i).split('.')[1])
for i in (start, stop, step)])
if step*(stop - start) > 0: # a non-null incr/decr range
if step < 0:
for i in frange(-start, -stop, -step, n):
yield -i
else:
steps = round((stop - start)/step)
while round(step*steps + start, n) < stop:
steps += 1
for i in range(steps):
yield round(start + i*step, n)
def Range(*argSequence):
if len(argSequence) == 3:
imin = argSequence[0]; imax = argSequence[1]; di = argSequence[2]
i = imin; iList = []
while i <= imax:
iList.append(i)
i += di
return iList
if len(argSequence) == 2:
return Range(argSequence[0], argSequence[1], 1)
if len(argSequence) == 1:
return Range(1, argSequence[0], 1)
Please note the first letter of Range is capital. This naming method is not encouraged for functions in Python. You can change Range to something like drange or frange if you want. The "Range" function behaves just as you want it to. You can check it's manual here [ http://reference.wolfram.com/language/ref/Range.html ].
I think that there is a very simple answer that really emulates all the features of range but for both float and integer. In this solution, you just suppose that your approximation by default is 1e-7 (or the one you choose) and you can change it when you call the function.
def drange(start,stop=None,jump=1,approx=7): # Approx to 1e-7 by default
'''
This function is equivalent to range but for both float and integer
'''
if not stop: # If there is no y value: range(x)
stop= start
start= 0
valor= round(start,approx)
while valor < stop:
if valor==int(valor):
yield int(round(valor,approx))
else:
yield float(round(valor,approx))
valor += jump
for i in drange(12):
print(i)
Talk about making a mountain out of a mole hill.
If you relax the requirement to make a float analog of the range function, and just create a list of floats that is easy to use in a for loop, the coding is simple and robust.
def super_range(first_value, last_value, number_steps):
if not isinstance(number_steps, int):
raise TypeError("The value of 'number_steps' is not an integer.")
if number_steps < 1:
raise ValueError("Your 'number_steps' is less than 1.")
step_size = (last_value-first_value)/(number_steps-1)
output_list = []
for i in range(number_steps):
output_list.append(first_value + step_size*i)
return output_list
first = 20.0
last = -50.0
steps = 5
print(super_range(first, last, steps))
The output will be
[20.0, 2.5, -15.0, -32.5, -50.0]
Note that the function super_range is not limited to floats. It can handle any data type for which the operators +, -, *, and / are defined, such as complex, Decimal, and numpy.array:
import cmath
first = complex(1,2)
last = complex(5,6)
steps = 5
print(super_range(first, last, steps))
from decimal import *
first = Decimal(20)
last = Decimal(-50)
steps = 5
print(super_range(first, last, steps))
import numpy as np
first = np.array([[1, 2],[3, 4]])
last = np.array([[5, 6],[7, 8]])
steps = 5
print(super_range(first, last, steps))
The output will be:
[(1+2j), (2+3j), (3+4j), (4+5j), (5+6j)]
[Decimal('20.0'), Decimal('2.5'), Decimal('-15.0'), Decimal('-32.5'), Decimal('-50.0')]
[array([[1., 2.],[3., 4.]]),
array([[2., 3.],[4., 5.]]),
array([[3., 4.],[5., 6.]]),
array([[4., 5.],[6., 7.]]),
array([[5., 6.],[7., 8.]])]
There will be of course some rounding errors, so this is not perfect, but this is what I use generally for applications, which don't require high precision. If you wanted to make this more accurate, you could add an extra argument to specify how to handle rounding errors. Perhaps passing a rounding function might make this extensible and allow the programmer to specify how to handle rounding errors.
arange = lambda start, stop, step: [i + step * i for i in range(int((stop - start) / step))]
If I write:
arange(0, 1, 0.1)
It will output:
[0.0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6000000000000001, 0.7000000000000001, 0.8, 0.9]
Is there a range() equivalent for floats in Python?
NO
Use this:
def f_range(start, end, step, coef=0.01):
a = range(int(start/coef), int(end/coef), int(step/coef))
var = []
for item in a:
var.append(item*coef)
return var
There several answers here that don't handle simple edge cases like negative step, wrong start, stop etc. Here's the version that handles many of these cases correctly giving same behaviour as native range():
def frange(start, stop=None, step=1):
if stop is None:
start, stop = 0, start
steps = int((stop-start)/step)
for i in range(steps):
yield start
start += step
Note that this would error out step=0 just like native range. One difference is that native range returns object that is indexable and reversible while above doesn't.
You can play with this code and test cases here.

Map each list value to its corresponding percentile

I'd like to create a function that takes a (sorted) list as its argument and outputs a list containing each element's corresponding percentile.
For example, fn([1,2,3,4,17]) returns [0.0, 0.25, 0.50, 0.75, 1.00].
Can anyone please either:
Help me correct my code below? OR
Offer a better alternative than my code for mapping values in a list to their corresponding percentiles?
My current code:
def median(mylist):
length = len(mylist)
if not length % 2:
return (mylist[length / 2] + mylist[length / 2 - 1]) / 2.0
return mylist[length / 2]
###############################################################################
# PERCENTILE FUNCTION
###############################################################################
def percentile(x):
"""
Find the correspoding percentile of each value relative to a list of values.
where x is the list of values
Input list should already be sorted!
"""
# sort the input list
# list_sorted = x.sort()
# count the number of elements in the list
list_elementCount = len(x)
#obtain set of values from list
listFromSetFromList = list(set(x))
# count the number of unique elements in the list
list_uniqueElementCount = len(set(x))
# define extreme quantiles
percentileZero = min(x)
percentileHundred = max(x)
# define median quantile
mdn = median(x)
# create empty list to hold percentiles
x_percentile = [0.00] * list_elementCount
# initialize unique count
uCount = 0
for i in range(list_elementCount):
if x[i] == percentileZero:
x_percentile[i] = 0.00
elif x[i] == percentileHundred:
x_percentile[i] = 1.00
elif x[i] == mdn:
x_percentile[i] = 0.50
else:
subList_elementCount = 0
for j in range(i):
if x[j] < x[i]:
subList_elementCount = subList_elementCount + 1
x_percentile[i] = float(subList_elementCount / list_elementCount)
#x_percentile[i] = float(len(x[x > listFromSetFromList[uCount]]) / list_elementCount)
if i == 0:
continue
else:
if x[i] == x[i-1]:
continue
else:
uCount = uCount + 1
return x_percentile
Currently, if I submit percentile([1,2,3,4,17]), the list [0.0, 0.0, 0.5, 0.0, 1.0] is returned.
I think your example input/output does not correspond to typical ways of calculating percentile. If you calculate the percentile as "proportion of data points strictly less than this value", then the top value should be 0.8 (since 4 of 5 values are less than the largest one). If you calculate it as "percent of data points less than or equal to this value", then the bottom value should be 0.2 (since 1 of 5 values equals the smallest one). Thus the percentiles would be [0, 0.2, 0.4, 0.6, 0.8] or [0.2, 0.4, 0.6, 0.8, 1]. Your definition seems to be "the number of data points strictly less than this value, considered as a proportion of the number of data points not equal to this value", but in my experience this is not a common definition (see for instance wikipedia).
With the typical percentile definitions, the percentile of a data point is equal to its rank divided by the number of data points. (See for instance this question on Stats SE asking how to do the same thing in R.) Differences in how to compute the percentile amount to differences in how to compute the rank (for instance, how to rank tied values). The scipy.stats.percentileofscore function provides four ways of computing percentiles:
>>> x = [1, 1, 2, 2, 17]
>>> [stats.percentileofscore(x, a, 'rank') for a in x]
[30.0, 30.0, 70.0, 70.0, 100.0]
>>> [stats.percentileofscore(x, a, 'weak') for a in x]
[40.0, 40.0, 80.0, 80.0, 100.0]
>>> [stats.percentileofscore(x, a, 'strict') for a in x]
[0.0, 0.0, 40.0, 40.0, 80.0]
>>> [stats.percentileofscore(x, a, 'mean') for a in x]
[20.0, 20.0, 60.0, 60.0, 90.0]
(I used a dataset containing ties to illustrate what happens in such cases.)
The "rank" method assigns tied groups a rank equal to the average of the ranks they would cover (i.e., a three-way tie for 2nd place gets a rank of 3 because it "takes up" ranks 2, 3 and 4). The "weak" method assigns a percentile based on the proportion of data points less than or equal to a given point; "strict" is the same but counts proportion of points strictly less than the given point. The "mean" method is the average of the latter two.
As Kevin H. Lin noted, calling percentileofscore in a loop is inefficient since it has to recompute the ranks on every pass. However, these percentile calculations can be easily replicated using different ranking methods provided by scipy.stats.rankdata, letting you calculate all the percentiles at once:
>>> from scipy import stats
>>> stats.rankdata(x, "average")/len(x)
array([ 0.3, 0.3, 0.7, 0.7, 1. ])
>>> stats.rankdata(x, 'max')/len(x)
array([ 0.4, 0.4, 0.8, 0.8, 1. ])
>>> (stats.rankdata(x, 'min')-1)/len(x)
array([ 0. , 0. , 0.4, 0.4, 0.8])
In the last case the ranks are adjusted down by one to make them start from 0 instead of 1. (I've omitted "mean", but it could easily be obtained by averaging the results of the latter two methods.)
I did some timings. With small data such as that in your example, using rankdata is somewhat slower than Kevin H. Lin's solution (presumably due to the overhead scipy incurs in converting things to numpy arrays under the hood) but faster than calling percentileofscore in a loop as in reptilicus's answer:
In [11]: %timeit [stats.percentileofscore(x, i) for i in x]
1000 loops, best of 3: 414 µs per loop
In [12]: %timeit list_to_percentiles(x)
100000 loops, best of 3: 11.1 µs per loop
In [13]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 39.3 µs per loop
With a large dataset, however, the performance advantage of numpy takes effect and using rankdata is 10 times faster than Kevin's list_to_percentiles:
In [18]: x = np.random.randint(0, 10000, 1000)
In [19]: %timeit [stats.percentileofscore(x, i) for i in x]
1 loops, best of 3: 437 ms per loop
In [20]: %timeit list_to_percentiles(x)
100 loops, best of 3: 1.08 ms per loop
In [21]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 102 µs per loop
This advantage will only become more pronounced on larger and larger datasets.
I think you want scipy.stats.percentileofscore
Example:
percentileofscore([1, 2, 3, 4], 3)
75.0
percentiles = [percentileofscore(data, i) for i in data]
In terms of complexity, I think reptilicus's answer is not optimal. It takes O(n^2) time.
Here is a solution that takes O(n log n) time.
def list_to_percentiles(numbers):
pairs = zip(numbers, range(len(numbers)))
pairs.sort(key=lambda p: p[0])
result = [0 for i in range(len(numbers))]
for rank in xrange(len(numbers)):
original_index = pairs[rank][1]
result[original_index] = rank * 100.0 / (len(numbers)-1)
return result
I'm not sure, but I think this is the optimal time complexity you can get. The rough reason I think it's optimal is because the information of all of the percentiles is essentially equivalent to the information of the sorted list, and you can't get better than O(n log n) for sorting.
EDIT: Depending on your definition of "percentile" this may not always give the correct result. See BrenBarn's answer for more explanation and for a better solution which makes use of scipy/numpy.
Pure numpy version of Kevin's solution
As Kevin said, optimal solution works in O(n log(n)) time. Here is fast version of his code in numpy, which works almost the same time as stats.rankdata:
percentiles = numpy.argsort(numpy.argsort(array)) * 100. / (len(array) - 1)
PS. This is one if my favourite tricks in numpy.
this might look oversimplyfied but what about this:
def percentile(x):
pc = float(1)/(len(x)-1)
return ["%.2f"%(n*pc) for n, i in enumerate(x)]
EDIT:
def percentile(x):
unique = set(x)
mapping = {}
pc = float(1)/(len(unique)-1)
for n, i in enumerate(unique):
mapping[i] = "%.2f"%(n*pc)
return [mapping.get(el) for el in x]
I tried Scipy's percentile score but it turned out to be very slow for one of my tasks. So, simply implemented it this way. Can be modified if a weak ranking is needed.
def assign_pct(X):
mp = {}
X_tmp = np.sort(X)
pct = []
cnt = 0
for v in X_tmp:
if v in mp:
continue
else:
mp[v] = cnt
cnt+=1
for v in X:
pct.append(mp[v]/cnt)
return pct
Calling the function
assign_pct([23,4,1,43,1,6])
Output of function
[0.75, 0.25, 0.0, 1.0, 0.0, 0.5]
If I understand you correctly, all you want to do, is to define the percentile this element represents in the array, how much of the array is before that element. as in [1, 2, 3, 4, 5]
should be [0.0, 0.25, 0.5, 0.75, 1.0]
I believe such code will be enough:
def percentileListEdited(List):
uniqueList = list(set(List))
increase = 1.0/(len(uniqueList)-1)
newList = {}
for index, value in enumerate(uniqueList):
newList[index] = 0.0 + increase * index
return [newList[val] for val in List]
For me the best solution is to use QuantileTransformer in sklearn.preprocessing.
from sklearn.preprocessing import QuantileTransformer
fn = lambda input_list : QuantileTransformer(100).fit_transform(np.array(input_list).reshape([-1,1])).ravel().tolist()
input_raw = [1, 2, 3, 4, 17]
output_perc = fn( input_raw )
print "Input=", input_raw
print "Output=", np.round(output_perc,2)
Here is the output
Input= [1, 2, 3, 4, 17]
Output= [ 0. 0.25 0.5 0.75 1. ]
Note: this function has two salient features:
input raw data is NOT necessarily sorted.
input raw data is NOT necessarily single column.
This version allows also to pass exact percentiles values used to ranking:
def what_pctl_number_of(x, a, pctls=np.arange(1, 101)):
return np.argmax(np.sign(np.append(np.percentile(x, pctls), np.inf) - a))
So it's possible to find out what's percentile number value falls for provided percentiles:
_x = np.random.randn(100, 1)
what_pctl_number_of(_x, 1.6, [25, 50, 75, 100])
Output:
3
so it hits to 75 ~ 100 range
for a pure python function to calculate a percentile score for a given item, compared to the population distribution (a list of scores), I pulled this from the scipy source code and removed all references to numpy:
def percentileofscore(a, score, kind='rank'):
n = len(a)
if n == 0:
return 100.0
left = len([item for item in a if item < score])
right = len([item for item in a if item <= score])
if kind == 'rank':
pct = (right + left + (1 if right > left else 0)) * 50.0/n
return pct
elif kind == 'strict':
return left / n * 100
elif kind == 'weak':
return right / n * 100
elif kind == 'mean':
pct = (left + right) / n * 50
return pct
else:
raise ValueError("kind can only be 'rank', 'strict', 'weak' or 'mean'")
source: https://github.com/scipy/scipy/blob/v1.2.1/scipy/stats/stats.py#L1744-L1835
Given that calculating percentiles is trickier than one would think, but way less complicated than the full scipy/numpy/scikit package, this is the best for light-weight deployment. The original code filters for only nonzero-values better, but otherwise, the math is the same. The optional parameter controls how it handles values that are in between two other values.
For this use case, one can call this function for each item in a list using the map() function.

Categories