upper bound on predictability

upper bound on predictability - python

I'm trying to compute the upper bound on the predictability of my occupancy dataset, as in Song's 'Limits of Predictability in Human Mobility' paper. Basically, home (=1) and not at home (=0) then represent the visited locations (towers) in Song's paper.
I tested my code (which I derived from https://github.com/gavin-s-smith/MobilityPredictabilityUpperBounds and https://github.com/gavin-s-smith/EntropyRateEst) on a random binary sequence which should return an entropy of 1 and a predictability of 0.5. Instead, the returned entropy is 0.87 and the predictabiltiy 0.71.
Here's my code:
import numpy as np
from scipy.optimize import fsolve
from cmath import log
import math
def matchfinder(data):
data_len = len(data)
output = np.zeros(len(data))
output[0] = 1
# Using L_{n} definition from
#"Nonparametric Entropy Estimation for Stationary Process and Random Fields, with Applications to English Text"
# by Kontoyiannis et. al.
# $L_{n} = 1 + max \{l :0 \leq l \leq n, X^{l-1}_{0} = X^{-j+l-1}_{-j} \text{ for some } l \leq j \leq n \}$
# for each position, i, in the sub-sequence that occurs before the current position, start_idx
# check to see the maximum continuously equal string we can make by simultaneously extending from i and start_idx
for start_idx in range(1,data_len):
max_subsequence_matched = 0
for i in range(0,start_idx):
# for( int i = 0; i < start_idx; i++ )
# {
j = 0
#increase the length of the substring starting at j and start_idx
#while they are the same keeping track of the length
while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
j = j + 1
if j > max_subsequence_matched:
max_subsequence_matched = j;
#L_{n} is obtained by adding 1 to the longest match-length
output[start_idx] = max_subsequence_matched + 1;
return output
if __name__ == '__main__':
#Read dataset
data = np.random.randint(2,size=2000)
#Number of distinct locations
N = len(np.unique(data))
#True entropy
lambdai = matchfinder(data)
Etrue = math.pow(sum( [ lambdai[i] / math.log(i+1,2) for i in range(1,len(data))] ) * (1.0/len(data)),-1)
S = Etrue
#use Fano's inequality to compute the predictability
func = lambda x: (-(x*log(x,2).real+(1-x)*log(1-x,2).real)+(1-x)*log(N-1,2).real ) - S
ub = fsolve(func, 0.9)[0]
print ub
the matchfinder function finds the entropy by looking for the longest match and adds 1 to it (= the shortest substring not previously seen). The predictability is then computed by using Fano's inequality.
What could be the problem?
Thanks!

The entropy function seems to be wrong.
Refering to the paper Song, C., Qu, Z., Blumm, N., & Barabási, A. L. (2010). Limits of predictability in human mobility. Science, 327(5968), 1018–1021. you mentioned, real entropy is estimated by algorithm based on Lempel-Ziv data compression:
In code it would look like this:
Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
Where n is the length of time series.
Notice that we used different base for logarithm than in given formula. However, since the base for logarithm in Fano's inequality is 2, then it seems logical to use the same base for entropy calculation. Also, I'm not sure why you started sum from the first instead of zero index.
So now wrapping that up into function for example:
def solve(locations, size):
data = np.random.randint(locations,size=size)
N = len(np.unique(data))
n = float(len(data))
print "Distinct locations: %i" % N
print "Time series length: %i" % n
#True entropy
lambdai = matchfinder(data)
#S = math.pow(sum([lambdai[i] / math.log(i + 1, 2) for i in range(1, len(data))]) * (1.0 / len(data)), -1)
Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
S = Etrue
print "Maximum entropy: %2.5f" % log(locations,2).real
print "Real entropy: %2.5f" % S
func = lambda x: (-(x * log(x, 2).real + (1 - x) * log(1 - x, 2).real) + (1 - x) * log(N - 1, 2).real) - S
ub = fsolve(func, 0.9)[0]
print "Upper bound of predictability: %2.5f" % ub
return ub
Output for 2 locations
Distinct locations: 2
Time series length: 10000
Maximum entropy: 1.00000
Real entropy: 1.01441
Upper bound of predictability: 0.50013
Output for 3 locations
Distinct locations: 3
Time series length: 10000
Maximum entropy: 1.58496
Real entropy: 1.56567
Upper bound of predictability: 0.41172
Lempel-Ziv compression converge to real entropy when n approaches infinity, that is why for 2 locations case it is slightly higher than maximum limit.
I am not also sure if you interpreted definition for lambda correctly. It is defined as "the length of the shortest substring starting at position i which dosen't previously appear from position 1 to i-1", so when we got to some point where further substrings are not unique anymore, your matching algorithm would give it length always one higher than the length of substring, while it should be rather equal to 0, since unique substring does not exist.
To make it clearer let's just give a simple example. If the array of positions looks like that:
[1 0 0 1 0 0]
Then we can see that after first three positions pattern is repeated once again. That means that from fourth location shorthest unique substring does not exist, thus it equals to 0. So the output (lambda) should look like this:
[1 1 2 0 0 0]
However, your function for that case would return:
[1 1 2 4 3 2]
I rewrote you matching function to treat that problem:
def matchfinder2(data):
data_len = len(data)
output = np.zeros(len(data))
output[0] = 1
for start_idx in range(1,data_len):
max_subsequence_matched = 0
for i in range(0,start_idx):
j = 0
end_distance = data_len - start_idx #length left to the end of sequence (including current index)
while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
j = j + 1
if j == end_distance: #check if j has reached the end of sequence
output[start_idx::] = np.zeros(end_distance) #if yes fill the rest of output with zeros
return output #end function
elif j > max_subsequence_matched:
max_subsequence_matched = j;
output[start_idx] = max_subsequence_matched + 1;
return output
Differences are small of course, because result change just for the small part of sequences.

Related

Put constraits on six random numbers making three negative and three positive and equating to 1?

How do i make three random negative numbers and three positive random numbers equal 1 constrained between -1 and 1. For example,
random_nums = np.array([-.2,-.3,-.5,.5,.8,.7])) = 1
I dont want np.uniform answers I need 6 random numbers in arr[0],arr[1],arr[2],arr[3],arr[4],arr[5],arr[6] =1. Then want to shuffle them in different order shuffle(random_nums).

Obviously, numbers generated by this scheme will never be truly "random", since to satisfy this constraint, you will have to, well, constrain some of them.
But, that warning aside, here's one way to generate such arrays:
Generate a pair (a,b) where a = rand(), and b=-rand()
If a+b < 0, reverse keep (a,b), otherwise a,b = -a, -b
Repeat with another pair (c,d)
If a+b+c+d < 1 then keep (c,d), otherwise c,d = -c, -d.
If a+b+c+d < 0 then a,b,c,d = -a, -b, -c, -d
You should now have a positive a+b+c+d. Obtain a random number e in the range [-(a+b+c+d), 0]
Your final number is f = 1 - (a+b+c+d+e)
rough (untested) example of python implementation:
def getSix():
a = numpy.random.rand()
b = -numpy.random.rand()
a,b = (a,b) if a+b > 0 else (-a,-b)
c = numpy.random.rand()
d = -numpy.random.rand()
c,d = (c,d) if a+b+c+d < 1 else (-c, -d)
a,b,c,d = (a,b,c,d) if a+b+c+d > 0 else (-a, -b, -c, -d)
e = -numpy.random.rand()*(a+b+c+d)
f = 1 - (a+b+c+d+e)
return numpy.array([a,b,c,d,e,f])
Obviously this is very specific to your case of 6 elements, but you could figure out a way to generalise it to more elements if necessary.

It's may be counter-intuitive, but actually, you can't :
1) the last one of them must be non-random so that the sum equals 1.
2) the penultimate must be in an interval such that when summed with
the first four, the result is between 0 and 2.
3) the antepenultimate must be in an interval such that when summed with
the first three, the result is between -1 and 3. and so on...
The numbers are constrained, so they can't be random.
What you can do to approach the solution is to use a backtracking algorithm.

There is no solution to your problem. Your problem can be formulated with the following equation:
x + y + z + p + q + r = 1, where x, y, z > 0 and p, q, r < 0.
As you can see, this is an indeterminate equation. This means that it has infinite solutions, thus your requirements cannot be met.

Not truly random due to the sum constraint and a floating point sum to an exact value can be problematic due to floating point rounding as well, but with a little brute force, the Decimal module and computing the sixth value of five random numbers, the following example works.
The code below generates five random numbers and computes sixth for the sum requirement and loops until 3 of 6 numbers are negative:
from decimal import Decimal
import random
def get_nums():
while True:
# Use integers representing 100ths to prevent float rounding error.
nums = [random.randint(-100,100) for _ in range(5)]
nums.append(100-sum(nums)) # compute 6th number so sum is 100.
# Check that 6th num is in right range and three are negative.
if -100 <= nums[5] <= 100 and sum([1 if num < 0 else 0 for num in nums]) == 3:
break
return [Decimal(n) / 100 for n in nums]
for _ in range(10):
print(get_nums())
Output:
[Decimal('-0.36'), Decimal('0.89'), Decimal('0.39'), Decimal('1'), Decimal('-0.05'), Decimal('-0.87')]
[Decimal('-0.01'), Decimal('0.12'), Decimal('0.98'), Decimal('-0.48'), Decimal('-0.45'), Decimal('0.84')]
[Decimal('0.99'), Decimal('0.5'), Decimal('-0.29'), Decimal('0.49'), Decimal('-0.65'), Decimal('-0.04')]
[Decimal('0.51'), Decimal('-0.03'), Decimal('0.64'), Decimal('-0.96'), Decimal('0.99'), Decimal('-0.15')]
[Decimal('0.51'), Decimal('-0.27'), Decimal('-0.62'), Decimal('0.67'), Decimal('-0.22'), Decimal('0.93')]
[Decimal('-0.25'), Decimal('0.84'), Decimal('-0.23'), Decimal('-0.59'), Decimal('0.94'), Decimal('0.29')]
[Decimal('0.75'), Decimal('-0.39'), Decimal('0.86'), Decimal('-0.81'), Decimal('0.6'), Decimal('-0.01')]
[Decimal('-0.4'), Decimal('-0.46'), Decimal('0.89'), Decimal('0.94'), Decimal('0.27'), Decimal('-0.24')]
[Decimal('-0.87'), Decimal('0.6'), Decimal('0.95'), Decimal('-0.12'), Decimal('0.9'), Decimal('-0.46')]
[Decimal('0.5'), Decimal('-0.58'), Decimal('-0.04'), Decimal('-0.41'), Decimal('0.68'), Decimal('0.85')]

Here is a simple approach that uses sum-normalization:
import numpy as np
def gen_rand(n):
result = np.array((np.random.random(n // 2) + 1).tolist() + (np.random.random(n - n // 2) - 1).tolist())
result /= np.sum(result)
return result
rand_arr = gen_rand(6)
print(rand_arr, np.sum(rand_arr))
# [ 0.70946589 0.62584558 0.77423647 -0.51977241 -0.28432949 -0.30544603] 1.0
Basically, we generate N / 2 numbers in the range (1, 2) and N / 2 numbers in the range (0, -1) and then these gets sum-normalized.
The algorithm does not guarantee exact results, but as N increases the probability of invalid results goes to zero.
N = 100000
failures = 0
for i in range(N):
rand_arr = gen_rand(6)
if np.any(rand_arr > 1) or np.any(rand_arr < -1):
failures += 1
print(f'Drawings: {N}, Failures: {failures}, Rate: {failures / N:.2%}')
# Drawings: 100000, Failures: 1931, Rate: 1.93%
Therefore, you could just combine it with some other logic to just discard invalid generations, if you are strict on the requirements, e.g.:
import numpy as np
def gen_rand_strict(n):
result = np.full(n, 2.0, dtype=float)
while np.any(result > 1) or np.any(result < -1):
result = np.array((np.random.random(n // 2) + 1).tolist() + (np.random.random(n - n // 2) - 1).tolist())
result /= np.sum(result)
return result
rand_arr = gen_rand_strict(6)
print(rand_arr, np.sum(rand_arr))
# [ 0.55928446 0.5434739 0.38103321 -0.24799626 -0.09556285 -0.14023246] 1.0
N = 100000
failures = 0
for i in range(N):
rand_arr = gen_rand_strict(6)
if np.any(rand_arr > 1) or np.any(rand_arr < -1):
failures += 1
print(f'Drawings: {N}, Failures: {failures}, Rate: {failures / N:.2%}')
# Drawings: 100000, Failures: 0, Rate: 0.00%
Actually, with the logic to ensure values within the exact range, you do not need to put much thinking into the ranges and this would also work well:
import numpy as np
def gen_rand_strict(n):
result = np.full(n, 2.0, dtype=float)
while np.any(result > 1) or np.any(result < -1):
result = np.array((np.random.random(n // 2) * n).tolist() + (np.random.random(n - n // 2) * -n).tolist())
result /= np.sum(result)
return result
rand_arr = gen_rand_strict(6)
print(rand_arr, np.sum(rand_arr))
# [ 0.32784034 0.75649476 0.8097567 -0.2923395 -0.41629451 -0.1854578 ] 1.0
N = 100000
failures = 0
for i in range(N):
rand_arr = gen_rand_strict(6)
if np.any(rand_arr > 1) or np.any(rand_arr < -1):
failures += 1
print(f'Drawings: {N}, Failures: {failures}, Rate: {failures / N:.2%}')
# Drawings: 100000, Failures: 0, Rate: 0.00%

How to generate random values in range (-1, 1) such that the total sum is 0?

If the sum is 1, I could just divide the values by their sum. However, this approach is not applicable when the sum is 0.
Maybe I could compute the opposite of each value I sample, so I would always have a pair of numbers, such that their sum is 0. However this approach reduces the "randomness" I would like to have in my random array.
Are there better approaches?
Edit: the array length can vary (from 3 to few hundreds), but it has to be fixed before sampling.

There is a Dirichlet-Rescale (DRS) algorithm that generates random numbers summing up to a given number. As it says, it has the feature that
the vectors are uniformly distributed over the valid region of the
domain of all possible vectors, bounded by the constraints.
There is also a Python library for it.

You could use sklearns Standardscaler. It scales your data to have a variance of 1 and a mean of 0. The mean of 0 is equivalent to a sum of 0.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
rand_numbers = StandardScaler().fit_transform(np.random.rand(100,1, ))
If you don't want to use sklearn you can standardize by hand, the formula is pretty simple:
rand_numbers = np.random.rand(1000,1, )
rand_numbers = (rand_numbers - np.mean(rand_numbers)) / np.std(rand_numbers)
The problem here is the variance of 1, that causes numbers greater than 1 or smaller than -1. Therefor you devide the array by its max abs value.
rand_numbers = rand_numbers*(1/max(abs(rand_numbers)))
Now you have an array with values between -1 and 1 with a sum really close to zero.
print(sum(rand_numbers))
print(min(rand_numbers))
print(max(rand_numbers))
Output:
[-1.51822999e-14]
[-0.99356294]
[1.]
What you will have with this solution is either one 1 or one -1 in your data allways. If you would want to avoid this you could add a positive random factor to the division through the max abs. rand_numbers*(1/(max(abs(rand_numbers))+randomfactor))
Edit
As #KarlKnechtel mentioned the division by the standard deviation is redundant with the division by max absolute value.
The above can be simply done by:
rand_numbers = np.random.rand(100000,1, )
rand_numbers = rand_numbers - np.mean(rand_numbers)
rand_numbers = rand_numbers / max(abs(rand_numbers))

I would try the following solution:
def draw_randoms_while_sum_not_zero(eps):
r = random.uniform(-1, 1)
sum = r
yield r
while abs(sum) > eps:
if sum > 0:
r = random.uniform(-1, 0)
else:
r = random.uniform(0,1)
sum += r
yield r
As the floating point numbers are not perfectly accurate, you can never be sure, that the numbers you'd draw might sum up to 0. You need to decide, what margin is acceptable and call the above generator.
It'll yield (lazily return) random numbers as you need them as long as they don't sum up to 0 ± eps
epss = [0.1, 0.01, 0.001, 0.0001, 0.00001]
for eps in epss:
lengths = []
for _ in range(100):
lengths.append(len(list(draw_randoms_while_sum_not_zero(eps))))
print(f'{eps}: min={min(lengths)}, max={max(lengths)}, avg={sum(lengths)/len(lengths)}')
Results:
0.1: min=1, max=24, avg=6.1
0.01: min=1, max=174, avg=49.27
0.001: min=4, max=2837, avg=421.41
0.0001: min=5, max=21830, avg=4486.51
1e-05: min=183, max=226286, avg=48754.42

Since you are fine with the approach of generating lots of numbers and dividing by the sum, why not generate n/2 positive numbers divide by sum. Then generate n/2 negative numbers and divide by sum?
Want a random positive to negative mix? Randomly generate that mix randomly first then continue.

One way to generate such list is by having the opposite number.
If that is not a desirable property, you can introduce some extra randomness by adding / subtracting the same random value to different opposite couples, e.g.:
def exact_sum_uniform_random(num, min_val=-1.0, max_val=1.0, epsilon=0.1):
items = [random.uniform(min_val, max_val) for _ in range(num // 2)]
opposites = [-x for x in items]
if num % 2 != 0:
items.append(0.0)
for i in range(len(items)):
diff = random.random() * epsilon
if items[i] + diff <= max_val \
and any(opposite - diff >= min_val for opposite in opposites):
items[i] += diff
modified = False
while not modified:
j = random.randint(0, num // 2 - 1)
if opposites[j] - diff >= min_val:
opposites[j] -= diff
modified = True
result = items + opposites
random.shuffle(result)
return result
random.seed(0)
x = exact_sum_uniform_random(3)
print(x, sum(x))
# [0.7646391433441265, -0.7686875811622043, 0.004048437818077755] 2.2551405187698492e-17
EDIT
If the upper and lower limits are not strict, a simple way to construct a zero sum sequence is to sum-normalize two separate sequences to 1 and -1 and join them together:
def norm(items, scale):
return [item / scale for item in items]
def zero_sum_uniform_random(num, min_val=-1.0, max_val=1.0):
a = [random.uniform(min_val, max_val) for _ in range(num // 2)]
a = norm(a, sum(a))
b = [random.uniform(min_val, max_val) for _ in range(num - len(a))]
b = norm(b, -sum(b))
result = a + b
random.shuffle(result)
return result
random.seed(0)
n = 3
x = exact_mean_uniform_random(n)
print(exact_mean_uniform_random(n), sum(x))
# [1.0, 2.2578843364303585, -3.2578843364303585] 0.0
Note that both approaches will not have, in general, a uniform distribution.

How to find sum of cubes of the divisors for every number from 1 to input number x in python where x can be very large

Examples,
1.Input=4
Output=111
Explanation,
1 = 1³(divisors of 1)
2 = 1³ + 2³(divisors of 2)
3 = 1³ + 3³(divisors of 3)
4 = 1³ + 2³ + 4³(divisors of 4)
------------------------
sum = 111(output)
1.Input=5
Output=237
Explanation,
1 = 1³(divisors of 1)
2 = 1³ + 2³(divisors of 2)
3 = 1³ + 3³(divisors of 3)
4 = 1³ + 2³ + 4³(divisors of 4)
5 = 1³ + 5³(divisors of 5)
-----------------------------
sum = 237 (output)
x=int(raw_input().strip())
tot=0
for i in range(1,x+1):
for j in range(1,i+1):
if(i%j==0):
tot+=j**3
print tot
Using this code I can find the answer for small number less than one million.
But I want to find the answer for very large numbers. Is there any algorithm
for how to solve it easily for large numbers?

Offhand I don't see a slick way to make this truly efficient, but it's easy to make it a whole lot faster. If you view your examples as matrices, you're summing them a row at a time. This requires, for each i, finding all the divisors of i and summing their cubes. In all, this requires a number of operations proportional to x**2.
You can easily cut that to a number of operations proportional to x, by summing the matrix by columns instead. Given an integer j, how many integers in 1..x are divisible by j? That's easy: there are x//j multiples of j in the range, so divisor j contributes j**3 * (x // j) to the grand total.
def better(x):
return sum(j**3 * (x // j) for j in range(1, x+1))
That runs much faster, but still takes time proportional to x.
There are lower-level tricks you can play to speed that in turn by constant factors, but they still take O(x) time overall. For example, note that x // j == 1 for all j such that x // 2 < j <= x. So about half the terms in the sum can be skipped, replaced by closed-form expressions for a sum of consecutive cubes:
def sum3(x):
"""Return sum(i**3 for i in range(1, x+1))"""
return (x * (x+1) // 2)**2
def better2(x):
result = sum(j**3 * (x // j) for j in range(1, x//2 + 1))
result += sum3(x) - sum3(x//2)
return result
better2() is about twice as fast as better(), but to get faster than O(x) would require deeper insight.
Quicker
Thinking about this in spare moments, I still don't have a truly clever idea. But the last idea I gave can be carried to a logical conclusion: don't just group together divisors with only one multiple in range, but also those with two multiples in range, and three, and four, and ... That leads to better3() below, which does a number of operations roughly proportional to the square root of x:
def better3(x):
result = 0
for i in range(1, x+1):
q1 = x // i
# value i has q1 multiples in range
result += i**3 * q1
# which values have i multiples?
q2 = x // (i+1) + 1
assert x // q1 == i == x // q2
if i < q2:
result += i * (sum3(q1) - sum3(q2 - 1))
if i+1 >= q2: # this becomes true when i reaches roughly sqrt(x)
break
return result
Of course O(sqrt(x)) is an enormous improvement over the original O(x**2), but for very large arguments it's still impractical. For example better3(10**6) appears to complete instantly, but better3(10**12) takes a few seconds, and better3(10**16) is time for a coffee break ;-)
Note: I'm using Python 3. If you're using Python 2, use xrange() instead of range().
One more
better4() has the same O(sqrt(x)) time behavior as better3(), but does the summations in a different order that allows for simpler code and fewer calls to sum3(). For "large" arguments, it's about 50% faster than better3() on my box.
def better4(x):
result = 0
for i in range(1, x+1):
d = x // i
if d >= i:
# d is the largest divisor that appears `i` times, and
# all divisors less than `d` also appear at least that
# often. Account for one occurence of each.
result += sum3(d)
else:
i -= 1
lastd = x // i
# We already accounted for i occurrences of all divisors
# < lastd, and all occurrences of divisors >= lastd.
# Account for the rest.
result += sum(j**3 * (x // j - i)
for j in range(1, lastd))
break
return result
It may be possible to do better by extending the algorithm in "A Successive Approximation Algorithm for Computing the Divisor Summatory Function". That takes O(cube_root(x)) time for the possibly simpler problem of summing the number of divisors. But it's much more involved, and I don't care enough about this problem to pursue it myself ;-)
Subtlety
There's a subtlety in the math that's easy to miss, so I'll spell it out, but only as it pertains to better4().
After d = x // i, the comment claims that d is the largest divisor that appears i times. But is that true? The actual number of times d appears is x // d, which we did not compute. How do we know that x // d in fact equals i?
That's the purpose of the if d >= i: guarding that comment. After d = x // i we know that
x == d*i + r
for some integer r satisfying 0 <= r < i. That's essentially what floor division means. But since d >= i is also known (that's what the if test ensures), it must also be the case that 0 <= r < d. And that's how we know x // d is i.
This can break down when d >= i is not true, which is why a different method needs to be used then. For example, if x == 500 and i == 51, d (x // i) is 9, but it's certainly not the case that 9 is the largest divisor that appears 51 times. In fact, 9 appears 500 // 9 == 55 times. While for positive real numbers
d == x/i
if and only if
i == x/d
that's not always so for floor division. But, as above, the first does imply the second if we also know that d >= i.
Just for Fun
better5() rewrites better4() for about another 10% speed gain. The real pedagogical point is to show that it's easy to compute all the loop limits in advance. Part of the point of the odd code structure above is that it magically returns 0 for a 0 input without needing to test for that. better5() gives up on that:
def isqrt(n):
"Return floor(sqrt(n)) for int n > 0."
g = 1 << ((n.bit_length() + 1) >> 1)
d = n // g
while d < g:
g = (d + g) >> 1
d = n // g
return g
def better5(x):
assert x > 0
u = isqrt(x)
v = x // u
return (sum(map(sum3, (x // d for d in range(1, u+1)))) +
sum(x // i * i**3 for i in range(1, v)) -
u * sum3(v-1))

def sum_divisors(n):
sum = 0
i = 0
for i in range (1, n) :
if n % i == 0 and n != 0 :
sum = sum + i
# Return the sum of all divisors of n, not including n
return sum
print(sum_divisors(0))
# 0
print(sum_divisors(3)) # Should sum of 1
# 1
print(sum_divisors(36)) # Should sum of 1+2+3+4+6+9+12+18
# 55
print(sum_divisors(102)) # Should be sum of 2+3+6+17+34+51
# 114

Get permutation with specified degree by index number

I've been working on this for hours but couldn't figure it out.
Define a permutation's degree to be the minimum number of transpositions that need to be composed to create it. So a the degree of (0, 1, 2, 3) is 0, the degree of (0, 1, 3, 2) is 1, the degree of (1, 0, 3, 2) is 2, etc.
Look at the space Snd as the space of all permutations of a sequence of length n that have degree d.
I want two algorithms. One that takes a permutation in that space and assigns it an index number, and another that takes an index number of an item in Snd and retrieves its permutation. The index numbers should obviously be successive (i.e. in the range 0 to len(Snd)-1, with each permutation having a distinct index number.)
I'd like this implemented in O(sane); which means that if you're asking for permutation number 17, the algorithm shouldn't go over all the permutations between 0 and 16 to retrieve your permutation.
Any idea how to solve this?
(If you're going to include code, I prefer Python, thank you.)
Update:
I want a solution in which
The permutations are ordered according to their lexicographic order (and not by manually ordering them, but by an efficient algorithm that gives them with lexicographic order to begin with) and
I want the algorithm to accept a sequence of different degrees as well, so I could say "I want permutation number 78 out of all permutations of degrees 1, 3 or 4 out of the permutation space of range(5)". (Basically the function would take a tuple of degrees.) This'll also affect the reverse function that calculates index from permutation; based on the set of degrees, the index would be different.
I've tried solving this for the last two days and I was not successful. If you could provide Python code, that'd be best.

The permutations of length n and degree d are exactly those that can be written as a composition of k = n - d cycles that partition the n elements. The number of such permutations is given by the Stirling numbers of the first kind, written n atop k in square brackets.
Stirling numbers of the first kind satisfy a recurrence relation
[n] [n - 1] [n - 1]
[ ] = (n - 1) [ ] + [ ]
[k] [ k ] [k - 1],
which means, intuitively, the number of ways to partition n elements into k cycles is to partition n - 1 non-maximum elements into k cycles and splice in the maximum element in one of n - 1 ways, or put the maximum element in its own cycle and partition the n - 1 non-maximum elements into k - 1 cycles. Working from a table of recurrence values, it's possible to trace the decisions down the line.
memostirling1 = {(0, 0): 1}
def stirling1(n, k):
if (n, k) not in memostirling1:
if not (1 <= k <= n): return 0
memostirling1[(n, k)] = (n - 1) * stirling1(n - 1, k) + stirling1(n - 1, k - 1)
return memostirling1[(n, k)]
def unrank(n, d, i):
k = n - d
assert 0 <= i <= stirling1(n, k)
if d == 0:
return list(range(n))
threshold = stirling1(n - 1, k - 1)
if i < threshold:
perm = unrank(n - 1, d, i)
perm.append(n - 1)
else:
(q, r) = divmod(i - threshold, stirling1(n - 1, k))
perm = unrank(n - 1, d - 1, r)
perm.append(perm[q])
perm[q] = n - 1
return perm

This answer is less elegant/efficient than my other one, but it describes a polynomial-time algorithm that copes with the additional constraints on the ordering of permutations. I'm going to describe a subroutine that, given a prefix of an n-element permutation and a set of degrees, counts how many permutations have that prefix and a degree belonging to the set. Given this subroutine, we can do an n-ary search for the permutation of a specified rank in the specified subset, extending the known prefix one element at a time.
We can visualize an n-element permutation p as an n-vertex, n-arc directed graph where, for each vertex v, there is an arc from v to p(v). This digraph consists of a collection of vertex-disjoint cycles. For example, the permutation 31024 looks like
_______
/ \
\->2->0->3
__ __
/ | / |
1<-/ 4<-/ .
Given a prefix of a permutation, we can visualize the subgraph corresponding to that prefix, which will be a collection of vertex-disjoint paths and cycles. For example, the prefix 310 looks like
2->0->3
__
/ |
1<-/ .
I'm going to describe a bijection between (1) extensions of this prefix that are permutations and (2) complete permutations on a related set of elements. This bijection preserves up to a constant term the number of cycles (which is the number of elements minus the degree). The constant term is the number of cycles in the prefix.
The permutations mentioned in (2) are on the following set of elements. Start with the original set, delete all elements involved in cycles that are complete in the prefix, and introduce a new element for each path. For example, if the prefix is 310, then we delete the complete cycle 1 and introduce a new element A for the path 2->0->3, resulting in the set {4, A}. Now, given a permutation in set (1), we obtain a permutation in set (2) by deleting the known cycles and replacing each path by its new element. For example, the permutation 31024 corresponds to the permutation 4->4, A->A, and the permutation 31042 corresponds to the permutation 4->A, A->4. I claim (1) that this map is a bijection and (2) that it preserves degrees as described before.
The definition, more or less, of the (n,k)-th Stirling number of the first kind, written
[n]
[ ]
[k]
(ASCII art square brackets), is the number of n-element permutations of degree n - k. To compute the number of extensions of an r-element prefix of an n-element permutation, count c, the number of complete cycles in the prefix. Sum, for each degree d in the specified set, the Stirling number
[ n - r ]
[ ]
[n - d - c]
of the first kind, taking the terms with "impossible" indices to be zero (some analytically motivated definitions of the Stirling numbers are nonzero in unexpected places).
To get a rank from a permutation, we do n-ary search again, except this time, we use the permutation rather than the rank to guide the search.
Here's some Python code for both (including a test function).
import itertools
memostirling1 = {(0, 0): 1}
def stirling1(n, k):
ans = memostirling1.get((n, k))
if ans is None:
if not 1 <= k <= n: return 0
ans = (n - 1) * stirling1(n - 1, k) + stirling1(n - 1, k - 1)
memostirling1[(n, k)] = ans
return ans
def cyclecount(prefix):
c = 0
visited = [False] * len(prefix)
for (i, j) in enumerate(prefix):
while j < len(prefix) and not visited[j]:
visited[j] = True
if j == i:
c += 1
break
j = prefix[j]
return c
def extcount(n, dset, prefix):
c = cyclecount(prefix)
return sum(stirling1(n - len(prefix), n - d - c) for d in dset)
def unrank(n, dset, rnk):
assert rnk >= 0
choices = set(range(n))
prefix = []
while choices:
for i in sorted(choices):
prefix.append(i)
count = extcount(n, dset, prefix)
if rnk < count:
choices.remove(i)
break
del prefix[-1]
rnk -= count
else:
assert False
return tuple(prefix)
def rank(n, dset, perm):
assert n == len(perm)
rnk = 0
prefix = []
choices = set(range(n))
for j in perm:
choices.remove(j)
for i in sorted(choices):
if i < j:
prefix.append(i)
rnk += extcount(n, dset, prefix)
del prefix[-1]
prefix.append(j)
return rnk
def degree(perm):
return len(perm) - cyclecount(perm)
def test(n, dset):
for (rnk, perm) in enumerate(perm for perm in itertools.permutations(range(n)) if degree(perm) in dset):
assert unrank(n, dset, rnk) == perm
assert rank(n, dset, perm) == rnk
test(7, {2, 3, 5})

I think you're looking for a variant of the Levenshtein distance which is used to measure the number of edits between two strings. The efficient way to compute this is by employing a technique called dynamic programming - a pseudo-algorithm for the "normal" Levenshtein distance is provided in the linked article. You would need to adapt this to account for the fact that instead of adding, deleting, or substituting a character, the only allowed operation is exchanging elements at two positions.
Concerning your second algorithm: It's not a 1:1 relationship between degrees of permutation and "a" resulting permutation, instead the number of possible results grows exponentially with the number of swaps: For a sequence of k elements, there's k*(k-1)/2 possible pairs of indices between which to swap. If we call that number l, after d swaps you have l^d possible results (even though some of them might be identical, as in first swapping 0<>1 then 2<>3, or first 2<>3 then 0<>1).

I wrote this stackoverflow answer to a similar problem: https://stackoverflow.com/a/13056801/10562 . Could it help?
The difference might be in the swapping bit for generating the perms, but an index-to-perm and perm-to-index function is given in Python.
I later went on to create this Rosetta Code task that is fleshed out with references and more code: http://rosettacode.org/wiki/Permutations/Rank_of_a_permutation.
Hope it helps :-)

The first part is straight forward if you work wholly in the lexiographic side of things. Given my answer on the other thread, you can go from a permutation to the factorial representation instantly. Basically, you imagine a list {0,1,2,3} and the number that I need to go along is the factorial representation, so for 1,2,3,4, i keep taking the zeroth element and get 000 (0*3+0*!2+0*!1!).
0,1,2,3, => 000
1032 = 3!+1! = 8th permuation (as 000 is the first permutation) => 101
And you can work out the degree trivially, as each transposition which swaps a pair of numbers (a,b) a
So 0123 -> 1023 is 000 -> 100.
if a>b you swap the numbers and then subtract one from the right hand number.
Given two permuations/lexiographic numbers, I just permute the digits from right to left like a bubble sort, counting the degree that I need, and building the new lexiographic number as I go. So to go from 0123 to the 1032 i first move the 1 to the left, then the zero is in the right position, and then I move the 2 into position, and both of those had pairs with the rh number greater than the left hand number, so both add a 1, so 101.
This deals with your first problem. The second is much more difficult, as the numbers of degree two are not evenly distributed. I don't see anything better than getting the global lexiographic number (global meaning here the number without any exclusions) of the permutation you want, e.g. 78 in your example, and then go through all the lexiographic numbers and each time that you get to one which is degree 2, then add one to your global lexiographic number, e.g. 78 -> 79 when you find the first number of degree 2. Obvioulsy, this will not be fast. Alternatively you could try generating all the numbers of degree to. Given a set of n elements, there are (n-1)(n-2) numbers of degree 2, but its not clear that this holds going forward, at least to me, which might easily be a lot less work than computing all the numbers up to your target. and you could just see which ones have lexiographic number less than your target number, and again add one to its global lexiographic number.
Ill see if i can come up with something better.

This seemed like fun so I thought about it some more.
Let's take David's example of 31042 and find its index. First we determine the degree, which equals the sum of the cardinalities of the permutation cycles, each subtracted by 1.
01234
31042
permutation cycles (0342)(1)
degree = (4-1) + (1-1) = 3
def cycles(prefix):
_cycles = []
i = j = 0
visited = set()
while j < len(prefix):
if prefix[i] == i:
_cycles.append({"is":[i],"incomplete": False})
j = j + 1
i = i + 1
elif not i in visited:
cycle = {"is":[],"incomplete": False}
cycleStart = -1
while True:
if i >= len(prefix):
for k in range(len(_cycles) - 1,-1,-1):
if any(i in cycle["is"] for i in _cycles[k]["is"]):
cycle["is"] = list(set(cycle["is"] + _cycles[k]["is"]))
del _cycles[k]
cycle["incomplete"] = True
_cycles.append(cycle)
break
elif cycleStart == i:
_cycles.append(cycle)
break
else:
if prefix[i] == j + 1:
j = j + 1
visited.add(i)
if cycleStart == -1:
cycleStart = i
cycle["is"].append(i)
i = prefix[i]
while j in visited:
j = j + 1
i = j
return _cycles
def degree(cycles):
d = 0
for i in cycles:
if i["incomplete"]:
d = d + len(i["is"])
else:
d = d + len(i["is"]) - 1
return d
Next we determine how many permutations of degree 3 start with either zero, one, or two; using David's formula:
number of permutations of n=5,d=3 that start with "0" = S(4,4-3) = 6
number of permutations of n=5,d=3 that start with "1" = S(4,4-2) = 11
[just in case you're wondering, I believe the ones starting with "1" are:
(01)(234)
(01)(243)
(201)(34)
(301)(24)
(401)(23)
(2301)(4)
(2401)(3)
(3401)(2)
(3201)(4)
(4201)(3)
(4301)(2) notice what's common to all of them?]
number of permutations of n=5,d=3 that start with "2" = S(4,4-2) = 11
We wonder whether there might be a lexicographically-lower permutation of degree 3 that also starts with "310". The only possibility seems to be 31024:
01234
31024 ?
permutaiton cycles (032)(4)(1)
degree = (3-1) + (1-1) + (1-1) = 2
since its degree is different, we will not apply 31024 to our calculation
The permutations of degree 3 that start with "3" and are lexicographically lower than 31042 must start with the prefix "30". Their count is equal to the number of ways we can maintain "three" before "zero" and "zero" before "one" in our permutation cycles while keeping the sum of the cardinalities of the cycles, each subtracted by 1 (i.e., the degree), at 3.
(031)(24)
(0321)(4)
(0341)(2)
count = 3
It seems that there are 6 + 11 + 11 + 3 = 31 permutations of n=5, d=3 before 31042.
def next(prefix,target):
i = len(prefix) - 1
if prefix[i] < target[i]:
prefix[i] = prefix[i] + 1
elif prefix[i] == target[i]:
prefix.append(0)
i = i + 1
while prefix[i] in prefix[0:i]:
prefix[i] = prefix[i] + 1
return prefix
def index(perm,prefix,ix):
if prefix == perm:
print ix
else:
permD = degree(cycles(perm))
prefixD = degree(cycles(prefix))
n = len(perm) - len(prefix)
k = n - (permD - prefixD)
if prefix != perm[0:len(prefix)] and permD >= prefixD:
ix = ix + S[n][k]
index(perm,next(prefix,perm),ix)
S = [[1]
,[0,1]
,[0,1,1]
,[0,2,3,1]
,[0,6,11,6,1]
,[0,24,50,35,10,1]]
(Let's try to confirm with David' program (I'm using a PC with windows):
C:\pypy>pypy test.py REM print(index([3,1,0,4,2],[0],0))
31
C:\pypy>pypy davids_rank.py REM print(rank(5,{3},[3,1,0,2,4]))
31

A bit late and not in Python but in C#...
I think the following code should work for you. It works for permutation possibilities where for x items, the number of permutations are x!
The algo calculate the index of a permutation and the reverse of it.
using System;
using System.Collections.Generic;
namespace WpfPermutations
{
public class PermutationOuelletLexico3<T>
{
// ************************************************************************
private T[] _sortedValues;
private bool[] _valueUsed;
public readonly long MaxIndex; // long to support 20! or less
// ************************************************************************
public PermutationOuelletLexico3(T[] sortedValues)
{
if (sortedValues.Length <= 0)
{
throw new ArgumentException("sortedValues.Lenght should be greater than 0");
}
_sortedValues = sortedValues;
Result = new T[_sortedValues.Length];
_valueUsed = new bool[_sortedValues.Length];
MaxIndex = Factorial.GetFactorial(_sortedValues.Length);
}
// ************************************************************************
public T[] Result { get; private set; }
// ************************************************************************
/// <summary>
/// Return the permutation relative to the index received, according to
/// _sortedValues.
/// Sort Index is 0 based and should be less than MaxIndex. Otherwise you get an exception.
/// </summary>
/// <param name="sortIndex"></param>
/// <param name="result">Value is not used as inpu, only as output. Re-use buffer in order to save memory</param>
/// <returns></returns>
public void GetValuesForIndex(long sortIndex)
{
int size = _sortedValues.Length;
if (sortIndex < 0)
{
throw new ArgumentException("sortIndex should be greater or equal to 0.");
}
if (sortIndex >= MaxIndex)
{
throw new ArgumentException("sortIndex should be less than factorial(the lenght of items)");
}
for (int n = 0; n < _valueUsed.Length; n++)
{
_valueUsed[n] = false;
}
long factorielLower = MaxIndex;
for (int index = 0; index < size; index++)
{
long factorielBigger = factorielLower;
factorielLower = Factorial.GetFactorial(size - index - 1); // factorielBigger / inverseIndex;
int resultItemIndex = (int)(sortIndex % factorielBigger / factorielLower);
int correctedResultItemIndex = 0;
for(;;)
{
if (! _valueUsed[correctedResultItemIndex])
{
resultItemIndex--;
if (resultItemIndex < 0)
{
break;
}
}
correctedResultItemIndex++;
}
Result[index] = _sortedValues[correctedResultItemIndex];
_valueUsed[correctedResultItemIndex] = true;
}
}
// ************************************************************************
/// <summary>
/// Calc the index, relative to _sortedValues, of the permutation received
/// as argument. Returned index is 0 based.
/// </summary>
/// <param name="values"></param>
/// <returns></returns>
public long GetIndexOfValues(T[] values)
{
int size = _sortedValues.Length;
long valuesIndex = 0;
List<T> valuesLeft = new List<T>(_sortedValues);
for (int index = 0; index < size; index++)
{
long indexFactorial = Factorial.GetFactorial(size - 1 - index);
T value = values[index];
int indexCorrected = valuesLeft.IndexOf(value);
valuesIndex = valuesIndex + (indexCorrected * indexFactorial);
valuesLeft.Remove(value);
}
return valuesIndex;
}
// ************************************************************************
}
}

Not sure how to integrate negative number function in data generating algorithm?

I’m having a bit of trouble controlling the results from a data generating algorithm I am working on. Basically it takes values from a list and then lists all the different combinations to get to a specific sum. So far the code works fine(haven’t tested scaling it with many variables yet), but I need to allow for negative numbers to be include in the list.
The way I think I can solve this problem is to put a collar on the possible results as to prevent infinity results(if apples is 2 and oranges are -1 then for any sum, there will be an infinite solutions but if I say there is a limit of either then it cannot go on forever.)
So Here's super basic code that detects weights:
import math
data = [-2, 10,5,50,20,25,40]
target_sum = 100
max_percent = .8 #no value can exceed 80% of total(this is to prevent infinite solutions
for node in data:
max_value = abs(math.floor((target_sum * max_percent)/node))
print node, "'s max value is ", max_value
Here's the code that generates the results(first function generates a table if its possible and the second function composes the actual results. Details/pseudo code of the algo is here: Can brute force algorithms scale? ):
from collections import defaultdict
data = [-2, 10,5,50,20,25,40]
target_sum = 100
# T[x, i] is True if 'x' can be solved
# by a linear combination of data[:i+1]
T = defaultdict(bool) # all values are False by default
T[0, 0] = True # base case
for i, x in enumerate(data): # i is index, x is data[i]
for s in range(target_sum + 1): #set the range of one higher than sum to include sum itself
for c in range(s / x + 1):
if T[s - c * x, i]:
T[s, i+1] = True
coeff = [0]*len(data)
def RecursivelyListAllThatWork(k, sum): # Using last k variables, make sum
# /* Base case: If we've assigned all the variables correctly, list this
# * solution.
# */
if k == 0:
# print what we have so far
print(' + '.join("%2s*%s" % t for t in zip(coeff, data)))
return
x_k = data[k-1]
# /* Recursive step: Try all coefficients, but only if they work. */
for c in range(sum // x_k + 1):
if T[sum - c * x_k, k - 1]:
# mark the coefficient of x_k to be c
coeff[k-1] = c
RecursivelyListAllThatWork(k - 1, sum - c * x_k)
# unmark the coefficient of x_k
coeff[k-1] = 0
RecursivelyListAllThatWork(len(data), target_sum)
My problem is, I don't know where/how to integrate my limiting code to the main code inorder to restrict results and allow for negative numbers. When I add a negative number to the list, it displays it but does not include it in the output. I think this is due to it not being added to the table(first function) and I'm not sure how to have it added(and still keep the programs structure so I can scale it with more variables).
Thanks in advance and if anything is unclear please let me know.
edit: a bit unrelated(and if detracts from the question just ignore, but since your looking at the code already, is there a way I can utilize both cpus on my machine with this code? Right now when I run it, it only uses one cpu. I know the technical method of parallel computing in python but not sure how to logically parallelize this algo)

You can restrict results by changing both loops over c from
for c in range(s / x + 1):
to
max_value = int(abs((target_sum * max_percent)/x))
for c in range(max_value + 1):
This will ensure that any coefficient in the final answer will be an integer in the range 0 to max_value inclusive.
A simple way of adding negative values is to change the loop over s from
for s in range(target_sum + 1):
to
R=200 # Maximum size of any partial sum
for s in range(-R,R+1):
Note that if you do it this way then your solution will have an additional constraint.
The new constraint is that the absolute value of every partial weighted sum must be <=R.
(You can make R large to avoid this constraint reducing the number of solutions, but this will slow down execution.)
The complete code looks like:
from collections import defaultdict
data = [-2,10,5,50,20,25,40]
target_sum = 100
# T[x, i] is True if 'x' can be solved
# by a linear combination of data[:i+1]
T = defaultdict(bool) # all values are False by default
T[0, 0] = True # base case
R=200 # Maximum size of any partial sum
max_percent=0.8 # Maximum weight of any term
for i, x in enumerate(data): # i is index, x is data[i]
for s in range(-R,R+1): #set the range of one higher than sum to include sum itself
max_value = int(abs((target_sum * max_percent)/x))
for c in range(max_value + 1):
if T[s - c * x, i]:
T[s, i+1] = True
coeff = [0]*len(data)
def RecursivelyListAllThatWork(k, sum): # Using last k variables, make sum
# /* Base case: If we've assigned all the variables correctly, list this
# * solution.
# */
if k == 0:
# print what we have so far
print(' + '.join("%2s*%s" % t for t in zip(coeff, data)))
return
x_k = data[k-1]
# /* Recursive step: Try all coefficients, but only if they work. */
max_value = int(abs((target_sum * max_percent)/x_k))
for c in range(max_value + 1):
if T[sum - c * x_k, k - 1]:
# mark the coefficient of x_k to be c
coeff[k-1] = c
RecursivelyListAllThatWork(k - 1, sum - c * x_k)
# unmark the coefficient of x_k
coeff[k-1] = 0
RecursivelyListAllThatWork(len(data), target_sum)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.