Get permutation with specified degree by index number

Get permutation with specified degree by index number - python

I've been working on this for hours but couldn't figure it out.
Define a permutation's degree to be the minimum number of transpositions that need to be composed to create it. So a the degree of (0, 1, 2, 3) is 0, the degree of (0, 1, 3, 2) is 1, the degree of (1, 0, 3, 2) is 2, etc.
Look at the space Snd as the space of all permutations of a sequence of length n that have degree d.
I want two algorithms. One that takes a permutation in that space and assigns it an index number, and another that takes an index number of an item in Snd and retrieves its permutation. The index numbers should obviously be successive (i.e. in the range 0 to len(Snd)-1, with each permutation having a distinct index number.)
I'd like this implemented in O(sane); which means that if you're asking for permutation number 17, the algorithm shouldn't go over all the permutations between 0 and 16 to retrieve your permutation.
Any idea how to solve this?
(If you're going to include code, I prefer Python, thank you.)
Update:
I want a solution in which
The permutations are ordered according to their lexicographic order (and not by manually ordering them, but by an efficient algorithm that gives them with lexicographic order to begin with) and
I want the algorithm to accept a sequence of different degrees as well, so I could say "I want permutation number 78 out of all permutations of degrees 1, 3 or 4 out of the permutation space of range(5)". (Basically the function would take a tuple of degrees.) This'll also affect the reverse function that calculates index from permutation; based on the set of degrees, the index would be different.
I've tried solving this for the last two days and I was not successful. If you could provide Python code, that'd be best.

The permutations of length n and degree d are exactly those that can be written as a composition of k = n - d cycles that partition the n elements. The number of such permutations is given by the Stirling numbers of the first kind, written n atop k in square brackets.
Stirling numbers of the first kind satisfy a recurrence relation
[n] [n - 1] [n - 1]
[ ] = (n - 1) [ ] + [ ]
[k] [ k ] [k - 1],
which means, intuitively, the number of ways to partition n elements into k cycles is to partition n - 1 non-maximum elements into k cycles and splice in the maximum element in one of n - 1 ways, or put the maximum element in its own cycle and partition the n - 1 non-maximum elements into k - 1 cycles. Working from a table of recurrence values, it's possible to trace the decisions down the line.
memostirling1 = {(0, 0): 1}
def stirling1(n, k):
if (n, k) not in memostirling1:
if not (1 <= k <= n): return 0
memostirling1[(n, k)] = (n - 1) * stirling1(n - 1, k) + stirling1(n - 1, k - 1)
return memostirling1[(n, k)]
def unrank(n, d, i):
k = n - d
assert 0 <= i <= stirling1(n, k)
if d == 0:
return list(range(n))
threshold = stirling1(n - 1, k - 1)
if i < threshold:
perm = unrank(n - 1, d, i)
perm.append(n - 1)
else:
(q, r) = divmod(i - threshold, stirling1(n - 1, k))
perm = unrank(n - 1, d - 1, r)
perm.append(perm[q])
perm[q] = n - 1
return perm

This answer is less elegant/efficient than my other one, but it describes a polynomial-time algorithm that copes with the additional constraints on the ordering of permutations. I'm going to describe a subroutine that, given a prefix of an n-element permutation and a set of degrees, counts how many permutations have that prefix and a degree belonging to the set. Given this subroutine, we can do an n-ary search for the permutation of a specified rank in the specified subset, extending the known prefix one element at a time.
We can visualize an n-element permutation p as an n-vertex, n-arc directed graph where, for each vertex v, there is an arc from v to p(v). This digraph consists of a collection of vertex-disjoint cycles. For example, the permutation 31024 looks like
_______
/ \
\->2->0->3
__ __
/ | / |
1<-/ 4<-/ .
Given a prefix of a permutation, we can visualize the subgraph corresponding to that prefix, which will be a collection of vertex-disjoint paths and cycles. For example, the prefix 310 looks like
2->0->3
__
/ |
1<-/ .
I'm going to describe a bijection between (1) extensions of this prefix that are permutations and (2) complete permutations on a related set of elements. This bijection preserves up to a constant term the number of cycles (which is the number of elements minus the degree). The constant term is the number of cycles in the prefix.
The permutations mentioned in (2) are on the following set of elements. Start with the original set, delete all elements involved in cycles that are complete in the prefix, and introduce a new element for each path. For example, if the prefix is 310, then we delete the complete cycle 1 and introduce a new element A for the path 2->0->3, resulting in the set {4, A}. Now, given a permutation in set (1), we obtain a permutation in set (2) by deleting the known cycles and replacing each path by its new element. For example, the permutation 31024 corresponds to the permutation 4->4, A->A, and the permutation 31042 corresponds to the permutation 4->A, A->4. I claim (1) that this map is a bijection and (2) that it preserves degrees as described before.
The definition, more or less, of the (n,k)-th Stirling number of the first kind, written
[n]
[ ]
[k]
(ASCII art square brackets), is the number of n-element permutations of degree n - k. To compute the number of extensions of an r-element prefix of an n-element permutation, count c, the number of complete cycles in the prefix. Sum, for each degree d in the specified set, the Stirling number
[ n - r ]
[ ]
[n - d - c]
of the first kind, taking the terms with "impossible" indices to be zero (some analytically motivated definitions of the Stirling numbers are nonzero in unexpected places).
To get a rank from a permutation, we do n-ary search again, except this time, we use the permutation rather than the rank to guide the search.
Here's some Python code for both (including a test function).
import itertools
memostirling1 = {(0, 0): 1}
def stirling1(n, k):
ans = memostirling1.get((n, k))
if ans is None:
if not 1 <= k <= n: return 0
ans = (n - 1) * stirling1(n - 1, k) + stirling1(n - 1, k - 1)
memostirling1[(n, k)] = ans
return ans
def cyclecount(prefix):
c = 0
visited = [False] * len(prefix)
for (i, j) in enumerate(prefix):
while j < len(prefix) and not visited[j]:
visited[j] = True
if j == i:
c += 1
break
j = prefix[j]
return c
def extcount(n, dset, prefix):
c = cyclecount(prefix)
return sum(stirling1(n - len(prefix), n - d - c) for d in dset)
def unrank(n, dset, rnk):
assert rnk >= 0
choices = set(range(n))
prefix = []
while choices:
for i in sorted(choices):
prefix.append(i)
count = extcount(n, dset, prefix)
if rnk < count:
choices.remove(i)
break
del prefix[-1]
rnk -= count
else:
assert False
return tuple(prefix)
def rank(n, dset, perm):
assert n == len(perm)
rnk = 0
prefix = []
choices = set(range(n))
for j in perm:
choices.remove(j)
for i in sorted(choices):
if i < j:
prefix.append(i)
rnk += extcount(n, dset, prefix)
del prefix[-1]
prefix.append(j)
return rnk
def degree(perm):
return len(perm) - cyclecount(perm)
def test(n, dset):
for (rnk, perm) in enumerate(perm for perm in itertools.permutations(range(n)) if degree(perm) in dset):
assert unrank(n, dset, rnk) == perm
assert rank(n, dset, perm) == rnk
test(7, {2, 3, 5})

I think you're looking for a variant of the Levenshtein distance which is used to measure the number of edits between two strings. The efficient way to compute this is by employing a technique called dynamic programming - a pseudo-algorithm for the "normal" Levenshtein distance is provided in the linked article. You would need to adapt this to account for the fact that instead of adding, deleting, or substituting a character, the only allowed operation is exchanging elements at two positions.
Concerning your second algorithm: It's not a 1:1 relationship between degrees of permutation and "a" resulting permutation, instead the number of possible results grows exponentially with the number of swaps: For a sequence of k elements, there's k*(k-1)/2 possible pairs of indices between which to swap. If we call that number l, after d swaps you have l^d possible results (even though some of them might be identical, as in first swapping 0<>1 then 2<>3, or first 2<>3 then 0<>1).

I wrote this stackoverflow answer to a similar problem: https://stackoverflow.com/a/13056801/10562 . Could it help?
The difference might be in the swapping bit for generating the perms, but an index-to-perm and perm-to-index function is given in Python.
I later went on to create this Rosetta Code task that is fleshed out with references and more code: http://rosettacode.org/wiki/Permutations/Rank_of_a_permutation.
Hope it helps :-)

The first part is straight forward if you work wholly in the lexiographic side of things. Given my answer on the other thread, you can go from a permutation to the factorial representation instantly. Basically, you imagine a list {0,1,2,3} and the number that I need to go along is the factorial representation, so for 1,2,3,4, i keep taking the zeroth element and get 000 (0*3+0*!2+0*!1!).
0,1,2,3, => 000
1032 = 3!+1! = 8th permuation (as 000 is the first permutation) => 101
And you can work out the degree trivially, as each transposition which swaps a pair of numbers (a,b) a
So 0123 -> 1023 is 000 -> 100.
if a>b you swap the numbers and then subtract one from the right hand number.
Given two permuations/lexiographic numbers, I just permute the digits from right to left like a bubble sort, counting the degree that I need, and building the new lexiographic number as I go. So to go from 0123 to the 1032 i first move the 1 to the left, then the zero is in the right position, and then I move the 2 into position, and both of those had pairs with the rh number greater than the left hand number, so both add a 1, so 101.
This deals with your first problem. The second is much more difficult, as the numbers of degree two are not evenly distributed. I don't see anything better than getting the global lexiographic number (global meaning here the number without any exclusions) of the permutation you want, e.g. 78 in your example, and then go through all the lexiographic numbers and each time that you get to one which is degree 2, then add one to your global lexiographic number, e.g. 78 -> 79 when you find the first number of degree 2. Obvioulsy, this will not be fast. Alternatively you could try generating all the numbers of degree to. Given a set of n elements, there are (n-1)(n-2) numbers of degree 2, but its not clear that this holds going forward, at least to me, which might easily be a lot less work than computing all the numbers up to your target. and you could just see which ones have lexiographic number less than your target number, and again add one to its global lexiographic number.
Ill see if i can come up with something better.

This seemed like fun so I thought about it some more.
Let's take David's example of 31042 and find its index. First we determine the degree, which equals the sum of the cardinalities of the permutation cycles, each subtracted by 1.
01234
31042
permutation cycles (0342)(1)
degree = (4-1) + (1-1) = 3
def cycles(prefix):
_cycles = []
i = j = 0
visited = set()
while j < len(prefix):
if prefix[i] == i:
_cycles.append({"is":[i],"incomplete": False})
j = j + 1
i = i + 1
elif not i in visited:
cycle = {"is":[],"incomplete": False}
cycleStart = -1
while True:
if i >= len(prefix):
for k in range(len(_cycles) - 1,-1,-1):
if any(i in cycle["is"] for i in _cycles[k]["is"]):
cycle["is"] = list(set(cycle["is"] + _cycles[k]["is"]))
del _cycles[k]
cycle["incomplete"] = True
_cycles.append(cycle)
break
elif cycleStart == i:
_cycles.append(cycle)
break
else:
if prefix[i] == j + 1:
j = j + 1
visited.add(i)
if cycleStart == -1:
cycleStart = i
cycle["is"].append(i)
i = prefix[i]
while j in visited:
j = j + 1
i = j
return _cycles
def degree(cycles):
d = 0
for i in cycles:
if i["incomplete"]:
d = d + len(i["is"])
else:
d = d + len(i["is"]) - 1
return d
Next we determine how many permutations of degree 3 start with either zero, one, or two; using David's formula:
number of permutations of n=5,d=3 that start with "0" = S(4,4-3) = 6
number of permutations of n=5,d=3 that start with "1" = S(4,4-2) = 11
[just in case you're wondering, I believe the ones starting with "1" are:
(01)(234)
(01)(243)
(201)(34)
(301)(24)
(401)(23)
(2301)(4)
(2401)(3)
(3401)(2)
(3201)(4)
(4201)(3)
(4301)(2) notice what's common to all of them?]
number of permutations of n=5,d=3 that start with "2" = S(4,4-2) = 11
We wonder whether there might be a lexicographically-lower permutation of degree 3 that also starts with "310". The only possibility seems to be 31024:
01234
31024 ?
permutaiton cycles (032)(4)(1)
degree = (3-1) + (1-1) + (1-1) = 2
since its degree is different, we will not apply 31024 to our calculation
The permutations of degree 3 that start with "3" and are lexicographically lower than 31042 must start with the prefix "30". Their count is equal to the number of ways we can maintain "three" before "zero" and "zero" before "one" in our permutation cycles while keeping the sum of the cardinalities of the cycles, each subtracted by 1 (i.e., the degree), at 3.
(031)(24)
(0321)(4)
(0341)(2)
count = 3
It seems that there are 6 + 11 + 11 + 3 = 31 permutations of n=5, d=3 before 31042.
def next(prefix,target):
i = len(prefix) - 1
if prefix[i] < target[i]:
prefix[i] = prefix[i] + 1
elif prefix[i] == target[i]:
prefix.append(0)
i = i + 1
while prefix[i] in prefix[0:i]:
prefix[i] = prefix[i] + 1
return prefix
def index(perm,prefix,ix):
if prefix == perm:
print ix
else:
permD = degree(cycles(perm))
prefixD = degree(cycles(prefix))
n = len(perm) - len(prefix)
k = n - (permD - prefixD)
if prefix != perm[0:len(prefix)] and permD >= prefixD:
ix = ix + S[n][k]
index(perm,next(prefix,perm),ix)
S = [[1]
,[0,1]
,[0,1,1]
,[0,2,3,1]
,[0,6,11,6,1]
,[0,24,50,35,10,1]]
(Let's try to confirm with David' program (I'm using a PC with windows):
C:\pypy>pypy test.py REM print(index([3,1,0,4,2],[0],0))
31
C:\pypy>pypy davids_rank.py REM print(rank(5,{3},[3,1,0,2,4]))
31

A bit late and not in Python but in C#...
I think the following code should work for you. It works for permutation possibilities where for x items, the number of permutations are x!
The algo calculate the index of a permutation and the reverse of it.
using System;
using System.Collections.Generic;
namespace WpfPermutations
{
public class PermutationOuelletLexico3<T>
{
// ************************************************************************
private T[] _sortedValues;
private bool[] _valueUsed;
public readonly long MaxIndex; // long to support 20! or less
// ************************************************************************
public PermutationOuelletLexico3(T[] sortedValues)
{
if (sortedValues.Length <= 0)
{
throw new ArgumentException("sortedValues.Lenght should be greater than 0");
}
_sortedValues = sortedValues;
Result = new T[_sortedValues.Length];
_valueUsed = new bool[_sortedValues.Length];
MaxIndex = Factorial.GetFactorial(_sortedValues.Length);
}
// ************************************************************************
public T[] Result { get; private set; }
// ************************************************************************
/// <summary>
/// Return the permutation relative to the index received, according to
/// _sortedValues.
/// Sort Index is 0 based and should be less than MaxIndex. Otherwise you get an exception.
/// </summary>
/// <param name="sortIndex"></param>
/// <param name="result">Value is not used as inpu, only as output. Re-use buffer in order to save memory</param>
/// <returns></returns>
public void GetValuesForIndex(long sortIndex)
{
int size = _sortedValues.Length;
if (sortIndex < 0)
{
throw new ArgumentException("sortIndex should be greater or equal to 0.");
}
if (sortIndex >= MaxIndex)
{
throw new ArgumentException("sortIndex should be less than factorial(the lenght of items)");
}
for (int n = 0; n < _valueUsed.Length; n++)
{
_valueUsed[n] = false;
}
long factorielLower = MaxIndex;
for (int index = 0; index < size; index++)
{
long factorielBigger = factorielLower;
factorielLower = Factorial.GetFactorial(size - index - 1); // factorielBigger / inverseIndex;
int resultItemIndex = (int)(sortIndex % factorielBigger / factorielLower);
int correctedResultItemIndex = 0;
for(;;)
{
if (! _valueUsed[correctedResultItemIndex])
{
resultItemIndex--;
if (resultItemIndex < 0)
{
break;
}
}
correctedResultItemIndex++;
}
Result[index] = _sortedValues[correctedResultItemIndex];
_valueUsed[correctedResultItemIndex] = true;
}
}
// ************************************************************************
/// <summary>
/// Calc the index, relative to _sortedValues, of the permutation received
/// as argument. Returned index is 0 based.
/// </summary>
/// <param name="values"></param>
/// <returns></returns>
public long GetIndexOfValues(T[] values)
{
int size = _sortedValues.Length;
long valuesIndex = 0;
List<T> valuesLeft = new List<T>(_sortedValues);
for (int index = 0; index < size; index++)
{
long indexFactorial = Factorial.GetFactorial(size - 1 - index);
T value = values[index];
int indexCorrected = valuesLeft.IndexOf(value);
valuesIndex = valuesIndex + (indexCorrected * indexFactorial);
valuesLeft.Remove(value);
}
return valuesIndex;
}
// ************************************************************************
}
}

Related

How to find sum of cubes of the divisors for every number from 1 to input number x in python where x can be very large

Examples,
1.Input=4
Output=111
Explanation,
1 = 1³(divisors of 1)
2 = 1³ + 2³(divisors of 2)
3 = 1³ + 3³(divisors of 3)
4 = 1³ + 2³ + 4³(divisors of 4)
------------------------
sum = 111(output)
1.Input=5
Output=237
Explanation,
1 = 1³(divisors of 1)
2 = 1³ + 2³(divisors of 2)
3 = 1³ + 3³(divisors of 3)
4 = 1³ + 2³ + 4³(divisors of 4)
5 = 1³ + 5³(divisors of 5)
-----------------------------
sum = 237 (output)
x=int(raw_input().strip())
tot=0
for i in range(1,x+1):
for j in range(1,i+1):
if(i%j==0):
tot+=j**3
print tot
Using this code I can find the answer for small number less than one million.
But I want to find the answer for very large numbers. Is there any algorithm
for how to solve it easily for large numbers?

Offhand I don't see a slick way to make this truly efficient, but it's easy to make it a whole lot faster. If you view your examples as matrices, you're summing them a row at a time. This requires, for each i, finding all the divisors of i and summing their cubes. In all, this requires a number of operations proportional to x**2.
You can easily cut that to a number of operations proportional to x, by summing the matrix by columns instead. Given an integer j, how many integers in 1..x are divisible by j? That's easy: there are x//j multiples of j in the range, so divisor j contributes j**3 * (x // j) to the grand total.
def better(x):
return sum(j**3 * (x // j) for j in range(1, x+1))
That runs much faster, but still takes time proportional to x.
There are lower-level tricks you can play to speed that in turn by constant factors, but they still take O(x) time overall. For example, note that x // j == 1 for all j such that x // 2 < j <= x. So about half the terms in the sum can be skipped, replaced by closed-form expressions for a sum of consecutive cubes:
def sum3(x):
"""Return sum(i**3 for i in range(1, x+1))"""
return (x * (x+1) // 2)**2
def better2(x):
result = sum(j**3 * (x // j) for j in range(1, x//2 + 1))
result += sum3(x) - sum3(x//2)
return result
better2() is about twice as fast as better(), but to get faster than O(x) would require deeper insight.
Quicker
Thinking about this in spare moments, I still don't have a truly clever idea. But the last idea I gave can be carried to a logical conclusion: don't just group together divisors with only one multiple in range, but also those with two multiples in range, and three, and four, and ... That leads to better3() below, which does a number of operations roughly proportional to the square root of x:
def better3(x):
result = 0
for i in range(1, x+1):
q1 = x // i
# value i has q1 multiples in range
result += i**3 * q1
# which values have i multiples?
q2 = x // (i+1) + 1
assert x // q1 == i == x // q2
if i < q2:
result += i * (sum3(q1) - sum3(q2 - 1))
if i+1 >= q2: # this becomes true when i reaches roughly sqrt(x)
break
return result
Of course O(sqrt(x)) is an enormous improvement over the original O(x**2), but for very large arguments it's still impractical. For example better3(10**6) appears to complete instantly, but better3(10**12) takes a few seconds, and better3(10**16) is time for a coffee break ;-)
Note: I'm using Python 3. If you're using Python 2, use xrange() instead of range().
One more
better4() has the same O(sqrt(x)) time behavior as better3(), but does the summations in a different order that allows for simpler code and fewer calls to sum3(). For "large" arguments, it's about 50% faster than better3() on my box.
def better4(x):
result = 0
for i in range(1, x+1):
d = x // i
if d >= i:
# d is the largest divisor that appears `i` times, and
# all divisors less than `d` also appear at least that
# often. Account for one occurence of each.
result += sum3(d)
else:
i -= 1
lastd = x // i
# We already accounted for i occurrences of all divisors
# < lastd, and all occurrences of divisors >= lastd.
# Account for the rest.
result += sum(j**3 * (x // j - i)
for j in range(1, lastd))
break
return result
It may be possible to do better by extending the algorithm in "A Successive Approximation Algorithm for Computing the Divisor Summatory Function". That takes O(cube_root(x)) time for the possibly simpler problem of summing the number of divisors. But it's much more involved, and I don't care enough about this problem to pursue it myself ;-)
Subtlety
There's a subtlety in the math that's easy to miss, so I'll spell it out, but only as it pertains to better4().
After d = x // i, the comment claims that d is the largest divisor that appears i times. But is that true? The actual number of times d appears is x // d, which we did not compute. How do we know that x // d in fact equals i?
That's the purpose of the if d >= i: guarding that comment. After d = x // i we know that
x == d*i + r
for some integer r satisfying 0 <= r < i. That's essentially what floor division means. But since d >= i is also known (that's what the if test ensures), it must also be the case that 0 <= r < d. And that's how we know x // d is i.
This can break down when d >= i is not true, which is why a different method needs to be used then. For example, if x == 500 and i == 51, d (x // i) is 9, but it's certainly not the case that 9 is the largest divisor that appears 51 times. In fact, 9 appears 500 // 9 == 55 times. While for positive real numbers
d == x/i
if and only if
i == x/d
that's not always so for floor division. But, as above, the first does imply the second if we also know that d >= i.
Just for Fun
better5() rewrites better4() for about another 10% speed gain. The real pedagogical point is to show that it's easy to compute all the loop limits in advance. Part of the point of the odd code structure above is that it magically returns 0 for a 0 input without needing to test for that. better5() gives up on that:
def isqrt(n):
"Return floor(sqrt(n)) for int n > 0."
g = 1 << ((n.bit_length() + 1) >> 1)
d = n // g
while d < g:
g = (d + g) >> 1
d = n // g
return g
def better5(x):
assert x > 0
u = isqrt(x)
v = x // u
return (sum(map(sum3, (x // d for d in range(1, u+1)))) +
sum(x // i * i**3 for i in range(1, v)) -
u * sum3(v-1))

def sum_divisors(n):
sum = 0
i = 0
for i in range (1, n) :
if n % i == 0 and n != 0 :
sum = sum + i
# Return the sum of all divisors of n, not including n
return sum
print(sum_divisors(0))
# 0
print(sum_divisors(3)) # Should sum of 1
# 1
print(sum_divisors(36)) # Should sum of 1+2+3+4+6+9+12+18
# 55
print(sum_divisors(102)) # Should be sum of 2+3+6+17+34+51
# 114

upper bound on predictability

I'm trying to compute the upper bound on the predictability of my occupancy dataset, as in Song's 'Limits of Predictability in Human Mobility' paper. Basically, home (=1) and not at home (=0) then represent the visited locations (towers) in Song's paper.
I tested my code (which I derived from https://github.com/gavin-s-smith/MobilityPredictabilityUpperBounds and https://github.com/gavin-s-smith/EntropyRateEst) on a random binary sequence which should return an entropy of 1 and a predictability of 0.5. Instead, the returned entropy is 0.87 and the predictabiltiy 0.71.
Here's my code:
import numpy as np
from scipy.optimize import fsolve
from cmath import log
import math
def matchfinder(data):
data_len = len(data)
output = np.zeros(len(data))
output[0] = 1
# Using L_{n} definition from
#"Nonparametric Entropy Estimation for Stationary Process and Random Fields, with Applications to English Text"
# by Kontoyiannis et. al.
# $L_{n} = 1 + max \{l :0 \leq l \leq n, X^{l-1}_{0} = X^{-j+l-1}_{-j} \text{ for some } l \leq j \leq n \}$
# for each position, i, in the sub-sequence that occurs before the current position, start_idx
# check to see the maximum continuously equal string we can make by simultaneously extending from i and start_idx
for start_idx in range(1,data_len):
max_subsequence_matched = 0
for i in range(0,start_idx):
# for( int i = 0; i < start_idx; i++ )
# {
j = 0
#increase the length of the substring starting at j and start_idx
#while they are the same keeping track of the length
while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
j = j + 1
if j > max_subsequence_matched:
max_subsequence_matched = j;
#L_{n} is obtained by adding 1 to the longest match-length
output[start_idx] = max_subsequence_matched + 1;
return output
if __name__ == '__main__':
#Read dataset
data = np.random.randint(2,size=2000)
#Number of distinct locations
N = len(np.unique(data))
#True entropy
lambdai = matchfinder(data)
Etrue = math.pow(sum( [ lambdai[i] / math.log(i+1,2) for i in range(1,len(data))] ) * (1.0/len(data)),-1)
S = Etrue
#use Fano's inequality to compute the predictability
func = lambda x: (-(x*log(x,2).real+(1-x)*log(1-x,2).real)+(1-x)*log(N-1,2).real ) - S
ub = fsolve(func, 0.9)[0]
print ub
the matchfinder function finds the entropy by looking for the longest match and adds 1 to it (= the shortest substring not previously seen). The predictability is then computed by using Fano's inequality.
What could be the problem?
Thanks!

The entropy function seems to be wrong.
Refering to the paper Song, C., Qu, Z., Blumm, N., & Barabási, A. L. (2010). Limits of predictability in human mobility. Science, 327(5968), 1018–1021. you mentioned, real entropy is estimated by algorithm based on Lempel-Ziv data compression:
In code it would look like this:
Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
Where n is the length of time series.
Notice that we used different base for logarithm than in given formula. However, since the base for logarithm in Fano's inequality is 2, then it seems logical to use the same base for entropy calculation. Also, I'm not sure why you started sum from the first instead of zero index.
So now wrapping that up into function for example:
def solve(locations, size):
data = np.random.randint(locations,size=size)
N = len(np.unique(data))
n = float(len(data))
print "Distinct locations: %i" % N
print "Time series length: %i" % n
#True entropy
lambdai = matchfinder(data)
#S = math.pow(sum([lambdai[i] / math.log(i + 1, 2) for i in range(1, len(data))]) * (1.0 / len(data)), -1)
Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
S = Etrue
print "Maximum entropy: %2.5f" % log(locations,2).real
print "Real entropy: %2.5f" % S
func = lambda x: (-(x * log(x, 2).real + (1 - x) * log(1 - x, 2).real) + (1 - x) * log(N - 1, 2).real) - S
ub = fsolve(func, 0.9)[0]
print "Upper bound of predictability: %2.5f" % ub
return ub
Output for 2 locations
Distinct locations: 2
Time series length: 10000
Maximum entropy: 1.00000
Real entropy: 1.01441
Upper bound of predictability: 0.50013
Output for 3 locations
Distinct locations: 3
Time series length: 10000
Maximum entropy: 1.58496
Real entropy: 1.56567
Upper bound of predictability: 0.41172
Lempel-Ziv compression converge to real entropy when n approaches infinity, that is why for 2 locations case it is slightly higher than maximum limit.
I am not also sure if you interpreted definition for lambda correctly. It is defined as "the length of the shortest substring starting at position i which dosen't previously appear from position 1 to i-1", so when we got to some point where further substrings are not unique anymore, your matching algorithm would give it length always one higher than the length of substring, while it should be rather equal to 0, since unique substring does not exist.
To make it clearer let's just give a simple example. If the array of positions looks like that:
[1 0 0 1 0 0]
Then we can see that after first three positions pattern is repeated once again. That means that from fourth location shorthest unique substring does not exist, thus it equals to 0. So the output (lambda) should look like this:
[1 1 2 0 0 0]
However, your function for that case would return:
[1 1 2 4 3 2]
I rewrote you matching function to treat that problem:
def matchfinder2(data):
data_len = len(data)
output = np.zeros(len(data))
output[0] = 1
for start_idx in range(1,data_len):
max_subsequence_matched = 0
for i in range(0,start_idx):
j = 0
end_distance = data_len - start_idx #length left to the end of sequence (including current index)
while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
j = j + 1
if j == end_distance: #check if j has reached the end of sequence
output[start_idx::] = np.zeros(end_distance) #if yes fill the rest of output with zeros
return output #end function
elif j > max_subsequence_matched:
max_subsequence_matched = j;
output[start_idx] = max_subsequence_matched + 1;
return output
Differences are small of course, because result change just for the small part of sequences.

Is it possible to determine if two lists are identical (rotatable) without going through every rotation? [duplicate]

For instance, I have lists:
a[0] = [1, 1, 1, 0, 0]
a[1] = [1, 1, 0, 0, 1]
a[2] = [0, 1, 1, 1, 0]
# and so on
They seem to be different, but if it is supposed that the start and the end are connected, then they are circularly identical.
The problem is, each list which I have has a length of 55 and contains only three ones and 52 zeros in it. Without circular condition, there are 26,235 (55 choose 3) lists. However, if the condition 'circular' exists, there are a huge number of circularly identical lists
Currently I check circularly identity by following:
def is_dup(a, b):
for i in range(len(a)):
if a == list(numpy.roll(b, i)): # shift b circularly by i
return True
return False
This function requires 55 cyclic shift operations at the worst case. And there are 26,235 lists to be compared with each other. In short, I need 55 * 26,235 * (26,235 - 1) / 2 = 18,926,847,225 computations. It's about nearly 20 Giga!
Is there any good way to do it with less computations? Or any data types that supports circular?

First off, this can be done in O(n) in terms of the length of the list
You can notice that if you will duplicate your list 2 times ([1, 2, 3]) will be [1, 2, 3, 1, 2, 3] then your new list will definitely hold all possible cyclic lists.
So all you need is to check whether the list you are searching is inside a 2 times of your starting list. In python you can achieve this in the following way (assuming that the lengths are the same).
list1 = [1, 1, 1, 0, 0]
list2 = [1, 1, 0, 0, 1]
print ' '.join(map(str, list2)) in ' '.join(map(str, list1 * 2))
Some explanation about my oneliner:
list * 2 will combine a list with itself, map(str, [1, 2]) convert all numbers to string and ' '.join() will convert array ['1', '2', '111'] into a string '1 2 111'.
As pointed by some people in the comments, oneliner can potentially give some false positives, so to cover all the possible edge cases:
def isCircular(arr1, arr2):
if len(arr1) != len(arr2):
return False
str1 = ' '.join(map(str, arr1))
str2 = ' '.join(map(str, arr2))
if len(str1) != len(str2):
return False
return str1 in str2 + ' ' + str2
P.S.1 when speaking about time complexity, it is worth noticing that O(n) will be achieved if substring can be found in O(n) time. It is not always so and depends on the implementation in your language (although potentially it can be done in linear time KMP for example).
P.S.2 for people who are afraid strings operation and due to this fact think that the answer is not good. What important is complexity and speed. This algorithm potentially runs in O(n) time and O(n) space which makes it much better than anything in O(n^2) domain. To see this by yourself, you can run a small benchmark (creates a random list pops the first element and appends it to the end thus creating a cyclic list. You are free to do your own manipulations)
from random import random
bigList = [int(1000 * random()) for i in xrange(10**6)]
bigList2 = bigList[:]
bigList2.append(bigList2.pop(0))
# then test how much time will it take to come up with an answer
from datetime import datetime
startTime = datetime.now()
print isCircular(bigList, bigList2)
print datetime.now() - startTime # please fill free to use timeit, but it will give similar results
0.3 seconds on my machine. Not really long. Now try to compare this with O(n^2) solutions. While it is comparing it, you can travel from US to Australia (most probably by a cruise ship)

Not knowledgeable enough in Python to answer this in your requested language, but in C/C++, given the parameters of your question, I'd convert the zeros and ones to bits and push them onto the least significant bits of an uint64_t. This will allow you to compare all 55 bits in one fell swoop - 1 clock.
Wickedly fast, and the whole thing will fit in on-chip caches (209,880 bytes). Hardware support for shifting all 55 list members right simultaneously is available only in a CPU's registers. The same goes for comparing all 55 members simultaneously. This allows for a 1-for-1 mapping of the problem to a software solution. (and using the SIMD/SSE 256 bit registers, up to 256 members if needed) As a result the code is immediately obvious to the reader.
You might be able to implement this in Python, I just don't know it well enough to know if that's possible or what the performance might be.
After sleeping on it a few things became obvious, and all for the better.
1.) It's so easy to spin the circularly linked list using bits that Dali's very clever trick isn't necessary. Inside a 64-bit register standard bit shifting will accomplish the rotation very simply, and in an attempt to make this all more Python friendly, by using arithmetic instead of bit ops.
2.) Bit shifting can be accomplished easily using divide by 2.
3.) Checking the end of the list for 0 or 1 can be easily done by modulo 2.
4.) "Moving" a 0 to the head of the list from the tail can be done by dividing by 2. This because if the zero were actually moved it would make the 55th bit false, which it already is by doing absolutely nothing.
5.) "Moving" a 1 to the head of the list from the tail can be done by dividing by 2 and adding 18,014,398,509,481,984 - which is the value created by marking the 55th bit true and all the rest false.
6.) If a comparison of the anchor and composed uint64_t is TRUE after any given rotation, break and return TRUE.
I would convert the entire array of lists into an array of uint64_ts right up front to avoid having to do the conversion repeatedly.
After spending a few hours trying to optimize the code, studying the assembly language I was able to shave 20% off the runtime. I should add that the O/S and MSVC compiler got updated mid-day yesterday as well. For whatever reason/s, the quality of the code the C compiler produced improved dramatically after the update (11/15/2014). Run-time is now ~ 70 clocks, 17 nanoseconds to compose and compare an anchor ring with all 55 turns of a test ring and NxN of all rings against all others is done in 12.5 seconds.
This code is so tight all but 4 registers are sitting around doing nothing 99% of the time. The assembly language matches the C code almost line for line. Very easy to read and understand. A great assembly project if someone were teaching themselves that.
Hardware is Hazwell i7, MSVC 64-bit, full optimizations.
#include "stdafx.h"
#include "stdafx.h"
#include <string>
#include <memory>
#include <stdio.h>
#include <time.h>
const uint8_t LIST_LENGTH = 55; // uint_8 supports full witdth of SIMD and AVX2
// max left shifts is 32, so must use right shifts to create head_bit
const uint64_t head_bit = (0x8000000000000000 >> (64 - LIST_LENGTH));
const uint64_t CPU_FREQ = 3840000000; // turbo-mode clock freq of my i7 chip
const uint64_t LOOP_KNT = 688275225; // 26235^2 // 1000000000;
// ----------------------------------------------------------------------------
__inline uint8_t is_circular_identical(const uint64_t anchor_ring, uint64_t test_ring)
{
// By trial and error, try to synch 2 circular lists by holding one constant
// and turning the other 0 to LIST_LENGTH positions. Return compare count.
// Return the number of tries which aligned the circularly identical rings,
// where any non-zero value is treated as a bool TRUE. Return a zero/FALSE,
// if all tries failed to find a sequence match.
// If anchor_ring and test_ring are equal to start with, return one.
for (uint8_t i = LIST_LENGTH; i; i--)
{
// This function could be made bool, returning TRUE or FALSE, but
// as a debugging tool, knowing the try_knt that got a match is nice.
if (anchor_ring == test_ring) { // test all 55 list members simultaneously
return (LIST_LENGTH +1) - i;
}
if (test_ring % 2) { // ring's tail is 1 ?
test_ring /= 2; // right-shift 1 bit
// if the ring tail was 1, set head to 1 to simulate wrapping
test_ring += head_bit;
} else { // ring's tail must be 0
test_ring /= 2; // right-shift 1 bit
// if the ring tail was 0, doing nothing leaves head a 0
}
}
// if we got here, they can't be circularly identical
return 0;
}
// ----------------------------------------------------------------------------
int main(void) {
time_t start = clock();
uint64_t anchor, test_ring, i, milliseconds;
uint8_t try_knt;
anchor = 31525197391593472; // bits 55,54,53 set true, all others false
// Anchor right-shifted LIST_LENGTH/2 represents the average search turns
test_ring = anchor >> (1 + (LIST_LENGTH / 2)); // 117440512;
printf("\n\nRunning benchmarks for %llu loops.", LOOP_KNT);
start = clock();
for (i = LOOP_KNT; i; i--) {
try_knt = is_circular_identical(anchor, test_ring);
// The shifting of test_ring below is a test fixture to prevent the
// optimizer from optimizing the loop away and returning instantly
if (i % 2) {
test_ring /= 2;
} else {
test_ring *= 2;
}
}
milliseconds = (uint64_t)(clock() - start);
printf("\nET for is_circular_identical was %f milliseconds."
"\n\tLast try_knt was %u for test_ring list %llu",
(double)milliseconds, try_knt, test_ring);
printf("\nConsuming %7.1f clocks per list.\n",
(double)((milliseconds * (CPU_FREQ / 1000)) / (uint64_t)LOOP_KNT));
getchar();
return 0;
}

Reading between the lines, it sounds as though you're trying to enumerate one representative of each circular equivalence class of strings with 3 ones and 52 zeros. Let's switch from a dense representation to a sparse one (set of three numbers in range(55)). In this representation, the circular shift of s by k is given by the comprehension set((i + k) % 55 for i in s). The lexicographic minimum representative in a class always contains the position 0. Given a set of the form {0, i, j} with 0 < i < j, the other candidates for minimum in the class are {0, j - i, 55 - i} and {0, 55 - j, 55 + i - j}. Hence, we need (i, j) <= min((j - i, 55 - i), (55 - j, 55 + i - j)) for the original to be minimum. Here's some enumeration code.
def makereps():
reps = []
for i in range(1, 55 - 1):
for j in range(i + 1, 55):
if (i, j) <= min((j - i, 55 - i), (55 - j, 55 + i - j)):
reps.append('1' + '0' * (i - 1) + '1' + '0' * (j - i - 1) + '1' + '0' * (55 - j - 1))
return reps

Repeat the first array, then use the Z algorithm (O(n) time) to find the second array inside the first.
(Note: you don't have to physically copy the first array. You can just wrap around during matching.)
The nice thing about the Z algorithm is that it's very simple compared to KMP, BM, etc.
However, if you're feeling ambitious, you could do string matching in linear time and constant space -- strstr, for example, does this. Implementing it would be more painful, though.

Following up on Salvador Dali's very smart solution, the best way to handle it is to make sure all elements are of the same length, as well as both LISTS are of the same length.
def is_circular_equal(lst1, lst2):
if len(lst1) != len(lst2):
return False
lst1, lst2 = map(str, lst1), map(str, lst2)
len_longest_element = max(map(len, lst1))
template = "{{:{}}}".format(len_longest_element)
circ_lst = " ".join([template.format(el) for el in lst1]) * 2
return " ".join([template.format(el) for el in lst2]) in circ_lst
No clue if this is faster or slower than AshwiniChaudhary's recommended regex solution in Salvador Dali's answer, which reads:
import re
def is_circular_equal(lst1, lst2):
if len(lst2) != len(lst2):
return False
return bool(re.search(r"\b{}\b".format(' '.join(map(str, lst2))),
' '.join(map(str, lst1)) * 2))

Given that you need to do so many comparisons might it be worth your while taking an initial pass through your lists to convert them into some sort of canonical form that can be easily compared?
Are you trying to get a set of circularly-unique lists? If so you can throw them into a set after converting to tuples.
def normalise(lst):
# Pick the 'maximum' out of all cyclic options
return max([lst[i:]+lst[:i] for i in range(len(lst))])
a_normalised = map(normalise,a)
a_tuples = map(tuple,a_normalised)
a_unique = set(a_tuples)
Apologies to David Eisenstat for not spotting his v.similar answer.

You can roll one list like this:
list1, list2 = [0,1,1,1,0,0,1,0], [1,0,0,1,0,0,1,1]
str_list1="".join(map(str,list1))
str_list2="".join(map(str,list2))
def rotate(string_to_rotate, result=[]):
result.append(string_to_rotate)
for i in xrange(1,len(string_to_rotate)):
result.append(result[-1][1:]+result[-1][0])
return result
for x in rotate(str_list1):
if cmp(x,str_list2)==0:
print "lists are rotationally identical"
break

First convert every of your list elements (in a copy if necessary) to that rotated version that is lexically greatest.
Then sort the resulting list of lists (retaining an index into the original list position) and unify the sorted list, marking all the duplicates in the original list as needed.

Piggybacking on #SalvadorDali's observation on looking for matches of a in any a-lengthed sized slice in b+b, here is a solution using just list operations.
def rollmatch(a,b):
bb=b*2
return any(not any(ax^bbx for ax,bbx in zip(a,bb[i:])) for i in range(len(a)))
l1 = [1,0,0,1]
l2 = [1,1,0,0]
l3 = [1,0,1,0]
rollmatch(l1,l2) # True
rollmatch(l1,l3) # False
2nd approach: [deleted]

Not a complete, free-standing answer, but on the topic of optimizing by reducing comparisons, I too was thinking of normalized representations.
Namely, if your input alphabet is {0, 1}, you could reduce the number of allowed permutations significantly. Rotate the first list to a (pseudo-) normalized form (given the distribution in your question, I would pick one where one of the 1 bits is on the extreme left, and one of the 0 bits is on the extreme right). Now before each comparison, successively rotate the other list through the possible positions with the same alignment pattern.
For example, if you have a total of four 1 bits, there can be at most 4 permutations with this alignment, and if you have clusters of adjacent 1 bits, each additional bit in such a cluster reduces the amount of positions.
List 1 1 1 1 0 1 0
List 2 1 0 1 1 1 0 1st permutation
1 1 1 0 1 0 2nd permutation, final permutation, match, done
This generalizes to larger alphabets and different alignment patterns; the main challenge is to find a good normalization with only a few possible representations. Ideally, it would be a proper normalization, with a single unique representation, but given the problem, I don't think that's possible.

Building further on RocketRoy's answer:
Convert all your lists up front to unsigned 64 bit numbers.
For each list, rotate those 55 bits around to find the smallest numerical value.
You are now left with a single unsigned 64 bit value for each list that you can compare straight with the value of the other lists. Function is_circular_identical() is not required anymore.
(In essence, you create an identity value for your lists that is not affected by the rotation of the lists elements)
That would even work if you have an arbitrary number of one's in your lists.

This is the same idea of Salvador Dali but don't need the string convertion. Behind is the same KMP recover idea to avoid impossible shift inspection. Them only call KMPModified(list1, list2+list2).
public class KmpModified
{
public int[] CalculatePhi(int[] pattern)
{
var phi = new int[pattern.Length + 1];
phi[0] = -1;
phi[1] = 0;
int pos = 1, cnd = 0;
while (pos < pattern.Length)
if (pattern[pos] == pattern[cnd])
{
cnd++;
phi[pos + 1] = cnd;
pos++;
}
else if (cnd > 0)
cnd = phi[cnd];
else
{
phi[pos + 1] = 0;
pos++;
}
return phi;
}
public IEnumerable<int> Search(int[] pattern, int[] list)
{
var phi = CalculatePhi(pattern);
int m = 0, i = 0;
while (m < list.Length)
if (pattern[i] == list[m])
{
i++;
if (i == pattern.Length)
{
yield return m - i + 1;
i = phi[i];
}
m++;
}
else if (i > 0)
{
i = phi[i];
}
else
{
i = 0;
m++;
}
}
[Fact]
public void BasicTest()
{
var pattern = new[] { 1, 1, 10 };
var list = new[] {2, 4, 1, 1, 1, 10, 1, 5, 1, 1, 10, 9};
var matches = Search(pattern, list).ToList();
Assert.Equal(new[] {3, 8}, matches);
}
[Fact]
public void SolveProblem()
{
var random = new Random();
var list = new int[10];
for (var k = 0; k < list.Length; k++)
list[k]= random.Next();
var rotation = new int[list.Length];
for (var k = 1; k < list.Length; k++)
rotation[k - 1] = list[k];
rotation[rotation.Length - 1] = list[0];
Assert.True(Search(list, rotation.Concat(rotation).ToArray()).Any());
}
}
Hope this help!

Simplifying The Problem
The problem consist of list of ordered items
The domain of value is binary (0,1)
We can reduce the problem by mapping consecutive 1s into a count
and consecutive 0s into a negative count
Example
A = [ 1, 1, 1, 0, 0, 1, 1, 0 ]
B = [ 1, 1, 0, 1, 1, 1, 0, 0 ]
~
A = [ +3, -2, +2, -1 ]
B = [ +2, -1, +3, -2 ]
This process require that the first item and the last item must be different
This will reduce the amount of comparisons overall
Checking Process
If we assume that they're duplicate, then we can assume what we are looking for
Basically the first item from the first list must exist somewhere in the other list
Followed by what is followed in the first list, and in the same manner
The previous items should be the last items from the first list
Since it's circular, the order is the same
The Grip
The question here is where to start, technically known as lookup and look-ahead
We will just check where the first element of the first list exist through the second list
The probability of frequent element is lower given that we mapped the lists into histograms
Pseudo-Code
FUNCTION IS_DUPLICATE (LIST L1, LIST L2) : BOOLEAN
LIST A = MAP_LIST(L1)
LIST B = MAP_LIST(L2)
LIST ALPHA = LOOKUP_INDEX(B, A[0])
IF A.SIZE != B.SIZE
OR COUNT_CHAR(A, 0) != COUNT_CHAR(B, ALPHA[0]) THEN
RETURN FALSE
END IF
FOR EACH INDEX IN ALPHA
IF ALPHA_NGRAM(A, B, INDEX, 1) THEN
IF IS_DUPLICATE(A, B, INDEX) THEN
RETURN TRUE
END IF
END IF
END FOR
RETURN FALSE
END FUNCTION
FUNCTION IS_DUPLICATE (LIST L1, LIST L2, INTEGER INDEX) : BOOLEAN
INTEGER I = 0
WHILE I < L1.SIZE DO
IF L1[I] != L2[(INDEX+I)%L2.SIZE] THEN
RETURN FALSE
END IF
I = I + 1
END WHILE
RETURN TRUE
END FUNCTION
Functions
MAP_LIST(LIST A):LIST MAP CONSQUETIVE ELEMENTS AS COUNTS IN A NEW LIST
LOOKUP_INDEX(LIST A, INTEGER E):LIST RETURN LIST OF INDICES WHERE THE ELEMENT E EXIST IN THE LIST A
COUNT_CHAR(LIST A , INTEGER E):INTEGER COUNT HOW MANY TIMES AN ELEMENT E OCCUR IN A LIST A
ALPHA_NGRAM(LIST A,LIST B,INTEGER I,INTEGER N):BOOLEAN CHECK IF B[I] IS EQUIVALENT TO A[0] N-GRAM IN BOTH DIRECTIONS
Finally
If the list size is going to be pretty huge or if the element we are starting to check the cycle from is frequently high, then we can do the following:
Look for the least-frequent item in the first list to start with
increase the n-gram N parameter to lower the probability of going through a the linear check

An efficient, quick-to-compute "canonical form" for the lists in question can be derived as:
Count the number of zeroes between the ones (ignoring wrap-around), to get three numbers.
Rotate the three numbers so that the biggest number is first.
The first number (a) must be between 18 and 52 (inclusive). Re-encode it as between 0 and 34.
The second number (b) must be between 0 and 26, but it doesn't matter much.
Drop the third number, since it's just 52 - (a + b) and adds no information
The canonical form is the integer b * 35 + a, which is between 0 and 936 (inclusive), which is fairly compact (there are 477 circularly-unique lists in total).

I wrote an straightforward solution which compares both lists and just increases (and wraps around) the index of the compared value for each iteration.
I don't know python well so I wrote it in Java, but it's really simple so it should be easy to adapt it to any other language.
By this you could also compare lists of other types.
public class Main {
public static void main(String[] args){
int[] a = {0,1,1,1,0};
int[] b = {1,1,0,0,1};
System.out.println(isCircularIdentical(a, b));
}
public static boolean isCircularIdentical(int[] a, int[]b){
if(a.length != b.length){
return false;
}
//The outer loop is for the increase of the index of the second list
outer:
for(int i = 0; i < a.length; i++){
//Loop trough the list and compare each value to the according value of the second list
for(int k = 0; k < a.length; k++){
// I use modulo length to wrap around the index
if(a[k] != b[(k + i) % a.length]){
//If the values do not match I continue and shift the index one further
continue outer;
}
}
return true;
}
return false;
}
}

As others have mentioned, once you find the normalized rotation of a list, you can compare them.
Heres some working code that does this,
Basic method is to find a normalized rotation for each list and compare:
Calculate a normalized rotation index on each list.
Loop over both lists with their offsets, comparing each item, returning if they mis-match.
Note that this method is it doesn't depend on numbers, you can pass in lists of strings (any values which can be compared).
Instead of doing a list-in-list search, we know we want the list to start with the minimum value - so we can loop over the minimum values, searching until we find which one has the lowest successive values, storing this for further comparisons until we have the best.
There are many opportunities to exit early when calculating the index, details on some optimizations.
Skip searching for the best minimum value when theres only one.
Skip searching minimum values when the previous is also a minimum value (it will never be a better match).
Skip searching when all values are the same.
Fail early when lists have different minimum values.
Use regular comparison when offsets match.
Adjust offsets to avoid wrapping the index values on one of the lists during comparison.
Note that in Python a list-in-list search may well be faster, however I was interested to find an efficient algorithm - which could be used in other languages too. Also, there is some advantage to avoiding to create new lists.
def normalize_rotation_index(ls, v_min_other=None):
""" Return the index or -1 (when the minimum is above `v_min_other`) """
if len(ls) <= 1:
return 0
def compare_rotations(i_a, i_b):
""" Return True when i_a is smaller.
Note: unless there are large duplicate sections of identical values,
this loop will exit early on.
"""
for offset in range(1, len(ls)):
v_a = ls[(i_a + offset) % len(ls)]
v_b = ls[(i_b + offset) % len(ls)]
if v_a < v_b:
return True
elif v_a > v_b:
return False
return False
v_min = ls[0]
i_best_first = 0
i_best_last = 0
i_best_total = 1
for i in range(1, len(ls)):
v = ls[i]
if v_min > v:
v_min = v
i_best_first = i
i_best_last = i
i_best_total = 1
elif v_min == v:
i_best_last = i
i_best_total += 1
# all values match
if i_best_total == len(ls):
return 0
# exit early if we're not matching another lists minimum
if v_min_other is not None:
if v_min != v_min_other:
return -1
# simple case, only one minimum
if i_best_first == i_best_last:
return i_best_first
# otherwise find the minimum with the lowest values compared to all others.
# start looking after the first we've found
i_best = i_best_first
for i in range(i_best_first + 1, i_best_last + 1):
if (ls[i] == v_min) and (ls[i - 1] != v_min):
if compare_rotations(i, i_best):
i_best = i
return i_best
def compare_circular_lists(ls_a, ls_b):
# sanity checks
if len(ls_a) != len(ls_b):
return False
if len(ls_a) <= 1:
return (ls_a == ls_b)
index_a = normalize_rotation_index(ls_a)
index_b = normalize_rotation_index(ls_b, ls_a[index_a])
if index_b == -1:
return False
if index_a == index_b:
return (ls_a == ls_b)
# cancel out 'index_a'
index_b = (index_b - index_a)
if index_b < 0:
index_b += len(ls_a)
index_a = 0 # ignore it
# compare rotated lists
for i in range(len(ls_a)):
if ls_a[i] != ls_b[(index_b + i) % len(ls_b)]:
return False
return True
assert(compare_circular_lists([0, 9, -1, 2, -1], [-1, 2, -1, 0, 9]) == True)
assert(compare_circular_lists([2, 9, -1, 0, -1], [-1, 2, -1, 0, 9]) == False)
assert(compare_circular_lists(["Hello" "Circular", "World"], ["World", "Hello" "Circular"]) == True)
assert(compare_circular_lists(["Hello" "Circular", "World"], ["Circular", "Hello" "World"]) == False)
See: this snippet for some more tests/examples.

You can check to see if a list A is equal to a cyclic shift of list B in expected O(N) time pretty easily.
I would use a polynomial hash function to compute the hash of list A, and every cyclic shift of list B. Where a shift of list B has the same hash as list A, I'd compare the actual elements to see if they are equal.
The reason this is fast is that with polynomial hash functions (which are extremely common!), you can calculate the hash of each cyclic shift from the previous one in constant time, so you can calculate hashes for all of the cyclic shifts in O(N) time.
It works like this:
Let's say B has N elements, then the the hash of B using prime P is:
Hb=0;
for (i=0; i<N ; i++)
{
Hb = Hb*P + B[i];
}
This is an optimized way to evaluate a polynomial in P, and is equivalent to:
Hb=0;
for (i=0; i<N ; i++)
{
Hb += B[i] * P^(N-1-i); //^ is exponentiation, not XOR
}
Notice how every B[i] is multiplied by P^(N-1-i). If we shift B to the left by 1, then every every B[i] will be multiplied by an extra P, except the first one. Since multiplication distributes over addition, we can multiply all the components at once just by multiplying the whole hash, and then fix up the factor for the first element.
The hash of the left shift of B is just
Hb1 = Hb*P + B[0]*(1-(P^N))
The second left shift:
Hb2 = Hb1*P + B[1]*(1-(P^N))
and so on...
NOTE: all math above is performed modulo some machine word size, and you only have to calculate P^N once.

To glue to the most pythonic way to do it, use sets !
from sets import Set
a = Set ([1, 1, 1, 0, 0])
b = Set ([0, 1, 1, 1, 0])
c = Set ([1, 0, 0, 1, 1])
a==b
True
a==b==c
True

Number of numbers divisible by a prime number in a row to pascal triangle

How can i find the total number of numbers in a given row number of a pascal triangle divisible by a prime number in which the row number and prime is given
I am using the following code in python
def factorial(x):
result = 1
for i in xrange(1,x+1):
result *= i
return result
def combination(n,r):
return factorial(n)/(factorial(n-r)*factorial(r))
p = input()
cnt = 0
for i in range(0,n+1):
if((combination(n,i)%p)==0):
cnt += 1
print cnt
but the given code takes long time for big numbers.
Can you please suggest me a better algorithm.

One corollary from Luca's theorem states that number of binomial coefficients C(n,k) which are not divisible by prime p, is
(a₁+1)⋅(a₂+1)⋅...⋅(am+1), where ai is ith digit of n in p-ary numeral system.
Example:
p = 3, n = 7dec = 213
Result = (2+1)⋅(1+1) = 6
7th row of Pascal triangle is 1 7 21 35 35 21 7 1, it contains 6 coefficients not divisible by 3, and the two remaining are divisible by 3.

You do not need to compute the binomial coefficient (n,r).
Count how often p is in n!, r! and (n-r)! and check if n! has more factors p than the other two togeter.
// sry... no python...
long count_p_in_fac(long n, long p)
{
long count = 0;
long i = 1;
long temp;
while(true)
{
temp = floor(n/pow(p,i));
count += temp;
if(temp == 0)
break;
}
return count;
}
p = input()
cnt = 0
for i in range(0,n+1):
if(count_p_in_fac(n,p) > count_p_in_fac(i,p) + count_p_in_fac(n-i,p)):
cnt += 1
print cnt
This avoids big numbers and reduces the operations.
This checks (n,r) = 0 mod p in O(log(n)) without computing factorials. But counting a row takes O(n log n).
You can also speed this up by using the symmetry of (n,r). Computing only the first half and multiply it by two. If n is even, you have to count the first half exept the middle r = n/2 and check add the middle after multiply by two.
And you can precompute count_p_in_fac(i,p) for all i.

There's no way you're going to do 10^12 in less than a second. There has to be some property of Pascall's Triangle that makes this easier.. If it's possible
Another interesting property of Pascal's triangle is that in a row p
where p is a prime number, all the terms in that row except the 1s are
multiples of p. This can be proven easily, since if p\in \mathbb{P},
then p has no factors save for 1 and itself. Every entry in the
triangle is an integer, so therefore by definition (p-k)! and k! are
factors of p!\,. However, there is no possible way p itself can show
up in the denominator, so therefore p (or some multiple of it) must be
left in the numerator, making the entire entry a multiple of p.
It might have something to do with that result (from the wiki page http://en.wikipedia.org/wiki/Pascal%27s_triangle).. if this has an answer (i.e. if it's university homework some professor gave you).
See here https://mathoverflow.net/questions/9181/pascal-triangle-and-prime-numbers
(I love this question - I'm not sure it's a programming question though).

You can rewrite your combination function without needing to calculate factorial. (n, r) can be written recursively as
(n, r) = (n-1, r) + (n-1, r-1)
Now we should find the base cases. These are:
(n, 1) = n
(n, 0) = 1
(n, n) = 1
Here, we are assuming that n and r are non-negative integers and n >= r holds true. Then the function combination can be rewritten as
def combination(n, r):
if r == 1:
return n
if r == 0 or r == n:
return 1
return combination(n-1, r) + combination(n-1, r-1)
p = input()
count = 0
for i in range(n + 1):
if combination(n, i) % p == 0:
count += 1
print count

Thank you all for responding to the question of a noob like me
Here is a working python code
n,p = map(int,raw_input().split(' '))
if n==p:
print n-1
elif p>n:
print 0
else:
result = 1
m = n
while n:
temp = n%p
result *= (temp+1)
n /= p
print m+1-result

n = input("enter the row for pascal triangle:")
p = input("enter any prime number u want:")
cnt = 0
line = [1]
for k in range(0, n):
line.append(line[k] * (n-k) / (k+1))
print line
lengths = map(lambda word: line[word]%p ==0, range(len(line))).count(True)
print lengths

Not sure how to integrate negative number function in data generating algorithm?

I’m having a bit of trouble controlling the results from a data generating algorithm I am working on. Basically it takes values from a list and then lists all the different combinations to get to a specific sum. So far the code works fine(haven’t tested scaling it with many variables yet), but I need to allow for negative numbers to be include in the list.
The way I think I can solve this problem is to put a collar on the possible results as to prevent infinity results(if apples is 2 and oranges are -1 then for any sum, there will be an infinite solutions but if I say there is a limit of either then it cannot go on forever.)
So Here's super basic code that detects weights:
import math
data = [-2, 10,5,50,20,25,40]
target_sum = 100
max_percent = .8 #no value can exceed 80% of total(this is to prevent infinite solutions
for node in data:
max_value = abs(math.floor((target_sum * max_percent)/node))
print node, "'s max value is ", max_value
Here's the code that generates the results(first function generates a table if its possible and the second function composes the actual results. Details/pseudo code of the algo is here: Can brute force algorithms scale? ):
from collections import defaultdict
data = [-2, 10,5,50,20,25,40]
target_sum = 100
# T[x, i] is True if 'x' can be solved
# by a linear combination of data[:i+1]
T = defaultdict(bool) # all values are False by default
T[0, 0] = True # base case
for i, x in enumerate(data): # i is index, x is data[i]
for s in range(target_sum + 1): #set the range of one higher than sum to include sum itself
for c in range(s / x + 1):
if T[s - c * x, i]:
T[s, i+1] = True
coeff = [0]*len(data)
def RecursivelyListAllThatWork(k, sum): # Using last k variables, make sum
# /* Base case: If we've assigned all the variables correctly, list this
# * solution.
# */
if k == 0:
# print what we have so far
print(' + '.join("%2s*%s" % t for t in zip(coeff, data)))
return
x_k = data[k-1]
# /* Recursive step: Try all coefficients, but only if they work. */
for c in range(sum // x_k + 1):
if T[sum - c * x_k, k - 1]:
# mark the coefficient of x_k to be c
coeff[k-1] = c
RecursivelyListAllThatWork(k - 1, sum - c * x_k)
# unmark the coefficient of x_k
coeff[k-1] = 0
RecursivelyListAllThatWork(len(data), target_sum)
My problem is, I don't know where/how to integrate my limiting code to the main code inorder to restrict results and allow for negative numbers. When I add a negative number to the list, it displays it but does not include it in the output. I think this is due to it not being added to the table(first function) and I'm not sure how to have it added(and still keep the programs structure so I can scale it with more variables).
Thanks in advance and if anything is unclear please let me know.
edit: a bit unrelated(and if detracts from the question just ignore, but since your looking at the code already, is there a way I can utilize both cpus on my machine with this code? Right now when I run it, it only uses one cpu. I know the technical method of parallel computing in python but not sure how to logically parallelize this algo)

You can restrict results by changing both loops over c from
for c in range(s / x + 1):
to
max_value = int(abs((target_sum * max_percent)/x))
for c in range(max_value + 1):
This will ensure that any coefficient in the final answer will be an integer in the range 0 to max_value inclusive.
A simple way of adding negative values is to change the loop over s from
for s in range(target_sum + 1):
to
R=200 # Maximum size of any partial sum
for s in range(-R,R+1):
Note that if you do it this way then your solution will have an additional constraint.
The new constraint is that the absolute value of every partial weighted sum must be <=R.
(You can make R large to avoid this constraint reducing the number of solutions, but this will slow down execution.)
The complete code looks like:
from collections import defaultdict
data = [-2,10,5,50,20,25,40]
target_sum = 100
# T[x, i] is True if 'x' can be solved
# by a linear combination of data[:i+1]
T = defaultdict(bool) # all values are False by default
T[0, 0] = True # base case
R=200 # Maximum size of any partial sum
max_percent=0.8 # Maximum weight of any term
for i, x in enumerate(data): # i is index, x is data[i]
for s in range(-R,R+1): #set the range of one higher than sum to include sum itself
max_value = int(abs((target_sum * max_percent)/x))
for c in range(max_value + 1):
if T[s - c * x, i]:
T[s, i+1] = True
coeff = [0]*len(data)
def RecursivelyListAllThatWork(k, sum): # Using last k variables, make sum
# /* Base case: If we've assigned all the variables correctly, list this
# * solution.
# */
if k == 0:
# print what we have so far
print(' + '.join("%2s*%s" % t for t in zip(coeff, data)))
return
x_k = data[k-1]
# /* Recursive step: Try all coefficients, but only if they work. */
max_value = int(abs((target_sum * max_percent)/x_k))
for c in range(max_value + 1):
if T[sum - c * x_k, k - 1]:
# mark the coefficient of x_k to be c
coeff[k-1] = c
RecursivelyListAllThatWork(k - 1, sum - c * x_k)
# unmark the coefficient of x_k
coeff[k-1] = 0
RecursivelyListAllThatWork(len(data), target_sum)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.