I tried to sort an array by permuting it with itself
(the array contain all the numbers in range between 0 to its length-1)
so to test it I used random.shuffle but it had some unexpected results
a = np.array(range(10))
random.shuffle(a)
a = a[a]
a = a[a]
print(a)
# not a sorted array
# [9 5 2 3 1 7 6 8 0 4]
a = np.array([2,1,4,7,6,5,0,3,8,9])
a = a[a]
a = a[a]
print(a)
# [0 1 2 3 4 5 6 7 8 9]
so for some reason the permutation when using the second example of an unsorted array returns the sorted array as expected but the shuffled array doesn't work the same way.
Does anyone know why? Or if there is an easier way to sort using permutation or something similar it would be great.
TL;DR
There is no reason to expect a = a[a] to sort the array. In most cases it won't. In case of a coincidence it might.
What is the operation c = b[a]? or Applying a permutation
When you use an array a obtained by shuffling range(n) as a mask for an array b of same size n, you are applying a permutation, in the mathematical sense, to the elements of b. For instance:
a = [2,0,1]
b = np.array(['Alice','Bob','Charlie'])
print(b[a])
# ['Charlie' 'Alice' 'Bob']
In this example, array a represents the permutation (2 0 1), which is a cycle of length 3. Since the length of the cycle is 3, if you apply it three times, you will end up where you started:
a = [2,0,1]
b = np.array(['Alice','Bob','Charlie'])
c = b
for i in range(3):
c = c[a]
print(c)
# ['Charlie' 'Alice' 'Bob']
# ['Bob' 'Charlie' 'Alice']
# ['Alice' 'Bob' 'Charlie']
Note that I used strings for the elements of b ton avoid confusing them with indices. Of course, I could have used numbers from range(n):
a = [2,0,1]
b = np.array([0,1,2])
c = b
for i in range(3):
c = c[a]
print(c)
# [2 0 1]
# [1 2 0]
# [0 1 2]
You might see an interesting, but unsurprising fact: The first line is equal to a; in other words, the first result of applying a to b is equal to a itself. This is because b was initialised to [0 1 2], which represent the identity permutation id; thus, the permutations that we find by repeatedly applying a to b are:
id == a^0
a
a^2
a^3 == id
Can we always go back where we started? or The rank of a permutation
It is a well-known result of algebra that if you apply the same permutation again and again, you will eventually end up on the identity permutation. In algebraic notations: for every permutation a, there exists an integer k such that a^k == id.
Can we guess the value of k?
The minimum value of k is called the rank of a permutation.
If a is a cycle, then the minimum possible k is the length of the cycle. In our previous example, a was a cycle of length 3, so it took three applications of a before we found the identity permutation again.
How about a cycle of length 2? A cycle of length 2 is just "swapping two elements". For instance, swapping elements 0 and 1:
a = [1,0,2]
b = np.array([0,1,2])
c = b
for i in range(2):
c = c[a]
print(c)
# [1 0 2]
# [0 1 2]
We swap 0 and 1, then we swap them back.
How about two disjoint cycles? Let's try a cycle of length 3 on the first three elements, simultaneously with swapping the last two elements:
a = [2,0,1,3,4,5,7,6]
b = np.array([0,1,2,3,4,5,6,7])
c = b
for i in range(6):
c = c[a]
print(c)
# [2 0 1 3 4 5 7 6]
# [1 2 0 3 4 5 6 7]
# [0 1 2 3 4 5 7 6]
# [2 0 1 3 4 5 6 7]
# [1 2 0 3 4 5 7 6]
# [0 1 2 3 4 5 6 7]
As you can see by carefully examining the intermediary results, there is a period of length 3 on the first three elements, and a period of length 2 on the last two elements. The overall period is the least common multiple of the two periods, which is 6.
What is k in general? A well-known theorem of algebra states: every permutation can be written as a product of disjoint cycles. The rank of a cycle is the length of the cycle. The rank of a product of disjoint cycles is the least common multiple of the ranks of cycles.
A coincidence in your code: sorting [2,1,4,7,6,5,0,3,8,9]
Let us go back to your python code.
a = np.array([2,1,4,7,6,5,0,3,8,9])
a = a[a]
a = a[a]
print(a)
# [0 1 2 3 4 5 6 7 8 9]
How many times did you apply permutation a? Note that because of the assignment a =, array a changed between the first and the second lines a = a[a]. Let us dissipate some confusion by using a different variable name for every different value. Your code is equivalent to:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a2 = a[a]
a4 = a2[a2]
print(a4)
Or equivalently:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = (a[a])[a[a]]
This last line looks a little bit complicated. However, a cool result of algebra is that composition of permutations is associative. You already knew that addition and multiplication were associative: x+(y+z) == (x+y)+z and x(yz) == (xy)z. Well, it turns out that composition of permutations is associative as well! Using numpy's masks, this means that:
a[b[c]] == (a[b])[c]
Thus your python code is equivalent to:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = ((a[a])[a])[a]
print(a4)
Or without the unneeded parentheses:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = a[a][a][a]
print(a4)
Since a4 is the identity permutation, this tells us that the rank of a divides 4. Thus the rank of a is 1, 2 or 4. This tells us that a can be written as a product of swaps and length-4 cycles. The only permutation of rank 1 is the identity itself. Permutations of rank 2 are products of disjoint swaps, and we can see that this is not the case of a. Thus the rank of a must be exactly 4.
You can find the cycles by choosing an element, and following its orbit: what values is that element successively transformed into? Here we see that:
0 is transformed into 2; 2 is transformed into 4; 4 is transformed into 6; 6 is transformed into 0;
1 remains untouched;
3 becomes 7; 7 becomes 3;
5 is untouched; 8 and 9 are untouched.
Conclusion: Your numpy array represents the permutation (0 -> 2 -> 4 -> 6 -> 0)(3 <-> 7), and its rank is the least common multiple of 4 and 2, lcm(4,2) == 4.
it's took some time but I figure a way to do it.
numpy doesn't have this fiture but panda does have.
by using df.reindex I can sort a data frame by it indexes
import pandas as pd
import numpy as np
train_df = pd.DataFrame(range(10))
train_df = train_df.reindex(np.random.permutation(train_df.index))
print(train_df) # random dataframe contaning all values up to 9
train_df = train_df.reindex(range(10))
print(train_df) # sort data frame
Please excuse my naivete as I don't have much programming experience. While googling something for an unrelated question, I stumbled upon this:
https://www.geeksforgeeks.org/find-number-of-solutions-of-a-linear-equation-of-n-variables/
I completely understand the first (extremely inefficient) bit of code. But the second:
def countSol(coeff, n, rhs):
# Create and initialize a table
# to store results of subproblems
dp = [0 for i in range(rhs + 1)]
dp[0] = 1
# Fill table in bottom up manner
for i in range(n):
for j in range(coeff[i], rhs + 1):
dp[j] += dp[j - coeff[i]]
return dp[rhs]
confuses me. My question being: why does this second program count the number of non-negative integer solutions?
I have written out several examples, including the one given in the article, and I understand that it does indeed do this. And I understand how it is populating the list. But I don't understand exactly why this works.
Please excuse what must be, to some, an ignorant question. But I would quite like to understand the logic, as I think it rather clever that such a little snip-it is able able to answer a question as general as "How many non negative integer solutions exist" (for some general equation).
This algorithms is pretty cool and demonstrates the power of looking for a solution from a different perspective.
Let's take a example: 3x + 2y + z = 6, where LHS is the left hand side and RHS is the right hand side.
dp[k] will keep track of the number of unique ways to arrive at a RHS value of k by substituting non-negative integer values for LHS variables.
The i loop iterates over the variables in the LHS. The algorithm begins with setting all the variables to zero. So, the only possible k value is zero, hence
k 0 1 2 3 4 5 6
dp[k] = 1 0 0 0 0 0 0
For i = 0, we will update dp to reflect what happens if x is 1 or 2. We don't care about x > 2 because the solutions are all non-negative and 3x would be too big. The j loop is responsible for updating dp and dp[k] gets incremented by dp[k - 3] because we can arrive at RHS value k by adding one copy of the coefficient 3 to k-3. The result is
k 0 1 2 3 4 5 6
dp[k] = 1 0 0 1 0 0 1
Now the algorithm continues with i = 1, updating dp to reflect all possible RHS values where x is 0, 1, or 2 and y is 0, 1, 2, or 3. This time the j loop increments dp[k] by dp[k-2] because we can arrive at RHS value k by adding one copy of the coefficient 2 to k-2, resulting in
k 0 1 2 3 4 5 6
dp[k] = 1 0 1 1 1 1 2
Finally, the algorithm incorporates z = 1, 2, 3, 4, 5, or 6, resulting in
k 0 1 2 3 4 5 6
dp[k] = 1 1 2 3 4 5 7
In addition to computing the answer in pseudo-polynomial time, dp encodes the answer for every RHS <= the input right hand side.
I have a large list of sub-lists (approx. 16000) that I want to find where the repeating pattern starts and ends. I am not 100% sure that there is a repeat, however I have a strong reason to believe so, due to the diagonals that appear within the sub-list sequence. The structure of a list of sub-lists is preferred, as it is used that way for other things in this script. The data looks like this:
data = ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011', etc
I do not have any time constraints, however the fastest method would not be frown upon. The code should be able to return the starting/ending sequence and location within the list, to be called upon in the future. If there is an arrangement of the data that would be more useful, I can try to reformat it if necessary. Python is something that I have been learning for the past few months, so I am not quite able to just create my own algorithms from scratch just yet. Thank you!
Here's some fairly simple code that scans a string for adjacent repeating subsequences. Set minrun to the length of the smallest subsequences that you want to check. For each match, the code prints the starting index of the first subsequence, the length of the subsequence, and the subsequence itself.
data = [
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
]
data = ''.join(data)
minrun = 3
lendata = len(data)
for runlen in range(minrun, lendata // 2):
i = 0
while i < lendata - runlen * 2:
s1 = data[i:i + runlen]
s2 = data[i + runlen:i + runlen * 2]
if s1 == s2:
print(i, runlen, s1)
i += runlen
else:
i += 1
output
1 3 100
4 3 100
8 3 000
15 3 010
18 3 010
23 3 000
32 3 001
38 3 000
47 3 001
53 3 000
17 15 001001000000110
32 15 001001000000110
Note that we get the same sequence of length 3 at index 15 and 18 = 15 + 3 : 010; that indicates that there are 3 adjacent copies of 010. Similarly, there are 3 adjacent copies of the sequence at index 17 of length 15.
I am researching how python implements dictionaries. One of the equations in the python dictionary implementation relates the pseudo random probing for an empty dictionary slot using the equation
j = ((j*5) + 1) % 2**i
which is explained here.
I have read this question, How are Python's Built In Dictionaries Implemented?, and basically understand how dictionaries are implemented.
What I don't understand is why/how the equation:
j = ((j*5) + 1) % 2**i
cycles through all the remainders of 2**i. For instance, if i = 3 for a total starting size of 8. j goes through the cycle:
0
1
6
7
4
5
2
3
0
if the starting size is 16, it would go through the cycle:
0 1 6 15 12 13 2 11 8 9 14 7 4 5 10 3 0
This is very useful for probing all the slots in the dictionary. But why does it work ? Why does j = ((j*5)+1) work but not j = ((j*6)+1) or j = ((j*3)+1) both of which get stuck in smaller cycles.
I am hoping to get a more intuitive understanding of this than the equation just works and that's why they used it.
This is the same principle that pseudo-random number generators use, as Jasper hinted at, namely linear congruential generators. A linear congruential generator is a sequence that follows the relationship X_(n+1) = (a * X_n + c) mod m. From the wiki page,
The period of a general LCG is at most m, and for some choices of factor a much less than that. The LCG will have a full period for all seed values if and only if:
m and c are relatively prime.
a - 1 is divisible by all prime factors of m.
a - 1 is divisible by 4 if m is divisible by 4.
It's clear to see that 5 is the smallest a to satisfy these requirements, namely
2^i and 1 are relatively prime.
4 is divisible by 2.
4 is divisible by 4.
Also interestingly, 5 is not the only number that satisfies these conditions. 9 will also work. Taking m to be 16, using j=(9*j+1)%16 yields
0 1 10 11 4 5 14 15 8 9 2 3 12 13 6 7
The proof for these three conditions can be found in the original Hull-Dobell paper on page 5, along with a bunch of other PRNG-related theorems that also may be of interest.
I have a pandas series of value_counts for a data set. I would like to plot the data with a color band (I'm using bokeh, but calculating the data band is the important part):
I hesitate to use the word standard deviation since all the references I use calculate that based on the mean value, and I specifically want to use the mode as the center.
So, basically, I'm looking for a way in pandas to start at the mode and return a new series that of value counts that includes 68.2% of the sum of the value_counts. If I had this series:
val count
1 0
2 0
3 3
4 1
5 2
6 5 <-- mode
7 4
8 3
9 2
10 1
total = sum(count) # example value 21
band1_count = 21 * 0.682 # example value ~ 14.3
This is the order they would be added based on an algorithm that walks the value count on each side of the mode and includes the higher of the two until the sum of the counts is > than 14.3.
band1_values = [6, 7, 8, 5, 9]
Here are the steps:
val count step
1 0
2 0
3 3
4 1
5 2 <-- 4) add to list -- eq (9,2), closer to (6,5)
6 5 <-- 1) add to list -- mode
7 4 <-- 2) add to list -- gt (5,2)
8 3 <-- 3) add to list -- gt (5,2)
9 2 <-- 5) add to list -- gt (4,1), stop since sum of counts > 14.3
10 1
Is there a native way to do this calculation in pandas or numpy? If there is a formal name for this study, I would appreciate knowing what it's called.