Sorting a random array using permutation - python

I tried to sort an array by permuting it with itself
(the array contain all the numbers in range between 0 to its length-1)
so to test it I used random.shuffle but it had some unexpected results
a = np.array(range(10))
random.shuffle(a)
a = a[a]
a = a[a]
print(a)
# not a sorted array
# [9 5 2 3 1 7 6 8 0 4]
a = np.array([2,1,4,7,6,5,0,3,8,9])
a = a[a]
a = a[a]
print(a)
# [0 1 2 3 4 5 6 7 8 9]
so for some reason the permutation when using the second example of an unsorted array returns the sorted array as expected but the shuffled array doesn't work the same way.
Does anyone know why? Or if there is an easier way to sort using permutation or something similar it would be great.

TL;DR
There is no reason to expect a = a[a] to sort the array. In most cases it won't. In case of a coincidence it might.
What is the operation c = b[a]? or Applying a permutation
When you use an array a obtained by shuffling range(n) as a mask for an array b of same size n, you are applying a permutation, in the mathematical sense, to the elements of b. For instance:
a = [2,0,1]
b = np.array(['Alice','Bob','Charlie'])
print(b[a])
# ['Charlie' 'Alice' 'Bob']
In this example, array a represents the permutation (2 0 1), which is a cycle of length 3. Since the length of the cycle is 3, if you apply it three times, you will end up where you started:
a = [2,0,1]
b = np.array(['Alice','Bob','Charlie'])
c = b
for i in range(3):
c = c[a]
print(c)
# ['Charlie' 'Alice' 'Bob']
# ['Bob' 'Charlie' 'Alice']
# ['Alice' 'Bob' 'Charlie']
Note that I used strings for the elements of b ton avoid confusing them with indices. Of course, I could have used numbers from range(n):
a = [2,0,1]
b = np.array([0,1,2])
c = b
for i in range(3):
c = c[a]
print(c)
# [2 0 1]
# [1 2 0]
# [0 1 2]
You might see an interesting, but unsurprising fact: The first line is equal to a; in other words, the first result of applying a to b is equal to a itself. This is because b was initialised to [0 1 2], which represent the identity permutation id; thus, the permutations that we find by repeatedly applying a to b are:
id == a^0
a
a^2
a^3 == id
Can we always go back where we started? or The rank of a permutation
It is a well-known result of algebra that if you apply the same permutation again and again, you will eventually end up on the identity permutation. In algebraic notations: for every permutation a, there exists an integer k such that a^k == id.
Can we guess the value of k?
The minimum value of k is called the rank of a permutation.
If a is a cycle, then the minimum possible k is the length of the cycle. In our previous example, a was a cycle of length 3, so it took three applications of a before we found the identity permutation again.
How about a cycle of length 2? A cycle of length 2 is just "swapping two elements". For instance, swapping elements 0 and 1:
a = [1,0,2]
b = np.array([0,1,2])
c = b
for i in range(2):
c = c[a]
print(c)
# [1 0 2]
# [0 1 2]
We swap 0 and 1, then we swap them back.
How about two disjoint cycles? Let's try a cycle of length 3 on the first three elements, simultaneously with swapping the last two elements:
a = [2,0,1,3,4,5,7,6]
b = np.array([0,1,2,3,4,5,6,7])
c = b
for i in range(6):
c = c[a]
print(c)
# [2 0 1 3 4 5 7 6]
# [1 2 0 3 4 5 6 7]
# [0 1 2 3 4 5 7 6]
# [2 0 1 3 4 5 6 7]
# [1 2 0 3 4 5 7 6]
# [0 1 2 3 4 5 6 7]
As you can see by carefully examining the intermediary results, there is a period of length 3 on the first three elements, and a period of length 2 on the last two elements. The overall period is the least common multiple of the two periods, which is 6.
What is k in general? A well-known theorem of algebra states: every permutation can be written as a product of disjoint cycles. The rank of a cycle is the length of the cycle. The rank of a product of disjoint cycles is the least common multiple of the ranks of cycles.
A coincidence in your code: sorting [2,1,4,7,6,5,0,3,8,9]
Let us go back to your python code.
a = np.array([2,1,4,7,6,5,0,3,8,9])
a = a[a]
a = a[a]
print(a)
# [0 1 2 3 4 5 6 7 8 9]
How many times did you apply permutation a? Note that because of the assignment a =, array a changed between the first and the second lines a = a[a]. Let us dissipate some confusion by using a different variable name for every different value. Your code is equivalent to:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a2 = a[a]
a4 = a2[a2]
print(a4)
Or equivalently:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = (a[a])[a[a]]
This last line looks a little bit complicated. However, a cool result of algebra is that composition of permutations is associative. You already knew that addition and multiplication were associative: x+(y+z) == (x+y)+z and x(yz) == (xy)z. Well, it turns out that composition of permutations is associative as well! Using numpy's masks, this means that:
a[b[c]] == (a[b])[c]
Thus your python code is equivalent to:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = ((a[a])[a])[a]
print(a4)
Or without the unneeded parentheses:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = a[a][a][a]
print(a4)
Since a4 is the identity permutation, this tells us that the rank of a divides 4. Thus the rank of a is 1, 2 or 4. This tells us that a can be written as a product of swaps and length-4 cycles. The only permutation of rank 1 is the identity itself. Permutations of rank 2 are products of disjoint swaps, and we can see that this is not the case of a. Thus the rank of a must be exactly 4.
You can find the cycles by choosing an element, and following its orbit: what values is that element successively transformed into? Here we see that:
0 is transformed into 2; 2 is transformed into 4; 4 is transformed into 6; 6 is transformed into 0;
1 remains untouched;
3 becomes 7; 7 becomes 3;
5 is untouched; 8 and 9 are untouched.
Conclusion: Your numpy array represents the permutation (0 -> 2 -> 4 -> 6 -> 0)(3 <-> 7), and its rank is the least common multiple of 4 and 2, lcm(4,2) == 4.

it's took some time but I figure a way to do it.
numpy doesn't have this fiture but panda does have.
by using df.reindex I can sort a data frame by it indexes
import pandas as pd
import numpy as np
train_df = pd.DataFrame(range(10))
train_df = train_df.reindex(np.random.permutation(train_df.index))
print(train_df) # random dataframe contaning all values up to 9
train_df = train_df.reindex(range(10))
print(train_df) # sort data frame

Related

Creating submatrix in python

Given a matrix S and a binary matrix W, I want to create a submatrix of S corresponding to the non zero coordinates of W.
For example:
S = [[1,1],[1,2],[1,3],[1,4],[1,5]]
W = [[1,0,0],[1,1,0],[1,1,1],[0,1,1],[0,0,1]]
I want to get matrices
S_1 = [[1,1],[1,2],[1,3]]
S_2 = [[1,2],[1,3],[1,4]]
S_3 = [[1,3],[1,4],[1,5]]
I couldn't figure out a slick way to do this in python. The best I could do for each S_i is
S_1 = S[0,:]
for i in range(np.shape(W)[0]):
if W[i, 0] == 1:
S_1 = np.vstack((S_1, S[i, :]))
but if i want to change the dimensions of the problem and have, say, 100 S_i's, writing a for loop for each one seems a bit ugly. (Side note: S_1 should be initialized to some empty 2d array but I couldn't get that to work, so initialized it to S[0,:] as a placeholder).
EDIT: To clarify what I mean:
I have a matrix S
1 1
1 2
1 3
1 4
1 5
and I have a binary matrix
1 0 0
1 1 0
1 1 1
0 1 1
0 0 1
Given the first column of the binary matrix W
1
1
1
0
0
The 1's are in the first, second, and third positions. So I want to create a corresponding submatrix of S with just the first, second and third positions of every column, so S_1 (corresponding to the 1st column of W) is
1 1
1 2
1 3
Similarly, if we look at the third column of W
0
0
1
1
1
The 1's are in the last three coordinates and so I want a submatrix of S with just the last three coordinates of every column, called S_3
1 3
1 4
1 5
So given any ith column of the binary matrix, I'm looking to generate a submatrix S_i where the columns of S_i contain the columns of S, but only the entries corresponding to the positions of the 1's in the ith column of the binary matrix.
It probably is more useful to work with the transpose of W rather than W itself, both for human-readability and to facilitate writing the code. This means that the entries that affect each S_i are grouped together in one of the inner parentheses of W, i.e. in a row of W rather than a column as you have it now.
Then, S_i = np.array[S[j,:] for j in np.shape(S)[0] if W_T[i,j] == 1], where W_T is the transpose of W. If you need/want to stick with W as is, you need to reverse the indices i and j.
As for the outer loop, you could try to nest this in another similar comprehension without an if statement--however this might be awkward since you aren't actually building one output matrix (the S_i can easily be different dimensions, unless you're somehow guaranteed to have the same number of 1s in every column of W). This in fact raises the question of what you want--a list of these arrays S_i? Otherwise if they are separate variables as you have it written, there's no good way to refer to them in a generalizable way as they don't have indices.
Numpy can do this directly.
import numpy as np
S = np.array([[1,1],[1,2],[1,3],[1,4],[1,5]])
W = np.array([[1,0,0],[1,1,0],[1,1,1],[0,1,1],[0,0,1]])
for row in range(W.shape[1]):
print(S[W[:,row]==1])
Output:
[[1 1]
[1 2]
[1 3]]
[[1 2]
[1 3]
[1 4]]
[[1 3]
[1 4]
[1 5]]

Why does the last element reflect the number of non-negative solutions?

Please excuse my naivete as I don't have much programming experience. While googling something for an unrelated question, I stumbled upon this:
https://www.geeksforgeeks.org/find-number-of-solutions-of-a-linear-equation-of-n-variables/
I completely understand the first (extremely inefficient) bit of code. But the second:
def countSol(coeff, n, rhs):
# Create and initialize a table
# to store results of subproblems
dp = [0 for i in range(rhs + 1)]
dp[0] = 1
# Fill table in bottom up manner
for i in range(n):
for j in range(coeff[i], rhs + 1):
dp[j] += dp[j - coeff[i]]
return dp[rhs]
confuses me. My question being: why does this second program count the number of non-negative integer solutions?
I have written out several examples, including the one given in the article, and I understand that it does indeed do this. And I understand how it is populating the list. But I don't understand exactly why this works.
Please excuse what must be, to some, an ignorant question. But I would quite like to understand the logic, as I think it rather clever that such a little snip-it is able able to answer a question as general as "How many non negative integer solutions exist" (for some general equation).
This algorithms is pretty cool and demonstrates the power of looking for a solution from a different perspective.
Let's take a example: 3x + 2y + z = 6, where LHS is the left hand side and RHS is the right hand side.
dp[k] will keep track of the number of unique ways to arrive at a RHS value of k by substituting non-negative integer values for LHS variables.
The i loop iterates over the variables in the LHS. The algorithm begins with setting all the variables to zero. So, the only possible k value is zero, hence
k 0 1 2 3 4 5 6
dp[k] = 1 0 0 0 0 0 0
For i = 0, we will update dp to reflect what happens if x is 1 or 2. We don't care about x > 2 because the solutions are all non-negative and 3x would be too big. The j loop is responsible for updating dp and dp[k] gets incremented by dp[k - 3] because we can arrive at RHS value k by adding one copy of the coefficient 3 to k-3. The result is
k 0 1 2 3 4 5 6
dp[k] = 1 0 0 1 0 0 1
Now the algorithm continues with i = 1, updating dp to reflect all possible RHS values where x is 0, 1, or 2 and y is 0, 1, 2, or 3. This time the j loop increments dp[k] by dp[k-2] because we can arrive at RHS value k by adding one copy of the coefficient 2 to k-2, resulting in
k 0 1 2 3 4 5 6
dp[k] = 1 0 1 1 1 1 2
Finally, the algorithm incorporates z = 1, 2, 3, 4, 5, or 6, resulting in
k 0 1 2 3 4 5 6
dp[k] = 1 1 2 3 4 5 7
In addition to computing the answer in pseudo-polynomial time, dp encodes the answer for every RHS <= the input right hand side.

How to determine row equality across several rows in a fully vectorized way?

I have a numpy.ndarray called grouping of size (S, N). Each row of grouping gives me the group labels of a sample of data. I run my algorithm S times and get new group labels in each iteration.
I want to determine how many times each sample of my data has the same group label as every other sample of my data across the S iterations in a fully vectorized way.
In a not-completely-vectorized way:
sim_matrix = np.zeros((N, N))
for s in range(S):
sim_matrix += np.equal.outer(grouping[s, :], grouping[s, :])
One vectorized approach would be with broadcasting -
(grouping[:,None,:] == grouping[:,:,None]).sum(0)
For performance, we can use np.count_nonzero -
np.count_nonzero(grouping[:,None,:] == grouping[:,:,None],axis=0)
The sum of equal.outer is a cryptic way of calculating all-pairs similarity of columns:
sum_i sum_jk (A[i,j] == A[i,k]) is the same as
sum_jk sum_i (A[i,j] == A[i,k])
where sum_i loops over rows, sum_jk over all pairs of columns.
Comparing two vectors by counting the the number of positions where they differ
is called
Hamming distance .
If we change == above to !=, similarity to distance = nrows - similarity
(most similar ⇔ distance 0), we get the problem:
find the Hamming distance between all pairs of a bunch of vectors:.
def allpairs_hamming( A, dtype=np.uint32 ):
""" -> Hamming distances between all pairs of rows of A """
nrow, ncol = A.shape
allpair_dist = np.zeros( [nrow, nrow], dtype=dtype )
for j in xrange(nrow):
for k in xrange( j + 1, nrow ):
allpair_dist[j,k] = allpair_dist[k,j] = (A[j] != A[k]).sum() # row diff
return allpair_dist
allpairs_hamming: 30.7 sec, 3 ns per cmp Nvec 2000 Veclen 5000 A 10m pairdist uint32 15m
Almost all the cpu time is in the row diff, not in the outer loop for j ... for k -- 3 ns per scalar compare, on a stock mac, isn't bad.
However memory caching is much faster if each row A[j] is in contiguous memory,
as for numpy C-order arrays.
Apart from that, whether you do "all pairs of rows" or "all pairs of columns"
doesn't matter, as long as you're clear.
(Is it possible to find "nearby" pairs in time and space < O(npairs), here O(20000^2) ? Afaik there are more methods than test cases.)
See also:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html (bug: hamming .mean not .sum)
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
https://stats.stackexchange.com/search?q=[clustering]+pairwise
You want to compare identic rows. A way to do that is grouping the entire rows in a raw block :
S,N=12,2
a=np.random.randint(0,3,(S,N)) #12 samples of two labels.
#a
0 1
0 2 2
1 2 0
2 1 2
3 0 0
4 0 1
5 1 1
6 0 1
7 0 1
8 0 1
9 0 0
10 2 2
11 0 0
samples=np.ascontiguousarray(a).view(dtype((void,a.strides[0])))
sample.shape is then (S,1).
you can now inventory your sample with np.unique, and use Pandas dataframes for pretty report :
_,inds,invs=np.unique(samples,return_index=True, return_inverse=True)
df=pd.DataFrame(invs)
result=df.reset_index().groupby(0).index.apply(list).to_frame()
result['sample']=[list(x) for x in a[inds]]
for
index samples
0
0 [3, 9, 11] [0, 0]
1 [4, 6, 7, 8] [0, 1]
2 [5] [1, 1]
3 [2] [1, 2]
4 [1] [2, 0]
5 [0, 10] [2, 2]
It can be a O(S ln S) if there is few fits between samples, when yours is O( N²S).

replace element by element different arrays

I have an array :
a = np.array([1,2,3,4,5,6,7,8])
The array may be reshaped to a = np.array([[1,2,3,4],[5,6,7,8]]), whatever is more convenient.
Now, I have an array :
b = np.array([[11,22,33,44], [55,66,77,88]])
I want to replace to each of these elements the corresponding elements from a.
The a array will always hold as many elements as b has.
So, array b will be :
[1,2,3,4], [5,6,7,8]
Note, that I must keep each b subarray dimension to (4,). I don't want to change it.So, the idx will take values from 0 to 3.I want to make a fit to every four values.
I am struggling with reshape, split,mask ..etc and I can't figure a way to do it.
import numpy as np
#a = np.array([[1,2,3,4],[5,6,7,8]])
a = np.array([1,2,3,4,5,6,7,8])
b = np.array([[11,22,33,44], [55,66,77,88]])
for arr in b:
for idx, x in enumerate(arr):
replace every arr[idx] with corresponding a value
For your current case, what you want is probably:
b, c = list(a.reshape(2, -1))
This isn't the cleanest, but it is a one-liner. Turn your 1D array into a 2D array with with the first dimension as 2 with reshape(2, -1), then list splits it along the first dimension so you can directly assign them to b, c
You can also do it with the specialty function numpy.split
b, c = np.split(a, 2)
EDIT: Based on accepted solution, vectorized way to do this is
b = a.reshape(b.shape)
The following worked for me:
i = 0
for arr in b:
for idx, x in enumerate(arr):
arr[idx] = a[i]
print(arr[idx])
i += 1
Output (arr[idx]): 1 2 3 4 5 6 7 8
If you type print(b) it'll output [[1 2 3 4] [5 6 7 8]]
b = a[:len(a)//2]
c = a[len(a)//2:]
Well, I'm quite new to Python but this worked for me:
for i in range (0, len(a)//2):
b[i] = a[i]
for i in range (len(a)//2,len(a)):
c[i-4] = a[i]
by printing the 3 arrays I have the following output:
[1 2 3 4 5 6 7 8]
[1 2 3 4]
[5 6 7 8]
But I would go for Daniel's solution (the split one): 1 liner, using numpy API, ...

Insert a numpy array into another without having to worry about length

When doing:
import numpy
A = numpy.array([1,2,3,4,5,6,7,8,9,10])
B = numpy.array([1,2,3,4,5,6])
A[7:7+len(B)] = B # A[7:7+len(B)] has in fact length 3 !
we get this typical error:
ValueError: could not broadcast input array from shape (6) into shape (3)
This is 100% normal because A[7:7+len(B)] has length 3, and not a length = len(B) = 6, and thus, cannot receive the content of B !
How to prevent this to happen and have easily the content of B copied into A, starting at A[7]:
A[7:???] = B[???]
# i would like [1 2 3 4 5 6 7 1 2 3]
This could be called "auto-broadcasting", i.e. we don't have to worry about length of arrays.
Edit: another example if len(A) = 20:
A = numpy.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
B = numpy.array([1,2,3,4,5,6])
A[7:7+len(B)] = B
A # [ 1 2 3 4 5 6 7 1 2 3 4 5 6 14 15 16 17 18 19 20]
Just tell it when to stop using len(A).
A[7:7+len(B)] = B[:len(A)-7]
Example:
import numpy
B = numpy.array([1,2,3,4,5,6])
A = numpy.array([1,2,3,4,5,6,7,8,9,10])
A[7:7+len(B)] = B[:len(A)-7]
print A # [1 2 3 4 5 6 7 1 2 3]
A = numpy.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
A[7:7+len(B)] = B[:len(A)-7]
print A # [ 1 2 3 4 5 6 7 1 2 3 4 5 6 14 15 16 17 18 19 20]
import numpy
A = numpy.array([1,2,3,4,5,6,7,8,9,10])
B = numpy.array([1,2,3,4,5,6])
numpy.hstack((A[0:7],B))[0:len(A)]
on second thought this fails the case where B fits inside A.
soo....
import numpy
A = numpy.array([1,2,3,4,5,6,7,8,9,10])
B = numpy.array([1,2,3,4,5,6])
if 7 + len(B) > len(A):
A = numpy.hstack((A[0:7],B))[0:len(A)]
else:
A[7:7+len(B)] = B
but, this sort of defeats the purpose of the question! I'm sure you prefer a one-liner!
Same question, but in 2d
Numpy - Overlap 2 matrices at a particular position
There I try to make the case that it is better that you take responsibility for determining which part of B should be copied:
A[7:] = B[:3]
A[7:] = B[-3:]
A[7:] = B[3:6]
np.put will do this sort of clipping for you, but you have to give it an index list, not a slice:
np.put(x, range(7,len(x)), B)
which isn't much better than x[7:]=y[:len(x)-7].
The doc for put tells me there is also a putmask, place, and copyto functions. And the counterpart to put is take.
An interesting thing is that while these other functions give more power than indexing, with modes like clip and repeat, I don't see them being used much. I think that's because it is easier to write a function that handles your special case, than it is to remember/lookup general functions with lots of options.

Categories