I'm looking for the most efficient way of finding the intersection of two different-sized matrices. Each matrix has three variables (columns) and a varying number of observations (rows). For example, matrix A:
a = np.matrix('1 5 1003; 2 4 1002; 4 3 1008; 8 1 2005')
b = np.matrix('7 9 1006; 4 4 1007; 7 7 1050; 8 2 2003'; 9 9 3000; 7 7 1000')
If I set the tolerance for each column as col1 = 1, col2 = 2, and col3 = 10, I would want a function such that it would output the indices in a and b that are within their respective tolerance, for example:
[x1, x2] = func(a, b, col1, col2, col3)
print x1
>> [2 3]
print x2
>> [1 3]
You can see by the indices, that element 2 of a is within the tolerances of element 1 of b.
I'm thinking I could loop through each element of matrix a, check if it's within the tolerances of each element in b, and do it that way. But it seems inefficient for very large data sets.
Any suggestions for alternatives to a looping method for accomplishing this?
If you don't mind working with NumPy arrays, you could exploit broadcasting for a vectorized solution. Here's the implementation -
# Set tolerance values for each column
tol = [1, 2, 10]
# Get absolute differences between a and b keeping their columns aligned
diffs = np.abs(np.asarray(a[:,None]) - np.asarray(b))
# Compare each row with the triplet from `tol`.
# Get mask of all matching rows and finally get the matching indices
x1,x2 = np.nonzero((diffs < tol).all(2))
Sample run -
In [46]: # Inputs
...: a=np.matrix('1 5 1003; 2 4 1002; 4 3 1008; 8 1 2005')
...: b=np.matrix('7 9 1006; 4 4 1007; 7 7 1050; 8 2 2003; 9 9 3000; 7 7 1000')
...:
In [47]: # Set tolerance values for each column
...: tol = [1, 2, 10]
...:
...: # Get absolute differences between a and b keeping their columns aligned
...: diffs = np.abs(np.asarray(a[:,None]) - np.asarray(b))
...:
...: # Compare each row with the triplet from `tol`.
...: # Get mask of all matching rows and finally get the matching indices
...: x1,x2 = np.nonzero((diffs < tol).all(2))
...:
In [48]: x1,x2
Out[48]: (array([2, 3]), array([1, 3]))
Large datasizes case : If you are working with huge datasizes that cause memory issues and since you already know that the number of columns is a small number 3, you might want to have a minimal loop of 3 iterations and save huge memory footprint, like so -
na = a.shape[0]
nb = b.shape[0]
accum = np.ones((na,nb),dtype=bool)
for i in range(a.shape[1]):
accum &= np.abs((a[:,i] - b[:,i].ravel())) < tol[i]
x1,x2 = np.nonzero(accum)
Related
I have an array full of zeros created as A = np.zeros(m,m) and m = 6.
I want to fill this array with specific numbers that each equals sum of it's row and column number; such as A(x,y) = x+y.
How can i do this using for loop and while loop?
Method that avoids a loop with rather significant performance improvement on a large ndarray:
A = np.zeros((6,6))
m = A.shape[0]
n = A.shape[1]
x = np.transpose(np.array([*range(1,m+1)]*n).reshape(n,m))
y = np.array([*range(1,n+1)]*m).reshape(m,n)
A = x+y
print(A)
[[ 2 3 4 5 6 7]
[ 3 4 5 6 7 8]
[ 4 5 6 7 8 9]
[ 5 6 7 8 9 10]
[ 6 7 8 9 10 11]
[ 7 8 9 10 11 12]]
A = np.zeros((6,6))
for i in range(0,A.shape[0]):
for j in range(0, A.shape[1]):
A[i][j] = i+j+2
If you want the rows and columns to be starting from 1, you can directly use this code, but if you want them to be starting from 0 you can surely remove the "+2" in line-4.
Explanation:
I am first traversing the row in a loop when, then traversing the columns in loop 2, and then I am accessing the cell value using A[i][j]. and assigning it to i+j+2 (or just i + j). This way the original array will fill your new values.
Have you tried this?
for y in range(len(A)):
for x in range(len(A[Y]):
A[y][x] = x + y
I tried to sort an array by permuting it with itself
(the array contain all the numbers in range between 0 to its length-1)
so to test it I used random.shuffle but it had some unexpected results
a = np.array(range(10))
random.shuffle(a)
a = a[a]
a = a[a]
print(a)
# not a sorted array
# [9 5 2 3 1 7 6 8 0 4]
a = np.array([2,1,4,7,6,5,0,3,8,9])
a = a[a]
a = a[a]
print(a)
# [0 1 2 3 4 5 6 7 8 9]
so for some reason the permutation when using the second example of an unsorted array returns the sorted array as expected but the shuffled array doesn't work the same way.
Does anyone know why? Or if there is an easier way to sort using permutation or something similar it would be great.
TL;DR
There is no reason to expect a = a[a] to sort the array. In most cases it won't. In case of a coincidence it might.
What is the operation c = b[a]? or Applying a permutation
When you use an array a obtained by shuffling range(n) as a mask for an array b of same size n, you are applying a permutation, in the mathematical sense, to the elements of b. For instance:
a = [2,0,1]
b = np.array(['Alice','Bob','Charlie'])
print(b[a])
# ['Charlie' 'Alice' 'Bob']
In this example, array a represents the permutation (2 0 1), which is a cycle of length 3. Since the length of the cycle is 3, if you apply it three times, you will end up where you started:
a = [2,0,1]
b = np.array(['Alice','Bob','Charlie'])
c = b
for i in range(3):
c = c[a]
print(c)
# ['Charlie' 'Alice' 'Bob']
# ['Bob' 'Charlie' 'Alice']
# ['Alice' 'Bob' 'Charlie']
Note that I used strings for the elements of b ton avoid confusing them with indices. Of course, I could have used numbers from range(n):
a = [2,0,1]
b = np.array([0,1,2])
c = b
for i in range(3):
c = c[a]
print(c)
# [2 0 1]
# [1 2 0]
# [0 1 2]
You might see an interesting, but unsurprising fact: The first line is equal to a; in other words, the first result of applying a to b is equal to a itself. This is because b was initialised to [0 1 2], which represent the identity permutation id; thus, the permutations that we find by repeatedly applying a to b are:
id == a^0
a
a^2
a^3 == id
Can we always go back where we started? or The rank of a permutation
It is a well-known result of algebra that if you apply the same permutation again and again, you will eventually end up on the identity permutation. In algebraic notations: for every permutation a, there exists an integer k such that a^k == id.
Can we guess the value of k?
The minimum value of k is called the rank of a permutation.
If a is a cycle, then the minimum possible k is the length of the cycle. In our previous example, a was a cycle of length 3, so it took three applications of a before we found the identity permutation again.
How about a cycle of length 2? A cycle of length 2 is just "swapping two elements". For instance, swapping elements 0 and 1:
a = [1,0,2]
b = np.array([0,1,2])
c = b
for i in range(2):
c = c[a]
print(c)
# [1 0 2]
# [0 1 2]
We swap 0 and 1, then we swap them back.
How about two disjoint cycles? Let's try a cycle of length 3 on the first three elements, simultaneously with swapping the last two elements:
a = [2,0,1,3,4,5,7,6]
b = np.array([0,1,2,3,4,5,6,7])
c = b
for i in range(6):
c = c[a]
print(c)
# [2 0 1 3 4 5 7 6]
# [1 2 0 3 4 5 6 7]
# [0 1 2 3 4 5 7 6]
# [2 0 1 3 4 5 6 7]
# [1 2 0 3 4 5 7 6]
# [0 1 2 3 4 5 6 7]
As you can see by carefully examining the intermediary results, there is a period of length 3 on the first three elements, and a period of length 2 on the last two elements. The overall period is the least common multiple of the two periods, which is 6.
What is k in general? A well-known theorem of algebra states: every permutation can be written as a product of disjoint cycles. The rank of a cycle is the length of the cycle. The rank of a product of disjoint cycles is the least common multiple of the ranks of cycles.
A coincidence in your code: sorting [2,1,4,7,6,5,0,3,8,9]
Let us go back to your python code.
a = np.array([2,1,4,7,6,5,0,3,8,9])
a = a[a]
a = a[a]
print(a)
# [0 1 2 3 4 5 6 7 8 9]
How many times did you apply permutation a? Note that because of the assignment a =, array a changed between the first and the second lines a = a[a]. Let us dissipate some confusion by using a different variable name for every different value. Your code is equivalent to:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a2 = a[a]
a4 = a2[a2]
print(a4)
Or equivalently:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = (a[a])[a[a]]
This last line looks a little bit complicated. However, a cool result of algebra is that composition of permutations is associative. You already knew that addition and multiplication were associative: x+(y+z) == (x+y)+z and x(yz) == (xy)z. Well, it turns out that composition of permutations is associative as well! Using numpy's masks, this means that:
a[b[c]] == (a[b])[c]
Thus your python code is equivalent to:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = ((a[a])[a])[a]
print(a4)
Or without the unneeded parentheses:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = a[a][a][a]
print(a4)
Since a4 is the identity permutation, this tells us that the rank of a divides 4. Thus the rank of a is 1, 2 or 4. This tells us that a can be written as a product of swaps and length-4 cycles. The only permutation of rank 1 is the identity itself. Permutations of rank 2 are products of disjoint swaps, and we can see that this is not the case of a. Thus the rank of a must be exactly 4.
You can find the cycles by choosing an element, and following its orbit: what values is that element successively transformed into? Here we see that:
0 is transformed into 2; 2 is transformed into 4; 4 is transformed into 6; 6 is transformed into 0;
1 remains untouched;
3 becomes 7; 7 becomes 3;
5 is untouched; 8 and 9 are untouched.
Conclusion: Your numpy array represents the permutation (0 -> 2 -> 4 -> 6 -> 0)(3 <-> 7), and its rank is the least common multiple of 4 and 2, lcm(4,2) == 4.
it's took some time but I figure a way to do it.
numpy doesn't have this fiture but panda does have.
by using df.reindex I can sort a data frame by it indexes
import pandas as pd
import numpy as np
train_df = pd.DataFrame(range(10))
train_df = train_df.reindex(np.random.permutation(train_df.index))
print(train_df) # random dataframe contaning all values up to 9
train_df = train_df.reindex(range(10))
print(train_df) # sort data frame
I have a numpy.ndarray called grouping of size (S, N). Each row of grouping gives me the group labels of a sample of data. I run my algorithm S times and get new group labels in each iteration.
I want to determine how many times each sample of my data has the same group label as every other sample of my data across the S iterations in a fully vectorized way.
In a not-completely-vectorized way:
sim_matrix = np.zeros((N, N))
for s in range(S):
sim_matrix += np.equal.outer(grouping[s, :], grouping[s, :])
One vectorized approach would be with broadcasting -
(grouping[:,None,:] == grouping[:,:,None]).sum(0)
For performance, we can use np.count_nonzero -
np.count_nonzero(grouping[:,None,:] == grouping[:,:,None],axis=0)
The sum of equal.outer is a cryptic way of calculating all-pairs similarity of columns:
sum_i sum_jk (A[i,j] == A[i,k]) is the same as
sum_jk sum_i (A[i,j] == A[i,k])
where sum_i loops over rows, sum_jk over all pairs of columns.
Comparing two vectors by counting the the number of positions where they differ
is called
Hamming distance .
If we change == above to !=, similarity to distance = nrows - similarity
(most similar ⇔ distance 0), we get the problem:
find the Hamming distance between all pairs of a bunch of vectors:.
def allpairs_hamming( A, dtype=np.uint32 ):
""" -> Hamming distances between all pairs of rows of A """
nrow, ncol = A.shape
allpair_dist = np.zeros( [nrow, nrow], dtype=dtype )
for j in xrange(nrow):
for k in xrange( j + 1, nrow ):
allpair_dist[j,k] = allpair_dist[k,j] = (A[j] != A[k]).sum() # row diff
return allpair_dist
allpairs_hamming: 30.7 sec, 3 ns per cmp Nvec 2000 Veclen 5000 A 10m pairdist uint32 15m
Almost all the cpu time is in the row diff, not in the outer loop for j ... for k -- 3 ns per scalar compare, on a stock mac, isn't bad.
However memory caching is much faster if each row A[j] is in contiguous memory,
as for numpy C-order arrays.
Apart from that, whether you do "all pairs of rows" or "all pairs of columns"
doesn't matter, as long as you're clear.
(Is it possible to find "nearby" pairs in time and space < O(npairs), here O(20000^2) ? Afaik there are more methods than test cases.)
See also:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html (bug: hamming .mean not .sum)
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
https://stats.stackexchange.com/search?q=[clustering]+pairwise
You want to compare identic rows. A way to do that is grouping the entire rows in a raw block :
S,N=12,2
a=np.random.randint(0,3,(S,N)) #12 samples of two labels.
#a
0 1
0 2 2
1 2 0
2 1 2
3 0 0
4 0 1
5 1 1
6 0 1
7 0 1
8 0 1
9 0 0
10 2 2
11 0 0
samples=np.ascontiguousarray(a).view(dtype((void,a.strides[0])))
sample.shape is then (S,1).
you can now inventory your sample with np.unique, and use Pandas dataframes for pretty report :
_,inds,invs=np.unique(samples,return_index=True, return_inverse=True)
df=pd.DataFrame(invs)
result=df.reset_index().groupby(0).index.apply(list).to_frame()
result['sample']=[list(x) for x in a[inds]]
for
index samples
0
0 [3, 9, 11] [0, 0]
1 [4, 6, 7, 8] [0, 1]
2 [5] [1, 1]
3 [2] [1, 2]
4 [1] [2, 0]
5 [0, 10] [2, 2]
It can be a O(S ln S) if there is few fits between samples, when yours is O( N²S).
I have an array :
a = np.array([1,2,3,4,5,6,7,8])
The array may be reshaped to a = np.array([[1,2,3,4],[5,6,7,8]]), whatever is more convenient.
Now, I have an array :
b = np.array([[11,22,33,44], [55,66,77,88]])
I want to replace to each of these elements the corresponding elements from a.
The a array will always hold as many elements as b has.
So, array b will be :
[1,2,3,4], [5,6,7,8]
Note, that I must keep each b subarray dimension to (4,). I don't want to change it.So, the idx will take values from 0 to 3.I want to make a fit to every four values.
I am struggling with reshape, split,mask ..etc and I can't figure a way to do it.
import numpy as np
#a = np.array([[1,2,3,4],[5,6,7,8]])
a = np.array([1,2,3,4,5,6,7,8])
b = np.array([[11,22,33,44], [55,66,77,88]])
for arr in b:
for idx, x in enumerate(arr):
replace every arr[idx] with corresponding a value
For your current case, what you want is probably:
b, c = list(a.reshape(2, -1))
This isn't the cleanest, but it is a one-liner. Turn your 1D array into a 2D array with with the first dimension as 2 with reshape(2, -1), then list splits it along the first dimension so you can directly assign them to b, c
You can also do it with the specialty function numpy.split
b, c = np.split(a, 2)
EDIT: Based on accepted solution, vectorized way to do this is
b = a.reshape(b.shape)
The following worked for me:
i = 0
for arr in b:
for idx, x in enumerate(arr):
arr[idx] = a[i]
print(arr[idx])
i += 1
Output (arr[idx]): 1 2 3 4 5 6 7 8
If you type print(b) it'll output [[1 2 3 4] [5 6 7 8]]
b = a[:len(a)//2]
c = a[len(a)//2:]
Well, I'm quite new to Python but this worked for me:
for i in range (0, len(a)//2):
b[i] = a[i]
for i in range (len(a)//2,len(a)):
c[i-4] = a[i]
by printing the 3 arrays I have the following output:
[1 2 3 4 5 6 7 8]
[1 2 3 4]
[5 6 7 8]
But I would go for Daniel's solution (the split one): 1 liner, using numpy API, ...
Given the following inputs:
In [18]: input
Out[18]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
2 1 5 9 1
3 1 5 9 1
In [26]: df = input.drop_duplicates()
Out[26]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
How would I go about getting an array that has the indices of the rows from the subset that are equivalent, eg:
resultant = [0, 1, 0, 0]
I.e. the '1' here is basically stating that (row[1] in input) == (row[1] in df). Since there will be fewer unique rows than there will be multiple values in 'resultant' that will equate to similar values in df. i.e (row[k] in input == row[k+N] in input) == (row[1] in df) could be a case.
I am looking for actual row number mapping from input:df.
While this example is trivial in my case i have a ton of dropped mappings that might map to one index as an example.
Why do I want this? I am training an autoencoder type system where the target sequence is non-unique.
One way would be to treat it as a groupby on all columns:
>> df.groupby(list(df.columns)).groups
{(1, 5, 9, 1): [0, 2, 3], (2, 6, 10, 2): [1]}
Another would be to sort and then compare, which is less efficient in theory but could very well be faster in some cases and is definitely easier to make more tolerant of error:
>>> ds = df.sort(list(df.columns))
>>> eqs = (ds != ds.shift()).all(axis=1).cumsum()
>>> ds.index.groupby(eqs)
{1: [0, 2, 3], 2: [1]}
This seems the right datastructure to me, but if you really do want an array with the group ids, that's easy too, e.g.
>>> eqs.sort_index() - 1
0 0
1 1
2 0
3 0
dtype: int64
Don't have pandas installed on this computer, but I think you could use df.iterrows() like:
def find_matching_row(row, df_slimmed):
for index, slimmed_row in df_slimmed.iterrows():
if slimmed_row.equals(row[slimmed_row.columns]):
return index
def rows_mappings(df, df_slimmed):
for _, row in df.iterrows():
yield find_matching_row(row, df_slimmed)
list(rows_mappings(df, input))
This is if you are interested in generating the resultant list in your example, I don't quite follow the latter part of your reasoning.