Efficient selection of values in numpy - python

I'm trying to find elements of one DataFrame (df_other) which match a column in another DataFrame (df). In other words, I'd like to know where the values in df['a'] match the values in df_other['a'] for each row in df['a'].
An example might be easier to explain the expected result:
>>> import pandas as pd
>>> import numpy as np
>>>
>>>
>>> df = pd.DataFrame({'a': ['x', 'y', 'z']})
>>> df
a
0 x
1 y
2 z
>>> df_other = pd.DataFrame({'a': ['x', 'x', 'y', 'z', 'z2'], 'c': [1, 2, 3, 4, 5]})
>>> df_other
a c
0 x 1
1 x 2
2 y 3
3 z 4
4 z2 5
>>>
>>>
>>> u = df_other['c'].unique()
>>> u
array([1, 2, 3, 4, 5])
>>> bm = np.ones((len(df), len(u)), dtype=bool)
>>> bm
array([[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]])
should yield a bitmap of
[
[1, 1, 0, 0, 0], # [1, 2] are df_other['c'] where df_other['a'] == df['a']
[0, 0, 1, 0, 0], # [3] matches
[0, 0, 0, 1, 0], # [4] matches
]
I'm looking for a fast numpy implementation that doesn't iterate through all rows (which is my current solution):
>>> df_other['a'] == df.loc[0, 'a']
0 True
1 True
2 False
3 False
4 False
Name: a, dtype: bool
>>>
>>>
>>> df_other['a'] == df.loc[1, 'a']
0 False
1 False
2 True
3 False
4 False
Name: a, dtype: bool
>>> df_other['a'] == df.loc[2, 'a']
0 False
1 False
2 False
3 True
4 False
Name: a, dtype: bool
Note: in the actual production code, there are many more column conditions ((df['a'] == df_other['a']) & (df['b'] == df_other['b'] & ...), but they are generally less than the number of rows in df, so I wouldn't mind a solution that loops over the conditions (and subsequently sets values in bm to false).
Also, the bitmap should have the shape of (len(df), len(df_other['c'].unique)).

numpy broadcasting is so useful here:
bm = df_other.values[:, 0] == df.values
Output:
>>> bm
array([[ True, True, False, False, False],
[False, False, True, False, False],
[False, False, False, True, False]])
If you need it as ints:
>>> bm.astype(int)
array([[1, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]])

Another way to do this using pandas methods are as follows:
pd.crosstab(df_other['a'], df_other['c']).reindex(df['a']).to_numpy(dtype=int)
Output:
array([[1, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]])

Related

masked_scatter but rowwise?

Assuming a mask as follows:
mask = torch.tensor([
[True, True, False, True, False],
[True, False, True, True, True ],
])
I would like to number the True values with sequential values in each row separately. I don't care what's in the False spots, so 0 for simplicity. Thus the desired result is
tensor([[0, 1, 0, 2, 0], # 0 1 _ 2 _
[0, 0, 1, 2, 3]]) # 0 _ 1 2 3
I hoped this would work:
replacements = torch.arange(mask.size(1)).expand(mask.size())
target = torch.zeros(mask.size(), dtype=int)
target.masked_scatter(mask, replacements)
Unfortunately, masked_scatter ignores the shape of replacements, so this code results in:
tensor([[0, 1, 0, 2, 0], # 0 1 _ 2 _
[3, 0, 4, 0, 1]]) # 3 _ 4 0 1
What would I need to do instead?
I would try something with torch.cumsum: torch.cumsum(mask,dim=1) -1) * mask
The complete example
import torch
mask = torch.tensor([
[True, True, False, True, False],
[True, False, True, True, True ],
])
result=torch.cumsum(mask,dim=1) -1) * mask
print(result)
That would print:
tensor([[0, 1, 0, 2, 0],
[0, 0, 1, 2, 3]])

How to use conditional statements or loops to achieve my requirement?

I am new to python coding. Kindly, help me to achieve my requirement.
Suppose there are two arrays 'a' and 'b' of size 3*4
a = [[1,0,0,1],
[0,0,1,1],
[1,0,0,1]]
b = [[12,-34,-10,4],
[2,11,-12,20],
[-12,16,19,-9]]
Here, if b[i,j]<10 than I want the corresponding a[i,j] to be same(i.e it can be either 0 or 1) else change a[i,j] element to 1.
Expected outcome for the above example :
c = [[1,0,0,1],
[0,1,1,1],
[1,1,1,1]]
You can use the or | operator:
In [11]: b >= 10
Out[11]:
array([[ True, False, False, False],
[False, True, False, True],
[False, True, True, False]])
In [12]: a | (b >= 10)
Out[12]:
array([[1, 0, 0, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]])
The | is a bitwise or and is equivalent to np.bitwise_or:
In [13]: np.bitwise_or(a, b >= 10)
Out[13]:
array([[1, 0, 0, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]])
This assumes both a and b are numpy arrays, you can make this so with the array constructor:
a, b = np.array(a), np.array(b)
if you do not want to use numpy you could do this nested list-comprehension:
c = [[el_a | (el_b >= 10) for el_a, el_b in zip(row_a, row_b)]
for row_a, row_b in zip(a, b)]
but i prefer Andy Hayden's anser. numpy really shines for that kind of operations.

How to quickly determine if a matrix is a permutation matrix

How to quickly determine if a square logical matrix is a permutation matrix? For instance,
is not a permutation matrix since the 3rd row have 2 entries 1.
PS: A permutation matrix is a square binary matrix that has exactly one entry 1 in each row and each column and 0s elsewhere.
I define a logical matrix like
numpy.array([(0,1,0,0), (0,0,1,0), (0,1,1,0), (1,0,0,1)])
Here is my source code:
#!/usr/bin/env python
import numpy as np
### two test cases
M1 = np.array([
(0, 1, 0, 0),
(0, 0, 1, 0),
(0, 1, 1, 0),
(1, 0, 0, 1)]);
M2 = np.array([
(0, 1, 0, 0),
(0, 0, 1, 0),
(1, 0, 0, 0),
(0, 0, 0, 1)]);
### fuction
def is_perm_matrix(M) :
for sumRow in np.sum(M, axis=1) :
if sumRow != 1 :
return False
for sumCol in np.sum(M, axis=0) :
if sumCol != 1 :
return False
return True
### print the result
print is_perm_matrix(M1) #False
print is_perm_matrix(M2) #True
Is there any better implementation?
What about this:
def is_permuation_matrix(x):
x = np.asanyarray(x)
return (x.ndim == 2 and x.shape[0] == x.shape[1] and
(x.sum(axis=0) == 1).all() and
(x.sum(axis=1) == 1).all() and
((x == 1) | (x == 0)).all())
Quick test:
In [37]: is_permuation_matrix(np.eye(3))
Out[37]: True
In [38]: is_permuation_matrix([[0,1],[2,0]])
Out[38]: False
In [39]: is_permuation_matrix([[0,1],[1,0]])
Out[39]: True
In [41]: is_permuation_matrix([[0,1,0],[0,0,1],[1,0,0]])
Out[41]: True
In [42]: is_permuation_matrix([[0,1,0],[0,0,1],[1,0,1]])
Out[42]: False
In [43]: is_permuation_matrix([[0,1,0],[0,0,1]])
Out[43]: False
Here's a simple non-numpy solution that assumes that the matrix is a list of lists and that it only contains integers 0 or 1. It also functions correctly if the matrix contains Booleans.
def is_perm_matrix(m):
#Check rows
if all(sum(row) == 1 for row in m):
#Check columns
return all(sum(col) == 1 for col in zip(*m))
return False
m1 = [
[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
]
m2 = [
[0, 1, 0],
[1, 0, 0],
[0, 1, 1],
]
m3 = [
[0, 1, 0],
[1, 0, 0],
[1, 0, 0],
]
m4 = [
[True, False, False],
[False, True, False],
[True, False, False],
]
print is_perm_matrix(m1)
print is_perm_matrix(m2)
print is_perm_matrix(m3)
print is_perm_matrix(m4)
output
True
False
False
False
One method is to call np.sum and pass an axis param, this should generate an array with all ones if not then you don't have a permutation matrix:
In [56]:
a = np.array([[0,1,0,0],[0,0,1,0],[0,1,1,0],[1,0,0,1]])
a
Out[56]:
array([[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]])
In [57]:
np.all(np.sum(a,axis=0) == np.ones((1,4)), True)
Out[57]:
array([False], dtype=bool)
In [58]:
np.all(np.sum(a,axis=1) == np.ones((1,4)), True)
Out[58]:
array([False], dtype=bool)
In [60]:
np.sum(a, axis=1) == np.ones([1,4])
Out[60]:
array([[ True, True, False, False]], dtype=bool)
In [59]:
np.sum(a, axis=0) == np.ones([1,4])
Out[59]:
array([[ True, False, False, True]], dtype=bool)
In [61]:
np.sum(a,axis=0)
Out[61]:
array([1, 2, 2, 1])
In [62]:
np.sum(a,axis=1)
Out[62]:
array([1, 1, 2, 2])

How to remove rows while iterating in numpy

How to remove rows while iterating in numpy, as Java does:
Iterator < Message > itMsg = messages.iterator();
while (itMsg.hasNext()) {
Message m = itMsg.next();
if (m != null) {
itMsg.remove();
continue;
}
}
Here is my pseudo code. Remove the rows whose entries are all 0 and 1 while iterating.
#! /usr/bin/env python
import numpy as np
M = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 0, 0, 0], #remove this row whose entries are all 0
[1, 1, 1, 1] #remove this row whose entries are all 1
])
it = np.nditer(M, order="K", op_flags=['readwrite'])
while not it.finished :
row = it.next() #how to get a row?
sumRow = np.sum(row)
if sumRow==4 or sumRow==0 : #remove rows whose entries are all 0 and 1 as well
#M = np.delete(M, row, axis =0)
it.remove_axis(i) #how to get i?
Writing good numpy code requires you to think in a vectorized fashion. Not every problem has a good vectorization, but for those that do, you can write clean and fast code pretty easily. In this case, we can decide on what rows we want to remove/keep and then use that to index into your array:
>>> M
array([[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 0],
[1, 1, 1, 1]])
>>> M[~((M == 0).all(1) | (M == 1).all(1))]
array([[0, 1, 0, 0],
[0, 0, 1, 0]])
Step by step, we can compare M to something to make a boolean array:
>>> M == 0
array([[ True, False, True, True],
[ True, True, False, True],
[ True, True, True, True],
[False, False, False, False]], dtype=bool)
We can use all to see if a row or column is all true:
>>> (M == 0).all(1)
array([False, False, True, False], dtype=bool)
We can use | to do an or operation:
>>> (M == 0).all(1) | (M == 1).all(1)
array([False, False, True, True], dtype=bool)
We can use this to select rows:
>>> M[(M == 0).all(1) | (M == 1).all(1)]
array([[0, 0, 0, 0],
[1, 1, 1, 1]])
But since these are the rows we want to throw away, we can use ~ (NOT) to flip False and True:
>>> M[~((M == 0).all(1) | (M == 1).all(1))]
array([[0, 1, 0, 0],
[0, 0, 1, 0]])
If instead we wanted to keep columns which weren't all 1 or all 0, we simply need to change what axis we're working on:
>>> M
array([[1, 1, 0, 1],
[1, 0, 1, 1],
[1, 0, 0, 1],
[1, 1, 1, 1]])
>>> M[:, ~((M == 0).all(axis=0) | (M == 1).all(axis=0))]
array([[1, 0],
[0, 1],
[0, 0],
[1, 1]])

How can you turn an index array into a mask array in Numpy?

Is it possible to convert an array of indices to an array of ones and zeros, given the range?
i.e. [2,3] -> [0, 0, 1, 1, 0], in range of 5
I'm trying to automate something like this:
>>> index_array = np.arange(200,300)
array([200, 201, ... , 299])
>>> mask_array = ??? # some function of index_array and 500
array([0, 0, 0, ..., 1, 1, 1, ... , 0, 0, 0])
>>> train(data[mask_array]) # trains with 200~299
>>> predict(data[~mask_array]) # predicts with 0~199, 300~499
Here's one way:
In [1]: index_array = np.array([3, 4, 7, 9])
In [2]: n = 15
In [3]: mask_array = np.zeros(n, dtype=int)
In [4]: mask_array[index_array] = 1
In [5]: mask_array
Out[5]: array([0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0])
If the mask is always a range, you can eliminate index_array, and assign 1 to a slice:
In [6]: mask_array = np.zeros(n, dtype=int)
In [7]: mask_array[5:10] = 1
In [8]: mask_array
Out[8]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
If you want an array of boolean values instead of integers, change the dtype of mask_array when it is created:
In [11]: mask_array = np.zeros(n, dtype=bool)
In [12]: mask_array
Out[12]:
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False], dtype=bool)
In [13]: mask_array[5:10] = True
In [14]: mask_array
Out[14]:
array([False, False, False, False, False, True, True, True, True,
True, False, False, False, False, False], dtype=bool)
For a single dimension, try:
n = (15,)
index_array = [2, 5, 7]
mask_array = numpy.zeros(n)
mask_array[index_array] = 1
For more than one dimension, convert your n-dimensional indices into one-dimensional ones, then use ravel:
n = (15, 15)
index_array = [[1, 4, 6], [10, 11, 2]] # you may need to transpose your indices!
mask_array = numpy.zeros(n)
flat_index_array = np.ravel_multi_index(
index_array,
mask_array.shape)
numpy.ravel(mask_array)[flat_index_array] = 1
There's a nice trick to do this as a one-liner, too - use the numpy.in1d and numpy.arange functions like this (the final line is the key part):
>>> x = np.linspace(-2, 2, 10)
>>> y = x**2 - 1
>>> idxs = np.where(y<0)
>>> np.in1d(np.arange(len(x)), idxs)
array([False, False, False, True, True, True, True, False, False, False], dtype=bool)
The downside of this approach is that it's ~10-100x slower than the appropch Warren Weckesser gave... but it's a one-liner, which may or may not be what you're looking for.
As requested, here it is in an answer. The code:
[x in index_array for x in range(500)]
will give you a mask like you asked for, but it will use Bools instead of 0's and 1's.

Categories