Related
I try to create mask of numpy.array based on list of tuples. Here is my solution that produces expected result:
import numpy as np
filter_vals = [(1, 1, 0), (0, 0, 1), (0, 1, 0)]
data = np.array([
[[0, 0, 0], [1, 1, 0], [1, 1, 1]],
[[1, 0, 0], [0, 1, 0], [0, 0, 1]],
[[1, 1, 0], [0, 1, 1], [1, 0, 1]],
])
mask = np.array([], dtype=bool)
for f_val in filter_vals:
if mask.size == 0:
mask = (data == f_val).all(-1)
else:
mask = mask | (data == f_val).all(-1)
Output/mask:
array([[False, True, False],
[False, True, True],
[ True, False, False]]
Problem is that with bigger data array and increasing number of tuples in filter_vals, it is getting slower.
It there any better solution? I tried to use np.isin(data, filter_vals), but it does not provide result I need.
A classical approach using broadcasting would be:
*A, B = data.shape
(data.reshape((-1,B)) == np.array(filter_vals)[:,None]).all(-1).any(0).reshape(A)
This will however be memory expensive. So applicability really depends on your use case.
output:
array([[False, True, False],
[False, True, True],
[ True, False, False]])
I'm trying to find elements of one DataFrame (df_other) which match a column in another DataFrame (df). In other words, I'd like to know where the values in df['a'] match the values in df_other['a'] for each row in df['a'].
An example might be easier to explain the expected result:
>>> import pandas as pd
>>> import numpy as np
>>>
>>>
>>> df = pd.DataFrame({'a': ['x', 'y', 'z']})
>>> df
a
0 x
1 y
2 z
>>> df_other = pd.DataFrame({'a': ['x', 'x', 'y', 'z', 'z2'], 'c': [1, 2, 3, 4, 5]})
>>> df_other
a c
0 x 1
1 x 2
2 y 3
3 z 4
4 z2 5
>>>
>>>
>>> u = df_other['c'].unique()
>>> u
array([1, 2, 3, 4, 5])
>>> bm = np.ones((len(df), len(u)), dtype=bool)
>>> bm
array([[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]])
should yield a bitmap of
[
[1, 1, 0, 0, 0], # [1, 2] are df_other['c'] where df_other['a'] == df['a']
[0, 0, 1, 0, 0], # [3] matches
[0, 0, 0, 1, 0], # [4] matches
]
I'm looking for a fast numpy implementation that doesn't iterate through all rows (which is my current solution):
>>> df_other['a'] == df.loc[0, 'a']
0 True
1 True
2 False
3 False
4 False
Name: a, dtype: bool
>>>
>>>
>>> df_other['a'] == df.loc[1, 'a']
0 False
1 False
2 True
3 False
4 False
Name: a, dtype: bool
>>> df_other['a'] == df.loc[2, 'a']
0 False
1 False
2 False
3 True
4 False
Name: a, dtype: bool
Note: in the actual production code, there are many more column conditions ((df['a'] == df_other['a']) & (df['b'] == df_other['b'] & ...), but they are generally less than the number of rows in df, so I wouldn't mind a solution that loops over the conditions (and subsequently sets values in bm to false).
Also, the bitmap should have the shape of (len(df), len(df_other['c'].unique)).
numpy broadcasting is so useful here:
bm = df_other.values[:, 0] == df.values
Output:
>>> bm
array([[ True, True, False, False, False],
[False, False, True, False, False],
[False, False, False, True, False]])
If you need it as ints:
>>> bm.astype(int)
array([[1, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]])
Another way to do this using pandas methods are as follows:
pd.crosstab(df_other['a'], df_other['c']).reindex(df['a']).to_numpy(dtype=int)
Output:
array([[1, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]])
How to remove rows while iterating in numpy, as Java does:
Iterator < Message > itMsg = messages.iterator();
while (itMsg.hasNext()) {
Message m = itMsg.next();
if (m != null) {
itMsg.remove();
continue;
}
}
Here is my pseudo code. Remove the rows whose entries are all 0 and 1 while iterating.
#! /usr/bin/env python
import numpy as np
M = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 0, 0, 0], #remove this row whose entries are all 0
[1, 1, 1, 1] #remove this row whose entries are all 1
])
it = np.nditer(M, order="K", op_flags=['readwrite'])
while not it.finished :
row = it.next() #how to get a row?
sumRow = np.sum(row)
if sumRow==4 or sumRow==0 : #remove rows whose entries are all 0 and 1 as well
#M = np.delete(M, row, axis =0)
it.remove_axis(i) #how to get i?
Writing good numpy code requires you to think in a vectorized fashion. Not every problem has a good vectorization, but for those that do, you can write clean and fast code pretty easily. In this case, we can decide on what rows we want to remove/keep and then use that to index into your array:
>>> M
array([[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 0],
[1, 1, 1, 1]])
>>> M[~((M == 0).all(1) | (M == 1).all(1))]
array([[0, 1, 0, 0],
[0, 0, 1, 0]])
Step by step, we can compare M to something to make a boolean array:
>>> M == 0
array([[ True, False, True, True],
[ True, True, False, True],
[ True, True, True, True],
[False, False, False, False]], dtype=bool)
We can use all to see if a row or column is all true:
>>> (M == 0).all(1)
array([False, False, True, False], dtype=bool)
We can use | to do an or operation:
>>> (M == 0).all(1) | (M == 1).all(1)
array([False, False, True, True], dtype=bool)
We can use this to select rows:
>>> M[(M == 0).all(1) | (M == 1).all(1)]
array([[0, 0, 0, 0],
[1, 1, 1, 1]])
But since these are the rows we want to throw away, we can use ~ (NOT) to flip False and True:
>>> M[~((M == 0).all(1) | (M == 1).all(1))]
array([[0, 1, 0, 0],
[0, 0, 1, 0]])
If instead we wanted to keep columns which weren't all 1 or all 0, we simply need to change what axis we're working on:
>>> M
array([[1, 1, 0, 1],
[1, 0, 1, 1],
[1, 0, 0, 1],
[1, 1, 1, 1]])
>>> M[:, ~((M == 0).all(axis=0) | (M == 1).all(axis=0))]
array([[1, 0],
[0, 1],
[0, 0],
[1, 1]])
Is it possible to convert an array of indices to an array of ones and zeros, given the range?
i.e. [2,3] -> [0, 0, 1, 1, 0], in range of 5
I'm trying to automate something like this:
>>> index_array = np.arange(200,300)
array([200, 201, ... , 299])
>>> mask_array = ??? # some function of index_array and 500
array([0, 0, 0, ..., 1, 1, 1, ... , 0, 0, 0])
>>> train(data[mask_array]) # trains with 200~299
>>> predict(data[~mask_array]) # predicts with 0~199, 300~499
Here's one way:
In [1]: index_array = np.array([3, 4, 7, 9])
In [2]: n = 15
In [3]: mask_array = np.zeros(n, dtype=int)
In [4]: mask_array[index_array] = 1
In [5]: mask_array
Out[5]: array([0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0])
If the mask is always a range, you can eliminate index_array, and assign 1 to a slice:
In [6]: mask_array = np.zeros(n, dtype=int)
In [7]: mask_array[5:10] = 1
In [8]: mask_array
Out[8]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
If you want an array of boolean values instead of integers, change the dtype of mask_array when it is created:
In [11]: mask_array = np.zeros(n, dtype=bool)
In [12]: mask_array
Out[12]:
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False], dtype=bool)
In [13]: mask_array[5:10] = True
In [14]: mask_array
Out[14]:
array([False, False, False, False, False, True, True, True, True,
True, False, False, False, False, False], dtype=bool)
For a single dimension, try:
n = (15,)
index_array = [2, 5, 7]
mask_array = numpy.zeros(n)
mask_array[index_array] = 1
For more than one dimension, convert your n-dimensional indices into one-dimensional ones, then use ravel:
n = (15, 15)
index_array = [[1, 4, 6], [10, 11, 2]] # you may need to transpose your indices!
mask_array = numpy.zeros(n)
flat_index_array = np.ravel_multi_index(
index_array,
mask_array.shape)
numpy.ravel(mask_array)[flat_index_array] = 1
There's a nice trick to do this as a one-liner, too - use the numpy.in1d and numpy.arange functions like this (the final line is the key part):
>>> x = np.linspace(-2, 2, 10)
>>> y = x**2 - 1
>>> idxs = np.where(y<0)
>>> np.in1d(np.arange(len(x)), idxs)
array([False, False, False, True, True, True, True, False, False, False], dtype=bool)
The downside of this approach is that it's ~10-100x slower than the appropch Warren Weckesser gave... but it's a one-liner, which may or may not be what you're looking for.
As requested, here it is in an answer. The code:
[x in index_array for x in range(500)]
will give you a mask like you asked for, but it will use Bools instead of 0's and 1's.
I have a boolean mask array a of length n:
a = np.array([True, True, True, False, False])
I have a 2d array with n columns:
b = np.array([[1,2,3,4,5], [1,2,3,4,5]])
I want a new array which contains only the "True"-values, for example
c = ([[1,2,3], [1,2,3]])
c = a * b does not work because it contains also "0" for the false columns what I don't want
c = np.delete(b, a, 1) does not work
Any suggestions?
You probably want something like this:
>>> a = np.array([True, True, True, False, False])
>>> b = np.array([[1,2,3,4,5], [1,2,3,4,5]])
>>> b[:,a]
array([[1, 2, 3],
[1, 2, 3]])
Note that for this kind of indexing to work, it needs to be an ndarray, like you were using, not a list, or it'll interpret the False and True as 0 and 1 and give you those columns:
>>> b[:,[True, True, True, False, False]]
array([[2, 2, 2, 1, 1],
[2, 2, 2, 1, 1]])
You can use numpy.ma module and use np.ma.masked_array function to do so.
>>> x = np.array([1, 2, 3, -1, 5])
>>> mx = ma.masked_array(x, mask=[0, 0, 0, 1, 0])
masked_array(data=[1, 2, 3, --, 5], mask=[False, False, False, True, False], fill_value=999999)
Hope I'm not too late! Here's your array:
X = np.array([[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5]])
Let's create an array of zeros of the same shape as X:
mask = np.zeros_like(X)
# array([[0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0]])
Then, specify the columns that you want to mask out or hide with a 1. In this case, we want the last 2 columns to be masked out.
mask[:, -2:] = 1
# array([[0, 0, 0, 1, 1],
# [0, 0, 0, 1, 1]])
Create a masked array:
X_masked = np.ma.masked_array(X, mask)
# masked_array(data=[[1, 2, 3, --, --],
# [1, 2, 3, --, --]],
# mask=[[False, False, False, True, True],
# [False, False, False, True, True]],
# fill_value=999999)
We can then do whatever we want with X_masked, like taking the sum of each column (along axis=0):
np.sum(X_masked, axis=0)
# masked_array(data=[2, 4, 6, --, --],
# mask=[False, False],
# fill_value=1e+20)
Great thing about this is that X_masked is just a view of X, not a copy.
X_masked.base is X
# True