pandas build matrix of row by row comparisons - python

I have two dataframes, a (10,2) and a (4,2) and I am looking for a faster/more pythonic way to compare them row by row.
x = pd.DataFrame([range(10),range(2,12)])
x = x.transpose()
y = pd.DataFrame([[5,8],[2,3],[5,5]])
I'd like to build a comparison matrix (10,3) that shows which of the rows in the first dataframe fit the following requirements in the second dataframe. the x1 value must be >= the y[0] value and the x[0] value must be <= the y1 value. In reality, the data are dates, but for simplicity I have just used integers to make this example easier to follow. We're testing for overlap in time periods, so the logic shows that there must be some overlap in the periods of the respective tables.
arr = np.zeros((len(x),len(y)), dtype=bool)
for xrow in x.index:
for yrow in y.index:
if x.loc[xrow,1] >= y.loc[yrow,0] and x.loc[xrow,0] <= y.loc[yrow,1]:
arr[xrow,yrow] = True
arr
The brute force approach above is too slow. Any suggestions for how I could vectorize this or do some sort of transposed matrix comparisons?

You can convert x, y to NumPy arrays and then extend dimensions with np.newaxis/None, which would bring in NumPy's broadcasting when performing the same operations. Thus, all those comparisons and the output boolean array would be created in a vectorized fashion. The implementation would look like this -
X = np.asarray(x)
Y = np.asarray(y)
arr = (X[:,None,1] >= Y[:,0]) & (X[:,None,0] <= Y[:,1])
Sample run -
In [207]: x = pd.DataFrame([range(10),range(2,12)])
...: x = x.transpose()
...: y = pd.DataFrame([[5,8],[2,3],[5,5]])
...:
In [208]: X = np.asarray(x)
...: Y = np.asarray(y)
...: arr = (X[:,None,1] >= Y[:,0]) & (X[:,None,0] <= Y[:,1])
...:
In [209]: arr
Out[209]:
array([[False, True, False],
[False, True, False],
[False, True, False],
[ True, True, True],
[ True, False, True],
[ True, False, True],
[ True, False, False],
[ True, False, False],
[ True, False, False],
[False, False, False]], dtype=bool)

Related

Apply numpy 'where' along one of axes

I have an array like that:
array = np.array([
[True, False],
[True, False],
[True, False],
[True, True],
])
I would like to find the last occurance of True for each row of the array.
If it was 1d array I would do it in this way:
np.where(array)[0][-1]
How do I do something similar in 2D? Kind of like:
np.where(array, axis = 1)[0][:,-1]
but there is no axis argument in np.where.
Since True is greater than False, find the position of the largest element in each row. Unfortunately, argmax finds the first largest element, not the last one. So, reverse the array sideways, find the first True from the end, and recalculate the indexes:
(array.shape[1] - 1) - array[:, ::-1].argmax(axis=1)
# array([0, 0, 0, 1])
The method fails if there are no True values in a row. You can check if that's the case by dividing by array.max(axis=1). A row with no Trues will have its last True at the infinity :)
array[0, 0] = False
((array.shape[1] - 1) - array[:, ::-1].argmax(axis=1)) / array.max(axis=1)
#array([inf, 0., 0., 1.])
I found an older answer but didn't like that it returns 0 for both a True in the first position, and for a row of False.
So here's a way to solve that problem, if it's important to you:
import numpy as np
arr = np.array([[False, False, False], # -1
[False, False, True], # 2
[True, False, False], # 0
[True, False, True], # 2
[True, True, False], # 1
[True, True, True], # 2
])
# Make an adustment for no Trues at all.
adj = np.sum(arr, axis=1) == 0
# Get the position and adjust.
x = np.argmax(np.cumsum(arr, axis=1), axis=1) - adj
# Compare to expected result:
assert np.all(x == np.array([-1, 2, 0, 2, 1, 2]))
print(x)
Gives [-1 2 0 2 1 2].

Python/Numpy: Combining boolean masks by row in grouped columns in multidimensional array

I have a 3D boolean array (a 2D numpy array of boolean mask arrays) with r rows and c cols. In the example below the array shape is (3, 6, 2); 3 rows and 6 columns, where each column contains 2 elements.
maskArr = np.array([
[[True, False], [True, True], [True, True], [True, True], [True, True], [True, True]],
[[False, True], [False, True], [True, True], [False, True], [True, True], [True, True]],
[[True, False], [True, True], [True, True], [True, True], [True, True], [True, True]],
])
# If n=2: |<- AND these 2 cols ->|<- AND these 2 cols ->|<- AND these 2 cols ->|
# If n=3: |<----- AND these 3 cols ----->|<----- AND these 3 cols ----->|
I know I can use np.all(maskArr, axis=1) to and together all the mask arrays in each row as in previous answer, but instead I would like to and together the boolean arrays in each row in increments of n columns.
So if we start with 6 columns, as above, and n=2, I would like to apply the equivalent of np.all on every 2 columns for an end result of 3 columns, where:
The first column of the result array equals the rows of the first (2) columns of the original array ANDed together - result[:,0] = np.all(maskArr[:,0:1], axis=1)
The second column of the result array equals the rows of the second (2) columns of the original array ANDed together. - result[:,1] = np.all(maskArr[:,2:3], axis=1)
And the third column of the result array equals the rows of the last (2) columns of the original array ANDed together. - result[:,2] = np.all(maskArr[:,4:5], axis=1)
Is there a way to use np.all (or another vectorized approach) to get this result?
Expected result with n=2:
>>> np.array([
[[True, False], [True, True], [True, True]],
[[False, True], [False, True], [True, True]],
[[True, False], [True, True], [True, True]],
])
Note: The array I'm working with is extremely large so I'm looking for a vectorized approach to minimize performance impact. The actual boolean arrays can be thousands of elements long.
I've tried:
n = 2
c = len(maskArr[0]) ## c = 6 (number of columns)
nResultColumns = int(c / n) ## nResultColumns = 3
combinedMaskArr = [np.all(maskArr[:,i*n:i*n+n], axis=1) for i in range(nResultColumns)]
which gives me:
>>> [
array([[True, False], [False, True], [True, False]]),
array([[True, True], [False, True], [True, True]]),
array([[True, True], [True, True], [True, True]])
]
The output above is not the expected format or values.
Any guidance or suggestions on how to get to the expected result?
Thank you in advance.
The following works, if I understood your problem correctly.
n = 2
cols = mask_arr.shape[1]
chunks = math.ceil(cols / n)
groups = np.array_split(np.swapaxes(mask_arr, 0, 1), chunks)
combined = np.array([np.all(g, axis=0) for g in groups])
result = np.swapaxes(combined, 0, 1)
If cols is divisible by n, I think this works:
n = 2
rows, cols = mask_arr.shape[0:2]
result = np.all(mask_arr.reshape(rows, cols // n, n, -1), axis=2)

Numpy: sums 1D array indexed by rows of 2D boolean array

Assume that I have an (m,)-array a and an (n, m)-array b of booleans. For each row b[i] of b, I want to take np.sum(a, where=b[i]), which should return an (n,)-array. I could do the following:
a = np.array([1,2,3])
b = np.array([[True, False, False], [True, True, False], [False, True, True]])
c = np.array([np.sum(a, where=r) for r in b])
# c is [1,3,5]
but this seems quite unelegant to me. I would have hoped that broadcasting magic makes something like
c = np.sum(a, where=b)
# c is 9
work, but apparently, np.sum then sums over the rows of b, which I do not want. Is there a numpy-inherent way of achieving the desired behavour with np.sum (or any ufunc.reduce)?
How about:
a = np.array([1,2,3])
b = np.array([[True, False, False], [True, True, False], [False, True, True]])
c = np.sum(a*b, axis = 1)
output:
array([1, 3, 5])

Elegant way to check co-ordinates of a 2D NumPy array lie within a certain range

So let us say we have a 2D NumPy array (denoting co-ordinates) and I want to check whether all the co-ordinates lie within a certain range. What is the most Pythonic way to do this? For example:
a = np.array([[-1,2], [1,5], [6,7], [5,2], [3,4], [0, 0], [-1,-1]])
#ALL THE COORDINATES WITHIN x-> 0 to 4 AND y-> 0 to 4 SHOULD
BE PUT IN b (x and y ranges might not be equal)
b = #DO SOME OPERATION
>>> b
>>> [[3,4],
[0,0]]
If the range is the same for both directions, x, and y, just compare them and use all:
import numpy as np
a = np.array([[-1,2], [1,5], [6,7], [5,2], [3,4], [0, 0], [-1,-1]])
a[(a >= 0).all(axis=1) & (a <= 4).all(axis=1)]
# array([[3, 4],
# [0, 0]])
If the ranges are not the same, you can also compare to an iterable of the same size as that axis (so two here):
mins = 0, 1 # x_min, y_min
maxs = 4, 10 # x_max, y_max
a[(a >= mins).all(axis=1) & (a <= maxs).all(axis=1)]
# array([[1, 5],
# [3, 4]])
To see what is happening here, let's have a look at the intermediate steps:
The comparison gives a per-element result of the comparison, with the same shape as the original array:
a >= mins
# array([[False, True],
# [ True, True],
# [ True, True],
# [ True, True],
# [ True, True],
# [ True, False],
# [False, False]], dtype=bool)
Using nmpy.ndarray.all, you get if all values are truthy or not, similarly to the built-in function all:
(a >= mins).all()
# False
With the axis argument, you can restrict this to only compare values along one (or multiple) axis of the array:
(a >= mins).all(axis=1)
# array([False, True, True, True, True, False, False], dtype=bool)
(a >= mins).all(axis=0)
# array([False, False], dtype=bool)
Note that the output of this is the same shape as array, except that all dimnsions mentioned with axis have been contracted to a single True/False.
When indexing an array with a sequence of True, False values, it is cast to the right shape if possible. Since we index an array with shape (7, 2) with an (7,) = (7, 1) index, the values are implicitly repeated along the second dimension, so these values are used to select rows of the original array.

Numpy.where: very slow with conditions from two different arrays

I have three arrays of type numpy.ndarray with dimensions (n by 1), named amplitude, distance and weight. I would like to use selected entries of the amplitude array, based on their respective distance- and weight-values. For example I would like to find the indices of the entries within a certain distance range, so I write:
index = np.where( (distance<10) & (distance>=5) )
and I would then proceed by using the values from amplitude(index).
This works perfectly well as long as I only use one array for specifying the conditions. When I try for example
index = np.where( (distance<10) & (distance>=5) & (weight>0.8) )
the operation becomes super-slow. Why is that, and is there a better way for this task? In fact, I eventually want to use many conditions from something like 6 different arrays.
This is just a guess, but perhaps numpy is broadcasting your arrays? If the arrays are the exact same shape, then numpy won't broadcast them:
>>> distance = numpy.arange(5) > 2
>>> weight = numpy.arange(5) < 4
>>> distance.shape, weight.shape
((5,), (5,))
>>> distance & weight
array([False, False, False, True, False], dtype=bool)
But if they have different shapes, and the shapes are broadcastable, then it will. (n,), (n, 1), and (1, n) are all arguably "n by 1" arrays, they aren't all the same:
>>> distance[None,:].shape, weight[:,None].shape
((1, 5), (5, 1))
>>> distance[None,:]
array([[False, False, False, True, True]], dtype=bool)
>>> weight[:,None]
array([[ True],
[ True],
[ True],
[ True],
[False]], dtype=bool)
>>> distance[None,:] & weight[:,None]
array([[False, False, False, True, True],
[False, False, False, True, True],
[False, False, False, True, True],
[False, False, False, True, True],
[False, False, False, False, False]], dtype=bool)
In addition to returning undesired results, this could be causing a big slowdown if the arrays are even moderately large:
>>> distance = numpy.arange(5000) > 500
>>> weight = numpy.arange(5000) < 4500
>>> %timeit distance & weight
100000 loops, best of 3: 8.17 us per loop
>>> %timeit distance[:,None] & weight[None,:]
10 loops, best of 3: 48.6 ms per loop

Categories