Related
Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11
I have 2 matrices, and I want to perform a 'cell-wise' addition, however the matrices aren't the same size. I want to preserve the cells relative positions during the calculation (i.e. their 'co-ordinates' from the top left), so a simple (if maybe not the best) solution, seems to be to pad the smaller matrix's x and y with zeros.
This thread has a perfectly satisfactory answer for concatenating vertically, and this does work with my data, and following the suggestion in the answer, I also threw in the hstack but at the moment, it's complaining that the dimensions (excluding concatenation axis) need to match exactly. Perhaps hstack doesnt work as I anticipate or exactly equivalently to vstack, but I'm at a bit of a loss now.
This is what hstack throws at me, meanwhile vstack seems to have no problem.
ValueError: all the input array dimensions except for the concatenation axis must match exactly
Essentially the code checks which of a pair of matrices is the shorter and/or wider, and then pads the smaller matrix with zeros to match.
Here's the code I have:
import numpy as np
A = np.random.randint(2, size = (3, 7))
B = np.random.randint(2, size = (5, 10))
# If the arrays have different row numbers:
if A.shape[0] < B.shape[0]: # Is A shorter than B?
A = np.vstack((A, np.zeros((B.shape[0] - A.shape[0], A.shape[1]))))
elif A.shape[0] > B.shape[0]: # or is A longer than B?
B = np.vstack((B, np.zeros((A.shape[0] - B.shape[0], B.shape[1]))))
# If they have different column numbers
if A.shape[1] < B.shape[1]: # Is A narrower than B?
A = np.hstack((A, np.zeros((B.shape[1] - A.shape[1], A.shape[0]))))
elif A.shape[1] > B.shape[1]: # or is A wider than B?
B = np.hstack((B, np.zeros((A.shape[1] - B.shape[1], B.shape[0]))))
It's getting late so its possible I've just missed something obvious with hstack but I can't see my logic error at the moment.
Just use np.pad :
np.pad(A,((0,2),(0,3)),'constant') # 2 is 5-3, 3 is 10-7
[[0 1 1 0 1 0 0 0 0 0]
[1 0 0 1 0 1 0 0 0 0]
[1 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]]
But the 4 pads width must be computed; so an another simple
method to pad the 2 array in any case is :
A = np.ones((3, 7),int)
B = np.ones((5, 2),int)
ma,na = A.shape
mb,nb = B.shape
m,n = max(ma,mb) , max(na,nb)
newA = np.zeros((m,n),A.dtype)
newA[:ma,:na]=A
newB = np.zeros((m,n),B.dtype)
newB[:mb,:nb]=B
For :
[[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]]
[[1 1 0 0 0 0 0]
[1 1 0 0 0 0 0]
[1 1 0 0 0 0 0]
[1 1 0 0 0 0 0]
[1 1 0 0 0 0 0]]
I think your hstack lines should be of the form
np.hstack((A, np.zeros((A.shape[0], B.shape[1] - A.shape[1]))))
You seem to have the rows and columns swapped.
Yes, indeed. You should swap (B.shape[1] - A.shape[1], A.shape[0]) to (A.shape[0], B.shape[1] - A.shape[1]) and so on, because you need to have the same numbers of rows to stack them horizontally.
Try b[:a.shape[0], :a.shape[1]] = b[:a.shape[0], :a.shape[1]]+a where b the larger array
Example below
import numpy as np
a = np.arange(12).reshape(3, 4)
print("a\n", a)
b = np.arange(16).reshape(4, 4)
print("b original\n", b)
b[:a.shape[0], :a.shape[1]] = b[:a.shape[0], :a.shape[1]]+a
print("b new\n",b)
output
a
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
b original
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
b new
[[ 0 2 4 6]
[ 8 10 12 14]
[16 18 20 22]
[12 13 14 15]]
big_array = np.array((
[0,1,0,0,1,0,0,1],
[0,1,0,0,0,0,0,0],
[0,1,0,0,1,0,0,0],
[0,0,0,0,1,0,0,0],
[1,0,0,0,1,0,0,0]))
print(big_array)
[[0 1 0 0 1 0 0 1]
[0 1 0 0 0 0 0 0]
[0 1 0 0 1 0 0 0]
[0 0 0 0 1 0 0 0]
[1 0 0 0 1 0 0 0]]
Is there a way to iterate over this numpy array and for each 2x2 cluster of 0s, set all values within that cluster = 5? This is what the output would look like.
[[0 1 5 5 1 5 5 1]
[0 1 5 5 0 5 5 0]
[0 1 5 5 1 5 5 0]
[0 0 5 5 1 5 5 0]
[1 0 5 5 1 5 5 0]]
My thoughts are to use advanced indexing to set the 2x2 shape = to 5, but I think it would be really slow to simply iterate like:
1) check if array[x][y] is 0
2) check if adjacent array elements are 0
3) if all elements are 0, set all those values to 5.
big_array = [1, 7, 0, 0, 3]
i = 0
p = 0
while i <= len(big_array) - 1 and p <= len(big_array) - 2:
if big_array[i] == big_array[p + 1]:
big_array[i] = 5
big_array[p + 1] = 5
print(big_array)
i = i + 1
p = p + 1
Output:
[1, 7, 5, 5, 3]
It is a example, not whole correct code.
Here's a solution by viewing the array as blocks.
First you need to define this function rolling_window from here https://gist.github.com/seberg/3866040/revisions
Then break the array big, your starting array, into 2x2 blocks using this function.
Also generate an array which has indices of every element in big and break it similarly into 2x2 blocks.
Then generate a boolean mask where the 2x2 blocks of big are all zero, and use the index array to get those elements.
blks = rolling_window(big,window=(2,2)) # 2x2 blocks of original array
inds = np.indices(big.shape).transpose(1,2,0) # array of indices into big
blkinds = rolling_window(inds,window=(2,2,0)).transpose(0,1,4,3,2) # 2x2 blocks of indices into big
mask = blks == np.zeros((2,2)) # generate a mask of every 2x2 block which is all zero
mask = mask.reshape(*mask.shape[:-2],-1).all(-1) # still generating the mask
# now blks[mask] is every block which is zero..
# but you actually want the original indices in the array 'big' instead
inds = blkinds[mask].reshape(-1,2).T # indices into big where elements need replacing
big[inds[0],inds[1]] = 5 #reassign
You need to test this: I did not. But the idea is to break the array into blocks, and an array of indices into blocks, then develop a boolean condition on the blocks, use those to get the indices, and then reassign.
An alternative would be to iterate through indblks as defined here, then test the 2x2 obtained from big at each indblk element and reassign if necessary.
This is my attempt to help you solve your problem. My solution may be subject to fair criticism.
import numpy as np
from itertools import product
m = np.array((
[0,1,0,0,1,0,0,1],
[0,1,0,0,0,0,0,0],
[0,1,0,0,1,0,0,0],
[0,0,0,0,1,0,0,0],
[1,0,0,0,1,0,0,0]))
h = 2
w = 2
rr, cc = tuple(d + 1 - q for d, q in zip(m.shape, (h, w)))
slices = [(slice(r, r + h), slice(c, c + w))
for r, c in product(range(rr), range(cc))
if not m[r:r + h, c:c + w].any()]
for s in slices:
m[s] = 5
print(m)
[[0 1 5 5 1 5 5 1]
[0 1 5 5 0 5 5 5]
[0 1 5 5 1 5 5 5]
[0 5 5 5 1 5 5 5]
[1 5 5 5 1 5 5 5]]
I have a set of objects and their positions over time. I would like to get the distance between each car and their nearest neighbour, and calculate an average of this for each time point. An example dataframe is as follows:
time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
df
x y car
time
0 216 13 1
0 218 12 2
0 217 12 3
1 280 110 1
1 290 109 3
2 130 3 4
2 132 56 5
For each time point, I would like to know the nearest car neighbour for each car. Example:
df2
car nearest_neighbour euclidean_distance
time
0 1 3 1.41
0 2 3 1.00
0 3 1 1.41
1 1 3 10.05
1 3 1 10.05
2 4 5 53.04
2 5 4 53.04
I know I can caluclate the pairwise distances between cars from How to apply euclidean distance function to a groupby object in pandas dataframe? but how do I get the nearest neighbour for each car?
After that it seems simple enough to get an average of the distances for each frame using groupby, but its the second step that really throws me off.
Help appreciated!
It might be a bit overkill but you could use nearest neighbors from scikit
An example:
import numpy as np
from sklearn.neighbors import NearestNeighbors
import pandas as pd
def nn(x):
nbrs = NearestNeighbors(n_neighbors=2, algorithm='auto', metric='euclidean').fit(x)
distances, indices = nbrs.kneighbors(x)
return distances, indices
time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
#This has the index of the nearest neighbor in the group, as well as the distance
nns = df.drop('car', 1).groupby('time').apply(lambda x: nn(x.as_matrix()))
groups = df.groupby('time')
nn_rows = []
for i, nn_set in enumerate(nns):
group = groups.get_group(i)
for j, tup in enumerate(zip(nn_set[0], nn_set[1])):
nn_rows.append({'time': i,
'car': group.iloc[j]['car'],
'nearest_neighbour': group.iloc[tup[1][1]]['car'],
'euclidean_distance': tup[0][1]})
nn_df = pd.DataFrame(nn_rows).set_index('time')
Result:
car euclidean_distance nearest_neighbour
time
0 1 1.414214 3
0 2 1.000000 3
0 3 1.000000 2
1 1 10.049876 3
1 3 10.049876 1
2 4 53.037722 5
2 5 53.037722 4
(Note that at time 0, car 3's nearest neighbor is car 2. sqrt((217-216)**2 + 1) is about 1.4142135623730951 while sqrt((218-217)**2 + 0) = 1)
use cdist from scipy.spatial.distance to get a matrix representing distance from each car to every other car. Since each car's distance to itself is 0, the diagonal elements are all 0.
example (for time == 0):
X = df[df.time==0][['x','y']]
dist = cdist(X, X)
dist
array([[0. , 2.23606798, 1.41421356],
[2.23606798, 0. , 1. ],
[1.41421356, 1. , 0. ]])
Use np.argsort to get the indexes that would sort the distance-matrix. The first column is just the row number because the diagonal elements are 0.
idx = np.argsort(dist)
idx
array([[0, 2, 1],
[1, 2, 0],
[2, 1, 0]], dtype=int64)
Then, just pick out the cars & closest distances using the idx
dist[v[:,0], v[:,1]]
array([1.41421356, 1. , 1. ])
df[df.time==0].car.values[v[:,1]]
array([3, 3, 2], dtype=int64)
combine the above logic into a function that returns the required dataframe:
def closest(df):
X = df[['x', 'y']]
dist = cdist(X, X)
v = np.argsort(dist)
return df.assign(euclidean_distance=dist[v[:, 0], v[:, 1]],
nearest_neighbour=df.car.values[v[:, 1]])
& use it with groupby, finally dropping the index because the groupby-apply adds an additional index
df.groupby('time').apply(closest).reset_index(drop=True)
time x y car euclidean_distance nearest_neighbour
0 0 216 13 1 1.414214 3
1 0 218 12 2 1.000000 3
2 0 217 12 3 1.000000 2
3 1 280 110 1 10.049876 3
4 1 290 109 3 10.049876 1
5 2 130 3 4 53.037722 5
6 2 132 56 5 53.037722 4
by the way your sample output is wrong for time 0. My answer & Bacon's answer both show the correct result
I apologize if this has been asked here - I've hunted around here and in the Tentative NumPy Tutorial for an answer.
I have 2 numpy arrays. The first array is similar to:
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
(etc... It's ~700x10 in actuality)
I then have a 2nd array similar to
3 1
4 18
5 2
(again, longer - maybe 400 or so rows)
The first column of the 2nd array is always completely contained within the first column of the first array
What I'd like to do is to insert the 2nd column of the 2nd array into that first array as part of an existing column, i.e:
array a:
1 0 0 0 0
2 0 0 0 0
3 1 0 0 0
4 18 0 0 0
5 2 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
(I'd be filling in each of those columns in turn, but each covers a different range within the original)
My first try was along the lines of a[b[:,0],1]=b[:,1] which puts them into the indices of b, not the values (ie, in my example above instead of filling in rows 3, 4, and 5, I filled in 2, 3, and 4). I should have realized that!
Since then, I've tried to make it work pretty inelegantly with where(), and I think I could make it work by finding the difference in the starting values of the first columns.
I'm new to python, so perhaps I'm overly optimistic - but it seems like there should be a more elegant way and I'm just missing it.
Thanks for any insights!
If the numbers in the first column of a are in sorted order, then you could use
a[a[:,0].searchsorted(b[:,0]),1] = b[:,1]
For example:
import numpy as np
a = np.array([(1,0,0,0,0),
(2,0,0,0,0),
(3,0,0,0,0),
(4,0,0,0,0),
(5,0,0,0,0),
(6,0,0,0,0),
(7,0,0,0,0),
(8,0,0,0,0),
])
b = np.array([(3, 1),
(5, 18),
(7, 2)])
a[a[:,0].searchsorted(b[:,0]),1] = b[:,1]
print(a)
yields
[[ 1 0 0 0 0]
[ 2 0 0 0 0]
[ 3 1 0 0 0]
[ 4 0 0 0 0]
[ 5 18 0 0 0]
[ 6 0 0 0 0]
[ 7 2 0 0 0]
[ 8 0 0 0 0]]
(I changed your example a bit to show that the values in b's first column do not have to be contiguous.)
If a[:,0] is not in sorted order, then you could use np.argsort to workaround this:
a = np.array( [(1,0,0,0,0),
(2,0,0,0,0),
(5,0,0,0,0),
(3,0,0,0,0),
(4,0,0,0,0),
(6,0,0,0,0),
(7,0,0,0,0),
(8,0,0,0,0),
])
b = np.array([(3, 1),
(5, 18),
(7, 2)])
perm = np.argsort(a[:,0])
a[:,1][perm[a[:,0][perm].searchsorted(b[:,0])]] = b[:,1]
print(a)
yields
[[ 1 0 0 0 0]
[ 2 0 0 0 0]
[ 5 18 0 0 0]
[ 3 1 0 0 0]
[ 4 0 0 0 0]
[ 6 0 0 0 0]
[ 7 2 0 0 0]
[ 8 0 0 0 0]]
The setup:
a = np.arange(20).reshape(2,10).T
b = np.array([[1, 100], [3, 300], [8, 800]])
This will work if you don't know anything about a[:, 0] except that it is sorted.
index = a[:, 0].searchsorted(b[:, 0])
a[index, 1] = b[:, 1]
print a
array([[ 0, 10],
[ 1, 100],
[ 2, 12],
[ 3, 300],
[ 4, 14],
[ 5, 15],
[ 6, 16],
[ 7, 17],
[ 8, 800],
[ 9, 19]])
But if you know that a[:, 0] is a sequence of contiguous integers like your example you can do:
index = b[:,0] + a[0, 0]
a[index, 1] = b[:, 1]