Related
Suppose we have a numpy array of numpy arrays of zeros as
arr1=np.zeros((len(Train),(L))
where Train is a (dataset) numpy array of arrays of integers of fixed length.
We also have another 1d numpy array, positions of length as len(Train).
Now we wish to add elements of Train to arr1 at the positions specified by positions.
One way is to use a for loop on the Train array as:
k=len(Train[0])
for i in range(len(Train)):
arr1[i,int(positions[i]):int((positions[i]+k))]=Train[i,0:k])]
However, going over the entire Train set using the explicit for loop is slow and I would like to optimize it.
Here is one way by generating all the indexes you want to assign to. Setup:
import numpy as np
n = 12 # Number of training samples
l = 8 # Number of columns in the output array
k = 4 # Number of columns in the training samples
arr = np.zeros((n, l), dtype=int)
train = np.random.randint(10, size=(n, k))
positions = np.random.randint(l - k, size=n)
Random example data:
>>> train
array([[3, 4, 3, 2],
[3, 6, 4, 1],
[0, 7, 9, 6],
[4, 0, 4, 8],
[2, 2, 6, 2],
[4, 5, 1, 7],
[5, 4, 4, 4],
[0, 8, 5, 3],
[2, 9, 3, 3],
[3, 3, 7, 9],
[8, 9, 4, 8],
[8, 7, 6, 4]])
>>> positions
array([3, 2, 3, 2, 0, 1, 2, 2, 3, 2, 1, 1])
Advanced indexing with broadcasting trickery:
rows = np.arange(n)[:, None] # Shape (n, 1)
cols = np.arange(k) + positions[:, None] # Shape (n, k)
arr[rows, cols] = train
output:
>>> arr
array([[0, 0, 0, 3, 4, 3, 2, 0],
[0, 0, 3, 6, 4, 1, 0, 0],
[0, 0, 0, 0, 7, 9, 6, 0],
[0, 0, 4, 0, 4, 8, 0, 0],
[2, 2, 6, 2, 0, 0, 0, 0],
[0, 4, 5, 1, 7, 0, 0, 0],
[0, 0, 5, 4, 4, 4, 0, 0],
[0, 0, 0, 8, 5, 3, 0, 0],
[0, 0, 0, 2, 9, 3, 3, 0],
[0, 0, 3, 3, 7, 9, 0, 0],
[0, 8, 9, 4, 8, 0, 0, 0],
[0, 8, 7, 6, 4, 0, 0, 0]])
This is the array I have at hand:
[array([[[ 4, 9, 1, -3],
[-2, 0, 8, 6],
[ 1, 3, 7, 9 ],
[ 2, 5, 0, -7],
[-1, -6, -5, -8]]]),
array([[[ 0, 2, -1, 6 ],
[9, 8, 0, 3],
[ -1, 2, 5, -4],
[0, 5, 9, 6],
[ 6, 2, 9, 4]]]),
array([[[ 1, 2, 0, 9],
[3, 4, 8, -1],
[5, 6, 9, 0],
[ 7, 8, -3, -],
[9, 0, 8, -2]]])]
But the goal is obtain arrays A from first columns of nested arrays, B from second columns of nested arrays, Cfrom third columns of nested array etc.
Such that:
A = array([4, -2, 1, 2, -1, 0, 9, -1 ,0, 6, 1, 3, 5, 7, 9])
B = array([9, 0, 3, 5, -6, 2, 8, 2, 5, 2, 2,, 4, 6, 8, 0])
How should I do this?
You can do this with a single hstack() and use squeeze() to remove the extra dimension. With that you can use regular numpy indexing to pull out columns (or anything else you want):
import numpy as np
l = [np.array([[[ 4, 9, 1, -3],
[-2, 0, 8, 6],
[ 1, 3, 7, 9 ],
[ 2, 5, 0, -7],
[-1, -6, -5, -8]]]),
np.array([[[ 0, 2, -1, 6 ],
[9, 8, 0, 3],
[ -1, 2, 5, -4],
[0, 5, 9, 6],
[ 6, 2, 9, 4]]]),
np.array([[[ 1, 2, 0, 9],
[3, 4, 8, -1],
[5, 6, 9, 0],
[ 7, 8, -3, -1],
[9, 0, 8, -2]]])]
arr = np.hstack(l).squeeze()
A = arr[:,0]
print(A)
# [ 4 -2 1 2 -1 0 9 -1 0 6 1 3 5 7 9]
B = arr[:,1]
print(B)
#[ 9 0 3 5 -6 2 8 2 5 2 2 4 6 8 0]
# etc...
IIUC,
l = [np.array([[[ 4, 9, 1, -3],
[-2, 0, 8, 6],
[ 1, 3, 7, 9 ],
[ 2, 5, 0, -7],
[-1, -6, -5, -8]]]),
np.array([[[ 0, 2, -1, 6 ],
[9, 8, 0, 3],
[ -1, 2, 5, -4],
[0, 5, 9, 6],
[ 6, 2, 9, 4]]]),
np.array([[[ 1, 2, 0, 9],
[3, 4, 8, -1],
[5, 6, 9, 0],
[ 7, 8, -3, -9],
[9, 0, 8, -2]]])]
a = np.hstack([i[0][:, 0] for i in l])
b = np.hstack([i[0][:, 1] for i in l])
Output:
array([ 4, -2, 1, 2, -1, 0, 9, -1, 0, 6, 1, 3, 5, 7, 9])
array([ 9, 0, 3, 5, -6, 2, 8, 2, 5, 2, 2, 4, 6, 8, 0])
This is likely a repost but I'm not sure what wording to use for the title.
I'm trying to subtract the values of arrays inside arrays by reshaping them to create a larger array.
xn = np.array([[1,2,3],[4,5,6]])
yn = np.array(([1,2,3,4,5], [6,7,8,9,10]])
xn.shape
Out[42]: (2, 3)
yn.shape
Out[43]: (2, 5)
The functionality I want is:
yn.reshape(2,-1,1) - xn
This throws a value error, but the below works just fine when I remove the first dimension as a factor:
yn.reshape(2,-1,1)[0] - xn[0]
Out[44]:
array([[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0],
[ 3, 2, 1],
[ 4, 3, 2]])
Which would be the first output I would expect because xn and yn both have a first dimension of 2.
Is there a proper way to do this with the desired broadcasting?
Desired output:
array([[[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0],
[ 3, 2, 1],
[ 4, 3, 2]],
[[2, 1, 0],
[3, 2, 1],
[4, 3, 2],
[5, 4, 3],
[6, 5, 4]]])
>>> x
array([[1, 2, 3],
[4, 5, 6]])
>>> y
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10]])
>>> z = y.reshape(2,-1,1)
Add another axis to x:
>>> z-x[:,None,:]
array([[[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0],
[ 3, 2, 1],
[ 4, 3, 2]],
[[ 2, 1, 0],
[ 3, 2, 1],
[ 4, 3, 2],
[ 5, 4, 3],
[ 6, 5, 4]]])
>>>
Or just:
>>> y[...,None] - x[:,None,:]
array([[[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0],
[ 3, 2, 1],
[ 4, 3, 2]],
[[ 2, 1, 0],
[ 3, 2, 1],
[ 4, 3, 2],
[ 5, 4, 3],
[ 6, 5, 4]]])
From broadcasting rules, to be able to broadcast the shapes must be equal or one of them needs to be equal to 1 (starting from trailing dimensions and moving forward). So swapping two last dimensions of xn will allow you to broadcast (after adding another dimension to xn):
yn.reshape(2, -1, 1) - xn.reshape(2, -1, 1).swapaxes(-1, -2)
array([[[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0],
[ 3, 2, 1],
[ 4, 3, 2]],
[[ 2, 1, 0],
[ 3, 2, 1],
[ 4, 3, 2],
[ 5, 4, 3],
[ 6, 5, 4]]])
The shape of yn.reshape(2, -1, 1) is (2, 5, 1) and the shape of xn.reshape(2, -1, 1).swapaxes(-1, -2) is (2, 1, 3). Now you can broadcast because the dimensions are equal or one of them is equal one by element-wise comparison starting from trailing dimensions.
So I have a very large two-dimensional numpy array such as:
array([[ 2, 4, 0, 0, 0, 5, 9, 0],
[ 2, 3, 0, 1, 0, 3, 1, 1],
[ 1, 5, 4, 3, 2, 7, 8, 3],
[ 0, 7, 0, 0, 0, 6, 4, 4],
...,
[ 6, 5, 6, 0, 0, 1, 9, 5]])
I would like to quickly remove each row of the array where np.sum(row[2:5]) == 0
The only way I can think to do this is with for loops, but that takes very long when there are millions of rows. Additionally, this needs to be constrained to Python 2.7
Boolean expressions can be used as an index. You can use them to mask the array.
inputarray = array([[ 2, 4, 0, 0, 0, 5, 9, 0],
[ 2, 3, 0, 1, 0, 3, 1, 1],
[ 1, 5, 4, 3, 2, 7, 8, 3],
[ 0, 7, 0, 0, 0, 6, 4, 4],
...,
[ 6, 5, 6, 0, 0, 1, 9, 5]])
mask = numpy.sum(inputarray[:,2:5], axis=1) != 0
result = inputarray[mask,:]
What this is doing:
inputarray[:, 2:5] selects all the columns you want to sum over
axis=1 means we're doing the sum on the columns
We want to keep the rows where the sum is not zero
The mask is used as a row index and selects the rows where the boolean expression is True
Another solution would be to use numpy.apply_along_axis to calculate the sums and cast it as a bool, and use that for your index:
my_arr = np.array([[ 2, 4, 0, 0, 0, 5, 9, 0],
[ 2, 3, 0, 1, 0, 3, 1, 1],
[ 1, 5, 4, 3, 2, 7, 8, 3],
[ 0, 7, 0, 0, 0, 6, 4, 4],])
my_arr[np.apply_along_axis(lambda x: bool(sum(x[2:5])), 1, my_arr)]
array([[2, 3, 0, 1, 0, 3, 1, 1],
[1, 5, 4, 3, 2, 7, 8, 3]])
We just cast the sum too a bool since any number that's not 0 is going to be True.
>>> a
array([[2, 4, 0, 0, 0, 5, 9, 0],
[2, 3, 0, 1, 0, 3, 1, 1],
[1, 5, 4, 3, 2, 7, 8, 3],
[0, 7, 0, 0, 0, 6, 4, 4],
[6, 5, 6, 0, 0, 1, 9, 5]])
You are interested in columns 2 through five
>>> a[:,2:5]
array([[0, 0, 0],
[0, 1, 0],
[4, 3, 2],
[0, 0, 0],
[6, 0, 0]])
>>> b = a[:,2:5]
You want to find the sum of those columns in each row
>>> sum_ = b.sum(1)
>>> sum_
array([0, 1, 9, 0, 6])
These are the rows that meet your criteria
>>> sum_ != 0
array([False, True, True, False, True], dtype=bool)
>>> keep = sum_ != 0
Use boolean indexing to select those rows
>>> a[keep, :]
array([[2, 3, 0, 1, 0, 3, 1, 1],
[1, 5, 4, 3, 2, 7, 8, 3],
[6, 5, 6, 0, 0, 1, 9, 5]])
>>>
I have an interesting puzzle. Suppose you have a numpy 2D array, in which each line corresponds to a measurement event and each column corresponds to different measured variable. One additional column in this array specifies the date at which the measurement was taken. The lines are sorted according to the time stamp. There are several (or many) measurements on each day. The goal is to identify the lines that correspond to a new day and subtract the respective values from the subsequent lines in that day.
I approach this problem by a loop that loops over the days, creating a boolean vector that selects the proper lines and then subtracting the first selected line. This approach works, but feels non-elegant. Are there better ways to do this?
Just a small example. The lines below define a matrix in which the first colum
is the day and the remaining two are the measured values
before = array([[ 1, 1, 2],
[ 1, 3, 4],
[ 1, 5, 6],
[ 2, 7, 8],
[ 3, 9, 10],
[ 3, 11, 12],
[ 3, 13, 14]])
at the end of the process I expect to see the following array:
array([[1, 0, 0],
[1, 2, 2],
[1, 4, 4],
[2, 0, 0],
[3, 0, 0],
[3, 2, 2],
[3, 4, 4]])
PS Please help me finding a better and more informative title for this post. I'm out of ideas
numpy.searchsorted is a convenient function for this:
In : before
Out:
array([[ 1, 1, 2],
[ 1, 3, 4],
[ 1, 5, 6],
[ 2, 7, 8],
[ 3, 9, 10],
[ 3, 11, 12],
[ 3, 13, 14]])
In : diff = before[before[:,0].searchsorted(x[:,0])]
In : diff[:,0] = 0
In : before - diff
Out:
array([[1, 0, 0],
[1, 2, 2],
[1, 4, 4],
[2, 0, 0],
[3, 0, 0],
[3, 2, 2],
[3, 4, 4]])
Longer explanation
If you take the first column, and search for itself you get the minimum indices for those particular values:
In : before
Out:
array([[ 1, 1, 2],
[ 1, 3, 4],
[ 1, 5, 6],
[ 2, 7, 8],
[ 3, 9, 10],
[ 3, 11, 12],
[ 3, 13, 14]])
In : before[:,0].searchsorted(x[:,0])
Out: array([0, 0, 0, 3, 4, 4, 4])
You can then use this to construct the matrix that you will subtract by indexing:
In : diff = before[before[:,0].searchsorted(x[:,0])]
In : diff
Out:
array([[ 1, 1, 2],
[ 1, 1, 2],
[ 1, 1, 2],
[ 2, 7, 8],
[ 3, 9, 10],
[ 3, 9, 10],
[ 3, 9, 10]])
You need to make the first column 0 so that they won't be subtracted.
In : diff[:,0] = 0
In : diff
Out:
array([[ 0, 1, 2],
[ 0, 1, 2],
[ 0, 1, 2],
[ 0, 7, 8],
[ 0, 9, 10],
[ 0, 9, 10],
[ 0, 9, 10]])
Finally, subtract two matrices to get the desired output:
In : before - diff
Out:
array([[1, 0, 0],
[1, 2, 2],
[1, 4, 4],
[2, 0, 0],
[3, 0, 0],
[3, 2, 2],
[3, 4, 4]])