Let's say I have a 2D array with positive integers:
a = numpy.array([[1, 1, 2],
[1, 2, 5],
[1, 3, 6],
[3, 3, 3],
[3, 4, 6],
[4, 5, 6],
])
and a threshold (positive integer). I want to count, for each row, how many ocurrences are < threshold, how many >= threshold and < threshold+2, and how many >= threshold+2. The results are to be stored on a size 3 x n array, where n = a.shape[0] and each of the 3 columns corresponds to the threshold partition.
For the example above and threshold = 3, it would be:
b = numpy.array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2],
])
My solution was to use a for loop combined with masks, so that I could apply the masks individually for each row. But using for loops on arrays feels wrong. Is there a more optimized way to accomplish that?
My solution so far:
b = []
for row in a:
b.append((numpy.sum(row < threshold),
numpy.sum((row >= threshold) * (row < threshold + 2)),
numpy.sum(row >= threshold + 2)))
b = numpy.array(b)
Approach #1
Making use of elementwise comparison against the thresholds and summing each row -
t = 3 # threshold
mask0 = (a<t)
mask2 = a>=t+2
mask1 = (a>=t) & ~mask2
out = np.c_[mask0.sum(1), mask1.sum(1), mask2.sum(1)]
Approach #2
If you think about it closely, we are creating three bins there. So, we could use get the bin ID for each element and finally, get the count of each row based on the IDs. We would use np.searchsorted to get those bin IDs and then elementwise equate and sum along each row.
Thus, we would have a solution, like so -
t = 3 # threshold
bins = [t, t+2] # Create intervals
N = len(bins)+1 # Number of cols in output
idx = np.searchsorted(bins,a,'right') # Get bin IDs
out = np.column_stack([(idx==i).sum(1) for i in range(N)])
We can vectorize the last step with broadcasting -
out = (idx == np.arange(N)[:,None,None]).sum(2).T
And one more vectorized alternative, which would also be memory efficient with np.bincount -
M = a.shape[0]
r = N*np.arange(M)[:,None]
out = np.bincount((idx + r).ravel(),minlength=M*N).reshape(M,N)
You have to break points 3 and 5. We can use np.searchsorted to find where each element of a falls with respect to our break points.
np.searchsorted([3, 5], 1, side='right') will return 0 because 1 should be inserted at position 0 to maintain sorted-ness.
np.searchsorted([3, 5], 3, side='right') will return 1 because 3 can be inserted at position 0 or any other in which a value of 3 occupies to maintain sorted-ness. The default behavior to insert to the left of elements that are equal. We can change this to insert to the right of all elements that are equal. This accounts for the condition < threshold
np.searchsorted([3, 5], 5) will return 1
np.searchsorted([3, 5], 7) will return 2
I use np.eye to build sub arrays to sum over in order to count how many fall within each bin.
np.eye(3, dtype=int)[np.searchsorted([3, 5], a, side='right')].sum(1)
array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2]])
We can generalize this with a function
def count_bins(a, threshold, interval_sizes):
edges = np.append(threshold, interval_sizes).cumsum()
eye = np.eye(edges.size + 1, dtype=int)
return eye[edges.searchsorted(a, side='right')].sum(1)
count_bins(a, 3, [2])
array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2]])
Or
count_bins(a, 3, [1, 1])
array([[3, 0, 0, 0],
[2, 0, 0, 1],
[1, 1, 0, 1],
[0, 3, 0, 0],
[0, 1, 1, 1],
[0, 0, 1, 2]])
But I'd rather return a pandas dataframe to see things more clearly
def count_bins(a, threshold, interval_sizes):
edges = np.append(threshold, interval_sizes).cumsum()
eye = np.eye(edges.size + 1, dtype=int)
labels = ['{:0.0f} to {:0.0f}'.format(i, j) for i, j in zip(np.append(-np.inf, edges), np.append(edges, np.inf))]
return pd.DataFrame(
eye[edges.searchsorted(a, side='right')].sum(1),
columns=labels
)
count_bins(a, 3, [2])
-inf to 3 3 to 5 5 to inf
0 3 0 0
1 2 0 1
2 1 1 1
3 0 3 0
4 0 2 1
5 0 1 2
I have a matrix 'A' whose values are shown below. After creating a matrix 'B' of ones using numpy.ones and assigning the values from 'A' to 'B' by indexing 'i' rows and 'j' columns, the resulting 'B' matrix is retaining the first row of ones from the original 'B' matrix. I'm not sure why this is happening with the code provided below.
The resulting 'B' matrix from command line is shown below:
import numpy
import numpy as np
A = np.matrix([[8,8,8,7,7,6,8,2],
[8,8,7,7,7,6,6,7],
[1,8,8,7,7,6,6,6],
[1,1,8,7,7,6,7,7],
[1,1,1,1,8,7,7,6],
[1,1,2,1,8,7,7,6],
[2,2,2,1,1,8,7,7],
[2,1,2,1,1,8,8,7]])
B = np.ones((8,8),dtype=np.int)
for i in np.arange(1,9):
for j in np.arange(1,9):
B[i:j] = A[i:j]
C = np.zeros((6,6),dtype=np.int)
print C
D = np.matrix([[1,1,2,3,3,2,2,1],
[1,2,1,2,3,3,3,2],
[1,1,2,1,1,2,2,3],
[2,2,3,2,2,2,1,3],
[1,2,2,3,2,3,1,3],
[1,2,3,3,2,3,2,3],
[1,2,2,3,2,3,1,2],
[2,2,3,2,2,3,2,2]])
print D
for k in np.arange(2,8):
for l in np.arange(2,8):
B[k,l] # point in middle
b = B[(k-1),(l-1)]
if b == 8:
# Matrix C is smaller than Matrix B
C[(k-1),(l-1)] = C[(k-1),(l-1)] + 1*D[(k-1),(l-1)]
#Output for Matrix B
B=
[1,1,1,1,1,1,1,1],
[8,8,7,7,7,6,6,7],
[1,8,8,7,7,6,6,6],
[1,1,8,7,7,6,7,7],
[1,1,1,1,8,7,7,6],
[1,1,2,1,8,7,7,6],
[2,2,2,1,1,8,7,7],
[2,1,2,1,1,8,8,7]
Python starts counting at 0, so your code should work find if you replace np.arange(1,9) with np.arange(9)
In [11]: np.arange(1,9)
Out[11]: array([1, 2, 3, 4, 5, 6, 7, 8])
In [12]: np.arange(9)
Out[12]: array([0, 1, 2, 3, 4, 5, 6, 7, 8])
As stated above: python indices start at 0.
In order to iterate over some (say matrix) indices, you should use the builtin function 'range' and not 'numpy.arange'. The arange returns an ndarray, while range returns a generator in a recent python version.
The syntax 'B[i:j]' does not refer to the element at row i and column j in an array B. It rather means: all rows of B starting at row i and going up to (but not including) row j (if B has so many rows, otherwise it returns until includingly the last row). The element at position i, j is in fact 'B[i,j]'.
The indexing syntax of python / numpy is quite powerful and performant.
For one thing, as others have mentioned, NumPy uses 0-based indexing. But even once you fix that, this is not what you want to use:
for i in np.arange(9):
for j in np.arange(9):
B[i:j] = A[i:j]
The : indicates slicing, so i:j means "all items from the i-th, up to the j-th, excluding the last one." So your code is copying every row over several times, which is not a very efficient way of doing things.
You probable wanted to use ,:
for i in np.arange(8): # Notice the range only goes up to 8
for j in np.arange(8): # ditto
B[i, j] = A[i, j]
This will work, but is also pretty wasteful performancewise when using NumPy. A much faster approach is to simply ask for:
B[:] = A
Here first what I think you are trying to do, with minimal corrections, comments to your code:
import numpy as np
A = np.matrix([[8,8,8,7,7,6,8,2],
[8,8,7,7,7,6,6,7],
[1,8,8,7,7,6,6,6],
[1,1,8,7,7,6,7,7],
[1,1,1,1,8,7,7,6],
[1,1,2,1,8,7,7,6],
[2,2,2,1,1,8,7,7],
[2,1,2,1,1,8,8,7]])
B = np.ones((8,8),dtype=np.int)
for i in np.arange(1,9): # i= 1...8
for j in np.arange(1,9): # j= 1..8, but A[8,j] and A[j,8] do not exist,
# if you insist on 1-based indeces, numpy still expects 0... n-1,
# so you'll have to subtract 1 from each index to use them
B[i-1,j-1] = A[i-1,j-1]
C = np.zeros((6,6),dtype=np.int)
D = np.matrix([[1,1,2,3,3,2,2,1],
[1,2,1,2,3,3,3,2],
[1,1,2,1,1,2,2,3],
[2,2,3,2,2,2,1,3],
[1,2,2,3,2,3,1,3],
[1,2,3,3,2,3,2,3],
[1,2,2,3,2,3,1,2],
[2,2,3,2,2,3,2,2]])
for k in np.arange(2,8): # k = 2..7
for l in np.arange(2,8): # l = 2..7 ; matrix B has indeces 0..7, so if you want inner points, you'll need 1..6
b = B[k-1,l-1] # so this is correct, gives you the inner matrix
if b == 8: # here b is a value in the matrix , not the index, careful not to mix those
# Matrix C is smaller than Matrix B ; yes C has indeces from 0..5 for k and l
# so to address C you'll need to subtract 2 from the k,l that you defined in the for loop
C[k-2,l-2] = C[k-2,l-2] + 1*D[k-1,l-1]
print C
output:
[[2 0 0 0 0 0]
[1 2 0 0 0 0]
[0 3 0 0 0 0]
[0 0 0 2 0 0]
[0 0 0 2 0 0]
[0 0 0 0 3 0]]
But there are more elegant ways to do it. In particular look up slicing, ( numpy conditional array arithmetic, possibly scipy threshold.All of the below should be much faster than Python loops too (numpy loops are written in C).
B=np.copy(A) #if you need a copy of A, this is the way
# one quick way to make a matrix that's 1 whereever A==8, and is smaller
from scipy import stats
B1=stats.threshold(A, threshmin=8, threshmax=8, newval=0)/8 # make a matrix with ones where there is an 8
B1=B1[1:-1,1:-1]
print B1
#another quick way to make a matrix that's 1 whereever A==8
B2 = np.zeros((8,8),dtype=np.int)
B2[A==8]=1
B2=B2[1:-1,1:-1]
print B2
# the following would obviously work with either B1 or B2 (which are the same)
print np.multiply(B2,D[1:-1,1:-1])
Output:
[[1 0 0 0 0 0]
[1 1 0 0 0 0]
[0 1 0 0 0 0]
[0 0 0 1 0 0]
[0 0 0 1 0 0]
[0 0 0 0 1 0]]
[[1 0 0 0 0 0]
[1 1 0 0 0 0]
[0 1 0 0 0 0]
[0 0 0 1 0 0]
[0 0 0 1 0 0]
[0 0 0 0 1 0]]
[[2 0 0 0 0 0]
[1 2 0 0 0 0]
[0 3 0 0 0 0]
[0 0 0 2 0 0]
[0 0 0 2 0 0]
[0 0 0 0 3 0]]
A cleaner way, in my opinion, of writing the C loop is:
for k in range(1,7):
for l in range(1,7):
if B[k,l]==8:
C[k-1, l-1] += D[k,l]
That inner block of B (and D) can be selected with slices, B[1:7, 1:7] or B[1:-1, 1:-1].
A and D are defined as np.matrix. Since we aren't doing matrix multiplications here (no dot products), that can create problems. For example I was puzzled why
In [27]: (B[1:-1,1:-1]==8)*D[1:-1,1:-1]
Out[27]:
matrix([[2, 1, 2, 3, 3, 3],
[3, 3, 3, 4, 5, 5],
[1, 2, 1, 1, 2, 2],
[2, 2, 3, 2, 3, 1],
[2, 2, 3, 2, 3, 1],
[2, 3, 3, 2, 3, 2]])
What I expected (and matches the loop C) is:
In [28]: (B[1:-1,1:-1]==8)*D.A[1:-1,1:-1]
Out[28]:
array([[2, 0, 0, 0, 0, 0],
[1, 2, 0, 0, 0, 0],
[0, 3, 0, 0, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 0, 3, 0]])
B = A.copy() still leaves B as matrix. B=A.A returns an np.ndarray. (as does np.copy(A))
D.A is the array equivalent of D. B[1:-1,1:-1]==8 is boolean, but when used in the multiplication context it is effectively 0s and 1s.
But if we want to stick with np.matrix then I'd suggest using the element by element multiply function:
In [46]: np.multiply((A[1:-1,1:-1]==8), D[1:-1,1:-1])
Out[46]:
matrix([[2, 0, 0, 0, 0, 0],
[1, 2, 0, 0, 0, 0],
[0, 3, 0, 0, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 0, 3, 0]])
or just multiply the full matrixes, and select the inner block after:
In [47]: np.multiply((A==8), D)[1:-1, 1:-1]
Out[47]:
matrix([[2, 0, 0, 0, 0, 0],
[1, 2, 0, 0, 0, 0],
[0, 3, 0, 0, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 0, 3, 0]])
I am trying to optimise some code by removing for loops and using numpy arrays only as I am working with large data sets.
I would like to take a 1D numpy array, for example:
a = [1, 2, 3, 4, 5]
and produce a 2D numpy array whereby the value in each column shifts along a place, for example in the case above for a I wish to have a function which returns:
[[1 2 3 4 5]
[0 1 2 3 4]
[0 0 1 2 3]
[0 0 0 1 2]
[0 0 0 0 1]]
I have found examples which use the strides function to do something similar to produce, for example:
[[1 2 3]
[2 3 4]
[3 4 5]]
However I am trying to shift each of my columns in the other direction. Alternatively, one can view the problem as putting the first element of a on the first diagonal, the second element on the second diagonal and so on. However, I would like to stress again how I would like to avoid using a for, while or if loop entirely. Any help would be greatly appreciated.
Such a matrix is an example of a Toeplitz matrix. You could use scipy.linalg.toeplitz to create it:
In [32]: from scipy.linalg import toeplitz
In [33]: a = range(1,6)
In [34]: toeplitz(a, np.zeros_like(a)).T
Out[34]:
array([[1, 2, 3, 4, 5],
[0, 1, 2, 3, 4],
[0, 0, 1, 2, 3],
[0, 0, 0, 1, 2],
[0, 0, 0, 0, 1]])
Inspired by #EelcoHoogendoorn's answer, here's a variation that doesn't use as much memory as scipy.linalg.toeplitz:
In [47]: from numpy.lib.stride_tricks import as_strided
In [48]: a
Out[48]: array([1, 2, 3, 4, 5])
In [49]: t = as_strided(np.r_[a[::-1], np.zeros_like(a)], shape=(a.size,a.size), strides=(a.itemsize, a.itemsize))[:,::-1]
In [50]: t
Out[50]:
array([[1, 2, 3, 4, 5],
[0, 1, 2, 3, 4],
[0, 0, 1, 2, 3],
[0, 0, 0, 1, 2],
[0, 0, 0, 0, 1]])
The result should be treated as a "read only" array. Otherwise, you'll be in for some surprises when you change an element. For example:
In [51]: t[0,2] = 99
In [52]: t
Out[52]:
array([[ 1, 2, 99, 4, 5],
[ 0, 1, 2, 99, 4],
[ 0, 0, 1, 2, 99],
[ 0, 0, 0, 1, 2],
[ 0, 0, 0, 0, 1]])
Here is the indexing-tricks based solution. Not nearly as elegant as the toeplitz solution already posted, but should memory consumption or performance be a concern, it is to be preferred. As demonstrated, this also makes it easy to subsequently manipulate the entries of the matrix in a consistent manner.
import numpy as np
a = np.arange(5)+1
def toeplitz_view(a):
b = np.concatenate((np.zeros_like(a),a))
i = a.itemsize
v = np.lib.index_tricks.as_strided(b,
shape=(len(b),)*2,
strides=(-i, i))
#return a view on the 'original' data as well, for manipulation
return v[:len(a), len(a):], b[len(a):]
v, a = toeplitz_view(a)
print v
a[0] = 10
v[2,1] = -1
print v