Vectorize "is in" - python

I try to build samples of m vectors (with integer entries) together with m evaluations. A vector x of shape (n,1) is evaluated to y=1 if one of its entries is the number 2. Otherwise, it is evaluated as y=0.
In order to deal with many such vectors and evaluations, the sample vectors are stored in an (n,m)-shaped ndarray and the evaluations are stored in a (1,m)-shaped ndarray. See the code:
import numpy as np
n = 10 # number of entries in each sample vector
m = 1000 # number of samples
X = np.random.randint(-10, 10, (n, m))
Y = []
for i in range(m):
if 2 in X[:, i]:
Y.append(1)
else:
Y.append(0)
Y = np.array(Y).reshape((1,-1))
assert (Y.shape == (1,m))
How can I vectorize the computation of Y? I tried to replace the initialization/computation of X and Y by the following:
X = np.random.randint(-10,10,(n,m))
Y = np.apply_along_axis(func1d=lambda x: 1 if 2 in x else 0, axis=0, arr=X)
A few executions suggested that this is most times even a bit slower than my first approach. (Acutally this anser starts by saying that numpy.apply_along_axis was not for speed. Also I am not aware of how good lambda is in this context.)
Is there a way to vectorize the computation of Y, i.e. a way to assign a value 1 or 0 to each column, depending on whether that column contains the element 2?

When using Numpy array and logical statement, it does a lot of optimisations without the user having to manually vectorise tasks. The following code reaches the same solution:
# assign logical 1 where element == 2 everywhere in the array X,
# then, for each column (axis = 0), if any element == 1 assign column logical 1
Y = (X == 2).any(axis = 0).reshape(1, -1)
print(Y.shape)
using timeit to assess execution times:
loop method: 3240 microseconds per run
numpy method: 6.57 microseconds per run
If you're interested, you could see if using other vectorisation methods, such as np.vectorise, improves the time further though I'm quite sure the underlying Numpy optimisations perform their own vectorisation at CPU instruction level (SIMD) by default.
Bottom line is when using numpy always try to find a solution using logical arrays and numpy functions/methods as they're already very heavily optimised within the compiled binaries, and any python functions used to manipulate, access, or iterate the data slows the execution speed dramatically.
By the way, the most common way to get faster for loop execution to build a list of outputs such as you've done is to use list comprehension:
Y = np.array([2 in X[:, i] for i in range(m)]).reshape((1, -1))
which executes in 3070 microseconds per loop.

Related

Coefficient matrix and loops challenge

I'm trying to solve the equation for T(p,q) as shown more clearly in the attached image.
Where:
p = 0.60
q = 0.45
M - coefficient matrix with 3 rows and 6 columns
I created five matrices and put them within their own functions in order to later call them in the while loop. However, the loop doesn't cycle through the various values of i.
How can I get the loop to work or is there another/better way I can solve the following equation?
(FYI this is approx my third day ever working with Python and coding)
import numpy as np
def M1(i):
M = np.array([[1,3,0,4,0,0],[3,3,1,4,0,0],[4,4,0,3,1,0]])
return M[i-1,0]
def M2(i):
M = np.array([[1,3,0,4,0,0],[3,3,1,4,0,0],[4,4,0,3,1,0]])
return M[i-1,1]
def M3(i):
M = np.array([[1,3,0,4,0,0],[3,3,1,4,0,0],[4,4,0,3,1,0]])
return M[i-1,2]
def M4(i):
M = np.array([[1,3,0,4,0,0],[3,3,1,4,0,0],[4,4,0,3,1,0]])
return M[i-1,3]
def M5(i):
M = np.array([[1,3,0,4,0,0],[3,3,1,4,0,0],[4,4,0,3,1,0]])
return M[i-1,4]
def T(p,q):
sum_i = 0
i = 1
while i <=5:
sum_i = sum_i + ((M1(i)*p**M2(i))*((1-p)**M3(i))*(q**M4(i))*((1-q)**M5(i)))
i = i +1
return sum_i
print(T(0.6,0.45))
"""I printed the below equation (using a single value for i) to test if the above loop is working and since I get the same answer as the loop, I can see that the loop is not cycling through the various values of i as expected"""
i=1
p=0.6
q=0.45
print(((M1(i)*p**M2(i))*((1-p)**M3(i))*(q**M4(i))*((1-q)**M5(i))))
return is placed inside while loop, you need to change the code a bit
while i <=5:
sum_i = sum_i + ((M1(i)*p**M2(i))*((1-p)**M3(i))*(q**M4(i))*((1-q)**M5(i)))
i = i +1
return sum_i
The real power with numpy is to look at computations like these and try to understand what is repeatable and what structure they have. Once you find operations that are similar or parallel, try to set up your numpy function calls so that they can be done element-wise in parallel with one call.
For instance, inside the typical element of the sum, there are four things being explicitly raised to a power (a fifth if you consider M(i, 1)^1). In one function call, you can perform all four of these exponentiations in parallel with one function call if you arrange your arrays smartly:
M = np.array([[1,3,0,4,0,0],[3,3,1,4,0,0],[4,4,0,3,1,0]])
ps_and_qs = np.array([[p, (1-p), q, (1-q)]])
a = np.power(ps_and_qs, M[:,1:5])
Now a will be populated with a 3 x 4 matrix with all of your exponentiations.
Now the next step is how to reduce these. There are several reduction functions that are built into numpy that are efficiently implemented with vectorized loops where possible. They can speed up your code quite a bit. In particular, there is both a product reduction as well as a sum reduction. With your equation, we need to first multiply across the rows to get one number per row and then sum across the remaining column like this:
b = M[:,:1] * np.product(a,axis=1)
c = np.sum(b, axis=0)
c should now be a scalar equal to T evaluated at (p,q). That is a lot to take in for a third day, but something to consider if you continue to use numpy for numerical analysis on bigger projects that need better performance.

Making the nested loop run faster, e.g. by vectorization in Python

I have a 2 * N integer array ids representing intervals, where N is about a million. It looks like this
0 2 1 ...
3 4 3 ...
The ints in the arrays can be 0, 1, ... , M-1, where M <= 2N - 1. (Detail: if M = 2N, then the ints span all the 2N integers; if M < 2N, then there are some integers that have the same values.)
I need to calculate a kind of inverse map from ids. What I called "inverse map" is to see ids as intervals and capture the relation from their inner points with their indices.
Intuition Intuitively,
0 2 1
3 4 3
can be seen as
0 -> 0, 1, 2
1 -> 2, 3
2 -> 1, 2
where the right-hand-side endpoints are excluded for my problem. The "inverse" map would be
0 -> 0
1 -> 0, 2
2 -> 0, 1, 2
3 -> 1
Code I have a piece of Python code that attempts to calculate the inverse map in a dictionary inv below:
for i in range(ids.shape[1]):
for j in range(ids[0][i], ids[1][i]):
inv[j].append(i)
where each inv[j] is an array-like data initialized as empty before the nested loop. Currently I use python's built-in arrays to initialize it.
for i in range(M): inv[i]=array.array('I')
Question The nested loop above works like a mess. In my problem setting (in image processing), my first loop has a million iterations; second one about 3000 iterations. Not only it takes much memory (because inv is huge), it is also slow. I would like to focus on speed in this question. How can I accelerate this nested loop above, e.g. with vectorization?
You could try the below option, in which, your outer loop is hidden away within numpy's C-language implementation of apply_along_axis(). Not sure about about performance benefit, only a test at a decent scale can tell (especially as there's some initial overhead involved in converting lists to numpy arrays):
import numpy as np
import array
ids = [[0,2,1],[3,4,3]]
ids_arr = np.array(ids) # Convert to numpy array. Expensive operation?
range_index = 0 # Initialize. To be bumped up by each invocation of my_func()
inv = {}
for i in range(np.max(ids_arr)):
inv[i] = array.array('I')
def my_func(my_slice):
global range_index
for i in range(my_slice[0], my_slice[1]):
inv[i].append(range_index)
range_index += 1
np.apply_along_axis (my_func,0,ids_arr)
print (inv)
Output:
{0: array('I', [0]), 1: array('I', [0, 2]), 2: array('I', [0, 1, 2]),
3: array('I', [1])}
Edit:
I feel that using a dictionary might not be a good idea here. I suspect that in this particular context, dictionary-indexing might actually be slower than numpy array indexing. Use the below lines to create and initialize inv as a numpy array of Python arrays. The rest of the code can remain as-is:
inv_len = np.max(ids_arr)
inv = np.empty(shape=(inv_len,), dtype=array.array)
for i in range(inv_len):
inv[i] = array.array('I')
(Note: This assumes that your application isn't doing dict-specific stuff on inv, such as inv.items() or inv.keys(). If that's the case, however, you might need an extra step to convert the numpy array into a dict)
avoid for loop, just a pandas sample
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A": np.random.randint(0, 100, 100000),
"B": np.random.randint(0, 100, 100000)
})
df.groupby("B")["A"].agg(list)
Since the order of N is large, I've come up with what seems like a practical approach; let me know if there are any flaws.
For the ith interval as [x,y], store it as [x,y,i]. Sort the arrays based on their start and end times. This should take O(NlogN) time.
Create a frequency array freq[2*N+1]. For each interval, update the frequency using the concept of range update in O(1) per update. Generating the frequencies gets done in O(N).
Determine a threshold, based on your data. According to that value, the elements can be specified as either sparse or frequent. For sparse elements, do nothing. For frequent elements only, store the intervals in which they occur.
During lookup, if there is a frequent element, you can directly access the pre-computed lists. If the element is a sparse one, you can search the intervals in O(logN) time, since the intervals are sorted and there indexes were appended in step 1.
This seems like a practical approach to me, rest depends on your usage. Like the amortized time complexity you need per query and so on.

numpy's linalg norm axis does not output the same result

I am trying to maximize computation performance using numpy (remove python for loop). Here is my initial implementation
np.random.seed(128)
l = []
for i in range(1000):
v = np.random.randn(7)
l.append(np.linalg.norm(v))
l = np.array(l)
l
The above code simply takes the Frobenius norm of a vector of size 7, and appends it to a list. This is repeated for 1000 times. To remove the for loop, I construct a matrix of size (1000, 7), and then take the norm of the matrix with axis=1 as shown below.
np.random.seed(128)
v = np.random.randn(1000, 7)
v = np.linalg.norm(v, axis=1)
However, when I check for equality of l to v with np.all(l == v), it outputs False for me. I don't understand why numpy behaves in such way. I checked the dtype of values for v and l and both are np.float64
you can read the following issue.
it is said there:
numpy in general does not guarantee that semantically equivalent
operations like this will produce identical results. Even operations
like sum can produce different results depending on memory layout (and
this is on purpose -- making them identical all the time would require
either big slowdowns or intentionally reducing precision).
so this is where the difference lies, you should not expect the same results but the same results up to tolerance. so the simplest solution to compare them will be the one suggested by Divakar:
np.allclose(l,v)
another possible option is:
np.array_equal(np.round(l,12),np.round(v,12))

Quickly find indices that have values larger than a threshold in Numpy/PyTorch

Task
Given a numpy or pytorch matrix, find the indices of cells that have values that are larger than a given threshold.
My implementation
#abs_cosine is the matrix
#sim_vec is the wanted
sim_vec = []
for m in range(abs_cosine.shape[0]):
for n in range(abs_cosine.shape[1]):
# exclude diagonal cells
if m != n and abs_cosine[m][n] >= threshold:
sim_vec.append((m, n))
Concerns
Speed. All other computations are built on Pytorch, using numpy is already a compromise, because it has moved computations from GPU to CPU. Pure python for loops will make the whole process even worse (for small data set already 5 times slower). I was wondering if we can move the whole computation to Numpy (or pytorch) without invoking any for loops?
An improvement I can think of (but got stuck...)
bool_cosine = abs_cosine > threshold
which returns a boolean matrix of True and False. But I cannot find a way to quick retrieve the indices of the True cells.
The following is for PyTorch (fully on GPU)
# abs_cosine should be a Tensor of shape (m, m)
mask = torch.ones(abs_cosine.size()[0])
mask = 1 - mask.diag()
sim_vec = torch.nonzero((abs_cosine >= threshold)*mask)
# sim_vec is a tensor of shape (?, 2) where the first column is the row index and second is the column index
The following works in numpy
mask = 1 - np.diag(np.ones(abs_cosine.shape[0]))
sim_vec = np.nonzero((abs_cosine >= 0.2)*mask)
# sim_vec is a 2-array tuple where the first array is the row index and the second array is column index
This is about twice as fast than np.where
import numba as nb
#nb.njit(fastmath=True)
def get_threshold(abs_cosine,threshold):
idx=0
sim_vec=np.empty((abs_cosine.shape[0]*abs_cosine.shape[1],2),dtype=np.uint32)
for m in range(abs_cosine.shape[0]):
for n in range(abs_cosine.shape[1]):
# exclude diagonal cells
if m != n and abs_cosine[m,n] >= threshold:
sim_vec[idx,0]=m
sim_vec[idx,1]=n
idx+=1
return sim_vec[0:idx,:]
The first call takes about 0.2s longer (compilation overhead). If the array is on the GPU, there may be also a way to do the whole computation on the GPU.
Nevertheless I am not really satisfied with the performance, since a simple boolean operation is about 5 times faster than the solution shown above and 10 times faster than np.where. If the order of the indices doesn't matter this problem can also be parallelized.

Speeding up Numpy Masking

I'm still an amature when it comes to thinking about how to optimize. I have this section of code that takes in a list of found peaks and finds where these peaks,+/- some value, are located in a multidimensional array. It then adds +1 to their indices of a zeros array. The code works well, but it takes a long time to execute. For instance it is taking close to 45min to run if ind has 270 values and refVals has a shape of (3050,3130,80). I understand that its a lot of data to churn through, but is there a more efficient way of going about this?
maskData = np.zeros_like(refVals).astype(np.int16)
for peak in ind:
tmpArr = np.ma.masked_outside(refVals,x[peak]-2,x[peak]+2).astype(np.int16)
maskData[tmpArr.mask == False ] += 1
tmpArr = None
maskData = np.sum(maskData,axis=2)
Approach #1 : Memory permitting, here's a vectorized approach using broadcasting -
# Craate +,-2 limits usind ind
r = x[ind[:,None]] + [-2,2]
# Use limits to get inside matches and sum over the iterative and last dim
mask = (refVals >= r[:,None,None,None,0]) & (refVals <= r[:,None,None,None,1])
out = mask.sum(axis=(0,3))
Approach #2 : If running out of memory with the previous one, we could use a loop and use NumPy boolean arrays and that could be more efficient than masked arrays. Also, we would perform one more level of sum-reduction, so that we would be dragging less data with us when moving across iterations. Thus, the alternative implementation would look something like this -
out = np.zeros(refVals.shape[:2]).astype(np.int16)
x_ind = x[ind]
for i in x_ind:
out += ((refVals >= i-2) & (refVals <= i+2)).sum(-1)
Approach #3 : Alternatively, we could replace that limit based comparison with np.isclose in approach #2. Thus, the only step inside the loop would become -
out += np.isclose(refVals,i,atol=2).sum(-1)

Categories