I have two large matrices (40000*4096) and I would like to compare and match each row of the first matrix to all of the rows for the second matrix and as a result, the output will have a size (40000*40000). However, since I need to do this for several thousand times, it is wildy time consuming 26k seconds for each iteration so for 5000 times ...
I would be glad if you could give me some smart suggestion. Thank you.
P.S. this is what I did so far for just one iteration (1 of 5000)
def matcher(Antigens, Antibodies,ind):
temp = np.zeros((Antibodies.shape[0],Antibodies.shape[1]))
output = np.zeros((Antibodies.shape[0],1))
for i in range(len(Antibodies)):
temp[i] = np.int32(np.equal(Antigens[ind],Antibodies[i]))
output[i] = np.sum(temp[i])
return output
output = [matcher(gens,Antibodies) for gens in Antigens]
Okay, I think I understand what your goal is:
Count number of row matches (antigen vs antibody matrix). Each row of the resulting vector (40,000 x 1) represents a count of exact matches between 1 antigen row and all of the antibodies row (so values from 0 - 40_000).
I made some fake data:
import numpy as np
import numba as nb
num_mat = 5 # number of matrices
num_row = 10_000 # number of rows per matrix
num_elm = 4_096 # number of elements per row
dim = (num_mat,num_row,num_elm)
Antigens = np.random.randint(0,256,dim,dtype=np.uint8)
Antibodies = np.random.randint(0,256,dim,dtype=np.uint8)
There's one important point here, I reduced the matrices to the smallest datatype that can represent the data in order to reduce their memory foot-print. I'm not sure what your data looks like, but hopefully you can do this as well.
Also, the following code assumes your dimensions look the fake data:
(number of matrices, rows, elements)
#nb.njit
def match_arr(arr1, arr2):
for i in range(arr1.shape[0]): #4096 vs 4096
if arr1[i] != arr2[i]:
return False
return True
#nb.njit
def match_mat_sum(ag, ab):
out = np.zeros((ag.shape[0])) # 40000
for i in range(ag.shape[0]):
tmp = 0
for j in range(ab.shape[0]):
tmp += match_arr(ag[i], ab[j])
out[i] = tmp
return out
#nb.njit(parallel=True)
def match_sets(Antigens, Antibodies):
out = np.empty((Antigens.shape[0] * Antibodies.shape[0], Antigens.shape[1])) # 5000 x 40000
# multiprocessing per antigen matrix, may want to move this as suits your data
for i in nb.prange(Antigens.shape[0]):
for j in range(Antibodies.shape[0]):
out[j+(5*i)] = match_mat_sum(Antigens[i], Antibodies[j]) # need to figure out the index to avoid race conditions
return out
I lean on Numba heavily. One of the key optimizations is not to check the equivalence of entire rows with np.equal() but to write a custom function match_arr() that breaks as soon as it finds a mis-matched element. Hopefully, this lets us skip a ton of comparisons.
Time comparison:
%timeit match_arr(arr1, arr2)
314 ns ± 0.361 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.equal(arr1, arr2)
1.07 µs ± 5.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
match_mat_sum
This function simply calculates the middle step (the 40,000 x 1 vector) that represents the sum of exact matches between two matrices. This step reduces two matrices like: (m x n), (o x n) -> (m)
match_sets()
The last function parallelizes this operation with explicit parallel loops through nb.prange. You might want to move this function to a different loop depending on what your data looks like (like if you have one antigen matrix, but 5000 antibody matrices, you should move prange to the inner loop or you'll not be leveraging parallelization). The fake data assumes some antigen and some antibody matrices.
Another important thing to note here is the indexing on the out array. In order to avoid race conditions, each explicit loops needs to write to a unique space. Again, depending on your data, you'll need to index the proper "place" to put the result.
On a Ryzen 1600 (6-core) with 16 gigs of RAM, using this fake data, I generated a result in 10.2 seconds.
Your data is about 3200x times larger. Assuming linear scaling, the full set would take approximately 9 hours, assuming you have enough memory.
You could write some kind of batch loader as well, rather than loading 5000 giant matrices directly into memory.
This problem can be tackled with a mixture of numpy broadcasting, and the module numexpr, which performs operations fast while minimizing the storage of intermediate values
import numexpr as ne
# expand arrays dimensions to support broadcasting when doing comparison
Antigens, Antibodies = Antigens[None, :, :], Antibodies[:, None, :]
output = ne.evaluate('sum((Antigens==Antibodies)*1, axis=2)')
# *1 is a hack because numexpr does not currently support sum on bool
This may be faster than your current solution, but for such large arrays it will take a while.
The performance of numexpr for this operations is a bit lackluster, but you can at least use broadcasting inside the loop:
output = np.zeros((Antibodies.shape[0],)*2, dtype=np.int32)
for row, out_row in zip(Antibodies, output):
(row[None,:]==Antigens).sum(1, out=out_row)
Related
I have a simple code that:
Read a trajectory file that can be seen as a list of 2D arrays (list of positions in space) stored in Y
I then want to compute for each pair (scipy.pdist style) the RMSD
My code works fine:
trajectory = read("test.lammpstrj", index="::")
m = len(trajectory)
#.get_positions() return a 2d numpy array
Y = np.array([snapshot.get_positions() for snapshot in trajectory])
b = [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
This code execute in 0.86 seconds using python3.10, using Julia1.8 the same kind of code execute in 0.46 seconds
I plan to have trajectory much larger (~ 200,000 elements), would it be possible to get a speed-up using python or should I stick to Julia?
You've mentioned that snapshot.get_positions() returns some 2D array, suppose of shape (p, q). So I expect that Y is a 3D array with some shape (m, p, q), where m is the number of snapshots in the trajectory. You also expect m to scale rather high.
Let's see a basic way to speed up the distance calculation, on the setting m=1000:
import numpy as np
# dummy inputs
m = 1000
p, q = 4, 5
Y = np.random.randn(m, p, q)
# your current method
def foo():
return [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
# vectorized approach -> compute the upper triangle of the pairwise distance matrix
def bar():
u, v = np.triu_indices(Y.shape[0], 1)
return np.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# Check for correctness
out_1 = foo()
out_2 = bar()
print(np.allclose(out_1, out_2))
# True
If we test the time required:
%timeit -n 10 -r 3 foo()
# 3.16 s ± 50.3 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
The first method is really slow, it takes over 3 seconds for this calculation. Let's check the second method:
%timeit -n 10 -r 3 bar()
# 97.5 ms ± 405 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So we have a ~30x speedup here, which would make your large calculation in python much more feasible than using the original code. Feel free to test out with other sizes of Y to see how it scales compared to the original.
JIT
In addition, you can also try out JIT, mainly jax or numba. It is fairly simple to port the function bar with jax.numpy, for example:
import jax
import jax.numpy as jnp
#jax.jit
def jit_bar(Y):
u, v = jnp.triu_indices(Y.shape[0], 1)
return jnp.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# check for correctness
print(np.allclose(bar(), jit_bar(Y)))
# True
If we test the time of the jitted jnp op:
%timeit -n 10 -r 3 jit_bar(Y)
# 10.6 ms ± 678 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So compared to the original, we could reach even up to ~300x speed.
Note that not every operation can be converted to jax/jit so easily (this particular problem is conveniently suitable), so the general advice is to simply avoid python loops and use numpy's broadcasting/vectorization capabilities, like in bar().
Stick to Julia.
If you already made it in a language which runs faster, why are you trying to use python in the first place?
Your question is about speeding up Python, relative to Julia, so I'd like to offer some Julia code for comparison.
Since your data is most naturally expressed as a list of 4x5 arrays, I suggest expressing it as a vector of SMatrixes:
sumdiff2(A, B) = sum((A[i] - B[i])^2 for i in eachindex(A, B))
function dists(Y)
M = length(Y)
V = Vector{float(eltype(eltype(Y)))}(undef, sum(1:M-1))
Threads.#threads for i in eachindex(Y)
ii = sum(M-i+1:M-1) # don't worry about this sum
for j in i+1:lastindex(Y)
ind = ii + (j-i)
V[ind] = sqrt(3 * sumdiff2(Y[i], Y[j])/length(Y[i]))
end
end
return V
end
using Random: randn
using StaticArrays: SMatrix
Ys = [randn(SMatrix{4,5,Float64}) for _ in 1:1000];
Benchmarks:
# single-threaded
julia> using BenchmarkTools
julia> #btime dists($Ys);
6.561 ms (2 allocations: 3.81 MiB)
# multi-threaded with 6 cores
julia> #btime dists($Ys);
1.606 ms (75 allocations: 3.82 MiB)
I was not able to install jax on my computer, but when comparing with #Mercury's numpy code I got
foo: 5.5seconds
bar: 179ms
i.e. approximately 3400x speedup over foo.
It is possible to write this as a one-liner at a ~2-3x performance cost.
While Python tends to be slower than Julia for many tasks, it is possible to write numerical codes as fast as Julia in Python using Numba and plain loops. Indeed, Numba is based on LLVM-Lite which is basically a JIT-compiler based on the LLVM toolchain. The standard implementation of Julia also use a JIT and the LLVM toolchain. This means the two should behave pretty closely besides the overhead introduced by the languages that are negligible once the computation is performed in parallel (because the resulting computation will be memory-bound on nearly all modern platforms).
This computation can be parallelized in both Julia and Python (still using Numba). While writing a sequential computation is quite straightforward, writing a parallel computation is if bit more complex. Indeed, computing the upper triangular values can result in an imbalanced workload and so to a sub-optimal execution time. An efficient strategy is to compute, for each iteration, a pair of lines: one comes from the top of the upper triangular part and one comes from the bottom. The top line contains m-i items while the bottom one contains i+1 items. In the end, there is m+1 items to compute per iteration so the number of item is independent of the iteration number. This results in a much better load-balancing. The line of the middle needs to be computed separately regarding the size of the input array.
Here is the final implementation:
import numba as nb
import numpy as np
#nb.njit(inline='always', fastmath=True)
def compute_line(tmp, res, i, m):
offset = (i * (2 * m - i - 1)) // 2
factor = 3.0 / n
for j in range(i + 1, m):
s = 0.0
for k in range(n):
s += (tmp[i, k] - tmp[j, k]) ** 2
res[offset] = np.sqrt(s * factor)
offset += 1
return res
#nb.njit('()', parallel=True, fastmath=True)
def fastest():
m, n = Y.shape[0], Y.shape[1] * Y.shape[2]
res = np.empty(m*(m-1)//2)
tmp = Y.reshape(m, n)
for i in nb.prange(m//2):
compute_line(tmp, res, i, m)
compute_line(tmp, res, m-i-1, m)
if m % 2 == 1:
compute_line(tmp, res, (m+1)//2, m)
return res
# [...] same as others
%timeit -n 100 fastest()
Results
Here are performance results on my machine (with a i5-9600KF having 6 cores):
foo (seq, Python, Mercury): 4910.7 ms
bar (seq, Python, Mercury): 134.2 ms
jit_bar (seq, Python, Mercury): ???
dists (seq, Julia, DNF) 6.9 ms
dists (par, Julia, DNF) 2.2 ms
fastest (par, Python, me): 1.5 ms <-----
(Jax does not work on my machine so I cannot test it yet)
This implementation is the fastest one and succeed to beat the best Julia code so far.
Optimal implementation
Note that for large arrays like (200_000,4,5), all implementations provided so far are inefficient since they are not cache friendly. Indeed, the input array will take 32 MiB and will not for on the cache of most modern processors (and even if it could, one need to consider the space needed for the output and the fact that caches are not perfect). This can be fixed using tiling, at the expense of an even more complex code. I think such an implementation should be optimal if you use Z-order curves.
I'm working on a machine learning problem, and I need to construct an array of dimensions m x n x p.
For the sake of the question, let's say we have m locations, n time windows, and p features for each observation at each time window.
Our data store can only return one location worth of data at a time, in other words we get an array of size 1 x n x p back. But for the prediction step, we want everything consolidated into a single array of size m x n x p.
With relatively small p, this is fast enough using the naïve approach the we don't care. In some cases, however, we have fairly large p, and that's made this fairly slow.
For example:
In [44]: A = np.arange(4800000000, dtype=np.float32).reshape(20,30,8000000)
In [45]: x = np.random.randn(30, 8000000)
In [46]: %timeit A[0,:] = x
39.5 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
While 40ms is not a lot on face, doing that for an m of 100 results in 4 seconds on this one fairly straightforward operation.
Is there a faster way to do this array construction?
How can I efficiently create a numpy tensor that collects many matrices and stacks them, always adding new matrices. This is useful for managing batches of images, for example, where each image is a 3D matrix. Stacking N, 3D images (one dimension for each of the RGB planes) together creates a 4D matrix.
Here is the base form of simply appending to matrices into a new dimension, creating a final matrix of higher dimension than the original two. And here is some information on the np.newaxis functionality.
There are two good ways I know of, which use some ways from the links of the question as building blocks.
If you create an array e.g. with 3 dimensions and on the fly would like to keep appending these singular results to one another, thus creating a 4d tensor, then you require a small amount of initial setup for the first array before you can apply the answers posted in the question.
One approach is to store all the single matrices in a list (appending as you go) and then simply to combine them using np.array:
def list_version(input_data, N):
outputs = []
for i in range(N):
one_matrix = function_creating_single_matrix(arg1, arg2)
outputs.append(one_matrix)
return np.array(outputs)
A second approach extends the very first matrix (e.g. that is 3d) to become 4d, using np.newaxis. Subsequent 3d matrices can then be np.concatenated one-by-one. They must also be extended in the same dimension as the very first matrix - the final result grows in this dimension. Here is an example:
def concat_version(input_data, N):
for i in range(N):
if i == 0:
results = function_creating_single_matrix(arg1, arg2)
results = results[np.newaxis,...] # add the new dimension
else:
output = function_creating_single_matrix(arg1, arg2)
results = np.concatenate((results, output[np.newaxis,...]), axis=0)
# the results are growing the the first dimension (=0)
return results
I also compared the variants in terms of performance using Jupyter %%timeit cells. In order to make the pseudocode above work, I just created simple matrices filled with ones, to be appended to one another:
function_creating_single_matrix() = np.ones(shape=(10, 50, 50))
I can then also compare the results to ensure it is the same.
%%timeit -n 100
resized = list_version(N=100)
# 4.97 ms ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit -n 100
resized = concat_version(N=100)
# 96.6 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So it seems that the list method is ~20 times faster! ...at least on these scales of matrix-magnitude
Here we see that the functions return identical results:
list_output = list_version(N=100)
concat_output = concat_version(N=100)
np.array_equal(list_output, concat_output)
# True
I also ran cProfile on the functions, and it seems the reason is that np.concatenate spends a lot of its time copying the matrices. I then traced this back to the underlying C code.
Links:
Here is a similar question that stacks several arrays, but assuming they are already in existence, not being generated and appended on the fly.
Here is some more discussion on the memory management and speed of the above mentioned methods.
I am trying to create sparse matrix representation of a random hash map h:[n] -> [t] which maps each i
to exactly s random location of available d locations and the value at those location are drawn from some discrete distribution.
:param d: number of bins
:param n: number of items hashed
:param s: sparsity of each column
:param distribution: distribution object.
Here is my attempt:
start_time=time.time()
distribution = scipy.stats.rv_discrete(values=([-1.0, +1.0 ], [0.5, 0.5]),name = 'dist')
data = (1.0/sqrt(self._s))*distribution.rvs(size=self._n*self._s)
col = numpy.empty(self._s*self._n)
for i in range(self._n):
col[i*self._s:(i+1)*self._s]=i
row = numpy.empty(self._s*self._n)
print time.time()-start_time
for i in range(self._n):
row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)
S = scipy.sparse.csr_matrix( (data, (row, col)), shape = (self._d,self._n))
print time.time()-start_time
return S
Now for creating this map for n=500000, s=10,d=1000, it is taking me around 20s on my decent workstation, in which 90% of time is consumed in generating row indices. Is there anything I can do to speed this up? Any alternatives? Thanks.
col = numpy.empty(self._s*self._n)
for i in range(self._n):
col[i*self._s:(i+1)*self._s]=i
looks like something that could be written as one non-looping expression; though it probably isn't a big time consumer
My first guess is - but I need to play with this to be sure; I think it's assigning all rows the column index number.
col = np.empty(self._s, self._n)
col[:,:] = np.arange(self._n)
col = col.ravel()
Something similar for:
for i in range(self._n):
row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)
is, I think, picking _s values from _d _n times. Doing the no-replace along _s, but allowing replace on _n could be tricky.
Without running the code myself (with smaller n) I'm stumbling around a bit. Which is the slow part, generating col, row, or the final csr? Iteration on n=500000 is going to be slow.
The matrix will be (1000, 500000), but with (10*500000) nonzero items. So a sparsity of .01. Just for comparison it would be interesting to generate a sparse random matrix of similar size and sparsity
In [5]: %timeit sparse.random(1000, 500000, .01)
1 loop, best of 3: 24.6 s per loop
and the dense random choices:
In [8]: timeit np.random.choice(1000,(10,500000)).shape
10 loops, best of 3: 53 ms per loop
In [9]: np.array([np.random.choice(1000,(10,)) for i in range(500000)]).shape
Out[9]: (500000, 10)
In [10]: timeit np.array([np.random.choice(1000,(10,)) for i in range(500000)]).
...: shape
1 loop, best of 3: 12.7 s per loop
So, yes, the large iteration loop is expensive. But given the replacement policy there might not be a way around that. Or is there?
So as first guess, creating row takes half the time, creating the sparse matrix the other half. I'm not surprised. You are using the coo style of input, which requires lexsorting and summing for duplicates when converting to csr. We might be able to gain speed by using the indptr type of input. There won't be duplicates to sum. And since there are consistently 10 nonzero terms per row, generating the indptr values won't be hard. But I can't do that off the top of my head. (oops, that's the transpose).
random sparse to csr is just a bit slower:
In [11]: %timeit sparse.random(1000, 500000, .01, 'csr')
1 loop, best of 3: 28.3 s per loop
I am writing a code for proposing typo correction using HMM and Viterbi algorithm. At some point for each word in the text I have to do the following. (lets assume I have 10,000 words)
#FYI Windows 10, 64bit, interl i7 4GRam, Python 2.7.3
import numpy as np
import pandas as pd
for k in range(10000):
tempWord = corruptList20[k] #Temp word read form the list which has all of the words
delta = np.zeros(26, len(tempWord)))
sai = np.chararray(26, len(tempWord)))
sai[:] = '#'
# INITIALIZATION DELTA
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is different
# INITILIZATION END
# 6.DELTA CALCULATION
for deltaIndex in range(1, len(tempWord)):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
# 7. SAI BACKWARD TRACKING
delta2 = pd.DataFrame(delta)
sai2 = pd.DataFrame(sai)
proposedWord = np.zeros(len(tempWord), str)
editId = 0
for col in delta2.columns:
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
editList20.append(''.join(editWord))
#END OF LOOP
As you can see it is computationally involved and When I run it takes too much time to run.
Currently my laptop is stolen and I run this on Windows 10, 64bit, 4GRam, Python 2.7.3
My question: Anybody can see any point that I can use to optimize? Do I have to delete the the matrices I created in the loop before loop goes to next round to make memory free or is this done automatically?
After the below comments and using xrange instead of range the performance increased almost by 30%. I am adding the screenshot here after this change.
I don't think that range discussion makes much difference. With Python3, where range is the iterator, expanding it into a list before iteration doesn't change time much.
In [107]: timeit for k in range(10000):x=k+1
1000 loops, best of 3: 1.43 ms per loop
In [108]: timeit for k in list(range(10000)):x=k+1
1000 loops, best of 3: 1.58 ms per loop
With numpy and pandas the real key to speeding up loops is to replace them with compiled operations that work on the whole array or dataframe. But even in pure Python, focus on streamlining the contents of the iteration, not the iteration mechanism.
======================
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication
A minor change: delta[i, 0] = ...; this is the array way of addressing a single element; functionally it often is the same, but the intent is clearer. But think, can't you set all of that column as once?
delta[:,0] = ...
====================
N = len(tempWord)
delta = np.zeros(26, N))
etc
In tight loops temporary variables like this can save time. This isn't tight, so here is just adds clarity.
===========================
This one ugly nested triple loop; admittedly 26 steps isn't large, but 26*26*N is:
for deltaIndex in range(1,N):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
Focus on replacing this with array operations. It's those 3 commented lines that need to be changed, not the iteration mechanism.
================
Make proposedWord a list rather than array might be faster. Small list operations are often faster than array one, since numpy arrays have a creation overhead.
In [136]: timeit np.zeros(20,str)
100000 loops, best of 3: 2.36 µs per loop
In [137]: timeit x=[' ']*20
1000000 loops, best of 3: 614 ns per loop
You have to careful when creating 'empty' lists that the elements are truly independent, not just copies of the same thing.
In [159]: %%timeit
x = np.zeros(20,str)
for i in range(20):
x[i] = chr(65+i)
.....:
100000 loops, best of 3: 14.1 µs per loop
In [160]: timeit [chr(65+i) for i in range(20)]
100000 loops, best of 3: 7.7 µs per loop
As noted in the comments, the behavior of range changed between Python 2 and 3.
In 2, range constructs an entire list populated with the numbers to iterate over, then iterates over the list. Doing this in a tight loop is very expensive.
In 3, range instead constructs a simple object that (as far as I know), consists only of 3 numbers: the starting number, the step (distance between numbers), and the end number. Using simple math, you can calculate any point along the range instead of needing to iterate necessarily. This makes "random access" on it O(1) instead of O(n) when the entire list is interated, and prevents the creation of a costly list.
In 2, use xrange to iterate over a range object instead of a list.
(#Tom: I'll delete this if you post an answer).
It's hard to see exactly what you need to do because of the missing code, but it's clear that you need to learn how to vectorize your numpy code. This can lead to a 100x speedup.
You can probably get rid of all the inner for-loops and replace them with vectorized operations.
eg. instead of
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is differen
do
delta[:, 0] = # Vectorized form of whatever operation you were going to do.