How can I efficiently create a numpy tensor that collects many matrices and stacks them, always adding new matrices. This is useful for managing batches of images, for example, where each image is a 3D matrix. Stacking N, 3D images (one dimension for each of the RGB planes) together creates a 4D matrix.
Here is the base form of simply appending to matrices into a new dimension, creating a final matrix of higher dimension than the original two. And here is some information on the np.newaxis functionality.
There are two good ways I know of, which use some ways from the links of the question as building blocks.
If you create an array e.g. with 3 dimensions and on the fly would like to keep appending these singular results to one another, thus creating a 4d tensor, then you require a small amount of initial setup for the first array before you can apply the answers posted in the question.
One approach is to store all the single matrices in a list (appending as you go) and then simply to combine them using np.array:
def list_version(input_data, N):
outputs = []
for i in range(N):
one_matrix = function_creating_single_matrix(arg1, arg2)
outputs.append(one_matrix)
return np.array(outputs)
A second approach extends the very first matrix (e.g. that is 3d) to become 4d, using np.newaxis. Subsequent 3d matrices can then be np.concatenated one-by-one. They must also be extended in the same dimension as the very first matrix - the final result grows in this dimension. Here is an example:
def concat_version(input_data, N):
for i in range(N):
if i == 0:
results = function_creating_single_matrix(arg1, arg2)
results = results[np.newaxis,...] # add the new dimension
else:
output = function_creating_single_matrix(arg1, arg2)
results = np.concatenate((results, output[np.newaxis,...]), axis=0)
# the results are growing the the first dimension (=0)
return results
I also compared the variants in terms of performance using Jupyter %%timeit cells. In order to make the pseudocode above work, I just created simple matrices filled with ones, to be appended to one another:
function_creating_single_matrix() = np.ones(shape=(10, 50, 50))
I can then also compare the results to ensure it is the same.
%%timeit -n 100
resized = list_version(N=100)
# 4.97 ms ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit -n 100
resized = concat_version(N=100)
# 96.6 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So it seems that the list method is ~20 times faster! ...at least on these scales of matrix-magnitude
Here we see that the functions return identical results:
list_output = list_version(N=100)
concat_output = concat_version(N=100)
np.array_equal(list_output, concat_output)
# True
I also ran cProfile on the functions, and it seems the reason is that np.concatenate spends a lot of its time copying the matrices. I then traced this back to the underlying C code.
Links:
Here is a similar question that stacks several arrays, but assuming they are already in existence, not being generated and appended on the fly.
Here is some more discussion on the memory management and speed of the above mentioned methods.
Related
I'm working on a machine learning problem, and I need to construct an array of dimensions m x n x p.
For the sake of the question, let's say we have m locations, n time windows, and p features for each observation at each time window.
Our data store can only return one location worth of data at a time, in other words we get an array of size 1 x n x p back. But for the prediction step, we want everything consolidated into a single array of size m x n x p.
With relatively small p, this is fast enough using the naïve approach the we don't care. In some cases, however, we have fairly large p, and that's made this fairly slow.
For example:
In [44]: A = np.arange(4800000000, dtype=np.float32).reshape(20,30,8000000)
In [45]: x = np.random.randn(30, 8000000)
In [46]: %timeit A[0,:] = x
39.5 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
While 40ms is not a lot on face, doing that for an m of 100 results in 4 seconds on this one fairly straightforward operation.
Is there a faster way to do this array construction?
I'm following the exercises from "Doing Bayesian Data Analysis" in both R and Python.
I would like to find a fast method of doing Monte-Carlo simulation that uses constant space.
The problem below is trivial, but serves as a good test for different methods:
ex 4.3
Determine the exact probability of drawing a 10 from a shuffled pinochle deck. (In a pinochle deck, there are 48 cards. There are six values: 9, 10, Jack, Queen, King, Ace. There are two copies of each value in each of the standard four suits: hearts, diamonds, clubs, spades.)
(A) What is the probability of getting a 10?
Of course, the answer is 1/6.
The fastest solution I could find (comparable to the speed of R) is generating a large array of card draws using np.random.choice, then applying a Counter. I don't like the idea of creating arrays unnecessarily, so I tried using a dictionary and a for loop, drawing one card at a time and incrementing the count for that type of card. To my surprise, it was much slower!
The full code is below for the 3 methods I tested. _Is there a way of doing this that will be as performant as method1(), but using constant space?
Python code: (Google Colab link)
deck = [c for c in ['9','10','Jack','Queen','King','Ace'] for _ in range(8)]
num_draws = 1000000
def method1():
draws = np.random.choice(deck, size=num_draws, replace=True)
df = pd.DataFrame([Counter(draws)])/num_draws
print(df)
def method2():
card_counts = defaultdict(int)
for _ in range(num_draws):
card_counts[np.random.choice(deck, replace=True)] += 1
df = pd.DataFrame([card_counts])/num_draws
print(df)
def method3():
card_counts = defaultdict(int)
for _ in range(num_draws):
card_counts[deck[random.randint(0, len(deck)-1)]] += 1
df = pd.DataFrame([card_counts])/num_draws
print(df)
Python timeit() results:
method1: 1.2997
method2: 23.0626
method3: 5.5859
R code:
card = sample(deck, numDraws, replace=TRUE)
print(as.data.frame(table(card)/numDraws))
Here's one with np.unique+np.bincount -
def unique():
unq,ids = np.unique(deck, return_inverse=True)
all_ids = np.random.choice(ids, size=num_draws, replace=True)
ar = np.bincount(all_ids)/num_draws
return pd.DataFrame(ar[None], columns=unq)
How does NumPy help here?
There are two major improvements that's helping us here :
We convert the string data to numeric. NumPy works well with such data. To achieve this, we are using np.unique.
We use np.bincount to replace the counting step. Again, it works well with numeric data and we do have that from the numeric conversion done at the start of this method.
NumPy in general works well with large data, which is the case here.
Timings with given sample dataset comparing against fastest method1 -
In [177]: %timeit method1()
328 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [178]: %timeit unique()
12.4 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy achieves efficiency by running C code in its numerical engine. Python is convenient, but it is orders of magnitude slower than C.
In Numpy and other high-performance Python libraries, the Python code consists mostly of glue code, preparing the task to be dispatched. Since there is overhead, it is much faster to draw a lot of samples at once.
Remember that providing a buffer of 1 million elements for Numpy to work is still constant space. Then you can sample 1 billion times by looping it.
This extra memory allocation is usually not a problem. If you must avoid using memory at all costs while still getting performance benefits from Numpy, you can try using Numba or Cython to accelerate it.
from numba import jit
#jit(nopython=True)
def method4():
card_counts = np.zeros(6)
for _ in range(num_draws):
card_counts[np.random.randint(0, 6)] += 1
return card_counts/num_draws
I have two large matrices (40000*4096) and I would like to compare and match each row of the first matrix to all of the rows for the second matrix and as a result, the output will have a size (40000*40000). However, since I need to do this for several thousand times, it is wildy time consuming 26k seconds for each iteration so for 5000 times ...
I would be glad if you could give me some smart suggestion. Thank you.
P.S. this is what I did so far for just one iteration (1 of 5000)
def matcher(Antigens, Antibodies,ind):
temp = np.zeros((Antibodies.shape[0],Antibodies.shape[1]))
output = np.zeros((Antibodies.shape[0],1))
for i in range(len(Antibodies)):
temp[i] = np.int32(np.equal(Antigens[ind],Antibodies[i]))
output[i] = np.sum(temp[i])
return output
output = [matcher(gens,Antibodies) for gens in Antigens]
Okay, I think I understand what your goal is:
Count number of row matches (antigen vs antibody matrix). Each row of the resulting vector (40,000 x 1) represents a count of exact matches between 1 antigen row and all of the antibodies row (so values from 0 - 40_000).
I made some fake data:
import numpy as np
import numba as nb
num_mat = 5 # number of matrices
num_row = 10_000 # number of rows per matrix
num_elm = 4_096 # number of elements per row
dim = (num_mat,num_row,num_elm)
Antigens = np.random.randint(0,256,dim,dtype=np.uint8)
Antibodies = np.random.randint(0,256,dim,dtype=np.uint8)
There's one important point here, I reduced the matrices to the smallest datatype that can represent the data in order to reduce their memory foot-print. I'm not sure what your data looks like, but hopefully you can do this as well.
Also, the following code assumes your dimensions look the fake data:
(number of matrices, rows, elements)
#nb.njit
def match_arr(arr1, arr2):
for i in range(arr1.shape[0]): #4096 vs 4096
if arr1[i] != arr2[i]:
return False
return True
#nb.njit
def match_mat_sum(ag, ab):
out = np.zeros((ag.shape[0])) # 40000
for i in range(ag.shape[0]):
tmp = 0
for j in range(ab.shape[0]):
tmp += match_arr(ag[i], ab[j])
out[i] = tmp
return out
#nb.njit(parallel=True)
def match_sets(Antigens, Antibodies):
out = np.empty((Antigens.shape[0] * Antibodies.shape[0], Antigens.shape[1])) # 5000 x 40000
# multiprocessing per antigen matrix, may want to move this as suits your data
for i in nb.prange(Antigens.shape[0]):
for j in range(Antibodies.shape[0]):
out[j+(5*i)] = match_mat_sum(Antigens[i], Antibodies[j]) # need to figure out the index to avoid race conditions
return out
I lean on Numba heavily. One of the key optimizations is not to check the equivalence of entire rows with np.equal() but to write a custom function match_arr() that breaks as soon as it finds a mis-matched element. Hopefully, this lets us skip a ton of comparisons.
Time comparison:
%timeit match_arr(arr1, arr2)
314 ns ± 0.361 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.equal(arr1, arr2)
1.07 µs ± 5.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
match_mat_sum
This function simply calculates the middle step (the 40,000 x 1 vector) that represents the sum of exact matches between two matrices. This step reduces two matrices like: (m x n), (o x n) -> (m)
match_sets()
The last function parallelizes this operation with explicit parallel loops through nb.prange. You might want to move this function to a different loop depending on what your data looks like (like if you have one antigen matrix, but 5000 antibody matrices, you should move prange to the inner loop or you'll not be leveraging parallelization). The fake data assumes some antigen and some antibody matrices.
Another important thing to note here is the indexing on the out array. In order to avoid race conditions, each explicit loops needs to write to a unique space. Again, depending on your data, you'll need to index the proper "place" to put the result.
On a Ryzen 1600 (6-core) with 16 gigs of RAM, using this fake data, I generated a result in 10.2 seconds.
Your data is about 3200x times larger. Assuming linear scaling, the full set would take approximately 9 hours, assuming you have enough memory.
You could write some kind of batch loader as well, rather than loading 5000 giant matrices directly into memory.
This problem can be tackled with a mixture of numpy broadcasting, and the module numexpr, which performs operations fast while minimizing the storage of intermediate values
import numexpr as ne
# expand arrays dimensions to support broadcasting when doing comparison
Antigens, Antibodies = Antigens[None, :, :], Antibodies[:, None, :]
output = ne.evaluate('sum((Antigens==Antibodies)*1, axis=2)')
# *1 is a hack because numexpr does not currently support sum on bool
This may be faster than your current solution, but for such large arrays it will take a while.
The performance of numexpr for this operations is a bit lackluster, but you can at least use broadcasting inside the loop:
output = np.zeros((Antibodies.shape[0],)*2, dtype=np.int32)
for row, out_row in zip(Antibodies, output):
(row[None,:]==Antigens).sum(1, out=out_row)
I'm computing huge outer products between vectors of size (50500,) and found out that NumPy is (much?) faster than PyTorch while doing so.
Here are the tests:
# NumPy
In [64]: a = np.arange(50500)
In [65]: b = a.copy()
In [67]: %timeit np.outer(a, b)
5.81 s ± 56.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
-------------
# PyTorch
In [73]: t1 = torch.arange(50500)
In [76]: t2 = t1.clone()
In [79]: %timeit torch.ger(t1, t2)
7.73 s ± 143 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'd ideally like to have the computation done in PyTorch. So, how can I speed things up for computing outer product in PyTorch for such huge vectors?
Note: I tried to move the tensors to GPU but I was treated with MemoryError because it needs around 19 GiB of space. So, I eventually have to do it on the CPU.
Unfortunately there's really no way to specifically speed up torch's method of computing the outer product torch.ger() without a vast amount of effort.
Explanation and Options
The reason numpy function np.outer() is so fast is because it's written in C, which you can see here: https://github.com/numpy/numpy/blob/7e3d558aeee5a8a5eae5ebb6aef03de892a92ebd/numpy/core/numeric.py#L1123
where the function uses operations from the umath C source code.
Pytorch's torch.ger() function is written in C++ here: https://github.com/pytorch/pytorch/blob/7ce634ebc2943ff11d2ec727b7db83ab9758a6e0/aten/src/ATen/native/LinearAlgebra.cpp#L142 which makes it ever so slightly slower as you can see in your example.
Your options to "speed up computing outer product in PyTorch" would be to add a C implementation for outer product in pytorch's native code, or make your own outer product function while interfacing with C using something like Cython if you really don't want to use numpy (which wouldn't make much sense).
P.S.
Also just as an aside, using GPUs would only improve your parallel computation speed on the GPU which may not outweigh the cost of time required to transfer data between RAM and GPU memory.
A very nice solution is to combine both.
class LazyFrames(object):
def __init__(self, frames):
self._frames = frames
def __array__(self, dtype=None):
out = np.concatenate(self._frames, axis=0)
if dtype is not None:
out = out.astype(dtype)
return out
frames might be just your pytorch tensors for instance.
This object ensures that common frames between the observations are only stored once. It exists purely to optimize memory usage which can be huge (e.g. DQN's 1M frames replay buffers). This object should only be converted to numpy array before being passed to the model.
Reference : https://github.com/Shmuma/ptan/blob/master/ptan/common/wrappers.py
When summing over a dimension in a numpy array, is there a performance difference between the first and the last axis?
Specifically, considering the following code, which of sum1 and sum2 will be performed faster?
import numpy as np
a = np.ones((1000,200))
b = np.ones((200,1000))
sum1 = np.sum(a, axis=0)
sum2 = np.sum(b, axis=-1)
I believe this question actually boils down to how does numpy internally store dimensions and that this can be overriden to use row-wise or column-wise format. However, when using the default setting, which of these will be faster? Also, what about N-dimensional arrays?
It is quite easy to check whether or not there is a performance difference (IPython, I increased a bit the numbers to have a more noticeable difference):
import numpy as np
a = np.ones((10000, 2000))
b = np.ones((2000, 10000))
%timeit np.sum(a, axis=0)
# 27.6 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(b, axis=-1)
# 34.6 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, by the time you are having an actual performance issue with np.sum you will probably have run out of memory anyway, but yes, there is a difference. By default, NumPy arrays are stored in row-major order, so first goes the first row, then the second, etc. It does make sense, then, that summing (or operating) in outer dimensions is faster, because the cache will be way more effective. Simply puy, in the first case, when you get the first element of the array a bunch of contiguous data will come to the cache with it, so when you want to sum the next elements they will be already there. In the second case, on the other hand, elements to sum are quite far away from each other (2000 elements of distance, actually), so the cache won't be helping much, column-wise. That is not to say the cache won't help at all, since you are summing all the columns, so cached data will still be reused to a degree, but not as effectively. This is a rather gross approximation, in general there are several cache levels, some shared among cores and some not, and understanding the exact effect that one or another code has on it is a complicated topic, but the general idea holds.