Fast Monte-Carlo simulation with numpy? - python

I'm following the exercises from "Doing Bayesian Data Analysis" in both R and Python.
I would like to find a fast method of doing Monte-Carlo simulation that uses constant space.
The problem below is trivial, but serves as a good test for different methods:
ex 4.3
Determine the exact probability of drawing a 10 from a shuffled pinochle deck. (In a pinochle deck, there are 48 cards. There are six values: 9, 10, Jack, Queen, King, Ace. There are two copies of each value in each of the standard four suits: hearts, diamonds, clubs, spades.)
(A) What is the probability of getting a 10?
Of course, the answer is 1/6.
The fastest solution I could find (comparable to the speed of R) is generating a large array of card draws using np.random.choice, then applying a Counter. I don't like the idea of creating arrays unnecessarily, so I tried using a dictionary and a for loop, drawing one card at a time and incrementing the count for that type of card. To my surprise, it was much slower!
The full code is below for the 3 methods I tested. _Is there a way of doing this that will be as performant as method1(), but using constant space?
Python code: (Google Colab link)
deck = [c for c in ['9','10','Jack','Queen','King','Ace'] for _ in range(8)]
num_draws = 1000000
def method1():
draws = np.random.choice(deck, size=num_draws, replace=True)
df = pd.DataFrame([Counter(draws)])/num_draws
print(df)
def method2():
card_counts = defaultdict(int)
for _ in range(num_draws):
card_counts[np.random.choice(deck, replace=True)] += 1
df = pd.DataFrame([card_counts])/num_draws
print(df)
def method3():
card_counts = defaultdict(int)
for _ in range(num_draws):
card_counts[deck[random.randint(0, len(deck)-1)]] += 1
df = pd.DataFrame([card_counts])/num_draws
print(df)
Python timeit() results:
method1: 1.2997
method2: 23.0626
method3: 5.5859
R code:
card = sample(deck, numDraws, replace=TRUE)
print(as.data.frame(table(card)/numDraws))

Here's one with np.unique+np.bincount -
def unique():
unq,ids = np.unique(deck, return_inverse=True)
all_ids = np.random.choice(ids, size=num_draws, replace=True)
ar = np.bincount(all_ids)/num_draws
return pd.DataFrame(ar[None], columns=unq)
How does NumPy help here?
There are two major improvements that's helping us here :
We convert the string data to numeric. NumPy works well with such data. To achieve this, we are using np.unique.
We use np.bincount to replace the counting step. Again, it works well with numeric data and we do have that from the numeric conversion done at the start of this method.
NumPy in general works well with large data, which is the case here.
Timings with given sample dataset comparing against fastest method1 -
In [177]: %timeit method1()
328 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [178]: %timeit unique()
12.4 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy achieves efficiency by running C code in its numerical engine. Python is convenient, but it is orders of magnitude slower than C.
In Numpy and other high-performance Python libraries, the Python code consists mostly of glue code, preparing the task to be dispatched. Since there is overhead, it is much faster to draw a lot of samples at once.
Remember that providing a buffer of 1 million elements for Numpy to work is still constant space. Then you can sample 1 billion times by looping it.
This extra memory allocation is usually not a problem. If you must avoid using memory at all costs while still getting performance benefits from Numpy, you can try using Numba or Cython to accelerate it.
from numba import jit
#jit(nopython=True)
def method4():
card_counts = np.zeros(6)
for _ in range(num_draws):
card_counts[np.random.randint(0, 6)] += 1
return card_counts/num_draws

Related

Is it possible to improve python performance for this code?

I have a simple code that:
Read a trajectory file that can be seen as a list of 2D arrays (list of positions in space) stored in Y
I then want to compute for each pair (scipy.pdist style) the RMSD
My code works fine:
trajectory = read("test.lammpstrj", index="::")
m = len(trajectory)
#.get_positions() return a 2d numpy array
Y = np.array([snapshot.get_positions() for snapshot in trajectory])
b = [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
This code execute in 0.86 seconds using python3.10, using Julia1.8 the same kind of code execute in 0.46 seconds
I plan to have trajectory much larger (~ 200,000 elements), would it be possible to get a speed-up using python or should I stick to Julia?
You've mentioned that snapshot.get_positions() returns some 2D array, suppose of shape (p, q). So I expect that Y is a 3D array with some shape (m, p, q), where m is the number of snapshots in the trajectory. You also expect m to scale rather high.
Let's see a basic way to speed up the distance calculation, on the setting m=1000:
import numpy as np
# dummy inputs
m = 1000
p, q = 4, 5
Y = np.random.randn(m, p, q)
# your current method
def foo():
return [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
# vectorized approach -> compute the upper triangle of the pairwise distance matrix
def bar():
u, v = np.triu_indices(Y.shape[0], 1)
return np.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# Check for correctness
out_1 = foo()
out_2 = bar()
print(np.allclose(out_1, out_2))
# True
If we test the time required:
%timeit -n 10 -r 3 foo()
# 3.16 s ± 50.3 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
The first method is really slow, it takes over 3 seconds for this calculation. Let's check the second method:
%timeit -n 10 -r 3 bar()
# 97.5 ms ± 405 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So we have a ~30x speedup here, which would make your large calculation in python much more feasible than using the original code. Feel free to test out with other sizes of Y to see how it scales compared to the original.
JIT
In addition, you can also try out JIT, mainly jax or numba. It is fairly simple to port the function bar with jax.numpy, for example:
import jax
import jax.numpy as jnp
#jax.jit
def jit_bar(Y):
u, v = jnp.triu_indices(Y.shape[0], 1)
return jnp.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# check for correctness
print(np.allclose(bar(), jit_bar(Y)))
# True
If we test the time of the jitted jnp op:
%timeit -n 10 -r 3 jit_bar(Y)
# 10.6 ms ± 678 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So compared to the original, we could reach even up to ~300x speed.
Note that not every operation can be converted to jax/jit so easily (this particular problem is conveniently suitable), so the general advice is to simply avoid python loops and use numpy's broadcasting/vectorization capabilities, like in bar().
Stick to Julia.
If you already made it in a language which runs faster, why are you trying to use python in the first place?
Your question is about speeding up Python, relative to Julia, so I'd like to offer some Julia code for comparison.
Since your data is most naturally expressed as a list of 4x5 arrays, I suggest expressing it as a vector of SMatrixes:
sumdiff2(A, B) = sum((A[i] - B[i])^2 for i in eachindex(A, B))
function dists(Y)
M = length(Y)
V = Vector{float(eltype(eltype(Y)))}(undef, sum(1:M-1))
Threads.#threads for i in eachindex(Y)
ii = sum(M-i+1:M-1) # don't worry about this sum
for j in i+1:lastindex(Y)
ind = ii + (j-i)
V[ind] = sqrt(3 * sumdiff2(Y[i], Y[j])/length(Y[i]))
end
end
return V
end
using Random: randn
using StaticArrays: SMatrix
Ys = [randn(SMatrix{4,5,Float64}) for _ in 1:1000];
Benchmarks:
# single-threaded
julia> using BenchmarkTools
julia> #btime dists($Ys);
6.561 ms (2 allocations: 3.81 MiB)
# multi-threaded with 6 cores
julia> #btime dists($Ys);
1.606 ms (75 allocations: 3.82 MiB)
I was not able to install jax on my computer, but when comparing with #Mercury's numpy code I got
foo: 5.5seconds
bar: 179ms
i.e. approximately 3400x speedup over foo.
It is possible to write this as a one-liner at a ~2-3x performance cost.
While Python tends to be slower than Julia for many tasks, it is possible to write numerical codes as fast as Julia in Python using Numba and plain loops. Indeed, Numba is based on LLVM-Lite which is basically a JIT-compiler based on the LLVM toolchain. The standard implementation of Julia also use a JIT and the LLVM toolchain. This means the two should behave pretty closely besides the overhead introduced by the languages that are negligible once the computation is performed in parallel (because the resulting computation will be memory-bound on nearly all modern platforms).
This computation can be parallelized in both Julia and Python (still using Numba). While writing a sequential computation is quite straightforward, writing a parallel computation is if bit more complex. Indeed, computing the upper triangular values can result in an imbalanced workload and so to a sub-optimal execution time. An efficient strategy is to compute, for each iteration, a pair of lines: one comes from the top of the upper triangular part and one comes from the bottom. The top line contains m-i items while the bottom one contains i+1 items. In the end, there is m+1 items to compute per iteration so the number of item is independent of the iteration number. This results in a much better load-balancing. The line of the middle needs to be computed separately regarding the size of the input array.
Here is the final implementation:
import numba as nb
import numpy as np
#nb.njit(inline='always', fastmath=True)
def compute_line(tmp, res, i, m):
offset = (i * (2 * m - i - 1)) // 2
factor = 3.0 / n
for j in range(i + 1, m):
s = 0.0
for k in range(n):
s += (tmp[i, k] - tmp[j, k]) ** 2
res[offset] = np.sqrt(s * factor)
offset += 1
return res
#nb.njit('()', parallel=True, fastmath=True)
def fastest():
m, n = Y.shape[0], Y.shape[1] * Y.shape[2]
res = np.empty(m*(m-1)//2)
tmp = Y.reshape(m, n)
for i in nb.prange(m//2):
compute_line(tmp, res, i, m)
compute_line(tmp, res, m-i-1, m)
if m % 2 == 1:
compute_line(tmp, res, (m+1)//2, m)
return res
# [...] same as others
%timeit -n 100 fastest()
Results
Here are performance results on my machine (with a i5-9600KF having 6 cores):
foo (seq, Python, Mercury): 4910.7 ms
bar (seq, Python, Mercury): 134.2 ms
jit_bar (seq, Python, Mercury): ???
dists (seq, Julia, DNF) 6.9 ms
dists (par, Julia, DNF) 2.2 ms
fastest (par, Python, me): 1.5 ms <-----
(Jax does not work on my machine so I cannot test it yet)
This implementation is the fastest one and succeed to beat the best Julia code so far.
Optimal implementation
Note that for large arrays like (200_000,4,5), all implementations provided so far are inefficient since they are not cache friendly. Indeed, the input array will take 32 MiB and will not for on the cache of most modern processors (and even if it could, one need to consider the space needed for the output and the fact that caches are not perfect). This can be fixed using tiling, at the expense of an even more complex code. I think such an implementation should be optimal if you use Z-order curves.

Sum a numpy array in chunks

Let's say I have a numpy array:
x = np.array([3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
And I want to sum it in groups of, say, 3, so that the results is as follows:
np.array([12, 21, 30, 39])
Here is one way to do it:
n = x.size
out = x.reshape(n//3, 3) # np.ones(3)
Is there a quicker way? I feel like this could be improved.
EDIT: just wanted to give an update for some of the methods described here
n = int(1e6)
arr = np.random.random(4*n)
def method1(arr):
return arr.reshape(n, 4) # np.ones(4)
def method2(arr):
return arr.reshape(n, 4).sum(-1)
def method3(arr):
return np.add.reduceat(arr, np.arange(0, 4*n, 4))
%timeit method1(arr)
1.53 ms ± 85.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit method2(arr)
14.6 ms ± 867 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit method3(arr)
14.2 ms ± 369 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
method2 is the basic way to do that in Numpy. That being said it is not very well optimized yet internally for such a case. Indeed, the reduction is done along a very small number of items and the internal reduction is optimized for a relatively large number of items. AFAIR, compilers like GCC tends to auto-vectorize the code using SIMD instructions resulting in a much slower execution for small reductions. It might be optimized in the future but this is tricky to do since the problem is mainly related to the way compilers optimize the code and the assumption they make during the optimization steps. Thus, it is not really a problem of Numpy though there are ways to specifically optimize this use-case at the expensive of a less-maintainable code.
method3 is not very efficient since np.add.reduceat is currently not yet very-optimized internally in Numpy. We plan to do that but one should not expect a drastic improvement since the method is fundamentally not very efficient on modern CPUs anyway.
method1 is clever because it makes use of BLAS that are very optimized internally. The default implementation on most platform, OpenBLAS, carefully optimize many use-case, including small matrices/vectors multiplications, resulting in a much faster execution. That being said, it is not optimal due to the unneeded multiplications by ones (BLAS does not optimize the computations based on the content of the values).
AFAIK, there is no way to write a faster implementation than method1 in pure Numpy. As a result, the only option left to speed up the code is to execute a natively-compiled code specifically design to solve your use-case. This is possible using Numba or Cython. Here is a naive implementation:
import numba as nb
#nb.njit('(float64[::1],)')
def method4(arr):
res = np.empty(n)
for i in range(n):
res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]
return res
If you run this code, you will certainly get similar performance results than BLAS demonstrating how good BLAS implementations are (in fact, OpenBLAS is a bit faster on my machine). This code is not optimal because it is mainly memory-bound and page faults slow things down on most systems (see this related post). You can mitigate their overheads using multiple threads. This is still not optimal as page faults does not scale well on all platforms (quite fine on Linux but poor on Windows). Alternatively, you can preallocate the output array once so to pay this overhead only once. You can even mix both approaches regarding your needs (suing multiple threads can be useful to ensure the memory is saturated whatever the target platform though creating threads can be expensive). Here is the naive parallel implementation and an optimized parallel implementation:
# Naive parallel implementation mitigating a bit the page-faults overhead
#nb.njit('(float64[::1],)', parallel=True)
def method5(arr):
res = np.empty(n)
for i in nb.prange(n):
res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]
return res
# Parallel implementation avoiding completely page-faults
# (assuming `res` is preallocated and filled)
#nb.njit('(float64[::1],float64[::1])', parallel=True)
def method6(arr, res):
for i in nb.prange(n):
res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]
Benchmark
method1: 3.64 ms
method2: 11.7 ms
method3: 16.0 ms
method4: 3.88 ms
method5: 2.05 ms
method6: 0.84 ms <----
This last method is nearly optimal and 4.3 times faster than the previously fastest BLAS one.

NumPy is faster than PyTorch for larger cross or outer products

I'm computing huge outer products between vectors of size (50500,) and found out that NumPy is (much?) faster than PyTorch while doing so.
Here are the tests:
# NumPy
In [64]: a = np.arange(50500)
In [65]: b = a.copy()
In [67]: %timeit np.outer(a, b)
5.81 s ± 56.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
-------------
# PyTorch
In [73]: t1 = torch.arange(50500)
In [76]: t2 = t1.clone()
In [79]: %timeit torch.ger(t1, t2)
7.73 s ± 143 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'd ideally like to have the computation done in PyTorch. So, how can I speed things up for computing outer product in PyTorch for such huge vectors?
Note: I tried to move the tensors to GPU but I was treated with MemoryError because it needs around 19 GiB of space. So, I eventually have to do it on the CPU.
Unfortunately there's really no way to specifically speed up torch's method of computing the outer product torch.ger() without a vast amount of effort.
Explanation and Options
The reason numpy function np.outer() is so fast is because it's written in C, which you can see here: https://github.com/numpy/numpy/blob/7e3d558aeee5a8a5eae5ebb6aef03de892a92ebd/numpy/core/numeric.py#L1123
where the function uses operations from the umath C source code.
Pytorch's torch.ger() function is written in C++ here: https://github.com/pytorch/pytorch/blob/7ce634ebc2943ff11d2ec727b7db83ab9758a6e0/aten/src/ATen/native/LinearAlgebra.cpp#L142 which makes it ever so slightly slower as you can see in your example.
Your options to "speed up computing outer product in PyTorch" would be to add a C implementation for outer product in pytorch's native code, or make your own outer product function while interfacing with C using something like Cython if you really don't want to use numpy (which wouldn't make much sense).
P.S.
Also just as an aside, using GPUs would only improve your parallel computation speed on the GPU which may not outweigh the cost of time required to transfer data between RAM and GPU memory.
A very nice solution is to combine both.
class LazyFrames(object):
def __init__(self, frames):
self._frames = frames
def __array__(self, dtype=None):
out = np.concatenate(self._frames, axis=0)
if dtype is not None:
out = out.astype(dtype)
return out
frames might be just your pytorch tensors for instance.
This object ensures that common frames between the observations are only stored once. It exists purely to optimize memory usage which can be huge (e.g. DQN's 1M frames replay buffers). This object should only be converted to numpy array before being passed to the model.
Reference : https://github.com/Shmuma/ptan/blob/master/ptan/common/wrappers.py

numpy.sum performance depending on axis

When summing over a dimension in a numpy array, is there a performance difference between the first and the last axis?
Specifically, considering the following code, which of sum1 and sum2 will be performed faster?
import numpy as np
a = np.ones((1000,200))
b = np.ones((200,1000))
sum1 = np.sum(a, axis=0)
sum2 = np.sum(b, axis=-1)
I believe this question actually boils down to how does numpy internally store dimensions and that this can be overriden to use row-wise or column-wise format. However, when using the default setting, which of these will be faster? Also, what about N-dimensional arrays?
It is quite easy to check whether or not there is a performance difference (IPython, I increased a bit the numbers to have a more noticeable difference):
import numpy as np
a = np.ones((10000, 2000))
b = np.ones((2000, 10000))
%timeit np.sum(a, axis=0)
# 27.6 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(b, axis=-1)
# 34.6 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, by the time you are having an actual performance issue with np.sum you will probably have run out of memory anyway, but yes, there is a difference. By default, NumPy arrays are stored in row-major order, so first goes the first row, then the second, etc. It does make sense, then, that summing (or operating) in outer dimensions is faster, because the cache will be way more effective. Simply puy, in the first case, when you get the first element of the array a bunch of contiguous data will come to the cache with it, so when you want to sum the next elements they will be already there. In the second case, on the other hand, elements to sum are quite far away from each other (2000 elements of distance, actually), so the cache won't be helping much, column-wise. That is not to say the cache won't help at all, since you are summing all the columns, so cached data will still be reused to a degree, but not as effectively. This is a rather gross approximation, in general there are several cache levels, some shared among cores and some not, and understanding the exact effect that one or another code has on it is a complicated topic, but the general idea holds.

Why python broadcasting in the example below is slower than a simple loop?

I have an array of vectors and compute the norm of their diffs vs the first one.
When using python broadcasting, the calculation is significantly slower than doing it via a simple loop. Why?
import numpy as np
def norm_loop(M, v):
n = M.shape[0]
d = np.zeros(n)
for i in range(n):
d[i] = np.sum((M[i] - v)**2)
return d
def norm_bcast(M, v):
n = M.shape[0]
d = np.zeros(n)
d = np.sum((M - v)**2, axis=1)
return d
M = np.random.random_sample((1000, 10000))
v = M[0]
%timeit norm_loop(M, v)
25.9 ms
%timeit norm_bcast(M, v)
38.5 ms
I have Python 3.6.3 and Numpy 1.14.2
To run the example in google colab:
https://drive.google.com/file/d/1GKzpLGSqz9eScHYFAuT8wJt4UIZ3ZTru/view?usp=sharing
Memory access.
First off, the broadcast version can be simplified to
def norm_bcast(M, v):
return np.sum((M - v)**2, axis=1)
This still runs slightly slower than the looped version.
Now, conventional wisdom says that vectorized code using broadcasting should always be faster, which in many cases isn't true (I'll shamelessly plug another of my answers here). So what's happening?
As I said, it comes down to memory access.
In the broadcast version every element of M is subtracted from v. By the time the last row of M is processed the results of processing the first row have been evicted from cache, so for the second step these differences are again loaded into cache memory and squared. Finally, they are loaded and processed a third time for the summation. Since M is quite large, parts of the cache are cleared on each step to acomodate all of the data.
In the looped version each row is processed completely in one smaller step, leading to fewer cache misses and overall faster code.
Lastly, it is possible to avoid this with some array operations by using einsum.
This function allows mixing matrix multiplications and summations.
First, I'll point out it's a function that has rather unintuitive syntax compared to the rest of numpy, and potential improvements often aren't worth the extra effort to understand it.
The answer may also be slightly different due to rounding errors.
In this case it can be written as
def norm_einsum(M, v):
tmp = M-v
return np.einsum('ij,ij->i', tmp, tmp)
This reduces it to two operations over the entire array - a subtraction, and calling einsum, which performs the squaring and summation.
This gives a slight improvement:
%timeit norm_bcast(M, v)
30.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_loop(M, v)
25.1 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_einsum(M, v)
21.7 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Squeezing out maximum performance
On vectorized operations you clearly have a bad cache behaviour. But the calculation itsef is also slow due to not exploiting modern SIMD instructions (AVX2,FMA). Fortunately it isn't really complicated to overcome this issues.
Example
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def norm_loop_improved(M, v):
n = M.shape[0]
d = np.empty(n,dtype=M.dtype)
#enables SIMD-vectorization
#if the arrays are not aligned
M=np.ascontiguousarray(M)
v=np.ascontiguousarray(v)
for i in nb.prange(n):
dT=0.
for j in range(v.shape[0]):
dT+=(M[i,j]-v[j])*(M[i,j]-v[j])
d[i]=dT
return d
Performance
M = np.random.random_sample((1000, 1000))
norm_loop_improved: 0.11 ms**, 0.28ms
norm_loop: 6.56 ms
norm_einsum: 3.84 ms
M = np.random.random_sample((10000, 10000))
norm_loop_improved:34 ms
norm_loop: 223 ms
norm_einsum: 379 ms
** Be careful when measuring performance
The first result (0.11ms) comes from calling the function repeadedly with the same data. This would need 77 GB/s reading-throuput from RAM, which is far more than my DDR3 Dualchannel-RAM is capable of. Due to the fact that calling a function with the same input parameters successively isn't realistic at all, we have to modify the measurement.
To avoid this issue we have to call the same function with different data at least twice (8MB L3-cache, 8MB data) and than divide the result by two to clear all the caches.
The relative performance of this methods also differ on array sizes (have a look at the einsum results).

Categories