Efficiency of random slicing on a numpy memory map

Efficiency of random slicing on a numpy memory map - python

I have a 20GB, 100k x 100k 'float16' 2D array as a datafile. I load it to memory as follows:
fp_read = np.memmap(filename, dtype='float16', mode='r', shape=(100000, 100000))
I then attempt to read slices from it. The vertical slices I need to take are effectively random but the performance is very poor for this, or am I doing something wrong?
Analysis:
I have compared with other forms of cross-sectional slicing, which is much better although I don't know why it should be:
%timeit fp_read[:,17000:17005] # slice 5 consecutive cols
1.64 µs ± 16.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,11000:11050:10]
1.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,5000:6000:200]
1.66 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,0:100000:20000] # slice 5 disperse cols
1.69 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,[1,1001,27009,81008,99100]] # slice 5 rand cols
32.4 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = np.arange(100000); b = np.array([1,1001,27009,81008,99100])
%timeit fp_read[np.ix_(a,b)]
18 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Even these timeit functions don't accurately capture the performance degradation, since:
import time
a = np.arange(100000)
cols = np.arange(100000)
np.random.shuffle(cols)
cols = np.sort(cols[:5])
t = time.time()
arr = fp_read[np.ix_(a,cols)]
print('Actually took: {} seconds'.format(time.time() - t))
Actually took: 24.5 seconds
Compared with:
t = time.time()
arr = fp_read[:,0:100000:20000]
print('Actually took: {} seconds'.format(time.time() - t))
Actually took 0.00024 seconds

The performance difference is explained by one key difference in "basic slicing and indexing" vs. "advanced indexing", see these docs. The key line herein is
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
How much the copy hurts can be seen from comparing fp_read[:,5000:6000:200] against fp_read[:,5000:6000:200].copy().
Although making an array copy is always going to be slower than making a new view, it's especially bad for a memmap:
Reading from disk is relatively slow. The data needs to be read from disk to make the (in-memory) copy, while a view doesn't have to read any data at all! There is simply a new ndarray object created with new offset and stepsize (strides) parameters for the memory buffer.
The memory layout of your data is row-major order (vs. columns-major, see wikipedia). For accessing random columns this means that a sector has to be read from disk for every single value of data. Compare that to contiguous access, where you only read one sector for every 256 values (assuming float16 and 512 byte sectors). With memory-mapped io this effect is even worse, because then data is read in blocks (memory pages) of 4kB, so 8 x 512 byte sectors.
Now we can also understand why the timeit results are not really representative: That particular part of the file is cached by the OS in memory.

Related

Fastest way to reset a numpy array to zero? [duplicate]

I have an array which is used to track various values. The array is 2500x1700 in size, so it is not very large. At the end of a session I need to reset all of the values within that array back to zero. I tried both creating a new array of zeros and replacing all values in the array with zeros and creating a brand new array is much faster.
Code Example:
for _ in sessions:
# Reset our array
tracking_array[:,:] = 0
1.44 s ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Versus
for _ in sessions:
# Reset our array
tracking_array = np.zeros(shape=(2500, 1700))
7.26 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Why is creating an entirely new array so much faster when compared to just replacing the values in the array?

The reason is that the array is not filled in memory on mainstream operating systems (Windows, Linux and MaxOS). Numpy allocates a zero-filled array by requesting to the operating systems (OS) a zero-filled area in virtual memory. This area is not directly mapping in physical RAM. The mapping and zero-initialization is generally done lazily by the OS when you read/write the pages in virtual memory. This cost is paid when you set later the array to 1 for example. Here is a proof:
In [19]: %timeit res = np.zeros(shape=(2500, 1700))
10.8 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [20]: %timeit res = np.ones(shape=(2500, 1700))
7.54 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The former would imply a RAM throughput of at least 4.2 GiB/s which is not high but fair. The latter would imply a RAM throughput of at least roughly 2930 GiB/s which is stupidly high since my machine (as well as any standard desktop/server machine) is barely able to reach 36 GiB/s (using a carefully-optimized benchmark).

Efficiently aggregate results into a Python Data Structure

I often find myself looping over some long INPUT list (or dataframe, or dictionary). Per iteration I do some calculations on the input data, I then push the results into some OUTPUT data structure. Often the final output is a dataframe (since it is convenient to deal with).
Below are two methods that loop over a long list, and aggregate some dummy results into a dataframe. Approach 1 is very slow (~3 seconds per run), whereas Approach 2 is very fast (~18 ms per run). Approach 1 is not good, because it is slow. Approach 2 is faster, but it is not ideal either, because it effectively "caches" data in a local file (and then relies on pandas to read that file back in very quickly). Ideally, we do everything in memory.
What approaches can people suggest to efficiently aggregate results? Bonus: And what if we don't know the exact size/length of our output structure (e.g. the actual output size may exceed the initial size estimate)? Any ideas appreciated.
import time
import pandas as pd
def run1(long_list):
my_df = pd.DataFrame(columns=['A','B','C'])
for el in long_list:
my_df.loc[(len)] = [el, el+1, 1/el] # Dummy calculations
return my_df
def run2(long_list):
with open('my_file.csv', 'w') as f:
f.write('A,B,C\n')
for el in long_list:
f.write(f'{el},{el+1},{1/el}\n') # Dummy calculations
return pd.read_csv('my_file.csv')
long_list = range(1, 2000)
%timeit df1 = run1(long_list) # 3 s ± 349 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = run2(long_list) # 18 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can do this by creating and then dropping a dummy input column and doing all of the calculations directly in pandas:
def func(long_list):
my_df = pd.DataFrame(long_list, columns=['input'])
my_df = my_df.assign(
A=my_df.input,
B=my_df.input+1,
C=1/my_df.input)
return my_df.drop('input', axis=1)
Comparing the times:
%timeit df1 = run1(long_list)
%timeit df2 = run2(long_list)
%timeit df3 = func(long_list)
3.81 s ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.54 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.19 ms ± 3.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pros:
All in memory
Really fast
Easy to read
Cons:
Probably not as fast as vectorized Numpy operations

You can directly build a DataFrame from a list of lists:
def run3(long_list):
return pd.DataFrame([[el, el+1, 1/el] for el in long_list],
columns=['A','B','C'])
It should be much faster than first one, and still faster that second one, because it does not use disk io.

Why is NumPy sometimes slower than NumPy + plain Python loop?

This is based on this question asked 2018-10.
Consider the following code. Three simple functions to count non-zero elements in a NumPy 3D array (1000 × 1000 × 1000).
import numpy as np
def f_1(arr):
return np.sum(arr > 0)
def f_2(arr):
ans = 0
for val in range(arr.shape[0]):
ans += np.sum(arr[val, :, :] > 0)
return ans
def f_3(arr):
return np.count_nonzero(arr)
if __name__ == '__main__':
data = np.random.randint(0, 10, (1_000, 1_000, 1_000))
print(f_1(data))
print(f_2(data))
print(f_3(data))
Runtimes on my machine (Python 3.7.?, Windows 10, NumPy 1.16.?):
%timeit f_1(data)
1.73 s ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2(data)
1.4 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_3(data)
2.38 s ± 956 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, f_2() works faster than f_1() and f_3(). However, it's not the case with data of smaller size. The question is - why so? Is it NumPy, Python, or something else?

This is due to memory access and caching. Each of these functions is doing two things, taking the first code as an example:
np.sum(arr > 0)
It first does a comparison to find where arr is greater than zero (or non-zero, since arr contains non-negative integers). This creates an intermediate array the same shape as arr. Then, it sums this array.
Straightforward, right? Well, when using np.sum(arr > 0) this is a large array. When it's large enough to not fit in cache, performance will decrease since when the processor starts to execute the sum most of the array elements will have been evicted from memory and need to be reloaded.
Since f_2 iterates over the first dimension, it is dealing with smaller sub-arrays. The same copy and sum is done, but this time the intermediate array fits in memory. It's created, used, and destroyed without ever leaving memory. This is much faster.
Now, you would think that f_3 would be fastest (using an in-built method and all), but looking at the source code shows that it uses the following operations:
a_bool = a.astype(np.bool_, copy=False)
return a_bool.sum(axis=axis, dtype=np.intp
a_bool is just another way of finding the non-zero entries, and creates a large intermediate array.
Conclusions
Rules of thumb are just that, and are frequently wrong. If you want faster code, profile it and see what the problems are (good work on that here).
Python does some things very well. In cases where it's optimized, it can be faster than numpy. Don't be afraid to use plain old python code or datatypes in combination with numpy.
If you find frequently yourself manually writing for loops for better performance you may want to take a look at numexpr - it automatically does some of this. I haven't used it much myself, but it should provide a good speedup if intermediate arrays are what's slowing down your program.

It's all a matter of how the data is laid out in memory and how the code accesses it. Essentially, data is fetched from the memory in blocks which are then cached; if an algorithm manages to use data from a block that is in the cache, there is no need to read from memory again. This can result in huge time savings, especially when the cache is much smaller than the data you are dealing with.
Consider these variations, which only differ in which axis we are iterating on:
def f_2_0(arr):
ans = 0
for val in range(arr.shape[0]):
ans += np.sum(arr[val, :, :] > 0)
return ans
def f_2_1(arr):
ans = 0
for val in range(arr.shape[1]):
ans += np.sum(arr[:, val, :] > 0)
return ans
def f_2_2(arr):
ans = 0
for val in range(arr.shape[2]):
ans += np.sum(arr[:, :, val] > 0)
return ans
And the results on my laptop:
%timeit f_1(data)
2.31 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_0(data)
1.88 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_1(data)
2.65 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_2(data)
12.8 s ± 650 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can see that f_2_1 almost as fast as f_1, which makes me think that numpy is not using the optimal access pattern (the one used by f_2_0). The explanation for how exactly caching affects the timing is in the other answer.

Let's remove the temporary array completely
As #user2699 already mentioned in his answer, allocating and writing to a large array that doesn't fit in cache can slow down the process quite a lot. To show this behavior I have written two small functions using Numba (JIT-Compiler).
In compiled languages (C,Fortran,..) you normally avoid temporary arrays. In interpreted Python (without using Cython or Numba) you often want to call a compiled function on a larger chunk of data (vectorization) because loops in interpreted code are extremely slow. But this can also have a view downsides (like temporary arrays, bad cache usage)
Function without temporary array allocation
#nb.njit(fastmath=True,parallel=False)
def f_4(arr):
sum=0
for i in nb.prange(arr.shape[0]):
for j in range(arr.shape[1]):
for k in range(arr.shape[2]):
if arr[i,j,k]>0:
sum+=1
return sum
With temporary array
Please note that if you turn on parallelization parallel=True, the compiler does not only try to parallelize the code, but also other optimizations like loop fusing are turned on.
#nb.njit(fastmath=True,parallel=False)
def f_5(arr):
return np.sum(arr>0)
Timings
%timeit f_1(data)
1.65 s ± 48.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2(data)
1.27 s ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_3(data)
1.99 s ± 6.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_4(data) #parallel=false
216 ms ± 5.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_4(data) #parallel=true
121 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_5(data) #parallel=False
1.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_5(data) #parallel=true Temp-Array is automatically optimized away
146 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How does Numpy move data when transpose a matrix?

It seems numpy.transpose only save strides, and do actually transpose lazily according to this
So, when data movement actually happened and how to move? use many many memcpy? or some other trick?
I follow the path:
array_reshape,
PyArray_Newshape,
PyArray_NewCopy,
PyArray_NewLikeArray,
PyArray_NewFromDescr,
PyArray_NewFromDescrAndBase,
PyArray_NewFromDescr_int
but see nothing about axis permute. When did it happen indeed?
Update 2021/1/19
Thanks for answers, numpy array copy with transpose is here, which use a common macro to implement it, this algorithm is very native, and it does not consider any of simd acceleration or cache friendliness

The answer to your question is: Numpy doesn't move data.
Did you see PyArray_Transpose on line 688 of your above links? There is a permute in this function,
n = permute->len;
axes = permute->ptr;
...
for (i = 0; i < n; i++) {
int axis = axes[i];
...
permutation[i] = axis;
}
Any array shape is purely metadata, used by Numpy to understand how to handle the data, as memory is always stored linearly and contiguously. There is therefore no reason to move or reorder any data, from the docs here,
Other operations, such as transpose, don't move data elements
around in the array, but rather change the information about the shape and strides so that the indexing of the array changes, but the data in the doesn't move.
Typically these new versions of the array metadata but the same data buffer are
new 'views' into the data buffer. There is a different ndarray object, but it
uses the same data buffer. This is why it is necessary to force copies through
use of the .copy() method if one really wants to make a new and independent
copy of the data buffer.
The only reason to copy may be to maximize cache efficiency, although Numpy already considers this,
As it turns out, numpy is smart enough when dealing with ufuncs to determine which index is the most rapidly varying one in memory and uses that for the innermost loop.

Tracing through the numpy C code is a slow and tedious process. I prefer to deduce patterns of behavior from timings.
Make a sample array and its transpose:
In [168]: A = np.random.rand(1000,1000)
In [169]: At = A.T
First a fast view - no coping of the databuffer:
In [171]: timeit B = A.ravel()
262 ns ± 4.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
A fast copy (presumably uses some fast block memory coping):
In [172]: timeit B = A.copy()
2.2 ms ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A slow copy (presumably requires traversing the source in its strided order, and the target in its own order):
In [173]: timeit B = A.copy(order='F')
6.29 ms ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Copying At without having to change the order - fast:
In [174]: timeit B = At.copy(order='F')
2.23 ms ± 51.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Like [173] but going from 'F' to 'C':
In [175]: timeit B = At.copy(order='C')
6.29 ms ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [176]: timeit B = At.ravel()
6.54 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Copies with simpler strided reordering fall somewhere in between:
In [177]: timeit B = A[::-1,::-1].copy()
3.75 ms ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [178]: timeit B = A[::-1].copy()
3.73 ms ± 6.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [179]: timeit B = At[::-1].copy(order='K')
3.98 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This astype also requires the slower copy:
In [182]: timeit B = A.astype('float128')
6.7 ms ± 8.12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
PyArray_NewFromDescr_int is described as Generic new array creation routine. While I can't figure out where it copies data from the source to the target, it clearly is checking order and strides and dtype. Presumably it handles all cases where the generic copy is required. The axis permutation isn't a special case.

Speed of copying numpy array

I am wondering if there is any downside of using b = np.array(a) rather than b = np.copy(a) to copy a Numpy array a into b. When I %timeit, the former can be upto 100% faster.
In both cases b is a is False, and I can manipulate b leaving a intact, so I suppose this does what is expected from .copy().
Am I missing anything? What is improper about using np.array to do copy an array?
with python 3.6.5, numpy 1.14.2, while the speed difference closes rapidly for larger sizes:
a = np.arange(1000)
%timeit np.array(a)
501 ns ± 30.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.copy(a)
1.1 µs ± 35.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

From documentation of numpy.copy:
This is equivalent to:
>>> np.array(a, copy=True)
Also, if you look at the source code:
def copy(a, order='K'):
return array(a, order=order, copy=True)
Some timings:
In [1]: import numpy as np
In [2]: a = np.ascontiguousarray(np.random.randint(0, 20000, 1000))
In [3]: %timeit b = np.array(a)
562 ns ± 10.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit b = np.array(a, order='K', copy=True)
1.1 µs ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit b = np.copy(a)
1.21 µs ± 9.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: a = np.ascontiguousarray(np.random.randint(0, 20000, 1000000))
In [7]: %timeit b = np.array(a)
310 µs ± 6.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit b = np.array(a, order='K', copy=True)
311 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit b = np.copy(a)
313 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [10]: print(np.__version__)
1.13.3
It is unexpected that simply explicitly setting parameters to their default values changes the speed of execution of np.array(). On the other hand, maybe just processing these explicit arguments adds enough execution time to make a difference for small arrays. Indeed, from the source code for the numpy.array(), one can see that there are many more checks and more processing being performed when keyword arguments are provided, for example, see goto full_path. When keyword parameters are not set, the execution skips all the way down to goto finish. This overhead (of additional processing of keyword arguments) is what you detect in timings for small arrays. For larger arrays this overhead is insignificant in comparison to the actual time of copying the arrays.

"What is improper about using np.array to do copy an array?"
I'd argue it is harder to read. Because it is not obvious that array makes a copy, for example, the similar asarray does not make a copy if it doesn't have to. The reader basically has to know the default value of the copy keyword argument to be sure.

As AGN pointed out, np.array is faster than np.copy because essentially the latter is a wrapper of the former. This means python "loses" some extra time searching for both functions. A similar thing happens with decorators.
This extra time is insignificant for pratical purposes, and you gain better code readability.
You can test it by using a big array (where the array creation takes the main time), and you'll see very little differences in %timeit for both.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.