I am interested in finding the fastest way of carrying a simple operation in Python3.6 using Numpy. I wish to create a function and from a given array to an array of function values. Here is a simplified code that does that using map:
import numpy as np
def func(x):
return x**2
xRange = np.arange(0,1,0.01)
arr_func = np.array(list(map(func, xRange)))
However, as I am running it with a complicated function and using large arrays, runtime speed is very important for me. Is there a known faster way?
EDIT My question is not the same as this one, because I am asking about assigning from a function, as opposed to a generator.
Check the related How do I build a numpy array from a generator?, where the most compelling option seems to be preallocating the numpy array and setting values, instead of creating a throwaway intermediate list.
arr_func = np.empty(len(xRange))
for i in range(len(xRange)):
arr_func[i] = func(xRange[i])
With a complex function that can't be rewritten with compiled numpy functions, we can't make big improvements in speed.
Define a function with math methods that require scalars, for example:
def func(x):
return math.sin(x)**2 + math.cos(x)**2
In [868]: x = np.linspace(0,np.pi,10000)
For reference do a straight forward list comprehension:
In [869]: np.array([func(i) for i in x])
Out[869]: array([ 1., 1., 1., ..., 1., 1., 1.])
In [870]: timeit np.array([func(i) for i in x])
13.4 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your list map is slightly faster:
In [871]: timeit np.array(list(map(func, x)))
12.6 ms ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For 1d array like this, np.array can be replaced with np.fromiter. It works with a generator as well, including the Py3 map.
In [875]: timeit np.fromiter(map(func, x),float)
13.1 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So that could get around the possible time penalty of creating a whole list first. But in this case it doesn't help.
Another iterator is np.frompyfunc. It is used by np.vectorize, but usually is faster with less overhead. It returns a dtype object array:
In [876]: f = np.frompyfunc(func, 1, 1)
In [877]: f(x)
Out[877]: array([1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0], dtype=object)
In [878]: timeit f(x)
11.1 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [879]: timeit f(x).astype(float)
11.2 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A slight speed improvement. I noticed more of an improvement with 1000 item x. This is even better if your problem requires several arrays that may be broadcasted against each other.
Assigning to a preallocated out array may save memory, and is often recommended as a alternative to the list append iteration. But here it doesn't not give a speed improvement:
In [882]: %%timeit
...: out = np.empty_like(x)
...: for i,j in enumerate(x): out[i]=func(j)
16.1 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(the use of enumerate is slightly faster than range iteration).
Related
I am trying to vectorize creation of an array with variable indices that change with the loop variable. In the code snippet below, I want to remove the for loop and vectorize the array creation. Can someone kindly help?
#Vectorize 1
def abc(x):
return str(x)+'_variable'
ar = []
for i in range(0,100):
ar += [str('vectorize_')+abc(i)]
You're not going to get much improvement from "vectorization" here since you're working with strings, unfortunately. A pure Python comprehension is about as good as you'll be able to get, because of this constraint. "Vectorized" operations are only able to take advantage of optimized numerical C code when the data are numeric.
Here's an example of one way you might do what you want here:
In [4]: %timeit np.char.add(np.repeat("vectorize_variable_", 100), np.arange(100).astype(str))
108 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
versus a pure Python comprehension:
In [5]: %timeit [f"vectorize_variable_{i}" for i in range(100)]
11.1 µs ± 175 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As far as I know, using numpy really doesn't net you any performance benefits when working with strings. Of course, I may be mistaken, and would love if I am.
If you're still not convinced, here's the same test with n=10000:
In [6]: %timeit [f"vectorize_variable_{i}" for i in range(n)]
1.21 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit np.char.add(np.repeat("vectorize_variable_", n), np.arange(n).astype(str)
...: )
9.97 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pure Python is about 10x faster than the "vectorized" version.
Why is Numpy slower than list comprehensions in this case?
What is the best way to vectorize this grid-construction?
In [1]: import numpy as np
In [2]: mesh = np.linspace(-1, 1, 3000)
In [3]: rowwise, colwise = np.meshgrid(mesh, mesh)
In [4]: f = lambda x, y: np.where(x > y, x**2, x**3)
# Using 2D arrays:
In [5]: %timeit f(colwise, rowwise)
285 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Using 1D array and list-comprehension:
In [6]: %timeit np.array([f(x, mesh) for x in mesh])
58 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Equivalent result
In [7]: np.allclose(f(colwise, rowwise), np.array([f(x, mesh) for x in mesh]))
True
In [1]: In [2]: mesh = np.linspace(-1, 1, 3000)
...: In [3]: rowwise, colwise = np.meshgrid(mesh, mesh)
...: In [4]: f = lambda x, y: np.where(x > y, x**2, x**3)
In addition lets make the sparse grid:
In [2]: r1,c1 = np.meshgrid(mesh,mesh,sparse=True)
In [3]: rowwise.shape
Out[3]: (3000, 3000)
In [4]: r1.shape
Out[4]: (1, 3000)
With the sparse grid, times are even better than your iteration:
In [5]: timeit f(colwise, rowwise)
645 ms ± 57.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: timeit f(c1,r1)
108 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: timeit np.array([f(x, mesh) for x in mesh])
166 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The other answer stresses caching. Other posts have shown that a modest amount of iteration can be faster than working with very large arrays, such as when using matmul. I don't know if it's caching or some other memory management complication that slows this down.
But at 3000*3000*8 bytes I'm not sure that's the issue here. Instead I think it's the time the x**2 and x**3 expressions require.
The arguments of the where are evaluated before being passed in.
The condition expression takes a modest amount of time:
In [8]: timeit colwise>rowwise
24.2 ms ± 71.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
But the power expression for the (3000,3000) array takes a majority of the total time:
In [9]: timeit rowwise**3
467 ms ± 8.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Contrast that with the time required for the sparse equivalent:
In [10]: timeit r1**3
142 µs ± 150 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This time is 3288x faster; that's a bit worse that O(n) scaling.
repeated multiply is better:
In [11]: timeit rowwise*rowwise*rowwise
116 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [f(x, mesh) for x in mesh], x**3 operates on a scalar, so is fast, even though it's repeated 3000 times.
In fact if we take the power calculations out of the timing, the whole array where is relatively fast:
In [15]: %%timeit x2,x3 = rowwise**2, rowwise**3
...: np.where(rowwise>colwise, x2,x3)
89.8 ms ± 3.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why is Numpy slower than list comprehensions in this case?
you are essentially suffering from 2 problems.
the first one is the cache utilization,
the second version uses a smaller subset of the space (3000,1) (1,3000) for the calculation which can fit nicely in you cache, so x>y, x**2 , x**3 can all fit inside your cache which somewhat speeds things up,
the first version is calculating each of those 3 for a 3000x3000 array (9 million entries) which can never sit inside your cache (usually ~ 2-5 MB), then np.where is called that has to get parts of the data from your RAM (and not cache) in order to do its memory copying, which is then returned piece by piece to your RAM, which is very expensive.
also numpy implementation of np.where is somewhat alignment unaware and is accessing your arrays column-wise, not row-wise, so it's essentially grabbing each and every entry from your RAM, and not utilizing the cache at all.
your list comprehension actually solves this issue as it is only dealing with a small subset of data at a given time, and therefore all the data can sit in you cache, but it is still using np.where, it's only forcing it to use a row-wise access and therefore utilize your cache.
the second problem is the calculation of x**2 and x**3 which is a floating point exponentiation, which is very expensive, consider replacing it with x*x and x*x*x
What is the best way to vectorize this grid-construction?
apparently you have written it in your second method.
an even faster but unnecessary optimization by utilization of cache is to write your own code in C and call it from within python so you don't have to evaluate either x*x or x*x*x unless you need to, and won't have to store x>y,x*x,x*x*x but the speedup won't be worth the trouble.
I use below code to create a empty matrix:
import numpy as np
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
print(x)
y =np.empty_like(x)
print(y)
# I get below data:
[[2097184 2097184 2097184]
[2097184 2097184 2097184]
[2097184 2097184 2097184]
[2097184 2097184 2097184]]
why the 2097184 stand for empty?
It doesn't stand for anything. From the documentation:
This function does not initialize the returned array; to do that use zeros_like or ones_like instead. It may be marginally faster than the functions that do set the array values.
So the contents of the array are whatever happens to be in the memory that it used for it. In this case, it was a bunch of 2097184 values. The next time you try it you'll probably get something different.
You use this when you don't care what's in the array, because you're going to overwrite it.
The empty_like method does not initialize the array (that's why it's very faster than zeros_like and ones_like), so the shape of the array is exactly the same as x, but the values are uninitialized and actually are almost random values from the memory place allocated to the array.
In addition, it's just a more efficient alternative to zeros_like or ones_like:
%%timeit
np.zeros_like(x)
>>> 18.4 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
np.ones_like(x)
>>> 14.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
np.empty_like(x)
>>> 2.09 µs ± 62.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Excuse me for my ignorance.
If numpy provides vectorized operations that make computation faster, how is that for data type conversion pure python is almost 8 times faster?
e.g
a = np.random.randint(0,500,100).astype(str)
b = np.random.randint(0,500,100).astype(str)
c = np.random.randint(0,500,100).astype(str)
def A(a,b,c):
for i,j,k in zip(a,b,c):
d,e,f = int(i), int(j), int(k)
r = d+e-f
return
def B(a,b,c):
for i,j,k in zip(a,b,c):
d,e,f = np.array([i,j,k]).astype(int)
r = d+e-f
return
Then,
%%timeit
A(a,b,c)
249 µs ± 3.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
B(a,b,c)
1.87 ms ± 4.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Thank you,
Ariel
Yes, NumPy does provide vectorized operations that make computations faster than vanilla Python code. However, you aren't using them.
NumPy is intended to perform operations across entire datasets, not many repeated operations across chunks a dataset. The latter causes iteration to be done at the Python level, which will increase runtime.
Your primary issue is that the only "vectorized" operation you are using is astype, but you're applying it to three elements at a time, and still looping just as much as the naive Python solution. Combine that with the fact that you incur additional overhead from creating numpy arrays at each iteration of your loop, it's no wonder your attempt with numpy is slower.
On tiny datasets, Python can be faster, since NumPy has overhead from creating arrays, passing objects to and from lower-level libraries, etc.. Let's take a look at the casting operation you are using on three elements at a time:
%timeit np.array(['1', '2', '3']).astype(int)
5.25 µs ± 89.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.array(['1', '2', '3'])
1.62 µs ± 42.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Over a quarter of the runtime is just from allocating the array! Compare this to your pure Python version:
%timeit a, b, c = int('1'), int('2'), int('3')
659 ns ± 50.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So if you operate only on chunks of this size, Python will beat NumPy.
But you have many more elements than just three, so NumPy can be used to speed up your code substantially, but you need to change your mindset about how you approach the problem. Instead of focusing on how the operation gets applied to individual scalars, think about how it gets applied to arrays.
To vectorize this problem, the general idea is:
Create a single array containing all your values
Convert the entire array to int with a single astype call.
Take advance of elementwise operations to apply your desired arithmetic to the array.
It ends up looking like this:
def vectorized(a, b, c):
u = np.array([a, b, c]).astype(int)
return u[0] + u[1] - u[2]
Once you compare two approaches where NumPy is being used correctly, you will start to see large performance increases.
def python_loop(a, b, c):
out = []
for i,j,k in zip(a,b,c):
d,e,f = int(i), int(j), int(k)
out.append(d+e-f)
return out
a, b, c = np.random.randint(0, 500, (3, 100_000)).astype(str)
In [255]: %timeit vectorized(a, b, c)
181 ms ± 6.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [256]: %timeit python_loop(a, b, c)
206 ms ± 7.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> np.array_equal(python_loop(a, b, c), vectorized(a, b, c))
True
Converting from strings to integers is not something that NumPy will do that much faster than pure Python, as you can see from the timings, the two are fairly close. However, by applying a vectorized approach, the comparison is at least much fairer.
It seems numpy.transpose only save strides, and do actually transpose lazily according to this
So, when data movement actually happened and how to move? use many many memcpy? or some other trick?
I follow the path:
array_reshape,
PyArray_Newshape,
PyArray_NewCopy,
PyArray_NewLikeArray,
PyArray_NewFromDescr,
PyArray_NewFromDescrAndBase,
PyArray_NewFromDescr_int
but see nothing about axis permute. When did it happen indeed?
Update 2021/1/19
Thanks for answers, numpy array copy with transpose is here, which use a common macro to implement it, this algorithm is very native, and it does not consider any of simd acceleration or cache friendliness
The answer to your question is: Numpy doesn't move data.
Did you see PyArray_Transpose on line 688 of your above links? There is a permute in this function,
n = permute->len;
axes = permute->ptr;
...
for (i = 0; i < n; i++) {
int axis = axes[i];
...
permutation[i] = axis;
}
Any array shape is purely metadata, used by Numpy to understand how to handle the data, as memory is always stored linearly and contiguously. There is therefore no reason to move or reorder any data, from the docs here,
Other operations, such as transpose, don't move data elements
around in the array, but rather change the information about the shape and strides so that the indexing of the array changes, but the data in the doesn't move.
Typically these new versions of the array metadata but the same data buffer are
new 'views' into the data buffer. There is a different ndarray object, but it
uses the same data buffer. This is why it is necessary to force copies through
use of the .copy() method if one really wants to make a new and independent
copy of the data buffer.
The only reason to copy may be to maximize cache efficiency, although Numpy already considers this,
As it turns out, numpy is smart enough when dealing with ufuncs to determine which index is the most rapidly varying one in memory and uses that for the innermost loop.
Tracing through the numpy C code is a slow and tedious process. I prefer to deduce patterns of behavior from timings.
Make a sample array and its transpose:
In [168]: A = np.random.rand(1000,1000)
In [169]: At = A.T
First a fast view - no coping of the databuffer:
In [171]: timeit B = A.ravel()
262 ns ± 4.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
A fast copy (presumably uses some fast block memory coping):
In [172]: timeit B = A.copy()
2.2 ms ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A slow copy (presumably requires traversing the source in its strided order, and the target in its own order):
In [173]: timeit B = A.copy(order='F')
6.29 ms ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Copying At without having to change the order - fast:
In [174]: timeit B = At.copy(order='F')
2.23 ms ± 51.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Like [173] but going from 'F' to 'C':
In [175]: timeit B = At.copy(order='C')
6.29 ms ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [176]: timeit B = At.ravel()
6.54 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Copies with simpler strided reordering fall somewhere in between:
In [177]: timeit B = A[::-1,::-1].copy()
3.75 ms ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [178]: timeit B = A[::-1].copy()
3.73 ms ± 6.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [179]: timeit B = At[::-1].copy(order='K')
3.98 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This astype also requires the slower copy:
In [182]: timeit B = A.astype('float128')
6.7 ms ± 8.12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
PyArray_NewFromDescr_int is described as Generic new array creation routine. While I can't figure out where it copies data from the source to the target, it clearly is checking order and strides and dtype. Presumably it handles all cases where the generic copy is required. The axis permutation isn't a special case.