I get a big array (image with 12 Mpix) in the array format from the python standard lib.
Since I want to perform operations on those array, I wish to convert it to a numpy array.
I tried the following:
import numpy
import array
from datetime import datetime
test = array.array('d', [0]*12000000)
t = datetime.now()
numpy.array(test)
print datetime.now() - t
I get a result between one or two seconds: equivalent to a loop in python.
Is there a more efficient way of doing this conversion?
np.array(test) # 1.19s
np.fromiter(test, dtype=int) # 1.08s
np.frombuffer(test) # 459ns !!!
asarray(x) is almost always the best choice for any array-like object.
array and fromiter are slow because they perform a copy. Using asarray allows this copy to be elided:
>>> import array
>>> import numpy as np
>>> test = array.array('d', [0]*12000000)
# very slow - this makes multiple copies that grow each time
>>> %timeit np.fromiter(test, dtype=test.typecode)
626 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# fast memory copy
>>> %timeit np.array(test)
63.5 ms ± 639 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# which is equivalent to doing the fast construction followed by a copy
>>> %timeit np.asarray(test).copy()
63.4 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# so doing just the construction is way faster
>>> %timeit np.asarray(test)
1.73 µs ± 70.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# marginally faster, but at the expense of verbosity and type safety if you
# get the wrong type
>>> %timeit np.frombuffer(test, dtype=test.typecode)
1.07 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Related
I've been exploring the performance differences between numpy functions and the normal built-in functions of Python, and I want to know how numpy functions are so optimized such that there's almost a 100x speed up.
Below is some code that I wrote to highlight the execution time differences between numpy mean() and manual calculation of mean using sum() and len()
import numpy as np
import time
n = 10**7
a = np.random.randn(n)
start = time.perf_counter()
mean = sum(a)/len(a)
seconds1 = time.perf_counter()-start
start = time.perf_counter()
mean = np.mean(a)
seconds2 = time.perf_counter()-start
print("First method takes time {:.3f}s".format(seconds1))
print("Second method takes time {:.3f}s".format(seconds2))
Output:-
First method takes 1.687s
Second method takes 0.013s
Make a numpy array:
In [130]: a=np.arange(10000)
Apply the numpy sum function:
In [131]: timeit np.sum(a)
16.2 µs ± 22.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
mean is a bit slower, since it has to divide by the shape (and may do a few other tests):
In [132]: timeit np.mean(a)
34.9 µs ± 198 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.sum actually delegates the action to the sum method of the array, so using that directly is a bit faster:
In [133]: timeit a.sum()
13.3 µs ± 25.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Python sum isn't a bad function, but it iterates over its argument. Iterating (in Python code) on an array is slow:
In [134]: timeit sum(a)
1.16 ms ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Converting the array to a list first saves time:
In [135]: timeit sum(a.tolist())
369 µs ± 7.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Better yet if we just time the list operation:
In [136]: %%timeit alist=a.tolist()
...: sum(alist)
57.2 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
When working with numpy arrays, it is best to use its own methods (or numpy functions). Generally when using Python functions, it is better to use lists.
Using a numpy function on a list is slow, because it has to first convert the list to an array:
In [137]: %%timeit alist=a.tolist()
...: np.sum(alist)
795 µs ± 28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have the following issue: I have a matrix yj of size (m,200) (m = 3683), and I have a dictionary that for each key, returns a numpy array of row indices for yj (for each key, the size array changes, just in case anyone is wondering).
Now, I have to access this matrix lots of times (around 1M times) and my code is slowing down because of the indexing (I've profiled the code and it takes 65% of time on this step).
Here is what I've tried out:
First of all, use the indices for slicing:
>> %timeit yj[R_u_idx_train[1]]
10.5 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The variable R_u_idx_train is the dictionary that has the row indices.
I thought that maybe boolean indexing might be faster:
>> yj[R_u_idx_train_mask[1]]
10.5 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
R_u_idx_train_mask is a dictionary that returns a boolean array of size m where the indices given by R_u_idx_train are set to True.
I also tried np.ix_
>> cols = np.arange(0,200)
>> %timeit ix_ = np.ix_(R_u_idx_train[1], cols); yj[ix_]
42.1 µs ± 353 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I also tried np.take
>> %timeit np.take(yj, R_u_idx_train[1], axis=0)
2.35 ms ± 88.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And while this seems great, it is not, since it gives an array that is shape (R_u_idx_train[1].shape[0], R_u_idx_train[1].shape[0]) (it should be (R_u_idx_train[1].shape[0], 200)). I guess I'm not using the method correctly.
I also tried np.compress
>> %timeit np.compress(R_u_idx_train_mask[1], yj, axis=0)
14.1 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Finally I tried to index with a boolean matrix
>> %timeit yj[R_u_idx_train_mask2[1]]
244 µs ± 786 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, is 10.5 µs ± 79.7 ns per loop the best I can do? I could try to use cython but that seems like a lot of work for just indexing...
Thanks a lot.
A very smart solution was given by V.Ayrat in the comments.
>> newdict = {k: yj[R_u_idx_train[k]] for k in R_u_idx_train.keys()}
>> %timeit newdict[1]
202 ns ± 6.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Anyway maybe it would still be cool to know if there is a way to speed it up using numpy!
I have a long unicode string:
alphabet = range(0x0FFF)
mystr = ''.join(chr(random.choice(alphabet)) for _ in range(100))
mystr = re.sub('\W', '', mystr)
I would like to view it as a series of code points, so at the moment, I am doing the following:
arr = np.array(list(mystr), dtype='U1')
I would like to be able to manipulate the string as numbers, and eventually get some different code points back. Now I'd like to invert the transformation:
mystr = ''.join(arr.tolist())
These transformations are reasonably fast and invertible, but take up an unnecessary amount of space with the list intermediary.
Is there a way to convert a numpy array of unicode characters to and from a Python string without converting to a list first?
Afterthoughts
I can get arr to appear as a single string with something like
buf = arr.view(dtype='U' + str(arr.size))
This results in a 1-element array containing the entire original. The inverse is possible as well:
buf.view(dtype='U1')
The only issue is that the type of the result is np.str_, not str.
fromiter works, but is really slow, since it goes through the iterator protocol. It's much faster to encode your data to UTF-32 (in system byte order) and use numpy.frombuffer:
In [56]: x = ''.join(chr(random.randrange(0x0fff)) for i in range(1000))
In [57]: codec = 'utf-32-le' if sys.byteorder == 'little' else 'utf-32-be'
In [58]: %timeit numpy.frombuffer(bytearray(x, codec), dtype='U1')
2.79 µs ± 47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [59]: %timeit numpy.fromiter(x, dtype='U1', count=len(x))
122 µs ± 3.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [60]: numpy.array_equal(numpy.fromiter(x, dtype='U1', count=len(x)), numpy.fr
...: ombuffer(bytearray(x, codec), dtype='U1'))
Out[60]: True
I've used sys.byteorder to determine whether to encode in utf-32-le or utf-32-be. Also, using bytearray instead of encode gets a mutable bytearray instead of an immutable bytes object, so the resulting array is writable.
As for the reverse conversion, arr.view(dtype=f'U{arr.size}')[0] works, but using item() is a bit faster and produces an ordinary string object, avoiding possible weird edge cases where numpy.str_ doesn't quite behave like str:
In [72]: a = numpy.frombuffer(bytearray(x, codec), dtype='U1')
In [73]: type(a.view(dtype=f'U{a.size}')[0])
Out[73]: numpy.str_
In [74]: type(a.view(dtype=f'U{a.size}').item())
Out[74]: str
In [75]: %timeit a.view(dtype=f'U{a.size}')[0]
3.63 µs ± 34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [76]: %timeit a.view(dtype=f'U{a.size}').item()
2.14 µs ± 23.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Finally, be aware that NumPy doesn't handle nulls like normal Python string objects do. NumPy can't distinguish between 'asdf\x00\x00\x00' and 'asdf', so using NumPy arrays for string operations is not safe if your data may contain null code points.
The fastest way I have found to convert a string to an array is
arr = np.array([mystr]).view(dtype='U1')
Another (slower) way to convert a string to an array of unicode code points based on #Daniel Mesejo's comment:
arr = np.fromiter(mystr, dtype='U1', count=len(mystr))
Looking at the source code for fromiter shows that setting the count parameter to the length of the string will cause the entire array to be allocated at once, instead of performing multiple reallocations.
To convert back to a string:
str(arr.view(dtype=f'U{arr.size}')[0])
For most purposes, the final conversion to Python str is not necessary since np.str_ is a subclass of str.
arr.view(dtype=f'U{arr.size}')[0]
Appendix: Timing of frombuffer vs array
100
mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(100))
%timeit np.array([mystr]).view(dtype='U1')
1.43 µs ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
1.2 µs ± 9.06 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
10000
mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(10000))
%timeit np.array([mystr]).view(dtype='U1')
4.33 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
10.9 µs ± 29.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1000000
mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(1000000))
%timeit np.array([mystr]).view(dtype='U1')
672 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
732 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It seems numpy.transpose only save strides, and do actually transpose lazily according to this
So, when data movement actually happened and how to move? use many many memcpy? or some other trick?
I follow the path:
array_reshape,
PyArray_Newshape,
PyArray_NewCopy,
PyArray_NewLikeArray,
PyArray_NewFromDescr,
PyArray_NewFromDescrAndBase,
PyArray_NewFromDescr_int
but see nothing about axis permute. When did it happen indeed?
Update 2021/1/19
Thanks for answers, numpy array copy with transpose is here, which use a common macro to implement it, this algorithm is very native, and it does not consider any of simd acceleration or cache friendliness
The answer to your question is: Numpy doesn't move data.
Did you see PyArray_Transpose on line 688 of your above links? There is a permute in this function,
n = permute->len;
axes = permute->ptr;
...
for (i = 0; i < n; i++) {
int axis = axes[i];
...
permutation[i] = axis;
}
Any array shape is purely metadata, used by Numpy to understand how to handle the data, as memory is always stored linearly and contiguously. There is therefore no reason to move or reorder any data, from the docs here,
Other operations, such as transpose, don't move data elements
around in the array, but rather change the information about the shape and strides so that the indexing of the array changes, but the data in the doesn't move.
Typically these new versions of the array metadata but the same data buffer are
new 'views' into the data buffer. There is a different ndarray object, but it
uses the same data buffer. This is why it is necessary to force copies through
use of the .copy() method if one really wants to make a new and independent
copy of the data buffer.
The only reason to copy may be to maximize cache efficiency, although Numpy already considers this,
As it turns out, numpy is smart enough when dealing with ufuncs to determine which index is the most rapidly varying one in memory and uses that for the innermost loop.
Tracing through the numpy C code is a slow and tedious process. I prefer to deduce patterns of behavior from timings.
Make a sample array and its transpose:
In [168]: A = np.random.rand(1000,1000)
In [169]: At = A.T
First a fast view - no coping of the databuffer:
In [171]: timeit B = A.ravel()
262 ns ± 4.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
A fast copy (presumably uses some fast block memory coping):
In [172]: timeit B = A.copy()
2.2 ms ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A slow copy (presumably requires traversing the source in its strided order, and the target in its own order):
In [173]: timeit B = A.copy(order='F')
6.29 ms ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Copying At without having to change the order - fast:
In [174]: timeit B = At.copy(order='F')
2.23 ms ± 51.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Like [173] but going from 'F' to 'C':
In [175]: timeit B = At.copy(order='C')
6.29 ms ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [176]: timeit B = At.ravel()
6.54 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Copies with simpler strided reordering fall somewhere in between:
In [177]: timeit B = A[::-1,::-1].copy()
3.75 ms ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [178]: timeit B = A[::-1].copy()
3.73 ms ± 6.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [179]: timeit B = At[::-1].copy(order='K')
3.98 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This astype also requires the slower copy:
In [182]: timeit B = A.astype('float128')
6.7 ms ± 8.12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
PyArray_NewFromDescr_int is described as Generic new array creation routine. While I can't figure out where it copies data from the source to the target, it clearly is checking order and strides and dtype. Presumably it handles all cases where the generic copy is required. The axis permutation isn't a special case.
I am wondering if there is any downside of using b = np.array(a) rather than b = np.copy(a) to copy a Numpy array a into b. When I %timeit, the former can be upto 100% faster.
In both cases b is a is False, and I can manipulate b leaving a intact, so I suppose this does what is expected from .copy().
Am I missing anything? What is improper about using np.array to do copy an array?
with python 3.6.5, numpy 1.14.2, while the speed difference closes rapidly for larger sizes:
a = np.arange(1000)
%timeit np.array(a)
501 ns ± 30.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.copy(a)
1.1 µs ± 35.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
From documentation of numpy.copy:
This is equivalent to:
>>> np.array(a, copy=True)
Also, if you look at the source code:
def copy(a, order='K'):
return array(a, order=order, copy=True)
Some timings:
In [1]: import numpy as np
In [2]: a = np.ascontiguousarray(np.random.randint(0, 20000, 1000))
In [3]: %timeit b = np.array(a)
562 ns ± 10.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit b = np.array(a, order='K', copy=True)
1.1 µs ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit b = np.copy(a)
1.21 µs ± 9.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: a = np.ascontiguousarray(np.random.randint(0, 20000, 1000000))
In [7]: %timeit b = np.array(a)
310 µs ± 6.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit b = np.array(a, order='K', copy=True)
311 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit b = np.copy(a)
313 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [10]: print(np.__version__)
1.13.3
It is unexpected that simply explicitly setting parameters to their default values changes the speed of execution of np.array(). On the other hand, maybe just processing these explicit arguments adds enough execution time to make a difference for small arrays. Indeed, from the source code for the numpy.array(), one can see that there are many more checks and more processing being performed when keyword arguments are provided, for example, see goto full_path. When keyword parameters are not set, the execution skips all the way down to goto finish. This overhead (of additional processing of keyword arguments) is what you detect in timings for small arrays. For larger arrays this overhead is insignificant in comparison to the actual time of copying the arrays.
"What is improper about using np.array to do copy an array?"
I'd argue it is harder to read. Because it is not obvious that array makes a copy, for example, the similar asarray does not make a copy if it doesn't have to. The reader basically has to know the default value of the copy keyword argument to be sure.
As AGN pointed out, np.array is faster than np.copy because essentially the latter is a wrapper of the former. This means python "loses" some extra time searching for both functions. A similar thing happens with decorators.
This extra time is insignificant for pratical purposes, and you gain better code readability.
You can test it by using a big array (where the array creation takes the main time), and you'll see very little differences in %timeit for both.