Setting structured array field in Numba - python

I would like to set an entire field of a NumPy structured scalar from within a Numba compiled nopython function. The desired_fn in the code below is a simple example of what I would like to do, and working_fn is an example of how I can currently accomplish this task.
import numpy as np
import numba as nb
test_numpy_dtype = np.dtype([("blah", np.int64)])
test_numba_dtype = nb.from_dtype(test_numpy_dtype)
#nb.njit
def working_fn(thing):
for j in range(len(thing)):
thing[j]['blah'] += j
#nb.njit
def desired_fn(thing):
thing['blah'] += np.arange(len(thing))
a = np.zeros(3,test_numpy_dtype)
print(a)
working_fn(a)
print(a)
desired_fn(a)
The error generated from running desired_fn(a) is:
numba.errors.InternalError: unsupported array index type const('blah') in [const('blah')]
[1] During: typing of staticsetitem at /home/sam/PycharmProjects/ChessAI/playground.py (938)
This is needed for extremely performance critical code, and will be run billions of times, so eliminating the need for these types of loops seems to be crucial.

The following works (numba 0.37):
#nb.njit
def desired_fn(thing):
thing.blah[:] += np.arange(len(thing))
# or
# thing['blah'][:] += np.arange(len(thing))
If you are operating primarily on columns of your data instead of rows, you might consider using a different data container. A numpy structured array is laid out like a vector of structs rather than a struct of arrays. This means that when you want to update blah, you are moving through non-contiguous memory space as you traverse the array.
Also, with any code optimizations, it's aways worth it to use timeit or some other timing harness (that removes the time required to jit the code) to see what is the actual performance. You might find with numba that explicit looping while more verbose could actually be faster than your vectorized code.

Without numba, accessing field values is no slower than accessing columns of a 2d array:
In [1]: arr2 = np.zeros((10000), dtype='i,i')
In [2]: arr2.dtype
Out[2]: dtype([('f0', '<i4'), ('f1', '<i4')])
Modifying a field:
In [4]: %%timeit x = arr2.copy()
...: x['f0'] += 1
...:
16.2 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Similar time if I assign the field to a new variable:
In [5]: %%timeit x = arr2.copy()['f0']
...: x += 1
...:
15.2 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Much faster if I construct a 1d array of the same size:
In [6]: %%timeit x = np.zeros(arr2.shape, int)
...: x += 1
...:
8.01 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But similar time when accessing the column of a 2d array:
In [7]: %%timeit x = np.zeros((arr2.shape[0],2), int)
...: x[:,0] += 1
...:
17.3 µs ± 23.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Related

Slow `statistics` functions

Why is statistics.mean() so slow compared to the NumPy version or even to a naive implementation like e.g.:
def mean(items):
return sum(items) / len(items)
On my system, I get the following timings:
import numpy as np
import statistics
ll_int = [x for x in range(100_000)]
%timeit statistics.mean(ll_int)
# 42 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sum(ll_int) / len(ll_int)
# 460 µs ± 5.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.mean(ll_int)
# 4.62 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ll_float = [x / 10 for x in range(100_000)]
%timeit statistics.mean(ll_float)
# 56.7 ms ± 879 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sum(ll_float) / len(ll_float)
# 459 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.mean(ll_float)
# 2.7 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I get similar timings for other functions like variance or stdev.
EDIT:
Even an iterative implementation like this:
def next_mean(value, mean_, num):
return (num * mean_ + value) / (num + 1)
def imean(items, mean_=0.0):
for i, item in enumerate(items):
mean_ = next_mean(item, mean_, i)
return mean_
seems to be faster:
%timeit imean(ll_int)
# 16.6 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit imean(ll_float)
# 16.2 ms ± 429 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The statistics module uses interpreted Python code, but numpy is using optimized compiled code for all of its heavy lifting, so it would be surprising if numpy didn't blow statistics out of the water.
Furthermore, statistics is designed to play nice with modules like decimal and fractions and uses code which values numerical accuracy and type safety over speed. Your naive implementation uses sum. The statistics module uses its own function called _sum internally. Looking at its source shows that it does an awful lot more than just add things together:
def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
If optional argument ``start`` is given, it is added to the total.
If ``data`` is empty, ``start`` (defaulting to 0) is returned.
Examples
--------
>>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
(<class 'float'>, Fraction(11, 1), 5)
Some sources of round-off error will be avoided:
# Built-in sum returns zero.
>>> _sum([1e50, 1, -1e50] * 1000)
(<class 'float'>, Fraction(1000, 1), 3000)
Fractions and Decimals are also supported:
>>> from fractions import Fraction as F
>>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
(<class 'fractions.Fraction'>, Fraction(63, 20), 4)
>>> from decimal import Decimal as D
>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
>>> _sum(data)
(<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)
Mixed types are currently treated as an error, except that int is
allowed.
"""
count = 0
n, d = _exact_ratio(start)
partials = {d: n}
partials_get = partials.get
T = _coerce(int, type(start))
for typ, values in groupby(data, type):
T = _coerce(T, typ) # or raise TypeError
for n,d in map(_exact_ratio, values):
count += 1
partials[d] = partials_get(d, 0) + n
if None in partials:
# The sum will be a NAN or INF. We can ignore all the finite
# partials, and just look at this special one.
total = partials[None]
assert not _isfinite(total)
else:
# Sum all the partial sums using builtin sum.
# FIXME is this faster if we sum them in order of the denominator?
total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)
The most surprising thing about this code is that it converts the data to fractions so as to minimize round-off error. There is no reason to expect that code like this would be as quick as a simple sum(nums)/len(nums) approach.
The developer of the statistics module made an explicit decision to value correctness over speed:
Correctness over speed. It is easier to speed up a correct but slow function than to correct a fast but buggy one.
and moreover stated that there was no intention to
to replace, or even compete directly with, numpy
However, a enhancement request was raised to add an additional, faster, simpler implementation, statistics.fmean, and this function will be released in Python 3.8. According to the enhancement developer this function is up to 500 times faster than the existing statistics.mean.
The fmean implementation is pretty much sum/len.

Python `expm` of an `(N,M,M)` matrix

Let A be an (N,M,M) matrix (with N very large) and I would like to compute scipy.linalg.expm(A[n,:,:]) for each n in range(N). I can of course just use a for loop but I was wondering if there was some trick to do this in a better way (something like np.einsum).
I have the same question for other operations like inverting matrices (inverting solved in comments).
Depending on the size and structure of your matrices you can do better than loop.
Assuming your matrices can be diagonalized as A = V D V^(-1) (where D has the eigenvalues in its diagonal and V contains the corresponding eigenvectors as columns), you can compute the matrix exponential as
exp(A) = V exp(D) V^(-1)
where exp(D) simply contains exp(lambda) for each eigenvalue lambda in its diagonal. This is really easy to prove if we use the power series definition of the exponential function. If the matrix A is furthermore normal, the matrix V is unitary and thus its inverse can be computed by simply taking its adjoint.
The good news is that numpy.linalg.eig and numpy.linalg.inv both work with stacked matrices just fine:
import numpy as np
import scipy.linalg
A = np.random.rand(1000,10,10)
def loopy_expm(A):
expmA = np.zeros_like(A)
for n in range(A.shape[0]):
expmA[n,...] = scipy.linalg.expm(A[n,...])
return expmA
def eigy_expm(A):
vals,vects = np.linalg.eig(A)
return np.einsum('...ik, ...k, ...kj -> ...ij',
vects,np.exp(vals),np.linalg.inv(vects))
Note that there's probably some room for optimization in specifying the order of operations in the call to einsum, but I didn't investigate that.
Testing the above for the random array:
In [59]: np.allclose(loopy_expm(A),eigy_expm(A))
Out[59]: True
In [60]: %timeit loopy_expm(A)
824 ms ± 55.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [61]: %timeit eigy_expm(A)
138 ms ± 992 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That's already nice. If you're lucky enough that your matrices are all normal (say, because they are real symmetric):
A = np.random.rand(1000,10,10)
A = (A + A.transpose(0,2,1))/2
def eigy_expm_normal(A):
vals,vects = np.linalg.eig(A)
return np.einsum('...ik, ...k, ...jk -> ...ij',
vects,np.exp(vals),vects.conj())
Note the symmetric definition of the input matrix and the transpose inside the pattern of einsum. Results:
In [80]: np.allclose(loopy_expm(A),eigy_expm_normal(A))
Out[80]: True
In [79]: %timeit loopy_expm(A)
878 ms ± 89.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [80]: %timeit eigy_expm_normal(A)
55.8 ms ± 868 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That is a 15-fold speedup for the above example shapes.
It should be noted though that scipy.linalg.eigm uses Padé approximation according to the documentation. This might imply that if your matrices are ill-conditioned, the eigenvalue decomposition may yield different results than scipy.linalg.eigm. I'm not familiar with how this function works, but I expect it to be safer for pathological inputs.

Is it efficient to build a list with a generator function

When reading the book 'Effective Python' by Brett Slatkin I noticed that the author suggested that sometimes building a list using a generator function and calling list on the resulting iterator could lead to cleaner, more readable code.
So an example:
num_list = range(100)
def num_squared_iterator(nums):
for i in nums:
yield i**2
def get_num_squared_list(nums):
l = []
for i in nums:
l.append(i**2)
return l
Where a user could call
l = list(num_squared_iterator(num_list))
or
l = get_num_squared_list(nums)
and get the same result.
The suggestion was that the generator function has less noise because it is shorter and does not have the extra code for creating the list and appending values to it.
(NOTE clearly for these simple examples a list comprehension or generator expression would be better, but let us take it as given that this is a simplification of a pattern that can be used for more complex code that would not be clear in a list comprehension)
My question is this, is there a cost to wrapping the generator in a list? Would it be equivalent in performance to the list building function?
Seeing this I decided to do a quick test and wrote and ran the following code:
from functools import wraps
from time import time
TEST_DATA = range(100)
def timeit(func):
#wraps(func)
def wrapped(*args, **kwargs):
start = time()
func(*args, **kwargs)
end = time()
print(f'running time for {func.__name__} = {end-start}')
return wrapped
def num_squared_iterator(nums):
for i in nums:
yield i**2
#timeit
def get_num_squared_list(nums):
l = []
for i in nums:
l.append(i**2)
return l
#timeit
def get_num_squared_list_from_iterator(nums):
return list(num_squared_iterator(nums))
if __name__ == '__main__':
get_num_squared_list(TEST_DATA)
get_num_squared_list_from_iterator(TEST_DATA)
I ran the test code many times and each times (much to my surprise) the get_num_squared_list_from_iterator function actually ran (fractionally) faster than the get_num_squared_list function.
Here are results for my first few runs:
1.
running time for get_num_squared_list = 5.2928924560546875e-05
running time for get_num_squared_list_from_iterator = 5.0067901611328125e-05
2.
running time for get_num_squared_list = 5.3882598876953125e-05
running time for get_num_squared_list_from_iterator = 4.982948303222656e-05
3.
running time for get_num_squared_list = 5.1975250244140625e-05
running time for get_num_squared_list_from_iterator = 4.76837158203125e-05
I am guessing that this is due to the expense of doing a list.append in each iteration of the loop in the get_num_squared_list function.
I find this interesting because not only is the code clear and elegant it seems more performant.
I can confirm that your generator with list example is faster:
In [4]: def num_squared_iterator(nums):
...: for i in nums:
...: yield i**2
...:
...: def get_num_squared_list(nums):
...: l = []
...: for i in nums:
...: l.append(i**2)
...: return l
...:
In [5]: %timeit list(num_squared_iterator(nums))
320 µs ± 4.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: %timeit get_num_squared_list(nums)
370 µs ± 25.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: nums = range(100000)
In [8]: %timeit list(num_squared_iterator(nums))
33.2 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit get_num_squared_list(nums)
36.3 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, there is more to the story. Conventional wisdom is that generators are slower than iterating over other types of iterables, there's a lot of overhead to generators. However, using list is pushing the list-building code down into the C-level, so you sort of are seeing a middle ground. Note, using a for-loop can be optimized thusly:
In [10]: def get_num_squared_list_microoptimized(nums):
...: l = []
...: append = l.append
...: for i in nums:
...: append(i**2)
...: return l
...:
In [11]: %timeit list(num_squared_iterator(nums))
33.4 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]: %timeit get_num_squared_list(nums)
36.5 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [13]: %timeit get_num_squared_list_microoptimized(nums)
33.3 ms ± 487 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And now you see that a lot of the difference in the approaches can be ameliorated if you "inline" l.append (which is what the list constructor avoids). In general, method resolution is slow in Python. In tight loops, the above micro-optimization is well known and is sort of the first step one would take to make your for-loops more performant.

Does Numpy fancy indexing copy values directly to another array?

According to the documentation that I could find, when using fancy indexing a copy rather than a view is returned. However, I couldn't figure out what its behavior is during assignment to another array, for instance:
A = np.arange(0,10)
B = np.arange(-10,0)
fancy_slice = np.array([0,3,5])
A[fancy_slice] = B[fancy_slice]
I understand that A will just receive a call to __setitem__ while B will get a call to __getitem__. What I am concerned about is whether an intermediate array is created before copying the values over to A.
The interpreter will parse the code and issue the method calls as:
A[idx] = B[idx]
A.__setitem__(idx, B.__getitem__(idx))
The B method is evaluated fully before being passed to the A method. numpy doesn't alter the Python interpreter or its syntax. Rather it just adds functions, objects, and methods.
Functionally, it should be the equivalent to
temp = B[idx]
A[idx] = temp
del temp
We could do some timeit just be sure.
In [712]: A = np.zeros(10000,int)
In [713]: B = np.arange(10000)
In [714]: idx = np.arange(0,10000,100)
In [715]: timeit A[idx] = B[idx]
1.2 µs ± 3.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [716]: %%timeit
...: temp = B[idx]
...: A[idx] = temp
...:
1.11 µs ± 0.669 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
There are some alternative functions/methods, like add.at, copyto, place, put, that may do some copies without an intermediate, but I haven't used them much. This indexed assignment is good enough - most of the time.
Example with copyto
In [718]: wh = np.zeros(A.shape, bool)
In [719]: wh[idx] = True
In [721]: np.copyto(A, B, where=wh)
In [722]: timeit np.copyto(A, B, where=wh)
7.47 µs ± 9.92 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So even without timing the construction of the boolean mask, copyto is slower.
put and take are no better:
In [727]: timeit np.put(A,idx, np.take(B,idx))
7.98 µs ± 8.34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
An intermediate array is created. It has to be created. NumPy doesn't see
A[fancy_slice] = B[fancy_slice]
It sees
B[fancy_slice]
on its own, with no idea what the context is. This operation is defined to make a new array, and NumPy makes a new array.
Then, NumPy sees
A[fancy_slice] = <the array created by the previous operation>
and copies the data into A.

How to get random.sample() from deque in Python 3?

I have a collections.deque() of tuples from which I want to draw random samples.
In Python 2.7, I can use batch = random.sample(my_deque, batch_size).
But in Python 3.4 this raises TypeError: Population must be a sequence or set. For dicts, use list(d).
What's the best workaround, or recommended way to sample efficiently from a deque in Python 3?
The obvious way – convert to a list.
batch = random.sample(list(my_deque), batch_size))
But you can avoid creating an entire list.
idx_batch = set(sample(range(len(my_deque)), batch_size))
batch = [val for i, val in enumerate(my_deque) if i in idx_batch]
P.S. (Edited)
Actually, random.sample should work fine with deques in Python >= 3.5. because the class has been updated to match the Sequence interface.
In [3]: deq = collections.deque(range(100))
In [4]: random.sample(deq, 10)
Out[4]: [12, 64, 84, 77, 99, 69, 1, 93, 82, 35]
Note! as Geoffrey Irving has correctly stated in the comment bellow, you'd better convert the queue into a list, because queues are implemented as linked lists, making each index-access O(n) in the size of the queue, therefore sampling m random values will take O(m*n) time.
sample() on a deque works fine in Python ≥3.5, and it's pretty fast.
In Python 3.4, you could use this instead, which runs about as fast:
sample_indices = sample(range(len(deq)), 50)
[deq[index] for index in sample_indices]
On my MacBook using Python 3.6.8, this solution is over 44 times faster than Eli Korvigo's solution. :)
I used a deque with 1 million items, and I sampled 50 items:
from random import sample
from collections import deque
deq = deque(maxlen=1000000)
for i in range(1000000):
deq.append(i)
sample_indices = set(sample(range(len(deq)), 50))
%timeit [deq[i] for i in sample_indices]
1.68 ms ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sample(deq, 50)
1.94 ms ± 60.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sample(range(len(deq)), 50)
44.9 µs ± 549 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [val for index, val in enumerate(deq) if index in sample_indices]
75.1 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That said, as others have pointed out, a deque is not well suited for random access. If you want to implement a replay memory, you could instead use a rotating list like this:
class ReplayMemory:
def __init__(self, max_size):
self.buffer = [None] * max_size
self.max_size = max_size
self.index = 0
self.size = 0
def append(self, obj):
self.buffer[self.index] = obj
self.size = min(self.size + 1, self.max_size)
self.index = (self.index + 1) % self.max_size
def sample(self, batch_size):
indices = sample(range(self.size), batch_size)
return [self.buffer[index] for index in indices]
With a million items, sampling 50 items is blazingly fast:
%timeit mem.sample(50)
#58 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Categories