Python `expm` of an `(N,M,M)` matrix - python

Let A be an (N,M,M) matrix (with N very large) and I would like to compute scipy.linalg.expm(A[n,:,:]) for each n in range(N). I can of course just use a for loop but I was wondering if there was some trick to do this in a better way (something like np.einsum).
I have the same question for other operations like inverting matrices (inverting solved in comments).

Depending on the size and structure of your matrices you can do better than loop.
Assuming your matrices can be diagonalized as A = V D V^(-1) (where D has the eigenvalues in its diagonal and V contains the corresponding eigenvectors as columns), you can compute the matrix exponential as
exp(A) = V exp(D) V^(-1)
where exp(D) simply contains exp(lambda) for each eigenvalue lambda in its diagonal. This is really easy to prove if we use the power series definition of the exponential function. If the matrix A is furthermore normal, the matrix V is unitary and thus its inverse can be computed by simply taking its adjoint.
The good news is that numpy.linalg.eig and numpy.linalg.inv both work with stacked matrices just fine:
import numpy as np
import scipy.linalg
A = np.random.rand(1000,10,10)
def loopy_expm(A):
expmA = np.zeros_like(A)
for n in range(A.shape[0]):
expmA[n,...] = scipy.linalg.expm(A[n,...])
return expmA
def eigy_expm(A):
vals,vects = np.linalg.eig(A)
return np.einsum('...ik, ...k, ...kj -> ...ij',
vects,np.exp(vals),np.linalg.inv(vects))
Note that there's probably some room for optimization in specifying the order of operations in the call to einsum, but I didn't investigate that.
Testing the above for the random array:
In [59]: np.allclose(loopy_expm(A),eigy_expm(A))
Out[59]: True
In [60]: %timeit loopy_expm(A)
824 ms ± 55.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [61]: %timeit eigy_expm(A)
138 ms ± 992 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That's already nice. If you're lucky enough that your matrices are all normal (say, because they are real symmetric):
A = np.random.rand(1000,10,10)
A = (A + A.transpose(0,2,1))/2
def eigy_expm_normal(A):
vals,vects = np.linalg.eig(A)
return np.einsum('...ik, ...k, ...jk -> ...ij',
vects,np.exp(vals),vects.conj())
Note the symmetric definition of the input matrix and the transpose inside the pattern of einsum. Results:
In [80]: np.allclose(loopy_expm(A),eigy_expm_normal(A))
Out[80]: True
In [79]: %timeit loopy_expm(A)
878 ms ± 89.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [80]: %timeit eigy_expm_normal(A)
55.8 ms ± 868 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That is a 15-fold speedup for the above example shapes.
It should be noted though that scipy.linalg.eigm uses Padé approximation according to the documentation. This might imply that if your matrices are ill-conditioned, the eigenvalue decomposition may yield different results than scipy.linalg.eigm. I'm not familiar with how this function works, but I expect it to be safer for pathological inputs.

Related

The function np.dot multiplies the GF4 field matrices for a very long time

Multiplies large matrices for a very long time. How can this problem be solved. I use the galois library, and numpy, I think it should still work stably. I tried to implement my GF4 arithmetic and multiplied matrices using numpy, but it takes even longer. Thank you for your reply.
When r = 2,3,4,5,6 multiplies quickly, then it takes a long time. As for me, these are not very large sizes of matrices. This is just a code snippet. I get the sizes n, k of matrices of a certain family given r. And I need to multiply the matrices of those obtained parameters.
import numpy as np
import galois
def family_Hamming(q,r):
n = int((q**r-1)/(q-1))
k = int((q**r-1)/(q-1)-r)
res = (n,k)
return res
q = 4
r = 7
n,k = family_Hamming(q,r)
GF = galois.GF(2**2)
#(5461,5461)
a = GF(np.random.randint(4, size=(k, k)))
#(5454,5461)
b = GF(np.random.randint(4, size=(k, n)))
c = np.dot(a,b)
print(c)
I'm not sure if it is actually faster but np.dot should be used for the dot product of two vectors, for matrix multiplication use A # B. That's as efficient as you can get with Python as far as I know
I'm the author of galois. I added performance improvements to matrix multiplication in v0.3.0 by parallelizing the arithmetic over multiple cores. The next performance improvement will come once GPU support is added.
I'm open to other performance improvement suggestions, but as far as I know the algorithm is running as fast as possible on a CPU.
In [1]: import galois
In [2]: GF = galois.GF(2**2)
In [3]: A = GF.Random((300, 400), seed=1)
In [4]: B = GF.Random((400, 500), seed=2)
# v0.2.0
In [5]: %timeit A # B
1.02 s ± 7.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# v0.3.0
In [5]: %timeit A # B
99 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Try using jax on a CUDA runtime. For example, you can try it out on Google Colab's free GPU. (Open a notebook -> Runtime -> Change runtime type -> GPU).
import jax.numpy as jnp
from jax import device_put
a = GF(np.random.randint(4, size=(k, k)))
b = GF(np.random.randint(4, size=(k, n)))
a, b = device_put(a), device_put(b)
c = jnp.dot(a, b)
c = np.asarray(c)
Timing test:
%timeit jnp.dot(a, b).block_until_ready()
# 765 ms ± 96.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Slow `statistics` functions

Why is statistics.mean() so slow compared to the NumPy version or even to a naive implementation like e.g.:
def mean(items):
return sum(items) / len(items)
On my system, I get the following timings:
import numpy as np
import statistics
ll_int = [x for x in range(100_000)]
%timeit statistics.mean(ll_int)
# 42 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sum(ll_int) / len(ll_int)
# 460 µs ± 5.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.mean(ll_int)
# 4.62 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ll_float = [x / 10 for x in range(100_000)]
%timeit statistics.mean(ll_float)
# 56.7 ms ± 879 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sum(ll_float) / len(ll_float)
# 459 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.mean(ll_float)
# 2.7 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I get similar timings for other functions like variance or stdev.
EDIT:
Even an iterative implementation like this:
def next_mean(value, mean_, num):
return (num * mean_ + value) / (num + 1)
def imean(items, mean_=0.0):
for i, item in enumerate(items):
mean_ = next_mean(item, mean_, i)
return mean_
seems to be faster:
%timeit imean(ll_int)
# 16.6 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit imean(ll_float)
# 16.2 ms ± 429 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The statistics module uses interpreted Python code, but numpy is using optimized compiled code for all of its heavy lifting, so it would be surprising if numpy didn't blow statistics out of the water.
Furthermore, statistics is designed to play nice with modules like decimal and fractions and uses code which values numerical accuracy and type safety over speed. Your naive implementation uses sum. The statistics module uses its own function called _sum internally. Looking at its source shows that it does an awful lot more than just add things together:
def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
If optional argument ``start`` is given, it is added to the total.
If ``data`` is empty, ``start`` (defaulting to 0) is returned.
Examples
--------
>>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
(<class 'float'>, Fraction(11, 1), 5)
Some sources of round-off error will be avoided:
# Built-in sum returns zero.
>>> _sum([1e50, 1, -1e50] * 1000)
(<class 'float'>, Fraction(1000, 1), 3000)
Fractions and Decimals are also supported:
>>> from fractions import Fraction as F
>>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
(<class 'fractions.Fraction'>, Fraction(63, 20), 4)
>>> from decimal import Decimal as D
>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
>>> _sum(data)
(<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)
Mixed types are currently treated as an error, except that int is
allowed.
"""
count = 0
n, d = _exact_ratio(start)
partials = {d: n}
partials_get = partials.get
T = _coerce(int, type(start))
for typ, values in groupby(data, type):
T = _coerce(T, typ) # or raise TypeError
for n,d in map(_exact_ratio, values):
count += 1
partials[d] = partials_get(d, 0) + n
if None in partials:
# The sum will be a NAN or INF. We can ignore all the finite
# partials, and just look at this special one.
total = partials[None]
assert not _isfinite(total)
else:
# Sum all the partial sums using builtin sum.
# FIXME is this faster if we sum them in order of the denominator?
total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)
The most surprising thing about this code is that it converts the data to fractions so as to minimize round-off error. There is no reason to expect that code like this would be as quick as a simple sum(nums)/len(nums) approach.
The developer of the statistics module made an explicit decision to value correctness over speed:
Correctness over speed. It is easier to speed up a correct but slow function than to correct a fast but buggy one.
and moreover stated that there was no intention to
to replace, or even compete directly with, numpy
However, a enhancement request was raised to add an additional, faster, simpler implementation, statistics.fmean, and this function will be released in Python 3.8. According to the enhancement developer this function is up to 500 times faster than the existing statistics.mean.
The fmean implementation is pretty much sum/len.

Setting structured array field in Numba

I would like to set an entire field of a NumPy structured scalar from within a Numba compiled nopython function. The desired_fn in the code below is a simple example of what I would like to do, and working_fn is an example of how I can currently accomplish this task.
import numpy as np
import numba as nb
test_numpy_dtype = np.dtype([("blah", np.int64)])
test_numba_dtype = nb.from_dtype(test_numpy_dtype)
#nb.njit
def working_fn(thing):
for j in range(len(thing)):
thing[j]['blah'] += j
#nb.njit
def desired_fn(thing):
thing['blah'] += np.arange(len(thing))
a = np.zeros(3,test_numpy_dtype)
print(a)
working_fn(a)
print(a)
desired_fn(a)
The error generated from running desired_fn(a) is:
numba.errors.InternalError: unsupported array index type const('blah') in [const('blah')]
[1] During: typing of staticsetitem at /home/sam/PycharmProjects/ChessAI/playground.py (938)
This is needed for extremely performance critical code, and will be run billions of times, so eliminating the need for these types of loops seems to be crucial.
The following works (numba 0.37):
#nb.njit
def desired_fn(thing):
thing.blah[:] += np.arange(len(thing))
# or
# thing['blah'][:] += np.arange(len(thing))
If you are operating primarily on columns of your data instead of rows, you might consider using a different data container. A numpy structured array is laid out like a vector of structs rather than a struct of arrays. This means that when you want to update blah, you are moving through non-contiguous memory space as you traverse the array.
Also, with any code optimizations, it's aways worth it to use timeit or some other timing harness (that removes the time required to jit the code) to see what is the actual performance. You might find with numba that explicit looping while more verbose could actually be faster than your vectorized code.
Without numba, accessing field values is no slower than accessing columns of a 2d array:
In [1]: arr2 = np.zeros((10000), dtype='i,i')
In [2]: arr2.dtype
Out[2]: dtype([('f0', '<i4'), ('f1', '<i4')])
Modifying a field:
In [4]: %%timeit x = arr2.copy()
...: x['f0'] += 1
...:
16.2 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Similar time if I assign the field to a new variable:
In [5]: %%timeit x = arr2.copy()['f0']
...: x += 1
...:
15.2 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Much faster if I construct a 1d array of the same size:
In [6]: %%timeit x = np.zeros(arr2.shape, int)
...: x += 1
...:
8.01 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But similar time when accessing the column of a 2d array:
In [7]: %%timeit x = np.zeros((arr2.shape[0],2), int)
...: x[:,0] += 1
...:
17.3 µs ± 23.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Does Numpy fancy indexing copy values directly to another array?

According to the documentation that I could find, when using fancy indexing a copy rather than a view is returned. However, I couldn't figure out what its behavior is during assignment to another array, for instance:
A = np.arange(0,10)
B = np.arange(-10,0)
fancy_slice = np.array([0,3,5])
A[fancy_slice] = B[fancy_slice]
I understand that A will just receive a call to __setitem__ while B will get a call to __getitem__. What I am concerned about is whether an intermediate array is created before copying the values over to A.
The interpreter will parse the code and issue the method calls as:
A[idx] = B[idx]
A.__setitem__(idx, B.__getitem__(idx))
The B method is evaluated fully before being passed to the A method. numpy doesn't alter the Python interpreter or its syntax. Rather it just adds functions, objects, and methods.
Functionally, it should be the equivalent to
temp = B[idx]
A[idx] = temp
del temp
We could do some timeit just be sure.
In [712]: A = np.zeros(10000,int)
In [713]: B = np.arange(10000)
In [714]: idx = np.arange(0,10000,100)
In [715]: timeit A[idx] = B[idx]
1.2 µs ± 3.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [716]: %%timeit
...: temp = B[idx]
...: A[idx] = temp
...:
1.11 µs ± 0.669 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
There are some alternative functions/methods, like add.at, copyto, place, put, that may do some copies without an intermediate, but I haven't used them much. This indexed assignment is good enough - most of the time.
Example with copyto
In [718]: wh = np.zeros(A.shape, bool)
In [719]: wh[idx] = True
In [721]: np.copyto(A, B, where=wh)
In [722]: timeit np.copyto(A, B, where=wh)
7.47 µs ± 9.92 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So even without timing the construction of the boolean mask, copyto is slower.
put and take are no better:
In [727]: timeit np.put(A,idx, np.take(B,idx))
7.98 µs ± 8.34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
An intermediate array is created. It has to be created. NumPy doesn't see
A[fancy_slice] = B[fancy_slice]
It sees
B[fancy_slice]
on its own, with no idea what the context is. This operation is defined to make a new array, and NumPy makes a new array.
Then, NumPy sees
A[fancy_slice] = <the array created by the previous operation>
and copies the data into A.

difference between np.inf and float('Inf')

Is there some difference between NumPy np.inf and float('Inf')?
float('Inf') == np.inf returns True, so it seems they are interchangeable, thus I was wondering why NumPy has defined its own "inf" constant, and when should I use one constant instead of the other (considering style concerns too)?
TL, DR: There is no difference and they can be used interchangeably.
Besides having the same value as math.inf and float('inf'):
>>> import math
>>> import numpy as np
>>> np.inf == float('inf')
True
>>> np.inf == math.inf
True
It also has the same type:
>>> import numpy as np
>>> type(np.inf)
float
>>> type(np.inf) is type(float('inf'))
float
That's interesting because NumPy also has it's own floating point types:
>>> np.float32(np.inf)
inf
>>> type(np.float32(np.inf))
numpy.float32
>>> np.float32('inf') == np.inf # nevertheless equal
True
So it has the same value and the same type as math.inf and float('inf') which means it's interchangeable.
Reasons for using np.inf
It's less to type:
np.inf (6 chars)
math.inf (8 chars; new in python 3.5)
float('inf') (12 chars)
That means if you already have NumPy imported you can save yourself 6 (or 2) chars per occurrence compared to float('inf') (or math.inf).
Because it's easier to remember.
At least for me, it's far easier to remember np.inf than that I need to call float with a string.
Also, NumPy defines some additional aliases for infinity:
np.Inf
np.inf
np.infty
np.Infinity
np.PINF
It also defines an alias for negative infinity:
np.NINF
Similarly for nan:
np.nan
np.NaN
np.NAN
Constants are constants
This point is based on CPython and could be completely different in another Python implementation.
A float CPython instance requires 24 Bytes:
>>> import sys
>>> sys.getsizeof(np.inf)
24
If you can re-use the same instance you might save a lot of memory compared to creating lots of new instances. Of course, this point is mute if you create your own inf constant but if you don't then:
a = [np.inf for _ in range(1000000)]
b = [float('inf') for _ in range(1000000)]
b would use 24 * 1000000 Bytes (~23 MB) more memory than a.
Accessing a constant is faster than creating the variable.
%timeit np.inf
37.9 ns ± 0.692 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit float('inf')
232 ns ± 13.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [np.inf for _ in range(10000)]
552 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [float('inf') for _ in range(10000)]
2.59 ms ± 78.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Of course, you can create your own constant to counter that point. But why bother if NumPy already did that for you.

Categories