Multiplying array in python - python

From this question I see how to multiply a whole numpy array with the same number (second answer, by JoshAdel). But when I change P into the maximum of a (long) array, is it better to store the maximum on beforehand, or does it calculate the maximum of H just once in the second example?
import numpy as np
H = [12,12,5,32,6,0.5]
P=H.max()
S=[22, 33, 45.6, 21.6, 51.8]
SP = P*np.array(S)
or
import numpy as np
H = [12,12,5,32,6,0.5]
S=[22, 33, 45.6, 21.6, 51.8]
SP = H.max()*np.array(S)
So does it calculate H.max() for every item it has to multiply, or is it smart enough to it just once? In my code S and H are longer arrays then in the example.

There is little difference between the 2 methods:
In [74]:
import numpy as np
H = np.random.random(100000)
%timeit P=H.max()
S=np.random.random(100000)
%timeit SP = P*np.array(S)
%timeit SP = H.max()*np.array(S)
10000 loops, best of 3: 51.2 µs per loop
10000 loops, best of 3: 165 µs per loop
1000 loops, best of 3: 217 µs per loop
Here you can see that the individual step of pre-calculating H.max() is no different from calculating it in a single line

Related

Create a vector of random integers that only occur once with numpy / Python [duplicate]

How can I generate non-repetitive random numbers in numpy?
list = np.random.random_integers(20,size=(10))
numpy.random.Generator.choice offers a replace argument to sample without replacement:
from numpy.random import default_rng
rng = default_rng()
numbers = rng.choice(20, size=10, replace=False)
If you're on a pre-1.17 NumPy, without the Generator API, you can use random.sample() from the standard library:
print(random.sample(range(20), 10))
You can also use numpy.random.shuffle() and slicing, but this will be less efficient:
a = numpy.arange(20)
numpy.random.shuffle(a)
print a[:10]
There's also a replace argument in the legacy numpy.random.choice function, but this argument was implemented inefficiently and then left inefficient due to random number stream stability guarantees, so its use isn't recommended. (It basically does the shuffle-and-slice thing internally.)
Some timings:
import timeit
print("when output size/k is large, np.random.default_rng().choice() is far far quicker, even when including time taken to create np.random.default_rng()")
print(1, timeit.timeit("rng.choice(a=10**5, size=10**4, replace=False, shuffle=False)", setup="import numpy as np; rng=np.random.default_rng()", number=10**3)) #0.16003450006246567
print(2, timeit.timeit("np.random.default_rng().choice(a=10**5, size=10**4, replace=False, shuffle=False)", setup="import numpy as np", number=10**3)) #0.19915290002245456
print(3, timeit.timeit("random.sample( population=range(10**5), k=10**4)", setup="import random", number=10**3)) #5.115292700007558
print("when output size/k is very small, random.sample() is quicker")
print(4, timeit.timeit("rng.choice(a=10**5, size=10**1, replace=False, shuffle=False)", setup="import numpy as np; rng=np.random.default_rng()", number=10**3)) #0.01609779999125749
print(5, timeit.timeit("random.sample( population=range(10**5), k=10**1)", setup="import random", number=10**3)) #0.008387799956835806
So numpy.random.Generator.choice is what you usually want to go for, except for very small output size/k.
I think numpy.random.sample doesn't work right, now. This is my way:
import numpy as np
np.random.choice(range(20), 10, replace=False)
Years later, some timeits for choosing 40000 out of 10000^2
(Numpy 1.8.1, imac 2.7 GHz):
import random
import numpy as np
n = 10000
k = 4
np.random.seed( 0 )
%timeit np.random.choice( n**2, k * n, replace=True ) # 536 µs ± 1.58 µs
%timeit np.random.choice( n**2, k * n, replace=False ) # 6.1 s ± 9.91 ms
# https://docs.scipy.org/doc/numpy/reference/random/index.html
randomstate = np.random.default_rng( 0 )
%timeit randomstate.choice( n**2, k * n, replace=False, shuffle=False ) # 766 µs ± 2.18 µs
%timeit randomstate.choice( n**2, k * n, replace=False, shuffle=True ) # 1.05 ms ± 1.41 µs
%timeit random.sample( range( n**2 ), k * n ) # 47.3 ms ± 134 µs
(Why choose 40000 out of 10000^2 ?
To generate large
scipy.sparse.random
matrices -- scipy 1.4.1 uses np.random.choice( replace=False ), slooooow.)
Tip of the hat to numpy.random people.
You can get this by sorting as well:
random_numbers = np.random.random([num_samples, max_int])
samples = np.argsort(random_numbers, axis=1)
Python set-list conversion can be used. 10 random non repetitive numbers between 0 and 20 can be obtained as:
import random
numbers=set()
while(len(numbers)<10):
numbers.add(random.randint(0,20))
numbers=list(numbers)
random.shuffle(numbers)
print(numbers)
Simply generate an array that contains the required range of numbers, then shuffle them by repeatedly swapping a random one with the 0th element in the array. This produces a random sequence that doesn't contain duplicate values.

A faster discrete Laplacian than scipy.ndimage.filters.laplace for small arrays

My spends the vast bulk of its computational time in scipy.ndimage.filters.laplace()
The main advantage of scipy and numpy is vectorised calculation in C/C++ wrapped in python.
scipy.ndimage.filters.laplace() is derived from _nd_image.correlate1d which is
part of the optimised library nd_image.h
Is there any faster method of doing this across an array of size 10-100?
Definition Laplace Filter - ignoring division
a[i-1] - 2*a[i] + a[i+1]
Optional Can ideally wrap around boundary a[n-1] - 2*a[n-1] + a[0] for n=a.shape[0]
The problem was rooted in scipy's excellent error handling and debugging. However, in the instance the user knows what they're doing it just provides excess overhead.
This code below strips all the python clutter in the back end of scipy and directly accesses the C++ function to get a ~6x speed up!
laplace == Mine ? True
testing timings...
array size 10
100000 loops, best of 3: 12.7 µs per loop
100000 loops, best of 3: 2.3 µs per loop
array size 100
100000 loops, best of 3: 12.7 µs per loop
100000 loops, best of 3: 2.5 µs per loop
array size 100000
1000 loops, best of 3: 413 µs per loop
1000 loops, best of 3: 404 µs per loop
Code
from scipy import ndimage
from scipy.ndimage import _nd_image
import numpy as np
laplace_filter = np.asarray([1, -2, 1], dtype=np.float64)
def fastLaplaceNd(arr):
output = np.zeros(arr.shape, 'float64')
if arr.ndim > 0:
_nd_image.correlate1d(arr, laplace_filter, 0, output, 1, 0.0, 0)
if arr.ndim == 1: return output
for ax in xrange(1, arr.ndim):
output += _nd_image.correlate1d(arr, laplace_filter, ax, output, 1, 0.0, 0)
return output
if __name__ == '__main__':
arr = np.random.random(10)
test = (ndimage.filters.laplace(arr, mode='wrap') == fastLaplace(arr)).all()
assert test
print "laplace == Mine ?", test
print 'testing timings...'
print "array size 10"
%timeit ndimage.filters.laplace(arr, mode='wrap')
%timeit fastLaplace(arr)
print 'array size 100'
arr = np.random.random(100)
%timeit ndimage.filters.laplace(arr, mode='wrap')
%timeit fastLaplace(arr)
print "array size 100000"
arr = np.random.random(100000)
%timeit ndimage.filters.laplace(arr, mode='wrap')
%timeit fastLaplace(arr)

Speedup for the reduce operation in Theano

Edit:
So sorry, it turns out that I had other processes running on my GPU while I did the test, I've updated the timing results on an idle GPU, and the speedup becomes noticeable for larger matrices.
Original Post:
As posted in this question, L is a list of matrices, where each item M is a x*n matrix (x is a variable, n is fixed).
I want to compute the sum of M'*M for all items in L (M' is the transpose of M) as the following Python code does.
for M in L:
res += np.dot(M.T, M)
Followings are some examples of Numpy and Theano implementations (for executable script please refer to #DanielRenshaw's answer to the previous question).
def numpy_version1(*L):
n = L[0].shape[1]
res = np.zeros((n, n), dtype=L[0].dtype)
for M in L:
res += np.dot(M.T, M)
return res
def compile_theano_version1(number_of_matrices, n, dtype):
L = [tt.matrix() for _ in xrange(number_of_matrices)]
res = tt.zeros(n, dtype=dtype)
for M in L:
res += tt.dot(M.T, M)
return theano.function(L, res)
def compile_theano_version2(number_of_matrices, n):
L = theano.typed_list.TypedListType(tt.TensorType(theano.config.floatX, broadcastable=(None, None)))()
res, _ = theano.reduce(fn=lambda i, tmp: tmp+tt.dot(L[i].T, L[i]),
outputs_info=tt.zeros((n, n), dtype=theano.config.floatX),
sequences=[theano.tensor.arange(number_of_matrices, dtype='int64')])
return theano.function([L], res)
I ran the Numpy version on CPU, and Theano versions on GPU with different settings, it seems that the Theano versions are always proportionally slower than the Numpy version (regardless of the number and size of matices).
But I was expecting there could be some optimization w.r.t GPU as it is a simple reduce operation.
Could someone help me understand what's going on under the hood?
Edit:
Followings are the script (from #DanielRenshaw) for generating data, settings I've tired and results.
L = [np.random.standard_normal(size=(x, n)).astype(dtype)
for x in range(min_x, number_of_matrices + min_x)]
dtype = 'float32'
theano.config.floatX = dtype
iteration_count = 10
min_x = 20
# base case:
# numpy_version1 0.100589990616
# theano_version1 0.243968963623
# theano_version2 0.198153018951
number_of_matrices = 200
n = 100
# increase matrix size:
# numpy_version1 4.90120816231
# theano_version1 0.984472036362
# theano_version2 3.56008815765
number_of_matrices = 200
n = 1000
# increase number of matrices:
# numpy_version1 5.11445093155
# theano_version1 compilation error
# theano_version2 6.54448604584
number_of_matrices = 2000
n = 100
The problem you are having, is not the number of matrices, is the size of them.
Your test example, creates matrices of size dependent on the number of matrices you have, thus, the more matrices you have the larger the matrices are, but also the larger the python loop overhead is (in reduce operations), and thus, it makes harder to detect speed improvements.
I've sightly modify your matrix generation in order to make some new tests:
S = 1000 # Size of the matrices
N = 10 # Number of matrices
L = [np.random.standard_normal(size=(np.random.randint(S//2, S*2), S)).astype(np.float32) for _ in range(N)]
This generates only 10 matrices of size (x, 1000) where x is some value in the range of [S//2, S*2] == [500, 2000].
f1 = compile_theano_version1(N, S, np.float32)
f2 = compile_theano_version2(N, S)
Now some tests with N = 10 big matrices:
For S = 1000, N = 10:
%timeit numpy_version1(*L) # 10 loops, best of 3: 131 ms per loop
%timeit f1(*L) # 10 loops, best of 3: 37.3 ms per loop
%timeit f2(L) # 10 loops, best of 3: 68.7 ms per loop
where theano functions have a x4 and x2 speedup in a laptop with a pretty nice i7 and a decent NVIDIA 860M (which means that you should get some nicer speedups here).
For S = 5000, N = 10:
%timeit numpy_version1(*L) # 1 loops, best of 3: 4 s per loop
%timeit f1(*L) # 1 loops, best of 3: 907 ms per loop
%timeit f2(L) # 1 loops, best of 3: 1.77 s per loop
So, overall, with this setup the larger the S the larger the speedup theano gets over the CPU.
Some tests with N = 100 big matrices: theano seems faster
For S = 1000, N = 100:
%timeit numpy_version1(*L) # 1 loops, best of 3: 1.46 s per loop
%timeit f1(*L) # 1 loops, best of 3: 408 ms per loop
%timeit f2(L) # 1 loops, best of 3: 724 s per loop
For S = 2000, N = 100:
%timeit numpy_version1(*L) # 1 loops, best of 3: 11.3 s per loop
%timeit f1(*L) # 1 loops, best of 3: 2.72 s per loop
%timeit f2(L) # 1 loops, best of 3: 4.01 s per loop
Tests with N = 100 small matrices: numpy seems faster
For S = 50, N = 100:
%timeit numpy_version1(*L) # 100 loops, best of 3: 1.17 ms per loop
%timeit f1(*L) # 100 loops, best of 3: 4.21 ms per loop
%timeit f2(L) # 100 loops, best of 3: 7.42 ms per loop
Specifications for the tests:
Processor: i7 4710HQ
GPU: NVIDIA GeForce GTX 860M
Numpy: version 1.10.2 built with intel MKT
Theano: version 0.70; floatX = float32; using GPU

Pairwise operations (distance) on two lists in numpy

I have two lists of coordinates:
l1 = [[x,y,z],[x,y,z],[x,y,z],[x,y,z],[x,y,z]]
l2 = [[x,y,z],[x,y,z],[x,y,z]]
I want to find the shortest pairwise distance between l1 and l2. Distance between two coordinates is simply:
numpy.linalg.norm(l1_element - l2_element)
So how do I use numpy to efficiently apply this operation to each pair of elements?
Here is a quick performance analysis of the four methods presented so far:
import numpy
import scipy
from itertools import product
from scipy.spatial.distance import cdist
from scipy.spatial import cKDTree as KDTree
n = 100
l1 = numpy.random.randint(0, 100, size=(n,3))
l2 = numpy.random.randint(0, 100, size=(n,3))
# by #Phillip
def a(l1,l2):
return min(numpy.linalg.norm(l1_element - l2_element) for l1_element,l2_element in product(l1,l2))
# by #Kasra
def b(l1,l2):
return numpy.min(numpy.apply_along_axis(
numpy.linalg.norm,
2,
l1[:, None, :] - l2[None, :, :]
))
# mine
def c(l1,l2):
return numpy.min(scipy.spatial.distance.cdist(l1,l2))
# just checking that numpy.min is indeed faster.
def c2(l1,l2):
return min(scipy.spatial.distance.cdist(l1,l2).reshape(-1))
# by #BrianLarsen
def d(l1,l2):
# make KDTrees for both sets of points
t1 = KDTree(l1)
t2 = KDTree(l2)
# we need a distance to not look beyond, if you have real knowledge use it, otherwise guess
maxD = numpy.linalg.norm(l1[0] - l2[0]) # this could be closest but anyhting further is certainly not
# get a sparce matrix of all the distances
ans = t1.sparse_distance_matrix(t2, maxD)
# get the minimum distance and points involved
minD = min(ans.values())
return minD
for x in (a,b,c,c2,d):
print("Timing variant", x.__name__, ':', flush=True)
print(x(l1,l2), flush=True)
%timeit x(l1,l2)
print(flush=True)
For n=100
Timing variant a :
2.2360679775
10 loops, best of 3: 90.3 ms per loop
Timing variant b :
2.2360679775
10 loops, best of 3: 151 ms per loop
Timing variant c :
2.2360679775
10000 loops, best of 3: 136 µs per loop
Timing variant c2 :
2.2360679775
1000 loops, best of 3: 844 µs per loop
Timing variant d :
2.2360679775
100 loops, best of 3: 3.62 ms per loop
For n=1000
Timing variant a :
0.0
1 loops, best of 3: 9.16 s per loop
Timing variant b :
0.0
1 loops, best of 3: 14.9 s per loop
Timing variant c :
0.0
100 loops, best of 3: 11 ms per loop
Timing variant c2 :
0.0
10 loops, best of 3: 80.3 ms per loop
Timing variant d :
0.0
1 loops, best of 3: 933 ms per loop
Using newaxis and broadcasting, l1[:, None, :] - l2[None, :, :] is an array of the pairwise difference vectors. You can reduce this array to an array of norms using apply_along_axis and then take the min:
numpy.min(numpy.apply_along_axis(
numpy.linalg.norm,
2,
l1[:, None, :] - l2[None, :, :]
))
Of course, this only works if l1 and l2 are numpy arrays, so if your lists in the question weren't pseudo-code, you'll have to add l1 = numpy.array(l1); l2 = numpy.array(l2).
You can use itertools.product to get the all combinations the use min :
l1 = [[x,y,z],[x,y,z],[x,y,z],[x,y,z],[x,y,z]]
l2 = [[x,y,z],[x,y,z],[x,y,z]]
from itertools import product
min(numpy.linalg.norm(l1_element - l2_element) for l1_element,l2_element in product(l1,l2))
If you have many, many, many points this is a great use for a KDTree. Totally overkill for this example but a good learning experience and really fast for a certain class of problems, and can give a bit more information on number of points within a certain distance.
import numpy as np
from scipy.spatial import cKDTree as KDTree
#sample data
l1 = [[0,0,0], [4,5,6], [7,6,7], [4,5,6]]
l2 = [[100,3,4], [1,0,0], [10,15,16], [17,16,17], [14,15,16], [-34, 5, 6]]
# make them arrays
l1 = np.asarray(l1)
l2 = np.asarray(l2)
# make KDTrees for both sets of points
t1 = KDTree(l1)
t2 = KDTree(l2)
# we need a distance to not look beyond, if you have real knowledge use it, otherwise guess
maxD = np.linalg.norm(l1[-1] - l2[-1]) # this could be closest but anyhting further is certainly not
# get a sparce matrix of all the distances
ans = t1.sparse_distance_matrix(t2, maxD)
# get the minimum distance and points involved
minA = min([(i,k) for k, i in ans.iteritems()])
print("Minimun distance is {0} between l1={1} and l2={2}".format(minA[0], l1[minA[1][0]], l2[minA[1][2]] ))
What this does is make a KDTree for the the sets of points then find all the distances for points within the guess distance and give back the distance and closest point. This post has a writeup of how a KDTree works.

How to get a fast lambda function from an sympy expression in 3 dimensions?

I am using sympy to generate different expressions for cfd-simulations.
Mostly these expressions are of the kind exp = f(x,y,z) for example
f(x,y,z) = sin(x)*cos(y)*sin(z). To get values on a grid I use simpy.lambdify.
For example:
import numpy as np
import sympy as sp
from sympy.abc import x,y,z
xg, yg, zg = np.mgrid[0:1:50*1j, 0:1:50*1j, 0:1:50*1j]
f = sp.sin(x)*sp.cos(y)*sp.sin(z)
lambda_f = sp.lambdify([x,y,z], f, "numpy")
fn = lambda_f(xg, yg, zg)
print fn
This seems to work pretty good but unfortunately my expressions are getting more and more complex and the grid computation takes a lot of time.
My Idea was that it is maybe possible to use the uFuncify method
(see http://docs.sympy.org/latest/modules/numeric-computation.html )
to speed the computations up, im not sure if this is the right way ?
And im also not sure how to get ufunctify to work for 3d grids ?
Thanks for any Suggestions
In previous releases (0.7.5 and prior), ufuncify only worked on single dimension arrays for the first argument (not very exciting). As of 0.7.6 (not released yet, but should be in a week!) ufuncify creates actual instances of numpy.ufunc by default (wraps C code in numpy api). Your code above only needs a small change to make it work.
In [1]: import numpy as np
In [2]: from sympy import sin, cos, lambdify
In [3]: from sympy.abc import x,y,z
In [4]: from sympy.utilities.autowrap import ufuncify
In [5]: from sympy.printing.theanocode import theano_function
In [6]: xg, yg, zg = np.mgrid[0:1:50*1j, 0:1:50*1j, 0:1:50*1j]
In [7]: f = sym.sin(x)*sym.cos(y)*sym.sin(z)
In [8]: ufunc_f = ufuncify([x,y,z], f)
In [9]: theano_f = theano_function([x, y, z], f, dims={x: 3, y: 3, z: 3})
In [10]: lambda_f = lambdify([x, y, z], f)
In [11]: type(ufunc_f)
Out[11]: numpy.ufunc
In [12]: type(theano_f)
Out[12]: theano.compile.function_module.Function
In [13]: type(lambda_f)
Out[13]: function
In [14]: %timeit ufunc_f(xg, yg, zg)
10 loops, best of 3: 21 ms per loop
In [15]: %timeit theano_f(xg, yg, zg)
10 loops, best of 3: 20.7 ms per loop
In [16]: %timeit lambda_f(xg, yg, zg)
10 loops, best of 3: 22.3 ms per loop
ufuncify and theano_function are comparable, and slightly faster than lambdify for this simple expression. The difference is greater using the more complicated expression given below:
In [17]: f = sin(x)*cos(y)*sin(z) + sin(4*(x - y**2*sin(z)))
In [18]: ufunc_f = ufuncify([x,y,z], f)
In [19]: theano_f = theano_function([x, y, z], f, dims={x: 3, y: 3, z: 3})
In [20]: lambda_f = lambdify([x, y, z], f)
In [21]: %timeit ufunc_f(xg, yg, zg)
10 loops, best of 3: 29.2 ms per loop
In [22]: %timeit theano_f(xg, yg, zg)
10 loops, best of 3: 29.2 ms per loop
In [23]: %timeit lambda_f(xg, yg, zg)
10 loops, best of 3: 42.1 ms per loop
This is much faster than using the python version, as no intermediate arrays are created, the loop is traversed and the calculation ran in C. Theano produces equivalent speeds, as they also compile to native code. For the large expressions that I see when doing multibody dynamics, ufuncify (and the related autowrap) perform significantly faster than lambdify. I don't have much experience with theano, so I can't say how well their approach scales either, but I'd assume it would be similar.
As I said above, this is only available in sympy 0.7.6 and up. Should be released soon, but until then you can grab the source from github. Docs on ufuncify new behavior can be found here
Perhaps you could work with sympy's theano_function. According to the link to the documentation you provided, it has similar speed as ufuncify and it can be used with an mgrid:
import numpy as np
import sympy as sp
from sympy.printing.theanocode import theano_function
x,y,z = sp.symbols('x y z')
xg, yg, zg = np.mgrid[0:1:50*1j, 0:1:50*1j, 0:1:50*1j]
f = sp.sin(x)*sp.cos(y)*sp.sin(z)
ft = theano_function([x,y,z], [f], dims={x: 3, y: 3, z: 3})
ft(xg,yg,zg) # result is similar to your fn
For this particular function f, however, the execution speed on my system of the lambdified version and the theano-ized version is similar:
In [24]: %timeit fn = lambda_f(xg, yg, zg)
10 loops, best of 3: 53.2 ms per loop
In [25]: %timeit fn = ft(xg,yg,zg)
10 loops, best of 3: 52.7 ms per loop
Making the function slightly more difficult,
In [27]: f = sp.sin(x)*sp.cos(y)*sp.sin(z) + sp.sin(4*(x-y**2*sp.sin(z)))
In [30]: %timeit fl(xg,yg,zg) # lambdified version
10 loops, best of 3: 89.4 ms per loop
In [31]: %timeit ft(xg,yg,zg) # Theano version
10 loops, best of 3: 67.6 ms per loop
makes the timing differences slightly bigger for me (and in favor of theano), but perhaps on your functions you'd experience much bigger timing differences?

Categories