How tp speed up looping arrays as inputs for pandas calculation?

How tp speed up looping arrays as inputs for pandas calculation? - python

I have two arrays named x and y. The goal is to iterate them as the input for pandas calculation.
Here's an example.
Iterating each x and y and appending the calculation result to the res list is slow.
The calculation is to get the exponential of each column modified by a and then sum together, multiply with b. Anyway, this calculation can be replaced by any other calculations.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,5,size=(5, 1)),columns=['data'])
x = np.linspace(1, 24, 4)
y = np.linspace(10, 1500, 5)
res = []
for a in x:
for b in y:
res.append(np.exp(-df/a).sum().values[0]*b)
res = np.array(res).reshape(4, 5)
expected output:
array([[ 11.67676844, 446.63639283, 881.59601721, 1316.5556416 ,
1751.51526599],
[ 37.52524129, 1435.34047927, 2833.15571725, 4230.97095523,
5628.78619321],
[ 42.79406912, 1636.87314392, 3230.95221871, 4825.0312935 ,
6419.1103683 ],
[ 44.93972433, 1718.94445549, 3392.94918665, 5066.95391781,
6740.95864897]])

You can use numpy broadcasting:
res = np.array(res).reshape(4, 5)
print (res)
[[ 11.67676844 446.63639283 881.59601721 1316.5556416 1751.51526599]
[ 37.52524129 1435.34047927 2833.15571725 4230.97095523 5628.78619321]
[ 42.79406912 1636.87314392 3230.95221871 4825.0312935 6419.1103683 ]
[ 44.93972433 1718.94445549 3392.94918665 5066.95391781 6740.95864897]]
res = np.exp(-df.to_numpy()/x).sum(axis=0)[:, None] * y
print (res)
[[ 11.67676844 446.63639283 881.59601721 1316.5556416 1751.51526599]
[ 37.52524129 1435.34047927 2833.15571725 4230.97095523 5628.78619321]
[ 42.79406912 1636.87314392 3230.95221871 4825.0312935 6419.1103683 ]
[ 44.93972433 1718.94445549 3392.94918665 5066.95391781 6740.95864897]]

I think what you want is:
z = -df['data'].to_numpy()
res = np.exp(z/x[:, None]).sum(axis=1)[:, None]*y
output:
array([[ 11.67676844, 446.63639283, 881.59601721, 1316.5556416 ,
1751.51526599],
[ 37.52524129, 1435.34047927, 2833.15571725, 4230.97095523,
5628.78619321],
[ 42.79406912, 1636.87314392, 3230.95221871, 4825.0312935 ,
6419.1103683 ],
[ 44.93972433, 1718.94445549, 3392.94918665, 5066.95391781,
6740.95864897]])

Related

Numpyic way to sort a matrix based on another similar matrix

Say I have a matrix Y of random float numbers from 0 to 10 with shape (10, 3):
import numpy as np
np.random.seed(99)
Y = np.random.uniform(0, 10, (10, 3))
print(Y)
Output:
[[6.72278559 4.88078399 8.25495174]
[0.31446388 8.08049963 5.6561742 ]
[2.97622499 0.46695721 9.90627399]
[0.06825733 7.69793028 7.46767101]
[3.77438936 4.94147452 9.28948392]
[3.95454044 9.73956297 5.24414715]
[0.93613093 8.13308413 2.11686786]
[5.54345785 2.92269116 8.1614236 ]
[8.28042566 2.21577372 6.44834702]
[0.95181622 4.11663239 0.96865261]]
I am now given a matrix X with same shape that can be seen as obtained by adding small noises to Y and then shuffling the rows:
X = np.random.normal(Y, scale=0.1)
np.random.shuffle(X)
print(X)
Output:
[[ 4.04067271 9.90959141 5.19126867]
[ 5.59873104 2.84109306 8.11175891]
[ 0.10743952 7.74620162 7.51100441]
[ 3.60396019 4.91708372 9.07551354]
[ 0.9400948 4.15448712 1.04187208]
[ 2.91884302 0.47222752 10.12700505]
[ 0.30995155 8.09263241 5.74876947]
[ 1.11247872 8.02092335 1.99767444]
[ 6.68543696 4.8345869 8.17330513]
[ 8.38904822 2.11830619 6.42013343]]
Now I want to sort the matrix X based on Y by row. I already know each pair of column values in each matching pair of rows are not different from each other more than a tolerance of 0.5. I managed to write the following code and it is working fine.
def sort_X_by_Y(X, Y, tol):
idxs = [next(i for i in range(len(X)) if all(abs(X[i] - row) <= tol)) for row in Y]
return X[idxs]
print(sort_X_by_Y(X, Y, tol=0.5))
Output:
[[ 6.68543696 4.8345869 8.17330513]
[ 0.30995155 8.09263241 5.74876947]
[ 2.91884302 0.47222752 10.12700505]
[ 0.10743952 7.74620162 7.51100441]
[ 3.60396019 4.91708372 9.07551354]
[ 4.04067271 9.90959141 5.19126867]
[ 1.11247872 8.02092335 1.99767444]
[ 5.59873104 2.84109306 8.11175891]
[ 8.38904822 2.11830619 6.42013343]
[ 0.9400948 4.15448712 1.04187208]]
However, in reality I am sorting (1000, 3) matrices and my code is way too slow. I feel like there should be more numpyic way to code this. Any suggestions?

This is a vectorized version of your algorithm. It runs ~26.5x faster than your implementation for 1000 samples. But an additional boolean array with shape (1000,1000,3) is created. There is a chance that rows will have similar values within the tolerance and a wrong row is selected.
tol = .5
X[(np.abs(Y[:, np.newaxis] - X) <= tol).all(2).argmax(1)]
Output
array([[ 6.68543696, 4.8345869 , 8.17330513],
[ 0.30995155, 8.09263241, 5.74876947],
[ 2.91884302, 0.47222752, 10.12700505],
[ 0.10743952, 7.74620162, 7.51100441],
[ 3.60396019, 4.91708372, 9.07551354],
[ 4.04067271, 9.90959141, 5.19126867],
[ 1.11247872, 8.02092335, 1.99767444],
[ 5.59873104, 2.84109306, 8.11175891],
[ 8.38904822, 2.11830619, 6.42013343],
[ 0.9400948 , 4.15448712, 1.04187208]])
More robust solutions with L1-norm
X[np.abs(Y[:, np.newaxis] - X).sum(2).argmin(1)]
Or L2-norm
X[((Y[:, np.newaxis] - X)**2).sum(2).argmin(1)]

Scipy Multivariate Normal: How to draw deterministic samples?

I am using Scipy.stats.multivariate_normal to draw samples from a multivariate normal distribution. Like this:
from scipy.stats import multivariate_normal
# Assume we have means and covs
mn = multivariate_normal(mean = means, cov = covs)
# Generate some samples
samples = mn.rvs()
The samples are different at every run. How do I get always the same sample?
I was expecting something like:
mn = multivariate_normal(mean = means, cov = covs, seed = aNumber)
or
samples = mn.rsv(seed = aNumber)

There are two ways:
The rvs() method accepts a random_state argument. Its value can
be an integer seed, or an instance of numpy.random.Generator or numpy.random.RandomState. In
this example, I use an integer seed:
In [46]: mn = multivariate_normal(mean=[0,0,0], cov=[1, 5, 25])
In [47]: mn.rvs(size=5, random_state=12345)
Out[47]:
array([[-0.51943872, 1.07094986, -1.0235383 ],
[ 1.39340583, 4.39561899, -2.77865152],
[ 0.76902257, 0.63000355, 0.46453938],
[-1.29622111, 2.25214387, 6.23217368],
[ 1.35291684, 0.51186476, 1.37495817]])
In [48]: mn.rvs(size=5, random_state=12345)
Out[48]:
array([[-0.51943872, 1.07094986, -1.0235383 ],
[ 1.39340583, 4.39561899, -2.77865152],
[ 0.76902257, 0.63000355, 0.46453938],
[-1.29622111, 2.25214387, 6.23217368],
[ 1.35291684, 0.51186476, 1.37495817]])
This version uses an instance of numpy.random.Generator:
In [34]: rng = np.random.default_rng(438753948759384)
In [35]: mn = multivariate_normal(mean=[0,0,0], cov=[1, 5, 25])
In [36]: mn.rvs(size=5, random_state=rng)
Out[36]:
array([[ 0.30626179, 0.60742839, 2.86919105],
[ 1.61859885, 2.63409111, 1.19018398],
[ 0.35469027, 0.85685011, 6.76892829],
[-0.88659459, -0.59922575, -5.43926698],
[ 0.94777687, -5.80057427, -2.16887719]])
You can set the seed for numpy's global random number generator. This is the generator that multivariate_normal.rvs() uses if random_state is not given:
In [54]: mn = multivariate_normal(mean=[0,0,0], cov=[1, 5, 25])
In [55]: np.random.seed(123)
In [56]: mn.rvs(size=5)
Out[56]:
array([[ 0.2829785 , 2.23013222, -5.42815302],
[ 1.65143654, -1.2937895 , -7.53147357],
[ 1.26593626, -0.95907779, -12.13339622],
[ -0.09470897, -1.51803558, -4.33370201],
[ -0.44398196, -1.4286283 , 7.45694813]])
In [57]: np.random.seed(123)
In [58]: mn.rvs(size=5)
Out[58]:
array([[ 0.2829785 , 2.23013222, -5.42815302],
[ 1.65143654, -1.2937895 , -7.53147357],
[ 1.26593626, -0.95907779, -12.13339622],
[ -0.09470897, -1.51803558, -4.33370201],
[ -0.44398196, -1.4286283 , 7.45694813]])

Using array indexing to apply 2D array function on 3D array

I wrote a function that takes in one set of randomized cartesian coordinates and returns the subset that remains within some spatial domain. To illustrate:
grid = np.ones((5,5))
grid = np.lib.pad(grid, ((10,10), (10,10)), 'constant')
>> np.shape(grid)
(25, 25)
random_pts = np.random.random(size=(100, 2)) * len(grid)
def inside(input):
idx = np.floor(input).astype(np.int)
mask = grid[idx[:,0], idx[:,1]] == 1
return input[mask]
>> inside(random_pts)
array([[ 10.59441506, 11.37998288],
[ 10.39124766, 13.27615815],
[ 12.28225713, 10.6970708 ],
[ 13.78351949, 12.9933591 ]])
But now I want the ability to simultaneously generate n sets of random_pts and keep n corresponding subsets that satisfy the same functional condition. So, if n=3,
random_pts = np.random.random(size=(3, 100, 2)) * len(grid)
Without resorting to for loop, how could I index my variables such that inside(random_pts) returns something like
array([[[ 17.73323523, 9.81956681],
[ 10.97074592, 2.19671642],
[ 21.12081044, 12.80412997]],
[[ 11.41995519, 2.60974757]],
[[ 9.89827156, 9.74580059],
[ 17.35840479, 7.76972241]]])

One approach -
def inside3d(input):
# Get idx in 3D
idx3d = np.floor(input).astype(np.int)
# Create a similar mask as witrh 2D case, but in 3D now
mask3d = grid[idx3d[:,:,0], idx3d[:,:,1]]==1
# Count of mask matches for each index in 0th dim
counts = np.sum(mask3d,axis=1)
# Index into input to get masked matches across all elements in 0th dim
out_cat_array = input.reshape(-1,2)[mask3d.ravel()]
# Split the rows based on the counts, as the final output
return np.split(out_cat_array,counts.cumsum()[:-1])
Verify results -
Create 3D random input:
In [91]: random_pts3d = np.random.random(size=(3, 100, 2)) * len(grid)
With inside3d:
In [92]: inside3d(random_pts3d)
Out[92]:
[array([[ 10.71196268, 12.9875877 ],
[ 10.29700184, 10.00506662],
[ 13.80111411, 14.80514828],
[ 12.55070282, 14.63155383]]), array([[ 10.42636137, 12.45736944],
[ 11.26682474, 13.01632751],
[ 13.23550598, 10.99431284],
[ 14.86871413, 14.19079225],
[ 10.61103434, 14.95970597]]), array([[ 13.67395756, 10.17229061],
[ 10.01518846, 14.95480515],
[ 12.18167251, 12.62880968],
[ 11.27861513, 14.45609646],
[ 10.895685 , 13.35214678],
[ 13.42690335, 13.67224414]])]
With inside:
In [93]: inside(random_pts3d[0])
Out[93]:
array([[ 10.71196268, 12.9875877 ],
[ 10.29700184, 10.00506662],
[ 13.80111411, 14.80514828],
[ 12.55070282, 14.63155383]])
In [94]: inside(random_pts3d[1])
Out[94]:
array([[ 10.42636137, 12.45736944],
[ 11.26682474, 13.01632751],
[ 13.23550598, 10.99431284],
[ 14.86871413, 14.19079225],
[ 10.61103434, 14.95970597]])
In [95]: inside(random_pts3d[2])
Out[95]:
array([[ 13.67395756, 10.17229061],
[ 10.01518846, 14.95480515],
[ 12.18167251, 12.62880968],
[ 11.27861513, 14.45609646],
[ 10.895685 , 13.35214678],
[ 13.42690335, 13.67224414]])

Sorting a 2D numpy array on to the proximity of each element to a certain point

I have (n,2) numpy array which contains the coordination of n points now I want to sort them based on approximation of each element to the specific point (x,y) and pick the closest one. How can I achieve this?
Right now I have:
def find_nearest(array,value):
xlist = (np.abs(array[:, 0]-value[:, 0]))
ylist = (np.abs(array[:, 1]-value[:, 1]))
newList = np.vstack((xlist,ylist))
// SORT NEW LIST and return the 0 elemnt
In my solution I need to sort newList based on Proximity to (0,0) and I don't know how? Any solution for this or any other solution?
My array of points looks like:
array([[ 0.1648, 0.227 ],
[ 0.2116, 0.2472],
[ 0.78 , 0.546 ],
[ 0.9752, 1. ],
[ 0.384 , 0.4862],
[ 0.4428, 0.2204],
[ 0.4448, 0.4146],
[ 0.1046, 0.2658],
[ 0.5668, 0.7792],
[ 0.1664, 0.0746],
[ 0.5636, 0.6372],
[ 0.7822, 0.5536],
[ 0.7718, 0.8276],
[ 0.9916, 1. ],
[ 0. , 0. ],
[ 0.8206, 0.817 ],
[ 0.4858, 0.4652],
[ 0. , 0. ],
[ 0.1574, 0.3114],
[ 0. , 0.0022],
[ 0.874 , 0.714 ],
[ 0.148 , 0.6624],
[ 0.0656, 0.5912],
[ 0.1148, 0.607 ],
[ 0.069 , 0.6296]])

Sorting to find the nearest point is not a good idea. If you want the closest point then just find the closest point instead, sorting for that is overkilling.
def closest_point(arr, x, y):
dist = (arr[:, 0] - x)**2 + (arr[:, 1] - y)**2
return arr[dist.argmin()]
Moreover if you need to repeat the search many times with a fixed or quasi fixed set of points there are specific data structures that can speed up this kind of query a lot (the search time becomes sub-linear).

If you just want the cartesian distance you can do something like the following:
def find_nearest(arr,value):
newList = arr - value
sort = np.sum(np.power(newList, 2), axis=1)
return newList[sort.argmin()]
I am assuming newList has a shape of (n,2). As a note I changed the input variable array to arr to avoid issues if numpy is imported like: from numpy import *.

If you have scipy the following works:
import scipy.spatial.distance as ds
import numpy as np
pointOfInterest = np.array([[0, 0]])
Then:
arr[ds.cdist(pointOfInterest, arr)[0].np.argsort()[0]]
arr is your array above.

How about just using the key parameter in sorted?
sorted(p, key = lambda (a,b) :(a-m)**2+(b-n)**2)
Here p is of the form array([[1,2], [3,4], ...]) and (m,n) is the tuple of the slowest point ...

Python Newaxis vs for loop

I am trying to make my program faster.
I have a matrix and a vector:
GDES = N.array([[1,2,3,4,5],
[6,7,8,9,10],
[11,12,13,14,15],
[16,17,18,19,20],
[21,22,23,24,25]])
Ene=N.array([1,2,3,4,5])
NN=len(GDES);
I have defined a function for matrix multiplication:
def Gl(n,np,k,q):
matrix = GDES[k,np]*GDES[k,n]*GDES[q,np]*GDES[q,n]
return matrix
and I have made a for loop in my calculation:
SIl = N.zeros((NN,NN),N.float)
for n in xrange(NN):
for np in xrange(NN):
SumJ = N.sum(N.sum(Gl(n,np,k,q) for q in xrange(NN)) for k in xrange(NN))
SIl[n,np]=SumJ
print 'SIl:',SIl
output:
SIl: [[ 731025. 828100. 931225. 1040400. 1155625.]
[ 828100. 940900. 1060900. 1188100. 1322500.]
[ 931225. 1060900. 1199025. 1345600. 1500625.]
[ 1040400. 1188100. 1345600. 1512900. 1690000.]
[ 1155625. 1322500. 1500625. 1690000. 1890625.]]
I want to use newaxis to make it faster:
def G():
Mknp = GDES[:, :, N.newaxis, N.newaxis]
Mkn = GDES[:, N.newaxis, :, N.newaxis]
Mqnp = GDES[:, N.newaxis, N.newaxis, :]
Mqn = GDES[N.newaxis, :, :, N.newaxis]
matrix=Mknp*Mkn*Mqnp*Mqn
return matrix
tmp = G()
MGI = N.sum(N.sum(tmp,axis=3), axis=2)
MGI = N.reshape(MGI,(NN,NN))
print 'MGI:', MGI
output:
MGI: [[ 825 3900 9225 16800 26625]
[ 31200 92400 169600 262800 372000]
[ 146575 413400 722475 1073800 1467375]
[ 403200 1116900 1911600 2787300 3744000]
[ 857325 2352900 3980725 5740800 7633125]]
Any idea how can I get the right answer?

Your problem is a perfect fit for np.einsum:
>>> GDES = np.arange(1, 26).reshape(5, 5)
>>> np.einsum('kj,ki,lj,li->ij', GDES, GDES, GDES, GDES)
array([[ 731025, 828100, 931225, 1040400, 1155625],
[ 828100, 940900, 1060900, 1188100, 1322500],
[ 931225, 1060900, 1199025, 1345600, 1500625],
[1040400, 1188100, 1345600, 1512900, 1690000],
[1155625, 1322500, 1500625, 1690000, 1890625]])
For your particular case, this other syntax may be easier to figure out:
>>> np.einsum(GDES, [2,1], GDES, [2,0], GDES, [3,1], GDES, [3,0], [0,1])
array([[ 731025, 828100, 931225, 1040400, 1155625],
[ 828100, 940900, 1060900, 1188100, 1322500],
[ 931225, 1060900, 1199025, 1345600, 1500625],
[1040400, 1188100, 1345600, 1512900, 1690000],
[1155625, 1322500, 1500625, 1690000, 1890625]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How tp speed up looping arrays as inputs for pandas calculation? - python

Related

Numpyic way to sort a matrix based on another similar matrix

Scipy Multivariate Normal: How to draw deterministic samples?

Using array indexing to apply 2D array function on 3D array

Sorting a 2D numpy array on to the proximity of each element to a certain point

Python Newaxis vs for loop

Categories

Resources