Fast way to calculate customized function on a multi-dimensional array? - python

I was trying to evaluate a customized function over every point on an n-dimensional grid, after which I can marginalize and do corner plots.
This is conceptually simple but I'm struggling to find a way to do it efficiently. I tried a loop regardless, and it is indeed too slow, especially considering that I will be needing this for more parameters (a1, a2, a3...) as well. I was wondering if there is a faster way or any reliable package that could help?
EDITS: Sorry that my description of myfunction hasn't been very clear, since the function takes some specific external data. Nevertheless here's a sample function that demonstrates it:
import numpy as np
from scipy.ndimage import gaussian_filter
#This gaussian filter is needed to process my data
data = gaussian_filter(np.array([[1, 2], [3, 4]]), sigma = 1)
model1 = np.array([[0, 1], [2, 3]])
model2 = np.array([[2, 3], [4, 5]])
models = np.array([model1, model2])
(This is just a demonstration. The actual data and models are some 500x500-ish 2D arrays.)
and then
from scipy.special import gammaln
def myfunction(params):
"""
Calculates the log of the Poisson likelihood of generating data
given model params and fits.
params: array-like, with number entries. For example,
if params = np.array([a1, a2]), we generate a model of
a1*model1 + a2*model2.
"""
model_combined = np.sum(models * params[:,None,None], axis=0)
#Unfortunately I need to process my combined model as well
model_smeared = gaussian_filter(model_combined, sigma=1)
#This following line is calculating the log of the Poisson likelihood
#of each pixel taking its value given the combined model as the expectation
#value, taking advantage that numpy does element-wise calculations
#automatically in this case
loglikelihood_array = data * np.log(model_combined) - model_combined - gammaln(data+1)
#Sum up the loglikelihoods
loglikelihood_sum = np.sum(loglikelihood_array)
return loglikelihood_sum
The function itself will return me results immediately, but not so if I just simply write a for-loop to calculate it for, say, 100x100 pairs of parameters.
EDIT #2 I understand that the for-loop within my shown code (sorry for my sloppiness) is confusing (and thanks for the comments for the broadcasting simplification!), and I've just edited that.
My real problem isn't with the combining of the models[i], but with the implementation of the entire function (again described by a very sloppy for-loop here), and loglikes is what I finally wanted:
a1_array = np.linspace(0, 2, 100)
a2_array = np.linspace(2, 4, 100)
loglikes = np.empty((100, 100))
for i in range(len(a1_array)):
for j in range(len(a2_array)):
loglikes[i][j] = myfunction(np.array([a1_array[i], a2_array[j]]))
I think there should be a better way of doing this out there than this for-loop, but was unfortunately unaware of it. When I say more parameters I mean, say adding an a3_array = np.linspace(3, 5, 100) and then loglikes will be a 3-dimensional array, and so on.
Thanks again so much for any feedback!

Vectorising that loop won't save you any time and in fact may make things worse.
Instead of looping through a1_array and a2_array to create pairs, you can generate all pairs from the get go by putting them in a 100x100x2 array. This operation takes an insignificant amount of time compared to python loops. But when you're actually in the function and you're broadcasting your arrays so that you can do your calculations on data, you're now suddenly dealing with 100x100x2x500x500 arrays. You won't have memory for this and if you rely on file swapping it makes the operations exponentially slower.
Not only are you not saving any time (well, you do for very small arrays but it's the difference between 0.03 s vs 0.005 s), but with python loops you're only using a few 10s of MB of RAM, while with the vectorised approach it skyrockets into the GB.
But out of curiosity, this is how it could be vectorised.
import itertools
def vectorised(params):
model_combined = np.sum(params[...,None,None] * models, axis=2)
model_smeared = gaussian_filter(model_combined, sigma=1)
log_array = data * np.log(model_combined) - model_combined - gammaln(data+1)
return np.sum(log_array, axis=(-2, -1))
np.random.seed(0)
M = 500
N = 100
data = gaussian_filter(np.random.randint(0, 1000, (N, N)), sigma=1)
models = np.random.randint(1, 1000, (2, M, M))
a1 = np.linspace(0, 2, N)
a2 = np.linspace(2, 4, N)
a = np.array(list(itertools.product(a1, a2))).reshape((N, N, 2))
log_sum = vectorised(a)
And some benchmarks
Bottom line, a python loop ran 10000 times just to fetch some array elements takes 0.001 s in its totality. This is insignificant to your function taking 0.01 s for each call.

Related

Efficiently apply SciPy optimization methods to arrays

Equation f(x,a,b) below requires an iterative solution, for which I am using one of the scipy optimisation methods ('brentq') which is essentially calculating the value of x for which f(x,a,b)=0.
However, I need to use array inputs for 'a' and 'b' and the arrays are very large e.g. could be as high as 1-100 million.
What is the most efficient/fastest way to do this with scipy/numpy? At present I am resorting to for loops as per below, but this becomes slow with my actual underlying equations (not shown). Note that each row in array is independent of others.
import numpy as np
from scipy import optimize
# function to solve (simplified)
def f(x,a,b): return (a/x)**0.25 * (x**0.5) - b*x
# array size
N = 10000000
# example input arrays from which 'a' and 'b' are taken (in reality values come from other complex functions)
A = np.linspace(1,500,N)
B = np.linspace(0.1,1,N)
# solution using brentq
results = [optimize.brentq(f, 1e10, 1000, args=(a,b)) for a,b in zip(A,B)]
results

Using a function to populate a numpy array

I have made a function that performs a random walk simulation (random_path) and returns a 1D array (of length num_steps +1). I would like to perform a large number of simulations (n_sims) using this function and then examine my results. I can do this using lists and for loops as:
simulations = []
for i in range(0, n_sims):
current_sim = random_path(x, y, sigma, T, num_steps)
simulations.append(current_sim)
This works fine. I am wondering if there is a more pythonic way of doing this though? Is it possible to do this using only numpy arrays? That is, instead of setting up simulations as an empty list and then creating a list of arrays with a for loop, can I directly initialise simulations using the function random_path to create an array that I guess ultimately would be of shape (n_sims, num_steps + 1)?
Let's assume that you generate your random walk something like this (and if you don't, you probably should be):
walk = np.r_[0, np.random.normal(scale=sigma, size=N).cumsum()]
To make M simulations, just generate the appropriate number of data points and sum over the correct axis:
walks = np.concatenate((np.zeros((M, 1)), np.random.normal(scale=sigma, size=(M, N)).cumsum(axis=-1)), axis=-1)
You can use list comprehension:
simulations = [random_path(x, y, sigma, T, num_steps) for i in range(n_sims)]
If you don't want to explicitly use lists you can try with numpy.vectorize:
import numpy as np
vect_func = np.vectorize(lambda ignored: random_path(x, y, sigma, T, num_steps))
simulations = vect_func(range(n_sims))
In this particular case, ignored is, in fact, ignored.

How to cluster very big sparse data set using low memory in Python?

I have data which forms a sparse matrix in shape of 1000 x 1e9. I want to cluster the 1000 examples into 10 clusters using K-means.
The matrix is very sparse, less than 1/1e6 values.
My laptop got 16 RAM. I tried sparse matrix in scipy. Unfortunately, the matrix makes the clustering process need much more memory than I have. Could anyone suggest a way to do this?
My system crashed when running the following test snippet
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans
row = np.array([0, 0, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 6, 6, 7, 8, 8, 8])
col = np.array([0, 2, 2, 0, 1, 2] * 3)
data = np.array([1, 2, 3, 4, 5, 6] * 3)
X = csr_matrix((data, (row, col)), shape=(9, 1e9))
resC = KMeans(n_clusters=3).fit(X)
resC.labels_
Any helpful suggestion is appreciated.
KMeans centers will not be sparse anymore, so this would need careful optimization for the sparse case (that may be costly for the usual case, so it probably isn't optimized this way).
You can try ELKI (not python but Java) which often is much faster, and also has sparse data types. You can also try using single-precision float will also help.
But in the end, the results will be questionable: k-means is statistically rooted in least-squares. It assumes your data is coming from k signals plus some Gaussian error. Because your data is sparse, it obviously does not have this kind of Gaussian shape. When the majority of values is 0, it cannot be a Gaussian.
With just 1000 data points, I'd rather use HAC.
Whatever you do (for your data; given your memory-constraints): kmeans is not ready for that!
This includes:
Online KMeans / MiniBatch Kmeans; as proposed in another answer
it only helps to handle many samples (and is hurt by the same effect mentioned later)!
Various KMeans-implementation in different languages (it's an algorithmic problem; not bound by an implementation)
Ignoring potential theoretic reasons (high-dimensionality and non-convex heuristic optimization) i'm just mentioning the practical problem here:
your centroids might become non-sparse! (mentioned in sidenote by SOs clustering-expert; this link also mentions alternatives!)
this means: the sparse data-structures used will get very non-sparse and eventually blow up your memory!
(i changed sklearn's code to observe what the above link already mentioned)
relevant sklearn code: center_shift_total = squared_norm(centers_old - centers)
Even if you remove / turn-off all the memory-heavy components like:
init=some_sparse_ndarray (instead of k-means++)
n_init=1 instead of 10
precompute_distances=False instead of True (unclear if it helps)
n_jobs=1 instead of -1
the above will be your problem to care!
Although KMeans accepts sparse matrices as input, the centroids used within the algorithm have a dense representation, and your feature space is so big that even 10 centroids will not fit into 16GB of RAM.
I have 2 ideas:
Can you fit the clustering into RAM if you discard all empty columns? If you have 1000 samples and only about 1/1e6 values are occupied, then probably less than 1 in 1000 columns will contain any non-zero entries.
Several clustering algorithms in scikit-learn will accept a matrix of distances between samples in stead of the full data e.g. sklearn.cluster.SpectralClustering. You could precompute the pairwise distances in a 1000x1000 matrix and pass that to your clustering algorithm in stead. (I can't make a specific recommendation of a clustering method, or a suitable function to calculate the distances, as it will depend on your application)
Consider using dict, since it will only store the values wich were assigned. I guess a nice way to do this is by creating a SparseMatrix object like this:
class SparseMatrix(dict):
def __init__(self, mapping=[]):
dict.__init__(self, {i:mapping[i] for i in range(len(mapping))})
#overriding this method makes never-accessed indexes return 0.0
def __getitem__(self, i):
try:
return dict.__getitem__(self, i)
except KeyError:
return 0.0
>>> my_matrix = SparseMatrix([1,2,3])
>>> my_matrix[0]
1
>>> my_matrix[5]
0.0
Edit:
For the multi-dimensional case you may need to override the two item-management methods as follows:
def __getitem__(self, ij):
i,j = ij
dict.__setitem__(i*self.n + j)
def __getitem__(self, ij):
try:
i,j = ij
return dict.__getitem__(self, i*self.n + j)
except KeyError:
return 0.0
>>> my_matrix[0,0] = 10
>>> my_matrix[1,2]
0.0
>>> my_matrix[0,0]
10
Also assuming you defined self.n as the length of the matrix rows.

Efficient way to perform a 2D x 1D Matrix Multiply

I am trying to perform a 2D by 1D matrix multiply. Specifically:
import numpy as np
s = np.ones(268)
one_d = np.ones(9422700)
s_newaxis = s[:, np.newaxis]
goal = s_newaxis * one_d
While the dimensions above are the same as my problem ((268, 1) and (9422700,)), the actual values in my arrays are a mix of very large and very small numbers. As a result I can run goal = s_newaxis * one_d because only 1s exist. However, I run out of ram using my actual data.
I recognize that, at the end of the day, this amounts to a matrix with ~2.5 billion values and so a heavy memory footprint is to be expected. However, any improvement in terms of efficiency would be welcome.
For completeness, I've included a rough attempt. It is not very elegant, but it is just enough of an improvement that it won't crash my computer (admittedly a low bar).
import gc
def arange(start, stop, step):
# `arange` which includes the endpoint (`stop`).
arr = np.arange(start=start, stop=stop, step=step)
if arr[-1] < stop:
return np.append(arr, [stop])
else:
return arr
left, all_arrays = 0, list()
for right in arange(0, stop=s_newaxis.shape[0], step=10):
chunk = s_newaxis[left:right,:] * one_d
all_arrays.append(chunk)
left = right
gc.collect() # unclear if this makes any difference...I suspect not.
goal = np.vstack(all_arrays)

find 2d elements in a 3d array which are similar to 2d elements in another 3d array

I have two 3D arrays and want to identify 2D elements in one array, which have one or more similar counterparts in the other array.
This works in Python 3:
import numpy as np
import random
np.random.seed(123)
A = np.round(np.random.rand(25000,2,2),2)
B = np.round(np.random.rand(25000,2,2),2)
a_index = np.zeros(A.shape[0])
for a in range(A.shape[0]):
for b in range(B.shape[0]):
if np.allclose(A[a,:,:].reshape(-1, A.shape[1]), B[b,:,:].reshape(-1, B.shape[1]),
rtol=1e-04, atol=1e-06):
a_index[a] = 1
break
np.nonzero(a_index)[0]
But of course this approach is awfully slow. Please tell me, that there is a more efficient way (and what it is). THX.
You are trying to do an all-nearest-neighbor type query. This is something that has special O(n log n) algorithms, I'm not aware of a python implementation. However you can use regular nearest-neighbor which is also O(n log n) just a bit slower. For example scipy.spatial.KDTree or cKDTree.
import numpy as np
import random
np.random.seed(123)
A = np.round(np.random.rand(25000,2,2),2)
B = np.round(np.random.rand(25000,2,2),2)
import scipy.spatial
tree = scipy.spatial.cKDTree(A.reshape(25000, 4))
results = tree.query_ball_point(B.reshape(25000, 4), r=1e-04, p=1)
print [r for r in results if r != []]
# [[14252], [1972], [7108], [13369], [23171]]
query_ball_point() is not an exact equivalent to allclose() but it is close enough, especially if you don't care about the rtol parameter to allclose(). You also get a choice of metric (p=1 for city block, or p=2 for Euclidean).
P.S. Consider using query_ball_tree() for very large data sets. Both A and B have to be indexed in that case.
P.S. I'm not sure what effect the 2d-ness of the elements should have; the sample code I gave treats them as 1d and that is identical at least when using city block metric.
From the docs of np.allclose, we have :
If the following equation is element-wise True, then allclose returns
True.
absolute(a - b) <= (atol + rtol * absolute(b))
Using that criteria, we can have a vectorized implementation using broadcasting, customized for the stated problem, like so -
# Setup parameters
rtol,atol = 1e-04, 1e-06
# Use np.allclose criteria to detect true/false across all pairwise elements
mask = np.abs(A[:,None,] - B) <= (atol + rtol * np.abs(B))
# Use the problem context to get final output
out = np.nonzero(mask.all(axis=(2,3)).any(1))[0]

Categories