Using a function to populate a numpy array - python

I have made a function that performs a random walk simulation (random_path) and returns a 1D array (of length num_steps +1). I would like to perform a large number of simulations (n_sims) using this function and then examine my results. I can do this using lists and for loops as:
simulations = []
for i in range(0, n_sims):
current_sim = random_path(x, y, sigma, T, num_steps)
simulations.append(current_sim)
This works fine. I am wondering if there is a more pythonic way of doing this though? Is it possible to do this using only numpy arrays? That is, instead of setting up simulations as an empty list and then creating a list of arrays with a for loop, can I directly initialise simulations using the function random_path to create an array that I guess ultimately would be of shape (n_sims, num_steps + 1)?

Let's assume that you generate your random walk something like this (and if you don't, you probably should be):
walk = np.r_[0, np.random.normal(scale=sigma, size=N).cumsum()]
To make M simulations, just generate the appropriate number of data points and sum over the correct axis:
walks = np.concatenate((np.zeros((M, 1)), np.random.normal(scale=sigma, size=(M, N)).cumsum(axis=-1)), axis=-1)

You can use list comprehension:
simulations = [random_path(x, y, sigma, T, num_steps) for i in range(n_sims)]
If you don't want to explicitly use lists you can try with numpy.vectorize:
import numpy as np
vect_func = np.vectorize(lambda ignored: random_path(x, y, sigma, T, num_steps))
simulations = vect_func(range(n_sims))
In this particular case, ignored is, in fact, ignored.

Related

How to vectorize a 2D scalar function over a mesh

I have a function foo(x,y) that takes two scalars (or lists of scalars) and returns a scalar output (or list of scalars computed pairwise from the input). I want to be able to evaluate this function over 2 orthogonal arrays such that the output is a matrix ij of foo(x[i], y[j]).
I have a for-loop version that solves this problem as below:
import numpy as np
x = np.range(50) # Could be linspaces, whatever the axis in the vector space is
y = np.range(50)
mat = np.zeros(len(x), len(y)) # To hold the result for plotting
for i in range(len(x)):
for j in range(len(y)):
mat[i][j] = foo(x[i], y[j])
where my result is stored in mat. However, this is dreadfully slow, and looks to me as if it could easily be vectorized. I'm not aware of how Python solves this problem however, as this doesn't appear to be something like zip or map. Is there another such function or concept (beyond trivially making extremely long arrays of the same array rotated by a value and passing them that way) that could vectorize this successfully? Or is the nature of the foo function limiting the ability to vectorize this?
In this case, itertools.product is the tool you want. It generates an iterable sequence of elements from the Cartesian product of N inputs, which you can use to discretely map a vector space. You can then evaluate foo on these. This isn't vectorization per se, but does reduce the nested for loop.
See docs at https://docs.python.org/3/library/itertools.html#itertools.product

Increase performance of np.where() loop

I am trying to speed up the code for the following script (ideally >4x) without multiprocessing. In a future step, I will implement multiprocessing, but the current speed is too slow even if I split it up to 40 cores. Therefore I'm trying to optimize to code first.
import numpy as np
def loop(x,y,q,z):
matchlist = []
for ind in range(len(x)):
matchlist.append(find_match(x[ind],y[ind],q,z))
return matchlist
def find_match(x,y,q,z):
A = np.where(q == x)
B = np.where(z == y)
return np.intersect1d(A,B)
# N will finally scale up to 10^9
N = 1000
M = 300
X = np.random.randint(M, size=N)
Y = np.random.randint(M, size=N)
# Q and Z size is fixed at 120000
Q = np.random.randint(M, size=120000)
Z = np.random.randint(M, size=120000)
# convert int32 arrays to str64 arrays, to represent original data (which are strings and not numbers)
X = np.char.mod('%d', X)
Y = np.char.mod('%d', Y)
Q = np.char.mod('%d', Q)
Z = np.char.mod('%d', Z)
matchlist = loop(X,Y,Q,Z)
I have two arrays (X and Y) which are identical in length. Each row of these arrays corresponds to one DNA sequencing read (basically strings of the letters 'A','C','G','T'; details not relevant for the example code here).
I also have two 'reference arrays' (Q and Z) which are identical in length and I want to find the occurrence (with np.where()) of every element of X within Q, as well as of every element of Y within Z (basically the find_match() function). Afterwards I want to know whether there is an overlap/intersect between the indexes found for X and Y.
Example output (matchlist; some rows of X/Y have matching indexes in Q/Y, some don't e.g. index 11):
The code works fine so far, but it would take quite long to execute with my final dataset where N=10^9 (in this code example N=1000 to make the tests quicker). 1000 rows of X/Y need about 2.29s to execute on my laptop:
Every find_match() takes about 2.48 ms to execute which is roughly 1/1000 of the final loop.
One first approach would be to combine (x with y) as well as (q with z) and then I only need to run np.where() once, but I couldn't get that to work yet.
I've tried to loop and lookup within Pandas (.loc()) but this was about 4x slower than np.where().
The question is closely related to a recent question from philshem (Combine several NumPy "where" statements to one to improve performance), however, the solutions provided on this question do not work for my approach here.
Numpy isn't too helpful here, since what you need is a lookup into a jagged array, with strings as the indexes.
lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
lookup.setdefault((q, z), []).append(i)
matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]
If you don't need the output as a jagged array, but are OK with just a boolean denoting presence, and can preprocess each string to a number, there is a much faster method.
lookup = np.zeros((300, 300), dtype=bool)
lookup[Q, Z] = True
matchlist = lookup[X, Y]
You typically won't want to use this method to replace the former jagged case, as dense variants (eg. Daniel F's solution) will be memory inefficient and numpy does not support sparse arrays well. However, if more speed is needed then a sparse solution is certainly possible.
You only have 300*300 = 90000 unique answers. Pre-compute.
Q_ = np.arange(300)[:, None] == Q
Z_ = np.arange(300)[:, None] == Z
lookup = np.logical_and(Q_[:, None, :], Z_)
lookup.shape
Out[]: (300, 300, 120000)
Then the result is just:
out = lookup[X, Y]
If you really want the indices you can do:
i = np.where(out)
out2 = np.split(i[1], np.flatnonzero(np.diff(i[0]))+1)
You'll parallelize by chunking with this method, since a boolean array of shape(120000, 1000000000) will throw a MemoryError.

Is there a more efficient (i.e. faster) way to compute correlated random walks?

Is there any way to vectorize or otherwise speed this up? I'm already jitting it using numba, but it is still a major bottleneck. Using numba to jit my functions on 1-d numpy arrays leads to code that is orders of magnitude faster, but there is essentially a negligible improvement when using numba on the 2-d arrays below. decomposition is a numpy matrix representing the cholesky decomposition of the correlation matrix, and x and n are constants, and nrand is numpy.random
#jit
def generate_random_correlated_walks(decomposition, x, n):
uncorrelated_walks = np.empty((2*x, n))
for i in range(x):
# Generate the random uncorrelated walks
wv = nrand.normal(loc=0, scale=1, size=n)
ws = nrand.normal(loc=0, scale=1, size=n)
uncorrelated_walks[2*i] = wv
uncorrelated_walks[(2*i) + 1] = ws
# Create a matrix out of these walks
uncorrelated_walks = np.matrix(uncorrelated_walks)
m, n = uncorrelated_walks.shape
correlated_walks = np.empty((m, n))
# Go through column and correlate the walk values
for i in range(n):
correlated_timestep = np.transpose(uncorrelated_walks[:, i]) * decomposition
correlated_walks[:, i] = correlated_timestep
return correlated_walks
EDIT: I have made the suggested changes, and now my code is as below, but unfortunately still is a major bottleneck. Any ideas?
#jit
def generate_random_correlated_walks(self, decomposition, x, n):
rows = 2*x
uncorrelated_walks = np.random.normal(loc=0, scale=1, size=(rows, n))
correlated_walks = np.empty((rows, n))
for i in range(n):
correlated_timestep = np.dot(np.transpose(uncorrelated_walks[:, i]), decomposition)
correlated_walks[:, i] = correlated_timestep
return correlated_walks
First thing you can improve is to remove your for loop. There is no advantage to calling np.random.normal a bunch of times with the same input parameters if you believe that it really generates random numbers.
Instead of using np.matrix, use np.array. This will make your life easier when you consider that the previous item can be used to shorten the entire first portion of your function into one step.
You can of course completely remove the final loop with a simple matrix multiplication: uncorrelated_walks.T # decomposition will give you the transpose of your current correlated_walks. You can avoid one of the transposes by changing the order of the arguments.
You end up with something like:
def generate_random_correlated_walks(decomposition, x, n):
uncorrelated_walks = nrand.normal(loc=0, scale=1, size=(2*x, n))
correlated_walks = np.dot(decomposition.T, uncorrelated_walks)
return correlated_walks
Not sure how much this will help you, but removing the Python-level loops should be some kind of boost since it will reduce the overhead of multiple numpy calls.
You could sacrifice legibility to turn the whole thing into a one-liner:
def generate_random_correlated_walks(decomposition, x, n):
return np.dot(decomposition.T, nrand.normal(loc=0, scale=1, size=(2*x, n)))

scipy parallel interpolation of multiple arrays

I have multiple arrays of the same dimension, or rather a matrix say
data.shape
# (n, m)
I want to interpolate the m-axis and leave the n-axis. Ideally I would get a function which I can call by with an x-array of length n.
interpolated(x)
x.shape
# (n,)
I tried
from scipy import interpolate
interpolated = interpolate.interp1d(x=x_points, y=data)
interpolated(x).shape
# (n, n)
but this evaluates every array at the given point. Is there a better way to do it than ugly loops like
interpolated = array(interpolate.interp1d(x=x_points, y=array_) for
array_ in data)
array(func_(xi) for func_, xi in zip(interpolated, x))
Your (n,m)-shaped data is, as you said, is a collection of n datasets, each of length m. You're trying to pass this an n-length x array, and expect to obtain an n-length result. That is, you're querying the n independent datasets at n unrelated points.
This makes me believe that you need to use n independent interpolators. There is no real benefit in trying to get away with a single call to an interpolation routine. Interpolation routines as far as I know assume that the target of the interpolation is a single object. Either a multivariate function, or a function that has an array-shaped value; in either case you can query the function one (optionally higher-dimensional) point at a time. For instance, multilinear interpolation works across rows of the input, so there's (again, as far as I know) no way to "interpolate linearly along an axis". In your case, there is absolutely no relationship between the rows of your data, and there's no relationship between query points, so it's also semantically motivated to use n independent interpolators for your problem.
As for convenience, you can shove all those interpolating functions into a single function for ease of use:
interpolated = [interpolate.interp1d(x=x_points, y=array_) for
array_ in data]
def common_interpolator(x):
'''interpolate n separate datasets at n separate input points'''
return array([fun(xx) for fun,xx in zip(interpolated,x)])
This will allow you to use a single call to common_interpolator with an input array_like of length n.
But since you mentioned it in comments, you can actually make use of np.vectorize if you want to add multiple sets if query points to this function. Here's a complete example with three trivial dummy functions:
import numpy as np
# three scalar (well, or vectorized) functions:
funs = [lambda x,i=i: x+i for i in range(3)]
# define a wrapper for calling them together
def allfuns(xs):
'''bundled call to functions: n-length input to n-length output'''
return np.array([fun(x) for fun,x in zip(funs,xs)])
# define a vectorized version of the wrapper, (...,n) to (...,n)-shape
allfuns_vector = np.vectorize(allfuns,signature='(n)->(n)')
# print some examples
x = np.arange(3)
print([fun(xx) for fun,xx in zip(funs,x)])
# [0, 2, 4]
print(allfuns(x))
# [0 2 4]
print(allfuns_vector(x))
# [0 2 4]
print(allfuns_vector([x,x+10]))
#[[ 0 2 4]
# [10 12 14]]
As you can see, all of the above work the same way for a 1d input array. But we can pass a (k,n)-shaped array to the vectorized version and it will perform the interpolation row-wise, that is each [:,n] slice will be fed to the original interpolator bundle. As far as I know np.vectorize is essentially a wrapper for a for loop, but at least it makes calling your functions more convenient.

best method of making an array

I'm new to programming and am a bit unsure about how to write my own for loop. This is what I would like please?
Let us subdivide interval [0,1] into n points x0=0,...,xn−1=1.
Write a function compute_discrete_u(epsilon, n) that returns two numpy arrays:
x_array contains the coordinates of the n points
u_array contains the discrete values of u at these points.
u(x)=sin(1x+ϵ)
Thank you!
First of all, you do not need a for loop at all. You want to use numpy, so you can use the vectorized operations that numpy is built upon.
Here's the function you are literally asking for (and most likely not how you should solve your problem):
# Do NOT use this.
import numpy as np
def compute_discrete_u(epsilon, n):
x = np.linspace(0, 1, n)
return x, np.sin(x + expsilon)
That's quite an awkward API. From a design point-of-view, you are mixing two responsibilities in the function:
Generating a certain x vector
Calculating a u vector based on a mathematical function.
You should not do this for complexity and reusability reasons. What if you want a non-uniform x later on?
So here's what you should do:
import numpy as np
def compute_u(x, epsilon):
return np.sin(x + epsilon)
x = np.linspace(0, 1, num=101)
u = compute_u(x, epsilon=1e-3)
This is more easy to understand because the function is just the mathematical function. Additionally, you can compute u for any x array (or single float) you like. If you do not need compute_u elsewhere, you may even completely drop it and write u = np.sin(x + epsilon)

Categories