Eliminating Redundancy with (Multiple) Nested For-Loops - python

for x in range(10):
for y in range(10):
for z in range(10):
if (1111*x + 1111*y + 1111*z) == (10000*y + 1110*x + z):
print(z)
Is there a way to shorten this code, specifically the first 3 lines where I've used three similar looking for loops? I'm quite new to python so please explain any modules used, if possible.

Well, you're essentially evaluating a function in a 3d coordinate system, with coordinates given by x, y, and z. So let's look at Numpy, which implements arrays in Python. If you're familiar with matlab or IDL, these arrays have similar functionality.
import numpy
x = numpy.arange(10) #Same as range but creates an array instead of a generator
y = numpy.arange(10)
z = numpy.arange(10)
#Now build a 3d array with every point
#defined by the coordinate arrays
xg, yg, zg = numpy.meshgrid(x,y,z)
#Evaluate your functions
#and store the Boolean result in an array.
mask = (1111*xg + 1111*yg + 1111*zg) == (10000*yg + 1110*xg + zg)
#Print out the z values where the mask is True
print(zg[mask])
Is this more readable? Debatable. Is it shorter? No. But it does leverage array operations which may be faster in certain circumstances.

Related

How to fill np array with values of function?

Imagine I have some function, for example
def f(x, y):
return x * y
And I want to fill some matrix with its values. The easiest way to do is, for example
N = 10
X = np.arange(N)
Y = np.arange(N)
matrix = np.zeros((N, N))
for i, x in enumerate(X):
for j, y in enumerate(Y):
matrix[i][j] = f(x,y)
How can I do it in pythonic way? For example using np.vectorize?
Using something like np.fromfunction or np.vectorize will give you little, if any, advantage over a normal for loop. In numpy, you can take advantage of the fact that vectorized operations use loops implemented in C. The problem is that there is no general solution to vectorize your function. For the example you give, it's possible though:
x = np.arange(N)
y = np.arange(N)
matrix = x[:, None] * y
For more complex operations that can not be reduced to numpy function calls, you may want to consider using cython or numba.

Increase performance of np.where() loop

I am trying to speed up the code for the following script (ideally >4x) without multiprocessing. In a future step, I will implement multiprocessing, but the current speed is too slow even if I split it up to 40 cores. Therefore I'm trying to optimize to code first.
import numpy as np
def loop(x,y,q,z):
matchlist = []
for ind in range(len(x)):
matchlist.append(find_match(x[ind],y[ind],q,z))
return matchlist
def find_match(x,y,q,z):
A = np.where(q == x)
B = np.where(z == y)
return np.intersect1d(A,B)
# N will finally scale up to 10^9
N = 1000
M = 300
X = np.random.randint(M, size=N)
Y = np.random.randint(M, size=N)
# Q and Z size is fixed at 120000
Q = np.random.randint(M, size=120000)
Z = np.random.randint(M, size=120000)
# convert int32 arrays to str64 arrays, to represent original data (which are strings and not numbers)
X = np.char.mod('%d', X)
Y = np.char.mod('%d', Y)
Q = np.char.mod('%d', Q)
Z = np.char.mod('%d', Z)
matchlist = loop(X,Y,Q,Z)
I have two arrays (X and Y) which are identical in length. Each row of these arrays corresponds to one DNA sequencing read (basically strings of the letters 'A','C','G','T'; details not relevant for the example code here).
I also have two 'reference arrays' (Q and Z) which are identical in length and I want to find the occurrence (with np.where()) of every element of X within Q, as well as of every element of Y within Z (basically the find_match() function). Afterwards I want to know whether there is an overlap/intersect between the indexes found for X and Y.
Example output (matchlist; some rows of X/Y have matching indexes in Q/Y, some don't e.g. index 11):
The code works fine so far, but it would take quite long to execute with my final dataset where N=10^9 (in this code example N=1000 to make the tests quicker). 1000 rows of X/Y need about 2.29s to execute on my laptop:
Every find_match() takes about 2.48 ms to execute which is roughly 1/1000 of the final loop.
One first approach would be to combine (x with y) as well as (q with z) and then I only need to run np.where() once, but I couldn't get that to work yet.
I've tried to loop and lookup within Pandas (.loc()) but this was about 4x slower than np.where().
The question is closely related to a recent question from philshem (Combine several NumPy "where" statements to one to improve performance), however, the solutions provided on this question do not work for my approach here.
Numpy isn't too helpful here, since what you need is a lookup into a jagged array, with strings as the indexes.
lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
lookup.setdefault((q, z), []).append(i)
matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]
If you don't need the output as a jagged array, but are OK with just a boolean denoting presence, and can preprocess each string to a number, there is a much faster method.
lookup = np.zeros((300, 300), dtype=bool)
lookup[Q, Z] = True
matchlist = lookup[X, Y]
You typically won't want to use this method to replace the former jagged case, as dense variants (eg. Daniel F's solution) will be memory inefficient and numpy does not support sparse arrays well. However, if more speed is needed then a sparse solution is certainly possible.
You only have 300*300 = 90000 unique answers. Pre-compute.
Q_ = np.arange(300)[:, None] == Q
Z_ = np.arange(300)[:, None] == Z
lookup = np.logical_and(Q_[:, None, :], Z_)
lookup.shape
Out[]: (300, 300, 120000)
Then the result is just:
out = lookup[X, Y]
If you really want the indices you can do:
i = np.where(out)
out2 = np.split(i[1], np.flatnonzero(np.diff(i[0]))+1)
You'll parallelize by chunking with this method, since a boolean array of shape(120000, 1000000000) will throw a MemoryError.

How to generate a 3D grid of vectors ? (each position in the 3D grid is a vector)

I want to generate a four dimensional array with dimensions (dim,N,N,N). The first component ndim =3 and N corresponds to the grid length. How can one elegantly generate such an array using python ?
here is my 'ugly' implementation:
qvec=np.zeros([ndim,N,N,N])
freq = np.arange(-(N-1)/2.,+(N+1)/2.)
x, y, z = np.meshgrid(freq[range(N)], freq[range(N)], freq[range(N)],indexing='ij')
qvec[0,:,:,:]=x
qvec[1,:,:,:]=y
qvec[2,:,:,:]=z
Your implementation looks good enough to me. However, here are some improvements to make it prettier:
qvec=np.empty([ndim,N,N,N])
freq = np.arange(-(N-1)/2.,+(N+1)/2.)
x, y, z = np.meshgrid(*[freq]*ndim, indexing='ij')
qvec[0,...]=x # qvec[0] = x
qvec[1,...]=y # qvec[1] = y
qvec[2,...]=z # qvec[2] = z
The improvements are:
Using numpy.empty() instead of numpy.zeros()
Getting rid of the range(N) indexing since that would give the same freq array
Using iterable unpacking and utilizing ndim
Using the ellipsis notation for dimensions (this is also not needed)
So, after incorporating all of the above points, the below piece of code would suffice:
qvec=np.empty([ndim,N,N,N])
freq = np.arange(-(N-1)/2.,+(N+1)/2.)
x, y, z = np.meshgrid(*[freq]*ndim, indexing='ij')
qvec[0:ndim] = x, y, z
Note: I'm assuming N is same since you used same variable name.

best method of making an array

I'm new to programming and am a bit unsure about how to write my own for loop. This is what I would like please?
Let us subdivide interval [0,1] into n points x0=0,...,xn−1=1.
Write a function compute_discrete_u(epsilon, n) that returns two numpy arrays:
x_array contains the coordinates of the n points
u_array contains the discrete values of u at these points.
u(x)=sin(1x+ϵ)
Thank you!
First of all, you do not need a for loop at all. You want to use numpy, so you can use the vectorized operations that numpy is built upon.
Here's the function you are literally asking for (and most likely not how you should solve your problem):
# Do NOT use this.
import numpy as np
def compute_discrete_u(epsilon, n):
x = np.linspace(0, 1, n)
return x, np.sin(x + expsilon)
That's quite an awkward API. From a design point-of-view, you are mixing two responsibilities in the function:
Generating a certain x vector
Calculating a u vector based on a mathematical function.
You should not do this for complexity and reusability reasons. What if you want a non-uniform x later on?
So here's what you should do:
import numpy as np
def compute_u(x, epsilon):
return np.sin(x + epsilon)
x = np.linspace(0, 1, num=101)
u = compute_u(x, epsilon=1e-3)
This is more easy to understand because the function is just the mathematical function. Additionally, you can compute u for any x array (or single float) you like. If you do not need compute_u elsewhere, you may even completely drop it and write u = np.sin(x + epsilon)

Python numpy grid transformation using universal functions

Here is my problem : I manipulate 432*46*136*136 grids representing time*(space) encompassed in numpy arrays with numpy and python. I have one array alt, which encompasses the altitudes of the grid points, and another array temp which stores the temperature of the grid points.
It is problematic for a comparison : if T1 and T2 are two results, T1[t0,z0,x0,y0] and T2[t0,z0,x0,y0] represent the temperature at H1[t0,z0,x0,y0] and H2[t0,z0,x0,y0] meters, respectively. But I want to compare the temperature of points at the same altitude, not at the same grid point.
Hence I want to modify the z-axis of my matrices to represent the altitude and not the grid point. I create a function conv(alt[t,z,x,y]) which attributes a number between -20 and 200 to each altitude. Here is my code :
def interpolation_extended(self,temp,alt):
[t,z,x,y]=temp.shape
new=np.zeros([t,220,x,y])
for l in range(0,t):
for j in range(0,z):
for lat in range(0,x):
for lon in range(0,y):
new[l,conv(alt[l,j,lat,lon]),lat,lon]=temp[l,j,lat,lon]
return new
But this takes definitely too much time, I can't work this it. I tried to write it using universal functions with numpy :
def interpolation_extended(self,temp,alt):
[t,z,x,y]=temp.shape
new=np.zeros([t,220,x,y])
for j in range(0,z):
new[:,conv(alt[:,j,:,:]),:,:]=temp[:,j,:,:]
return new
But that does not work. Do you have any idea of doing this in python/numpy without using 4 nested loops ?
Thank you
I can't really try the code since I don't have your matrices, but something like this should do the job.
First, instead of declaring conv as a function, get the whole altitude projection for all your data:
conv = np.round(alt / 500.).astype(int)
Using np.round, the numpys version of round, it rounds all the elements of the matrix by vectorizing operations in C, and thus, you get a new array very quickly (at C speed). The following line aligns the altitudes to start in 0, by shifting all the array by its minimum value (in your case, -20):
conv -= conv.min()
the line above would transform your altitude matrix from [-20, 200] to [0, 220] (better for indexing).
With that, interpolation can be done easily by getting multidimensional indices:
t, z, y, x = np.indices(temp.shape)
the vectors above contain all the indices needed to index your original matrix. You can then create the new matrix by doing:
new_matrix[t, conv[t, z, y, x], y, x] = temp[t, z, y, x]
without any loop at all.
Let me know if it works. It might give you some erros since is hard for me to test it without data, but it should do the job.
The following toy example works fine:
A = np.random.randn(3,4,5) # Random 3x4x5 matrix -- your temp matrix
B = np.random.randint(0, 10, 3*4*5).reshape(3,4,5) # your conv matrix with altitudes from 0 to 9
C = np.zeros((3,10,5)) # your new matrix
z, y, x = np.indices(A.shape)
C[z, B[z, y, x], x] = A[z, y, x]
C contains your results by altitude.

Categories