Hi I have a question that I am not sure how to implement in python.
I have three different arrays. I have values in X and values in Y. Such that to each X a specific Y belongs (X,Y). Now, from it I built a histogram.
U, V = histogram(X, bins=arange(min(X), max(X), 50))
The third array (V) has the number of points for each bin. Knowing this, I want to print the different Y values for each point in different bins. It is:
for i, j in zip(X, Y):
if a<i<b:
print j
where a is the first value in V and b the second one. For example in my first case, the first value is 500 and the second one 600, so it would be:
for i, j in zip(X, Y):
if 500<i<600:
print j
and here it prints the Y values for the points that lie in the range 500-600 in X. What now I would like to do is to implement a loop so I don't have to be writing manually the different entries for V. I was thinking about something like:
for i, j, k in zip(X,Y,range(len(V))):
if V[k]<i<V[k+1]:
print j
But it doesn't work. Any ideas?
From the code in your question, it looks like you're using numpy. There are better ways to approach this problem in numpy, and I'll go over those at the end of the answer. For the moment, though, let's look at why what you tried didn't work.
The reason that it's not working is that your V array is the bin edges. It's not the same size as your X or Y arrays.
When you zip sequences together, zip stops when the shortest sequence has been iterated through. For example:
for i, j in zip([1, 2], [5, 6, 7, 8, 9]):
print j
Will yield:
5
6
In your case, you actually want to iterate over the bins, and then have an inner loop over X an Y. For example:
for k in range(len(V)):
for i, j in zip(x, y):
if V[k]<i<V[k+1]:
print j
We could also make this a bit more readable by doing something like:
bin_edges = V
for left, right in zip(bin_edges, bin_edges[1:]):
for i, j in zip(x, y):
if left < i < right:
print j
However, this both of these are horribly inefficient in numpy. (Iterating through numpy arrays is slower than iterating through lists, but this would be slow even with lists.)
Fortunately, you're using numpy, and there are much more efficient ways.
First, let's reproduce the example above, but let's use boolean indexing to remove the inner loop:
import numpy as np
# Generate some random data
x, y = np.random.random((2, 100))
# Your "U" and "V" arrays, but I'm changing the names for clarity
counts, bins = np.histogram(x, bins=20)
# Rather than iterate over an index, let's use a slightly different trick
for left, right in zip(bins[:-1], bins[1:]):
# Use boolean indexing to replace the inner loop
print y[(x > left) & (x < right)]
Another way to do this is through numpy.digitize:
import numpy as np
# Generate some x, y data
x, y = np.random(2, 100)
# Create a histogram of x
counts, bins = np.histogram(x, bins=30)
# Return an series of indicies of which bin each x-value falls into
# This will be the same size as x and have values between 0 to len(bins)
idx = np.digitize(x, bins)
# Print the y-values for each bin
for i in range(bins.size):
y[idx == i]
Either way, using boolean indexing for this instead of the inner loop will yield significant speedups.
What is the error you're getting? You're probably indexing outside of the array in
for i, j, k in zip(X,Y,range(len(V))):
if V[k]<i<V[k+1]:
print j
When k is len(V) there is no V[k+1]. You can do something like
for i, j, k in zip(X,Y,range(len(V)-1)):
if V[k]<i<V[k+1]:
print j
#handle the last bin separately
if X[-1] > V[-1]:
print Y[-1]
Related
for x in range(10):
for y in range(10):
for z in range(10):
if (1111*x + 1111*y + 1111*z) == (10000*y + 1110*x + z):
print(z)
Is there a way to shorten this code, specifically the first 3 lines where I've used three similar looking for loops? I'm quite new to python so please explain any modules used, if possible.
Well, you're essentially evaluating a function in a 3d coordinate system, with coordinates given by x, y, and z. So let's look at Numpy, which implements arrays in Python. If you're familiar with matlab or IDL, these arrays have similar functionality.
import numpy
x = numpy.arange(10) #Same as range but creates an array instead of a generator
y = numpy.arange(10)
z = numpy.arange(10)
#Now build a 3d array with every point
#defined by the coordinate arrays
xg, yg, zg = numpy.meshgrid(x,y,z)
#Evaluate your functions
#and store the Boolean result in an array.
mask = (1111*xg + 1111*yg + 1111*zg) == (10000*yg + 1110*xg + zg)
#Print out the z values where the mask is True
print(zg[mask])
Is this more readable? Debatable. Is it shorter? No. But it does leverage array operations which may be faster in certain circumstances.
I am trying to speed up the code for the following script (ideally >4x) without multiprocessing. In a future step, I will implement multiprocessing, but the current speed is too slow even if I split it up to 40 cores. Therefore I'm trying to optimize to code first.
import numpy as np
def loop(x,y,q,z):
matchlist = []
for ind in range(len(x)):
matchlist.append(find_match(x[ind],y[ind],q,z))
return matchlist
def find_match(x,y,q,z):
A = np.where(q == x)
B = np.where(z == y)
return np.intersect1d(A,B)
# N will finally scale up to 10^9
N = 1000
M = 300
X = np.random.randint(M, size=N)
Y = np.random.randint(M, size=N)
# Q and Z size is fixed at 120000
Q = np.random.randint(M, size=120000)
Z = np.random.randint(M, size=120000)
# convert int32 arrays to str64 arrays, to represent original data (which are strings and not numbers)
X = np.char.mod('%d', X)
Y = np.char.mod('%d', Y)
Q = np.char.mod('%d', Q)
Z = np.char.mod('%d', Z)
matchlist = loop(X,Y,Q,Z)
I have two arrays (X and Y) which are identical in length. Each row of these arrays corresponds to one DNA sequencing read (basically strings of the letters 'A','C','G','T'; details not relevant for the example code here).
I also have two 'reference arrays' (Q and Z) which are identical in length and I want to find the occurrence (with np.where()) of every element of X within Q, as well as of every element of Y within Z (basically the find_match() function). Afterwards I want to know whether there is an overlap/intersect between the indexes found for X and Y.
Example output (matchlist; some rows of X/Y have matching indexes in Q/Y, some don't e.g. index 11):
The code works fine so far, but it would take quite long to execute with my final dataset where N=10^9 (in this code example N=1000 to make the tests quicker). 1000 rows of X/Y need about 2.29s to execute on my laptop:
Every find_match() takes about 2.48 ms to execute which is roughly 1/1000 of the final loop.
One first approach would be to combine (x with y) as well as (q with z) and then I only need to run np.where() once, but I couldn't get that to work yet.
I've tried to loop and lookup within Pandas (.loc()) but this was about 4x slower than np.where().
The question is closely related to a recent question from philshem (Combine several NumPy "where" statements to one to improve performance), however, the solutions provided on this question do not work for my approach here.
Numpy isn't too helpful here, since what you need is a lookup into a jagged array, with strings as the indexes.
lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
lookup.setdefault((q, z), []).append(i)
matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]
If you don't need the output as a jagged array, but are OK with just a boolean denoting presence, and can preprocess each string to a number, there is a much faster method.
lookup = np.zeros((300, 300), dtype=bool)
lookup[Q, Z] = True
matchlist = lookup[X, Y]
You typically won't want to use this method to replace the former jagged case, as dense variants (eg. Daniel F's solution) will be memory inefficient and numpy does not support sparse arrays well. However, if more speed is needed then a sparse solution is certainly possible.
You only have 300*300 = 90000 unique answers. Pre-compute.
Q_ = np.arange(300)[:, None] == Q
Z_ = np.arange(300)[:, None] == Z
lookup = np.logical_and(Q_[:, None, :], Z_)
lookup.shape
Out[]: (300, 300, 120000)
Then the result is just:
out = lookup[X, Y]
If you really want the indices you can do:
i = np.where(out)
out2 = np.split(i[1], np.flatnonzero(np.diff(i[0]))+1)
You'll parallelize by chunking with this method, since a boolean array of shape(120000, 1000000000) will throw a MemoryError.
How can we do the following operation in one line only in numpy?
medians = np.median(x, axis=0)
for i in range(0, len(x)): # transforming input data to binary values
for j in range(0, len(x[i])):
x[i][j] = 1 if x[i][j] <= medians[j] else 2
What it does is to transform this feature vector into binary values based on the values of the median for that dimension of data.
Use broadcasting:
x = (x <= np.median(x, axis=0))
The result will be a boolean array of zeros and ones. This wouldn't work, by the way, if you tried it with axis=1, because broadcasting matches axes from the right. Instead, you would have to insert a placeholder for the reduced axis, e.g. like this:
x = (x <= np.median(x, axis=1)[..., np.newaxis])
An even more general approach would be
x = (x <= np.median(x, axis=<whatever>, keepdims=True))
Since booleans are technically a subclass of integers in Python, and numpy honors that convention, you can get a mask with ones and twos instead of zeros and ones by adding one to the result, however you choose to compute it:
x = ... + 1
Assume that I have two arrays V and Q, where V is (i, j, j) and Q is (j, j). I now wish to compute the dot product of Q with each "row" of V and save the result as an (i, j, j) sized matrix. This is easily done using for-loops by simply iterating over i like
import numpy as np
v = np.random.normal(size=(100, 5, 5))
q = np.random.normal(size=(5, 5))
output = np.zeros_like(v)
for i in range(v.shape[0]):
output[i] = q.dot(v[i])
However, this is way too slow for my needs, and I'm guessing there is a way to vectorize this operation using either einsum or tensordot, but I haven't managed to figure it out. Could someone please point me in the right direction? Thanks
You can certainly use np.tensordot, but need to swap axes afterwards, like so -
out = np.tensordot(v,q,axes=(1,1)).swapaxes(1,2)
With np.einsum, it's a bit more straight-forward, like so -
out = np.einsum('ijk,lj->ilk',v,q)
I have a matrix of counts,
import numpy as np
x = np.array([[ 1,2,3],[1,4,6],[2,3,7]])
And I need the percentages of the total along axis = 1:
for i in range(x.shape[0]):
for j in range(x.shape[1]):
x[i,j] = x[i,j] / np.sum(x[i,:])
In numpy broadcast form.
Currently, I have:
x_sums = np.sum(x,axis=1)
for j in range(x.shape[1]):
x[:,j] = x[:,j] / x_sums[:]
Which puts most of the complexity in numpy code...but a numpy one liner would be best.
Also,
def percentages(a):
return a / np.sum(a)
x_percentages = np.apply_along_axis(percentages,1,x)
But that still involves python.
np.linalg.norm
Is very close, in terms of what is going on, but they only have the 8 hardcoded norms, which does not include percentage of total.
Then there is np.percentile, which is again close...but it is computing the sorted percentile.
x /= x.sum(axis=1, keepdims=True)
Altough x should have a floating point dtype for this to work correctly.
Better may be:
x = np.true_divide(x, x.sum(axis=1, keepdims=True))
Could this be what you are after:
print (x.T/np.sum(x, axis=1)).T