Python random sample of two arrays, but matching indices - python

I have two numpy arrays x and y, which have length 10,000.
I would like to plot a random subset of 1,000 entries of both x and y.
Is there an easy way to use the lovely, compact random.sample(population, k) on both x and y to select the same corresponding indices? (The y and x vectors are linked by a function y(x) say.)
Thanks.

You can use np.random.choice on an index array and apply it to both arrays:
idx = np.random.choice(np.arange(len(x)), 1000, replace=False)
x_sample = x[idx]
y_sample = y[idx]

Just zip the two together and use that as the population:
import random
random.sample(zip(xs,ys), 1000)
The result will be 1000 pairs (2-tuples) of corresponding entries from xs and ys.
Update: For Python 3, you need to convert the zipped sequences into a list:
random.sample(list(zip(xs,ys)), 1000)

After test numpy.random.choice solution,
I found out it was very slow for larger array.
numpy.random.randint should be much faster
example
x = np.arange(1e8)
y = np.arange(1e8)
idx = np.random.randint(0, x.shape[0], 10000)
return x[idx], y[idx]

Using the numpy.random.randint function, you generate a list of random numbers, meaning that you can select certain datapoints twice.

Related

Evenly sampled 3D meshgrid

I have a 3-dimensional meshgrid generated using the following code:
x = np.linspace(-1,1,100)
xx, yy, zz = np.meshgrid(x, x, x)
This generates a 100 x 100 x 100 point 3-d grid of points. I would like to plot an evenly-space sub-sampling of this same grid, without having to generate a new grid. My approach to this was to use np.linspace() to get an array of 10000 evenly-space indices from the original array to plot xx[subsample], yy[subsample], and zz[subsample]. I used
subsample = np.linspace(0,len(xx.flatten())-1,10000,dtype=int)
However, when I pass this array my plotting function, I get uneven structure (diagonal lines) in 3-dimensions:
My guess is that this is happening because I flattened the array, and then used np.linspace(), but I can't figure out how to sample the grid in 3-dimensions and have it come out evenly distributed. I would like to avoid generating a new meshgrid if at all possible.
My question is how would I evenly subsample my original 3-dimensional meshgrid, without having to generate a new meshgrid?
In [117]: x = np.linspace(-1,1,100)
...: xx, yy, zz = np.meshgrid(x, x, x)
In [118]: xx.shape
Out[118]: (100, 100, 100)
1000 equally spaced points in xx, similarly for all other grids:
In [119]: xx[::10,::10,::10].shape
Out[119]: (10, 10, 10)
Or with advanced indexing (making a copy)
In [123]: i=np.arange(0,100,10)
In [124]: xx[np.ix_(i,i,i)].shape
Out[124]: (10, 10, 10)
I think we could use np.ravel_multi_index to get an array of flattened indices. We'd have to generate 1000 tuples of indices to do that!
I don't see how we could get a 10,000 points. ::5 would give 8000 points.
Have you trying using arange? Using linspace for integers may have some rounding issues.
Could you try the following?
subsample = np.arange(0, xx.size, xx.size // 10000) # the last parameter is the step size
Also, be sure that xx.size is divisible by 10000, which is the case for your 100x100x100.
Tip: use .size to get the number of elements in an array. Use .ravel instead of .flatten as the latter creates a copy but ravel is just a view.
Edit: That subsample did not generate those diagonals but it just got a plane.
subsample_axis = [np.arange(0, xx.shape[i], 10) for i in range(len(xx.shape))]
subsample = np.zeros([len(axis) for axis in subsample_axis])
for i, axis in enumerate(subsample_axis):
shape = [len(axis) if j == i else 1 for j in range(len(xx.shape))]
subsample += axis.reshape(shape)*np.prod(xx.shape[i+1:])
subsample = subsample.ravel().astype('int')

Increase performance of np.where() loop

I am trying to speed up the code for the following script (ideally >4x) without multiprocessing. In a future step, I will implement multiprocessing, but the current speed is too slow even if I split it up to 40 cores. Therefore I'm trying to optimize to code first.
import numpy as np
def loop(x,y,q,z):
matchlist = []
for ind in range(len(x)):
matchlist.append(find_match(x[ind],y[ind],q,z))
return matchlist
def find_match(x,y,q,z):
A = np.where(q == x)
B = np.where(z == y)
return np.intersect1d(A,B)
# N will finally scale up to 10^9
N = 1000
M = 300
X = np.random.randint(M, size=N)
Y = np.random.randint(M, size=N)
# Q and Z size is fixed at 120000
Q = np.random.randint(M, size=120000)
Z = np.random.randint(M, size=120000)
# convert int32 arrays to str64 arrays, to represent original data (which are strings and not numbers)
X = np.char.mod('%d', X)
Y = np.char.mod('%d', Y)
Q = np.char.mod('%d', Q)
Z = np.char.mod('%d', Z)
matchlist = loop(X,Y,Q,Z)
I have two arrays (X and Y) which are identical in length. Each row of these arrays corresponds to one DNA sequencing read (basically strings of the letters 'A','C','G','T'; details not relevant for the example code here).
I also have two 'reference arrays' (Q and Z) which are identical in length and I want to find the occurrence (with np.where()) of every element of X within Q, as well as of every element of Y within Z (basically the find_match() function). Afterwards I want to know whether there is an overlap/intersect between the indexes found for X and Y.
Example output (matchlist; some rows of X/Y have matching indexes in Q/Y, some don't e.g. index 11):
The code works fine so far, but it would take quite long to execute with my final dataset where N=10^9 (in this code example N=1000 to make the tests quicker). 1000 rows of X/Y need about 2.29s to execute on my laptop:
Every find_match() takes about 2.48 ms to execute which is roughly 1/1000 of the final loop.
One first approach would be to combine (x with y) as well as (q with z) and then I only need to run np.where() once, but I couldn't get that to work yet.
I've tried to loop and lookup within Pandas (.loc()) but this was about 4x slower than np.where().
The question is closely related to a recent question from philshem (Combine several NumPy "where" statements to one to improve performance), however, the solutions provided on this question do not work for my approach here.
Numpy isn't too helpful here, since what you need is a lookup into a jagged array, with strings as the indexes.
lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
lookup.setdefault((q, z), []).append(i)
matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]
If you don't need the output as a jagged array, but are OK with just a boolean denoting presence, and can preprocess each string to a number, there is a much faster method.
lookup = np.zeros((300, 300), dtype=bool)
lookup[Q, Z] = True
matchlist = lookup[X, Y]
You typically won't want to use this method to replace the former jagged case, as dense variants (eg. Daniel F's solution) will be memory inefficient and numpy does not support sparse arrays well. However, if more speed is needed then a sparse solution is certainly possible.
You only have 300*300 = 90000 unique answers. Pre-compute.
Q_ = np.arange(300)[:, None] == Q
Z_ = np.arange(300)[:, None] == Z
lookup = np.logical_and(Q_[:, None, :], Z_)
lookup.shape
Out[]: (300, 300, 120000)
Then the result is just:
out = lookup[X, Y]
If you really want the indices you can do:
i = np.where(out)
out2 = np.split(i[1], np.flatnonzero(np.diff(i[0]))+1)
You'll parallelize by chunking with this method, since a boolean array of shape(120000, 1000000000) will throw a MemoryError.

Squaring every single item in a 2D list

So I have a large list of points.
I have split those points up into the x coordinates and the y coordinates and then further split them into groups of 1000.
x = [points_Cartesian[x: x + 1000, 0] for x in range(0, len(points_Cartesian), 1000)]
(The y coordinates looks the same but with y instead of x.)
I am trying to turn the cartesian points into polar and to do so I must square every item in x and every item in y.
for sublist1 in x:
temp1 = []
for inte1 in sublist1:
temp1.append(inte1**2)
xSqua.append(temp1)
After that I add both of the Squared values together and square root them to get rad.
rad = np.sqrt(xSqua + ySqua)
The problem is, I started with 10,000 points and somewhere in this code it gets trimmed down to 1,000.
Does anyone know what the error is and how I fix it?
You're already using numpy. You can reshape matrices using numpy.reshape() and square the entire array elementwise using the ** operator on the entire array and your code will be much faster than iterating.
For example, let's say we have a 10000x3 points_cartesian
points_Cartesian = np.random.random((10000,2))
# reshape to 1000 columns, as many rows as required
xpts = points_Cartesian[:, 0].reshape((-1, 1000))
ypts = points_Cartesian[:, 1].reshape((-1, 1000))
# elementwise square using **
rad = np.sqrt(xpts**2 + ypts**2)
ang = np.arctan2(ypts, xpts)
Now rad and ang are 10x1000 arrays.

how to randomly sample in 2D matrix in numpy

I have a 2d array/matrix like this, how would I randomly pick the value from this 2D matrix, for example getting value like [-62, 29.23]. I looked at the numpy.choice but it is built for 1d array.
The following is my example with 4 rows and 8 columns
Space_Position=[
[[-62,29.23],[-49.73,29.23],[-31.82,29.23],[-14.2,29.23],[3.51,29.23],[21.21,29.23],[39.04,29.23],[57.1,29.23]],
[[-62,11.28],[-49.73,11.28],[-31.82,11.28],[-14.2,11.28],[3.51,11.28],[21.21,11.28] ,[39.04,11.28],[57.1,11.8]],
[[-62,-5.54],[-49.73,-5.54],[-31.82,-5.54] ,[-14.2,-5.54],[3.51,-5.54],[21.21,-5.54],[39.04,-5.54],[57.1,-5.54]],
[[-62,-23.1],[-49.73,-23.1],[-31.82,-23.1],[-14.2,-23.1],[3.51,-23.1],[21.21,-23.1],[39.04,-23.1] ,[57.1,-23.1]]
]
In the answers the following solution was given:
random_index1 = np.random.randint(0, Space_Position.shape[0])
random_index2 = np.random.randint(0, Space_Position.shape[1])
Space_Position[random_index1][random_index2]
this indeed works to give me one sample, how about more than one sample like what np.choice() does?
Another way I am thinking is to tranform the matrix into a array instead of matrix like,
Space_Position=[
[-62,29.23],[-49.73,29.23],[-31.82,29.23],[-14.2,29.23],[3.51,29.23],[21.21,29.23],[39.04,29.23],[57.1,29.23], ..... ]
and at last use np.choice(), however I could not find the ways to do the transformation, np.flatten() makes the array like
Space_Position=[-62,29.23,-49.73,29.2, ....]
Just use a random index (in your case 2 because you have 3 dimensions):
import numpy as np
Space_Position = np.array(Space_Position)
random_index1 = np.random.randint(0, Space_Position.shape[0])
random_index2 = np.random.randint(0, Space_Position.shape[1])
Space_Position[random_index1, random_index2] # get the random element.
The alternative is to actually make it 2D:
Space_Position = np.array(Space_Position).reshape(-1, 2)
and then use one random index:
Space_Position = np.array(Space_Position).reshape(-1, 2) # make it 2D
random_index = np.random.randint(0, Space_Position.shape[0]) # generate a random index
Space_Position[random_index] # get the random element.
If you want N samples with replacement:
N = 5
Space_Position = np.array(Space_Position).reshape(-1, 2) # make it 2D
random_indices = np.random.randint(0, Space_Position.shape[0], size=N) # generate N random indices
Space_Position[random_indices] # get N samples with replacement
or without replacement:
Space_Position = np.array(Space_Position).reshape(-1, 2) # make it 2D
random_indices = np.arange(0, Space_Position.shape[0]) # array of all indices
np.random.shuffle(random_indices) # shuffle the array
Space_Position[random_indices[:N]] # get N samples without replacement
Refering to numpy.random.choice:
Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword.
The genrator documentation is linked here numpy.random.Generator.choice.
Using this knowledge. You can create a generator and then "choice" from your array:
rng = np.random.default_rng() #creates the generator ==> Generator(PCG64) at 0x2AA703BCE50
N = 3 #Number of Choices
a = np.array(Space_Position) #makes sure, a is an ndarray and numpy-supported
s = a.shape #(4,8,2)
a = a.reshape((s[0] * s[1], s[2])) #makes your array 2 dimensional keeping the last dimension seperated
a.shape #(32, 2)
b = rng.choice(a, N, axis=0, replace=False) #returns N choices of a in array b, e.g. narray([[ 57.1 , 11.8 ], [ 21.21, -5.54], [ 39.04, 11.28]])
#Note: replace=False prevents having the same entry several times in the result
Space_Position[np.random.randint(0, len(Space_Position))]
[np.random.randint(0, len(Space_Position))]
gives you what you want

NxN python arrays subsets

I need to carry out some operation on a subset of an NxN array. I have the center of the sub-array, x and y, and its size.
So I can easily do:
subset = data[y-size:y+size,x-size:x+size]
And this is fine.
What I ask is if there is the possibility to do the same without writing an explicit loop if x and y are both 1D arrays of positions.
Thanks!
Using a simple example of a 5x5 array and setting size=1 we can get:
import numpy as np
data = np.arange(25).reshape((5,5))
size = 1
x = np.array([1,4])
y = np.array([1,4])
subsets = [data[j-size:j+size,i-size:i+size] for i in x for j in y]
print(subsets)
Which returns a list of numpy arrays:
[array([[0, 1],[5, 6]]),
array([[15, 16],[20, 21]]),
array([[3, 4],[8, 9]]),
array([[18, 19],[23, 24]])]
Which I hope is what you are looking for.
To get the list of subset assuming you have the list of positions xList and yList, this will do the tric:
subsetList = [ data[y-size:y+size,x-size:x+size] for x,y in zip(xList,yList) ]

Categories