I'm trying to create a grid of coordinates for an algorithm that requires and understanding of distance. I know how to do this for a known number of dimensions - like so for 2D:
x = [0,1,2]
y = [10,11,12]
z = np.zeros((3,3,2))
for i,X in enumerate(x):
for j,Y in enumerate(y):
z[i][j][0] = X
z[i][j][1] = Y
print(z)
--------------------------
array([[[ 0., 10.],
[ 0., 11.],
[ 0., 12.]],
[[ 1., 10.],
[ 1., 11.],
[ 1., 12.]],
[[ 2., 10.],
[ 2., 11.],
[ 2., 12.]]])
This works well enough. I end up with a shape of (3,3,2) where the 2 is the values of the coordinates at that point. I'm trying to use this to create a probability surface, so I need to be able to have each point be it's own "location" value. Is there a way to easily extend this into N-dimensions? There I would have an unknown number of for loops. Due to project constraints I have access to Python built-ins and numpy, but that's more or less it.
I've tried np.meshgrid() but it results in an output shape of (2,3,3) and my attempts to reshape it never give me the coordinates in the correct order. Any ideas on how I could do this cleanly?
I can replicate your z with
In [223]: np.stack([np.tile([x],(1,3)).reshape(3,3).T,np.tile([y],(3,1))],2)
Out[223]:
array([[[ 0, 10],
[ 0, 11],
[ 0, 12]],
[[ 1, 10],
[ 1, 11],
[ 1, 12]],
[[ 2, 10],
[ 2, 11],
[ 2, 12]]])
The tile pieces look like
In [224]: np.tile([y],(3,1))
Out[224]:
array([[10, 11, 12],
[10, 11, 12],
[10, 11, 12]])
In [225]: np.tile([x],(1,3)).reshape(3,3).T
Out[225]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
I might be able clean up the 2nd one. But the basic idea is to replicate the inputs in such a way that stack can combine them into the desired (n,n,2) array.
Once this is understood, it shouldn't be hard to extend things to 3d and up. But I haven't fully processed your intentions.
Possibly simpler (and repeat is faster than tile):
np.stack([np.repeat(x,3).reshape(3,3), np.repeat(y,3).reshape(3,3).T], 2)
With more dimensions the transpose might require refinement.
Same thing with meshgrid (it probably uses repeat or tile internally:
In [232]: np.stack(np.meshgrid(x,y, indexing='ij'),2)
Out[232]:
array([[[ 0, 10],
[ 0, 11],
[ 0, 12]],
[[ 1, 10],
[ 1, 11],
[ 1, 12]],
[[ 2, 10],
[ 2, 11],
[ 2, 12]]])
In higher dimensions:
In [237]: np.stack(np.meshgrid([1,2], [10,20,30], [100,200,300,400], indexing='ij'), 3).sum(axis=-1)
Out[237]:
array([[[111, 211, 311, 411],
[121, 221, 321, 421],
[131, 231, 331, 431]],
[[112, 212, 312, 412],
[122, 222, 322, 422],
[132, 232, 332, 432]]])
Related
After a bunch of distance-wise computation for specifying neighbors of every single atom, I end up with the following neighbor table (First column for the atom itself, second for its neighbor):
array([[ 0, 1],
[ 1, 0],
[ 1, 2],
[ 2, 1],
[ 2, 3],
[ 3, 2],
[ 3, 4],
[ 4, 3],
[ 4, 5],
...
[48, 47],
[48, 49],
[49, 48]])
For instance, the 0th atom has only one neighbor, which is indexed by 1 (it's the meaning of the 0th row). The second atom, which is indexed by 1, has two neighbors indexed by 0 and 2 since the number 1 is in between them. It goes like that, and at the end, as there is no atom indexed by a number greater than 49, the last atom has only one neighbor just like the 0th atom, and that neighbor is the atom indexed by the number 48.
What I want is to alter this array in a way that every row refers to only one atom and its neighbors, such that:
array([[ 0, 1],
[ 1, 0, 2],
[ 2, 1, 3],
[ 3, 2, 4],
[ 4, 3, 5],
...
[48, 47, 49],
[49, 48]])
where the first column refers to atoms themselves, and the rest of the columns refer to their whole neighbors.
Because the array will contain hundreds of thousands items, and that it will be called for thousands of times, I don't want to use a python loop. I'm searching for very efficient way of doing this. Moreover, the neighbors don't have to be one for the first and the last atoms, and two for the rest of the atoms; number of neighbors for an atom can change. Hence, some indexing methods probably won't work for this problem although it may work at first.
I thought about array manipulation methods, but I didn't manage to solve my problem. I'd be appreciated if you could guide me to solve this problem. Thank you.
This looks like a groupby-type operation, and NumPy doesn't have much built-in functionality for group-by operations, however pandas does.
Here's an example of doing this efficiently using a pandas groupby:
import numpy as np
import pandas as pd
neighbors = np.array([[ 0, 1],
[ 1, 0],
[ 1, 2],
[ 2, 1],
[ 2, 3],
[ 3, 2],
[ 3, 4],
[ 4, 3],
[ 4, 5],
[48, 47],
[48, 49],
[49, 48]])
g = pd.Series(neighbors[:, 1]).groupby(neighbors[:, 0]).apply(list)
grouped = pd.DataFrame(g.to_list(), index=g.index).reset_index().to_numpy()
print(grouped)
# array([[ 0., 1., nan],
# [ 1., 0., 2.],
# [ 2., 1., 3.],
# [ 3., 2., 4.],
# [ 4., 3., 5.],
# [48., 47., 49.],
# [49., 48., nan]])
Note that numpy cannot have heterogeneous row lengths in a single array; here pandas uses np.nan as a fill value for missing entries.
I have a requirement where I have 2 2D numpy arrays, and I would like to combine them in a specific manner:
x = [[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]
| | |
0 1 2
y = [[10, 11, 12],
[13, 14, 15],
[16, 17, 18]]
| | |
3 4 5
x op y = [ 0 3 1 4 2 5 ] (in terms of the columns)
In other words,
The combination of x and y should look something like this:
[[ 0., 10., 1., 11., 2., 12.],
[ 3., 13., 4., 14., 5., 15.],
[ 6., 16., 7., 17., 8., 18.]]
Where I alternately combine the columns of each individual array to form the final 2D array. I have come up with one way of doing so, but it is rather ugly. Here's my code:
x = np.arange(9).reshape(3, 3)
y = np.arange(start=10, stop=19).reshape(3, 3)
>>> a = np.zeros((6, 3)) # create a 2D array where num_rows(a) = num_cols(x) + num_cols(y)
>>> a[: : 2] = x.T
>>> a[1: : 2] = y.T
>>> a.T
array([[ 0., 10., 1., 11., 2., 12.],
[ 3., 13., 4., 14., 5., 15.],
[ 6., 16., 7., 17., 8., 18.]])
As you can see, this is a very ugly sequence of operations. Furthermore, things become even more cumbersome in higher dimensions. For example, if you have x and y to be [3 x 3 x 3], then this operation has to be repeated in each dimension. So I'd probably have to tackle this with a loop.
Is there a simpler way around this?
Thanks.
In [524]: x=np.arange(9).reshape(3,3)
In [525]: y=np.arange(10,19).reshape(3,3)
This doesn't look at all ugly to me (one liners are over rated):
In [526]: a = np.zeros((3,6),int)
....
In [528]: a[:,::2]=x
In [529]: a[:,1::2]=y
In [530]: a
Out[530]:
array([[ 0, 10, 1, 11, 2, 12],
[ 3, 13, 4, 14, 5, 15],
[ 6, 16, 7, 17, 8, 18]])
still if you want a one liner, this might do:
In [535]: np.stack((x.T,y.T),axis=1).reshape(6,3).T
Out[535]:
array([[ 0, 10, 1, 11, 2, 12],
[ 3, 13, 4, 14, 5, 15],
[ 6, 16, 7, 17, 8, 18]])
The idea on this last was to combine the arrays on a new dimension, and reshape is some way other. I found it by trial and error.
and with another trial:
In [539]: np.stack((x,y),2).reshape(3,6)
Out[539]:
array([[ 0, 10, 1, 11, 2, 12],
[ 3, 13, 4, 14, 5, 15],
[ 6, 16, 7, 17, 8, 18]])
Here is a compact way to write it with a loop, it might be generalizable to higher dimension arrays with a little work:
x = np.array([[0,1,2], [3,4,5], [6,7,8]])
y = np.array([[10,11,12], [13,14,15], [16,17,18]])
z = np.zeros((3,6))
for i in xrange(3):
z[i] = np.vstack((x.T[i],y.T[i])).reshape((-1,),order='F')
I have a matrix M that is rather large. I am trying to find the top 5 closest distances along with their indices.
M = csr_matrix(M)
dst = pairwise_distances(M,Y=None,metric='euclidean')
dst becomes a huge matrix and I am trying to sort it efficiently or use scipy or sklearn to find the closest 5 distances.
Here is an example of what I am trying to do:
X = np.array([[2, 3, 5], [2, 3, 6], [2, 3, 8], [2, 3, 3], [2, 3, 4]])
I then calculate dst as:
[[ 0. 1. 3. 2. 1.]
[ 1. 0. 2. 3. 2.]
[ 3. 2. 0. 5. 4.]
[ 2. 3. 5. 0. 1.]
[ 1. 2. 4. 1. 0.]]
So, row 0 to itself has a distance of 0., row 0 to 1 has a distance of 1.,... row 2 to row 3 has a distance of 5., and so on. I want to find these closest 5 distances and put them in a list with the corresponding rows, maybe like [distance, row, row]. I don't want any diagonal elements or duplicate elements so I take the upper triangular matrix as follows:
[[ inf 1. 3. 2. 1.]
[ nan inf 2. 3. 2.]
[ nan nan inf 5. 4.]
[ nan nan nan inf 1.]
[ nan nan nan nan inf]]
Now, the top 5 distances least to greatest are:
[1, 0, 1], [1, 0, 4], [1, 3, 4], [2, 1, 2], [2, 0, 3], [2, 1, 4]
As you can see there are three elements that have distance 2 and three elements that have distance 1. From these I want to randomly choose one of the elements with distance 2 to keep as I only want the top f elements where f=5 in this case.
This is just a sample as this matrix could be very large. Is there an efficient way to do the above besides using a basic sorted function? I couldn't find any sklearn or scipy to help me with this.
Here's a fully vectorized solution to your problem:
import numpy as np
from scipy.spatial.distance import pdist
def smallest(M, f):
# compute the condensed distance matrix
dst = pdist(M, 'euclidean')
# indices of the upper triangular matrix
rows, cols = np.triu_indices(M.shape[0], k=1)
# indices of the f smallest distances
idx = np.argsort(dst)[:f]
# gather results in the specified format: distance, row, column
return np.vstack((dst[idx], rows[idx], cols[idx])).T
Notice that np.argsort(dst)[:f] yields the indices of the smallest f elements of the condensed distance matrix dst sorted in ascending order.
The following demo reproduces the result of your toy example and shows how the function smallest deals with a fairly large matrix of integers:
In [59]: X = np.array([[2, 3, 5], [2, 3, 6], [2, 3, 8], [2, 3, 3], [2, 3, 4]])
In [60]: smallest(X, 5)
Out[60]:
array([[ 1., 0., 1.],
[ 1., 0., 4.],
[ 1., 3., 4.],
[ 2., 0., 3.],
[ 2., 1., 2.]])
In [61]: large_X = np.random.randint(100, size=(10000, 2000))
In [62]: large_X
Out[62]:
array([[ 8, 78, 97, ..., 23, 93, 90],
[42, 2, 21, ..., 68, 45, 62],
[28, 45, 30, ..., 0, 75, 48],
...,
[26, 88, 78, ..., 0, 88, 43],
[91, 53, 94, ..., 85, 44, 37],
[39, 8, 10, ..., 46, 15, 67]])
In [63]: %time smallest(large_X, 5)
Wall time: 1min 32s
Out[63]:
array([[ 1676.12529365, 4815. , 5863. ],
[ 1692.97253374, 1628. , 2950. ],
[ 1693.558384 , 5742. , 8240. ],
[ 1695.86408654, 2140. , 6969. ],
[ 1696.68853948, 5477. , 6641. ]])
I'm trying to resize numpy array, but it seems that the resize works by first flattening the array, then getting first X*Y elem and putting them in the new shape. What I want to do instead is to cut the array at coord 3,3, not rearrange it. Similar thing happens when I try to upsize it say to 7,7 ... instead of "rearranging" I want to fill the new cols and rows with zeros and keep the data as it is.
Is there a way to do that ?
> a = np.zeros((5,5))
> a.flat = range(25)
> a
array(
[[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.],
[ 20., 21., 22., 23., 24.]])
> a.resize((3,3),refcheck=False)
> a
array(
[[ 0., 1., 2.],
[ 3., 4., 5.],
[ 6., 7., 8.]])
thank you ...
Upsizing to 7x7 goes like this
upsized = np.zeros([7, 7])
upsized[:5, :5] = a
I believe you want to use numpy's slicing syntax instead of resize. resize works by first raveling the array and working with a 1D view.
>>> a = np.arange(25).reshape(5,5)
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
>>> a[:3,:3]
array([[ 0, 1, 2],
[ 5, 6, 7],
[10, 11, 12]])
What you are doing here is taking a view of the numpy array. For example to update the original array by slicing:
>>> a[:3,:3] = 0
>>> a
array([[ 0, 0, 0, 3, 4],
[ 0, 0, 0, 8, 9],
[ 0, 0, 0, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
An excellent guide on numpy's slicing syntax can be found here.
Upsizing (or padding) only works by making a copy of the data. You start with an array of zeros and fill in appropriately
upsized = np.zeros([7, 7])
upsized[:5, :5] = a
I have a numpy array, for example:
points = np.array([[-468.927, -11.299, 76.271, -536.723],
[-429.379, -694.915, -214.689, 745.763],
[ 0., 0., 0., 0. ]])
if I print it or turn it into a string with str() I get:
print w_points
[[-468.927 -11.299 76.271 -536.723]
[-429.379 -694.915 -214.689 745.763]
[ 0. 0. 0. 0. ]]
I need to turn it into a string that prints with separating commas while keeping the 2D array structure, that is:
[[-468.927, -11.299, 76.271, -536.723],
[-429.379, -694.915, -214.689, 745.763],
[ 0., 0., 0., 0. ]]
Does anybody know an easy way of turning a numpy array to that form of string?
I know that .tolist() adds the commas but the result loses the 2D structure.
Try using repr
>>> import numpy as np
>>> points = np.array([[-468.927, -11.299, 76.271, -536.723],
... [-429.379, -694.915, -214.689, 745.763],
... [ 0., 0., 0., 0. ]])
>>> print(repr(points))
array([[-468.927, -11.299, 76.271, -536.723],
[-429.379, -694.915, -214.689, 745.763],
[ 0. , 0. , 0. , 0. ]])
If you plan on using large numpy arrays, set np.set_printoptions(threshold=np.nan) first. Without it, the array representation will be truncated after about 1000 entries (by default).
>>> arr = np.arange(1001)
>>> print(repr(arr))
array([ 0, 1, 2, ..., 998, 999, 1000])
Of course, if you have arrays that large, this starts to become less useful and you should probably analyze the data some way other than just looking at it and there are better ways of persisting a numpy array than saving it's repr to a file...
Now, in numpy 1.11, there is numpy.array2string:
In [279]: a = np.reshape(np.arange(25, dtype='int8'), (5, 5))
In [280]: print(np.array2string(a, separator=', '))
[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]]
Comparing with repr from #mgilson (shows "array()" and dtype):
In [281]: print(repr(a))
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]], dtype=int8)
P.S. Still need np.set_printoptions(threshold=np.nan) for large array.
The function you are looking for is np.set_string_function. source
What this function does is let you override the default __str__ or __repr__ functions for the numpy objects. If you set the repr flag to True, the __repr__ function will be overriden with your custom function. Likewise, if you set repr=False, the __str__ function will be overriden. Since print calls the __str__ function of the object, we need to set repr=False.
For example:
np.set_string_function(lambda x: repr(x), repr=False)
x = np.arange(5)
print(x)
will print the output
array([0, 1, 2, 3, 4])
A more aesthetically pleasing version is
np.set_string_function(lambda x: repr(x).replace('(', '').replace(')', '').replace('array', '').replace(" ", ' ') , repr=False)
print(np.eye(3))
which gives
[[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]]
Hope this answers your question.
Another way to do it, which is particularly helpful when an object doesn't have a __repr__() method, is to employ Python's pprint module (which has various formatting options). Here is what that looks like, by example:
>>> import numpy as np
>>> import pprint
>>>
>>> A = np.zeros(10, dtype=np.int64)
>>>
>>> print(A)
[0 0 0 0 0 0 0 0 0 0]
>>>
>>> pprint.pprint(A)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])