How to merge related array items into one along the rows? - python

After a bunch of distance-wise computation for specifying neighbors of every single atom, I end up with the following neighbor table (First column for the atom itself, second for its neighbor):
array([[ 0, 1],
[ 1, 0],
[ 1, 2],
[ 2, 1],
[ 2, 3],
[ 3, 2],
[ 3, 4],
[ 4, 3],
[ 4, 5],
...
[48, 47],
[48, 49],
[49, 48]])
For instance, the 0th atom has only one neighbor, which is indexed by 1 (it's the meaning of the 0th row). The second atom, which is indexed by 1, has two neighbors indexed by 0 and 2 since the number 1 is in between them. It goes like that, and at the end, as there is no atom indexed by a number greater than 49, the last atom has only one neighbor just like the 0th atom, and that neighbor is the atom indexed by the number 48.
What I want is to alter this array in a way that every row refers to only one atom and its neighbors, such that:
array([[ 0, 1],
[ 1, 0, 2],
[ 2, 1, 3],
[ 3, 2, 4],
[ 4, 3, 5],
...
[48, 47, 49],
[49, 48]])
where the first column refers to atoms themselves, and the rest of the columns refer to their whole neighbors.
Because the array will contain hundreds of thousands items, and that it will be called for thousands of times, I don't want to use a python loop. I'm searching for very efficient way of doing this. Moreover, the neighbors don't have to be one for the first and the last atoms, and two for the rest of the atoms; number of neighbors for an atom can change. Hence, some indexing methods probably won't work for this problem although it may work at first.
I thought about array manipulation methods, but I didn't manage to solve my problem. I'd be appreciated if you could guide me to solve this problem. Thank you.

This looks like a groupby-type operation, and NumPy doesn't have much built-in functionality for group-by operations, however pandas does.
Here's an example of doing this efficiently using a pandas groupby:
import numpy as np
import pandas as pd
neighbors = np.array([[ 0, 1],
[ 1, 0],
[ 1, 2],
[ 2, 1],
[ 2, 3],
[ 3, 2],
[ 3, 4],
[ 4, 3],
[ 4, 5],
[48, 47],
[48, 49],
[49, 48]])
g = pd.Series(neighbors[:, 1]).groupby(neighbors[:, 0]).apply(list)
grouped = pd.DataFrame(g.to_list(), index=g.index).reset_index().to_numpy()
print(grouped)
# array([[ 0., 1., nan],
# [ 1., 0., 2.],
# [ 2., 1., 3.],
# [ 3., 2., 4.],
# [ 4., 3., 5.],
# [48., 47., 49.],
# [49., 48., nan]])
Note that numpy cannot have heterogeneous row lengths in a single array; here pandas uses np.nan as a fill value for missing entries.

Related

np.nan assigning massive negative integer?

When I run this code
import numpy as np
X = np.column_stack([np.arange(1,6), np.arange(1,6)])
X = np.roll(X,1,axis=0)
X[0,:] = np.nan
print(X)
My output is
array([[-9223372036854775808, -9223372036854775808],
[ 1, 1],
[ 2, 2],
[ 3, 3],
[ 4, 4]])
Which I didn't think is what it's supposed to do... I also ran the code in colab to make sure something crazy wasn't going on with my machine. Am I doing something that doesn't make sense?
Expected output is something along the lines of
array([[NaN, NaN],
[ 1, 1],
[ 2, 2],
[ 3, 3],
[ 4, 4]])
Possibly You are getting this because you are using an old version of numpy
also you can't assign float numbers to integer as nan is of type 'float' and In the newer version X[0,:] = np.nan throw an errors:-ValueError: cannot convert float NaN to integer
So firstly use astype() method to convert your int array to float:-
X=X.astype(float)
Finally:-
X[0,:]=np.nan
Now if you print X you will get:-
array([[nan, nan],
[ 1., 1.],
[ 2., 2.],
[ 3., 3.],
[ 4., 4.]])

Numpy indexes order

I'm confused about how np.zeros() dimensions are handled.
I have a pandas dataframe with some toy data
# A toy 3col x 4row Dataframe
a = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],columns=['colA','colB','colC'])
b = pd.DataFrame([[40,41,42],[43,44,45],[46,47,48],[49,50,51]],columns=['colA','colB','colC'])
colA colB colC
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I want to get this two dataframes into a 3D numpy array of dimensions (4 rows, 3cols, 2channels), so I can calculate statistics between the two dataframes (eg: average, max values, etc...)
So I basically create a 3D array of zeros and populate each channel with the values from the dataframes.
But it looks like the dimensions are not correctly arranged.
c = np.zeros((4, 3, 2))
c[:,:,0] = a.values
c[:,:,1] = b.values
array([[[ 1., 40.],
[ 2., 41.],
[ 3., 42.]],
[[ 4., 43.],
[ 5., 44.],
[ 6., 45.]],
[[ 7., 46.],
[ 8., 47.],
[ 9., 48.]],
[[10., 49.],
[11., 50.],
[12., 51.]]])
If I put the number of channels as the first index then it is correctly arranged.
However this is very counterintuitive, usually in 3-dimensional data the channel is the third index, not the first.
c = np.zeros((2,4,3))
c[0,:,:] = a.values
c[1,:,:] = b.values
array([[[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., 9.],
[10., 11., 12.]],
[[40., 41., 42.],
[43., 44., 45.],
[46., 47., 48.],
[49., 50., 51.]]])
I don't understand this logic. Why the third dimension (channel) is the first index instead of the last one?
When I calculate the average over the two channels I have to do it using axis=0 which is very confusing. Anybody looking at the code will think it's a columnwise average instead of an average between channels.
Am I doing anything wrong?
Regarding intuition, this is just a typical way of access in most (all?) programming language. Typically, when you do something like:
my_array[a][b][c][d]
which is more common than the Numpy style indexing considering all languages, what you typically mean is:
From my_array, get block a. This is an inner block.
From the previous block, get block b. this is an inner block.
From the previous block, get block c. this is an inner block.
From the previous block, get item d (which happens to not be a block because it's the last dimensions).
The order is always outer most dimensions to inner most dimension. This has nothing two do with images or channels. So in your example, if you expect c[0] to return the channel rather than what you call the row, then that is the intuition. You always put first your outer dimension - just like when you have an image as an array the first dimensions is rows (height) and then columns (width).
This entire conversation is ignoring FORTRAN based array orderings (Matlab uses that for example) where columns is "outer" to rows, by definition. If you came from those languages (to Python and C based orderings row->column), that is a common source of misunderstanding. In this case, intuition just equals what you are used to working with, which is subjective and somewhat arbitrary.
usually i think in 3-dimensional data the channel is the first index, as shown in your second codes. it how it is arranged. so just using it that way
This would be my approach
>>> x = a.values.reshape((a.shape[0], a.shape[1], 1)) # Convert 2D to 3D - One layer
>>> y = b.values.reshape((b.shape[0], b.shape[1], 1)) # Convert 2D to 3D - Second layer
>>> z = np.concatenate((x, y), axis=2) # Concatenate on 3rd(starts from zero) axis
Which would yeild something similar to your array, which is correct.
array([[[ 1, 40],
[ 2, 41],
[ 3, 42]],
[[ 4, 43],
[ 5, 44],
[ 6, 45]],
[[ 7, 46],
[ 8, 47],
[ 9, 48]],
[[10, 49],
[11, 50],
[12, 51]]])
Also, If you wanna visually see the array in dataframe (just for checking)
>>> pd.DataFrame(z.tolist())
0 1 2
0 [1, 40] [2, 41] [3, 42]
1 [4, 43] [5, 44] [6, 45]
2 [7, 46] [8, 47] [9, 48]
3 [10, 49] [11, 50] [12, 51]

Creating an N-dimensional grid with Python

I'm trying to create a grid of coordinates for an algorithm that requires and understanding of distance. I know how to do this for a known number of dimensions - like so for 2D:
x = [0,1,2]
y = [10,11,12]
z = np.zeros((3,3,2))
for i,X in enumerate(x):
for j,Y in enumerate(y):
z[i][j][0] = X
z[i][j][1] = Y
print(z)
--------------------------
array([[[ 0., 10.],
[ 0., 11.],
[ 0., 12.]],
[[ 1., 10.],
[ 1., 11.],
[ 1., 12.]],
[[ 2., 10.],
[ 2., 11.],
[ 2., 12.]]])
This works well enough. I end up with a shape of (3,3,2) where the 2 is the values of the coordinates at that point. I'm trying to use this to create a probability surface, so I need to be able to have each point be it's own "location" value. Is there a way to easily extend this into N-dimensions? There I would have an unknown number of for loops. Due to project constraints I have access to Python built-ins and numpy, but that's more or less it.
I've tried np.meshgrid() but it results in an output shape of (2,3,3) and my attempts to reshape it never give me the coordinates in the correct order. Any ideas on how I could do this cleanly?
I can replicate your z with
In [223]: np.stack([np.tile([x],(1,3)).reshape(3,3).T,np.tile([y],(3,1))],2)
Out[223]:
array([[[ 0, 10],
[ 0, 11],
[ 0, 12]],
[[ 1, 10],
[ 1, 11],
[ 1, 12]],
[[ 2, 10],
[ 2, 11],
[ 2, 12]]])
The tile pieces look like
In [224]: np.tile([y],(3,1))
Out[224]:
array([[10, 11, 12],
[10, 11, 12],
[10, 11, 12]])
In [225]: np.tile([x],(1,3)).reshape(3,3).T
Out[225]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
I might be able clean up the 2nd one. But the basic idea is to replicate the inputs in such a way that stack can combine them into the desired (n,n,2) array.
Once this is understood, it shouldn't be hard to extend things to 3d and up. But I haven't fully processed your intentions.
Possibly simpler (and repeat is faster than tile):
np.stack([np.repeat(x,3).reshape(3,3), np.repeat(y,3).reshape(3,3).T], 2)
With more dimensions the transpose might require refinement.
Same thing with meshgrid (it probably uses repeat or tile internally:
In [232]: np.stack(np.meshgrid(x,y, indexing='ij'),2)
Out[232]:
array([[[ 0, 10],
[ 0, 11],
[ 0, 12]],
[[ 1, 10],
[ 1, 11],
[ 1, 12]],
[[ 2, 10],
[ 2, 11],
[ 2, 12]]])
In higher dimensions:
In [237]: np.stack(np.meshgrid([1,2], [10,20,30], [100,200,300,400], indexing='ij'), 3).sum(axis=-1)
Out[237]:
array([[[111, 211, 311, 411],
[121, 221, 321, 421],
[131, 231, 331, 431]],
[[112, 212, 312, 412],
[122, 222, 322, 422],
[132, 232, 332, 432]]])

huge matrix sorted and then find smallest elements with their indices into a list

I have a matrix M that is rather large. I am trying to find the top 5 closest distances along with their indices.
M = csr_matrix(M)
dst = pairwise_distances(M,Y=None,metric='euclidean')
dst becomes a huge matrix and I am trying to sort it efficiently or use scipy or sklearn to find the closest 5 distances.
Here is an example of what I am trying to do:
X = np.array([[2, 3, 5], [2, 3, 6], [2, 3, 8], [2, 3, 3], [2, 3, 4]])
I then calculate dst as:
[[ 0. 1. 3. 2. 1.]
[ 1. 0. 2. 3. 2.]
[ 3. 2. 0. 5. 4.]
[ 2. 3. 5. 0. 1.]
[ 1. 2. 4. 1. 0.]]
So, row 0 to itself has a distance of 0., row 0 to 1 has a distance of 1.,... row 2 to row 3 has a distance of 5., and so on. I want to find these closest 5 distances and put them in a list with the corresponding rows, maybe like [distance, row, row]. I don't want any diagonal elements or duplicate elements so I take the upper triangular matrix as follows:
[[ inf 1. 3. 2. 1.]
[ nan inf 2. 3. 2.]
[ nan nan inf 5. 4.]
[ nan nan nan inf 1.]
[ nan nan nan nan inf]]
Now, the top 5 distances least to greatest are:
[1, 0, 1], [1, 0, 4], [1, 3, 4], [2, 1, 2], [2, 0, 3], [2, 1, 4]
As you can see there are three elements that have distance 2 and three elements that have distance 1. From these I want to randomly choose one of the elements with distance 2 to keep as I only want the top f elements where f=5 in this case.
This is just a sample as this matrix could be very large. Is there an efficient way to do the above besides using a basic sorted function? I couldn't find any sklearn or scipy to help me with this.
Here's a fully vectorized solution to your problem:
import numpy as np
from scipy.spatial.distance import pdist
def smallest(M, f):
# compute the condensed distance matrix
dst = pdist(M, 'euclidean')
# indices of the upper triangular matrix
rows, cols = np.triu_indices(M.shape[0], k=1)
# indices of the f smallest distances
idx = np.argsort(dst)[:f]
# gather results in the specified format: distance, row, column
return np.vstack((dst[idx], rows[idx], cols[idx])).T
Notice that np.argsort(dst)[:f] yields the indices of the smallest f elements of the condensed distance matrix dst sorted in ascending order.
The following demo reproduces the result of your toy example and shows how the function smallest deals with a fairly large matrix of integers:
In [59]: X = np.array([[2, 3, 5], [2, 3, 6], [2, 3, 8], [2, 3, 3], [2, 3, 4]])
In [60]: smallest(X, 5)
Out[60]:
array([[ 1., 0., 1.],
[ 1., 0., 4.],
[ 1., 3., 4.],
[ 2., 0., 3.],
[ 2., 1., 2.]])
In [61]: large_X = np.random.randint(100, size=(10000, 2000))
In [62]: large_X
Out[62]:
array([[ 8, 78, 97, ..., 23, 93, 90],
[42, 2, 21, ..., 68, 45, 62],
[28, 45, 30, ..., 0, 75, 48],
...,
[26, 88, 78, ..., 0, 88, 43],
[91, 53, 94, ..., 85, 44, 37],
[39, 8, 10, ..., 46, 15, 67]])
In [63]: %time smallest(large_X, 5)
Wall time: 1min 32s
Out[63]:
array([[ 1676.12529365, 4815. , 5863. ],
[ 1692.97253374, 1628. , 2950. ],
[ 1693.558384 , 5742. , 8240. ],
[ 1695.86408654, 2140. , 6969. ],
[ 1696.68853948, 5477. , 6641. ]])

string representation of a numpy array with commas separating its elements

I have a numpy array, for example:
points = np.array([[-468.927, -11.299, 76.271, -536.723],
[-429.379, -694.915, -214.689, 745.763],
[ 0., 0., 0., 0. ]])
if I print it or turn it into a string with str() I get:
print w_points
[[-468.927 -11.299 76.271 -536.723]
[-429.379 -694.915 -214.689 745.763]
[ 0. 0. 0. 0. ]]
I need to turn it into a string that prints with separating commas while keeping the 2D array structure, that is:
[[-468.927, -11.299, 76.271, -536.723],
[-429.379, -694.915, -214.689, 745.763],
[ 0., 0., 0., 0. ]]
Does anybody know an easy way of turning a numpy array to that form of string?
I know that .tolist() adds the commas but the result loses the 2D structure.
Try using repr
>>> import numpy as np
>>> points = np.array([[-468.927, -11.299, 76.271, -536.723],
... [-429.379, -694.915, -214.689, 745.763],
... [ 0., 0., 0., 0. ]])
>>> print(repr(points))
array([[-468.927, -11.299, 76.271, -536.723],
[-429.379, -694.915, -214.689, 745.763],
[ 0. , 0. , 0. , 0. ]])
If you plan on using large numpy arrays, set np.set_printoptions(threshold=np.nan) first. Without it, the array representation will be truncated after about 1000 entries (by default).
>>> arr = np.arange(1001)
>>> print(repr(arr))
array([ 0, 1, 2, ..., 998, 999, 1000])
Of course, if you have arrays that large, this starts to become less useful and you should probably analyze the data some way other than just looking at it and there are better ways of persisting a numpy array than saving it's repr to a file...
Now, in numpy 1.11, there is numpy.array2string:
In [279]: a = np.reshape(np.arange(25, dtype='int8'), (5, 5))
In [280]: print(np.array2string(a, separator=', '))
[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]]
Comparing with repr from #mgilson (shows "array()" and dtype):
In [281]: print(repr(a))
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]], dtype=int8)
P.S. Still need np.set_printoptions(threshold=np.nan) for large array.
The function you are looking for is np.set_string_function. source
What this function does is let you override the default __str__ or __repr__ functions for the numpy objects. If you set the repr flag to True, the __repr__ function will be overriden with your custom function. Likewise, if you set repr=False, the __str__ function will be overriden. Since print calls the __str__ function of the object, we need to set repr=False.
For example:
np.set_string_function(lambda x: repr(x), repr=False)
x = np.arange(5)
print(x)
will print the output
array([0, 1, 2, 3, 4])
A more aesthetically pleasing version is
np.set_string_function(lambda x: repr(x).replace('(', '').replace(')', '').replace('array', '').replace(" ", ' ') , repr=False)
print(np.eye(3))
which gives
[[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]]
Hope this answers your question.
Another way to do it, which is particularly helpful when an object doesn't have a __repr__() method, is to employ Python's pprint module (which has various formatting options). Here is what that looks like, by example:
>>> import numpy as np
>>> import pprint
>>>
>>> A = np.zeros(10, dtype=np.int64)
>>>
>>> print(A)
[0 0 0 0 0 0 0 0 0 0]
>>>
>>> pprint.pprint(A)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Categories