I have a list of N=3 points like this as input:
points = [[1, 1], [2, 2], [4, 4]]
I wrote this code to compute all possible distances between all elements of my list points, as dist = min(∣x1−x2∣,∣y1−y2∣):
distances = []
for i in range(N-1):
for j in range(i+1,N):
dist = min((abs(points[i][0]-points[j][0]), abs(points[i][1]-points[j][1])))
distances.append(dist)
print(distances)
My output will be the array distances with all the distances saved in it: [1, 3, 2]
It works fine with N=3, but I would like to compute it in a more efficiently way and be free to set N=10^5.
I am trying to use also numpy and scipy, but I am having a little trouble with replacing the loops and use the correct method.
Can anybody help me please? Thanks in advance
The numpythonic solution
To compute your distances using the full power of Numpy, and do it
substantially faster:
Convert your points to a Numpy array:
pts = np.array(points)
Then run:
dist = np.abs(pts[np.newaxis, :, :] - pts[:, np.newaxis, :]).min(axis=2)
Here the result is a square array.
But if you want to get a list of elements above the diagonal,
just like your code generates, you can run:
dist2 = dist[np.triu_indices(pts.shape[0], 1)].tolist()
I ran this code for the following 9 points:
points = [[1, 1], [2, 2], [4, 4], [3, 5], [2, 8], [4, 10], [3, 7], [2, 9], [4, 7]]
For the above data, the result saved in dist (a full array) is:
array([[0, 1, 3, 2, 1, 3, 2, 1, 3],
[1, 0, 2, 1, 0, 2, 1, 0, 2],
[3, 2, 0, 1, 2, 0, 1, 2, 0],
[2, 1, 1, 0, 1, 1, 0, 1, 1],
[1, 0, 2, 1, 0, 2, 1, 0, 1],
[3, 2, 0, 1, 2, 0, 1, 1, 0],
[2, 1, 1, 0, 1, 1, 0, 1, 0],
[1, 0, 2, 1, 0, 1, 1, 0, 2],
[3, 2, 0, 1, 1, 0, 0, 2, 0]])
and the list of elements from upper diagonal part is:
[1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 0, 2, 1, 0, 2, 1, 2, 0, 1, 2, 0, 1, 1, 0, 1, 1,
2, 1, 0, 1, 1, 1, 0, 1, 0, 2]
How faster is my code
It turns out that even for such small sample like I used (9
points), my code works 2 times faster. For a sample of 18 points
(not presented here) - 6 times faster.
This difference in speed has been gained even though my function
computes "2 times more than needed" i.e. it generates a full
array, whereas the lower diagonal part of the result in a "mirror
view" of the upper diagonal part (what computes your code).
For bigger number of points the difference should be much bigger.
Make your test on a bigger sample of points (say 100 points) and write how
many times faster was my code.
Related
I would like to generate a 2d Array like this using Python and Numpy:
[
[0, 1, 2, 3, 4, 4, 3, 4],
[1, 2, 3, 4, 4, 3, 2, 3],
[2, 3, 4, 4, 3, 2, 1, 2],
[3, 4, 4, 3, 2, 1, 0, 1],
[4, 5, 5, 4, 3, 2, 1, 2]
]
Pretty much the the numbers spread left and right starting from the zeros. This matrix allows to see the distance of any point to the closest zero. I thought this matrix was common, but I couldn't found anything on the web, even its name. If you have a code to efficiently generate such a matrix or know at least how it's called, please let me know.
Thank you
Here's one with Scipy cdist -
from scipy.spatial.distance import cdist
def bwdist_manhattan(a, seedval=1):
seed_mask = a==seedval
z = np.argwhere(seed_mask)
nz = np.argwhere(~seed_mask)
out = np.zeros(a.shape, dtype=int)
out[tuple(nz.T)] = cdist(z, nz, 'cityblock').min(0).astype(int)
return out
In MATLAB, it's called Distance transform of binary image, hence a derivative name is given here.
Sample run -
In [60]: a # input binary image with 1s at "seed" positions
Out[60]:
array([[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0]])
In [61]: bwdist_manhattan(a)
Out[61]:
array([[0, 1, 2, 3, 4, 4, 3, 4],
[1, 2, 3, 4, 4, 3, 2, 3],
[2, 3, 4, 4, 3, 2, 1, 2],
[3, 4, 4, 3, 2, 1, 0, 1],
[4, 5, 5, 4, 3, 2, 1, 2]])
I have a 20x20 2D array, from which I want to get for every column the value with the highest count of occurring (excluding zeros) i.e. the value that receives the major vote.
I can do that for a single column like this :
: np.unique(p[:,0][p[:,0] != 0],return_counts=True)
: (array([ 3, 21], dtype=int16), array([1, 3]))
: nums, cnts = np.unique(p[:,0][ p[:,0] != 0 ],return_counts=True)
: nums[cnts.argmax()]
: 21
Just for completeness, we can extend the earlier proposed method to a loop-based solution for 2D arrays -
# p is 2D input array
for i in range(p.shape[1]):
nums, cnts = np.unique(p[:,i][ p[:,i] != 0 ],return_counts=True)
output_per_col = nums[cnts.argmax()]
How do I do that for all columns w/o using for loop ?
We can use bincount2D_vectorized to get binned counts per col, where the bins would be each integer. Then, simply slice out from the second count onwards (as the first count would be for 0) and get argmax, add 1 (to compensate for the slicing). That's our desired output.
Hence, the solution shown as a sample case run -
In [116]: p # input array
Out[116]:
array([[4, 3, 4, 1, 1, 0, 2, 0],
[4, 0, 0, 0, 0, 0, 4, 0],
[3, 1, 3, 4, 3, 1, 4, 3],
[4, 4, 3, 3, 1, 1, 3, 2],
[3, 0, 3, 0, 4, 4, 4, 0],
[3, 0, 0, 3, 2, 0, 1, 4],
[4, 0, 3, 1, 3, 3, 2, 0],
[3, 3, 0, 0, 2, 1, 3, 1],
[2, 4, 0, 0, 2, 3, 4, 2],
[0, 2, 4, 2, 0, 2, 2, 4]])
In [117]: bincount2D_vectorized(p.T)[:,1:].argmax(1)+1
Out[117]: array([3, 3, 3, 1, 2, 1, 4, 2])
That transpose is needed because bincount2D_vectorized gets us 2D bincounts per row. Thus, for an alternative problem of getting ranks per row, simply skip that transpose.
Also, feel free to explore other options in that linked Q&A to get 2D-bincounts.
How can i get the sorted indices of a numpy array (distance), only considering certain indices from another numpy array (val).
For example, consider the two numpy arrays val and distance below:
val = np.array([[10, 0, 0, 0, 0],
[0, 0, 10, 0, 10],
[0, 10, 10, 0, 0],
[0, 0, 0, 10, 0],
[0, 0, 0, 0, 0]])
distance = np.array([[4, 3, 2, 3, 4],
[3, 2, 1, 2, 3],
[2, 1, 0, 1, 2],
[3, 2, 1, 2, 3],
[4, 3, 2, 3, 4]])
the distances where val == 10 are 4, 1, 3, 1, 0, 2. I would like to get these sorted to be 0, 1, 1, 2, 3, 4 and return the respective indices from distance array.
Returning something like:
(array([2, 1, 2, 3, 1, 0], dtype=int64), array([2, 2, 1, 3, 4, 0], dtype=int64))
or:
(array([2, 2, 1, 3, 1, 0], dtype=int64), array([2, 1, 2, 3, 4, 0], dtype=int64))
since the second and third element both have distance '1', so i guess the indices can be interchangable.
Tried using combinations of np.where, np.argsort, np.argpartition, np.unravel_index but cant seem to get it working right
Here's one way with masking -
In [20]: mask = val==10
In [21]: np.argwhere(mask)[distance[mask].argsort()]
Out[21]:
array([[2, 2],
[1, 2],
[2, 1],
[3, 3],
[1, 4],
[0, 0]])
Take the array: arr = [0, 1, 2]
np.tile(arr,[10,1])
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
>>> np.tile(arr,[10,2])
array([[0, 1, 2, 0, 1, 2],
[0, 1, 2, 0, 1, 2],
[0, 1, 2, 0, 1, 2],
[0, 1, 2, 0, 1, 2],
[0, 1, 2, 0, 1, 2],
[0, 1, 2, 0, 1, 2],
[0, 1, 2, 0, 1, 2],
[0, 1, 2, 0, 1, 2],
[0, 1, 2, 0, 1, 2],
[0, 1, 2, 0, 1, 2]])
Similar to this, I want to use the tile function to create 10 copies of an image batch of size 10x227x227x3 (the batch already has 10 images)) For each image I want to create a tile. So I should get 100x227x227x3
However when I do this M=10):
images = np.tile(img_batch, [M, 1])
I get 10x227x2270x3 instead, images = np.tile(img_batch, [M]) doesn't work either and brings a value of size 10x227x227x30
I can't get around my head on how to get the tiles I need. Any recommendations are welcome.
Your img_batch has 4 dimensions. Make the reps of size 4:
np.tile(img_batch, [M, 1, 1, 1])
Otherwise, it will be equivalent to np.tile(img_batch, [1, 1, M, 1] in your first case according to the docs:
If A.ndim > d, reps is promoted to A.ndim by pre-pending 1’s to it.
Thus for an A of shape (2, 3, 4, 5), a reps of (2, 2) is treated as
(1, 1, 2, 2).
I asked this question here: How to convert occurence matrix to co-occurence matrix
I realized that my data is so big that it is not possible to do this using R. My computer hangs. The actual data is a text file with ~5 million rows and 600 columns. I think Python may be an alternate option to do this.
This would be the way you translate the R code to Python code.
>>> import numpy as np
>>> a=np.array([[0, 1, 0, 0, 1, 1],
[0, 0, 1, 1, 0, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 1, 1]])
>>> acov=np.dot(a.T, a)
>>> acov[np.diag_indices_from(acov)]=0
>>> acov
array([[0, 2, 2, 1, 1, 1],
[2, 0, 2, 1, 2, 2],
[2, 2, 0, 2, 1, 2],
[1, 1, 2, 0, 0, 1],
[1, 2, 1, 0, 0, 2],
[1, 2, 2, 1, 2, 0]])
However, you have a very big dataset. If you don't want to assemble the co-occurence matrix piece by piece and you store your values in int64, with 3e+9 numbers it will take 24GB of RAM alone just to hold the data http://www.wolframalpha.com/input/?i=3e9+*+8+bytes. So you probably want to think over and decide which dtype you want to store your data in: http://docs.scipy.org/doc/numpy/user/basics.types.html. Using int16 probably will make the dot product operation possible on a decent desktop PC nowadays.