How to vectorize a function with array lookups - python

I'm trying to vectorize my fitness function for a Minimum Vector Cover genetic algorithm, but I'm at a loss about how to do it.
As it stands now:
vert_cover_fitness = [1 if self.dna[edge[0]] or self.dna[edge[1]] else -num_edges for edge in edges]
The dna is a one-dimensional binary array of size [0..n], where each index corresponds to a vertex, and its value indicates if we have chosen it or not. edges is a two dimensional positive integer array, where each value corresponds to a vertex (index) in dna. Both are ndarrays.
Simply explained - if one of the vertices connected by an edge is "selected", then we get a score of one. If not, the function is penalized by -num_edges.
I have tried np.vectorize as an attempt to get away cheap with a lambda function:
fit_func = np.vectorize(lambda edge: 1 if self.dna[edge[0]] or self.dna[edge[1]] else -num_edges)
vert_cover_fitness = fit_func(edges)
This returns IndexError: invalid index to scalar variable., as this function is applied to each value, and not each row.
To fix this I tried np.apply_along_axis. This works but it's just a wrapper for a loop so I'm not getting any speedups.
If any Numpy wizards can see some obvious way to do this, I would much appreciate your help. I'm guessing a problem lies with the representation of the problem, and that changing either the dna or edges shapes could help. I'm just not skilled enough to see what I should do.

I came up with this bit of numpy code, it runs 30x faster than your for loop on my randomly generated data.
import numpy as np
num_vertices = 1000
num_edges = 500
dna = np.random.choice([0, 1], num_vertices)
edges = np.random.randint(0, num_vertices, num_edges * 2).reshape(-1, 2)
vert_cover_fitness1 = [1 if dna[edge[0]] or dna[edge[1]] else -num_edges for edge in edges]
vert_cover_fitness2 = np.full([num_edges], -num_edges)
mask = (dna[edges[:, 0]] | dna[edges[:, 1]]).astype(bool)
vert_cover_fitness2[mask] = 1.0
print((vert_cover_fitness1 == vert_cover_fitness2).all()) # this shows it's correct
Here is the timeit code used to measure the speedup.
import timeit
setup = """
import numpy as np
num_vertices = 1000
num_edges = 500
dna = np.random.choice([0, 1], num_vertices)
edges = np.random.randint(0, num_vertices, num_edges*2).reshape(-1, 2)
"""
python_loop = "[1 if dna[edge[0]] or dna[edge[1]] else -num_edges for edge in edges]"
print(timeit.timeit(python_loop, setup, number=1000))
vectorised="""
vert_cover_fitness2 = np.full([num_edges], -num_edges)
mask = (dna[edges[:, 0]] | dna[edges[:, 1]]).astype(bool)
vert_cover_fitness2[mask] = 1.0
"""
print(timeit.timeit(vectorised, setup, number=1000))
# prints:
# 0.375906624016352
# 0.012783741112798452

Related

Fast way to iterate over sparse numpy array but only where elements are 1

I have an array representing a light source in space made as such:
source2D = np.zeros((256, 256))
with any amount of the pixels = 1. An example I have is a point source which is generated by:
source2D[126:128, 126:128] = 1
And I am running a monte carlo simulation which shoots a ray from each part of the array where the value = 1. Currently I am iterating over the entire array but I would save a lot of time by only picking out the elements where array = 1 and iterating over them. I should add that this function should be made to accept a generic 256x256 where any elements could be set to 1, so cropping the array is not an option. What is the fastest way to do this? I am also using tensorflow so if there is an implementation using that, that would also be an option
Right now my code looks somethinglike this:
while pc < 1000000:
pc+=1
# Randomize x and y as coordinates on source
x = np.random.randint(0, source2D.shape[0]) # 0 to 255 for this example
y = np.random.randint(0, source2D.shape[1]) # 0 to 255 for this example
# Shoot raycast from x,y to point on detector
Solved with hpaulj's comment:
source2D = np.zeros((256, 256)) # Testing with point source
source2D[126:128, 126:128] = 1
nonzero_entries = np.nonzero(source2D)
i = np.random.choice(nonzero_entries[0])
j = np.random.choice(nonzero_entries[1])

How to remove overlapping blocks from numpy array?

I'm using cv2.goodFeaturesToTrack function to find feature points in an image. The end goal is to extract square blocks of certain size, with feature points being the centers of those blocks.
However, lots of the feature points are close to each other, so the blocks are overlapping, which is not what I want.
This is an example of all feature points (centers):
array([[3536., 1419.],
[2976., 1024.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.],
[3108., 737.]])
Let's say I want to find the first n blocks with a blockRadius = 400 which are not overlapping. Any ideas on how to achieve this?
You could get closer with scipy.spatial.KDTree - though it doesn't support querying blocks that consists of distinct amounts of points in blocks. So it can be used in conjunction with another library python-igraph that allows to find connected components of close points in a fast manner:
from scipy.spatial import KDTree
import igraph as ig
data = np.array([[3536., 1419.],
[2976., 1024.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.],
[3108., 737.]])
edges1 = KDTree(data[:,:1]).query_pairs(r=400)
edges2 = KDTree(data[:,1:]).query_pairs(r=400)
g = ig.Graph(n = len(data), edges=edges1 & edges2)
i = g.clusters()
So clusters corresponds to sequences of indices of block points of some kind of internal type igraph. There's a quick preview:
>>> print(i)
Clustering with 8 elements and 2 clusters
[0] 0, 2, 3, 4, 5, 6
[1] 1, 7
>>> pal = ig.drawing.colors.ClusterColoringPalette(len(i)) #number of colors used
color = pal.get_many(i.membership) #list of color tags
ig.plot(g, bbox = (200, 100), layout=g.layout('circle'), vertex_label=g.vs.indices,
vertex_color = color, vertex_size = 12, vertex_label_size = 8)
Example of usage:
>>> [data[n] for n in i] #or list(i)
[array([[3536., 1419.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.]]),
array([[2976., 1024.],
[3108., 737.]])]
Remark: this method allows to work with pairs of close points instead of n*n matrix which is more efficient in memory in some cases.
You'll need something iterative to do that, as recurrent dropouts like this aren't vectorizable. Something like this will work, I think
from scipy.spatial.distance import pdist, squareform
c = np.array([[3536., 1419.],
[2976., 1024.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.],
[3108., 737.]])
dists = squareform(pdist(c, metric = 'chebyshev')) # distance matrix, chebyshev here since you seem to want blocks
indices = np.arange(c.shape[0]) # indices that haven't been dropped (all to start)
out = [0] # always want the first index
while True:
try:
indices = indices[dists[indices[0], indices] > 400] #drop indices that are inside threshhold
out.append(indices[0]) # add the next index that hasn't been dropped to the output
except:
break # once you run out of indices, you'll get an IndexError and you're done
print(out)
[0, 1]
let's try with a whole bunch of points:
np.random.seed(42)
c = np.random.rand(10000, 2) * 800
dists = squareform(pdist(c, metric = 'chebyshev')) # distance matrix, checbyshev here since you seem to want squares
indices = np.arange(c.shape[0]) # indices that haven't been dropped (all to start)
out = [0] # always want the first index
while True:
try:
indices = indices[dists[indices[0], indices] > 400] #drop indices that are inside threshhold
out.append(indices[0]) # add the next index that hasn't been dropped to the output
except:
break # once you run out of indices, you'll get an IndexError and you're done
print(out, pdist(c[out], metric = 'chebyshev'))
[0, 2, 6, 17] [635.77582886 590.70015659 472.87353138 541.13920029 647.69071411
476.84658995]
So, 4 points (makes sense since 4 400x400 blocks tile a 800x800 space with 4 tiles), mostly low values (17 << 10000) and distance between kept points is always > 400

Is there a way to find polygons in given points?

I'm currently working on a way to find rectangles/polygons in up to 15 given points (Image below).
Given Points
My goal is it to find polygons in that point array, like I marked in the image below. The polygons are rectangles in the real world but they are distorted a bit that's the reason why they can look like polygons or other shapes. I must find the best rectangle/polygon.
My idea was to check all connections between the points but the total amount of that is to big to run in and it took.
Does anyone has an idea how to solve that, I researched in the web and found the k-Nearest algorithm in sklearn for python but I don't have experience with that if this is the right way to solve it and how to do that. Maybe I'll also need a method to filter out some of the outliers to make it even easier for the algorithm to find the right corner points of the polygon.
The code snippet below splits the given point string into separate arrays, the array coordinatesOnly contains just the x and y values of the points.
Many thanks for you help.
Polygon in Given Points
import math
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.neighbors import NearestNeighbors
millis = round(int(time.time())) / 1000
####input String
print("2D to 3D convert")
resultString = "0,487.50,399.46,176.84,99.99;1,485.93,423.43,-4.01,95.43;2,380.53,433.28,1.52,94.90;3,454.47,397.68,177.07,90.63;4,490.20,404.10,-6.17,89.90;5,623.56,430.52,-176.09,89.00;6,394.66,385.44,90.22,87.74;7,625.61,416.77,-177.95,87.02;8,597.21,591.66,-91.04,86.49;9,374.03,540.89,-11.20,85.77;10,600.51,552.91,178.29,85.52;11,605.29,530.78,-179.89,85.34;12,583.73,653.92,-82.39,84.42;13,483.56,449.58,-91.12,83.37;14,379.01,451.62,-6.21,81.51"
resultString = resultString.split(";")
resultStringSplitted = list()
coordinatesOnly = list()
for i in range(len(resultString)):
resultStringSplitted .append(resultString[i].split(","))
newList = ((float(resultString[i].split(",")[1]),float(resultString[i].split(",")[2])))
coordinatesOnly.append(newList)
for j in range(len(resultStringSplitted[i])):
resultStringSplitted[i][j] = float(resultStringSplitted[i][j])
#Check if score is valid
validScoreList = list()
for i in range(len(resultStringSplitted)):
if resultStringSplitted[i][len(resultStringSplitted[i])-1] != 0:
validScoreList.append(resultStringSplitted[i])
resultStringSplitted = validScoreList
#Result String array contains all 2D results
# [Point Number, X Coordinate, Y Coordinate, Angle, Point Score]
for i in range(len(resultStringSplitted)):
plt.scatter(resultStringSplitted[i][1],resultStringSplitted[i][2])
plt.show(block=True)
Since you mentioned that you can have a maximum of 15 points, I suggest to check all possible combinations of 4 points and keep all rectangles that are close enough to perfect rectangles. For 15 points, it is "only" 15*14*13*12=32760 potential rectangles.
import math
import itertools
import numpy as np
coordinatesOnly = ((0,0),(0,1),(1,0),(1,1),(2,0),(2,1),(1,3)) # Test data
rectangles = []
# Returns True if l0 and l1 are within 10% deviation
def isValid(l0, l1):
if l0 == 0 or l1 == 0:
return False
return abs(max(l0,l1)/min(l0,l1) - 1) < 0.1
for p in itertools.combinations(np.array(coordinatesOnly),4):
for r in itertools.permutations(p,4):
l01 = np.linalg.norm(r[1]-r[0]) # Side
l12 = np.linalg.norm(r[2]-r[1]) # Side
l23 = np.linalg.norm(r[3]-r[2]) # Side
l30 = np.linalg.norm(r[0]-r[3]) # Side
l02 = np.linalg.norm(r[2]-r[0]) # Diagonal
l13 = np.linalg.norm(r[2]-r[0]) # Diagonal
areSidesEqual = isValid(l01,l23) and isValid(l12,l30)
isDiag1Valid = isValid(math.sqrt(l01*l01+l30*l30),l13) # Pythagore
isDiag2Valid = isValid(math.sqrt(l01*l01+l12*l12),l02) # Pythagore
if areSidesEqual and isDiag1Valid and isDiag2Valid:
rectangles.append(r)
break
print(rectangles)
It takes about 1 second to run on 15 points on my computer. It really depends on what are your requirements for computation time, i.e., real time, interactive time, "I just don't want to spend days waiting for the answer" time.

Sum of difference of squares between each combination of rows of 17,000 by 300 matrix

Ok, so I have a matrix with 17000 rows (examples) and 300 columns (features). I want to compute basically the euclidian distance between each possible combination of rows, so the sum of the squared differences for each possible pair of rows.
Obviously it's a lot and iPython, while not completely crashing my laptop, says "(busy)" for a while and then I can't run anything anymore and it certain seems to have given up, even though I can move my mouse and everything.
Is there any way to make this work? Here's the function I wrote. I used numpy everywhere I could.
What I'm doing is storing the differences in a difference matrix for each possible combination. I'm aware that the lower diagonal part of the matrix = the upper diagonal, but that would only save 1/2 the computation time (better than nothing, but not a game changer, I think).
EDIT: I just tried using scipy.spatial.distance.pdistbut it's been running for a good minute now with no end in sight, is there a better way? I should also mention that I have NaN values in there...but that's not a problem for numpy apparently.
features = np.array(dataframe)
distances = np.zeros((17000, 17000))
def sum_diff():
for i in range(17000):
for j in range(17000):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
You could always divide your computation time by 2, noticing that d(i, i) = 0 and d(i, j) = d(j, i).
But have you had a look at sklearn.metrics.pairwise.pairwise_distances() (in v 0.18, see the doc here) ?
You would use it as:
from sklearn.metrics import pairwise
import numpy as np
a = np.array([[0, 0, 0], [1, 1, 1], [3, 3, 3]])
pairwise.pairwise_distances(a)
The big thing with numpy is to avoid using loops and to let it do its magic with the vectorised operations, so there are a few basic improvements that will save you some computation time:
import numpy as np
import timeit
#I reduced the problem size to 1000*300 to keep the timing in reasonable range
n=1000
features = np.random.rand(n,300)
distances = np.zeros((n,n))
def sum_diff():
for i in range(n):
for j in range(n):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
#Here I removed the unnecessary copy induced by calling np.array
# -> some improvement
def sum_diff_v0():
for i in range(n):
for j in range(n):
diff = features[i] - features[j]
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
#Collapsing of the statements -> no improvement
def sum_diff_v1():
for i in range(n):
for j in range(n):
distances[i][j] = np.sum(np.square(features[i] - features[j]))
# Using brodcasting and vetorized operations -> big improvement
def sum_diff_v2():
for i in range(n):
distances[i] = np.sum(np.square(features[i] - features),axis=1)
# Computing only half the distance -> 1/2 computation time
def sum_diff_v3():
for i in range(n):
distances[i][i+1:] = np.sum(np.square(features[i] - features[i+1:]),axis=1)
distances[:] = distances + distances.T
print("original :",timeit.timeit(sum_diff, number=10))
print("v0 :",timeit.timeit(sum_diff_v0, number=10))
print("v1 :",timeit.timeit(sum_diff_v1, number=10))
print("v2 :",timeit.timeit(sum_diff_v2, number=10))
print("v3 :",timeit.timeit(sum_diff_v3, number=10))
Edit : For completeness I also timed Camilleri's solution that is much faster:
from sklearn.metrics import pairwise
def Camilleri_solution():
distances=pairwise.pairwise_distances(features)
Timing results (in seconds, function run 10 times with 1000*300 input):
original : 138.36921879299916
v0 : 111.39915344800102
v1 : 117.7582511530054
v2 : 23.702392491002684
v3 : 9.712442981006461
Camilleri's : 0.6131987979897531
So as you can see we can easily gain an order of magnitude by using the proper numpy syntax. Note that with only 1/20th of the data the function run in about one second so I would expect the whole thing to run in the tens of minutes as the scipt runs in N^2.

Finding index of nearest point in numpy arrays of x and y coordinates

I have two 2d numpy arrays: x_array contains positional information in the x-direction, y_array contains positions in the y-direction.
I then have a long list of x,y points.
For each point in the list, I need to find the array index of the location (specified in the arrays) which is closest to that point.
I have naively produced some code which works, based on this question:
Find nearest value in numpy array
i.e.
import time
import numpy
def find_index_of_nearest_xy(y_array, x_array, y_point, x_point):
distance = (y_array-y_point)**2 + (x_array-x_point)**2
idy,idx = numpy.where(distance==distance.min())
return idy[0],idx[0]
def do_all(y_array, x_array, points):
store = []
for i in xrange(points.shape[1]):
store.append(find_index_of_nearest_xy(y_array,x_array,points[0,i],points[1,i]))
return store
# Create some dummy data
y_array = numpy.random.random(10000).reshape(100,100)
x_array = numpy.random.random(10000).reshape(100,100)
points = numpy.random.random(10000).reshape(2,5000)
# Time how long it takes to run
start = time.time()
results = do_all(y_array, x_array, points)
end = time.time()
print 'Completed in: ',end-start
I'm doing this over a large dataset and would really like to speed it up a bit.
Can anyone optimize this?
Thanks.
UPDATE: SOLUTION following suggestions by #silvado and #justin (below)
# Shoe-horn existing data for entry into KDTree routines
combined_x_y_arrays = numpy.dstack([y_array.ravel(),x_array.ravel()])[0]
points_list = list(points.transpose())
def do_kdtree(combined_x_y_arrays,points):
mytree = scipy.spatial.cKDTree(combined_x_y_arrays)
dist, indexes = mytree.query(points)
return indexes
start = time.time()
results2 = do_kdtree(combined_x_y_arrays,points_list)
end = time.time()
print 'Completed in: ',end-start
This code above sped up my code (searching for 5000 points in 100x100 matrices) by 100 times. Interestingly, using scipy.spatial.KDTree (instead of scipy.spatial.cKDTree) gave comparable timings to my naive solution, so it is definitely worth using the cKDTree version...
Here is a scipy.spatial.KDTree example
In [1]: from scipy import spatial
In [2]: import numpy as np
In [3]: A = np.random.random((10,2))*100
In [4]: A
Out[4]:
array([[ 68.83402637, 38.07632221],
[ 76.84704074, 24.9395109 ],
[ 16.26715795, 98.52763827],
[ 70.99411985, 67.31740151],
[ 71.72452181, 24.13516764],
[ 17.22707611, 20.65425362],
[ 43.85122458, 21.50624882],
[ 76.71987125, 44.95031274],
[ 63.77341073, 78.87417774],
[ 8.45828909, 30.18426696]])
In [5]: pt = [6, 30] # <-- the point to find
In [6]: A[spatial.KDTree(A).query(pt)[1]] # <-- the nearest point
Out[6]: array([ 8.45828909, 30.18426696])
#how it works!
In [7]: distance,index = spatial.KDTree(A).query(pt)
In [8]: distance # <-- The distances to the nearest neighbors
Out[8]: 2.4651855048258393
In [9]: index # <-- The locations of the neighbors
Out[9]: 9
#then
In [10]: A[index]
Out[10]: array([ 8.45828909, 30.18426696])
scipy.spatial also has a k-d tree implementation: scipy.spatial.KDTree.
The approach is generally to first use the point data to build up a k-d tree. The computational complexity of that is on the order of N log N, where N is the number of data points. Range queries and nearest neighbour searches can then be done with log N complexity. This is much more efficient than simply cycling through all points (complexity N).
Thus, if you have repeated range or nearest neighbor queries, a k-d tree is highly recommended.
If you can massage your data into the right format, a fast way to go is to use the methods in scipy.spatial.distance:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
In particular pdist and cdist provide fast ways to calculate pairwise distances.
Search methods have two phases:
build a search structure, e.g. a KDTree, from the npt data points (your x y)
lookup nq query points.
Different methods have different build times, and different query times.
Your choice will depend a lot on npt and nq:
scipy cdist
has build time 0, but query time ~ npt * nq.
KDTree build times are complicated,
lookups are very fast, ~ ln npt * nq.
On a regular (Manhatten) grid, you can do much better: see (ahem)
find-nearest-value-in-numpy-array.
A little
testbench:
: building a KDTree of 5000 × 5000 2d points takes about 30 seconds,
then queries take microseconds;
scipy cdist
25 million × 20 points (all pairs, 4G) takes about 5 seconds, on my old iMac.
I have been trying to follow along with this, but new to Jupyter Notebooks, Python and the various tools being discussed here, but I have managed to get some way down the road I'm travelling.
BURoute = pd.read_csv('C:/Users/andre/BUKP_1m.csv', header=None)
NGEPRoute = pd.read_csv('c:/Users/andre/N1-06.csv', header=None)
I create a combined XY array from my BURoute dataframe
combined_x_y_arrays = BURoute.iloc[:,[0,1]]
And I create the points with the following command
points = NGEPRoute.iloc[:,[0,1]]
I then do the KDTree magic
def do_kdtree(combined_x_y_arrays, points):
mytree = scipy.spatial.cKDTree(combined_x_y_arrays)
dist, indexes = mytree.query(points)
return indexes
results2 = do_kdtree(combined_x_y_arrays, points)
This gives me an array of the indexes. I'm now trying to figure out how to calculate the distance between the points and the indexed points in the results array.
def find_nearest_vector(self,arrList, value):
y,x = value
offset =10
x_Array=[]
y_Array=[]
for p in arrList:
x_Array.append(p[1])
y_Array.append(p[0])
x_Array=np.array(x_Array)
y_Array=np.array(y_Array)
difference_array_x = np.absolute(x_Array-x)
difference_array_y = np.absolute(y_Array-y)
index_x = np.where(difference_array_x<offset)[0]
index_y = np.where(difference_array_y<offset)[0]
index = np.intersect1d(index_x, index_y, assume_unique=True)
nearestCootdinate = (arrList[index][0][0],arrList[index][0][1])
return nearestCootdinate

Categories