I have a pd Dataframe that has a lot of planes in the XY plane. The dataframe consists of the points' x and y coordinates. I want to check every point's distance to all other points using the pythagorean theorem and count number of points within a certain distance of that point.
def distance(x1, y1, x2, y2):
return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)
df = pd.DataFrame({'X':[random.randint(1,100) for i in range(100)], 'Y':[random.randint(1,100) for i in range(100)]})
I realise that I can loop over the dataframe but that is not best practice and it takes too long. Is there a way I can optimize this process.
Ultimately I'd want another column in the dataframe that stores the number of points in the dataframe that are within a certain distance of each point.
EDIT:
Another thing I am trying to do is look for arbitrary points (or zones) in the XY plane with the most number of points within a given radius. What I basically mean is I want to also look at positions in the plane that are not necessarily points in the dataframe but are still within the limits of the plane.
If you want your code to run fast using pandas and numpy you should try to get used to writing functions that look like they only work with numbers but you can actually input numpy arrays/pandas series. E.g. if you want to find all points in your df being distance r or less from the point cx, cy you could do that like so
def close_to_my_point(x,y):
return (x-cx)**2+(y-cy)**2 <= r**2
close_to_my_point(df["X"],df["Y"])
This gives you a series of booleans indicating if your point at that position in the dataframe now is close to cx, cy or not. Notice now that when summing over True, False values True behaves like 1 and False like 0. So sum(close_to_my_point(df["X"],df["Y"])) does what you want for one point.
For functions that can't be applied to series by default there is np.vectorize to change that. Putting all that together you get something that can calculate the amount of points in some distance quite quickly:
def disk_equation(cx,cy,r):
return lambda x,y: (x-cx)**2+(y-cy)**2<= r**2
points_in_distance = lambda x,y: sum(disk_equation(x,y,20)(df["X"],df["Y"]))
df["points_closer_than_20"] = np.vectorize(points_in_distance)(df["X"],df["Y"])
There is a whole lot of tools for pairwise distance calculations included in SciPy: enter link description here
The simplest one to use would be a distance_matrix that calculates pairwise distances and returns those as a matrix. First you need to convert your dataframe into a properly formatted numpy array:
import random
from scipy.spatial import distance_matrix
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[random.randint(1,100) for i in range(100)], 'Y': random.randint(1,100) for i in range(100)]})
foo = np.array([(x,y) for x, y in zip(df.X, df.Y)])
baz = distance_matrix(foo, foo)
Here we're using foo twice since we want all pairwise distances to all points in the array.
Related
I have generated a graph using basic function -
plt.plot(tm, o1)
tm is list of all x coordinates and o1 is a list of all y coordinates
NOTE
there is no specific function such as y=f(x), rather a certain y value remains constant for a given range of x.. see figure for clarity
My question is how to integrate this function, either using the matplotlib figure or using the lists (tm and o1)
The integral corresponds to computing the area under the curve.
The most easy way to compute (or approximate) the integral "numerically" is using the rectangle rule which is basically approximating the area under the curve by summing area of rectangles (see https://en.wikipedia.org/wiki/Numerical_integration#Quadrature_rules_based_on_interpolating_functions).
Practically in your case, it quite straightforward since it is a step function.
First, I recomment to use numpy arrays instead of list (more handy for numerical computing):
import matplotlib.pyplot as plt
import numpy as np
x = np.array([0,1,3,4,6,7,8,11,13,15])
y = np.array([8,5,2,2,2,5,6,5,9,9])
plt.plot(x,y)
Then, we compute the width of rectangles using np.diff():
w = np.diff(x)
Then, the height of the same rectangles (multiple possibilities exist):
h = y[:-1]
Here I chose the 2nd value of each two successive y values. the top right angle of rectangle is on the curve. You can choose the mean value of each two successive y values h = (y[1:]+y[:-1])/2 in which the middle of the top of the rectangle coincide with the curve.
Then , you will need to multiply and sum:
area = (w*h).sum()
I have carried out some clustering analysis on some data X and have arrived at both the labels y and the centroids c. Now, I'm trying to calculate the distance between X and their assigned cluster's centroid c. This is easy when we have a small number of points:
import numpy as np
# 10 random points in 3D space
X = np.random.rand(10,3)
# define the number of clusters, say 3
clusters = 3
# give each point a random label
# (in the real code this is found using KMeans, for example)
y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1)
# randomly assign location of centroids
# (in the real code this is found using KMeans, for example)
c = np.random.rand(clusters,3)
# calculate distances
distances = []
for i in range(len(X)):
distances.append(np.linalg.norm(X[i]-c[y[i][0]]))
Unfortunately, the actual data has many more rows. Is there a way to vectorise this somehow (instead of using a for loop)? I can't seem to get my head around the mapping.
Thanks to numpy's array indexing, you can actually turn your for loop into a one-liner and avoid explicit looping altogether:
distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1)
will do the same thing as your original for loop.
EDIT: Thanks #Kris, I forgot the axis keyword, and since I didn't specify it, numpy automatically computed the norm of the entire flattened matrix, not just along the rows (axis 1). I've updated it now, and it should return an array of distances for each point. Also, einsum was suggested by #Kris for their specific application.
I have a large set of data points in a pandas dataframe, with columns containing x/y coordinates for these points. I would like to identify all points that are within a certain distance "d" of any other point in the dataframe.
I first tried to do this using 'for' loops, checking the distance between the first point and all other points, then the distance between the second point and all others, etc. Clearly this is not very efficient for a large data set.
Recent searching online suggests that the best way might be to use scipy.spatial.ckdtree, but I can't figure out how to implement this. Most examples I see check against a single x/y location, whereas I want to check all vs all. Is anyone able to provide suggestions or examples, starting from an array of x/y coordinates taken from my dataframe as follows:
points = df_sub.loc[:,['FRONT_X','FRONT_Y']].values
That looks something like this:
[[19091199.587 -544406.722]
[19091161.475 -544452.426]
[19091163.893 -544464.899]
...
[19089150.04 -544747.196]
[19089774.213 -544729.005]
[19089690.516 -545165.489]]
The ideal output would be the ID's of all pairs of points that are within a cutoff distance "d" of each other.
scipy.spatial has many good functions for handling distance computations.
Let's create an array pos of 1000 (x, y) points, similar to what you have in your dataframe.
import numpy as np
from scipy.spatial import distance_matrix
num = 1000
pos = np.random.uniform(size=(num, 2))
# Distance threshold
d = 0.25
From here we shall use the distance_matrix function to calculate pairwise distances. Then we use np.argwhere to find the indices of all the pairwise distances less than some threshold d.
pair_dist = distance_matrix(pos, pos)
ids = np.argwhere(pair_dist < d)
ids now contains the "ID's of all pairs of points that are within a cutoff distance "d" of each other", as you desired.
Shortcomings
Of course, this method has the shortcoming that we always compute the distance between each point and itself (returning a distance of 0), which will always be less than our threshold d. However, we can exclude self-comparisons from our ids with the following fudge:
pair_dist[np.r_[:num], np.r_[:num]] = np.inf
ids = np.argwhere(pair_dist < d)
Another shortcoming is that we compute the full symmetric pairwise distance matrix when we only really need the upper or lower triangular pairwise distance matrix. However, unless this computation really is a bottleneck in your code, I wouldn't worry too much about this.
I currently have a python script that reads in a 3 column text file containing x and y coordinates for a walker and the time they have been walking.
I have read in this data and allocated it in numpy arrays as shown in the code below:
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt("info.txt", delimiter = ',')
x = data[:,0]
y = data[:,1]
t = data[:,2]
File is following format (x,y,t):
5907364.2371 -447070.881709 2193094
5907338.306978 -447058.019176 2193116
5907317.260891 -447042.192668 2193130
I now want to find the distance traveled as a function of time by the walker. One way I can think of doing that is by summing the differences in x coordinates and all the differences in y coordinates in a loop. This seems a very long winded method however and I think it could be solved with a type of numerical integration. Does anyone have any ideas of what I could do?
To compute the distance "along the way", you must first obtain the distance of each step.
This can be obtained, component-wise, by the indexing dx = x[1:]-x[:-1]. The distance per step is then "square root of dx**2+dy**2" Note that the length of this array is less by one as there is one less interval with respect to the number of steps. This can be completed by assigning the distance "0" to the the first time data. This is the role of the "concatenate" line below.
There is no numerical integration here, but a cumulative sum. To perform numerical integration, you would need equations of motion (for instance).
Extra change: I use np.loadtxt with the unpack=True argument to save a few lines.
import numpy as np
import matplotlib.pyplot as plt
x, y, t = np.loadtxt("info.txt", unpack=True)
dx = x[1:]-x[:-1]
dy = y[1:]-y[:-1]
step_size = np.sqrt(dx**2+dy**2)
cumulative_distance = np.concatenate(([0], np.cumsum(step_size)))
plt.plot(t, cumulative_distance)
plt.show()
There are several ways to obtain the Euclidean Distance between the points:
Numpy:
import numpy as np
dist = np.linalg.norm(x-y)
dist1= np.sqrt(np.sum((x-y)**2)))
Scipy:
from scipy.spatial import distance
dist = distance.euclidean(x,y)
typically, to get distance walked, you sum up smaller distances. Your walker probably isn't walking on a grid (that is, a step in x and a step in y) but rather in diagonals (think Pythagorean theorem)
so, in python it might look like this...
distanceWalked = 0
for x_y_point in listOfPoints:
distanceWalked = distanceWalked + (x_y_point[0] **2 + x_y_point[1] **2)**.5
Where listOfPoints is something like [[0,0],[0,1],[0,2],[1,2],[2,2]]
Alternatively, you can use pandas.
import pandas as pd
df = pd.read_csv('info.txt',sep = '\t')
df['helpercol'] = (df['x']**2 +df['y']**2 )**.5
df['cumDist'] = df['helpercol'].cumsum()
now, you'll have cumulative distance per time in your dataframe
I have a list of n polar coordinates, and a distance function which takes in two coordinates.
I want to create an n x n matrix which contains the pairwise distances under my function. I realize I probably need to use some form of vectorization with numpy but am not sure exactly how to do so.
A simple code segment is below for your reference
import numpy as np
length = 10
coord_r = np.random.rand(length)*10
coord_alpha = np.random.rand(length)*np.pi
# Repeat vector to matrix form
coord_r_X = np.tile(coord_r, [length,1])
coord_r_Y = coord_r_X.T
coord_alpha_X = np.tile(coord_alpha, [length,1])
coord_alpha_Y = coord_alpha_X.T
matDistance = np.sqrt(coord_r_X**2 + coord_r_Y**2 - 2*coord_r_X*coord_r_Y*np.cos(coord_alpha_X - coord_alpha_Y))
print matDistance
You can use scipy.spatial.distance.pdist. However, if the distance you want to calculate is the Euclidean distance, you may be better off just converting your points to rectangular coordinates, since then pdist will do the calculations quite quickly using its builtin Euclidean distance.