I currently have a python script that reads in a 3 column text file containing x and y coordinates for a walker and the time they have been walking.
I have read in this data and allocated it in numpy arrays as shown in the code below:
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt("info.txt", delimiter = ',')
x = data[:,0]
y = data[:,1]
t = data[:,2]
File is following format (x,y,t):
5907364.2371 -447070.881709 2193094
5907338.306978 -447058.019176 2193116
5907317.260891 -447042.192668 2193130
I now want to find the distance traveled as a function of time by the walker. One way I can think of doing that is by summing the differences in x coordinates and all the differences in y coordinates in a loop. This seems a very long winded method however and I think it could be solved with a type of numerical integration. Does anyone have any ideas of what I could do?
To compute the distance "along the way", you must first obtain the distance of each step.
This can be obtained, component-wise, by the indexing dx = x[1:]-x[:-1]. The distance per step is then "square root of dx**2+dy**2" Note that the length of this array is less by one as there is one less interval with respect to the number of steps. This can be completed by assigning the distance "0" to the the first time data. This is the role of the "concatenate" line below.
There is no numerical integration here, but a cumulative sum. To perform numerical integration, you would need equations of motion (for instance).
Extra change: I use np.loadtxt with the unpack=True argument to save a few lines.
import numpy as np
import matplotlib.pyplot as plt
x, y, t = np.loadtxt("info.txt", unpack=True)
dx = x[1:]-x[:-1]
dy = y[1:]-y[:-1]
step_size = np.sqrt(dx**2+dy**2)
cumulative_distance = np.concatenate(([0], np.cumsum(step_size)))
plt.plot(t, cumulative_distance)
plt.show()
There are several ways to obtain the Euclidean Distance between the points:
Numpy:
import numpy as np
dist = np.linalg.norm(x-y)
dist1= np.sqrt(np.sum((x-y)**2)))
Scipy:
from scipy.spatial import distance
dist = distance.euclidean(x,y)
typically, to get distance walked, you sum up smaller distances. Your walker probably isn't walking on a grid (that is, a step in x and a step in y) but rather in diagonals (think Pythagorean theorem)
so, in python it might look like this...
distanceWalked = 0
for x_y_point in listOfPoints:
distanceWalked = distanceWalked + (x_y_point[0] **2 + x_y_point[1] **2)**.5
Where listOfPoints is something like [[0,0],[0,1],[0,2],[1,2],[2,2]]
Alternatively, you can use pandas.
import pandas as pd
df = pd.read_csv('info.txt',sep = '\t')
df['helpercol'] = (df['x']**2 +df['y']**2 )**.5
df['cumDist'] = df['helpercol'].cumsum()
now, you'll have cumulative distance per time in your dataframe
Related
I have a pd Dataframe that has a lot of planes in the XY plane. The dataframe consists of the points' x and y coordinates. I want to check every point's distance to all other points using the pythagorean theorem and count number of points within a certain distance of that point.
def distance(x1, y1, x2, y2):
return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)
df = pd.DataFrame({'X':[random.randint(1,100) for i in range(100)], 'Y':[random.randint(1,100) for i in range(100)]})
I realise that I can loop over the dataframe but that is not best practice and it takes too long. Is there a way I can optimize this process.
Ultimately I'd want another column in the dataframe that stores the number of points in the dataframe that are within a certain distance of each point.
EDIT:
Another thing I am trying to do is look for arbitrary points (or zones) in the XY plane with the most number of points within a given radius. What I basically mean is I want to also look at positions in the plane that are not necessarily points in the dataframe but are still within the limits of the plane.
If you want your code to run fast using pandas and numpy you should try to get used to writing functions that look like they only work with numbers but you can actually input numpy arrays/pandas series. E.g. if you want to find all points in your df being distance r or less from the point cx, cy you could do that like so
def close_to_my_point(x,y):
return (x-cx)**2+(y-cy)**2 <= r**2
close_to_my_point(df["X"],df["Y"])
This gives you a series of booleans indicating if your point at that position in the dataframe now is close to cx, cy or not. Notice now that when summing over True, False values True behaves like 1 and False like 0. So sum(close_to_my_point(df["X"],df["Y"])) does what you want for one point.
For functions that can't be applied to series by default there is np.vectorize to change that. Putting all that together you get something that can calculate the amount of points in some distance quite quickly:
def disk_equation(cx,cy,r):
return lambda x,y: (x-cx)**2+(y-cy)**2<= r**2
points_in_distance = lambda x,y: sum(disk_equation(x,y,20)(df["X"],df["Y"]))
df["points_closer_than_20"] = np.vectorize(points_in_distance)(df["X"],df["Y"])
There is a whole lot of tools for pairwise distance calculations included in SciPy: enter link description here
The simplest one to use would be a distance_matrix that calculates pairwise distances and returns those as a matrix. First you need to convert your dataframe into a properly formatted numpy array:
import random
from scipy.spatial import distance_matrix
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[random.randint(1,100) for i in range(100)], 'Y': random.randint(1,100) for i in range(100)]})
foo = np.array([(x,y) for x, y in zip(df.X, df.Y)])
baz = distance_matrix(foo, foo)
Here we're using foo twice since we want all pairwise distances to all points in the array.
Given two sets of points in n-dimensional space, how can one map points from one set to the other, such that each point is only used once and the total euclidean distance between the pairs of points is minimized?
For example,
import matplotlib.pyplot as plt
import numpy as np
# create six points in 2d space; the first three belong to set "A" and the
# second three belong to set "B"
x = [1, 2, 3, 1.8, 1.9, 3.4]
y = [2, 3, 1, 2.6, 3.4, 0.4]
colors = ['red'] * 3 + ['blue'] * 3
plt.scatter(x, y, c=colors)
plt.show()
So in the example above, the goal would be to map each red point to a blue point such that each blue point is only used once and the sum of the distances between points is minimized.
I came across this question which helps to solve the first part of the problem -- computing the distances between all pairs of points across sets using the scipy.spatial.distance.cdist() function.
From there, I could probably test every permutation of single elements from each row, and find the minimum.
The application I have in mind involves a fairly small number of datapoints in 3-dimensional space, so the brute force approach might be fine, but I thought I would check to see if anyone knows of a more efficient or elegant solution first.
An example of assigning (mapping) elements of one set to points to the elements of another set of points, such that the sum Euclidean distance is minimized.
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
np.random.seed(100)
points1 = np.array([(x, y) for x in np.linspace(-1,1,7) for y in np.linspace(-1,1,7)])
N = points1.shape[0]
points2 = 2*np.random.rand(N,2)-1
C = cdist(points1, points2)
_, assigment = linear_sum_assignment(C)
plt.plot(points1[:,0], points1[:,1],'bo', markersize = 10)
plt.plot(points2[:,0], points2[:,1],'rs', markersize = 7)
for p in range(N):
plt.plot([points1[p,0], points2[assigment[p],0]], [points1[p,1], points2[assigment[p],1]], 'k')
plt.xlim(-1.1,1.1)
plt.ylim(-1.1,1.1)
plt.axes().set_aspect('equal')
There's a known algorithm for this, The Hungarian Method For Assignment, which works in time O(n3).
In SciPy, you can find an implementation in scipy.optimize.linear_sum_assignment
Suppose I have a process where I push a button, and after a certain amount of time (from 1 to 30 minutes), an event occurs. I then run a very large number of trials, and record how long it takes the event to occur for each trial. This raw data is then reduced to a set of 30 data points where the x value is the number of minutes it took for the event to occur, and the y value is the percentage of trials which fell into that bucket. I do not have access to the original data.
How can I use this set of 30 points to identify an appropriate probability distribution which I can then use to generate representative random samples?
I feel like scipy.stats has all the tools I need built in, but for the life of me I can't figure out how to go about it. Any tips?
If you don't have any prior information about the underlying function of the data which have been produced, I suggest you to use numpy.polyfit which fits a polynomial of given degree.
import matplotlib.pyplot as plt
import numpy as np
y = np.array([ 0.005995184, ...]) # your array
x = np.arange(len(y))
f = np.poly1d(np.polyfit(x, y, 10))
x_new = np.linspace(x[0], x[-1], 30)
y_new = f(x_new)
plt.plot(x,y,'o', x_new, y_new)
plt.xlim([x[0]-1, x[-1] + 1 ])
plt.show()
Here is an example for degree = 10.
In order to get an unknown value from the produced polynomial distribution, you simply:
f(13.5)
which in this case gives:
0.0206996531272
You can also use the histogram, piecewise uniform distribution directly, then you get exactly the corresponding random numbers instead of an approximation.
The inverse cdf, ppf, is piecewise linear and linear interpolation can be used to transform uniform random numbers appropriately.
I was able to come up with a solution, but it doesn't feel like a very elegant one. Basically, take the percentage value (y value) for each x value, multiply by some large number (say, 10,000), then add that many values of x to an array. Continue through all values of x, ending up with a single giant array. This array can then be fed into .fit() methods of the scipy.stats.rv_discrete subclasses. I'll leave the question open for now as I feel like there must be a better way.
import matplotlib.pyplot as plt
import scipy
import scipy.stats
import numpy as np
xRange = 30
x = scipy.arange(0,xRange+1)
data = [
0.005995184,0.012209876,0.028232119,0.04711878,0.087894128,
0.116652421,0.115370764,0.12774159,0.109731418,0.079767439,
0.068016186,0.045287033,0.033403796,0.029145134,0.018925806,
0.013340493,0.010087069,0.007998098,0.00984276,0.004906083,
0.004720561,0.003186032,0.003028522,0.002942859,0.002780096,
0.002450613,0.002733441,0.002217294,0.002072314,0.002063246]
y=[]
for i in range(len(data)):
for j in range(int(data[i]*10000)):
y=np.append(y,i+1)
# creating the histogram
plt.figure(num=1,figsize=(22,12))
h = plt.hist(y, bins=x, normed=True)
dist_names = ['burr','f','rayleigh']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1])
plt.plot(pdf_fitted, label=dist_name, lw=4)
plt.xlim(0,xRange)
plt.legend(loc='upper right')
plt.show()
I need to interpolate temperature data linearly in 4 dimensions (latitude, longitude, altitude and time).
The number of points is fairly high (360x720x50x8) and I need a fast method of computing the temperature at any point in space and time within the data bounds.
I have tried using scipy.interpolate.LinearNDInterpolator but using Qhull for triangulation is inefficient on a rectangular grid and takes hours to complete.
By reading this SciPy ticket, the solution seemed to be implementing a new nd interpolator using the standard interp1d to calculate a higher number of data points, and then use a "nearest neighbor" approach with the new dataset.
This, however, takes a long time again (minutes).
Is there a quick way of interpolating data on a rectangular grid in 4 dimensions without it taking minutes to accomplish?
I thought of using interp1d 4 times without calculating a higher density of points, but leaving it for the user to call with the coordinates, but I can't get my head around how to do this.
Otherwise would writing my own 4D interpolator specific to my needs be an option here?
Here's the code I've been using to test this:
Using scipy.interpolate.LinearNDInterpolator:
import numpy as np
from scipy.interpolate import LinearNDInterpolator
lats = np.arange(-90,90.5,0.5)
lons = np.arange(-180,180,0.5)
alts = np.arange(1,1000,21.717)
time = np.arange(8)
data = np.random.rand(len(lats)*len(lons)*len(alts)*len(time)).reshape((len(lats),len(lons),len(alts),len(time)))
coords = np.zeros((len(lats),len(lons),len(alts),len(time),4))
coords[...,0] = lats.reshape((len(lats),1,1,1))
coords[...,1] = lons.reshape((1,len(lons),1,1))
coords[...,2] = alts.reshape((1,1,len(alts),1))
coords[...,3] = time.reshape((1,1,1,len(time)))
coords = coords.reshape((data.size,4))
interpolatedData = LinearNDInterpolator(coords,data)
Using scipy.interpolate.interp1d:
import numpy as np
from scipy.interpolate import LinearNDInterpolator
lats = np.arange(-90,90.5,0.5)
lons = np.arange(-180,180,0.5)
alts = np.arange(1,1000,21.717)
time = np.arange(8)
data = np.random.rand(len(lats)*len(lons)*len(alts)*len(time)).reshape((len(lats),len(lons),len(alts),len(time)))
interpolatedData = np.array([None, None, None, None])
interpolatedData[0] = interp1d(lats,data,axis=0)
interpolatedData[1] = interp1d(lons,data,axis=1)
interpolatedData[2] = interp1d(alts,data,axis=2)
interpolatedData[3] = interp1d(time,data,axis=3)
Thank you very much for your help!
In the same ticket you have linked, there is an example implementation of what they call tensor product interpolation, showing the proper way to nest recursive calls to interp1d. This is equivalent to quadrilinear interpolation if you choose the default kind='linear' parameter for your interp1d's.
While this may be good enough, this is not linear interpolation, and there will be higher order terms in the interpolation function, as this image from the wikipedia entry on bilinear interpolation shows:
This may very well be good enough for what you are after, but there are applications where a triangulated, really piecewise linear, interpoaltion is preferred. If you really need this, there is an easy way of working around the slowness of qhull.
Once LinearNDInterpolator has been setup, there are two steps to coming up with an interpolated value for a given point:
figure out inside which triangle (4D hypertetrahedron in your case) the point is, and
interpolate using the barycentric coordinates of the point relative to the vertices as weights.
You probably do not want to mess with barycentric coordinates, so better leave that to LinearNDInterpolator. But you do know some things about the triangulation. Mostly that, because you have a regular grid, within each hypercube the triangulation is going to be the same. So to interpolate a single value, you could first determine in which subcube your point is, build a LinearNDInterpolator with the 16 vertices of that cube, and use it to interpolate your value:
from itertools import product
def interpolator(coords, data, point) :
dims = len(point)
indices = []
sub_coords = []
for j in xrange(dims) :
idx = np.digitize([point[j]], coords[j])[0]
indices += [[idx - 1, idx]]
sub_coords += [coords[j][indices[-1]]]
indices = np.array([j for j in product(*indices)])
sub_coords = np.array([j for j in product(*sub_coords)])
sub_data = data[list(np.swapaxes(indices, 0, 1))]
li = LinearNDInterpolator(sub_coords, sub_data)
return li([point])[0]
>>> point = np.array([12.3,-4.2, 500.5, 2.5])
>>> interpolator((lats, lons, alts, time), data, point)
0.386082399091
This cannot work on vectorized data, since that would require storing a LinearNDInterpolator for every possible subcube, and even though it probably would be faster than triangulating the whole thing, it would still be very slow.
scipy.ndimage.map_coordinates
is a nice fast interpolator for uniform grids (all boxes the same size).
See multivariate-spline-interpolation-in-python-scipy on SO
for a clear description.
For non-uniform rectangular grids, a simple wrapper
Intergrid maps / scales non-uniform to uniform grids,
then does map_coordinates.
On a 4d test case like yours it takes about 1 μsec per query:
Intergrid: 1000000 points in a (361, 720, 47, 8) grid took 652 msec
For very similar things I use Scientific.Functions.Interpolation.InterpolatingFunction.
import numpy as np
from Scientific.Functions.Interpolation import InterpolatingFunction
lats = np.arange(-90,90.5,0.5)
lons = np.arange(-180,180,0.5)
alts = np.arange(1,1000,21.717)
time = np.arange(8)
data = np.random.rand(len(lats)*len(lons)*len(alts)*len(time)).reshape((len(lats),len(lons),len(alts),len(time)))
axes = (lats, lons, alts, time)
f = InterpolatingFunction(axes, data)
You can now leave it to the user to call the InterpolatingFunction with coordinates:
>>> f(0,0,10,3)
0.7085675631375401
InterpolatingFunction has nice additional features, such as integration and slicing.
However, I do not know for sure whether the interpolation is linear. You would have to look in the module source to find out.
I can not open this address, and find enough informations about this package
Are there any algorithms that will return the equation of a straight line from a set of 3D data points? I can find plenty of sources which will give the equation of a line from 2D data sets, but none in 3D.
Thanks.
If you are trying to predict one value from the other two, then you should use lstsq with the a argument as your independent variables (plus a column of 1's to estimate an intercept) and b as your dependent variable.
If, on the other hand, you just want to get the best fitting line to the data, i.e. the line which, if you projected the data onto it, would minimize the squared distance between the real point and its projection, then what you want is the first principal component.
One way to define it is the line whose direction vector is the eigenvector of the covariance matrix corresponding to the largest eigenvalue, that passes through the mean of your data. That said, eig(cov(data)) is a really bad way to calculate it, since it does a lot of needless computation and copying and is potentially less accurate than using svd. See below:
import numpy as np
# Generate some data that lies along a line
x = np.mgrid[-2:5:120j]
y = np.mgrid[1:9:120j]
z = np.mgrid[-5:3:120j]
data = np.concatenate((x[:, np.newaxis],
y[:, np.newaxis],
z[:, np.newaxis]),
axis=1)
# Perturb with some Gaussian noise
data += np.random.normal(size=data.shape) * 0.4
# Calculate the mean of the points, i.e. the 'center' of the cloud
datamean = data.mean(axis=0)
# Do an SVD on the mean-centered data.
uu, dd, vv = np.linalg.svd(data - datamean)
# Now vv[0] contains the first principal component, i.e. the direction
# vector of the 'best fit' line in the least squares sense.
# Now generate some points along this best fit line, for plotting.
# I use -7, 7 since the spread of the data is roughly 14
# and we want it to have mean 0 (like the points we did
# the svd on). Also, it's a straight line, so we only need 2 points.
linepts = vv[0] * np.mgrid[-7:7:2j][:, np.newaxis]
# shift by the mean to get the line in the right place
linepts += datamean
# Verify that everything looks right.
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d as m3d
ax = m3d.Axes3D(plt.figure())
ax.scatter3D(*data.T)
ax.plot3D(*linepts.T)
plt.show()
Here's what it looks like:
If your data is fairly well behaved then it should be sufficient to find the least squares sum of the component distances. Then you can find the linear regression with z independent of x and then again independent of y.
Following the documentation example:
import numpy as np
pts = np.add.accumulate(np.random.random((10,3)))
x,y,z = pts.T
# this will find the slope and x-intercept of a plane
# parallel to the y-axis that best fits the data
A_xz = np.vstack((x, np.ones(len(x)))).T
m_xz, c_xz = np.linalg.lstsq(A_xz, z)[0]
# again for a plane parallel to the x-axis
A_yz = np.vstack((y, np.ones(len(y)))).T
m_yz, c_yz = np.linalg.lstsq(A_yz, z)[0]
# the intersection of those two planes and
# the function for the line would be:
# z = m_yz * y + c_yz
# z = m_xz * x + c_xz
# or:
def lin(z):
x = (z - c_xz)/m_xz
y = (z - c_yz)/m_yz
return x,y
#verifying:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = Axes3D(fig)
zz = np.linspace(0,5)
xx,yy = lin(zz)
ax.scatter(x, y, z)
ax.plot(xx,yy,zz)
plt.savefig('test.png')
plt.show()
If you want to minimize the actual orthogonal distances from the line (orthogonal to the line) to the points in 3-space (which I'm not sure is even referred to as linear regression). Then I would build a function that computes the RSS and use a scipy.optimize minimization function to solve it.