Calculate track from location points without time - python

I've got a bunch of points [ID, lat, lon, time] but the time is unreliable. The time for a couple points is often mixed up and there are some large gaps. I want to be able to calculate a track (basically just a linear-fit or polyfit) from the points but I'm struggling to get them into some kind of order.
First I tried ordering by lat/lon and this works for the cases where the track is moving constantly in one direction. There are all kind of mismatches and problems when the track turns back on itself.
Maybe it's a travelling salesman problem but in this case I don't know where the object's track starts/ends.
I've thought about picking a point at random and travelling to the next closest point and repeat; but how would I complete the track if my random point is in the middle and there are often large gaps between points.
GPS points, incorrectly placed into tracks
Here's a picture of some of the GPS points, colour coded by ID. I've sorted the points by [lat,lon] and you can see the blue track has problems.
This is so simple do do manually, just join the dots, but I can't figure it out computationally. I'm using python/numpy/pandas for this and there are millions of these points so it would be helpful to avoid computationally intensive methods but at this point I'm just plain stuck.
EDIT:
Okay, so this is not so simple. It's probably going to involve writing particle/Kalman filters or maybe some kind of Hamiltonian cost equation and then iterating the whole damn track to get an optimal solution. The best (least work for me) solution would be to try to correct the junk time field and possible build a statistical guesser from the average bearing of point segments.
EDIT + Solution:
Okay, so it's no that complex. The data I'm looking at the objects generally travel N-S or E-W with little deviation. Where there is complex manoeuvring I usually have more reliable time data. The non-general solution for my dataset would be to check whether the track can be defined as a function of latitude (N-S travel with no S-N movement component) else can it be function of longitude. Then order by lat/lon and bam. This won't work in the case of spirals or other complex tracks but those are minimal in my data.
Not the perfect solution but good enough for me.

Hm, seems like simple clustering - or even sorting with the right metric would do the trick.
Your data
from IPython.display import Image
Image('http://i.stack.imgur.com/76pNx.png')
Generate similar data
import numpy as np
np.random.seed(42)
data_lat = np.arange(300, dtype=np.int32) * (1 + (np.random.random(300) - 0.5) * 0.1)
data_lon = np.arange(5, 305, dtype=np.int32) * (1 + (np.random.random(300) - 0.5) * 0.1)
%pylab inline
import seaborn as sns
plt.scatter(data_lat, data_lon)
import itertools
seq_data = [(la, lo) for i, (la, lo) in enumerate(zip(data_lat, data_lon)) if
i in itertools.chain(range(20), range(55, 70), range(120, 165),
range(200, 250), range(280, 300))]
plt.plot(*zip(*seq_data))
plt.scatter(*zip(*seq_data))
Scramble
import random
random.seed(42)
data = seq_data.copy()
random.shuffle(data)
plt.plot(*zip(*data))
plt.scatter(*zip(*data))
Sort
data.sort(key=lambda t: (t[0]**2 + t[1]**2)**(1/2))
plt.plot(*zip(*data))
plt.scatter(*zip(*data))

Related

Concatenating Waves in Python using Numpy

I'm try to produce a tone that transitions linearly between a list of arbitrary frequencies with respect to time. I'm using scipy.signal to create waves that transition between pairs of frequencies, then concatenating them. When the frequencies are round numbers and relatively far from each other this works. When the numbers are close together or aren't as nice I get pops between each transition.
What is causing these pops, why only in the first case and not the other two, and what can I do to fix it? (Or if there's a better/easier way to do what I'm trying to do, what is it?)
Any thoughts would be very much appreciated.
from scipy.signal import sweep_poly
import numpy
import sounddevice
sample_rate = 44100.0
time = 1.0
amplitude = 10000
sounddevice.default.samplerate = sample_rate
freq_list = [100, 200, 100, 200]
#freq_list = [100.05,200.21,100.02,200.65,100.16]
#freq_list = [100,101,102,103]
#get samples for each segment
samples = numpy.arange(sample_rate * time) / sample_rate
#make all the different segments
wave_list = []
for i in range(len(freq_list)-1):
wave_list.append(amplitude * sweep_poly(samples, [float(freq_list[i+1])-float(freq_list[i]),float(freq_list[i])]))
#join them together
wave = numpy.concatenate(wave_list)
#convert it to wav format (16 bits)
wav_wave = numpy.array(wave, dtype=numpy.int16)
sounddevice.play(wav_wave, blocking=True)
My comment was correct. The problem was that when I made a new wave it always started at its peak, which didn't necessarily line up with the old wave, resulting in discontinuities:
I fixed this by setting the phase offset parameter of sweep_poly to (180/math.pi)*math.acos(prev_point/amplitude) where prev_point was the last point in the previous sine wave.
Unfortunately since sine isn't one-to-one sometimes I got waves where the values matched, but the slopes didn't:
My fix for this was to check if the signs of the slopes matched, and if they didn't, slowly increase the offset until they did, then continue slowly increasing the offset until the discontinuity was small (<10). I'm sure this isn't the nicest or most mathematically satisfying way to solve this, but it works well enough for me. Now I have beautiful (pretty close to) continuously differentiable waves.

Path between two Topos() locations: determine latitude and longitude where a given altitude is reached

I am quite new to field of orbital mechanics and currently struggling a bit with the following problem, which should be quite easy to solve with Skyfield, yet I am a bit overwhelmed by all the different coordinate systems and the translation between them.
I have a Topos location on Earth and a Topos location of a LEO satellite. I am considering the line-of-sight between them. I want to determine the latitude and longitude of the position along this path, where it intersects a specific layer of the atmosphere.
An example would be the mesosphere and an existing dataset on its properties at around 100km that is given based on latitude and longitude. The intersection would allow me to better understand the interaction these properties have on the communication with the satellite.
I tried doing it with Skyfield directly, but only end up with an Apparent object that I cannot convert back to latitude, longitude on Earth. First, I trigonometrically determined the distance from Earth to the point, where the height of 100km is reached.
Then, I took the position on Earth and used the unchanged elevation, azimuth to keep the direction of the path and finally added the calculated distance to arrive at this position. I think I need to get a Geocentric object to use subpoint() in order to get the desired latitude, longitude of this location.
This is what I have so far:
from skyfield.api import load, Distance
from skyfield.toposlib import Topos
import numpy as np
ts = load.timescale()
earth_position = Topos('52.230039 N', '4.842402 E', elevation_m=10)
space_position = Topos('51.526200 N', '5.347795 E', elevation_m=625 * 1000)
difference = (space_position - earth_position).at(ts.now()).altaz()
distance_to_height = 100 / np.sin(difference[0].radians)
position = earth_position.at(ts.now()).from_altaz(alt_degrees=difference[0].degrees, az_degrees=difference[1].degrees, distance=Distance(km=distance_to_height))
I have gone through the documentation multiple times, and stumbled upon frame_latlon(frame) for Generic ICRF objects, but am not sure how to further proceed.
Trying it completely trigonomatrically with the latitudes and longitudes didn't yield the desired results either.
Unfortunately I do not really have any validated results that could be used to solve this problem more easily. Imagining it again trigonometrically, it is obvious that an increase in altitude of the satellite position would move the lat, lon of the intersection closer to the position on Earth. Decreasing the altitude would then move this intersection closer to the satellite.
That is an interesting problem, which Skyfield’s API provides no easy way to ask about; if you could outline the larger problem that will be solved by knowing the intersection of the line-of-sight with a particular altitude, then it is possible that a routine addressing that problem could be written for future users tackling the same question.
In the meantime:
To get your script running I had to import Distance from api.
The name dis was not recognized, so I replaced it with distance_to_height, hoping that it was the name intended.
Calling ts.now() is giving you a slightly different date and time on each call. While the script runs so fast that it probably does not matter, I have for clarity pivoted to calling now() only once at the beginning of the script, which is also slightly faster than calling it repeatedly. (Actually in this case it’s much faster, because the rotation matrices only get computed once rather than having to be computed over again for each separate time object, but that’s a hidden detail that’s not easy to see.)
I suspect a problem with your geometry: the 100 / sin() maneuver would only work if the Earth were flat, I think? But maybe you are always dealing with nearly-overhead satellites and so the error is manageable? (Or maybe I am mis-imagining the geometry; feel free to provide a diagram if the math is in fact correct.)
For readability I’ll give the components of altaz() names rather than numbers.
With those tweaks in place,
I think the answer is that you need to manually construct a Geocentric
position by adding together the position of the observer
and the relative vector you have created
between the observer and the kind-of-100km point along the line of sight.
Having to take a manual step like this suggests a possible area
where Skyfield can improve.
Here is how it looks in code:
from skyfield.api import load, Distance
from skyfield.positionlib import Geocentric
from skyfield.toposlib import Topos
import numpy as np
ts = load.timescale()
t = ts.now()
earth_position = Topos('52.230039 N', '4.842402 E', elevation_m=10)
space_position = Topos('51.526200 N', '5.347795 E', elevation_m=625 * 1000)
alt, az, distance = (space_position - earth_position).at(t).altaz()
distance_to_height = 100 / np.sin(alt.radians)
e = earth_position.at(t)
p = e.from_altaz(alt_degrees=alt.degrees, az_degrees=az.degrees, distance=Distance(km=distance_to_height))
g = Geocentric(e.position.au + p.position.au, t=t)
s = g.subpoint()
print(s)
print(s.elevation.km, '<- warning: 100/sin() did not produce exactly 100')
The result I see is:
Topos 52deg 06' 30.0" N 04deg 55' 51.7" E
100.02752954478532 <- warning: 100/sin() did not produce exactly 100
And for the future,
I have added some thoughts to the Skyfield TODO.rst file
that together might move towards unlocking a more idiomatic way
to perform this kind of calculation in the future —
though I suspect that a few more steps even beyond these will be necessary:
https://github.com/skyfielders/python-skyfield/commit/ba1172a0ccfef84473436d9d7b8a7d7011344cbd

How to get the K most distant points, given their coordinates?

We have boring CSV with 10000 rows of ages (float), titles (enum/int), scores (float), ....
We have N columns each with int/float values in a table.
You can imagine this as points in ND space
We want to pick K points that would have maximised distance between each other.
So if we have 100 points in a tightly packed cluster and one point in the distance we would get something like this for three points:
or this
For 4 points it will become more interesting and pick some point in the middle.
So how to select K most distant rows (points) from N (with any complexity)? It looks like an ND point cloud "triangulation" with a given resolution yet not for 3d points.
I search for a reasonably fast approach (approximate - no precise solution needed) for K=200 and N=100000 and ND=6 (probably multigrid or ANN on KDTree based, SOM or triangulation based..).. Does anyone know one?
From past experience with a pretty similar problem, a simple solution of computing the mean Euclidean distance of all pairs within each group of K points and then taking the largest mean, works very well. As someone noted above, it's probably hard to avoid a loop on all combinations (not on all pairs). So a possible implementation of all this can be as follows:
import itertools
import numpy as np
from scipy.spatial.distance import pdist
Npoints = 3 # or 4 or 5...
# making up some data:
data = np.matrix([[3,2,4,3,4],[23,25,30,21,27],[6,7,8,7,9],[5,5,6,6,7],[0,1,2,0,2],[3,9,1,6,5],[0,0,12,2,7]])
# finding row indices of all combinations:
c = [list(x) for x in itertools.combinations(range(len(data)), Npoints )]
distances = []
for i in c:
distances.append(np.mean(pdist(data[i,:]))) # pdist: a method of computing all pairwise Euclidean distances in a condensed way.
ind = distances.index(max(distances)) # finding the index of the max mean distance
rows = c[ind] # these are the points in question
I propose an approximate solution. The idea is to start from a set of K points chosen in a way I'll explain below, and repeatedly loop through these points replacing the current one with the point, among the N-K+1 points not belonging to the set but including the current one, that maximizes the sum of the distances from the points of the set. This procedure leads to a set of K points where the replacement of any single point would cause the sum of the distances among the points of the set to decrease.
To start the process we take the K points that are closest to the mean of all points. This way we have good chances that on the first loop the set of K points will be spread out close to its optimum. Subsequent iterations will make adjustments to the set of K points towards a maximum of the sum of distances, which for the current values of N, K and ND appears to be reachable in just a few seconds. In order to prevent excessive looping in edge cases, we limit the number of loops nonetheless.
We stop iterating when an iteration does not improve the total distance among the K points. Of course, this is a local maximum. Other local maxima will be reached for different initial conditions, or by allowing more than one replacement at a time, but I don't think it would be worthwhile.
The data must be adjusted in order for unit displacements in each dimension to have the same significance, i.e., in order for Euclidean distances to be meaningful. E.g., if your dimensions are salary and number of children, unadjusted, the algorithm will probably yield results concentrated in the extreme salary regions, ignoring that person with 10 kids. To get a more realistic output you could divide salary and number of children by their standard deviation, or by some other estimate that makes differences in salary comparable to differences in number of children.
To be able to plot the output for a random Gaussian distribution, I have set ND = 2 in the code, but setting ND = 6, as per your request, is no problem (except you cannot plot it).
import matplotlib.pyplot as plt
import numpy as np
import scipy.spatial as spatial
N, K, ND = 100000, 200, 2
MAX_LOOPS = 20
SIGMA, SEED = 40, 1234
rng = np.random.default_rng(seed=SEED)
means, variances = [0] * ND, [SIGMA**2] * ND
data = rng.multivariate_normal(means, np.diag(variances), N)
def distances(ndarray_0, ndarray_1):
if (ndarray_0.ndim, ndarray_1.ndim) not in ((1, 2), (2, 1)):
raise ValueError("bad ndarray dimensions combination")
return np.linalg.norm(ndarray_0 - ndarray_1, axis=1)
# start with the K points closest to the mean
# (the copy() is only to avoid a view into an otherwise unused array)
indices = np.argsort(distances(data, data.mean(0)))[:K].copy()
# distsums is, for all N points, the sum of the distances from the K points
distsums = spatial.distance.cdist(data, data[indices]).sum(1)
# but the K points themselves should not be considered
# (the trick is that -np.inf ± a finite quantity always yields -np.inf)
distsums[indices] = -np.inf
prev_sum = 0.0
for loop in range(MAX_LOOPS):
for i in range(K):
# remove this point from the K points
old_index = indices[i]
# calculate its sum of distances from the K points
distsums[old_index] = distances(data[indices], data[old_index]).sum()
# update the sums of distances of all points from the K-1 points
distsums -= distances(data, data[old_index])
# choose the point with the greatest sum of distances from the K-1 points
new_index = np.argmax(distsums)
# add it to the K points replacing the old_index
indices[i] = new_index
# don't consider it any more in distsums
distsums[new_index] = -np.inf
# update the sums of distances of all points from the K points
distsums += distances(data, data[new_index])
# sum all mutual distances of the K points
curr_sum = spatial.distance.pdist(data[indices]).sum()
# break if the sum hasn't changed
if curr_sum == prev_sum:
break
prev_sum = curr_sum
if ND == 2:
X, Y = data.T
marker_size = 4
plt.scatter(X, Y, s=marker_size)
plt.scatter(X[indices], Y[indices], s=marker_size)
plt.grid(True)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
Output:
Splitting the data into 3 equidistant Gaussian distributions the output is this:
Assuming that if you read your csv file with N (10000) rows and D dimension (or features) into a N*D martix X. You can calculate the distance between each point and store it in a distance matrix as follows:
import numpy as np
X = np.asarray(X) ### convert to numpy array
distance_matrix = np.zeros((X.shape[0],X.shape[0]))
for i in range(X.shape[0]):
for j in range(i+1,X.shape[0]):
## We compute triangle matrix and copy the rest. Distance from point A to point B and distance from point B to point A are the same.
distance_matrix[i][j]= np.linalg.norm(X[i]-X[j]) ## Here I am calculating Eucledian distance. Other distance measures can also be used.
#distance_matrix = distance_matrix + distance_matrix.T - np.diag(np.diag(distance_matrix)) ## This syntax can be used to get the lower triangle of distance matrix, which is not really required in your case.
K = 5 ## Number of points that you want to pick
indexes = np.unravel_index(np.argsort(distance_matrix.ravel())[-1*K:], distance_matrix.shape)
print(indexes)
Bottom Line Up Front: Dealing with multiple equally distant points and the Curse of Dimensionality are going to be larger problems than just finding the points. Spoiler alert: There's a surprise ending.
I think this an interesting question but I'm bewildered by some of the answers. I think this is, in part, due to the sketches provided. You've no doubt noticed the answers look similar -- 2d, with clusters -- even though you indicated a wider scope was needed. Because others will eventually see this, I'm going to step through my thinking a bit slowly so bear with me for the early part.
It makes sense to start with a simplified example to see if we can generalize a solution with data that's easy to grasp and a linear 2D model is easiest of the easy.
We don't need to calculate all the distances though. We just need the ones at the extremes. So we can then take the top and bottom few values:
right = lin_2_D.nlargest(8, ['x'])
left = lin_2_D.nsmallest(8, ['x'])
graph = sns.scatterplot(x="x", y="y", data=lin_2_D, color = 'gray', marker = '+', alpha = .4)
sns.scatterplot(x = right['x'], y = right['y'], color = 'red')
sns.scatterplot(x = left['x'], y = left['y'], color = 'green')
fig = graph.figure
fig.set_size_inches(8,3)
What we have so far: Of 100 points, we've eliminated the need to calculate the distance between 84 of them. Of what's left we can further drop this by ordering the results on one side and checking the distance against the others.
You can imagine a case where you have a couple of data points way off the trend line that could be captured by taking the greatest or least y values, and all that starts to look like Walter Tross's top diagram. Add in a couple of extra clusters and you get what looks his bottom diagram and it appears that we're sort of making the same point.
The problem with stopping here is the requirement you mentioned is that you need a solution that works for any number of dimensions.
The unfortunate part is that we run into four challenges:
Challenge 1: As you increase the dimensions you can run into a large number of cases where you have multiple solutions when seeking midpoints. So you're looking for k furthest points but have a large number of equally valid possible solutions and no way prioritizing them. Here are two super easy examples illustrate this:
A) Here we have just four points and in only two dimensions. You really can't get any easier than this, right? The distance from red to green is trivial. But try to find the next furthest point and you'll see both of the black points are equidistant from both the red and green points. Imagine you wanted the furthest six points using the first graphs, you might have 20 or more points that are all equidistant.
edit: I just noticed the red and green dots are at the edges of their circles rather than at the center, I'll update later but the point is the same.
B) This is super easy to imagine: Think of a D&D 4 sided die. Four points of data in a three-dimensional space, all equidistant so it's known as a triangle-based pyramid. If you're looking for the closest two points, which two? You have 4 choose 2 (aka, 6) combinations possible. Getting rid of valid solutions can be a bit of a problem because invariably you face questions such as "why did we get rid of these and not this one?"
Challenge 2: The Curse of Dimensionality. Nuff Said.
Challenge 3 Revenge of The Curse of Dimensionality Because you're looking for the most distant points, you have to x,y,z ... n coordinates for each point or you have to impute them. Now, your data set is much larger and slower.
Challenge 4 Because you're looking for the most distant points, dimension reduction techniques such as ridge and lasso are not going to be useful.
So, what to do about this?
Nothing.
Wait. What?!?
Not truly, exactly, and literally nothing. But nothing crazy. Instead, rely on a simple heuristic that is understandable and computationally easy. Paul C. Kainen puts it well:
Intuitively, when a situation is sufficiently complex or uncertain,
only the simplest methods are valid. Surprisingly, however,
common-sense heuristics based on these robustly applicable techniques
can yield results which are almost surely optimal.
In this case, you have not the Curse of Dimensionality but rather the Blessing of Dimensionality. It's true you have a lot of points and they'll scale linearly as you seek other equidistant points (k) but the total dimensional volume of space will increase to power of the dimensions. The k number of furthest points you're is insignificant to the total number of points. Hell, even k^2 becomes insignificant as the number of dimensions increase.
Now, if you had a low dimensionality, I would go with them as a solution (except the ones that are use nested for loops ... in NumPy or Pandas).
If I was in your position, I'd be thinking how I've got code in these other answers that I could use as a basis and maybe wonder why should I should trust this other than it lays out a framework on how to think through the topic. Certainly, there should be some math and maybe somebody important saying the same thing.
Let me reference to chapter 18 of Computer Intensive Methods in Control and Signal Processing and an expanded argument by analogy with some heavy(-ish) math. You can see from the above (the graph with the colored dots at the edges) that the center is removed, particularly if you followed the idea of removing the extreme y values. It's a though you put a balloon in a box. You could do this a sphere in a cube too. Raise that into multiple dimensions and you have a hypersphere in a hypercube. You can read more about that relationship here.
Finally, let's get to a heuristic:
Select the points that have the most max or min values per dimension. When/if you run out of them pick ones that are close to those values if there isn't one at the min/max. Essentially, you're choosing the corners of a box For a 2D graph you have four points, for a 3D you have the 8 corners of the box (2^3).
More accurately this would be a 4d or 5d (depending on how you might assign the marker shape and color) projected down to 3d. But you can easily see how this data cloud gives you the full range of dimensions.
Here is a quick check on learning; for purposes of ease, ignore the color/shape aspect: It's easy to graphically intuit that you have no problem with up to k points short of deciding what might be slightly closer. And you can see how you might need to randomize your selection if you have a k < 2D. And if you added another point you can see it (k +1) would be in a centroid. So here is the check: If you had more points, where would they be? I guess I have to put this at the bottom -- limitation of markdown.
So for a 6D data cloud, the values of k less than 64 (really 65 as we'll see in just a moment) points are pretty easy. But...
If you don't have a data cloud but instead have data that has a linear relationship, you'll 2^(D-1) points. So, for that linear 2D space, you have a line, for linear 3D space, you'd have a plane. Then a rhomboid, etc. This is true even if your shape is curved. Rather than do this graph myself, I'm using the one from an excellent post on by Inversion Labs on Best-fit Surfaces for 3D Data
If the number of points, k, is less than 2^D you need a process to decide what you don't use. Linear discriminant analysis should be on your shortlist. That said, you can probably satisfice the solution by randomly picking one.
For a single additional point (k = 1 + 2^D), you're looking for one that is as close to the center of the bounding space.
When k > 2^D, the possible solutions will scale not geometrically but factorially. That may not seem intuitive so let's go back to the two circles. For 2D you have just two points that could be a candidate for being equidistant. But if that were 3D space and rotate the points about the line, any point in what is now a ring would suffice as a solution for k. For a 3D example, they would be a sphere. Hyperspheres (n-spheres) from thereon. Again, 2^D scaling.
One last thing: You should seriously look at xarray if you're not already familiar with it.
Hope all this helps and I also hope you'll read through the links. It'll be worth the time.
*It would be the same shape, centrally located, with the vertices at the 1/3 mark. So like having 27 six-sided dice shaped like a giant cube. Each vertice (or point nearest it) would fix the solution. Your original k+1 would have to be relocated too. So you would select 2 of the 8 vertices. Final question: would it be worth calculating the distances of those points against each other (remember the diagonal is slightly longer than the edge) and then comparing them to the original 2^D points? Bluntly, no. Satifice the solution.
If you're interested in getting the most distant points you can take advantage of all of the methods that were developed for nearest neighbors, you just have to give a different "metric".
For example, using scikit-learn's nearest neighbors and distance metrics tools you can do something like this
import numpy as np
from sklearn.neighbors import BallTree
from sklearn.neighbors.dist_metrics import PyFuncDistance
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
def inverted_euclidean(x1, x2):
# You can speed this up using cython like scikit-learn does or numba
dist = np.sum((x1 - x2) ** 2)
# We invert the euclidean distance and set nearby points to the biggest possible
# positive float that isn't inf
inverted_dist = np.where(dist == 0, np.nextafter(np.inf, 0), 1 / dist)
return inverted_dist
# Make up some fake data
n_samples = 100000
n_features = 200
X, _ = make_blobs(n_samples=n_samples, centers=3, n_features=n_features, random_state=0)
# We exploit the BallTree algorithm to get the most distant points
ball_tree = BallTree(X, leaf_size=50, metric=PyFuncDistance(inverted_euclidean))
# Some made up query, you can also provide a stack of points to query against
test_point = np.zeros((1, n_features))
distance, distant_points_inds = ball_tree.query(X=test_point, k=10, return_distance=True)
distant_points = X[distant_points_inds[0]]
# We can try to visualize the query results
plt.plot(X[:, 0], X[:, 1], ".b", alpha=0.1)
plt.plot(test_point[:, 0], test_point[:, 1], "*r", markersize=9)
plt.plot(distant_points[:, 0], distant_points[:, 1], "sg", markersize=5, alpha=0.8)
plt.show()
Which will plot something like:
There are many points that you can improve on:
I implemented the inverted_euclidean distance function with numpy, but you can try to do what the folks of scikit-learn do with their distance functions and implement them in cython. You could also try to jit compile them with numba.
Maybe the euclidean distance isn't the metric you would like to use to find the furthest points, so you're free to implement your own or simply roll with what scikit-learn provides.
The nice thing about using the Ball Tree algorithm (or the KdTree algorithm) is that for each queried point you have to do log(N) comparisons to find the furthest point in the training set. Building the Ball Tree itself, I think also requires log(N) comparison, so in the end if you want to find the k furthest points for every point in the ball tree training set (X), it will have almost O(D N log(N)) complexity (where D is the number of features), which will increase up to O(D N^2) with the increasing k.

How to generate noisy mock time series or signal (in Python)

Quite often I have to work with a bunch of noisy, somewhat correlated time series. Sometimes I need some mock data to test my code, or to provide some sample data for a question on Stack Overflow. I usually end up either loading some similar dataset from a different project, or just adding a few sine functions and noise and spending some time to tweak it.
What's your approach? How do you generate noisy signals with certain specs? Have I just overlooked some blatantly obvious standard package that does exactly this?
The features I would generally like to get in my mock data:
Varying noise levels over time
Some history in the signal (like a random walk?)
Periodicity in the signal
Being able to produce another time series with similar (but not exactly the same) features
Maybe a bunch of weird dips/peaks/plateaus
Being able to reproduce it (some seed and a few parameters?)
I would like to get a time series similar to the two below [A]:
I usually end up creating a time series with a bit of code like this:
import numpy as np
n = 1000
limit_low = 0
limit_high = 0.48
my_data = np.random.normal(0, 0.5, n) \
+ np.abs(np.random.normal(0, 2, n) \
* np.sin(np.linspace(0, 3*np.pi, n)) ) \
+ np.sin(np.linspace(0, 5*np.pi, n))**2 \
+ np.sin(np.linspace(1, 6*np.pi, n))**2
scaling = (limit_high - limit_low) / (max(my_data) - min(my_data))
my_data = my_data * scaling
my_data = my_data + (limit_low - min(my_data))
Which results in a time series like this:
Which is something I can work with, but still not quite what I want. The problem here is mainly that:
it doesn't have the history/random walk aspect
it's quite a bit of code and tweaking (this is especially a problem if i want to share a sample time series)
I need to retweak the values (freq. of sines etc.) to produce another similar but not exactly the same time series.
[A]: For those wondering, the time series depicted in the first two images is the traffic intensity at two points along one road over three days (midnight to 6 am is clipped) in cars per second (moving hanning window average over 2 min). Resampled to 1000 points.
Have you looked into TSimulus? By using Generators, you should be able generate data with specific patterns, periodicity, and cycles.
The TSimulus project provides tools for specifying the shape of a time series (general patterns, cycles, importance of the added noise, etc.) and for converting this specification into time series values.
Otherwise, you can try "drawing" the data yourself and exporting those data points using Time Series Maker.

These spectrum bands used to be judged by eye, how to do it programmatically?

Operators used to examine the spectrum, knowing the location and width of each peak and judge the piece the spectrum belongs to. In the new way, the image is captured by a camera to a screen. And the width of each band must be computed programatically.
Old system: spectroscope -> human eye
New system: spectroscope -> camera -> program
What is a good method to compute the width of each band, given their approximate X-axis positions; given that this task used to be performed perfectly by eye, and must now be performed by program?
Sorry if I am short of details, but they are scarce.
Program listing that generated the previous graph; I hope it is relevant:
import Image
from scipy import *
from scipy.optimize import leastsq
# Load the picture with PIL, process if needed
pic = asarray(Image.open("spectrum.jpg"))
# Average the pixel values along vertical axis
pic_avg = pic.mean(axis=2)
projection = pic_avg.sum(axis=0)
# Set the min value to zero for a nice fit
projection /= projection.mean()
projection -= projection.min()
#print projection
# Fit function, two gaussians, adjust as needed
def fitfunc(p,x):
return p[0]*exp(-(x-p[1])**2/(2.0*p[2]**2)) + \
p[3]*exp(-(x-p[4])**2/(2.0*p[5]**2))
errfunc = lambda p, x, y: fitfunc(p,x)-y
# Use scipy to fit, p0 is inital guess
p0 = array([0,20,1,0,75,10])
X = xrange(len(projection))
p1, success = leastsq(errfunc, p0, args=(X,projection))
Y = fitfunc(p1,X)
# Output the result
print "Mean values at: ", p1[1], p1[4]
# Plot the result
from pylab import *
#subplot(211)
#imshow(pic)
#subplot(223)
#plot(projection)
#subplot(224)
#plot(X,Y,'r',lw=5)
#show()
subplot(311)
imshow(pic)
subplot(312)
plot(projection)
subplot(313)
plot(X,Y,'r',lw=5)
show()
Given an approximate starting point, you could use a simple algorithm that finds a local maxima closest to this point. Your fitting code may be doing that already (I wasn't sure whether you were using it successfully or not).
Here's some code that demonstrates simple peak finding from a user-given starting point:
#!/usr/bin/env python
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
# Sample data with two peaks: small one at t=0.4, large one at t=0.8
ts = np.arange(0, 1, 0.01)
xs = np.exp(-((ts-0.4)/0.1)**2) + 2*np.exp(-((ts-0.8)/0.1)**2)
# Say we have an approximate starting point of 0.35
start_point = 0.35
# Nearest index in "ts" to this starting point is...
start_index = np.argmin(np.abs(ts - start_point))
# Find the local maxima in our data by looking for a sign change in
# the first difference
# From http://stackoverflow.com/a/9667121/188535
maxes = (np.diff(np.sign(np.diff(xs))) < 0).nonzero()[0] + 1
# Find which of these peaks is closest to our starting point
index_of_peak = maxes[np.argmin(np.abs(maxes - start_index))]
print "Peak centre at: %.3f" % ts[index_of_peak]
# Quick plot showing the results: blue line is data, green dot is
# starting point, red dot is peak location
plt.plot(ts, xs, '-b')
plt.plot(ts[start_index], xs[start_index], 'og')
plt.plot(ts[index_of_peak], xs[index_of_peak], 'or')
plt.show()
This method will only work if the ascent up the peak is perfectly smooth from your starting point. If this needs to be more resilient to noise, I have not used it, but PyDSTool seems like it might help. This SciPy post details how to use it for detecting 1D peaks in a noisy data set.
So assume at this point you've found the centre of the peak. Now for the width: there are several methods you could use, but the easiest is probably the "full width at half maximum" (FWHM). Again, this is simple and therefore fragile. It will break for close double-peaks, or for noisy data.
The FWHM is exactly what its name suggests: you find the width of the peak were it's halfway to the maximum. Here's some code that does that (it just continues on from above):
# FWHM...
half_max = xs[index_of_peak]/2
# This finds where in the data we cross over the halfway point to our peak. Note
# that this is global, so we need an extra step to refine these results to find
# the closest crossovers to our peak.
# Same sign-change-in-first-diff technique as above
hm_left_indices = (np.diff(np.sign(np.diff(np.abs(xs[:index_of_peak] - half_max)))) > 0).nonzero()[0] + 1
# Add "index_of_peak" to result because we cut off the left side of the data!
hm_right_indices = (np.diff(np.sign(np.diff(np.abs(xs[index_of_peak:] - half_max)))) > 0).nonzero()[0] + 1 + index_of_peak
# Find closest half-max index to peak
hm_left_index = hm_left_indices[np.argmin(np.abs(hm_left_indices - index_of_peak))]
hm_right_index = hm_right_indices[np.argmin(np.abs(hm_right_indices - index_of_peak))]
# And the width is...
fwhm = ts[hm_right_index] - ts[hm_left_index]
print "Width: %.3f" % fwhm
# Plot to illustrate FWHM: blue line is data, red circle is peak, red line
# shows FWHM
plt.plot(ts, xs, '-b')
plt.plot(ts[index_of_peak], xs[index_of_peak], 'or')
plt.plot(
[ts[hm_left_index], ts[hm_right_index]],
[xs[hm_left_index], xs[hm_right_index]], '-r')
plt.show()
It doesn't have to be the full width at half maximum — as one commenter points out, you can try to figure out where your operators' normal threshold for peak detection is, and turn that into an algorithm for this step of the process.
A more robust way might be to fit a Gaussian curve (or your own model) to a subset of the data centred around the peak — say, from a local minima on one side to a local minima on the other — and use one of the parameters of that curve (eg. sigma) to calculate the width.
I realise this is a lot of code, but I've deliberately avoided factoring out the index-finding functions to "show my working" a bit more, and of course the plotting functions are there just to demonstrate.
Hopefully this gives you at least a good starting point to come up with something more suitable to your particular set.
Late to the party, but for anyone coming across this question in the future...
Eye movement data looks very similar to this; I'd base an approach off that used by Nystrom + Holmqvist, 2010. Smooth the data using a Savitsky-Golay filter (scipy.signal.savgol_filter in scipy v0.14+) to get rid of some of the low-level noise while keeping the large peaks intact - the authors recommend using an order of 2 and a window size of about twice the width of the smallest peak you want to be able to detect. You can find where the bands are by arbitrarily removing all values above a certain y value (set them to numpy.nan). Then take the (nan)mean and (nan)standard deviation of the remainder, and remove all values greater than the mean + [parameter]*std (I think they use 6 in the paper). Iterate until you're not removing any data points - but depending on your data, certain values of [parameter] may not stabilise. Then use numpy.isnan() to find events vs non-events, and numpy.diff() to find the start and end of each event (values of -1 and 1 respectively). To get even more accurate start and end points, you can scan along the data backward from each start and forward from each end to find the nearest local minimum which has value smaller than mean + [another parameter]*std (I think they use 3 in the paper). Then you just need to count the data points between each start and end.
This won't work for that double peak; you'd have to do some extrapolation for that.
The best method might be to statistically compare a bunch of methods with human results.
You would take a large variety data and a large variety of measurement estimates (widths at various thresholds, area above various thresholds, different threshold selection methods, 2nd moments, polynomial curve fits of various degrees, pattern matching, and etc.) and compare these estimates to human measurements of the same data set. Pick the estimate method that correlates best with expert human results. Or maybe pick several methods, the best one for each of various heights, for various separations from other peaks, and etc.

Categories