Optimization on piecewise linear regression

Optimization on piecewise linear regression - python

I am trying to create a piecewise linear regression to minimize the MSE(minimum square errors) then using linear regression directly. The method should be using dynamic programming to calculate the different piecewise sizes and combinations of groups to achieve the overall MSE. I think the algorithm runtime is O(n²) and I wonder if there are ways to optimize it to O(nLogN)?
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
import pandas as pd
import matplotlib.pyplot as plt
x = [3.4, 1.8, 4.6, 2.3, 3.1, 5.5, 0.7, 3.0, 2.6, 4.3, 2.1, 1.1, 6.1, 4.8,3.8]
y = [26.2, 17.8, 31.3, 23.1, 27.5, 36.5, 14.1, 22.3, 19.6, 31.3, 24.0, 17.3, 43.2, 36.4, 26.1]
dataset = np.dstack((x,y))
dataset = dataset[0]
d_arg = np.argsort(dataset[:,0])
dataset = dataset[d_arg]
def calc_error(dataset):
lr_model = linear_model.LinearRegression()
x = pd.DataFrame(dataset[:,0])
y = pd.DataFrame(dataset[:,1])
lr_model.fit(x,y)
predictions = lr_model.predict(x)
mse = mean_squared_error(y, predictions)
return mse
#n is the number of points , m is the number of groups, k is the minimum number of points in a group
#（15，5，3）returns 【【3，3，3，3，3】】
#（15，5，2） returns [[2,4,3,3,3],[3,2,4,2,4],[4,2,3,3,3]....]
def all_combination(n,m,k):
result = []
if n < k*m:
print('There are not enough elements to split.')
return
combination_bottom = [k for q in range(m)]
#add greedy algorithm here?
if n == k*m:
result.append(combination_bottom.copy())
else:
combination_now = [combination_bottom.copy()]
j = k*m+1
while j < n+1:
combination_last = combination_now.copy()
combination_now = []
for x in combination_last:
for i in range (0, m):
combination_new = x.copy()
combination_new[i] = combination_new[i]+1
combination_now.append(combination_new.copy())
j += 1
else:
for x in combination_last:
for i in range (0, m):
combination_new = x.copy()
combination_new[i] = combination_new[i]+1
if combination_new not in result:
result.append(combination_new.copy())
return result #2-d list
def calc_sum_error(dataset,cb):#cb = combination
mse_sum = 0
for n in range(0,len(cb)):
if n == 0:
low = 0
high = cb[0]
else:
low = 0
for i in range(0,n):
low += cb[i]
high = low + cb[n]
mse_sum += calc_error(dataset[low:high])
return mse_sum
#k is the number of points as a group
def best_piecewise(dataset,k):
lenth = len(dataset)
max_split = lenth // k
min_mse = calc_error(dataset)
split_cb = []
all_cb = []
for i in range(2, max_split+1):
split_result = all_combination(lenth, i, k)
all_cb += split_result
for cb in split_result:
tmp_mse = calc_sum_error(dataset,cb)
if tmp_mse < min_mse:
min_mse = tmp_mse
split_cb = cb
return min_mse, split_cb, all_cb
min_mse, split_cb, all_cb = best_piecewise(dataset, 2)
print('The best split of the data is '+str(split_cb))
print('The minimum MSE value is '+str(min_mse))
x = np.array(dataset[:,0])
y = np.array(dataset[:,1])
plt.plot(x,y,"o")
for n in range(0,len(split_cb)):
if n == 0:
low = 0
high = split_cb[n]
else:
low = 0
for i in range(0,n):
low += split_cb[i]
high = low + split_cb[n]
x_tmp = pd.DataFrame(dataset[low:high,0])
y_tmp = pd.DataFrame(dataset[low:high,1])
lr_model = linear_model.LinearRegression()
lr_model.fit(x_tmp,y_tmp)
y_predict = lr_model.predict(x_tmp)
plt.plot(x_tmp, y_predict, 'g-')
plt.show()
Please let me know if I didn't make it clear in any part.

It took me some time to realize, that the problem you're describing is exactly what a decision tree regressor tries to solve.
Unfortunately, construction of an optimal decision tree is NP-hard, meaning that even with dynamic programming you can't bring the runtime down to anything like O(NlogN).
Good news is that you can directly use any well maintained decision tree implementation, DecisionTreeRegressor of sklearn.tree module for example, and can be certain about obtaining best possible performance in O(NlogN) time complexity. To enforce a minimum number of points per group, use min_samples_leaf parameter. You can also control several other properties like maximun of no. groups with max_leaf_nodes, optimization w.r.t different loss functions using criterion etc.
If you're curious how Scikit-learn's decision tree compare with the one learnt by your algorithm (i.e. split_cb in your code):
X = np.array(x).reshape(-1,1)
dt = DecisionTreeRegressor(min_samples_leaf=MIN_SIZE).fit(X,y)
split_cb = np.unique(dt.apply(X),return_counts=True)[1]
And then use the same plotting code you use. Do note that since your time complexity is considerably higher than O(NlogN)*, your implementation will often find better splits than the scikit-learn's greedy algorithm.
[1] Hyafil, L., & Rivest, R. L. (1976). Constructing optimal binary decision trees is np-complete. Information Processing Letters, 5(1), 15–17
*Although I'm not sure about the exact time complexity of your implementation, it's quite certainly worse than O(N^2), all_combination(21,4,2) took more than 5 mins.

Just so you know, this is a massive topic and there is no way to discuss all of it here. But I think we can make some good inroads and answer what you're looking for along the way.
Also, I think theory first works best because others may not be at the same point. Your code which worked out of the box - Hell yes! - kinda indicates you know what I'm getting ready to say but it leaves me with a couple of questions:
Why write in vanilla python when it isn't needed and is much slower than NumPy which you are already importing and using to some extent?
Does your example indicate that you don't fully understand the application piecewise regression? Since we're starting with the theory first, this may a bit of a non-issue.
Here's the thing about regression: It rarely models the data exactly and the closer it gets to perfectly accurate, the closer it gets to being overfit.
Your piecewise regressions are, with the exception of the first one, absolutely perfect. And they should be. Two points make a line. So, in the example you provided, you've also given an example of overfitting the data and what a brittle model would look like. Not sure if that's right? Consider what would the x values of 4.85 to 5.99 return? How about 3.11 to 3.39?
Your example is on the left (or top), standard linear regression is on the right (or bottom):
This linear regression on the right gives us y values for the full range of x values, and ones that (presumably) continue on. The continuity of a function is exactly what we're seeking. With the other example, You can throw any number of tools at it, including a decision tree regressor, but you'll either get something similarly brittle or something that violates expectations. And then what? Toss it out because it's 'wrong'? Go with it because 'that's what the computer says'? Both are equally bad.
We could stop there. But this is a really good question and it would be a disservice to not go on. So let's start with two different datasets that we know are good candidates for piecewise regression.
iterations = 500
x = np.random.normal(0, 1, iterations) * 10
y = np.where(x < 0, -4 * x + 3 , np.where(x < 10, x + 48, x + 98)) + np.random.normal(0, 3, iterations)
plt.scatter(x, y, s = 3, color = 'k')
plt.show()
... which gives us the left image. I picked this for two reasons: It's continuous over the x-values but not with the y-values. The image on the right is from a really well-done R package Which is also continuous on the x-axis, has one clear break but would still be best by three piecewise regressions. I'll mention a bit more about it later.
A couple of things to note:
One sort of obvious way to detect breakpoints is to look for where a line would one which would pass a one-sided limit test but fail a two-sided limit. That is, it's not differentiable.
Breakpoints in dummy data like this are going to be so much easier to identify because we're using code to develop them. But in real life, we will probably need a different solution. Let's set that aside for now.
One of the concerns I highlighted before was wondering why you would write vanilla python when other libraries are specifically geared towards your question are so much faster. So, let's find out how much faster and what sort of an answer you might find. And let's use the discontiguous torture test for good measure:
from scipy import optimize
def piecewise_linear(x, x0, x1, b, k1, k2, k3):
condlist = [x < x0, (x >= x0) & (x < x1), x >= x1]
funclist = [lambda x: k1*x + b, lambda x: k1*x + b + k2*(x-x0), lambda x: k1*x + b + k2*(x-x0) + k3*(x - x1)]
return np.piecewise(x, condlist, funclist)
p, e = optimize.curve_fit(piecewise_linear, x, y)
xd = np.linspace(-30, 30, iterations)
plt.plot(x, y, "ko" )
plt.plot(xd, piecewise_linear(xd, *p))
Even in a fairly extreme case like this, we get a quick, robust answer which is probably not as pretty as we would like and takes some thought about if it's optimal or not. So, take a sec and consider the graph. Is it optimal (and why) or not (and why not)?
While we're at it, let's talk about time. Running %%timeit on the roll-your-own version (imports, data, plotting -- the whole thing) took:
10.8 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which was 650 times longer than doing something similar (but with additionally randomizing 500 data points) with built-in NumPy and SciPy functions.
16.5 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
If that doesn't quite do it for you which is a very reasonable situation because (and I'm sort of tipping my hand here) we would expect a piecewise linear regression to catch any and all discontinuous breaks. So for that, let me refer you to this GitHub gist by datadog since a: there is no need to re-invent the wheel and b: they have an interesting implementation. Along with the code is an accompanying blog post that addresses the key shortcoming of dynamic programming as well as their methodology and thinking.
While dynamic programming can be used to traverse this search space
much more efficiently than a naive brute-force implementation, it’s
still too slow in practice.
Three last points.
If you tweak the random points to get a contiguous line, the results look even better but it's not perfect.
You can see why this can't all be addressed in one question. I haven't addressed things like curve fitting, Anscombe's quartet, splining, or using ML.
For now, there is no substitute for understanding what is going on. That said the R package MCP is really impressive in how it identifies inflection points using a Bayesian approach.

Related

Simple k-means algorithm in Python

The following is a very simple implementation of the k-means algorithm.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
DIM = 2
N = 2000
num_cluster = 4
iterations = 3
x = np.random.randn(N, DIM)
y = np.random.randint(0, num_cluster, N)
mean = np.zeros((num_cluster, DIM))
for t in range(iterations):
for k in range(num_cluster):
mean[k] = np.mean(x[y==k], axis=0)
for i in range(N):
dist = np.sum((mean - x[i])**2, axis=1)
pred = np.argmin(dist)
y[i] = pred
for k in range(num_cluster):
plt.scatter(x[y==k,0], x[y==k,1])
plt.show()
Here are two example outputs the code produces:
The first example (num_cluster = 4) looks as expected. The second example (num_cluster = 11) however shows only on cluster which is clearly not what I wanted. The code works depending on the number of classes I define and the number of iterations.
So far, I couldn't find the bug in the code. Somehow the clusters disappear but I don't know why.
Does anyone see my mistake?

You're getting one cluster because there really is only one cluster.
There's nothing in your code to avoid clusters disappearing, and the truth is that this will happen also for 4 clusters but after more iterations.
I ran your code with 4 clusters and 1000 iterations and they all got swallowed up in the one big and dominant cluster.
Think about it, your large cluster passes a critical point, and just keeps growing because other points are gradually becoming closer to it than to their previous mean.
This will not happen in the case that you reach an equilibrium (or stationary) point, in which nothing moves between clusters. But it's obviously a bit rare, and more rare the more clusters you're trying to estimate.
A clarification: The same thing can happen also when there are 4 "real" clusters and you're trying to estimate 4 clusters. But that would mean a rather nasty initialization and can be avoided by intelligently aggregating multiple randomly seeded runs.
There are also common "tricks" like taking the initial means to be far apart, or at the centers of different pre-estimated high density locations, etc. But that's starting to get involved, and you should read more deeply about k-means for that purpose.

K-means is also pretty sensitive to initial conditions. That said, k-means can and will drop clusters (but dropping to one is weird). In your code, you assign random clusters to the points.
Here's the problem: if I take several random subsamples of your data, they're going to have about the same mean point. Each iteration, the very similar centroids will be close to each other and more likely to drop.
Instead, I changed your code to pick num_cluster number of points in your data set to use as the initial centroids (higher variance). This seems to produce more stable results (didn't observe the dropping to one cluster behavior over several dozen runs):
import numpy as np
import matplotlib.pyplot as plt
DIM = 2
N = 2000
num_cluster = 11
iterations = 3
x = np.random.randn(N, DIM)
y = np.zeros(N)
# initialize clusters by picking num_cluster random points
# could improve on this by deliberately choosing most different points
for t in range(iterations):
if t == 0:
index_ = np.random.choice(range(N),num_cluster,replace=False)
mean = x[index_]
else:
for k in range(num_cluster):
mean[k] = np.mean(x[y==k], axis=0)
for i in range(N):
dist = np.sum((mean - x[i])**2, axis=1)
pred = np.argmin(dist)
y[i] = pred
for k in range(num_cluster):
fig = plt.scatter(x[y==k,0], x[y==k,1])
plt.show()

It does seem that there are NaN's entering the picture.
Using a seed=1, iterations=2, the number of clusters reduce from the initial 4 to effectively 3. In the next iteration this technically plummets to 1.
The NaN mean coordinates of the problematic centroid then result in weird things. To rule out those problematic clusters which became empty, one (possibly a bit too lazy) option is to set the related coordinates to Inf, thereby making it a "more distant than any other" point than those still in the game (as long as the 'input' coordinates cannot be Inf).
The below snippet is a quick illustration of that and a few debug messages that I used to peek into what was going on:
[...]
for k in range(num_cluster):
mean[k] = np.mean(x[y==k], axis=0)
# print mean[k]
if any(np.isnan(mean[k])):
# print "oh no!"
mean[k] = [np.Inf] * DIM
[...]
With this modification the posted algorithm seems to work in a more stable fashion (i.e. I couldn't break it so far).
Please also see the Quora link also mentioned among the comments about the split opinions, and the book "The Elements of Statistical Learning" for example here - the algorithm is not too explicitly defined there either in the relevant respect.

scikit kmeans not accurate cost \ inertia

I want to get the k-means cost (inertia in scikit kmeans).
Just to remind:
The cost is the sum of squared distanctes from each point to the nearest cluster.
I get a strange difference between the cost calc of scikit('inertia'),
and my own trivial way for computing the cost
Please see the following example:
p = np.random.rand(1000000,2)
from sklearn.cluster import KMeans
a = KMeans(n_clusters=3).fit(p)
print a.inertia_ , "****"
means = a.cluster_centers_
s = 0
for x in p:
best = float("inf")
for y in means:
if np.linalg.norm(x-y)**2 < best:
best = np.linalg.norm(x-y)**2
s += best
print s, "*****"
Where for my run the output is:
66178.4232156 ****
66173.7928716 *****
Where on my own dataset, the result is more significant(20% difference).
Is this a bug in scikit's implementation?

First - it does not seem to be a bug (but for sure ugly inconsistency). Why is that? You need to take a closer look into what is the code actually doing. For this general purpose it calls the cython code from _k_means.pyx
(lines 577-578)
inertia = _k_means._assign_labels_array(
X, x_squared_norms, centers, labels, distances=distances)
and what it does is essentialy exactly your code, but... using doubles in C. So maybe it is just a numerical issue? Let us test your code but now, with clear clusters structure (thus there are no points which might be assigned to many centers - depending on numerical accuracy).
import numpy as np
from sklearn.metrics import euclidean_distances
p = np.random.rand(1000000,2)
p[:p.shape[0]/2, :] += 100 #I move half of points far away
from sklearn.cluster import KMeans
a = KMeans(n_clusters=2).fit(p) #changed to two clusters
print a.inertia_ , "****"
means = a.cluster_centers_
s = 0
for x in p:
best = float("inf")
for y in means:
d = (x-y).T.dot(x-y)
if d < best:
best = d
s += best
print s, "*****"
results
166805.190832 ****
166805.190946 *****
makes sense. Thus the problem is with existance of samples "near the boundary" which might be assigned to multiple clusters depending on arithmetic accuracy. Unftunately I was not able to trace exactly where the difference comes from.
The funny thing is that there is actually an inconsistency coming from the fact, that inertia_ field is filled with Cython code, and .score calls NumPy one. Thus if you call
print -a.score(p)
you will get exactly your inertia.

Fitting Parametric Curves in Python

I have experimental data of the form (X,Y) and a theoretical model of the form (x(t;*params),y(t;*params)) where t is a physical (but unobservable) variable, and *params are the parameters that I want to determine. t is a continuous variable, and there is a 1:1 relationship between x and t and between y and t in the model.
In a perfect world, I would know the value of T (the real-world value of the parameter) and would be able to do an extremely basic least-squares fit to find the values of *params. (Note that I am not trying to "connect" the values of x and y in my plot, like in 31243002 or 31464345.) I cannot guarantee that in my real data, the latent value T is monotonic, as my data is collected across multiple cycles.
I'm not very experienced doing curve fitting manually, and have to use extremely crude methods without easy access to a basic scipy function. My basic approach involves:
Choose some value of *params and apply it to the model
Take an array of t values and put it into the model to create an array of model(*params) = (x(*params),y(*params))
Interpolate X (the data values) into model to get Y_predicted
Run a least-squares (or other) comparison between Y and Y_predicted
Do it again for a new set of *params
Eventually, choose the best values for *params
There are several obvious problems with this approach.
1) I'm not experienced enough with coding to develop a very good "do it again" other than "try everything in the solution space," of maybe "try everything in a coarse grid" and then "try everything again in a slightly finer grid in the hotspots of the coarse grid." I tried doing MCMC methods, but I never found any optimum values, largely because of problem 2
2) Steps 2-4 are super inefficient in their own right.
I've tried something like (resembling pseudo-code; the actual functions are made up). There are many minor quibbles that could be made about using broadcasting on A,B, but those are less significant than the problem of needing to interpolate for every single step.
People I know have recommended using some sort of Expectation Maximization algorithm, but I don't know enough about that to code one up from scratch. I'm really hoping there's some awesome scipy (or otherwise open-source) algorithm I haven't been able to find that covers my whole problem, but at this point I am not hopeful.
import numpy as np
import scipy as sci
from scipy import interpolate
X_data
Y_data
def x(t,A,B):
return A**t + B**t
def y(t,A,B):
return A*t + B
def interp(A,B):
ts = np.arange(-10,10,0.1)
xs = x(ts,A,B)
ys = y(ts,A,B)
f = interpolate.interp1d(xs,ys)
return f
N = 101
lsqs = np.recarray((N**2),dtype=float)
count = 0
for i in range(0,N):
A = 0.1*i #checks A between 0 and 10
for j in range(0,N):
B = 10 + 0.1*j #checks B between 10 and 20
f = interp(A,B)
y_fit = f(X_data)
squares = np.sum((y_fit - Y_data)**2)
lsqs[count] = (A,b,squares) #puts the values in place for comparison later
count += 1 #allows us to move to the next cell
i = np.argmin(lsqs[:,2])
A_optimal = lsqs[i][0]
B_optimal = lsqs[i][1]

If I understand the question correctly, the params are constants which are the same in every sample, but t varies from sample to sample. So, for example, maybe you have a whole bunch of points which you believe have been sampled from a circle
x = a+r cos(t)
y = b+r sin(t)
at different values of t.
In this case, what I would do is eliminate the variable t to get a relation between x and y -- in this case, (x-a)^2+(y-b)^2 = r^2. If your data fit the model perfectly, you would have (x-a)^2+(y-b)^2 = r^2 at each of your data points. With some error, you could still find (a,b,r) to minimize
sum_i ((x_i-a)^2 + (y_i-b)^2 - r^2)^2.
Mathematica's Eliminate command can automate the procedure of eliminating t in some cases.
PS You might do better at stats.stackexchange, math.stackexchange or mathoverflow.net . I know the last one has a scary reputation, but we don't bite, really!

Python, Pairwise 'distance', need a fast way to do it

For a side project in my PhD, I engaged in the task of modelling some system in Python. Efficiency wise, my program hits a bottleneck in the following problem, which I'll expose in a Minimal Working Example.
I deal with a large number of segments encoded by their 3D beginning and endpoints, so each segment is represented by 6 scalars.
I need to calculate a pairwise minimal intersegment distance. The analytical expression of the minimal distance between two segments is found in this source. To the MWE:
import numpy as np
N_segments = 1000
List_of_segments = np.random.rand(N_segments, 6)
Pairwise_minimal_distance_matrix = np.zeros( (N_segments,N_segments) )
for i in range(N_segments):
for j in range(i+1,N_segments):
p0 = List_of_segments[i,0:3] #beginning point of segment i
p1 = List_of_segments[i,3:6] #end point of segment i
q0 = List_of_segments[j,0:3] #beginning point of segment j
q1 = List_of_segments[j,3:6] #end point of segment j
#for readability, some definitions
a = np.dot( p1-p0, p1-p0)
b = np.dot( p1-p0, q1-q0)
c = np.dot( q1-q0, q1-q0)
d = np.dot( p1-p0, p0-q0)
e = np.dot( q1-q0, p0-q0)
s = (b*e-c*d)/(a*c-b*b)
t = (a*e-b*d)/(a*c-b*b)
#the minimal distance between segment i and j
Pairwise_minimal_distance_matrix[i,j] = sqrt(sum( (p0+(p1-p0)*s-(q0+(q1-q0)*t))**2)) #minimal distance
Now, I realize this is extremely inefficient, and this is why I am here. I have looked extensively in how to avoid the loop, but I run into a bit of a problem. Apparently, this sort of calculations is best done with the cdist of python. However, the custom distance functions it can handle have to be binary functions. This is a problem in my case, because my vectors have specifically a length of 6, and have to bit split into their first and last 3 components. I don't think I can translate the distance calculation into a binary function.
Any input is appreciated.

You can use numpy's vectorization capabilities to speed up the calculation. My version computes all elements of the distance matrix at once and then sets the diagonal and the lower triangle to zero.
def pairwise_distance2(s):
# we need this because we're gonna divide by zero
old_settings = np.seterr(all="ignore")
N = N_segments # just shorter, could also use len(s)
# we repeat p0 and p1 along all columns
p0 = np.repeat(s[:,0:3].reshape((N, 1, 3)), N, axis=1)
p1 = np.repeat(s[:,3:6].reshape((N, 1, 3)), N, axis=1)
# and q0, q1 along all rows
q0 = np.repeat(s[:,0:3].reshape((1, N, 3)), N, axis=0)
q1 = np.repeat(s[:,3:6].reshape((1, N, 3)), N, axis=0)
# element-wise dot product over the last dimension,
# while keeping the number of dimensions at 3
# (so we can use them together with the p* and q*)
a = np.sum((p1 - p0) * (p1 - p0), axis=-1).reshape((N, N, 1))
b = np.sum((p1 - p0) * (q1 - q0), axis=-1).reshape((N, N, 1))
c = np.sum((q1 - q0) * (q1 - q0), axis=-1).reshape((N, N, 1))
d = np.sum((p1 - p0) * (p0 - q0), axis=-1).reshape((N, N, 1))
e = np.sum((q1 - q0) * (p0 - q0), axis=-1).reshape((N, N, 1))
# same as above
s = (b*e-c*d)/(a*c-b*b)
t = (a*e-b*d)/(a*c-b*b)
# almost same as above
pairwise = np.sqrt(np.sum( (p0 + (p1 - p0) * s - ( q0 + (q1 - q0) * t))**2, axis=-1))
# turn the error reporting back on
np.seterr(**old_settings)
# set everything at or below the diagonal to 0
pairwise[np.tril_indices(N)] = 0.0
return pairwise
Now let's take it for a spin. With your example, N = 1000, I get a timing of
%timeit pairwise_distance(List_of_segments)
1 loops, best of 3: 10.5 s per loop
%timeit pairwise_distance2(List_of_segments)
1 loops, best of 3: 398 ms per loop
And of course, the results are the same:
(pairwise_distance2(List_of_segments) == pairwise_distance(List_of_segments)).all()
returns True. I'm also pretty sure there's a matrix multiplication hidden somewhere in the algorithm, so there should be some potential for further speedup (and also cleanup).
By the way: I've tried simply using numba first without success. Not sure why, though.

This is more of a meta answer, at least for starters. Your problem might already be in "my program hits a bottleneck" and "I realize this is extremely inefficient".
Extremely inefficient? By what measure? Do you have comparison? Is your code too slow to finish in a reasonable amount of time? What is a reasonable amount of time for you? Can you throw more computing power at the problem? Equally important -- do you use a proper infrastructure to run your code on (numpy/scipy compiled with vendor compilers, possibly with OpenMP support)?
Then, if you have answers for all of the questions above and need to further optimize your code -- where is the bottleneck in your current code exactly? Did you profile it? It the body of the loop possibly much more heavy-weight than the evaluation of the loop itself? If so, then "the loop" is not your bottleneck and you do not need to worry about the nested loop in the first place. Optimize the body at first, possibly by coming up with unorthodox matrix representations of your data so that you can perform all these single calculations in one step -- by matrix multiplication, for instance. If your problem is not solvable by efficient linear algebra operations, you can start writing a C extension or use Cython or use PyPy (which just very recently got some basic numpy support!). There are endless possibilities for optimizing -- the questions really are: how close to a practical solution are you already, how much do you need to optimize, and how much of an effort are you willing to invest.
Disclaimer: I have done non-canonical pairwise-distance stuff with scipy/numpy for my PhD, too ;-). For one particular distance metric, I ended up coding the "pairwise" part in simple Python (i.e. I also used the doubly-nested loop), but spent some effort in getting the body as efficient as possible (with a combination of i) a cryptical matrix multiplication representation of my problem and ii) using bottleneck).

You can use it something like this:
def distance3d (p, q):
if (p == q).all ():
return 0
p0 = p[0:3]
p1 = p[3:6]
q0 = q[0:3]
q1 = q[3:6]
... # Distance computation using the formula above.
print (distance.cdist (List_of_segments, List_of_segments, distance3d))
It doesn't seem to be any faster, though, since it executes the same loop internally.

Averaging unevenly sampled data

I have data which consist of the radial distance to the ground, sampled evenly every d_theta. I would like to do gaussian smoothing on it, but make the size of the smoothing window a constant in x, rather than be a constant number of points. What is a good way to do this?
I made a function to do it, but it is slow and I haven't even put in the parts that will calculate the edges yet.
If it helps to do it faster, I guess you can assume the floor is flat and use that to calculate how many points to sample, rather than using the actual x-values.
Here is what I have attempted so far:
bs = [gaussian(2*n-1,n/2) for n in range (1,500)] #bring the computation of the
bs = [b/b.sum() for b in bs] #gaussian outside to speed it up
def uneven_gauss_smoothing(xvals,yvals,sigma):
newy = []
for i, xval in enumerate (xvals):
#find how big the window should be to have the chosen sigma
#(or .5*sigma, whatever):
wheres = np.where(xvals> xval + sigma )[0]
iright = wheres[0] -i if len(wheres) else 100
if i - iright < 0 :
newy.append(0) #not implemented yet
continue
if i + iright >= len(xvals):
newy.append(0) #not implemented
continue
else:
#weighted average with gaussian curve:
newy.append((yvals[i-iright:i+iright+1]*bs[iright]).sum())
return np.array(newy)
Sorry it's a bit of a mess--it was so incredibly frustrating to debug that I just ended up using the first solution (usually one which was difficult to read) that came to mind for some of the problems that popped up. But it does work in it's limited way.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.