Piecewise Fit not working - large dataset

Piecewise Fit not working - large dataset - python

I have been using a solution found in several places on stack overflow for fitting a piecewise function:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15], dtype=float)
y = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03])
def piecewise_linear(x, x0, y0, k1, k2):
return np.piecewise(x, [x < x0], [lambda x:k1*x + y0-k1*x0, lambda x:k2*x + y0-k2*x0])
p, e = optimize.curve_fit(piecewise_linear, x, y)
xd = np.linspace(-5, 30, 100)
plt.plot(x, y, ".")
plt.plot(xd, piecewise_linear(xd, *p))
plt.show()
(for example, here: How to apply piecewise linear fit in Python?)
The first time I try it in the console I get an OptimizeWarning.
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
After that I just get a straight line for my fit. It seems as though there is clearly a bend in the data that the fit isn't following, although I cannot figure out why.
For the dataset I am using there are about 3200 points in each x and y, is this part of the problem?
Here are some fake data that kind of simulate mine (same problem occurs where fit is not piecewise):
x = np.append(np.random.uniform(low=10.0, high=40.2, size=(1500,)), np.random.uniform(low=-10.0, high=20.2, size=(1500,)))
y = np.append(np.random.uniform(low=-3000, high=0, size=(1500,)), np.random.uniform(low=-2000, high=1000, size=(1500,)))

Just to complete the question with the answer provided in the comment above:
The issue was not due to the large number of points, but the fact that I had such large values on my y axis. Since the default initial values are 1, my values of around 1000 were too large. To fix that an initial guess for the line fit was used for parameter p0. From the docs for scipy.optimize.curve_fit it looks like:
p0 : None, scalar, or N-length sequence, optional
Initial guess for the parameters. If None, then the initial values will all be 1 (if the number of parameters for the function can be determined using introspection, otherwise a ValueError is raised).
So my final code ended up looking like this:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15], dtype=float)
y = np.array([500, 700, 900, 1100, 1300, 1500, 2892, 4281, 5670, 7059, 8447, 9836, 11225, 12614, 14003])
def piecewise_linear(x, x0, y0, k1, k2):
return np.piecewise(x, [x < x0], [lambda x:k1*x + y0-k1*x0, lambda x:k2*x + y0-k2*x0])
p, e = optimize.curve_fit(piecewise_linear, x, y, p0=(10, -2500, 0, -500))
xd = np.linspace(-5, 30, 100)
plt.plot(x, y, ".")
plt.plot(xd, piecewise_linear(xd, *p))
plt.show()

Just for fun (very scattered case) :
Since the original data was not available, the coordinates of the points are obtained from the figure published in the Rachel W's question, thanks to a graphical scan and the record of the blue pixels. They are some artefact due to the straight line and the grid which, after scanning, appear in white.
The result of a piecewise regression (two segments) is drawn in red on the above figure.
The equation of the fitted function is :
The regression method used is not iterative and don't require initial guess. The code is very simple : pp.12-13 in this paper https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf

Related

How to calculate the correlation coefficient on a rolling window of a vector using numpy?

I'm able to calculate a rolling correlation coefficient for a 1D-array (data against [0, 1, 2, 3, 4]) using a loop.
I'm looking for a smarter solution using numpy (not pandas).
Here is my current code:
import numpy as np
data = np.array([10,5,8,9,15,22,26,11,15,16,18,7,4,8,-2,-3,-4,-6,-2,0,10,0,5,8])
x = np.zeros_like(data).astype('float32')
length = 5
for i in range(length, data.shape[0]):
x[i] = np.corrcoef(data[i - length:i], np.arange(length))[0, 1]
print(x)
x gives :
[ 0. 0. 0. 0. 0. 0.607 0.959 0.98 0.328 -0.287
-0.61 -0.314 -0.18 -0.8 -0.782 -0.847 -0.811 -0.825 -0.869 -0.283
0.566 0.863 0.643 0.454]
Any solution without the loop please?

Use a numpy.lib.stride_tricks.sliding_window_view (available in numpy v1.20.0+)
swindow = np.lib.stride_tricks.sliding_window_view(data, (length,))
which gives a view on the data array that looks like so:
array([[10, 5, 8, 9, 15],
[ 5, 8, 9, 15, 22],
[ 8, 9, 15, 22, 26],
[ 9, 15, 22, 26, 11],
[15, 22, 26, 11, 15],
[22, 26, 11, 15, 16],
[26, 11, 15, 16, 18],
[11, 15, 16, 18, 7],
[15, 16, 18, 7, 4],
[16, 18, 7, 4, 8],
[18, 7, 4, 8, -2],
[ 7, 4, 8, -2, -3],
[ 4, 8, -2, -3, -4],
[ 8, -2, -3, -4, -6],
[-2, -3, -4, -6, -2],
[-3, -4, -6, -2, 0],
[-4, -6, -2, 0, 10],
[-6, -2, 0, 10, 0],
[-2, 0, 10, 0, 5],
[ 0, 10, 0, 5, 8]])
Now, we want to apply the correlation coefficient calculation to each row of this array. Unfortunately, np.corrcoef doesn't take an axis argument, it applies the calculation to the entire matrix and doesn't provide a way to do so for each row/column.
However, the calculation for the correlation coefficient of two vectors is quite simple:
Applying that here:
def vec_corrcoef(X, y, axis=1):
Xm = np.mean(X, axis=axis, keepdims=True)
ym = np.mean(y)
n = np.sum((X - Xm) * (y - ym), axis=axis)
d = np.sqrt(np.sum((X - Xm)**2, axis=axis) * np.sum((y - ym)**2))
return n / d
Now, call this function with our array and arange:
cc = vec_corrcoef(swindow, np.arange(length))
which gives the desired result:
array([ 0.60697698, 0.95894955, 0.98 , 0.3279521 , -0.28709766,
-0.61035663, -0.31390158, -0.17995394, -0.80041656, -0.78192905,
-0.84702587, -0.81091772, -0.82464375, -0.86892667, -0.28347335,
0.56568542, 0.86304424, 0.64326752, 0.45374261, 0.38135638])
To get your x, just set the appropriate indices of a zeros array of the correct size.
Note: I think your x should contain nonzero values starting at the 4 index (because that's where the sliding window is full) instead of starting at index 5.
x = np.zeros(data.shape)
x[-len(cc):] = cc
If you are sure that your values should start at the index 5, then you can do:
x = np.zeros(data.shape)
x[length:] = cc[:-1] # Ignore the last value in cc
Comparing the runtimes of your original approach with those suggested in the answers here:
f_OP_loopy is your approach, which implements a sliding window using a loop
f_PH_numpy is my approach, which uses the sliding_window_view and the vectorized function for row-wise calculation of the vector correlation coefficient
f_RA_numpy is Rontogiannis's approach, which tiles the arange, calculates the correlation coefficient for the entire matrices, and only selects the first len(data) - length rows of the last column
f_RA_recur is Rontogiannis's recursive approach, but I didn't time this because it misses out on the last correlation coefficient.
Unsurprisingly, the numpy-only solution is faster than the loopy approach.
My numpy solution, which computes the row-wise correlation coefficient, is faster than that shown by Rontogiannis below, because the extra work involved in tiling the vector input and calculating the correlation of the entire matrix, only to discard the unwanted elements, is avoided by my approach.
As the input data size increases, this "extra work" in Rontogiannis's approach increases so much that its runtime is worse even than the loopy approach! I am unsure if this extra time is in the np.corrcoef calculation or in the np.tile operation.
Note: This plot was obtained on my 2.2GHz i7 Macbook Air with 8GB RAM, Python 3.10.7 and numpy 1.23.3. Similar results were obtained on Google Colab
If you're interested in the timing code, here it is:
import timeit
import numpy as np
from matplotlib import pyplot as plt
def time_funcs(funcs, sizes, arg_gen, N=20):
times = np.zeros((len(sizes), len(funcs)))
gdict = globals().copy()
for i, s in enumerate(sizes):
args = arg_gen(s)
print(args)
for j, f in enumerate(funcs):
gdict.update(locals())
try:
times[i, j] = timeit.timeit("f(*args)", globals=gdict, number=N) / N
print(f"{i}/{len(sizes)}, {j}/{len(funcs)}, {times[i, j]}")
except ValueError:
print(f"ERROR in {f}, with args=", *args)
return times
def plot_times(times, funcs):
fig, ax = plt.subplots()
for j, f in enumerate(funcs):
ax.plot(sizes, times[:, j], label=f.__name__)
ax.set_xlabel("Array size")
ax.set_ylabel("Time per function call (s)")
ax.set_xscale("log")
ax.set_yscale("log")
ax.legend()
ax.grid()
fig.tight_layout()
return fig, ax
#%%
def arg_gen(n):
return [np.random.randint(-100, 100, (n,)), 5]
#%%
def f_OP_loopy(data, length):
x = np.zeros_like(data).astype('float32')
for i in range(length-1, data.shape[0]):
x[i] = np.corrcoef(data[i - length + 1:i+1], np.arange(length))[0, 1]
return x
def f_PH_numpy(data, length):
swindow = np.lib.stride_tricks.sliding_window_view(data, (length,))
cc = vec_corrcoef(swindow, np.arange(length))
x = np.zeros(data.shape)
x[-len(cc):] = cc
return x
def f_RA_recur(data, length):
return np.concatenate((
np.zeros([length,]),
rolling_correlation_recurse(data, 0, length)
))
def f_RA_numpy(data, length):
n = len(data)
cc = np.corrcoef(np.lib.stride_tricks.sliding_window_view(data, length), np.tile(np.arange(length), (n-length+1, 1)))[:n-length+1, -1]
x = np.zeros(data.shape)
x[-len(cc):] = cc
return x
#%%
def rolling_correlation_recurse(data, i, length) :
assert i+length < data.size
left = np.array([np.corrcoef(data[i:i+length], np.arange(length))[0, 1]])
if i+length+1 == data.size :
return left
right = rolling_correlation_recurse(data, i+1, length)
return np.concatenate((left, right))
def vec_corrcoef(X, y, axis=1):
Xm = np.mean(X, axis=axis, keepdims=True)
ym = np.mean(y)
n = np.sum((X - Xm) * (y - ym), axis=axis)
d = np.sqrt(np.sum((X - Xm)**2, axis=axis) * np.sum((y - ym)**2))
return n / d
#%%
if __name__ == "__main__":
#%% Set up sim
sizes = [5, 10, 50, 100, 500, 1000, 5000, 10_000] #, 50_000, 100_000]
funcs = [f_OP_loopy, #f_RA_recur,
f_PH_numpy, f_RA_numpy]
#%% Run timing
time_fcalls = np.zeros((len(sizes), len(funcs))) * np.nan
time_fcalls = time_funcs(funcs, sizes, arg_gen)
fig, ax = plot_times(time_fcalls, funcs)
ax.set_xlabel(f"Input size")
plt.show()
input("Enter x to exit")

Ask and you shall receive. Here is a solution that uses recursion:
import numpy as np
data = np.array([10,5,8,9,15,22,26,11,15,16,18,7,4,8,-2,-3,-4,-6,-2,0,10,0,5,8])
length = 5
def rolling_correlation_recurse(data, i, length) :
assert i+length < data.size
left = np.array([np.corrcoef(data[i:i+length], np.arange(length))[0, 1]])
if i+length+1 == data.size :
return left
right = rolling_correlation_recurse(data, i+1, length)
return np.concatenate((left, right))
def rolling_correlation(data, length) :
return np.concatenate((
np.zeros([length,]),
rolling_correlation_recurse(data, 0, length)
))
print(rolling_correlation(data, length))
Edit: here is a numpy solution too:
n = len(data)
print(np.corrcoef(np.lib.stride_tricks.sliding_window_view(data, length), np.tile(np.arange(length), (n-length+1, 1)))[:n-length+1, -1])

How to make a graph with x and y of different length

I'm trying to make a Python app that shows a graph after the input of the data by the user, but the problem is that the y_array and the x_array do not have the same dimensions. When I run the program, this error is raised:
ValueError: x and y must have same first dimension, but have shapes () and ()
How can I draw a graph with the X and Y axis of different length?
Here is a minimal example code that will lead to the same error I got
:
import matplotlib.pyplot as plt
y = [0, 8, 9, 3, 0]
x = [1, 2, 3, 4, 5, 6, 7]
plt.plot(x, y)
plt.show()

This is virtually a copy/paste of the answer found here, but I'll show what I did to get these to match.
First, we need to decide which array to use- the x_array of length 7, or the y_array of length 5. I'll show both, starting with the former. Note that I am using numpy arrays, not lists.
Let's load the modules
import numpy as np
import matplotlib.pyplot as plt
import scipy.interpolate as interp
and the arrays
y = np.array([0, 8, 9, 3, 0])
x = np.array([1, 2, 3, 4, 5, 6, 7])
In both cases, we use interp.interp1d which is described in detail in the documentation.
For the x_array to be reduced to the length of the y_array:
x_inter = interp.interp1d(np.arange(x.size), x)
x_ = x_inter(np.linspace(0,x.size-1,y.size))
print(len(x_), len(y))
# Prints 5,5
plt.plot(x_,y)
plt.show()
Which gives
and for the y_array to be increased to the length of the x_array:
y_inter = interp.interp1d(np.arange(y.size), y)
y_ = y_inter(np.linspace(0,y.size-1,x.size))
print(len(x), len(y_))
# Prints 7,7
plt.plot(x,y_)
plt.show()
Which gives

Difference in binom.cdf and binom_test in python

I am running a binomial test and I cannot understand why these two methods have different results.
The probs from the second to the first are different. When we calculate the two-tailed p-value should we just double one of the tails?
from scipy.stats import binom
n, p = 50, 0.4
prob = binom.cdf(2, n, p)
first = 2*prob
from scipy import stats
second = stats.binom_test(2, n, p, alternative='two-sided')

When we calculate the two-tailed p-value should we just double one of the tails?
No, because the binomial distribution is not, in general, symmetric. The one case where your calculation would work is p = 0.5.
Here's a visualization for the two-sided binomial test. For this demonstration, I'll use n=14 instead of n=50 to make the plot clearer.
The dashed line is drawn at the height of binom.pmf(2, n, p). The probabilities that contribute to the two-sided binomial test binom_test(2, n, p, alternative='two-sided') are those that are less than or equal to this value. In this example, we can see that the values of k where this is true are [0, 1, 2] (which is the left tail) and [10, 11, 12, 13, 14] (which is the right tail). The p-value of the two-sided binomial test should be the sum of these probabilities. And that is, in fact, what we find:
In [20]: binom.pmf([0, 1, 2, 10, 11, 12, 13, 14], n, p).sum()
Out[20]: 0.05730112258048004
In [21]: binom_test(2, n, p)
Out[21]: 0.05730112258047999
Note that scipy.stats.binom_test is deprecated. Users of SciPy 1.7.0 or later should use scipy.stats.binomtest instead:
In [36]: from scipy.stats import binomtest
In [37]: result = binomtest(2, n, p)
In [38]: result.pvalue
Out[38]: 0.05730112258047999
Here's the script to generate the plot:
import numpy as np
from scipy.stats import binom
import matplotlib.pyplot as plt
n = 14
p = 0.4
k = np.arange(0, n+1)
plt.plot(k, binom.pmf(k, n, p), 'o')
plt.xlabel('k')
plt.ylabel('pmf(k, n, p)')
plt.title(f'Binomial distribution with {n=}, {p=}')
ax = plt.gca()
ax.set_xticks(k[::2])
pmf2 = binom.pmf(2, n, p)
plt.axhline(pmf2, linestyle='--', color='k', alpha=0.5)
plt.grid(True)
plt.show()

Python | How interpolate two measurements such that the x-values are the same

I have two measurements consisting of x and y value pairs. I want to calculate the difference between these two series. The problem is that I cannot simply calculate the difference between these two measurements because they are sampled differently in the x values.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x1 = np.array([1, 2, 3, 4, 5])
y1 = np.array([1, 4, 9, 16, 25])
x2 = np.array([1.5, 2.5, 3.3, 4.2, 5.1])
y2 = np.array([1.3, 2.5, 3.3, 4.2, 5.1])
df = np.array([x1, y1, x2, y2])
df = pd.DataFrame(df.T, columns=['x1', 'y1', 'x2', 'y2'])
df.head()
plt.plot(df.x1.values, df.y1.values, df.x2.values, df.y2.values)
I would like to assign a new variable x = np.linspace(0, 5, 100, endpoint=True) and then determine new y1_new and y2_new by interpolating the y1 and y2 values on the values of x.
I have looked at pandas.resample() but that seems to be working with timestamps. Maybe 'scipy.interpolate' could help but I am not sure about the capabilities. In principle, I know how to program this by hand in python, but I am sure that there is already a solution to my problem.

An example of using the scipy.interpolate would be:
import scipy.interpolate as interp
import numpy as np
x1 = np.array([1, 2, 3, 4, 5])
y1 = np.array([1, 4, 9, 16, 25])
new_x1 = np.linspace(0, 5, 100, endpoint=True)
interpolated_1 = interp.interp1d(x1, y1, fill_value="extrapolate")
new_y1 = interpolated_1(new_x1)
new_y1
All the other methods follow the same signature, more or less, as you can see in the docs. Which one to use, depends on the underlying data you have, for example, the first looks like a quadratic and the second the identity.

Piecewise regression python

I am trying to do a piecewise linear regression in Python and the data looks like this,
I need to fit 3 lines for each section. Any idea how? I am having the following code, but the result is shown below. Any help would be appreciated.
import numpy as np
import matplotlib
import matplotlib.cm as cm
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from scipy import optimize
def piecewise(x,x0,x1,y0,y1,k0,k1,k2):
return np.piecewise(x , [x <= x0, np.logical_and(x0<x, x< x1),x>x1] , [lambda x:k0*x + y0, lambda x:k1*(x-x0)+y1+k0*x0 lambda x:k2*(x-x1) y0+y1+k0*x0+k1*(x1-x0)])
x1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15,16,17,18,19,20,21], dtype=float)
y1 = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03,145,147,149,151,153,155])
y1 = np.flip(y1,0)
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15,16,17,18,19,20,21], dtype=float)
y = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03,145,147,149,151,153,155])
y = np.flip(y,0)
perr_min = np.inf
p_best = None
for n in range(100):
k = np.random.rand(7)*20
p , e = optimize.curve_fit(piecewise, x1, y1,p0=k)
perr = np.sum(np.abs(y1-piecewise(x1, *p)))
if(perr < perr_min):
perr_min = perr
p_best = p
xd = np.linspace(0, 21, 100)
plt.figure()
plt.plot(x1, y1, "o")
y_out = piecewise(xd, *p_best)
plt.plot(xd, y_out)
plt.show()
data with fit
Thanks.

A very simple method (without iteration, without initial guess) can solve this problem.
The method of calculus comes from page 30 of this paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf (copy below).
The next figure shows the result :
The equation of the fitted function is :
Or equivalently :
H is the Heaviside function.
In addition, the details of the numerical calculus are given below :

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Piecewise Fit not working - large dataset - python

Related

How to calculate the correlation coefficient on a rolling window of a vector using numpy?

How to make a graph with x and y of different length

Difference in binom.cdf and binom_test in python

Python | How interpolate two measurements such that the x-values are the same

Piecewise regression python

Categories

Resources