I'm able to calculate a rolling correlation coefficient for a 1D-array (data against [0, 1, 2, 3, 4]) using a loop.
I'm looking for a smarter solution using numpy (not pandas).
Here is my current code:
import numpy as np
data = np.array([10,5,8,9,15,22,26,11,15,16,18,7,4,8,-2,-3,-4,-6,-2,0,10,0,5,8])
x = np.zeros_like(data).astype('float32')
length = 5
for i in range(length, data.shape[0]):
x[i] = np.corrcoef(data[i - length:i], np.arange(length))[0, 1]
print(x)
x gives :
[ 0. 0. 0. 0. 0. 0.607 0.959 0.98 0.328 -0.287
-0.61 -0.314 -0.18 -0.8 -0.782 -0.847 -0.811 -0.825 -0.869 -0.283
0.566 0.863 0.643 0.454]
Any solution without the loop please?
Use a numpy.lib.stride_tricks.sliding_window_view (available in numpy v1.20.0+)
swindow = np.lib.stride_tricks.sliding_window_view(data, (length,))
which gives a view on the data array that looks like so:
array([[10, 5, 8, 9, 15],
[ 5, 8, 9, 15, 22],
[ 8, 9, 15, 22, 26],
[ 9, 15, 22, 26, 11],
[15, 22, 26, 11, 15],
[22, 26, 11, 15, 16],
[26, 11, 15, 16, 18],
[11, 15, 16, 18, 7],
[15, 16, 18, 7, 4],
[16, 18, 7, 4, 8],
[18, 7, 4, 8, -2],
[ 7, 4, 8, -2, -3],
[ 4, 8, -2, -3, -4],
[ 8, -2, -3, -4, -6],
[-2, -3, -4, -6, -2],
[-3, -4, -6, -2, 0],
[-4, -6, -2, 0, 10],
[-6, -2, 0, 10, 0],
[-2, 0, 10, 0, 5],
[ 0, 10, 0, 5, 8]])
Now, we want to apply the correlation coefficient calculation to each row of this array. Unfortunately, np.corrcoef doesn't take an axis argument, it applies the calculation to the entire matrix and doesn't provide a way to do so for each row/column.
However, the calculation for the correlation coefficient of two vectors is quite simple:
Applying that here:
def vec_corrcoef(X, y, axis=1):
Xm = np.mean(X, axis=axis, keepdims=True)
ym = np.mean(y)
n = np.sum((X - Xm) * (y - ym), axis=axis)
d = np.sqrt(np.sum((X - Xm)**2, axis=axis) * np.sum((y - ym)**2))
return n / d
Now, call this function with our array and arange:
cc = vec_corrcoef(swindow, np.arange(length))
which gives the desired result:
array([ 0.60697698, 0.95894955, 0.98 , 0.3279521 , -0.28709766,
-0.61035663, -0.31390158, -0.17995394, -0.80041656, -0.78192905,
-0.84702587, -0.81091772, -0.82464375, -0.86892667, -0.28347335,
0.56568542, 0.86304424, 0.64326752, 0.45374261, 0.38135638])
To get your x, just set the appropriate indices of a zeros array of the correct size.
Note: I think your x should contain nonzero values starting at the 4 index (because that's where the sliding window is full) instead of starting at index 5.
x = np.zeros(data.shape)
x[-len(cc):] = cc
If you are sure that your values should start at the index 5, then you can do:
x = np.zeros(data.shape)
x[length:] = cc[:-1] # Ignore the last value in cc
Comparing the runtimes of your original approach with those suggested in the answers here:
f_OP_loopy is your approach, which implements a sliding window using a loop
f_PH_numpy is my approach, which uses the sliding_window_view and the vectorized function for row-wise calculation of the vector correlation coefficient
f_RA_numpy is Rontogiannis's approach, which tiles the arange, calculates the correlation coefficient for the entire matrices, and only selects the first len(data) - length rows of the last column
f_RA_recur is Rontogiannis's recursive approach, but I didn't time this because it misses out on the last correlation coefficient.
Unsurprisingly, the numpy-only solution is faster than the loopy approach.
My numpy solution, which computes the row-wise correlation coefficient, is faster than that shown by Rontogiannis below, because the extra work involved in tiling the vector input and calculating the correlation of the entire matrix, only to discard the unwanted elements, is avoided by my approach.
As the input data size increases, this "extra work" in Rontogiannis's approach increases so much that its runtime is worse even than the loopy approach! I am unsure if this extra time is in the np.corrcoef calculation or in the np.tile operation.
Note: This plot was obtained on my 2.2GHz i7 Macbook Air with 8GB RAM, Python 3.10.7 and numpy 1.23.3. Similar results were obtained on Google Colab
If you're interested in the timing code, here it is:
import timeit
import numpy as np
from matplotlib import pyplot as plt
def time_funcs(funcs, sizes, arg_gen, N=20):
times = np.zeros((len(sizes), len(funcs)))
gdict = globals().copy()
for i, s in enumerate(sizes):
args = arg_gen(s)
print(args)
for j, f in enumerate(funcs):
gdict.update(locals())
try:
times[i, j] = timeit.timeit("f(*args)", globals=gdict, number=N) / N
print(f"{i}/{len(sizes)}, {j}/{len(funcs)}, {times[i, j]}")
except ValueError:
print(f"ERROR in {f}, with args=", *args)
return times
def plot_times(times, funcs):
fig, ax = plt.subplots()
for j, f in enumerate(funcs):
ax.plot(sizes, times[:, j], label=f.__name__)
ax.set_xlabel("Array size")
ax.set_ylabel("Time per function call (s)")
ax.set_xscale("log")
ax.set_yscale("log")
ax.legend()
ax.grid()
fig.tight_layout()
return fig, ax
#%%
def arg_gen(n):
return [np.random.randint(-100, 100, (n,)), 5]
#%%
def f_OP_loopy(data, length):
x = np.zeros_like(data).astype('float32')
for i in range(length-1, data.shape[0]):
x[i] = np.corrcoef(data[i - length + 1:i+1], np.arange(length))[0, 1]
return x
def f_PH_numpy(data, length):
swindow = np.lib.stride_tricks.sliding_window_view(data, (length,))
cc = vec_corrcoef(swindow, np.arange(length))
x = np.zeros(data.shape)
x[-len(cc):] = cc
return x
def f_RA_recur(data, length):
return np.concatenate((
np.zeros([length,]),
rolling_correlation_recurse(data, 0, length)
))
def f_RA_numpy(data, length):
n = len(data)
cc = np.corrcoef(np.lib.stride_tricks.sliding_window_view(data, length), np.tile(np.arange(length), (n-length+1, 1)))[:n-length+1, -1]
x = np.zeros(data.shape)
x[-len(cc):] = cc
return x
#%%
def rolling_correlation_recurse(data, i, length) :
assert i+length < data.size
left = np.array([np.corrcoef(data[i:i+length], np.arange(length))[0, 1]])
if i+length+1 == data.size :
return left
right = rolling_correlation_recurse(data, i+1, length)
return np.concatenate((left, right))
def vec_corrcoef(X, y, axis=1):
Xm = np.mean(X, axis=axis, keepdims=True)
ym = np.mean(y)
n = np.sum((X - Xm) * (y - ym), axis=axis)
d = np.sqrt(np.sum((X - Xm)**2, axis=axis) * np.sum((y - ym)**2))
return n / d
#%%
if __name__ == "__main__":
#%% Set up sim
sizes = [5, 10, 50, 100, 500, 1000, 5000, 10_000] #, 50_000, 100_000]
funcs = [f_OP_loopy, #f_RA_recur,
f_PH_numpy, f_RA_numpy]
#%% Run timing
time_fcalls = np.zeros((len(sizes), len(funcs))) * np.nan
time_fcalls = time_funcs(funcs, sizes, arg_gen)
fig, ax = plot_times(time_fcalls, funcs)
ax.set_xlabel(f"Input size")
plt.show()
input("Enter x to exit")
Ask and you shall receive. Here is a solution that uses recursion:
import numpy as np
data = np.array([10,5,8,9,15,22,26,11,15,16,18,7,4,8,-2,-3,-4,-6,-2,0,10,0,5,8])
length = 5
def rolling_correlation_recurse(data, i, length) :
assert i+length < data.size
left = np.array([np.corrcoef(data[i:i+length], np.arange(length))[0, 1]])
if i+length+1 == data.size :
return left
right = rolling_correlation_recurse(data, i+1, length)
return np.concatenate((left, right))
def rolling_correlation(data, length) :
return np.concatenate((
np.zeros([length,]),
rolling_correlation_recurse(data, 0, length)
))
print(rolling_correlation(data, length))
Edit: here is a numpy solution too:
n = len(data)
print(np.corrcoef(np.lib.stride_tricks.sliding_window_view(data, length), np.tile(np.arange(length), (n-length+1, 1)))[:n-length+1, -1])
I'm trying to make a Python app that shows a graph after the input of the data by the user, but the problem is that the y_array and the x_array do not have the same dimensions. When I run the program, this error is raised:
ValueError: x and y must have same first dimension, but have shapes () and ()
How can I draw a graph with the X and Y axis of different length?
Here is a minimal example code that will lead to the same error I got
:
import matplotlib.pyplot as plt
y = [0, 8, 9, 3, 0]
x = [1, 2, 3, 4, 5, 6, 7]
plt.plot(x, y)
plt.show()
This is virtually a copy/paste of the answer found here, but I'll show what I did to get these to match.
First, we need to decide which array to use- the x_array of length 7, or the y_array of length 5. I'll show both, starting with the former. Note that I am using numpy arrays, not lists.
Let's load the modules
import numpy as np
import matplotlib.pyplot as plt
import scipy.interpolate as interp
and the arrays
y = np.array([0, 8, 9, 3, 0])
x = np.array([1, 2, 3, 4, 5, 6, 7])
In both cases, we use interp.interp1d which is described in detail in the documentation.
For the x_array to be reduced to the length of the y_array:
x_inter = interp.interp1d(np.arange(x.size), x)
x_ = x_inter(np.linspace(0,x.size-1,y.size))
print(len(x_), len(y))
# Prints 5,5
plt.plot(x_,y)
plt.show()
Which gives
and for the y_array to be increased to the length of the x_array:
y_inter = interp.interp1d(np.arange(y.size), y)
y_ = y_inter(np.linspace(0,y.size-1,x.size))
print(len(x), len(y_))
# Prints 7,7
plt.plot(x,y_)
plt.show()
Which gives
I am running a binomial test and I cannot understand why these two methods have different results.
The probs from the second to the first are different. When we calculate the two-tailed p-value should we just double one of the tails?
from scipy.stats import binom
n, p = 50, 0.4
prob = binom.cdf(2, n, p)
first = 2*prob
from scipy import stats
second = stats.binom_test(2, n, p, alternative='two-sided')
When we calculate the two-tailed p-value should we just double one of the tails?
No, because the binomial distribution is not, in general, symmetric. The one case where your calculation would work is p = 0.5.
Here's a visualization for the two-sided binomial test. For this demonstration, I'll use n=14 instead of n=50 to make the plot clearer.
The dashed line is drawn at the height of binom.pmf(2, n, p). The probabilities that contribute to the two-sided binomial test binom_test(2, n, p, alternative='two-sided') are those that are less than or equal to this value. In this example, we can see that the values of k where this is true are [0, 1, 2] (which is the left tail) and [10, 11, 12, 13, 14] (which is the right tail). The p-value of the two-sided binomial test should be the sum of these probabilities. And that is, in fact, what we find:
In [20]: binom.pmf([0, 1, 2, 10, 11, 12, 13, 14], n, p).sum()
Out[20]: 0.05730112258048004
In [21]: binom_test(2, n, p)
Out[21]: 0.05730112258047999
Note that scipy.stats.binom_test is deprecated. Users of SciPy 1.7.0 or later should use scipy.stats.binomtest instead:
In [36]: from scipy.stats import binomtest
In [37]: result = binomtest(2, n, p)
In [38]: result.pvalue
Out[38]: 0.05730112258047999
Here's the script to generate the plot:
import numpy as np
from scipy.stats import binom
import matplotlib.pyplot as plt
n = 14
p = 0.4
k = np.arange(0, n+1)
plt.plot(k, binom.pmf(k, n, p), 'o')
plt.xlabel('k')
plt.ylabel('pmf(k, n, p)')
plt.title(f'Binomial distribution with {n=}, {p=}')
ax = plt.gca()
ax.set_xticks(k[::2])
pmf2 = binom.pmf(2, n, p)
plt.axhline(pmf2, linestyle='--', color='k', alpha=0.5)
plt.grid(True)
plt.show()
I have two measurements consisting of x and y value pairs. I want to calculate the difference between these two series. The problem is that I cannot simply calculate the difference between these two measurements because they are sampled differently in the x values.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x1 = np.array([1, 2, 3, 4, 5])
y1 = np.array([1, 4, 9, 16, 25])
x2 = np.array([1.5, 2.5, 3.3, 4.2, 5.1])
y2 = np.array([1.3, 2.5, 3.3, 4.2, 5.1])
df = np.array([x1, y1, x2, y2])
df = pd.DataFrame(df.T, columns=['x1', 'y1', 'x2', 'y2'])
df.head()
plt.plot(df.x1.values, df.y1.values, df.x2.values, df.y2.values)
I would like to assign a new variable x = np.linspace(0, 5, 100, endpoint=True) and then determine new y1_new and y2_new by interpolating the y1 and y2 values on the values of x.
I have looked at pandas.resample() but that seems to be working with timestamps. Maybe 'scipy.interpolate' could help but I am not sure about the capabilities. In principle, I know how to program this by hand in python, but I am sure that there is already a solution to my problem.
An example of using the scipy.interpolate would be:
import scipy.interpolate as interp
import numpy as np
x1 = np.array([1, 2, 3, 4, 5])
y1 = np.array([1, 4, 9, 16, 25])
new_x1 = np.linspace(0, 5, 100, endpoint=True)
interpolated_1 = interp.interp1d(x1, y1, fill_value="extrapolate")
new_y1 = interpolated_1(new_x1)
new_y1
All the other methods follow the same signature, more or less, as you can see in the docs. Which one to use, depends on the underlying data you have, for example, the first looks like a quadratic and the second the identity.
I am trying to do a piecewise linear regression in Python and the data looks like this,
I need to fit 3 lines for each section. Any idea how? I am having the following code, but the result is shown below. Any help would be appreciated.
import numpy as np
import matplotlib
import matplotlib.cm as cm
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from scipy import optimize
def piecewise(x,x0,x1,y0,y1,k0,k1,k2):
return np.piecewise(x , [x <= x0, np.logical_and(x0<x, x< x1),x>x1] , [lambda x:k0*x + y0, lambda x:k1*(x-x0)+y1+k0*x0 lambda x:k2*(x-x1) y0+y1+k0*x0+k1*(x1-x0)])
x1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15,16,17,18,19,20,21], dtype=float)
y1 = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03,145,147,149,151,153,155])
y1 = np.flip(y1,0)
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15,16,17,18,19,20,21], dtype=float)
y = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03,145,147,149,151,153,155])
y = np.flip(y,0)
perr_min = np.inf
p_best = None
for n in range(100):
k = np.random.rand(7)*20
p , e = optimize.curve_fit(piecewise, x1, y1,p0=k)
perr = np.sum(np.abs(y1-piecewise(x1, *p)))
if(perr < perr_min):
perr_min = perr
p_best = p
xd = np.linspace(0, 21, 100)
plt.figure()
plt.plot(x1, y1, "o")
y_out = piecewise(xd, *p_best)
plt.plot(xd, y_out)
plt.show()
data with fit
Thanks.
A very simple method (without iteration, without initial guess) can solve this problem.
The method of calculus comes from page 30 of this paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf (copy below).
The next figure shows the result :
The equation of the fitted function is :
Or equivalently :
H is the Heaviside function.
In addition, the details of the numerical calculus are given below :