Overlapping coefficient using scipy quad not working as expected - python

Trying to find the area overlap of two scipy.skewnorm distributions I have generated using the code below
import scipy
a1 =1
loc1 = 0
scale1 = 1
a2=2
loc2=0
scale2=1
print(scipy.integrate.quad(lambda x: (min(skewnorm.pdf(x, a1, loc=loc1, scale=scale1), skewnorm.pdf(x, a2, loc=loc2, scale=scale1))),-10,10))
output : (0.8975836176504333, 8.065277615563445e-10)
However, changing the limits, significantly affects my results.
print(scipy.integrate.quad(lambda x: (min(skewnorm.pdf(x, a1, loc=loc1, scale=scale1), skewnorm.pdf(x, a2, loc=loc2, scale=scale1))),-1,1))
output:(0.341344746068543, 3.789687964201238e-15)
How can I determine the limits to be used?

Related

How to use the linear regression to create a "calibration curve" class in python for experimental activities?

I'm a newbie in python and would like to create a module that makes a class named "calibration curve" to help save time during my lab experimental activities.
My goal is to just upload my measurements in python with pd.read_Excel and obtain a "calibration curve" resulting from the linear regression of my data points.
My measurements are typically in this form: usually i have two different measurements of different things (A and B), in duplicate (A1, A2; B1, B2) or in triplicate (A1, A2, A3; B1, B2, B3).
Time
Measurement A1
Measurement A2
Measurement B1
Measurement B2
0
2.451
2.480
3.01
3.01
1
2.102
2.09
3.31
3.02
2
1.850
1.844
3.2
2.9
3
1.200
NaN
3.4
3.2
4
0.999
1.001
2.9
3.01
I typically have to make some calculations on these data like: the ratio between the measurement B1 and measurement A1, the ratio between the measurement B2 and measurement A2 and then the mean of these ratios, etc. I usually obtain something like this:
Final result image
Then I need to calculate the linear regression of mean vs. time and find slope, intercept, rsquared, etc.
I would like to store all of these values as attributes of an object to evoke them when I need. This is how it should work:
calibration_curve_instrument1 = calibration_curve("datapoints.xlsx", intercept=0)
print(calibration_curve_instrument1.intercept) --> 0
print(calibration_curve_instrument1.slope) --> 0.44
print(calibration_curve_instrument1.rsquared) --> 0.98
I wrote a code, but I don't know how to overcome these issues:
I don't know how to set intercept equal to 0 (sometimes I need to make this assumption)
how to make it "smart" (for instance: not raising an error if my measurements columns are 3 instead of 2, namely A1, A2, A3; B1, B2, B3)
how to avoid referring to the column name in the code, in order to adapt to different situations (in case, for instance, the first column is not named "time" but "velocity").
I've seen that there are many libraries to solve these problems, but, being new in python, I'm wondering what's the best in this situation: sklearn.linear_model, numpy.polyfit, scipy.stats.linregr?
I tried to define a class called calibration_curve that reads .xlsx data points and sets intercept, slope, and rsquared values as attributes.
def ratio_A(row):
return (row[3] / row[1])
def ratio_B(row):
return row[4] / row[2]
def ratio_mean(row):
return np.nanmean([row[5], row[6]])
class calibration_curve():
def __init__(self, xlsx):
self.raw = pd.read_excel(xlsx)
self.input = self.raw.copy()
self.raw["ratio_A"] = self.raw.apply(ratio_A, axis="columns")
self.raw["ratio_B"] = self.raw.apply(ratio_B, axis="columns")
self.raw["mean"] = self.raw.apply(ratio_mean, axis="columns")
self.slope, self.intercept, self.rsquared, self.p, self.std_err = \
scipy.stats.linregress(self.raw["mean"], self.raw["Time"])
It worked but, as I said I can't force intercept = 0 with scipy.stats.linregress and I'm pretty sure there are better ways to solve this problem than this one. For instance, if the dataset is in triplicate (A1, A2, A3; B1, B2, B3), this wouldn't work because I defined the functions "ratio_A" and "ratio_B" considering the column index. I also had to to refer to the column name self.raw["mean"], self.raw["Time"] so if the first column has a diffent name (like "velocity") this wouldn't work.
I hope I was clear. Any kind of suggestion to face these kind of problems is appreciated! Thanks a lot.

How to interpolate a line between two other lines in python

Note: I asked this question before but it was closed as a duplicate, however, I, along with several others believe it was unduely closed, I explain why in an edit in my original post. So I would like to re-ask this question here again.
Does anyone know of a python library that can interpolate between two lines. For example, given the two solid lines below, I would like to produce the dashed line in the middle. In other words, I'd like to get the centreline. The input is a just two numpy arrays of coordinates with size N x 2 and M x 2 respectively.
Furthermore, I'd like to know if someone has written a function for this in some optimized python library. Although optimization isn't exactly a necessary.
Here is an example of two lines that I might have, you can assume they do not overlap with each other and an x/y can have multiple y/x coordinates.
array([[ 1233.87375018, 1230.07095987],
[ 1237.63559365, 1253.90749041],
[ 1240.87500801, 1264.43925132],
[ 1245.30875975, 1274.63795396],
[ 1256.1449357 , 1294.48254424],
[ 1264.33600095, 1304.47893299],
[ 1273.38192911, 1313.71468591],
[ 1283.12411536, 1322.35942538],
[ 1293.2559388 , 1330.55873344],
[ 1309.4817002 , 1342.53074698],
[ 1325.7074616 , 1354.50276051],
[ 1341.93322301, 1366.47477405],
[ 1358.15898441, 1378.44678759],
[ 1394.38474581, 1390.41880113]])
array([[ 1152.27115094, 1281.52899302],
[ 1155.53345506, 1295.30515742],
[ 1163.56506781, 1318.41642169],
[ 1168.03497425, 1330.03181319],
[ 1173.26135672, 1341.30559949],
[ 1184.07110925, 1356.54121651],
[ 1194.88086178, 1371.77683353],
[ 1202.58908737, 1381.41765447],
[ 1210.72465255, 1390.65097106],
[ 1227.81309742, 1403.2904646 ],
[ 1244.90154229, 1415.92995815],
[ 1261.98998716, 1428.56945169],
[ 1275.89219696, 1438.21626352],
[ 1289.79440676, 1447.86307535],
[ 1303.69661656, 1457.50988719],
[ 1323.80994319, 1470.41028655],
[ 1343.92326983, 1488.31068591],
[ 1354.31738934, 1499.33260989],
[ 1374.48879779, 1516.93734053],
[ 1394.66020624, 1534.54207116]])
Visualizing this we have:
So my attempt at this has been using the skeletonize function in the skimage.morphology library by first rasterizing the coordinates into a filled in polygon. However, I get branching at the ends like this:
First of all, pardon the overkill; I had fun with your question. If the description is too long, feel free to skip to the bottom, I defined a function that does everything I describe.
Your problem would be relatively straightforward if your arrays were the same length. In that case, all you would have to do is find the average between the corresponding x values in each array, and the corresponding y values in each array.
So what we can do is create arrays of the same length, that are more or less good estimates of your original arrays. We can do this by fitting a polynomial to the arrays you have. As noted in comments and other answers, the midline of your original arrays is not specifically defined, so a good estimate should fulfill your needs.
Note: In all of these examples, I've gone ahead and named the two arrays that you posted a1 and a2.
Step one: Create new arrays that estimate your old lines
Looking at the data you posted:
These aren't particularly complicated functions, it looks like a 3rd degree polynomial would fit them pretty well. We can create those using numpy:
import numpy as np
# Find the range of x values in a1
min_a1_x, max_a1_x = min(a1[:,0]), max(a1[:,0])
# Create an evenly spaced array that ranges from the minimum to the maximum
# I used 100 elements, but you can use more or fewer.
# This will be used as your new x coordinates
new_a1_x = np.linspace(min_a1_x, max_a1_x, 100)
# Fit a 3rd degree polynomial to your data
a1_coefs = np.polyfit(a1[:,0],a1[:,1], 3)
# Get your new y coordinates from the coefficients of the above polynomial
new_a1_y = np.polyval(a1_coefs, new_a1_x)
# Repeat for array 2:
min_a2_x, max_a2_x = min(a2[:,0]), max(a2[:,0])
new_a2_x = np.linspace(min_a2_x, max_a2_x, 100)
a2_coefs = np.polyfit(a2[:,0],a2[:,1], 3)
new_a2_y = np.polyval(a2_coefs, new_a2_x)
The result:
That's not bad so bad! If you have more complicated functions, you'll have to fit a higher degree polynomial, or find some other adequate function to fit to your data.
Now, you've got two sets of arrays of the same length (I chose a length of 100, you can do more or less depending on how smooth you want your midpoint line to be). These sets represent the x and y coordinates of the estimates of your original arrays. In the example above, I named these new_a1_x, new_a1_y, new_a2_x and new_a2_y.
Step two: calculate the average between each x and each y in your new arrays
Then, we want to find the average x and average y value for each of our estimate arrays. Just use np.mean:
midx = [np.mean([new_a1_x[i], new_a2_x[i]]) for i in range(100)]
midy = [np.mean([new_a1_y[i], new_a2_y[i]]) for i in range(100)]
midx and midy now represent the midpoint between our 2 estimate arrays. Now, just plot your original (not estimate) arrays, alongside your midpoint array:
plt.plot(a1[:,0], a1[:,1],c='black')
plt.plot(a2[:,0], a2[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
And voilĂ :
This method still works with more complex, noisy data (but you have to fit the function thoughtfully):
As a function:
I've put the above code in a function, so you can use it easily. It returns an array of your estimated midpoints, in the format you had your original arrays in.
The arguments: a1 and a2 are your 2 input arrays, poly_deg is the degree polynomial you want to fit, n_points is the number of points you want in your midpoint array, and plot is a boolean, whether you want to plot it or not.
import matplotlib.pyplot as plt
import numpy as np
def interpolate(a1, a2, poly_deg=3, n_points=100, plot=True):
min_a1_x, max_a1_x = min(a1[:,0]), max(a1[:,0])
new_a1_x = np.linspace(min_a1_x, max_a1_x, n_points)
a1_coefs = np.polyfit(a1[:,0],a1[:,1], poly_deg)
new_a1_y = np.polyval(a1_coefs, new_a1_x)
min_a2_x, max_a2_x = min(a2[:,0]), max(a2[:,0])
new_a2_x = np.linspace(min_a2_x, max_a2_x, n_points)
a2_coefs = np.polyfit(a2[:,0],a2[:,1], poly_deg)
new_a2_y = np.polyval(a2_coefs, new_a2_x)
midx = [np.mean([new_a1_x[i], new_a2_x[i]]) for i in range(n_points)]
midy = [np.mean([new_a1_y[i], new_a2_y[i]]) for i in range(n_points)]
if plot:
plt.plot(a1[:,0], a1[:,1],c='black')
plt.plot(a2[:,0], a2[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
return np.array([[x, y] for x, y in zip(midx, midy)])
[EDIT]:
I was thinking back on this question, and I overlooked a simpler way to do this, by "densifying" both arrays to the same number of points using np.interp. This method follows the same basic idea as the line-fitting method above, but instead of approximating lines using polyfit / polyval, it just densifies:
min_a1_x, max_a1_x = min(a1[:,0]), max(a1[:,0])
min_a2_x, max_a2_x = min(a2[:,0]), max(a2[:,0])
new_a1_x = np.linspace(min_a1_x, max_a1_x, 100)
new_a2_x = np.linspace(min_a2_x, max_a2_x, 100)
new_a1_y = np.interp(new_a1_x, a1[:,0], a1[:,1])
new_a2_y = np.interp(new_a2_x, a2[:,0], a2[:,1])
midx = [np.mean([new_a1_x[i], new_a2_x[i]]) for i in range(100)]
midy = [np.mean([new_a1_y[i], new_a2_y[i]]) for i in range(100)]
plt.plot(a1[:,0], a1[:,1],c='black')
plt.plot(a2[:,0], a2[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
The "line between two lines" is not so well defined. You can obtain a decent though simple solution by triangulating between the two curves (you can triangulate by progressing from vertex to vertex, choosing the diagonals that produce the less skewed triangle).
Then the interpolated curve joins the middles of the sides.
I work with rivers, so this is a common problem. One of my solutions is exactly like the one you showed in your question--i.e. skeletonize the blob. You see that the boundaries have problems, so what I've done that seems to work well is to simply mirror the boundaries. For this approach to work, the blob must not intersect the corners of the image.
You can find my implementation in RivGraph; this particular algorithm is in rivers/river_utils.py called "mask_to_centerline".
Here's an example output showing how the ends of the centerline extend to the desired edge of the object:
sacuL's solution almost worked for me, but I needed to aggregate more than just two curves.
Here is my generalization for sacuL's solution:
def interp(*axis_list):
min_max_xs = [(min(axis[:,0]), max(axis[:,0])) for axis in axis_list]
new_axis_xs = [np.linspace(min_x, max_x, 100) for min_x, max_x in min_max_xs]
new_axis_ys = [np.interp(new_x_axis, axis[:,0], axis[:,1]) for axis, new_x_axis in zip(axis_list, new_axis_xs)]
midx = [np.mean([new_axis_xs[axis_idx][i] for axis_idx in range(len(axis_list))]) for i in range(100)]
midy = [np.mean([new_axis_ys[axis_idx][i] for axis_idx in range(len(axis_list))]) for i in range(100)]
for axis in axis_list:
plt.plot(axis[:,0], axis[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
If we now run an example:
a1 = np.array([[x, x**2+5*(x%4)] for x in range(10)])
a2 = np.array([[x-0.5, x**2+6*(x%3)] for x in range(10)])
a3 = np.array([[x+0.2, x**2+7*(x%2)] for x in range(10)])
interp(a1, a2, a3)
we get the plot:

Visualization of large combination of groups using pandas

I have a data frame with a structure similar to
dt, ing_net, egs_net, ing_ip, egs_ip, avg_pkt, sum_time
2017-01-01, A2, A1, 10.100.0.0, 22.54.23.0, 12.1, 123
2017-01-01, B2, A1, 10.100.1.0, 22.54.23.0, 12.1, 982
2017-01-01, B2, A2, 10.0.1.0, 22.54.13.0, 92.1, 692
...
2017-06-31, A2, B8, 65.200.0.0, 33.0.23.0, 12.7, 99887
and the possible number of combinations between ing_net and egs_net is 250. How can I visualize one of the value variables, say avg_pkt, for all possible combinations? I'm looking for a visual aid or approach to search for outliers.
Seaborn FacetGrid cannot plot all graphs:
g = sns.FacetGrid(df, row='egs_net', column='ing_net', 'avg_pkt')
g.map(sns.distplot)
# all plots are extremely small and with wrong x datetime axis
Doing a groupby on the pandas dataframe generates all graphs, but not in a grid:
for name, group in df_avg.groupby(['egs_net', 'ing_net']):
group.plot(x='dt', y='avg_pkt', title='{} - {}'.format(name[0], name[1]), figsize=(7,5), subplots=True)
and a Holoviews HoloMap bails with the number of intermediate graphs:
hv_df = hv.Dataset(df.reset_index(), kdims=['dt', 'egs_net', 'ing_net'])
hv_df.to(hv.Curve, vdims=['avg_pkt'])
How can I explore this phase space?

Scipy convolve2d with subsampling like Theano's conv2d?

I wish to perform 2D convolution on images of size 600 X 400 using a 10 X 10 filter. The filter is not separable. scipy.signal.convolve2d works well for me currently but, I am expecting a lot bigger images soon.
To counter that, I have two ideas
resizing images
subsampling (or striding)?
Focusing on the subsampling part, theano has a function which does convolution the same way as scipy convolve2d, see theano conv2d
It also has the subsampling option too. But, installing theano on windows has been painful to me. How do I get subsampling work with scipy.signal.convolve2d? Any other alternatives (which doesn't require me installing me some heavyweight library)?
You could implement subsampling by hand, I'll only sketch 1d for simplicity. Say you want to sample s = d * f on a regular subgrid with spacing k. Then your nth sample is s_nk = sum_i=0^10 f_i d_nk-i. The thing to observe here is that the indices of f and d always sum to a multiple of k. This suggests splitting it up into sub-sums s_nk = sum_j=0^k-1 sum_i=0^10/k f_j+ik d_-j+(n-i)k. So what you need to do is: subsample d and f at grids with spacing k at all offsets 0, ..., k-1. Convolve all pairs of subsampled d and f whose offsets sum to 0 or k and add the results.
Here's some code for 1d. It roughly implements the above, only the grids are placed slightly differently to make index management easier. The second function does it the stupid way, i.e. computes the full convolution and then decimates. It is for testing the first function against.
import numpy as np
from scipy import signal
def ss_conv(d1, d2, decimate):
n = (len(d1) + len(d2) - 1) // decimate
out = np.zeros((n,))
for i in range(decimate):
d1d = d1[i::decimate]
d2d = d2[decimate-i-1::decimate]
cv = signal.convolve(d1d, d2d, 'full')
out[:len(cv)] += cv
return out
def conv_ss(d1, d2, decimate):
return signal.convolve(d1, d2, 'full')[decimate-1::decimate]
Edit: 2d version:
import numpy as np
from scipy import signal
def ss_conv_2d(d1, d2, decy, decx):
ny = (d1.shape[0] + d2.shape[0] - 1) // decy
nx = (d1.shape[1] + d2.shape[1] - 1) // decx
out = np.zeros((ny, nx))
for i in range(decy):
for j in range(decx):
d1d = d1[i::decy, j::decx]
d2d = d2[decy-i-1::decy, decx-j-1::decx]
cv = signal.convolve2d(d1d, d2d, 'full')
out[:cv.shape[0], :cv.shape[1]] += cv
return out
def conv_ss_2d(d1, d2, decy, decx):
return signal.convolve2d(d1, d2, 'full')[decy-1::decy, decx-1::decx]

Scipy odeint giving index out of bounds errors

I am trying to solve a differential equation in python using Scipy's odeint function. The equation is of the form dy/dt = w(t) where w(t) = w1*(1+A*sin(w2*t)) for some parameters w1, w2, and A. The code I've written works for some parameters, but for others I get given index out of bound errors.
Here's some example code that works
import numpy as np
import scipy.integrate as integrate
t = np.arange(1000)
w1 = 2*np.pi
w2 = 0.016*np.pi
A = 1.0
w = w1*(1+A*np.sin(w2*t))
def f(y,t0):
return w[t0]
y = integrate.odeint(f,0,t)
Here's some example code that doesn't work
import numpy as np
import scipy.integrate as integrate
t = np.arange(1000)
w1 = 0.3*np.pi
w2 = 0.005*np.pi
A = 0.15
w = w1*(1+A*np.sin(w2*t))
def f(y,t0):
return w[t0]
y = integrate.odeint(f,0,t)
The only thing that changes between these is that the three parameters w1, w2, and A are smaller in the second, but the second one always gives me the following error
line 13, in f
return w[t0]
IndexError: index 1001 is out of bounds for axis 0 with size 1000
This error continues even after restarting python and running the second code first. I've tried with other parameters, some seem to work, but others give me different index out of bounds errors. Some say 1001 is out of bounds, some say 1000, some say 1008, ect.
Changing the initial condition on y (the second input for odeint, which I have as 0 on the above codes) also changes the number on the index error, so it might be that I'm misunderstanding what to put here. I wasn't told what the initial conditions should be other than that y is used as a phase of a signal, so I presumed it to be initially 0.
What you want to do is
def w(t):
return w1*(1+A*np.sin(w2*t))
def f(y,t0):
return w(t0)
Array indices are typically integers, time arguments and values of solutions of differential equations are typically real numbers. Thus there is some conceptual difficulty in invoking w[t0].
You might also try to integrate directly the function w, there is no inherent difficulty in this example.
As for coupled systems, you solve them as coupled systems.
def w(t):
return w1*(1+A*np.sin(w2*t))
def f(y,t):
wt = w(t)
return np.array([ wt, wt*sin(y[1]-y[0]) ])

Categories