Python polynomial regression values ceiling <0,1> - python

I am trying to find the best curve that will describe my data. my data are stored in numpy arrays of t and dur they are both in values only from 0-1. However the best fit I get according to R**2 score is this yellow line with score of 0.979388 which doesn't fit my data because it is way off from expected values when it is well above 1 in Y axis:
t = [1.0, 1.0, 1.0, 1.0, 1.0, 0.33695652173913043, 0.010869565217391304, 1.0, 0.018518518518518517, 1.0, 1.0, 1.0, 1.0, 1.0, 0.005076142131979695, 1.0, 1.0, 1.0, 1.0, 0.03225806451612903, 1.0, 1.0, 1.0, 1.0, 1.0, 0.5, 0.25, 1.0]
dur = [1.0, 1.0, 1.0, 1.0, 0.9999999999999998, 0.2688679245283018, 0.2688679245283018, 1.0, 0.46692607003891046, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4444444444444444, 1.0, 1.0, 1.0, 1.0, 0.34210526315789475, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4714285714285715, 0.4714285714285715, 1.0]
#polynomial curve fitting
mymodel1 = np.poly1d(np.polyfit(t, dur, 1))
mymodel2 = np.poly1d(np.polyfit(t, dur, 2))
mymodel3 = np.poly1d(np.polyfit(t, dur, 3))
mymodel4 = np.poly1d(np.polyfit(t, dur, 4))
#polynomial score
p1 = r2_score(dur, mymodel1(t))
p2 = r2_score(dur, mymodel2(t))
p3 = r2_score(dur, mymodel3(t))
p4 = r2_score(dur, mymodel4(t))
#append results of R**2 to list of tuples from which I extract best score
fit = []
fit.append(p1)
fit.append(p2)
fit.append(p3)
fit.append(p4)
fitname = []
fitname.append('p1')
fitname.append('p2')
fitname.append('p3')
fitname.append('p4')
#append best result value
resultValue.append(max(fitTuple,key=lambda item:item[0])[0])
#append best result name
resultName.append(max(fitTuple,key=lambda item:item[0])[1])
#plot values from regression models
myline = np.linspace(0, 1, 100)
plt.plot(myline, mymodel1(myline),color = "black")
plt.plot(myline, mymodel2(myline),color = "black")
plt.plot(myline, mymodel3(myline),color = "black")
plt.plot(myline, mymodel4(myline),color = "yellow")

This is what is called "overfitting". If you fit overly complex models to your data, the models will usually have very high R^2 and indeed meet the data points of the data you use for training quite well, but are clearly not the appropriate choice, as can be seen when trying to fit new data, e.g. they don't interpolate well. And fitting polynomials of high degree is usually taken as a standard example for overfitting.
If you want to stick with polynomial models, you should think about what the least complex model, i.e. in this case the lowest degree polynomial, is, that you would still think appropriate for your data. In your case, quadratic seems OK.
One usually employs more sophisticated methods for regression, like those provided in e.g. scikit learn, which can help you find the right model (e.g. via cross-validation) and also provide regularization techniques. For model selection, see here.

Related

Fit data to integral using quad - magnetic hysteresis loop

I'm having trouble getting a fit to converge, as it's either not converging or giving a NaN error, depending on my start parameters. I'm using quad to integrate and fitting using lmfit. Any help is appreciated.
I'm fitting my data to a Langevin function, weighted by a log-normal distribution. Stackoverflow won't let me post an image of the function because of my reputation score, but it's in the code below.
I'm plugging in H (field) and fitting for Ms, Dm, and sigma, while mu_0, Msb, kb, and T are all constants.
Here's what I'm working with, using some example data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
from numpy import vectorize, sqrt, log, inf, exp, pi, tanh
from scipy.constants import k, mu_0
from lmfit import Parameters
from scipy.integrate import quad
x_data = [-7.0, -6.5, -6.0, -5.5, -5.0, -4.5, -4.0, -3.5, -3.0, -2.5, -2.0, -1.5, -1.0,
-0.95, -0.9, -0.85, -0.8, -0.75, -0.7, -0.65, -0.6, -0.55, -0.5, -0.45, -0.4,
-0.35, -0.3, -0.25, -0.2, -0.1,-0.05, 3e-6, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3,
0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0,
1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0]
y_data = [-61.6, -61.6, -61.6, -61.5, -61.5, -61.4, -61.3, -61.2, -61.1, -61.0, -60.8,
-60.4, -59.8, -59.8, -59.7, -59.5, -59.4, -59.3, -59.1, -58.9, -58.7, -58.4,
-58.1, -57.7, -57.2, -56.5, -55.6, -54.3, -52.2, -48.7, -41.8, -27.3, 2.6,
30.1, 43.1, 49.3, 52.6, 54.5, 55.8, 56.6, 57.3, 57.8, 58.2, 58.5, 58.7, 59.0,
59.1, 59.3, 59.5, 59.6, 59.7, 59.8, 59.9, 60.5, 60.8, 61.0, 61.2, 61.3, 61.4,
61.4, 61.5, 61.6, 61.6, 61.7, 61.7]
params = Parameters()
params.add('Dm' , value = 8e-9 , vary = True, min = 0, max = 1) # magnetic diameter (m)
params.add('s' , value = 0.4 , vary = True, min = 0.0, max = 10.0) # sigma, unitless
params.add('Ms' , value = 61.0 , vary = True) #, min = 30.0 , max = 100.0) # saturation magnetization (emu/g)
params.add('Msb', value = 446000 * 1e-16, vary = False) # Bulk magnetite saturation magnetization (A/m)
params.add('T' , value = 300 , vary = False) # Temperature (K)
def Mag(x_data, params):
v = params.valuesdict() # put parameters into a dictionary
def numerator(D, x_data, params):
# langevin
a_numerator = pi * v['Msb'] * x_data * D**3
a_denominator = 6*k*v['T']
a = a_numerator / a_denominator
langevin = (1/tanh(a)) - (1/a)
# PDF
exp_num = (log(D/v['Dm']))**2
exp_denom = 2 * v['s']
exponential = exp(-exp_num/exp_denom)
pdf = exponential/(sqrt(2*pi) * v['s'] * D)
return D**3 * langevin * pdf
def denominator(D, params):
# PDF
exp_num = (log(D/v['Dm']))**2
exp_denom = 2 * v['s']
exponential = exp(-exp_num/exp_denom)
pdf = exponential/(sqrt(2*pi) * v['s'] * D)
return D**3 * pdf
# return integrals
return v['Ms'] * quad(numerator, 0, inf, args=(x_data, params))[0] / quad(denominator, 0, inf,args=(params))[0]
# vectorize
vcurve = np.vectorize(Mag, excluded=set([1]))
plt.plot(x_data, vcurve(x_data, params))
plt.scatter(x_data, y_data)
This plots the data and the fit equation with start parameters. I have an issue somewhere with units in the Langevin and have to multiply the numerator by 1e-16 to get the curve looking correct...
from lmfit import minimize, Minimizer, Parameters, Parameter, report_fit
def fit_function(params, x_data, y_data):
model1 = vcurve(x_data, params)
resid1 = y_data - model1
return resid1
minner = Minimizer(fit_function, params, fcn_args=(x_data, y_data))
result = minner.minimize()
report_fit(result)
result.params.pretty_print()
Depending on the sigma (s) value I choose, which should be able to range from 0 to infinity, the integral won't converge, giving the following error:
/var/folders/pz/tbd_dths0_512bm6l43vpg680000gp/T/ipykernel_68003/1413445460.py:39: IntegrationWarning: The algorithm does not converge. Roundoff error is detected
in the extrapolation table. It is assumed that the requested tolerance
cannot be achieved, and that the returned result (if full_output = 1) is
the best which can be obtained.
return v['Ms'] * quad(numerator, 0, inf, args=(x_data, params))[0] / quad(denominator, 0, inf,args=(params))[0]
I'm stuck on why the fit isn't converging. Is this an issue because I'm using very small numbers or is this an issue with quad/lmfit? Thank you!
Having parameters that are closer to order 1 (say, between 1e-7 and 1e7) is a good idea. If you expect a parameter is in the 1.e-9 (or 1.e-16!) range, you could definitely scale it (in the fitting function) so that the value passed back and forth by the fitting algorithm is closer to order 1. But, I sort of doubt that is the main problem you are having.
It looks to me like your Mag function is not very sensitive to the values of your variable parameters Dm and s. I am not 100% sure why that is. Have you verified that calculations using your "Mag" or "vcurve" do what you expect them to do?

Difference between SimpleITK.Euler3DTransform and scipy.spatial.transform.Rotation.from_euler?

Using these two library functions:
SimpleITK.Euler3DTransform
scipy.spatial.transform.Rotation.from_euler
to create a simple rotation matrix from Euler Angles:
import numpy as np
import SimpleITK as sitk
from scipy.spatial.transform import Rotation
from math import pi
euler_angles = [pi / 10, pi / 18, pi / 36]
sitk_matrix = sitk.Euler3DTransform((0, 0, 0), *euler_angles).GetMatrix()
sitk_matrix = np.array(sitk_matrix).reshape((3,3))
print(np.array_str(sitk_matrix, precision=3, suppress_small=True))
order = 'XYZ' # Different results for any order in ['XYZ','XZY','YZX','YXZ','ZXY','ZYX','xyz','xzy','yzx','yxz','zxy','zyx']
scipy_matrix = Rotation.from_euler(order, euler_angles).as_matrix()
print(np.array_str(scipy_matrix, precision=3, suppress_small=True))
I get two different results:
[[ 0.976 -0.083 0.2 ]
[ 0.139 0.947 -0.288]
[-0.165 0.309 0.937]]
[[ 0.981 -0.086 0.174]
[ 0.136 0.943 -0.304]
[-0.138 0.322 0.937]]
Why? How can I compute the same matrix as SimpleITK using scipy?
The issue is that the itk.Euler3DTransform class by default applies the rotation matrix multiplications in Z # X # Y order and the Rotation.from_euler function in Z # Y # X order.
Note that this is independent of the order you specified. The order you specify refers to the order of the angles, not the order of the matrix multiplications.
If you are using the itk.Euler3DTransform directly as you showed in your example, you can actually change the default behavior for itk to perform the matrix multiplication in Z # Y # X order.
I never worked with sitk but in theory and based on the documentation, something like this should work:
euler_transform = sitk.Euler3DTransform((0, 0, 0), *euler_angles)
euler_transform.SetComputeZYX(True)
sitk_matrix = euler_transform.GetMatrix()
Alternatively, I wrote a function which is similar to Rotation.from_euler but has the option to specify the rotation order as well:
def build_rotation_3d(radians: NDArray,
radians_oder: str = 'XYZ',
rotation_order: str = 'ZYX',
dims: List[str] = ['X', 'Y', 'Z']) -> NDArray:
x_rad, y_rad, z_rad = radians[(np.searchsorted(dims, list(radians_oder)))]
x_cos, y_cos, z_cos = np.cos([x_rad, y_rad, z_rad], dtype=np.float64)
x_sin, y_sin, z_sin = np.sin([x_rad, y_rad, z_rad], dtype=np.float64)
x_rot = np.asarray([
[1.0, 0.0, 0.0, 0.0],
[0.0, x_cos, -x_sin, 0.0],
[0.0, x_sin, x_cos, 0.0],
[0.0, 0.0, 0.0, 1.0],
])
y_rot = np.asarray([
[y_cos, 0.0, y_sin, 0.0],
[0.0, 1.0, 0.0, 0.0],
[-y_sin, 0.0, y_cos, 0.0],
[0.0, 0.0, 0.0, 1.0],
])
z_rot = np.asarray([
[z_cos, -z_sin, 0.0, 0.0],
[z_sin, z_cos, 0.0, 0.0],
[0.0, 0.0, 1.0, 0.0],
[0.0, 0.0, 0.0, 1.0],
])
rotations = np.asarray([x_rot, y_rot, z_rot])[(np.searchsorted(dims, list(rotation_order)))]
return rotations[0] # rotations[1] # rotations[2]
What is your 'order' string. When I ran your code with order='xyz', I get the same results for SimpleITK and scipy's Rotation.

python (reverse) Interpolate assign a tenor-point value to two closest tenor point

I am looking to do a reverse type of (numpy) interpolation.
Consider the case where I have a 'risk' value of 2.2, and that is mapped to this tenor-point value of 1.50.
Consider a have a tenor-list of list (or array) = [0.5, 1.0, 2.0, 3.0, 5.0].
Now, I would like to attribute this risk-value of 2.2 to what it would be, as mapped to the closest two tenor-points (in this case 1.0 and 2.0), in the form of a linear interpolation.
In this example, the function will generate the risk-value of 2.0, (which is mapped to expiry value of 1.50) to
for the 1.0 tenor point : of 2.2 * (1.5 - 1.0)/(2.0 - 1.0)
for the 2.0 tenor point : of 2.2 * (2.0 - 1.5)/(2.0 - 1.0)
Is there a numpy/scipy/panda or python code that would do this?
Thanks!
Well, I have attempted a bit of a different approach but maybe this helps you. I try to interpolate the points for the new grid points using interpolate.interp1d (with the option to extrapolate points fill_value="extrapolate") to extend the range beyond the interval given. In your first example the new points were always internal, in the comment example also external, so I used the more general case. This still might be polished, but should give an idea:
import numpy as np
from scipy import interpolate
def dist_val(vpt, arr):
dist = np.abs(arr-np.full_like(arr, vpt))
i0 = np.argmin(dist)
dist[i0] = np.max(dist) + 1
i1 = np.argmin(dist)
return (i0, i1)
def dstr_lin(ra, tnl, tnh):
'''returns a risk-array like ra for tnh based on tnl'''
if len(tnh) < len(tnl) or len(ra) != len(tnl):
return -1
rah = []
for vh in tnh:
try:
rah.append((vh, ra[tnl.index(vh)]))
except ValueError:
rah.append((vh, float(interpolate.interp1d(tnl, ra, fill_value="extrapolate")(vh))))
return rah
ra = [0.422, 1.053, 100.423, -99.53]
tn_low = [1.0, 2.0, 5.0, 10.0]
tn_high = [1.0, 2.0, 3.0, 5.0, 7.0, 10.0, 12.0, 15.0]
print(dstr_lin(ra, tn_low, tn_high))
this results in
[(1.0, 0.422), (2.0, 1.053), (3.0, 34.17633333333333), (5.0, 100.423), (7.0, 20.4418), (10.0, -99.53), (12.0, -179.51120000000003), (15.0, -299.483)]
Careful though, I am not sure how "well behaved" your data is, interpolation or extrapolation might swing out of range so use with care.

Tensorflow Contrib Metrics always return 0.0

I tried to use the contrib metrics for the first time and didn't manage to make them work.
Here is the metrics I tried to use, and how they were implemented:
y_pred_labels = y[:, 1]
y_true_labels = tf.cast(y_[:, 1], tf.int32)
with tf.name_scope('auc'):
auc_score, update_op_auc = tf.contrib.metrics.streaming_auc(
predictions=y_pred_labels,
labels=y_true_labels
)
tf.summary.scalar('auc', auc_score)
with tf.name_scope('accuracy_contrib'):
accuracy_contrib, update_op_acc = tf.contrib.metrics.streaming_accuracy(
predictions=y_pred_labels,
labels=y_true_labels
)
tf.summary.scalar('accuracy_contrib', accuracy_contrib)
with tf.name_scope('error_contrib'):
error_contrib, update_op_error = tf.contrib.metrics.streaming_mean_absolute_error(
predictions=y_pred_labels,
labels=y_[:, 1] ## Needs to use float32 and not int32
)
tf.summary.scalar('error_contrib', error_contrib)
This code perfectly execute and during execution I obtain the following:
########################################
Accuracy at step 1000: 0.633333 # This is computed by another displayed not displayed above
Accuracy Contrib at step 1000: (0.0, 0.0)
AUC Score at step 1000: (0.0, 0.0)
Error Contrib at step 1000: (0.0, 0.0)
########################################
Here is the format of the data inputed:
y_pred_labels = [0.1, 0.5, 0.6, 0.8, 0.9, 0.1, ...] #Represent a binary probability
y_true_labels = [1, 0, 1, 1, 1, 0, 0, ...] # Represent the true class {0 or 1}
y_[:, 1] = [1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, ...] # Same as y_true_labels formated as float32
I think I've understood in the official documentation that it is normal behavior under certain conditions ... However, I don't manage to obtain the values of my metric.
Secondly, I have noticed two of the metrics are called: streaming_accuracy and streaming_auc, how does it behave differently than in a "non streaming" accuracy or auc metric? And is there any way to make it "non streaming" if necessary ?
I encountered the same problem just now. And found out:
You need to run update_ops such as sess.run(update_op_auc), while running metric operations such as sess.run(auc_score).

How do I stretch a list of floats in Python?

I'm working in Python and have a list of hourly values for a day. For simplicity let's say there are only 10 hours in a day.
[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0]
I want to stretch this around the centre-point to 150% to end up with:
[0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0]
Note this is just an example and I will also need to stretch things by amounts that leave fractional amounts in a given hour. For example stretching to 125% would give:
[0.0, 0.0, 0.5, 1.0, 1.0, 1.0, 1.0, 0.5, 0.0, 0.0]
My first thought for handling the fractional amounts is to multiply the list up by a factor of 10 using np.repeat, apply some method for stretching out the values around the midpoint, then finally split the list into chunks of 10 and take the mean for each hour.
My main issue is the "stretching" part but if the answer also solves the second part so much the better.
I guess, you need something like that:
def stretch(xs, coef):
# compute new distibution
oldDist = sum(hours[:len(hours)/2])
newDist = oldDist * coef
# generate new list
def f(x):
if newDist - x < 0:
return 0.0
return min(1.0, newDist - x)
t = [f(x) for x in range(len(xs)/2)]
res = list(reversed(t))
res.extend(t)
return res
But be careful with odd count of hours.
If I look at the expected output, the algorithm goes something like this:
start with a list of numbers, values >0.0 indicate working hours
sum those hours
compute how many extra hours are requested
divide those
extra hours over both ends of the sequence by prepending or appending
half of this at each 'end'
So:
hours = [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0]
expansion = 130
extra_hrs = float(sum(hours)) * float(expansion - 100)/100
# find indices of the first and last non-zero hours
# because of floating point can't use "==" for comparison.
hr_idx = [idx for (idx, value) in enumerate(hours) if value>0.001]
# replace the entries before the first and after the last
# with half the extra hours
print "Before expansion:",hours
hours[ hr_idx[0]-1 ] = hours[ hr_idx[-1]+1 ] = extra_hrs/2.0
print "After expansion:",hours
Gives as output:
Before expansion: [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0]
After expansion: [0.0, 0.0, 0.6, 1.0, 1.0, 1.0, 1.0, 0.6, 0.0, 0.0]
This is what I've ended up doing. It's a little ugly as it needs to handle stretch coefficients less than 100%.
def stretch(xs, coef, centre):
"""Scale a list by a coefficient around a point in the list.
Parameters
----------
xs : list
Input values.
coef : float
Coefficient to scale by.
centre : int
Position in the list to use as a centre point.
Returns
-------
list
"""
grain = 100
stretched_array = np.repeat(xs, grain * coef)
if coef < 1:
# pad start and end
total_pad_len = grain * len(xs) - len(stretched_array)
centre_pos = float(centre) / len(xs)
start_pad_len = centre_pos * total_pad_len
end_pad_len = (1 - centre_pos) * total_pad_len
start_pad = [stretched_array[0]] * int(start_pad_len)
end_pad = [stretched_array[-1]] * int(end_pad_len)
stretched_array = np.array(start_pad + list(stretched_array) + end_pad)
else:
pivot_point = (len(xs) - centre) * grain * coef
first = int(pivot_point - (len(xs) * grain)/2)
last = first + len(xs) * grain
stretched_array = stretched_array[first:last]
return [round(chunk.mean(), 2) for chunk in chunks(stretched_array, grain)]
def chunks(iterable, n):
"""
Yield successive n-sized chunks from iterable.
Source: http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python#answer-312464
"""
for i in xrange(0, len(iterable), n):
yield iterable[i:i + n]

Categories