Python - Find coefficients minimizing error in csv data - python

I've recently run into a problem. I have data looking like this :
Value 1
Value 2
Target
1345
4590
2.45
1278
3567
2.48
1378
4890
2.46
1589
4987
2.50
...
...
...
The data goes on for a few thousand lines.
I need to find two values (A & B), that minimize the error when the data is inputted like so :
Value 1 * A + Value 2 * B = Target
I've looked into scipy.optimize.curve_fit, but I can't seem to understand how it would work, because the function changes at every iteration of the data (since Value 1 and Value 2 are not the same over every row).
Any help is greatly appreciated, thanks in advance !

The function curve_fit takes 3 arguments :
a function f that takes an input argument, let's call it X and parameters params (as many as you want)
the input X_data you have from your dataset
the output Y_data you have from your dataset
The point of this function is the give you best params to input in f(X_data, params) to get Y_data.
Intuitively the form X in your function f is a simple numpy 1D array, but actually it can have the form you want. Here your input a tuple of two 1D arrays (or a 2D array if you want to implemente it this way).
Here's a code example :
import numpy as np
from scipy.optimize import curve_fit
X_data = (np.array([1345,1278,1378,1589]),
np.array([4590,3567,4890,4987]))
Y_data = np.array([2.45,2.48,2.46,2.50])
def my_func(X, A, B):
x1, x2 = X
return A*x1 + B*x2
(A, B), _ = curve_fit(my_func, X_data, Y_data)
interpolated_results = my_func(X_data, A, B)
relative_error_in_percent = abs((Y_data - interpolated_results)/Y_data)*100
print(relative_error_in_percent)

Unfortunataly you have not provided any test data so I have come up with my own:
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
def f(V1,V2,A,B): #Target function
return V1*A+V2*B
# Generate Test-Data
def generateData(A,B):
np.random.seed(0)
V1=np.random.uniform(low=1000, high=1500, size=(100,))
V2=np.random.uniform(low=3500, high=5000, size=(100,))
Target=f(V1,V2,A,B) +np.random.normal(0,1,100)
return V1,V2,Target
data=generateData(2,3) #Important:
data={"Value 1":data[0], "Value 2":data[1], "Target":data[2]}
df=pd.DataFrame(data) #Similar structure as given in Table
df.head() looks like this:
Value 1 Value 2 Target
0 1292.0525763109854 3662.162080896163 13570.276523473405
1 1155.0421489258965 4907.133274663096 17033.392287295104
2 1430.7172112685223 4844.422515098364 17395.412651006143
3 1396.0480757043242 4076.5845114488666 15022.720636830541
4 1346.2120476329646 3570.9567326419674 13406.565815022896
Your question is answered in the following:
## Plot Data to check whether linear function is useful
df.head()
fig=plt.figure()
ax1=fig.add_subplot(211)
ax2=fig.add_subplot(212)
ax1.scatter(df["Value 1"], df["Target"])
ax2.scatter(df["Value 2"], df["Target"])
def fmin(x, df): #Returns Error at given parameters
def RMSE(y,y_target): #Definition for error term
return np.sqrt(np.mean((y-y_target)**2))
A,B=x
V1,V2,y_target=df["Value 1"], df["Value 2"], df["Target"]
y=f(V1,V2,A,B) #Calculate target value with given parameter set
return RMSE(y,y_target)
res=minimize(fmin,x0=[1,1],args=df, options={"disp":True})
print(res.x)
I prefere scipy.optimize.minimize() over curve_fit since you can define the error function yourself. The documentation can be found here.
You need:
a function fun that returns the error for a given set of parameter x (here fmin with RMSE)
an initial guess x0 (here [1,1]), if your guess is totally off you will probably do not find a solution or (with more complex problems) just a local one
additional arguments args provided to the fun here the data df but also helpful for fixed parameters
options={"disp":True} is for printing additional information
your parameters can be found besides further information in the returned variable res
For this case the result is:
[1.9987209 3.0004212]
Similar to the given parameters when generating the data.

Related

Theory behind `curvefit()` module in python

Consider I have the following data:
0.000000000000000000e+00 4.698409927534825670e-01
1.052631578947368363e+00 8.864688755521996200e+00
2.105263157894736725e+00 1.554529316011567630e+01
3.157894736842105310e+00 9.767558170900922931e+00
4.210526315789473450e+00 2.670221074763470881e+01
Now, I would like to use this data to do some statistical analysis.
%pylab inline
# Loads numpty
my_data = loadtxt("numbers.dat")
dataxaxis = my_data[:,0]
datayaxis = my_data[:,1]
I know that I am storing the data as a variable and hence taking the first column as my xdata for the x-axis and the second column as my y data for the y-axis.
I was learning about the curvefit() function which works similar to polyfit() in finding a Line of best fit (LOBF) gradient and intercept.
I understood that first I had to define the function of a straight line being y = mx+c.
Here is where I become confused.
According to the lecturer, I needed to have the xdata as an argument, but also define the gradient and intercept as parameters:
def straightline(dataxaxis, m, c):
"Returns values of y according to y = mx + c"
return m*dataxaxis + c
And later I could call the curvefit() function as such:
lineinfo = curve_fit(line, dataxaxis, datayaxis)
lineparams = lineinfo[0]
m = lineparams[0]
c = lineparams[1]
which gave the corresponding values.
But, when I called the function straightline as the first parameter in curve_fit, I didn't pass any information about dataxaxis or any information about m or c yet it still calculated the gradient and intercept to return in a matrix regardless.
How is this possible?

How can I create a function from this data?

I have a dataset in the form of a table:
Score Percentile
381 1
382 2
383 2
...
569 98
570 99
The complete table is here as a Google spreadsheet.
Currently, I am computing a score and then doing a lookup on this dataset (table) to find the corresponding percentile rank.
Is it possible to create a function to calculate the corresponding percentile rank for a given score using a formula instead of looking it up in the table?
It's impossible to recreate the function that generated a given table of data, if no information is provided about the process behind that data.
That being said, we can make some speculation.
Since it's a "percentile" function, it probably represents the cumulative value of a probability distribution of some sort. A very common probability distribution is the normal distribution, whose "cumulative" counterpart (i.e. its integral) is the so called "error function" ("erf").
In fact, your tabulated data looks a lot like an error function for a variable whose average value is 473.09:
your dataset: orange; fitted error function (erf): blue
However, the agreement is not perfect and that could be because of three reasons:
the fitting procedure I've used to generate the parameters for the error function didn't use the right constraints (because I have no idea what I'm modelling!)
your dataset doesn't represent an exact normal distribution, but rather real world data whose underlying distribution is the normal distribution. The features of your sample data that deviate from the model are being ignored altogether.
the underlying distribution is not a normal distribution at all, its integral just happens to look like the error function by chance.
There is literally no way for me to tell!
If you want to use this function, this is its definition:
import numpy as np
from scipy.special import erf
def fitted_erf(x):
c = 473.09090474
w = 37.04826334
return 50+50*erf((x-c)/(w*np.sqrt(2)))
Tests:
In [2]: fitted_erf(439) # 17 from the table
Out[2]: 17.874052406601457
In [3]: fitted_erf(457) # 34 from the table
Out[3]: 33.20270318344252
In [4]: fitted_erf(474) # 51 from the table
Out[4]: 50.97883169390196
In [5]: fitted_erf(502) # 79 from the table
Out[5]: 78.23955071273468
however I'd strongly advise you to check if a fitted function, made without knowledge of your data source, is the right tool for your task.
P.S.
In case you're interested, this is the code used to obtain the parameters:
import numpy as np
from scipy.special import erf
from scipy.optimize import curve_fit
tab=np.genfromtxt('table.csv', delimiter=',', skip_header=1)
# using a 'table.csv' file generated by Google Spreadsheets
x = tab[:,0]
y = tab[:,1]
def parametric_erf(x, c, w):
return 50+50*erf((x-c)/(w*np.sqrt(2)))
pars, j = curve_fit(parametric_erf, x, y, p0=[475,10])
print(pars)
# outputs [ 473.09090474, 37.04826334]
and to generate the plot
import matplotlib.pyplot as plt
plt.plot(x,parametric_erf(x,*pars))
plt.plot(x,y)
plt.show()
Your question is quite vague but it seems whatever calculation you do ends up with a number in the range 381-570, is this correct. You have a multiline calculation which gives this number? I'm guessing you are repeating this in many places in your code which is why you want to procedurise it?
For any calculation you can wrap it in a function. For instance:
answer = variable_1 * variable_2 + variable_3
can be written as:
def calculate(v1, v2, v3):
''' calculate the result from the inputs
'''
return v1 * v2 + v3
answer = calculate(variable_1, variable_2, variable_3)
if you would like a definitive answer then simply post your calculation and I can make it into a function for you

Optimising two arrays simultaneously with scipy optimize

I have a function that takes two m-dimensional arrays does some calculation with them (here it is very simplified) and returns one dimensional array. Also I have m-dimensional measurement data and would like to optimise those two arrays to fit the measurements. This worked fine with one arrays. I can just simply not get it to work with two arrays (or more). it always throws:
TypeError: Improper input: N=40 must not exceed M=20
Here is my Code. Thank you very much if anyone can help!
import numpy as np
from scipy import optimize
data=[np.arange(0,20.0,1),np.array([-52.368, 32.221, 40.102, 48.088, 73.106, 50.807, 52.235, 76.933, 65.737, 34.772, 94.376, 123.366, 92.71, 72.25, 165.051, 91.501, 118.92, 100.936, 56.747, 159.034])]
def line(m,b):
return m*b
guessm = np.ones(20) #initial guessed values for m
guessb = np.ones(20) #initial guesses values for b
guess = np.append(guessm,guessb)
errfunc= lambda p,y: (y-line(p[:20],p[20:]))
parameter, sucess = optimize.leastsq(errfunc, guess, args=(data[1]))
print(parameter)
plt.plot(data[0],d[1],'o')
plt.plot(data[0],line(parameter[0],parameter[1]))
plt.show()
If you want to fit a line, you should give the slope and intercept - two parameters, not 40. I suspect this is what you try to do:
import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize
data=[np.arange(0,20.0,1),np.array([-52.368, 32.221, 40.102, 48.088, 73.106, 50.807, 52.235, 76.933, 65.737, 34.772, 94.376, 123.366, 92.71, 72.25, 165.051, 91.501, 118.92, 100.936, 56.747, 159.034])]
def line(m,b):
return np.arange(0, 20, 1)*m + b
guess = np.ones(2)
errfunc= lambda p,y: (y-line(p[0],p[1]))
parameter, sucess = optimize.leastsq(errfunc, guess, args=(data[1]))
print(parameter)
plt.plot(data[0],data[1],'o')
plt.plot(data[0],line(parameter[0],parameter[1]))
plt.show()

"only length-1 arrays can be converted to Python scalars" using scipy.optimize in Sage

I want to adjust the parameters of a model to a given set of data.
I'm trying to use scipy's function curve_fit in Sage, but I keep getting
TypeError: only length-1 arrays can be converted to Python scalars
Here´s my code:
from numpy import cos,exp,pi
f = lambda x: exp( - 1 / cos(x) )
import numpy as np
def ang(time): return (time-12)*pi/12
def temp(x,maxtemp):
cte=(273+maxtemp)/f(0)**(1/4)
if 6<x and x<18:
return float(cte*f(ang(x))**(1/4)-273)
else:
return -273
lT=list(np.linspace(15,40,1+24*2))
lT=[float(num) for num in lT] #list of y data
ltimes=np.linspace(0,24,6*24+1)[1:]
ltimes=list(ltimes) #list of x data
u0=lT[0]
def u(time,maxtemp,k): #the function I want to fit to the data
def integ(t): return k*exp(k*t)*temp(t,maxtemp)
return exp(-k*time)*( numerical_integral(integ, 0, time)[0] + u0 )
import scipy.optimize as optimization
print optimization.curve_fit(u, ltimes, lT,[1000,0.0003])
scipy.optimize.curve_fit expects the model function to be vectorized: that is, it must be able to receive an array (ndarray, to be precise), and return an array of values. You can see the problem right away by adding a debug printout
def u(time,maxtemp,k):
print time % for debugging
def integ(t): return k*exp(k*t)*temp(t,maxtemp)
return exp(-k*time)*( numerical_integral(integ, 0, time)[0] + u0 )
The output of print will be the entire array ltimes you are passing to curve_fit. This is something numerical_integral is not designed to handle. You need to give it values one by one.
Like this:
def u(time,maxtemp,k):
def integ(t): return k*exp(k*t)*temp(t,maxtemp)
return [exp(-k*time_i)*( numerical_integral(integ, 0, time_i)[0] + u0) for time_i in time]
This will take care of the “only length-1 arrays can be converted" error. You will then have another one, because your lists ltimes and lT are of different length, which doesn't make sense since lT is supposed to be the target outputs for inputs ltimes. You should revise the definitions of these arrays to figure out what size you want.

Scipy odeint giving index out of bounds errors

I am trying to solve a differential equation in python using Scipy's odeint function. The equation is of the form dy/dt = w(t) where w(t) = w1*(1+A*sin(w2*t)) for some parameters w1, w2, and A. The code I've written works for some parameters, but for others I get given index out of bound errors.
Here's some example code that works
import numpy as np
import scipy.integrate as integrate
t = np.arange(1000)
w1 = 2*np.pi
w2 = 0.016*np.pi
A = 1.0
w = w1*(1+A*np.sin(w2*t))
def f(y,t0):
return w[t0]
y = integrate.odeint(f,0,t)
Here's some example code that doesn't work
import numpy as np
import scipy.integrate as integrate
t = np.arange(1000)
w1 = 0.3*np.pi
w2 = 0.005*np.pi
A = 0.15
w = w1*(1+A*np.sin(w2*t))
def f(y,t0):
return w[t0]
y = integrate.odeint(f,0,t)
The only thing that changes between these is that the three parameters w1, w2, and A are smaller in the second, but the second one always gives me the following error
line 13, in f
return w[t0]
IndexError: index 1001 is out of bounds for axis 0 with size 1000
This error continues even after restarting python and running the second code first. I've tried with other parameters, some seem to work, but others give me different index out of bounds errors. Some say 1001 is out of bounds, some say 1000, some say 1008, ect.
Changing the initial condition on y (the second input for odeint, which I have as 0 on the above codes) also changes the number on the index error, so it might be that I'm misunderstanding what to put here. I wasn't told what the initial conditions should be other than that y is used as a phase of a signal, so I presumed it to be initially 0.
What you want to do is
def w(t):
return w1*(1+A*np.sin(w2*t))
def f(y,t0):
return w(t0)
Array indices are typically integers, time arguments and values of solutions of differential equations are typically real numbers. Thus there is some conceptual difficulty in invoking w[t0].
You might also try to integrate directly the function w, there is no inherent difficulty in this example.
As for coupled systems, you solve them as coupled systems.
def w(t):
return w1*(1+A*np.sin(w2*t))
def f(y,t):
wt = w(t)
return np.array([ wt, wt*sin(y[1]-y[0]) ])

Categories