Optimising two arrays simultaneously with scipy optimize - python

I have a function that takes two m-dimensional arrays does some calculation with them (here it is very simplified) and returns one dimensional array. Also I have m-dimensional measurement data and would like to optimise those two arrays to fit the measurements. This worked fine with one arrays. I can just simply not get it to work with two arrays (or more). it always throws:
TypeError: Improper input: N=40 must not exceed M=20
Here is my Code. Thank you very much if anyone can help!
import numpy as np
from scipy import optimize
data=[np.arange(0,20.0,1),np.array([-52.368, 32.221, 40.102, 48.088, 73.106, 50.807, 52.235, 76.933, 65.737, 34.772, 94.376, 123.366, 92.71, 72.25, 165.051, 91.501, 118.92, 100.936, 56.747, 159.034])]
def line(m,b):
return m*b
guessm = np.ones(20) #initial guessed values for m
guessb = np.ones(20) #initial guesses values for b
guess = np.append(guessm,guessb)
errfunc= lambda p,y: (y-line(p[:20],p[20:]))
parameter, sucess = optimize.leastsq(errfunc, guess, args=(data[1]))
print(parameter)
plt.plot(data[0],d[1],'o')
plt.plot(data[0],line(parameter[0],parameter[1]))
plt.show()

If you want to fit a line, you should give the slope and intercept - two parameters, not 40. I suspect this is what you try to do:
import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize
data=[np.arange(0,20.0,1),np.array([-52.368, 32.221, 40.102, 48.088, 73.106, 50.807, 52.235, 76.933, 65.737, 34.772, 94.376, 123.366, 92.71, 72.25, 165.051, 91.501, 118.92, 100.936, 56.747, 159.034])]
def line(m,b):
return np.arange(0, 20, 1)*m + b
guess = np.ones(2)
errfunc= lambda p,y: (y-line(p[0],p[1]))
parameter, sucess = optimize.leastsq(errfunc, guess, args=(data[1]))
print(parameter)
plt.plot(data[0],data[1],'o')
plt.plot(data[0],line(parameter[0],parameter[1]))
plt.show()

Related

Python - Find coefficients minimizing error in csv data

I've recently run into a problem. I have data looking like this :
Value 1
Value 2
Target
1345
4590
2.45
1278
3567
2.48
1378
4890
2.46
1589
4987
2.50
...
...
...
The data goes on for a few thousand lines.
I need to find two values (A & B), that minimize the error when the data is inputted like so :
Value 1 * A + Value 2 * B = Target
I've looked into scipy.optimize.curve_fit, but I can't seem to understand how it would work, because the function changes at every iteration of the data (since Value 1 and Value 2 are not the same over every row).
Any help is greatly appreciated, thanks in advance !
The function curve_fit takes 3 arguments :
a function f that takes an input argument, let's call it X and parameters params (as many as you want)
the input X_data you have from your dataset
the output Y_data you have from your dataset
The point of this function is the give you best params to input in f(X_data, params) to get Y_data.
Intuitively the form X in your function f is a simple numpy 1D array, but actually it can have the form you want. Here your input a tuple of two 1D arrays (or a 2D array if you want to implemente it this way).
Here's a code example :
import numpy as np
from scipy.optimize import curve_fit
X_data = (np.array([1345,1278,1378,1589]),
np.array([4590,3567,4890,4987]))
Y_data = np.array([2.45,2.48,2.46,2.50])
def my_func(X, A, B):
x1, x2 = X
return A*x1 + B*x2
(A, B), _ = curve_fit(my_func, X_data, Y_data)
interpolated_results = my_func(X_data, A, B)
relative_error_in_percent = abs((Y_data - interpolated_results)/Y_data)*100
print(relative_error_in_percent)
Unfortunataly you have not provided any test data so I have come up with my own:
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
def f(V1,V2,A,B): #Target function
return V1*A+V2*B
# Generate Test-Data
def generateData(A,B):
np.random.seed(0)
V1=np.random.uniform(low=1000, high=1500, size=(100,))
V2=np.random.uniform(low=3500, high=5000, size=(100,))
Target=f(V1,V2,A,B) +np.random.normal(0,1,100)
return V1,V2,Target
data=generateData(2,3) #Important:
data={"Value 1":data[0], "Value 2":data[1], "Target":data[2]}
df=pd.DataFrame(data) #Similar structure as given in Table
df.head() looks like this:
Value 1 Value 2 Target
0 1292.0525763109854 3662.162080896163 13570.276523473405
1 1155.0421489258965 4907.133274663096 17033.392287295104
2 1430.7172112685223 4844.422515098364 17395.412651006143
3 1396.0480757043242 4076.5845114488666 15022.720636830541
4 1346.2120476329646 3570.9567326419674 13406.565815022896
Your question is answered in the following:
## Plot Data to check whether linear function is useful
df.head()
fig=plt.figure()
ax1=fig.add_subplot(211)
ax2=fig.add_subplot(212)
ax1.scatter(df["Value 1"], df["Target"])
ax2.scatter(df["Value 2"], df["Target"])
def fmin(x, df): #Returns Error at given parameters
def RMSE(y,y_target): #Definition for error term
return np.sqrt(np.mean((y-y_target)**2))
A,B=x
V1,V2,y_target=df["Value 1"], df["Value 2"], df["Target"]
y=f(V1,V2,A,B) #Calculate target value with given parameter set
return RMSE(y,y_target)
res=minimize(fmin,x0=[1,1],args=df, options={"disp":True})
print(res.x)
I prefere scipy.optimize.minimize() over curve_fit since you can define the error function yourself. The documentation can be found here.
You need:
a function fun that returns the error for a given set of parameter x (here fmin with RMSE)
an initial guess x0 (here [1,1]), if your guess is totally off you will probably do not find a solution or (with more complex problems) just a local one
additional arguments args provided to the fun here the data df but also helpful for fixed parameters
options={"disp":True} is for printing additional information
your parameters can be found besides further information in the returned variable res
For this case the result is:
[1.9987209 3.0004212]
Similar to the given parameters when generating the data.

Errors using curve_fit for Guassian fit of data

I'm trying to do a guassian fit for some experimental data but I keep running into error after error. I've followed a few different threads online but either the fit isn't good (it's just a horizontal line) or the code just won't run. I'm following this code from another thread. Below is my code.
I apologize if my code seems a bit messy. There are some bits from other attempts when I tried making it work. Hence the "astropy" import.
import math as m
import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize as opt
import pandas as pd
import statistics as stats
from astropy import modeling
def gaus(x,a,x0,sigma, offset):
return a*m.exp(-(x-x0)**2/(2*sigma**2)) + offset
# Python program to get average of a list
def Average(lst):
return sum(lst) / len(lst)
wavelengths = [391.719, 391.984, 392.248, 392.512, 392.777, 393.041, 393.306, 393.57, 393.835, 394.099, 391.719, 391.455, 391.19, 390.926, 390.661, 390.396]
intensities = [511.85, 1105.85, 1631.85, 1119.85, 213.85, 36.85, 10.85, 6.85, 13.85, 7.85, 511.85, 200.85, 80.85, 53.85, 14.85, 24.85]
n=sum(intensities)
mean = sum(wavelengths*intensities)/n
sigma = m.sqrt(sum(intensities*(wavelengths-mean)**2)/n)
def gaus(x,a,x0,sigma):
return a*m.exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = opt.curve_fit(gaus,wavelengths,intensities,p0=[1,mean,sigma])
print(popt)
plt.scatter(wavelengths, intensities)
plt.title("Helium Spectral Line Peak 1")
plt.xlabel("Wavelength (nm)")
plt.ylabel("Intensity (a.u.)")
plt.show()
Thanks to the kind user, my curve seems to be working more reasonably well. However, one of the points seems to be back connecting to an earlier point? Screenshot below:
There are two problems with your code. The first is that you are performing vector operation on list which gives you the first error in the line mean = sum(wavelengths*intensities)/n. Therefore, you should use np.array instead. The second is that you take math.exp on python list which again throws an error as it takes a real number, so you should use np.exp here instead.
The following code solves your problem:
import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize as opt
wavelengths = [391.719, 391.984, 392.248, 392.512, 392.777, 393.041,
393.306, 393.57, 393.835, 394.099, 391.719, 391.455,
391.19, 390.926, 390.661, 390.396]
intensities = [511.85, 1105.85, 1631.85, 1119.85, 213.85, 36.85, 10.85, 6.85,
13.85, 7.85, 511.85, 200.85, 80.85, 53.85, 14.85, 24.85]
wavelengths_new = np.array(wavelengths)
intensities_new = np.array(intensities)
n=sum(intensities)
mean = sum(wavelengths_new*intensities_new)/n
sigma = np.sqrt(sum(intensities_new*(wavelengths_new-mean)**2)/n)
def gaus(x,a,x0,sigma):
return a*np.exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = opt.curve_fit(gaus,wavelengths_new,intensities_new,p0=[1,mean,sigma])
print(popt)
plt.scatter(wavelengths_new, intensities_new, label="data")
plt.plot(wavelengths_new, gaus(wavelengths_new, *popt), label="fit")
plt.title("Helium Spectral Line Peak 1")
plt.xlabel("Wavelength (nm)")
plt.ylabel("Intensity (a.u.)")
plt.show()

ValueError: x and y must have same first dimension, but have shapes

I wonder how to best solve the following problem in my script: "ValueError: x and y must have same first dimension, but have shapes (1531,) and (1532,)".
What is the problem here? The problem is that the x and y axis of the plot don't share the exact same number of values (input) to plot. The result is the error message above.
Let us look at the code first:
# Initialize
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
from matplotlib.pyplot import cm
# Numpy.loadtxt – Loads data from a textfile.
# Scipy.signal.welch – Creation of the power-spectrum via welch method. f, Welch creates the ideal frequencies (f, Welch = Power Spectrum or Power Spectral Density)
Subjects = ["Subject1" "Subject2"]
for Subject in Subjects:
Txt = np.loadtxt("/datadir.../{0}/filename...{0}.txt".format(Subject), comments="#", delimiter=None,
converters=None, skiprows=0, usecols=0, unpack=False, ndmin=0, encoding=None, max_rows=None, like=None)
f, Welch = signal.welch(Txt, fs=1.0, window="hann", nperseg=None, noverlap=None, nfft=3062, detrend="constant", return_onesided=True, scaling="density", axis=-1, average="mean")
BypassZero1 = f[f > 0.00000000000001] # Avoids "RuntimeWarning: divide by zero encountered in log"
BypassZero2 = Welch[Welch > 0.00000000000001]
Log_f = np.log(BypassZero1, out=BypassZero1, where=BypassZero1 > 0)
Log_Welch = np.log(BypassZero2, out=BypassZero2, where=BypassZero2 > 0)
plt.plot(Log_f, Log_Welch)
The code lines "BypassZero1" and "BypassZero2" tell Python to only use values above 0.00000000000001 for both "f" and "Welch". Otherwise the problem "RuntimeWarning: divide by zero encountered in log" would occur in the following step where I apply the logarithm for both axes (Log_f and Log_Welch).
This is where the problem occurs for the last plt.plot line of the code. It seems that a different number of numeric values are "left over" for "f" and "Welch" after the previous step of using the Welch method and applying the logarithm for both axes.
I wonder if there is a possibility to deal with the 0.xxx values provided in the .txt file. Currently, only values above 0.00000000000001 for both f and Welch are used. This will lead to the different number of values for x and y, hence resulting in the impossibility of plotting the data.
What could be a solution for this problem?
As you pointed out, the error message indicates that your two arrays are of different length. This is because the mask of the second array should be the same as the mask of the first. Therefore, replacing BypassZero2 = Welch[Welch > 0.00000000000001] with BypassZero2 = Welch[f > 0.00000000000001] should fix the issue.
Basically, x and y coordinates we are plotting must be of same length, so that we can make sure it plots one on one.
Thus, ensure their lengths are equal.

Using `dask.array.map_block()` to parallelize line fitting on a 3-D `dask.array`

I have a series of N images that are recorded at different times. I have stacked the images into a 3-D dask array and rechunked them along the time axis. I would now like to perform a linear fit at each pixel position across the image, but I am running into the following error when using da.map_blocks as I try to scale up: TypeError: expected 1D or 2D array for y
I found one other post, applying-a-function-along-an-axis-of-a-dask-array, related to this but it didn't address an issue with specifically setting the chunk size. When using da.apply_along_axis I found an issue similar to the one reported in dask-performance-apply-along-axis wherein only one CPU seems to be utilized during the computation (even for chunked data).
MWE: Works properly
import dask.array as da
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
def f(y, args, axis=None):
return np.polyfit(args[0], y.squeeze(), args[1])[:, None, None]
deg = 1
nsamp=20*10*10
shape=(20,10,10)
chunk_size=(20,1,1)
a = da.linspace(1, nsamp, nsamp).reshape(shape)
chunked = a.rechunk(chunk_size)
times = da.linspace(1, shape[0], shape[0])
results = chunked.map_blocks(f, chunks=(20,1,1), args=[times, deg], dtype='float').compute()
m_fit = results[0]
b_fit = results[1]
# Plot a few fits to visually examine them
fig, ax = plt.subplots(nrows=1, ncols=1)
for (x,y) in zip([1,9], [1,9]):
ax.scatter(times, chunked[:,x,y])
ax.plot(times, np.polyval([m_fit[x, y], b_fit[x,y]], times))
The array, chunked, looks like this:
The resulting plot looks like this,
Which is exactly what I would expect and so all is well! However, the issue arises whenever I try to use a chunksize larger than one.
MWE: Raises TypeError
nsamp=20*10*10
shape=(20,10,10)
chunk_size=(20,5,5) # Chunking the data now
a = da.linspace(1,nsamp, nsamp).reshape(shape)
chunked = a.rechunk(chunk_size)
times = da.linspace(1, shape[0], shape[0])
results = chunked.map_blocks(f, chunks=(20,1,1), args=[times, 1], dtype='float') # error
Does anyone have any ideas as to what is happening here?
It looks like maybe your function expects single-dimensional inputs. I wonder if there is a way that you can write a Python function that wraps your function and handles the unpacking and then repacking of one-dimensional inputs. If you can get that function to work on a single numpy array of shape (20, 2, 2) for example then you can probably use Dask to then apply that function across many similarly sized chunks

Error: [only length-1 arrays can be converted to Python scalars] when changing variable order

Dear Stackoverflow Community,
I am very new to Python and to programming in general, so please don't get mad when I don't get your answers and ask again.
I am trying to fit a curve to experimental data with scipy.optimization.curve_fit. This is my code:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as nm
from __future__ import division
import cantera as ct
from matplotlib.backends.backend_pdf import PdfPages
import math as ma
import scipy.optimize as so
R = 8.314
T = nm.array([700, 900, 1100, 1300, 1400, 1500, 1600, 1700])
k = nm.array([289, 25695, 763059, 6358040, 14623536, 30098925, 56605969, 98832907])
def func(A, E, T):
return A*ma.exp(-E/(R*T))
popt, pcov = so.curve_fit(func, T, k)
Now this code works for me, but if I change the function to:
def func(T, A, E)
and keep the rest I get:
TypeError: only length-1 arrays can be converted to Python scalars
Also I am not really convinced by the Parameter solution of the first one.
Can anyone tell me what happens when you change the variable order?
I got the same problem and found the cause and its solution:
The problem lies on the implementation of Scipy. After the optimal parameter has been found, Scipy calls your function with the input array xdata as first argument. That is, it calls func(xdata, *args), and the function complains with a type error because xdata is not an scalar. For example:
from math import erf
erf([1, 2]) # TypeError
erf(np.array([1, 2])) # TypeError
To avoid the error, you can add custom code for supporting arrays, or better, as suggested in the answer of Joris, use numpy functions because they have support for scalars and arrays.
If the math function is not in numpy , like erf or any custom function you coded, then I recommend you instead of doing from math import erf, to do as follows:
from math import erf as math_erf # only supports scalars
import numpy as np
erf = np.vectorize(math_erf) # adds array support
def fit_func(t,s):
return 0.5*(1.0-erf(t/(np.sqrt(2)*s)))
X = np.linspace(-5,5,1000)
Y = np.array([fit_func(x,1) for x in X])
curve_fit(fit_func, X, Y)
The curve_fit function from scipy does not handle very well embedded functions from the math module. When you change the exponential function to the numpy exponential function you don't get the error:
def func(A, E, T):
return A*np.exp(-E/(R*T))
I wonder whether you data shows an exponential decay of rate. The mathematical model may not be the most suitable one.
See the doc string of curve_fit
f : callable
The model function, f(x, ...). It must take the independent variable as the first argument and the parameters to fit as separate remaining arguments.
since your formula is essentially: k=A*ma.exp(-E/(R*T)), the right order of parameters in func should be (T, A, E) or (T, E, A).
Regarding the order of A and E, they don't really matter. If you flip them, the result will get flipped as well:
>>> def func(T, A, E):
return A*ma.exp(-E/(R*T))
>>> so.curve_fit(func, T, k)
(array([ 8.21449078e+00, -5.86499656e+04]), array([[ 6.07720215e+09, 4.31864058e+12],
[ 4.31864058e+12, 3.07102992e+15]]))
>>> def func(T, E, A):
return A*ma.exp(-E/(R*T))
>>> so.curve_fit(func, T, k)
(array([ -5.86499656e+04, 8.21449078e+00]), array([[ 3.07102992e+15, 4.31864058e+12],
[ 4.31864058e+12, 6.07720215e+09]]))
I didn't get your typeerror at all.

Categories