Python - curve fit producing incorrect fit

Python - curve fit producing incorrect fit - python

I'm trying to fit a sine wave curve this data distribution, but for some reason, the fit is incorrect:
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
from scipy.optimize import curve_fit
#=======================
#====== Analysis =======
#=======================
# sine curve fit
def fit_Sin(t, A, b, C):
return A* np.sin(t*b) + C
## The Data extraciton
t,y,y1 = np.loadtxt("new10_CoCore_5to20_BL.txt", unpack=True)
xdata = t
popt, pcov = curve_fit(fit_Sin, t, y)
print "A = %s , b = %s, C = %s" % (popt[0], popt[1], popt[2])
#=======================
#====== Plotting =======
#=======================
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
ax1.plot(t, y, ".")
ax1.plot(t, fit_Sin(t, *popt))
plt.show()
In which this fit makes an extreme underestimation of the data. Any ideas why that is?
Here is the data provided here: https://www.dropbox.com/sh/72jnpkkk0jf3sjg/AAAb17JSPbqhQOWnI68xK7sMa?dl=0
Any idea why this is producing this?

Sine waves are extremely difficult to fit if your frequency guess is off. That is because with a sufficient number of cycles in the data, the guess will be out of phase with half the data and in phase with half of it for even a small error in the frequency. At that point, a straight line offers a better fit than a sine wave of different frequency. That is how Fourier transforms work by the way.
I can think of three ways to estimate the frequency well enough to allow a non linear least squares algorithm to take over:
Eyeball it. Subtract the x-values of two peaks in the GUI or even in the command line. If you have very low noise data, you can automate this process quite easily.
Use a Discrete Fourier transform. If your data is a sine wave of one component, the first non-constant peak will give you the frequency. I have found this to require some additional tweaking since the frequency of the sampling is often not a multiple of the sine wave frequency. A parabolic fit to the three points around the peak (three including the peak) can help in this situation.
Find where your data crosses the vertical offset. This is similar to #1 but is easier to automate for relatively non-noisy data. The wavelength is twice the distance between a pair of intersections.
Using #1, I can clearly see that your wavelength is 50. The initial guess for b should therefore be 2*np.pi/50. Also, don't forget to add a phase shift parameter to allow the fit to slide horizontally: A*sin(b*t + d) + C.
You will need to pass in an initial guess via the p0 parameter to curve_fit. A good eyeball estimate is p0=(0.55, np.pi/25, 0.0, -np.pi/25*12.5). The phase shift in your data appears to be a quarter period to the right, hence the 12.5.
I am currently in the process of writing an algorithm for fitting noisy sine waves with a single frequency component that I will submit to SciPy. Will update when I finish.

Related

Exponential decay curve fitting with scipy.optimize

I am trying to fit a curve with the curve_fit function in SciPy. By changing the inital values of the model the quality of the fit is changing but I am not able to find the best fit through my data. Here is how my fit looks like
My question is how can I improve this fit and what is the best way of selecting the initial values of the model.
I have attached the raw data which I want to fit an exponential curve to it.
This is the data which I am using
y = [ 338.52656636 337.43934446 348.25434126 308.42768639 279.24436171
269.85992004 279.24436171 249.25992615 239.53215125 219.96215705
220.41993469 220.30549028 220.30549028 195.07049776 180.364391
171.20883816 180.24994659 180.13550218 180.47883541 209.89104892
220.19104587 180.02105777 595.45426801 324.50712607 150.60884426
170.97994934 171.20883816 170.75106052 170.75106052 159.76439711
140.88106937 150.37995544 140.88106937 1620.70451979 140.42329173
150.37995544 140.53773614 284.68047121 1146.84743797 170.97994934
150.60884426 145.74495682 141.10995819 121.53996399 121.19663076
131.38218329 170.40772729 140.42329173 140.82384716 145.5732902
140.30884732 121.53996399 700.39979247 2783.74584185 131.26773888
140.76662496 140.53773614 121.76885281 126.23218482 130.69551683]
and here is my code:
from numpy import arange
from pandas import read_csv
from scipy.optimize import curve_fit
from matplotlib import pyplot
def expDecay(t, Amax, tau):
return Amax/tau*np.exp(-t/tau)
Amax = []
Tau = []
ydata = y
x = array(range(len(y)))
xdata = x
popt, pcov = curve_fit(expDecay, x, y,
p0=(10000, 5),
bounds=([0., 2.], [10000., 30]),)
Amax.append(popt[0])
Tau.append(popt[1])
plt.plot(xdata, expDecay(xdata, *popt), 'k-', label='Pred.');
plt.plot(ydata);
plt.ylim([0, 500])
plt.show()

The deviation is due to the outliers. After eliminating them :
Note about eliminating the outliers.
Since the definition of outlier is subjective a software able to do this will probably be more or less interactive. I built my own very rudimentary software. The principle is :
A first nonlinear regression is done with all the points. With the function and parameters obtained the values of y are computed for each point. The absolute difference between the "y computed" and the "y values" from the given data file are compared. This allows to eliminate the point the further away.
Another nonlinear regression is done with the remaining points. The same procedure eliminates a second point.
And so on until a specified criteria be reached to stop. That is the subjective part.
With your data (60 points) the point n.54 was eliminated first. Then the point n.34, then n.39 and so on. The process stops after eliminating 6 points. Eliminating more points doesn't improve much the LMSE.
The curve above is the result of the last nonlinear regression with the 54 remaining points.

How to estimate confidence-intervals beyond the current simulated step, based on existing data for 1,000,000 monte-carlo simulations?

Situation:
I have a program which generates 600 random numbers per "step".
These 600 numbers are fed into a complicated algorithm, which then outputs a single value (which can be positive or negative) for that "step"; let's call this Value-X for that step.
This Value-X is then added to a Global-Value-Y, making the latter a running sum of each step in the series.
I have essentially run this simulation 1,000,000 times, recording the values of Global-Value-Y at each step in those simulations.
I have "calculated confidence intervals" from those one-million simulations, by sorting the simulations by (the absolute value of) their Global-Value-Y at each column, and finding the 90th percentile, the 99th percentile, etc.
What I want to do:
Using the pool of simulation results, "extrapolate" from that to find some equation that will estimate the confidence intervals for results from the used algorithm, many "steps" into the future, without having to extend the runs of those one-million simulations further. (it would take too long to keep running those one-million simulations indefinitely)
Note that the results do not have to be terribly precise at this point; the results are mainly used atm as a visual indicator on the graph, for the user to get an idea of how "normal" the current simulation's results are relative to the confidence-intervals extrapolated from the historical data of the one-million simulations.
Anyway, I've already made some attempts at finding an "estimated curve-fit" of the confidence-intervals from the historical data (ie. those based on the one-million simulations), but the results are not quite precise enough.
Here are the key parts from the curve-fitting Python code I've tried: (link to full code here)
# curve fit functions
def func_linear(t, a, b):
return a*t +b
def func_quadratic(t, a, b, c):
return a*pow(t,2) + b*t +c
def func_cubic(t, a, b, c, d):
return a*pow(t,3) + b*pow(t,2) + c*t + d
def func_biquadratic(t, a, b, c, d, e):
return a*pow(t,4) + b*pow(t,3) + c*pow(t,2) + d*t + e
[...]
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# calling the read function on the recorded percentile/confidence-intervals data
xVals,yVals = read_file()
# using inbuilt function for linear fit
popt, pcov = curve_fit(func_linear, xVals, yVals)
fit_linear = func_linear(np.array(xVals), *popt)
[same here for the other curve-fit functions]
[...]
plt.rcParams["figure.figsize"] = (40,20)
# plotting the respective curve fits
plt.plot(xVals, yVals, color="blue", linewidth=3)
plt.plot(xVals, fit_linear, color="red", linewidth=2)
plt.plot(xVals, fit_quadratic, color="green", linewidth=2)
plt.plot(xVals, fit_cubic, color="orange", linewidth=2)
#plt.plot(xVals, fit_biquadratic, color="black", linewidth=2) # extremely off
plt.legend(['Actual data','linear','quadratic','cubic','biquadratic'])
plt.xlabel('Session Column')
plt.ylabel('y-value for CI')
plt.title('Curve fitting')
plt.show()
And here are the results: (with the purple "actual data" being the 99.9th percentile of the Global-Value-Ys at each step, from the one-million recorded simulations)
While it seems to be attempting to estimate a curve-fit (over the graphed ~630 steps), the results are not quite accurate enough for my purposes. The imprecision is particularly noticeable at the first ~5% of the graph, where the estimate-curve is far too high. (though practically, my issues are more on the other end, as the CI curve-fit keeps getting less accurate the farther out from the historical data it goes)
EDIT: As requested by a commenter, here is a GitHub gist of the Python code I'm using to attempt the curve-fit (and it includes the data used): https://gist.github.com/Venryx/74a44ed25d5c4dc7768d13e22b36c8a4
So my questions:
Is there something wrong with my usage of scypy's curve_fit function?
Or is the curve_fit function too basic of a tool to get meaningful estimates/extrapolations of confidence-intervals for random-walk data like this?
If so, is there some alternative that works better for estimating/extrapolating confidence-intervals from random-walk data of this sort?

If I well understand your question, you have a lot of Monte Carlo simulations gathered as (x,y) points and you want to find a model that reasonably fits them.
Importing your data to create a MCVE:
import numpy as np
from scipy import optimize
import matplotlib.pyplot as plt
data = np.array([
[0,77],[1,101],[2,121],[3,138],[4,151],[5,165],[6,178],[7,189],[8,200],[9,210],[10,221],[11,229],[12,238],[13,247],[14,254],[15,264],[16,271],[17,278],[18,285],[19,291],[20,299],[21,305],[22,312],[23,318],[24,326],[25,331],[26,338],[27,344],[28,350],[29,356],[30,362],[31,365],[32,371],[33,376],[34,383],[35,387],[36,393],[37,399],[38,404],[39,409],[40,414],[41,419],[42,425],[43,430],[44,435],[45,439],[46,444],[47,447],[48,453],[49,457],[50,461],[51,467],[52,472],[53,476],[54,480],[55,483],[56,488],[57,491],[58,495],[59,499],[60,504],[61,508],[62,512],[63,516],[64,521],[65,525],[66,528],[67,532],[68,536],[69,540],[70,544],[71,547],[72,551],[73,554],[74,560],[75,563],[76,567],[77,571],[78,574],[79,577],[80,582],[81,585],[82,588],[83,591],[84,595],[85,600],[86,603],[87,605],[88,610],[89,613],[90,617],[91,621],[92,624],[93,627],[94,630],[95,632],[96,636],[97,638],[98,642],[99,645],[100,649],[101,653],[102,656],[103,660],[104,664],[105,667],[106,670],[107,673],[108,674],[109,679],[110,681],[111,684],[112,687],[113,689],[114,692],[115,697],[116,698],[117,701],[118,705],[119,708],[120,710],[121,712],[122,716],[123,718],[124,722],[125,725],[126,728],[127,730],[128,732],[129,735],[130,739],[131,742],[132,743],[133,747],[134,751],[135,753],[136,754],[137,757],[138,760],[139,762],[140,765],[141,768],[142,769],[143,774],[144,775],[145,778],[146,782],[147,784],[148,788],[149,790],[150,793],[151,795],[152,799],[153,801],[154,804],[155,808],[156,811],[157,812],[158,814],[159,816],[160,819],[161,820],[162,824],[163,825],[164,828],[165,830],[166,832],[167,834],[168,836],[169,839],[170,843],[171,845],[172,847],[173,850],[174,853],[175,856],[176,858],[177,859],[178,863],[179,865],[180,869],[181,871],[182,873],[183,875],[184,878],[185,880],[186,883],[187,884],[188,886],[189,887],[190,892],[191,894],[192,895],[193,898],[194,900],[195,903],[196,903],[197,905],[198,907],[199,910],[200,911],[201,914],[202,919],[203,921],[204,922],[205,926],[206,927],[207,928],[208,931],[209,933],[210,935],[211,940],[212,942],[213,944],[214,943],[215,948],[216,950],[217,954],[218,955],[219,957],[220,959],[221,963],[222,965],[223,967],[224,969],[225,970],[226,971],[227,973],[228,975],[229,979],[230,980],[231,982],[232,983],[233,986],[234,988],[235,990],[236,992],[237,993],[238,996],[239,998],[240,1001],[241,1003],[242,1007],[243,1007],[244,1011],[245,1012],[246,1013],[247,1016],[248,1019],[249,1019],[250,1020],[251,1024],[252,1027],[253,1029],[254,1028],[255,1031],[256,1033],[257,1035],[258,1038],[259,1040],[260,1041],[261,1046],[262,1046],[263,1046],[264,1048],[265,1052],[266,1053],[267,1055],[268,1056],[269,1057],[270,1059],[271,1061],[272,1064],[273,1067],[274,1065],[275,1068],[276,1071],[277,1073],[278,1074],[279,1075],[280,1080],[281,1081],[282,1083],[283,1084],[284,1085],[285,1086],[286,1088],[287,1090],[288,1092],[289,1095],[290,1097],[291,1100],[292,1100],[293,1102],[294,1104],[295,1107],[296,1109],[297,1110],[298,1113],[299,1114],[300,1112],[301,1116],[302,1118],[303,1120],[304,1121],[305,1124],[306,1124],[307,1126],[308,1126],[309,1130],[310,1131],[311,1131],[312,1135],[313,1137],[314,1138],[315,1141],[316,1145],[317,1147],[318,1147],[319,1148],[320,1152],[321,1151],[322,1152],[323,1155],[324,1157],[325,1158],[326,1161],[327,1161],[328,1163],[329,1164],[330,1167],[331,1169],[332,1172],[333,1175],[334,1177],[335,1177],[336,1179],[337,1181],[338,1180],[339,1184],[340,1186],[341,1186],[342,1188],[343,1190],[344,1193],[345,1195],[346,1197],[347,1198],[348,1198],[349,1200],[350,1203],[351,1204],[352,1206],[353,1207],[354,1209],[355,1210],[356,1210],[357,1214],[358,1215],[359,1215],[360,1219],[361,1221],[362,1222],[363,1223],[364,1224],[365,1225],[366,1228],[367,1232],[368,1233],[369,1236],[370,1237],[371,1239],[372,1239],[373,1243],[374,1244],[375,1244],[376,1244],[377,1247],[378,1249],[379,1251],[380,1251],[381,1254],[382,1256],[383,1260],[384,1259],[385,1260],[386,1263],[387,1264],[388,1265],[389,1267],[390,1271],[391,1271],[392,1273],[393,1274],[394,1277],[395,1278],[396,1279],[397,1281],[398,1285],[399,1286],[400,1289],[401,1288],[402,1290],[403,1290],[404,1291],[405,1292],[406,1295],[407,1297],[408,1298],[409,1301],[410,1300],[411,1301],[412,1303],[413,1305],[414,1307],[415,1311],[416,1312],[417,1313],[418,1314],[419,1316],[420,1317],[421,1316],[422,1319],[423,1319],[424,1321],[425,1322],[426,1323],[427,1325],[428,1326],[429,1329],[430,1328],[431,1330],[432,1334],[433,1335],[434,1338],[435,1340],[436,1342],[437,1342],[438,1344],[439,1346],[440,1347],[441,1347],[442,1349],[443,1351],[444,1352],[445,1355],[446,1358],[447,1356],[448,1359],[449,1362],[450,1362],[451,1366],[452,1365],[453,1367],[454,1368],[455,1368],[456,1368],[457,1371],[458,1371],[459,1374],[460,1374],[461,1377],[462,1379],[463,1382],[464,1384],[465,1387],[466,1388],[467,1386],[468,1390],[469,1391],[470,1396],[471,1395],[472,1396],[473,1399],[474,1400],[475,1403],[476,1403],[477,1406],[478,1406],[479,1412],[480,1409],[481,1410],[482,1413],[483,1413],[484,1418],[485,1418],[486,1422],[487,1422],[488,1423],[489,1424],[490,1426],[491,1426],[492,1430],[493,1430],[494,1431],[495,1433],[496,1435],[497,1437],[498,1439],[499,1440],[500,1442],[501,1443],[502,1442],[503,1444],[504,1447],[505,1448],[506,1448],[507,1450],[508,1454],[509,1455],[510,1456],[511,1460],[512,1459],[513,1460],[514,1464],[515,1464],[516,1466],[517,1467],[518,1469],[519,1470],[520,1471],[521,1475],[522,1477],[523,1476],[524,1478],[525,1480],[526,1480],[527,1479],[528,1480],[529,1483],[530,1484],[531,1485],[532,1486],[533,1487],[534,1489],[535,1489],[536,1489],[537,1492],[538,1492],[539,1494],[540,1493],[541,1494],[542,1496],[543,1497],[544,1499],[545,1500],[546,1501],[547,1504],[548,1506],[549,1508],[550,1507],[551,1509],[552,1510],[553,1510],[554,1512],[555,1513],[556,1516],[557,1519],[558,1520],[559,1520],[560,1522],[561,1525],[562,1527],[563,1530],[564,1531],[565,1533],[566,1533],[567,1534],[568,1538],[569,1539],[570,1538],[571,1540],[572,1541],[573,1543],[574,1545],[575,1545],[576,1547],[577,1549],[578,1550],[579,1554],[580,1554],[581,1557],[582,1559],[583,1564],[584,1565],[585,1567],[586,1567],[587,1568],[588,1570],[589,1571],[590,1569],[591,1572],[592,1572],[593,1574],[594,1574],[595,1574],[596,1579],[597,1578],[598,1579],[599,1582],[600,1583],[601,1583],[602,1586],[603,1585],[604,1588],[605,1590],[606,1592],[607,1596],[608,1595],[609,1596],[610,1598],[611,1601],[612,1603],[613,1604],[614,1605],[615,1605],[616,1608],[617,1611],[618,1613],[619,1615],[620,1613],[621,1616],[622,1617],[623,1617],[624,1619],[625,1617],[626,1618],[627,1619],[628,1618],[629,1621],[630,1624],[631,1626],[632,1626],[633,1631]
])
And plotting them show a curve that seems to have a square root and linear behaviors respectively at the beginning and the end of the dataset. So let's try this first simple model:
def model(x, a, b, c):
return a*np.sqrt(x) + b*x + c
Notice than formulated as this, it is a LLS problem which is a good point for solving it. The optimization with curve_fit works as expected:
popt, pcov = optimize.curve_fit(model, data[:,0], data[:,1])
# [61.20233162 0.08897784 27.76102519]
# [[ 1.51146696e-02 -4.81216428e-04 -1.01108383e-01]
# [-4.81216428e-04 1.59734405e-05 3.01250722e-03]
# [-1.01108383e-01 3.01250722e-03 7.63590271e-01]]
And returns a pretty decent adjustment. Graphically it looks like:
Of course this is just an adjustment to an arbitrary model chosen based on a personal experience (to me it looks like a specific heterogeneous reaction kinetics).
If you have a theoretical reason to accept or reject this model then you should use it to discriminate. I would also investigate units of parameters to check if they have significant meanings.
But in any case this is out of scope of the Stack Overflow community which is oriented to solve programming issues not scientific validation of models (see Cross Validated or Math Overflow if you want to dig deeper). If you do so, draw my attention on it, I would be glad to follow this question in details.

How good is this interpolation method?

I came up with a custom interpolation method for my problem and I'd like to ask if there are any risks using it. I am not a math or programming expert, that's why I'd like a feedback :)
Story:
I was searching for a good curve-fit method for my data when I came up with an idea to interpolate the data.
I am mixing paints together and making reflectance measurements with a spectrophotometer when the film is dry. I would like to calculate the required proportions of white and colored paints to reach a certain lightness, regardless of any hue shift (e.g. black+white paints gives a bluish grey) or chroma loss (e.g. orange+white gives "pastel" yellowish orange, etc.)
I check if Beer-Lambert law applies, but it does not. Pigment-mixing behaves in a more complicated fashion than dye-dilutions. So I wanted to fit a curve to my data points (the process is explained here: Interpolation for color-mixing
First step was doing a calibration curve, I tested the following ratios of colored VS white paints mixed together:
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
This is the plot of my carefully prepared samples, measured with a spectrophotometer, the blue curve represents the full color (ratio = 1), the red curve represents the white paint (ratio = 0), the black curves the mixed samples:
Second step I wanted to guess from this data a function that would compute a spectral curve for any ration between 0 and 1. I did test several curve fitting (fitting an exponential function) and interpolation (quadratic, cubic) methods but the results were of a poor quality.
For example, this is my reflectance data at 380nm for all the color samples:
This is the result of scipy.optimize.curve_fit using the function:
def func(x, a, b, c):
return a * np.exp(-b * x) + c
popt, pcov = curve_fit(func, x, y)
Then I came-up with this idea: the logarithm of the spectral data gives a closer match to a straight line, and the logarithm of the logarithm of the data is almost a straight line, as demonstrated by this code and graph:
import numpy as np
import matplotlib.pyplot as plt
reflectance_at_380nm = 5.319, 13.3875, 24.866, 35.958, 47.1105, 56.2255, 65.232, 83.9295
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
linear_approx = np.log(np.log(reflectance_at_380nm))
plt.plot(ratios, linear_approx)
plt.show()
What I did then is to interpolate the linear approximation an then convert the data back to linear, then I got a very nice interpolation of my data, much better than what I got before:
import numpy as np
import matplotlib.pyplot as plt
import scipy.interpolate
reflectance_at_380nm = 5.319, 13.3875, 24.866, 35.958, 47.1105, 56.2255, 65.232, 83.9295
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
linear_approx = np.log(np.log(reflectance_at_380nm))
xnew = np.arange(100)/100.
cs = scipy.interpolate.spline(ratios, linear_approx, xnew, order=1)
cs = np.exp(np.exp(cs))
plt.plot(xnew,cs)
plt.plot(x,y,'ro')
plt.show()
So my question is for experts: how good is this interpolation method and what are the risks of using it? Can it lead to wrong results?
Also: can this method be improved or does it already exists and if so how is it called?
Thank you for reading

This looks similar to the Kernel Method that is used for fitting regression lines or finding decision boundaries for classification problems.
The idea behind the Kernel trick being, the data is transformed into a dimensional space (often higher dimensional), where the data is linearly separable (for classification), or has a linear curve-fit (for regression). After the curve-fitting is done, inverse transformations can be applied. In your case successive exponentiations (exp(exp(X))), seems to be the inverse transformation and successive logarithms (log(log(x)))seems to be the transformation.
I am not sure if there is a kernel that does exactly this, but the intuition is similar. Here is a medium article explaining this for classification using SVM:
https://medium.com/#zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d
Since it is a method that is quite popularly used in Machine Learning, I doubt it will lead to wrong results if the fit is done properly (not under-fit or over-fit) - and this needs to be judged by statistical testing.

Parameters of cosine squared scipy optimize curvefit are incorrect in python

I am trying to fit a cosine squared to a data array from an optics interferometry intensity measurement. Unfortunately, the fit returns amplitudes and periods that are way off. Only once I received a more reasonable fit by selecting the first 200 data points from the array (and some other selections). Those fit parameters were used as initial guesses to extend the fit to the entire array, which gave back a plot similar to the image.
import csv
import numpy as np
import matplotlib.pyplot as plt
import scipy as sy
from numpy import genfromtxt
from scipy.optimize import curve_fit
# reads the data from the csv file
csvfile ="</home/pi/Desktop/molecularpolOutput_No2.csv>"
csv = genfromtxt ('molecularpolOutput_No2.csv', delimiter=",")
# defines the data as variables
pressure = csv[100:200,2]
intensity = csv[100:200,3]
temperature = csv[:,1]
pi = 3.14
P = pressure
# defines the function and initial fit parameters
def func(P, T, a, b, c):
return a*np.cos((2*pi*P)/T+b)**2+c
p0 = sy.array([2200, 45, 4000, 85])
# fits the function
coeffs, pcov = curve_fit(func, pressure, intensity, p0)
I = func(P, coeffs[0], coeffs[1], coeffs[2], coeffs[3])
print 'period =',(coeffs[0]), 'Pa'
# plots the data and the function
fig = plt.figure(figsize=(10, 3), dpi=100)
plt.plot(pressure, intensity, linestyle="none", marker=".")
plt.plot(pressure, I)
plt.xlabel('Pressure (Pa)')
plt.ylabel('Relative intensity')
plt.title('interference intensity plot of Newtons rings ')
plt.show()
I would expect the fit to be correct for both a large and small data array. However, as the figures show, extending the array messes with both the amplitude and period. The fit which looks ok, also gives values for the period comparable to other experiments. The data generated by the photoresistor is not precisely linear but I assume this should not be the problem for curve_fit. Is their something I can change in the code to get the fit working? I already tried this code: How do I fit a sine curve to my data with pylab and numpy?
update
A least square curve fit in Matlab gives the same problem. Should I try another method to fit the curve or is it the data that causes the problem?
Matlab Code:
%% Opens excel file
filename = 'vpnat_1.xlsx';
Pr = xlsread(filename,'D1:D500');
I = xlsread(filename, 'E1:E500');
P = Pr;
% defines figure size relative to screen
scrsz = get(groot,'ScreenSize');
figure('Position',[1 scrsz(4)/2 scrsz(3)/2 scrsz(4)/4])
%% fit & plots
hold on
scatter(P,I,'.'); % scatter plot
%% defines parameter guesses
Im = mean(I);
Iu = max(I);
Il = min(I);
Ia = Iu-Il;
Ip = 2000;
Id = -4000;
a_0 = [Ia; Ip; Id; Im]; % initial guesses
fun = #(a,P) a(1).*(cos((2*pi*P)./a(2)+a(3)).^2)+a(4); % defines function
fcn = #(a) sum((fun(a,P)-I).^2); % finds best fit
s = fminsearch(fcn, a_0);
plot(P,fun(s,P)) % plots fitted function
hold off

I solved the problem by using Matlab. It appears that the parameters were to poorly defined for curve_fit in python to find a least squares whithin its given boundaries (Constrain on number of iterations?).
Matlab appeared to accept a larger margin of error in the initial parameters and therefore found a fit for all selections of data. Using the fit parameters from matlab as initial parameters in Python returns a proper fit. The problem in python could be prevented by computing the guesses for the parameters to get a better start.

How can I find the break frequencies/3dB points from a bandpass filter frequency sweep data in python?

The data that i have is stored in a 2D list where one column represents a frequency and the other column is its corresponding dB. I would like to programmatically identify the frequency of the 3db points on either end of the passband. I have two ideas on how to do this but they both have drawbacks.
Find maximum point then the average of points in the passband then find points about 3dB lower
Use the sympy library to perform numerical differentiation and identify the critical points/inflection points
use a histogram/bin function to find the amplitude of the passband.
drawbacks
sensitive to spikes, not quite sure how to do this
i don't under stand the math involved and the data is noisy which could lead to a lot of false positives
correlating the amplitude values with list index values could be tricky
Can you think of better ideas and/or ways to implement what I have described?

Assuming that you've loaded multiple readings of the PSD from the signal analyzer, try averaging them before attempting to find the bandedges. If the signal isn't changing too dramatically, the averaging process might smooth away any peaks and valleys and noise within the passband, making it easier to find the edges. This is what many spectrum analyzers can do to make for a smoother PSD.
In case that wasn't clear, assume that each reading gives you 128 tuples of the frequency and power and that you capture 100 of these buffers of data. Now average the 100 samples from bin 0, then samples from 1, 2, ..., 128. Now try and locate the bandpass on this data. It should be easier than on any single buffer. Note I used 100 as an example. If your data is very noisy, it may require more. If there isn't much noise, fewer.
Be careful when doing the averaging. Your data is in dB. To add the samples together in order to find an average, you must first convert the dB data back to decimal, do the adds, do the divide to find the average, and then convert the averaged power back into dB.

Ok it seems this has to be solved by data analysis. I would propose these steps:
Preprocess you data if you suspect it to bee too noisy. I'd suggest either moving-average filter (sp.convolve(data, sp.ones(n)/n, "same")) or better a savitzky-golay-filter (sp.signal.savgol_filter(data, n, polyorder=3)) because you will be interested in extrema of the data, which will be unnecessarily distorted by the ma filter. You might also want to get rid of artifacts like 60Hz noise at this stage.
If the signal you are interested in lives in a narrow band, the spectrum will be a single pronounced peak. In that case you could just fit a curve to your data, a gaussian would be appropriate in that case.
import scipy as sp
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
freq, pow = read_in_your_data_here()
freq, pow = sp.asarray(freq), sp.asarray(pow)
def gauss(x, a, mu, sig):
return a**sp.exp(-(x-mu)**2/(2.*sig**2))
(a, mu, sig), _ = curve_fit(gauss, freq, pow)
fitted_curve = gauss(freq, a, mu, sig)
plt.plot(freq, pow)
plt.plot(freq, fitted_curve)
plt.vlines(mu, min(pow)-2, max(pow)+2)
plt.show()
center_idx = sp.absolute(freq-mu).argmin()
pow_center = pow[center_idx]
pow_3db = pow_center - 3.
def interv_from_binvec(data):
indicator = sp.convolve(data, [-1,1], "same")
return indicator.argmin(), indicator.argmax()
passband_idx = interv_from_binvec(pow > pow_3db)
passband = freq[passband_idx[0]], freq[passband_idx[1]]
This is more an example than a solution, and relies heavily on the assumption the you are searching and finding a high SNR signal with a narrow band. It could be extended to handle more than one signal by use of a mixture model.

You can use scipy's UnivariateSpline and leastsq methods:
Create a spline of y-(np.max(y)-3)
Find the roots of it.
Calculate the difference between the two roots.
from scipy.interpolate import UnivariateSpline
from scipy.optimize import leastsq
x = df["Wavelength / nm"]
y = df["Power / dBm"]
#create spline
spline = UnivariateSpline(x, y-(np.max(y)-3), s=0)
# find the roots
r1, r2 = spline.roots()
# calculate the difference
threedB_bandwidth = abs(r2-r1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.