prediction plots for statsmodels OLS fit, taking out categorical effects - python

I have some data for about 500 galaxies in a pandas DataFrame (a few hundred measurements per galaxy), and I'm trying to perform OLS regression on a few variables, one of which is categorical (each galaxy is its own category). Basically, once I have finished fitting the model, I want to plot the data, the fit, and some error bounds, taking out the presumed effects of the categories themselves.
In effect, I want to produce a plot much like the one in "OLS non-linear curve but linear in parameters" section of this tutorial (replicated here).
Instead, I have this (I've just picked two galaxies here, for ease of reading, but it gets really ugly with all 500):
Since there seem to be two "clusters" here, I have concluded that each must correspond to a galaxy. What I really want, though, is to collapse them down into a single line that takes out the inter-category effects, and imagines that they were all one galaxy.
For reference, the code that I'm using to fit and plot is:
m = sm.ols(
formula='{} ~ Rdeproj + NSAMstar + \
NSASersicN + C(plateifu)'.format(qty),
data=dfr)
f = m.fit()
#print dir(f)
ypred = f.predict()
prstd, iv_l, iv_u = wls_prediction_std(f)
plt.close('all')
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(dfr['Rdeproj'], dfr[qty], '.', label='data')
ax.plot(dfr['Rdeproj'], f.fittedvalues, 'r--.', label='pred')
ax.plot(dfr['Rdeproj'], iv_u, 'r--', label='bound')
ax.plot(dfr['Rdeproj'], iv_l, 'r--')
legend=ax.legend(loc='best')
ax.set_xlabel(r'$R_{deproj}$ [Mpc]')
ax.set_ylabel(qty)
plt.tight_layout()
plt.savefig('fits/' + qty + '_fit.png')
I found one similar question asked here, but it seems to only address predicting observations for specific categories, rather than taking out those effects entirely.
Any further advice would be very much appreciated.

Related

How to estimate confidence-intervals beyond the current simulated step, based on existing data for 1,000,000 monte-carlo simulations?

Situation:
I have a program which generates 600 random numbers per "step".
These 600 numbers are fed into a complicated algorithm, which then outputs a single value (which can be positive or negative) for that "step"; let's call this Value-X for that step.
This Value-X is then added to a Global-Value-Y, making the latter a running sum of each step in the series.
I have essentially run this simulation 1,000,000 times, recording the values of Global-Value-Y at each step in those simulations.
I have "calculated confidence intervals" from those one-million simulations, by sorting the simulations by (the absolute value of) their Global-Value-Y at each column, and finding the 90th percentile, the 99th percentile, etc.
What I want to do:
Using the pool of simulation results, "extrapolate" from that to find some equation that will estimate the confidence intervals for results from the used algorithm, many "steps" into the future, without having to extend the runs of those one-million simulations further. (it would take too long to keep running those one-million simulations indefinitely)
Note that the results do not have to be terribly precise at this point; the results are mainly used atm as a visual indicator on the graph, for the user to get an idea of how "normal" the current simulation's results are relative to the confidence-intervals extrapolated from the historical data of the one-million simulations.
Anyway, I've already made some attempts at finding an "estimated curve-fit" of the confidence-intervals from the historical data (ie. those based on the one-million simulations), but the results are not quite precise enough.
Here are the key parts from the curve-fitting Python code I've tried: (link to full code here)
# curve fit functions
def func_linear(t, a, b):
return a*t +b
def func_quadratic(t, a, b, c):
return a*pow(t,2) + b*t +c
def func_cubic(t, a, b, c, d):
return a*pow(t,3) + b*pow(t,2) + c*t + d
def func_biquadratic(t, a, b, c, d, e):
return a*pow(t,4) + b*pow(t,3) + c*pow(t,2) + d*t + e
[...]
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# calling the read function on the recorded percentile/confidence-intervals data
xVals,yVals = read_file()
# using inbuilt function for linear fit
popt, pcov = curve_fit(func_linear, xVals, yVals)
fit_linear = func_linear(np.array(xVals), *popt)
[same here for the other curve-fit functions]
[...]
plt.rcParams["figure.figsize"] = (40,20)
# plotting the respective curve fits
plt.plot(xVals, yVals, color="blue", linewidth=3)
plt.plot(xVals, fit_linear, color="red", linewidth=2)
plt.plot(xVals, fit_quadratic, color="green", linewidth=2)
plt.plot(xVals, fit_cubic, color="orange", linewidth=2)
#plt.plot(xVals, fit_biquadratic, color="black", linewidth=2) # extremely off
plt.legend(['Actual data','linear','quadratic','cubic','biquadratic'])
plt.xlabel('Session Column')
plt.ylabel('y-value for CI')
plt.title('Curve fitting')
plt.show()
And here are the results: (with the purple "actual data" being the 99.9th percentile of the Global-Value-Ys at each step, from the one-million recorded simulations)
While it seems to be attempting to estimate a curve-fit (over the graphed ~630 steps), the results are not quite accurate enough for my purposes. The imprecision is particularly noticeable at the first ~5% of the graph, where the estimate-curve is far too high. (though practically, my issues are more on the other end, as the CI curve-fit keeps getting less accurate the farther out from the historical data it goes)
EDIT: As requested by a commenter, here is a GitHub gist of the Python code I'm using to attempt the curve-fit (and it includes the data used): https://gist.github.com/Venryx/74a44ed25d5c4dc7768d13e22b36c8a4
So my questions:
Is there something wrong with my usage of scypy's curve_fit function?
Or is the curve_fit function too basic of a tool to get meaningful estimates/extrapolations of confidence-intervals for random-walk data like this?
If so, is there some alternative that works better for estimating/extrapolating confidence-intervals from random-walk data of this sort?
If I well understand your question, you have a lot of Monte Carlo simulations gathered as (x,y) points and you want to find a model that reasonably fits them.
Importing your data to create a MCVE:
import numpy as np
from scipy import optimize
import matplotlib.pyplot as plt
data = np.array([
[0,77],[1,101],[2,121],[3,138],[4,151],[5,165],[6,178],[7,189],[8,200],[9,210],[10,221],[11,229],[12,238],[13,247],[14,254],[15,264],[16,271],[17,278],[18,285],[19,291],[20,299],[21,305],[22,312],[23,318],[24,326],[25,331],[26,338],[27,344],[28,350],[29,356],[30,362],[31,365],[32,371],[33,376],[34,383],[35,387],[36,393],[37,399],[38,404],[39,409],[40,414],[41,419],[42,425],[43,430],[44,435],[45,439],[46,444],[47,447],[48,453],[49,457],[50,461],[51,467],[52,472],[53,476],[54,480],[55,483],[56,488],[57,491],[58,495],[59,499],[60,504],[61,508],[62,512],[63,516],[64,521],[65,525],[66,528],[67,532],[68,536],[69,540],[70,544],[71,547],[72,551],[73,554],[74,560],[75,563],[76,567],[77,571],[78,574],[79,577],[80,582],[81,585],[82,588],[83,591],[84,595],[85,600],[86,603],[87,605],[88,610],[89,613],[90,617],[91,621],[92,624],[93,627],[94,630],[95,632],[96,636],[97,638],[98,642],[99,645],[100,649],[101,653],[102,656],[103,660],[104,664],[105,667],[106,670],[107,673],[108,674],[109,679],[110,681],[111,684],[112,687],[113,689],[114,692],[115,697],[116,698],[117,701],[118,705],[119,708],[120,710],[121,712],[122,716],[123,718],[124,722],[125,725],[126,728],[127,730],[128,732],[129,735],[130,739],[131,742],[132,743],[133,747],[134,751],[135,753],[136,754],[137,757],[138,760],[139,762],[140,765],[141,768],[142,769],[143,774],[144,775],[145,778],[146,782],[147,784],[148,788],[149,790],[150,793],[151,795],[152,799],[153,801],[154,804],[155,808],[156,811],[157,812],[158,814],[159,816],[160,819],[161,820],[162,824],[163,825],[164,828],[165,830],[166,832],[167,834],[168,836],[169,839],[170,843],[171,845],[172,847],[173,850],[174,853],[175,856],[176,858],[177,859],[178,863],[179,865],[180,869],[181,871],[182,873],[183,875],[184,878],[185,880],[186,883],[187,884],[188,886],[189,887],[190,892],[191,894],[192,895],[193,898],[194,900],[195,903],[196,903],[197,905],[198,907],[199,910],[200,911],[201,914],[202,919],[203,921],[204,922],[205,926],[206,927],[207,928],[208,931],[209,933],[210,935],[211,940],[212,942],[213,944],[214,943],[215,948],[216,950],[217,954],[218,955],[219,957],[220,959],[221,963],[222,965],[223,967],[224,969],[225,970],[226,971],[227,973],[228,975],[229,979],[230,980],[231,982],[232,983],[233,986],[234,988],[235,990],[236,992],[237,993],[238,996],[239,998],[240,1001],[241,1003],[242,1007],[243,1007],[244,1011],[245,1012],[246,1013],[247,1016],[248,1019],[249,1019],[250,1020],[251,1024],[252,1027],[253,1029],[254,1028],[255,1031],[256,1033],[257,1035],[258,1038],[259,1040],[260,1041],[261,1046],[262,1046],[263,1046],[264,1048],[265,1052],[266,1053],[267,1055],[268,1056],[269,1057],[270,1059],[271,1061],[272,1064],[273,1067],[274,1065],[275,1068],[276,1071],[277,1073],[278,1074],[279,1075],[280,1080],[281,1081],[282,1083],[283,1084],[284,1085],[285,1086],[286,1088],[287,1090],[288,1092],[289,1095],[290,1097],[291,1100],[292,1100],[293,1102],[294,1104],[295,1107],[296,1109],[297,1110],[298,1113],[299,1114],[300,1112],[301,1116],[302,1118],[303,1120],[304,1121],[305,1124],[306,1124],[307,1126],[308,1126],[309,1130],[310,1131],[311,1131],[312,1135],[313,1137],[314,1138],[315,1141],[316,1145],[317,1147],[318,1147],[319,1148],[320,1152],[321,1151],[322,1152],[323,1155],[324,1157],[325,1158],[326,1161],[327,1161],[328,1163],[329,1164],[330,1167],[331,1169],[332,1172],[333,1175],[334,1177],[335,1177],[336,1179],[337,1181],[338,1180],[339,1184],[340,1186],[341,1186],[342,1188],[343,1190],[344,1193],[345,1195],[346,1197],[347,1198],[348,1198],[349,1200],[350,1203],[351,1204],[352,1206],[353,1207],[354,1209],[355,1210],[356,1210],[357,1214],[358,1215],[359,1215],[360,1219],[361,1221],[362,1222],[363,1223],[364,1224],[365,1225],[366,1228],[367,1232],[368,1233],[369,1236],[370,1237],[371,1239],[372,1239],[373,1243],[374,1244],[375,1244],[376,1244],[377,1247],[378,1249],[379,1251],[380,1251],[381,1254],[382,1256],[383,1260],[384,1259],[385,1260],[386,1263],[387,1264],[388,1265],[389,1267],[390,1271],[391,1271],[392,1273],[393,1274],[394,1277],[395,1278],[396,1279],[397,1281],[398,1285],[399,1286],[400,1289],[401,1288],[402,1290],[403,1290],[404,1291],[405,1292],[406,1295],[407,1297],[408,1298],[409,1301],[410,1300],[411,1301],[412,1303],[413,1305],[414,1307],[415,1311],[416,1312],[417,1313],[418,1314],[419,1316],[420,1317],[421,1316],[422,1319],[423,1319],[424,1321],[425,1322],[426,1323],[427,1325],[428,1326],[429,1329],[430,1328],[431,1330],[432,1334],[433,1335],[434,1338],[435,1340],[436,1342],[437,1342],[438,1344],[439,1346],[440,1347],[441,1347],[442,1349],[443,1351],[444,1352],[445,1355],[446,1358],[447,1356],[448,1359],[449,1362],[450,1362],[451,1366],[452,1365],[453,1367],[454,1368],[455,1368],[456,1368],[457,1371],[458,1371],[459,1374],[460,1374],[461,1377],[462,1379],[463,1382],[464,1384],[465,1387],[466,1388],[467,1386],[468,1390],[469,1391],[470,1396],[471,1395],[472,1396],[473,1399],[474,1400],[475,1403],[476,1403],[477,1406],[478,1406],[479,1412],[480,1409],[481,1410],[482,1413],[483,1413],[484,1418],[485,1418],[486,1422],[487,1422],[488,1423],[489,1424],[490,1426],[491,1426],[492,1430],[493,1430],[494,1431],[495,1433],[496,1435],[497,1437],[498,1439],[499,1440],[500,1442],[501,1443],[502,1442],[503,1444],[504,1447],[505,1448],[506,1448],[507,1450],[508,1454],[509,1455],[510,1456],[511,1460],[512,1459],[513,1460],[514,1464],[515,1464],[516,1466],[517,1467],[518,1469],[519,1470],[520,1471],[521,1475],[522,1477],[523,1476],[524,1478],[525,1480],[526,1480],[527,1479],[528,1480],[529,1483],[530,1484],[531,1485],[532,1486],[533,1487],[534,1489],[535,1489],[536,1489],[537,1492],[538,1492],[539,1494],[540,1493],[541,1494],[542,1496],[543,1497],[544,1499],[545,1500],[546,1501],[547,1504],[548,1506],[549,1508],[550,1507],[551,1509],[552,1510],[553,1510],[554,1512],[555,1513],[556,1516],[557,1519],[558,1520],[559,1520],[560,1522],[561,1525],[562,1527],[563,1530],[564,1531],[565,1533],[566,1533],[567,1534],[568,1538],[569,1539],[570,1538],[571,1540],[572,1541],[573,1543],[574,1545],[575,1545],[576,1547],[577,1549],[578,1550],[579,1554],[580,1554],[581,1557],[582,1559],[583,1564],[584,1565],[585,1567],[586,1567],[587,1568],[588,1570],[589,1571],[590,1569],[591,1572],[592,1572],[593,1574],[594,1574],[595,1574],[596,1579],[597,1578],[598,1579],[599,1582],[600,1583],[601,1583],[602,1586],[603,1585],[604,1588],[605,1590],[606,1592],[607,1596],[608,1595],[609,1596],[610,1598],[611,1601],[612,1603],[613,1604],[614,1605],[615,1605],[616,1608],[617,1611],[618,1613],[619,1615],[620,1613],[621,1616],[622,1617],[623,1617],[624,1619],[625,1617],[626,1618],[627,1619],[628,1618],[629,1621],[630,1624],[631,1626],[632,1626],[633,1631]
])
And plotting them show a curve that seems to have a square root and linear behaviors respectively at the beginning and the end of the dataset. So let's try this first simple model:
def model(x, a, b, c):
return a*np.sqrt(x) + b*x + c
Notice than formulated as this, it is a LLS problem which is a good point for solving it. The optimization with curve_fit works as expected:
popt, pcov = optimize.curve_fit(model, data[:,0], data[:,1])
# [61.20233162 0.08897784 27.76102519]
# [[ 1.51146696e-02 -4.81216428e-04 -1.01108383e-01]
# [-4.81216428e-04 1.59734405e-05 3.01250722e-03]
# [-1.01108383e-01 3.01250722e-03 7.63590271e-01]]
And returns a pretty decent adjustment. Graphically it looks like:
Of course this is just an adjustment to an arbitrary model chosen based on a personal experience (to me it looks like a specific heterogeneous reaction kinetics).
If you have a theoretical reason to accept or reject this model then you should use it to discriminate. I would also investigate units of parameters to check if they have significant meanings.
But in any case this is out of scope of the Stack Overflow community which is oriented to solve programming issues not scientific validation of models (see Cross Validated or Math Overflow if you want to dig deeper). If you do so, draw my attention on it, I would be glad to follow this question in details.

how to optimise fitting of gauss-hermite function in python?

What I have done so far:
I am trying to fit noisy data (which I generated myself by adding random noise to my function) to Gauss-Hermite function that I have defined. It works well in some cases for lower values of h3 and h4 but every once in a while it will produce a really bad fit even for lower h3, h4 values, and for higher h3, h4 values, it always gives a bad fit.
My code:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as mpl
import matplotlib.pyplot as plt
# Let's define the Gauss-Hermite function
#a=amplitude, x0=location of peak, sig=std dev, h3, h4
def gh_func(x, a, x0, sig, h3, h4):
return a*np.exp(-.5*((x-x0)/sig)**2)*(1+h3*((-np.sqrt(3))*((x-x0)/sig)+(2/np.sqrt(3))*((x-x0)/sig)**3)+ h4*((np.sqrt(6)/4)+(-np.sqrt(6))*((x-x0)/sig)**2+(np.sqrt(6)/3)*(((x-x0)/sig)**4)))
#generate clean data
x = np.linspace(-10, 20, 100)
y = gh_func(x, 10, 5, np.sqrt(3), -0.10,-0.03) #it gives okay fit for h3=-0.10, h4=-0.03 but bad fits for higher values like h3=-0.4 and h4=-0.3.
#add noise to data
noise=np.random.normal(0,np.sqrt(0.5),size=len(x))
yn = y + noise
fig = mpl.figure(1)
ax = fig.add_subplot(111)
ax.plot(x, y, c='k', label='analytic function')
ax.scatter(x, yn, s=5, label='fake noisy data')
fig.savefig('model_and_noise_h3h4.png')
# Executing curve_fit on noisy data
popt, pcov = curve_fit(gh_func, x, yn)
#popt returns the best fit values for parameters of the given model (func)
print('Fitted Parameters (Gaus_Hermite):\na = %.10f , x0 = %.10f , sig = %.10f\nh3 = %.10f , h4 = %.10f' \
%(popt[0],popt[1],popt[2],popt[3],popt[4]))
ym = gh_func(x, popt[0], popt[1], popt[2], popt[3], popt[4])
ax.plot(x, ym, c='r', label='Best fit')
ax.legend()
fig.savefig('model_fit_h3h4.png')
plt.legend(loc='upper left')
plt.xlabel("v")
plt.ylabel("f(v)")
What I want to do:
I want to find a better fitting method than just curve_fit from scipy.optimize but I am not sure what I can use. Even if we end up using curve_fit, I need a way to produce better fits by providing initial guesses for the parameters which are generated automatically, e.g. one approach for only single peak gaussian is described in the accepted answer of this post (Jean Jacquelin's method):gaussian fitting inaccurate for lower peak width using Python. But this is just for mu,sigma and amplitude not h3,h4.
Besides curve_fit from scipy.optimize, I think there's one called lmfit: https://lmfit.github.io/lmfit-py/ but I am not sure how I will implement it in my code. I do not want to use manual initial guesses for the parameters. I want to be able to find the fitting automatically.
Thanks a lot!
Using lmfit for this fitting problem would be straightforward, by creating a lmfit.Model from your gh_func, with something like
from lmfit import Model
gh_model = Model(gh_func)
params = gh_model.make_params(x0=?, a=?, sig=?, h3=?, h4=?)
where those question marks would have to be initial values for the parameters in question.
Your use of scipy.optimize.curve_fit does not provide initial values for the variables. Unfortunately, this does not raise an error or give an indication of a problem because the authors of scipy.optimize.curve_fit have lied to you by making initial values optional. Initial values for all parameters are always required for all non-linear least-squares analyses. It is not a choice of the implementation, it is a feature of the mathematical algorithm. What curve_fit hides from you is that leaving p0=None makes all initial values 1.0. Whether that is remotely appropriate depends on the model and data being fit - it cannot be always reasonable. As an example, if the x values for your data extended from 9000 to 9500, and the peak function was centered around 9200, starting with x0 of 1 would almost certainly not find a suitable fit. It is always deceptive to imply in any way that initial values are optional. They just are not. Not sometimes, not for well-defined problems. Initial values are never, ever optional.
Initial values don't have to be perfect, but they need to be of the right scale. Many people will warn you about "false minima" - getting stuck at a solution that is not bad but not "best". My experience is that the more common problem people run into is initial values that are so far off that the model is just not sensitive to small changes in their values, and so can never be optimized (with x0=1,sig=1 for data with x on [9000, 9500] and centered at 9200 being exactly in that category).
The answer to your question for your essential question of "providing initial guesses for the parameters which are generated automatically" is hard to answer without knowing the properties of your model function. You might be able to use some heuristics to guess values from a data set. Lmfit has such heuristic "guess parameter values from data" functions for many peak-like functions. You probably have some sense for the physical (or other domain) for h3 and h4 and know what kinds of values are reasonable, and can probably give better initial values than h3=h4=1. It might be that you want to start with guessing parameters as if the data was simple Gaussian (say, using lmfit.Model.guess_from_peak()) and then use the difference of that and your data to get the scale for h3 and h4.

Python Linear regression : plt.plot() not showing straight line. Instead it connects every point on scatter plot

I am relatively new to python. I am trying to do a multivariate linear regression and plot scatter plots and the line of best fit using one feature at a time.
This is my code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
And this is the graph that I'm getting-
I have tried searching a lot but to no avail. I wanted to understand why this is not showing a line of best-fit and why instead it is connecting all the points on the scatter plot.
Thank you!
See linear regression means, that you are predicting the value linearly which will always give you a best fit line. Anything else is not possible, in your code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Use the right variables to plot the line ie:
plt.plot(x_test,y_pred)
Plot the graph between the values that you put for test and the predictions that you get from that ie:
y_pred=regr.predict(x_test)
Also your model must be trained for the same, otherwise you will get the straight line but the results will be unexpected.
This is a multivariant data so you need to get the pairwise line
http://www.sthda.com/english/articles/32-r-graphics-essentials/130-plot-multivariate-continuous-data/#:~:text=wiki%2F3d%2Dgraphics-,Create%20a%20scatter%20plot%20matrix,pairwise%20comparison%20of%20multivariate%20data.&text=Create%20a%20simple%20scatter%20plot%20matrix.
or change the model for a linearly dependent data that will change the model completely
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train['lag_7'],y_train)
y_pred=regr.predict(x_test['lag_7'])
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Assuming your graphical library is matplotlib, imported with import matplotlib.pyplot as plt, the problem is that you passed the same data to both plt.scatter and plt.plot. The former draws the scatter plot, while the latter passes a line through all points in the order given (it first draws a straight line between (x_test['lag_7'][0], y_pred[0]) and (x_test['lag_7'][1], y_pred[1]), then one between (x_test['lag_7'][1], y_pred[1]) and (x_test['lag_7'][2], y_pred[2]), etc.)
Concerning the more general question about how to do multivariate regression and plot the results, I have two remarks:
Finding the line of best fit one feature at a time amounts to performing 1D regression on that feature: it is an altogether different model from the multivariate linear regression you want to perform.
I don't think it makes much sense to split your data into train and test samples, because linear regression is a very simple model with little risk of overfitting. In the following, I consider the whole data set df.
I like to use OpenTURNS because it has built-in linear regression viewing facilities. The downside is that to use it, we need to convert your pandas tables (DataFrame or Series) to OpenTURNS objects of the class Sample.
import pandas as pd
import numpy as np
import openturns as ot
from openturns.viewer import View
# convert pandas DataFrames to numpy arrays and then to OpenTURNS Samples
X = ot.Sample(np.array(df[['lag_7','rolling_mean', 'expanding_mean']]))
X.setDescription(['lag_7','rolling_mean', 'expanding_mean']) # keep labels
Y = ot.Sample(np.array(df[['sales']]))
Y.setDescription(['sales'])
You did not provide your data, so I need to generate some:
func = ot.SymbolicFunction(['x1', 'x2', 'x3'], ['4*x1 + 0.05*x2 - 2*x3'])
inputs_distribution = ot.ComposedDistribution([ot.Uniform(0, 3.0e6)]*3)
residuals_distribution = ot.Normal(0.0, 2.0e6)
ot.RandomGenerator.SetSeed(0)
X = inputs_distribution.getSample(30)
X.setDescription(['lag_7','rolling_mean', 'expanding_mean'])
Y = func(X) + residuals_distribution.getSample(30)
Y.setDescription(['sales'])
Now, let us find the best-fitting line one feature at a time (1D linear regression):
linear_regression_1 = ot.LinearModelAlgorithm(X[:, 0], Y)
linear_regression_1.run()
linear_regression_1_result = linear_regression_1.getResult()
ot.VisualTest_DrawLinearModel(X[:, 0], Y, linear_regression_1_result)
linear_regression_2 = ot.LinearModelAlgorithm(X[:, 1], Y)
linear_regression_2.run()
linear_regression_2_result = linear_regression_2.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 1], Y, linear_regression_2_result))
linear_regression_3 = ot.LinearModelAlgorithm(X[:, 2], Y)
linear_regression_3.run()
linear_regression_3_result = linear_regression_3.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 2], Y, linear_regression_3_result))
As you can see, in this example, none of the one-feature linear regressions are able to very accurately predict the output.
Now let us do multivariate linear regression. To plot the result, it is best to view the actual vs. predicted values.
full_linear_regression = ot.LinearModelAlgorithm(X, Y)
full_linear_regression.run()
full_linear_regression_result = full_linear_regression.getResult()
full_linear_regression_analysis = ot.LinearModelAnalysis(full_linear_regression_result)
View(full_linear_regression_analysis.drawModelVsFitted())
As you can see, in this example, the fit is much better with multivariate linear regression than with 1D regressions one feature at a time.

How can I check if a network is scale free?

Given an undirected NetworkX Graph graph, I want to check if it is scale free.
To do this, as I understand, I need to find the degree k of each node, and the frequency of that degree P(k) within the entire network. This should represent a power law curve due to the relationship between the frequency of degrees and the degrees themselves.
Plotting my calculations for P(k) and k displays a power curve as expected, but when I double log it, a straight line is not plotted.
The following plots were obtained with a 1000 nodes.
Code as follows:
k = []
Pk = []
for node in list(graph.nodes()):
degree = graph.degree(nbunch=node)
try:
pos = k.index(degree)
except ValueError as e:
k.append(degree)
Pk.append(1)
else:
Pk[pos] += 1
# get a double log representation
for i in range(len(k)):
logk.append(math.log10(k[i]))
logPk.append(math.log10(Pk[i]))
order = np.argsort(logk)
logk_array = np.array(logk)[order]
logPk_array = np.array(logPk)[order]
plt.plot(logk_array, logPk_array, ".")
m, c = np.polyfit(logk_array, logPk_array, 1)
plt.plot(logk_array, m*logk_array + c, "-")
The m is supposed to represent the scaling coefficient, and if it's between 2 and 3 then the network ought to be scale free.
The graphs are obtained by calling the NetworkX's scale_free_graph method, and then using that as input for the Graph constructor.
Update
As per request from #Joel, below are the plots for 10000 nodes.
Additionally, the exact code that generates the graph is as follows:
graph = networkx.Graph(networkx.scale_free_graph(num_of_nodes))
As we can see, a significant amount of the values do seem to form a straight-line, but the network seems to have a strange tail in its double log form.
Have you tried powerlaw module in python?
It's pretty straightforward.
First, create a degree distribution variable from your network:
degree_sequence = sorted([d for n, d in G.degree()], reverse=True) # used for degree distribution and powerlaw test
Then fit the data to powerlaw and other distributions:
import powerlaw # Power laws are probability distributions with the form:p(x)∝x−α
fit = powerlaw.Fit(degree_sequence)
Take into account that powerlaw automatically find the optimal alpha value of xmin by creating a power law fit starting from each unique value in the dataset, then selecting the one that results in the minimal Kolmogorov-Smirnov distance,D, between the data and the fit. If you want to include all your data, you can define xmin value as follow:
fit = powerlaw.Fit(degree_sequence, xmin=1)
Then you can plot:
fig2 = fit.plot_pdf(color='b', linewidth=2)
fit.power_law.plot_pdf(color='g', linestyle='--', ax=fig2)
which will produce an output like this:
powerlaw fit
On the other hand, it may not be a powerlaw distribution but any other distribution like loglinear, etc, you can also check powerlaw.distribution_compare:
R, p = fit.distribution_compare('power_law', 'exponential', normalized_ratio=True)
print (R, p)
where R is the likelihood ratio between the two candidate distributions. This number will be positive if the data is more likely in the first distribution, but you should also check p < 0.05
Finally, once you have chosen a xmin for your distribution you can plot a comparisson between some usual degree distributions for social networks:
plt.figure(figsize=(10, 6))
fit.distribution_compare('power_law', 'lognormal')
fig4 = fit.plot_ccdf(linewidth=3, color='black')
fit.power_law.plot_ccdf(ax=fig4, color='r', linestyle='--') #powerlaw
fit.lognormal.plot_ccdf(ax=fig4, color='g', linestyle='--') #lognormal
fit.stretched_exponential.plot_ccdf(ax=fig4, color='b', linestyle='--') #stretched_exponential
lognornal vs powerlaw vs stretched exponential
Finally, take into account that powerlaw distributions in networks are being under discussion now, strongly scale-free networks seem to be empirically rare
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6399239/
Part of your problem is that you aren't including the missing degrees in fitting your line. There are a small number of large degree nodes, which you're including in your line, but you're ignoring the fact that many of the large degrees don't exist. Your largest degrees are somewhere in the 1000-2000 range, but there are only 2 observations. So really, for such large values, I'm expecting that the probability a random node has such a large degree 2/(1000*N) (or really, it's probably even less than that). But in your fit, you're treating them as if the probability of those two specific degrees is 2/N, and you're ignoring the other degrees.
The simple fix is to only use the smaller degrees in your fit.
The more robust way is to fit the complementary cumulative distribution. Instead of plotting P(K=k), plot P(K>=k) and try to fit that (noting that if the probability that P(K=k) is a powerlaw, then the probability that P(K>=k) is also, but with a different exponent - check it).
Trying to fit a line to these points is wrong, as the points are not linearly distributed over the x-axis. The fitting function of line will give more importance to the portion of the domain that contain more points.
You should redistribute the observations over the x-axis using function np.interp, like this.
logk_interp = np.linspace(np.min(logk_array),np.max(logk_array),1000)
logPk_interp = np.interp(logk_interp, logk_array, logPk_array)
plt.plot(logk_array, logPk_array,".")
m, c = np.polyfit(logk_interp, logPk_interp, 1)
plt.plot(logk_interp, m*logk_interp + c, "-")

Fitting a log-log data using scipy.optmize.curve_fit

I have two variables x and y which I am trying to fit using curve_fit from scipy.optimize.
The equation that fits the data is a simple power law of the form y=a(x^b). The fit seems to be well for the data when I set the x and y axis to log scale, i.e ax.set_xscale('log') and ax.set_yscale('log').
Here is the code:
def fitfunc(x,p1,p2):
y = p1*(x**p2)
return y
popt_1,pcov_1 = curve_fit(fitfunc,x,y,p0=(1.0,1.0))
p1_1 = popt_1[0]
p1_2 = popt_1[1]
residuals1 = (ngal_mstar_1) - fitfunc(x,p1_1,p1_2)
xi_sq_1 = sum(residuals1**2) #The chi-square value
curve_y_1 = fitfunc(x,p1_1,p1_2) #This is the fit line seen in the graph
fig = plt.figure(figsize=(14,12))
ax1 = fig.add_subplot(111)
ax1.scatter(x,y,c='r')
ax1.plot(y,curve_y_1,'y.',linewidth=1)
ax1.legend(loc='best',shadow=True,scatterpoints=1)
ax1.set_xscale('log') #Scale is set to log
ax1.set_yscale('log') #SCale is set to log
plt.show()
When I use true log-log values for x and y, the power law fit becomes y=10^(a+b*log(x)),i.e raising the power of the right side to 10 as it is logbase 10. Now both by x and y values are log(x) and log(y).
The fit for the above does not seem to be good. Here is the code I have used.
def fitfunc(x,p1,p2):
y = 10**(p1+(p2*x))
return y
popt_1,pcov_1 = curve_fit(fitfunc,np.log10(x),np.log10(y),p0=(1.0,1.0))
p1_1 = popt_1[0]
p1_2 = popt_1[1]
residuals1 = (y) - fitfunc((x),p1_1,p1_2)
xi_sq_1 = sum(residuals1**2)
curve_y_1 = fitfunc(np.log10(x),p1_1,p1_2) #The fit line uses log(x) here itself
fig = plt.figure(figsize=(14,12))
ax1 = fig.add_subplot(111)
ax1.scatter(np.log10(x),np.log10(y),c='r')
ax1.plot(np.log10(y),curve_y_1,'y.',linewidth=1)
plt.show()
THE ONLY DIFFERENCE BETWEEN THE TWO PLOTS IS THE FITTING EQUATIONS, AND FOR THE SECOND PLOT THE VALUES HAVE BEEN LOGGED INDEPENDENTLY. Am I doing something wrong here, because I want a log(x) vs log(y) plot and the corresponding fit parameters (slope and intercept)
Your transformation of the power-law model to log-log is wrong, i.e. your second fit actually fits a different model. Take your original model y=a*(x^b) and apply the logarithm on both sides, you will get log(y) = log(a) + b*log(x). Thus, your model in log-scale should simply read y' = a' + b*x', where the primes indicate variables in log-scale. The model is now a linear function, a well known result that all power-laws become linear functions in log-log.
That said, you can still expect some small differences in the two versions of your fit, since curve_fit will optimise the least-squares problem. Therefore, in log scale, the fit will minimise the relative error between the fit and the data, while in linear scale, the fit will minimise the absolute error. Thus, in order to decide which way is actually the better domain for your fit, you will have to estimate the error in your data. The data you show certainly does not have a constant uncertainty in log-scale, so on linear scale your fit might be more faithful. If details about the error in each data-point are known, then you could consider using the sigma parameter. If that one is used properly, there should not be much difference in the two approaches. In that case, I would prefer the log-scale fitting, as the model is simpler and therefore likely to be more numerically stable.

Categories