generate random lognormal distributions using shape of observed data - python
I'm trying to fit some data to a lognormal distribution and from this generate random lognormal distribution using optimized parameters.
After some search I found some solutions, but none convincing:
solution1 using the fit function:
import numpy as np
from scipy.stats import lognorm
mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354]
shape, loc, scale = lognorm.fit(mydata)
rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)
or Solution 2 using mu and sigma from original data:
import numpy as np
from scipy.stats import lognorm
mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354]
mu = np.mean([np.log(i) for i in mydata])
sigma = np.std([np.log(i) for i in mydata])
distr = lognorm(mu, sigma)
rnd_log = distr.rvs (size=100)
None of those solutions are fitting well:
import pylab
pylab.plot(sorted(mydata, reverse=True), 'ro')
pylab.plot(sorted(rnd_log, reverse=True), 'bx')
I am not sure if i understand well the way to use distributions, or if I am missing something else...
I though finding the solution here: Does anyone have example code of using scipy.stats.distributions?
but I am not able to get the shape from my data... am I missing something in the use of the fit function?
thanks
EDIT:
this is an example in order to understand better my problem:
print 'solution 1:'
means = []
stdes = []
distr = lognorm(mu, sigma)
for _ in xrange(1000):
rnd_log = distr.rvs (size=100)
means.append (np.mean([np.log(i) for i in rnd_log]))
stdes.append (np.std ([np.log(i) for i in rnd_log]))
print 'observed mean:',mu , 'mean simulated mean:', np.mean (means)
print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)
print '\nsolution 2:'
means = []
stdes = []
shape, loc, scale = lognorm.fit(mydata)
for _ in xrange(1000):
rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)
means.append (np.mean([np.log(i) for i in rnd_log]))
stdes.append (np.std ([np.log(i) for i in rnd_log]))
print 'observed mean:',mu , 'mean simulated mean:', np.mean (means)
print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)
the result is:
solution 1:
observed mean: 1.82562655734 mean simulated mean: 1.18929982267
observed std : 1.39003773799 mean simulated std : 0.88985924363
solution 2:
observed mean: 1.82562655734 mean simulated mean: 4.50608084668
observed std : 1.39003773799 mean simulated std : 5.44206119499
while, if I do the same in R:
mydata <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354)
meanlog <- mean(log(mydata))
sdlog <- sd(log(mydata))
means <- c()
stdes <- c()
for (i in 1:1000){
rnd.log <- rlnorm(length(mydata), meanlog, sdlog)
means <- c(means, mean(log(rnd.log)))
stdes <- c(stdes, sd(log(rnd.log)))
}
print (paste('observed mean:',meanlog,'mean simulated mean:',mean(means),sep=' '))
print (paste('observed std :',sdlog ,'mean simulated std :',mean(stdes),sep=' '))
i get:
[1] "observed mean: 1.82562655733507 mean simulated mean: 1.82307191072317"
[1] "observed std : 1.39704049131865 mean simulated std : 1.39736545866904"
that is much more closer, so I guess I am doing something wrong when using scipy...
The lognormal distribution in scipy is parametrized a little different from the usual way. See the scipy.stats.lognorm docs, particularly the "Notes" section.
Here's how to get the results you're expecting (note that we hold location to 0 when fitting):
In [315]: from scipy import stats
In [316]: x = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354])
In [317]: mu, sigma = stats.norm.fit(np.log(x))
In [318]: mu, sigma
Out[318]: (1.8256265573350701, 1.3900377379913127)
In [319]: shape, loc, scale = stats.lognorm.fit(x, floc=0)
In [320]: np.log(scale), shape
Out[320]: (1.8256267737298788, 1.3900309739954713)
Now you can generate samples and confirm your expectations:
In [321]: dist = stats.lognorm(shape, loc, scale)
In [322]: means, sds = [], []
In [323]: for i in xrange(1000):
.....: sample = dist.rvs(size=100)
.....: logsample = np.log(sample)
.....: means.append(logsample.mean())
.....: sds.append(logsample.std())
.....:
In [324]: np.mean(means), np.mean(sds)
Out[324]: (1.8231068508345041, 1.3816361818739145)
Related
Confine a gaussian fit with curve_fit
in the framework of my bachelor's thesis, I need to evaluate my data with python. Unfortunately there's no suiting script of my fellow students yet and I'm quite new to programming. I have this data set and I'm trying to fit it with a gaussian by using scipy.optimize.curve_fit. Since there are a lot of unusable counts especially at the end of the axis, I'd like to confine the part that is to be fitted. Picture raw data This is what I have so far: import numpy as np import matplotlib.pyplot as plt from scipy.optimize import curve_fit x=np.arange(5120) y=array([ 0.81434599, 1.17054264, 0.85279188, ..., 1. , 1. , 13.56291391]) #most of the data isn't interesting #to me, part of interest see below def Gauss(x, a, x0, sigma): return a * np.exp(-(x - x0)**2 / (2 * sigma**2)) mean = sum(x * y) / sum(y) sigma = np.sqrt(sum(y * (x - mean)**2) / sum(y)) popt,pcov = curve_fit(Gauss, x, y, p0=[max(y), mean, sigma], maxfev=360000) plt.plot(x,y,label='data') plt.plot(x,Gauss(x, *popt), 'r-',label='fit') On docs.scipy.org I've found a general description about curve_fit If I try using bounds=([2400,-np.inf, -np.inf],[2600, np.inf, np.inf]), I'm getting the ValueError: x0 is infeasible. What is the problem here? I also tried to confine it with popt,pcov = curve_fit(Gauss, x[2400:2600], y[2400:2600], p0=[max(y), mean, sigma], maxfev=360000) as suggested in a comment on this question: "Error when obtaining gaussian fit for graph" at stackoverflow In this case I only get a straight line though. Picture: Confinement with x[2400:2600],y[2400:2600] as arguments of curve_fit I really hope you can help me out here. I only need a way to fit a small part of my data. Thanks in advance! interesting data: y=array([ 0.93396226, 1.00884956, 1.15457413, 1.07590759, 0.88915094, 1.07142857, 1.10714286, 1.14171123, 1.06666667, 0.84975369, 0.95480226, 0.99388379, 1.01675978, 0.83967391, 0.9771987 , 1.02402402, 1.04531722, 1.07492795, 0.97135417, 0.99714286, 1.0248139 , 1.26223776, 1.1533101 , 0.99099099, 1.18867925, 1.15772871, 0.95076923, 1.03313253, 1.02278481, 0.93265993, 1.06705539, 1.00265252, 1.02023121, 0.92076503, 0.99728997, 1.03353659, 1.15116279, 1.04336043, 0.95076923, 1.05515588, 0.92571429, 0.93448276, 1.02702703, 0.90056818, 0.96068796, 1.08493151, 1.13584906, 1.1212938 , 1.0739645 , 0.98972603, 0.94594595, 1.07913669, 0.98425197, 0.87762238, 0.96811594, 1.02710843, 0.99392097, 0.91384615, 1.09809264, 1.00630915, 0.93175074, 0.87572254, 1.00651466, 0.78772379, 1.12244898, 1.2248062 , 0.97109827, 0.94607843, 0.97900262, 0.97527473, 1.01212121, 1.16422287, 1.20634921, 0.97275204, 1.01090909, 0.99404762, 1.00561798, 1.01146132, 1.08695652, 0.97214485, 1.03525641, 0.99096386, 1.05135952, 1.16451613, 0.90462428, 0.76876877, 0.47701149, 0.27607362, 0.21580547, 0.20598007, 0.16766467, 0.15533981, 0.19745223, 0.15407855, 0.18925831, 0.26997245, 0.47603834, 0.596875 , 0.85126582, 0.96 , 1.06578947, 1.08761329, 0.89548023, 0.99705882, 1.07142857, 0.95677233, 0.86119874, 1.02857143, 0.98250729, 0.94214876, 1.04166667, 0.96024465, 1.07022472, 1.10344828, 1.04859335, 0.96655518, 1.06424581, 1.01754386, 1.03492063, 1.18627451, 0.91036415, 1.03355705, 1.09116809, 0.96083551, 1.01298701, 1.03691275, 1.02923977, 1.11612903, 1.01457726, 1.06285714, 0.98186528, 1.16470588, 0.86645963, 1.07317073, 1.09615385, 1.21192053, 0.94385027, 0.94244604, 0.88390501, 0.95718654, 0.9691358 , 1.01729107, 1.01119403, 1.20350877, 1.12890625, 1.06940063, 0.90410959, 1.14662757, 0.97093023, 1.03021148, 1.10629921, 0.97118156, 1.10693642, 1.07917889, 0.9484127 , 1.07581227, 0.98006645, 0.98986486, 0.90066225, 0.90066225, 0.86779661, 0.86779661, 0.96996997, 1.01438849, 0.91186441, 0.91290323, 1.03745318, 1.0615942 , 0.97202797, 1.16608997, 0.94182825, 1.08333333, 0.9076087 , 1.18181818, 1.20618557, 1.01273885, 0.93606138, 0.87457627, 0.90575916, 1.09756098, 0.99115044, 1.13380282, 1.04333333, 1.04026846, 1.0297619 , 1.04334365, 1.03395062, 0.92553191, 0.98198198, 1. , 0.9439528 , 1.02684564, 1.1372549 , 0.96676737, 0.99649123, 1.07051282, 1.10367893, 1.0866426 , 1.15384615, 0.99667774])
You might find the lmfit module (https://lmfit.github.io/lmfit-py/) useful for this. It is designed to make curve fitting very easy, has built-in models for common peaks like Gaussian, and has many useful features such as allowing you to set bounds on parameters. A fit to your data with lmfit might look like this: import numpy as np import matplotlib.pyplot as plt from lmfit.models import GaussianModel, ConstantModel y = np.array([.....]) # uses your shorter data range x = np.arange(len(y)) # make a model that is a Gaussian + a constant: model = GaussianModel(prefix='peak_') + ConstantModel() # make parameters with starting values: params = model.make_params(c=1.0, peak_center=90, peak_sigma=5, peak_amplitude=-5) # it's not really needed for this data, but you can put bounds on # parameters like this (or set .vary=False to fix a parameter) params['peak_sigma'].min = 0 # sigma > 0 params['peak_amplitude'].max = 0 # amplitude < 0 params['peak_center'].min = 80 params['peak_center'].max = 100 # run fit result = model.fit(y, params, x=x) # print, plot results print(result.fit_report()) plt.plot(x, y) plt.plot(x, result.best_fit) plt.show() This will print out [[Model]] (Model(gaussian, prefix='peak_') + Model(constant)) [[Fit Statistics]] # function evals = 54 # data points = 200 # variables = 4 chi-square = 1.616 reduced chi-square = 0.008 Akaike info crit = -955.625 Bayesian info crit = -942.432 [[Variables]] peak_sigma: 4.03660814 +/- 0.204240 (5.06%) (init= 5) peak_center: 91.2246614 +/- 0.200267 (0.22%) (init= 90) peak_amplitude: -9.79111362 +/- 0.445273 (4.55%) (init=-5) c: 1.02138228 +/- 0.006796 (0.67%) (init= 1) peak_fwhm: 9.50548558 +/- 0.480950 (5.06%) == '2.3548200*peak_sigma' peak_height: -0.96766623 +/- 0.041854 (4.33%) == '0.3989423*peak_amplitude/max(1.e-15, peak_sigma)' [[Correlations]] (unreported correlations are < 0.100) C(peak_sigma, peak_amplitude) = -0.599 C(peak_amplitude, c) = -0.328 C(peak_sigma, c) = 0.196 and make a plot like this:
how to replicate scipy.stats.fit using optimization function?
I am trying to fit a distribution to some values. This is my code from __future__ import print_function import pandas as pd import numpy as np import scipy as sp import scipy.optimize as opt import scipy.stats import matplotlib.pyplot as plt values = np.random.pareto(1.5, 10000) loc = values.min() scale = 1 def cost_function(alpha): cost = -sp.stats.pareto(alpha, loc=loc, scale=scale).pdf(values) return cost.sum() opt_res = opt.fmin(cost_function, 1.5) alpha_fit_v = sp.stats.pareto.fit(values, floc=loc, fscale=scale) print('opt_res = ', opt_res, ' alpha_fit_v = ', alpha_fit_v) I was expecting alpha_fit_v to be equivalent to opt_res but it is not. What's wrong?.
What's wrong?. The cost function is wrong. np.random.pareto has a different distribution than sp.stats.pareto 1. The cost function is wrong It does not make sense to sum inverse probabilities. You need to use the logarithm: def cost_function(alpha): cost = -sp.stats.pareto(alpha, loc=loc, scale=scale).logpdf(values) return cost.sum() 2. np.random.pareto has a different distribution than sp.stats.pareto This one is tricky, but you may have noticed that not even sp.stats.pareto.fit returns the correct result. This is because scipy's Pareto distribution cannot fit the data generated by numpy. import matplpotlib.pyplot as plt import scipys as sp import numpy as np plt.subplot(2, 1, 1) plt.hist(np.random.pareto(1.5, 10000), 1000) # This is a Lomax or Pareto II distribution plt.xlim(0, 10) plt.subplot(2, 1, 2) plt.hist(sp.stats.pareto.rvs(1.5, size=1000), 1000) # This is a Pareto distribution plt.xlim(0, 10) That said, this will work as expected: values = sp.stats.pareto.rvs(1.5, size=1000) loc = 0 scale = 1 def cost_function(alpha): cost = -sp.stats.pareto(alpha, loc=loc, scale=scale).logpdf(values) return cost.sum() opt_res = opt.fmin(cost_function, 1.5) alpha_fit_v = sp.stats.pareto.fit(values, floc=loc, fscale=scale) print('opt_res = ', opt_res, ' alpha_fit_v = ', alpha_fit_v) # opt_res = [ 1.49611816] alpha_fit_v = (1.4960937500000013, 0, 1) According to the documentation numpy.random.pareto does not quite draw from the Pareto distribution: Draw samples from a Pareto II or Lomax distribution with specified shape. The Lomax or Pareto II distribution is a shifted Pareto distribution. The classical Pareto distribution can be obtained from the Lomax distribution by adding 1 and multiplying by the scale parameter m (see Notes). So you have two alternatives if using numpy to generate the data: You can set loc=-1 for the scipy distribution. You can do values = np.random.pareto(1.5, 10000) + 1 and set loc=0.
power-law curve fitting scipy, numpy not working
I came up with a problem in fitting a power-law curve on my data. I have two data sets: bins1 and bins2 bins1 acting fine in curve-fitting by using numpy.linalg.lstsq (I then use np.exp(coefs[0])*x**coefs[1] to get power-law equation) On the other hand, bins2 is acting weird and shows a bad R-squared Both data have different equations than what excel shows me (and worse R-squared). here is the code (and data): import numpy as np import matplotlib.pyplot as plt bins1 = np.array([[6.769318871738219667e-03, 1.306418618130891773e-02, 1.912138120913448383e-02, 2.545189874466026111e-02, 3.214689891729670401e-02, 4.101898933375244805e-02, 5.129862592803200588e-02, 6.636505322669797313e-02, 8.409809827572585494e-02, 1.058164348650862258e-01, 1.375849753230810046e-01, 1.830664031837437311e-01, 2.682454535427478137e-01, 3.912508246490400410e-01, 5.893271848997768680e-01, 8.480213305038615257e-01, 2.408136266017391058e+00, 3.629192766488219313e+00, 4.639246557509275171e+00, 9.901792214343277720e+00], [8.501658465758301112e-04, 1.562697718429977012e-03, 1.902062808421856087e-04, 4.411817741488644959e-03, 3.409236963162485048e-03, 1.686099657013027898e-03, 3.643231240239608402e-03, 2.544120616413291154e-04, 2.549036204611017029e-02, 3.527340723977697573e-02, 5.038482027310990652e-02, 5.617932487522721979e-02, 1.620407270423956103e-01, 1.906538999080910068e-01, 3.180688368126549093e-01, 2.364903188268162038e-01, 3.267322385964683273e-01, 9.384571074801122403e-01, 4.419747716107813029e-01, 9.254710022316929852e+00]]).T bins2 = np.array([[6.522512685133712192e-03, 1.300415548684437199e-02, 1.888928895701269539e-02, 2.509905819337970856e-02, 3.239654633369139919e-02, 4.130706234846069635e-02, 5.123820846515786398e-02, 6.444380072984744190e-02, 8.235238352205621892e-02, 1.070907072127811749e-01, 1.403438221033725120e-01, 1.863115065963684147e-01, 2.670209758710758163e-01, 4.003337413814173074e-01, 6.549054078382223754e-01, 1.116611087124244062e+00, 2.438604844718367914e+00, 3.480674117919704269e+00, 4.410201659398489404e+00, 6.401903059926267403e+00], [1.793454543936148608e-03, 2.441092334386309615e-03, 2.754373929745804715e-03, 1.182752729942167062e-03, 1.357797177773524414e-03, 6.711673916715021199e-03, 1.392761674092503343e-02, 1.127957613093066511e-02, 7.928803089359596004e-03, 2.524609593305639915e-02, 5.698702885370290905e-02, 8.607729156137132465e-02, 2.453761830112021203e-01, 9.734443815196883176e-02, 1.487480479168299119e-01, 9.918002699934079791e-01, 1.121298151253063535e+00, 1.389239135742518227e+00, 4.254082922056571237e-01, 2.643453492951096440e+00]]).T bins = bins1 #change to bins2 to see results for bins2 def fit(x,a,m): # power-law fit (based on previous studies) return a*(x**m) coefs= np.linalg.lstsq(np.vstack([np.ones(len(bins[:,0])), np.log(bins[:,0]), bins[:,0]]).T, np.log(bins[:,1]))[0] # calculating fitting coefficients (a,m) y_predict = fit(bins[:,0],np.exp(coefs[0]),coefs[1]) # prediction based of fitted model model_plot = plt.loglog(bins[:,0],bins[:,1],'o',label="error") fit_line = plt.plot(bins[:,0],y_predict,'r', label="fit") plt.ylabel('Y (bins[:,1])') plt.xlabel('X (bins[:,0])') plt.title('model') plt.legend(loc='best') plt.show(model_plot,fit_line) def R_sqr (y,y_predict): # calculating R squared value to measure fitting accuracy rsdl = y - y_predict ss_res = np.sum(rsdl**2) ss_tot = np.sum((y-np.mean(y))**2) R2 = 1-(ss_res/ss_tot) R2 = np.around(R2,decimals=4) return R2 R2= R_sqr(bins[:,1],y_predict) print ('(R^2 = %s)' % (R2)) The fit formula for bins1[[x],[y]]: python: y = 0.337*(x)^1.223 (R^2 = 0.7773), excel: y = 0.289*(x)^1.174 (R^2 = 0.8548) The fit formula for bins2[[x],[y]]: python: y = 0.509*(x)^1.332 (R^2 = -1.753), excel: y = 0.311*(x)^1.174 (R^2 = 0.9116) And these are two sample data sets out of 30, I randomly see this fitting problem in my data and some have R-squared around "-150"!! Itried scipy "curve_fit" but I didn't get better results, in fact worse! Anyone knows how to get excel-like fit in python?
You are trying to calculate an R-squared using Y's that have not been converted to log-space. The following change gives reasonable R-squared values: R2 = R_sqr(np.log(bins[:,1]), np.log(y_predict))
Detrending a time-series of a multi-dimensional array without the for loops
I have a 3D array which has a time-series of air-sea carbon flux for each grid point on the earth's surface (model output). I want to remove the trend (linear) in the time series. I came across this code: from matplotlib import mlab for x in xrange(40): for y in xrange(182): cflux_detrended[:, x, y] = mlab.detrend_linear(cflux[:, x, y]) Can I speed this up by not using for loops?
Scipy has a lot of signal processing tools. Using scipy.signal.detrend() will remove the linear trend along an axis of the data. From the documentation it looks like the linear trend of the complete data set will be subtracted from the time-series at each grid point. import scipy.signal cflux_detrended = scipy.signal.detrend(cflux, axis=0) Using scipy.signal will get the same result as using the method in the original post. Using Josef's detrend_separate() function will also return the same result.
Here are two versions using numpy.linalg.lstsq. This version uses np.vander to create any polynomial trend. Warning: not tested except on the example. I think something like this will be added to scikits.statsmodels, which doesn't have yet a multivariate version for detrending either. For the common trend case, we could use scikits.statsmodels OLS and we would also get all the result statistics for the estimation. # -*- coding: utf-8 -*- """Detrending multivariate array Created on Fri Dec 02 15:08:42 2011 Author: Josef Perktold http://stackoverflow.com/questions/8355197/detrending-a-time-series-of-a-multi-dimensional-array-without-the-for-loops I should also add the multivariate version to statsmodels """ import numpy as np import matplotlib.pyplot as plt def detrend_common(y, order=1): '''detrend multivariate series by common trend Paramters --------- y : ndarray data, can be 1d or nd. if ndim is greater then 1, then observations are along zero axis order : int degree of polynomial trend, 1 is linear, 0 is constant Returns ------- y_detrended : ndarray detrended data in same shape as original ''' nobs = y.shape[0] shape = y.shape y_ = y.ravel() nobs_ = len(y_) t = np.repeat(np.arange(nobs), nobs_ /float(nobs)) exog = np.vander(t, order+1) params = np.linalg.lstsq(exog, y_)[0] fittedvalues = np.dot(exog, params) resid = (y_ - fittedvalues).reshape(*shape) return resid, params def detrend_separate(y, order=1): '''detrend multivariate series by series specific trends Paramters --------- y : ndarray data, can be 1d or nd. if ndim is greater then 1, then observations are along zero axis order : int degree of polynomial trend, 1 is linear, 0 is constant Returns ------- y_detrended : ndarray detrended data in same shape as original ''' nobs = y.shape[0] shape = y.shape y_ = y.reshape(nobs, -1) kvars_ = len(y_) t = np.arange(nobs) exog = np.vander(t, order+1) params = np.linalg.lstsq(exog, y_)[0] fittedvalues = np.dot(exog, params) resid = (y_ - fittedvalues).reshape(*shape) return resid, params nobs = 30 sige = 0.1 y0 = 0.5 * np.random.randn(nobs,4,3) t = np.arange(nobs) y_observed = y0 + t[:,None,None] for detrend_func, name in zip([detrend_common, detrend_separate], ['common', 'separate']): y_detrended, params = detrend_func(y_observed, order=1) print '\n\n', name print 'params for detrending' print params print 'std of detrended', y_detrended.std() #should be roughly sig=0.5 (var of y0) print 'maxabs', np.max(np.abs(y_detrended - y0)) print 'observed' print y_observed[-1] print 'detrended' print y_detrended[-1] print 'original "true"' print y0[-1] plt.figure() for i in range(4): for j in range(3): plt.plot(y0[:,i,j], 'bo', alpha=0.75) plt.plot(y_detrended[:,i,j], 'ro', alpha=0.75) plt.title(name + ' detrending: blue - original, red - detrended') plt.show() Since Nicholas pointed out scipy.signal.detrend. My detrend separate is basically the same as scipy.signal.detrend with fewer (no axis or breaks) or different (with polynomial order) options. >>> res = signal.detrend(y_observed, axis=0) >>> (res - y0).var() 0.016931858083279336 >>> (y_detrended - y0).var() 0.01693185808327945 >>> (res - y_detrended).var() 8.402584948582852e-30
I think a plain old list comprehension is easiest: cflux_detrended = np.array([[mlab.detrend_linear(t) for t in kk] for kk in cflux.T])
scipy linregress function erroneous standard error return?
I have a weird situation with scipy.stats.linregress seems to be returning an incorrect standard error: from scipy import stats x = [5.05, 6.75, 3.21, 2.66] y = [1.65, 26.5, -5.93, 7.96] gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y) >>> gradient 5.3935773611970186 >>> intercept -16.281127993087829 >>> r_value 0.72443514211849758 >>> r_value**2 0.52480627513624778 >>> std_err 3.6290901222878866 Whereas Excel returns the following: slope: 5.394 intercept: -16.281 rsq: 0.525 steyX: 11.696 steyX is excel's standard error function, returning 11.696 versus scipy's 3.63. Anybody know what's going on here? Any alternative way of getting the standard error of a regression in python, without going to Rpy?
I've just been informed by the SciPy user group that the std_err here represents the standard error of the gradient line, not the standard error of the predicted y's, as per Excel. Nevertheless users of this function should be careful, because this was not always the behaviour of this library - it used to output exactly as Excel, and the changeover appears to have occurred in the past few months. Anyway still looking for an equivalent to STEYX in Python.
You could try the statsmodels package: In [37]: import statsmodels.api as sm In [38]: x = [5.05, 6.75, 3.21, 2.66] In [39]: y = [1.65, 26.5, -5.93, 7.96] In [40]: X = sm.add_constant(x) # intercept In [41]: model = sm.OLS(y, X) In [42]: fit = model.fit() In [43]: fit.params Out[43]: array([ 5.39357736, -16.28112799]) In [44]: fit.rsquared Out[44]: 0.52480627513624789 In [45]: np.sqrt(fit.mse_resid) Out[45]: 11.696414461570097
yes this is true - the standard estimate of the gradient is what linregress returns; the standard estimate of the estimate (Y) is related, though, and you can back-into the SEE by multiplying the standard error of the gradient (SEG) that linregress gives you: SEG = SEE / sqrt( sum of (X - average X)**2 ) Stack Exchange doesn't handle latex but the math is here if you are interested, under the "Analyze Sample Data" heading.
This will give you an equivalent to STEYX using python: fit = np.polyfit(x,y,deg=1) n = len(x) m = fit[0] c = fit[1] y_pred = m*x+c STEYX = (((y-y_pred)**2).sum()/(n-2))**0.5 print(STEYX)
The calculation of "std err on y" in excel is actually standard deviation of values of y. That's the same for std err on x. The number '2' in the final step is the degree of freedom of example you given. >>> x = [5.05, 6.75, 3.21, 2.66] >>> y = [1.65, 26.5, -5.93, 7.96] >>> def power(a): return a*5.3936-16.2811 >>> y_fit = list(map(power,x)) >>> y_fit [10.956580000000002, 20.125700000000005, 1.032356, -1.934123999999997] >>> var = [y[i]-y_fit[i] for i in range(len(y))] >>> def pow2(a): return a**2 >>> summa = list(map(pow2,var)) >>> summa [86.61243129640003, 40.63170048999993, 48.47440107073599, 97.89368972737596] >>> total = 0 >>> for i in summa: total += i >>> total 273.6122225845119 >>> import math >>> math.sqrt(total/2) 11.696414463084658