generate random lognormal distributions using shape of observed data - python

I'm trying to fit some data to a lognormal distribution and from this generate random lognormal distribution using optimized parameters.
After some search I found some solutions, but none convincing:
solution1 using the fit function:
import numpy as np
from scipy.stats import lognorm
mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354]
shape, loc, scale = lognorm.fit(mydata)
rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)
or Solution 2 using mu and sigma from original data:
import numpy as np
from scipy.stats import lognorm
mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354]
mu = np.mean([np.log(i) for i in mydata])
sigma = np.std([np.log(i) for i in mydata])
distr = lognorm(mu, sigma)
rnd_log = distr.rvs (size=100)
None of those solutions are fitting well:
import pylab
pylab.plot(sorted(mydata, reverse=True), 'ro')
pylab.plot(sorted(rnd_log, reverse=True), 'bx')
I am not sure if i understand well the way to use distributions, or if I am missing something else...
I though finding the solution here: Does anyone have example code of using scipy.stats.distributions?
but I am not able to get the shape from my data... am I missing something in the use of the fit function?
thanks
EDIT:
this is an example in order to understand better my problem:
print 'solution 1:'
means = []
stdes = []
distr = lognorm(mu, sigma)
for _ in xrange(1000):
rnd_log = distr.rvs (size=100)
means.append (np.mean([np.log(i) for i in rnd_log]))
stdes.append (np.std ([np.log(i) for i in rnd_log]))
print 'observed mean:',mu , 'mean simulated mean:', np.mean (means)
print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)
print '\nsolution 2:'
means = []
stdes = []
shape, loc, scale = lognorm.fit(mydata)
for _ in xrange(1000):
rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)
means.append (np.mean([np.log(i) for i in rnd_log]))
stdes.append (np.std ([np.log(i) for i in rnd_log]))
print 'observed mean:',mu , 'mean simulated mean:', np.mean (means)
print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)
the result is:
solution 1:
observed mean: 1.82562655734 mean simulated mean: 1.18929982267
observed std : 1.39003773799 mean simulated std : 0.88985924363
solution 2:
observed mean: 1.82562655734 mean simulated mean: 4.50608084668
observed std : 1.39003773799 mean simulated std : 5.44206119499
while, if I do the same in R:
mydata <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354)
meanlog <- mean(log(mydata))
sdlog <- sd(log(mydata))
means <- c()
stdes <- c()
for (i in 1:1000){
rnd.log <- rlnorm(length(mydata), meanlog, sdlog)
means <- c(means, mean(log(rnd.log)))
stdes <- c(stdes, sd(log(rnd.log)))
}
print (paste('observed mean:',meanlog,'mean simulated mean:',mean(means),sep=' '))
print (paste('observed std :',sdlog ,'mean simulated std :',mean(stdes),sep=' '))
i get:
[1] "observed mean: 1.82562655733507 mean simulated mean: 1.82307191072317"
[1] "observed std : 1.39704049131865 mean simulated std : 1.39736545866904"
that is much more closer, so I guess I am doing something wrong when using scipy...

The lognormal distribution in scipy is parametrized a little different from the usual way. See the scipy.stats.lognorm docs, particularly the "Notes" section.
Here's how to get the results you're expecting (note that we hold location to 0 when fitting):
In [315]: from scipy import stats
In [316]: x = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354])
In [317]: mu, sigma = stats.norm.fit(np.log(x))
In [318]: mu, sigma
Out[318]: (1.8256265573350701, 1.3900377379913127)
In [319]: shape, loc, scale = stats.lognorm.fit(x, floc=0)
In [320]: np.log(scale), shape
Out[320]: (1.8256267737298788, 1.3900309739954713)
Now you can generate samples and confirm your expectations:
In [321]: dist = stats.lognorm(shape, loc, scale)
In [322]: means, sds = [], []
In [323]: for i in xrange(1000):
.....: sample = dist.rvs(size=100)
.....: logsample = np.log(sample)
.....: means.append(logsample.mean())
.....: sds.append(logsample.std())
.....:
In [324]: np.mean(means), np.mean(sds)
Out[324]: (1.8231068508345041, 1.3816361818739145)

Related

Confine a gaussian fit with curve_fit

in the framework of my bachelor's thesis, I need to evaluate my data with python. Unfortunately there's no suiting script of my fellow students yet and I'm quite new to programming.
I have this data set and I'm trying to fit it with a gaussian by using scipy.optimize.curve_fit. Since there are a lot of unusable counts especially at the end of the axis, I'd like to confine the part that is to be fitted.
Picture raw data
This is what I have so far:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x=np.arange(5120)
y=array([ 0.81434599, 1.17054264, 0.85279188, ..., 1. ,
1. , 13.56291391]) #most of the data isn't interesting
#to me, part of interest see below
def Gauss(x, a, x0, sigma):
return a * np.exp(-(x - x0)**2 / (2 * sigma**2))
mean = sum(x * y) / sum(y)
sigma = np.sqrt(sum(y * (x - mean)**2) / sum(y))
popt,pcov = curve_fit(Gauss, x, y, p0=[max(y), mean, sigma],
maxfev=360000)
plt.plot(x,y,label='data')
plt.plot(x,Gauss(x, *popt), 'r-',label='fit')
On docs.scipy.org I've found a general description about curve_fit
If I try using
bounds=([2400,-np.inf, -np.inf],[2600, np.inf, np.inf]),
I'm getting the ValueError: x0 is infeasible. What is the problem here?
I also tried to confine it with
popt,pcov = curve_fit(Gauss, x[2400:2600], y[2400:2600], p0=[max(y), mean, sigma], maxfev=360000)
as suggested in a comment on this question: "Error when obtaining gaussian fit for graph" at stackoverflow
In this case I only get a straight line though.
Picture: Confinement with x[2400:2600],y[2400:2600] as arguments of curve_fit
I really hope you can help me out here. I only need a way to fit a small part of my data. Thanks in advance!
interesting data:
y=array([ 0.93396226, 1.00884956, 1.15457413, 1.07590759,
0.88915094, 1.07142857, 1.10714286, 1.14171123, 1.06666667,
0.84975369, 0.95480226, 0.99388379, 1.01675978, 0.83967391,
0.9771987 , 1.02402402, 1.04531722, 1.07492795, 0.97135417,
0.99714286, 1.0248139 , 1.26223776, 1.1533101 , 0.99099099,
1.18867925, 1.15772871, 0.95076923, 1.03313253, 1.02278481,
0.93265993, 1.06705539, 1.00265252, 1.02023121, 0.92076503,
0.99728997, 1.03353659, 1.15116279, 1.04336043, 0.95076923,
1.05515588, 0.92571429, 0.93448276, 1.02702703, 0.90056818,
0.96068796, 1.08493151, 1.13584906, 1.1212938 , 1.0739645 ,
0.98972603, 0.94594595, 1.07913669, 0.98425197, 0.87762238,
0.96811594, 1.02710843, 0.99392097, 0.91384615, 1.09809264,
1.00630915, 0.93175074, 0.87572254, 1.00651466, 0.78772379,
1.12244898, 1.2248062 , 0.97109827, 0.94607843, 0.97900262,
0.97527473, 1.01212121, 1.16422287, 1.20634921, 0.97275204,
1.01090909, 0.99404762, 1.00561798, 1.01146132, 1.08695652,
0.97214485, 1.03525641, 0.99096386, 1.05135952, 1.16451613,
0.90462428, 0.76876877, 0.47701149, 0.27607362, 0.21580547,
0.20598007, 0.16766467, 0.15533981, 0.19745223, 0.15407855,
0.18925831, 0.26997245, 0.47603834, 0.596875 , 0.85126582, 0.96
, 1.06578947, 1.08761329, 0.89548023, 0.99705882, 1.07142857,
0.95677233, 0.86119874, 1.02857143, 0.98250729, 0.94214876,
1.04166667, 0.96024465, 1.07022472, 1.10344828, 1.04859335,
0.96655518, 1.06424581, 1.01754386, 1.03492063, 1.18627451,
0.91036415, 1.03355705, 1.09116809, 0.96083551, 1.01298701,
1.03691275, 1.02923977, 1.11612903, 1.01457726, 1.06285714,
0.98186528, 1.16470588, 0.86645963, 1.07317073, 1.09615385,
1.21192053, 0.94385027, 0.94244604, 0.88390501, 0.95718654,
0.9691358 , 1.01729107, 1.01119403, 1.20350877, 1.12890625,
1.06940063, 0.90410959, 1.14662757, 0.97093023, 1.03021148,
1.10629921, 0.97118156, 1.10693642, 1.07917889, 0.9484127 ,
1.07581227, 0.98006645, 0.98986486, 0.90066225, 0.90066225,
0.86779661, 0.86779661, 0.96996997, 1.01438849, 0.91186441,
0.91290323, 1.03745318, 1.0615942 , 0.97202797, 1.16608997,
0.94182825, 1.08333333, 0.9076087 , 1.18181818, 1.20618557,
1.01273885, 0.93606138, 0.87457627, 0.90575916, 1.09756098,
0.99115044, 1.13380282, 1.04333333, 1.04026846, 1.0297619 ,
1.04334365, 1.03395062, 0.92553191, 0.98198198, 1. ,
0.9439528 , 1.02684564, 1.1372549 , 0.96676737, 0.99649123,
1.07051282, 1.10367893, 1.0866426 , 1.15384615, 0.99667774])
You might find the lmfit module (https://lmfit.github.io/lmfit-py/) useful for this. It is designed to make curve fitting very easy, has built-in models for common peaks like Gaussian, and has many useful features such as allowing you to set bounds on parameters. A fit to your data with lmfit might look like this:
import numpy as np
import matplotlib.pyplot as plt
from lmfit.models import GaussianModel, ConstantModel
y = np.array([.....]) # uses your shorter data range
x = np.arange(len(y))
# make a model that is a Gaussian + a constant:
model = GaussianModel(prefix='peak_') + ConstantModel()
# make parameters with starting values:
params = model.make_params(c=1.0, peak_center=90,
peak_sigma=5, peak_amplitude=-5)
# it's not really needed for this data, but you can put bounds on
# parameters like this (or set .vary=False to fix a parameter)
params['peak_sigma'].min = 0 # sigma > 0
params['peak_amplitude'].max = 0 # amplitude < 0
params['peak_center'].min = 80
params['peak_center'].max = 100
# run fit
result = model.fit(y, params, x=x)
# print, plot results
print(result.fit_report())
plt.plot(x, y)
plt.plot(x, result.best_fit)
plt.show()
This will print out
[[Model]]
(Model(gaussian, prefix='peak_') + Model(constant))
[[Fit Statistics]]
# function evals = 54
# data points = 200
# variables = 4
chi-square = 1.616
reduced chi-square = 0.008
Akaike info crit = -955.625
Bayesian info crit = -942.432
[[Variables]]
peak_sigma: 4.03660814 +/- 0.204240 (5.06%) (init= 5)
peak_center: 91.2246614 +/- 0.200267 (0.22%) (init= 90)
peak_amplitude: -9.79111362 +/- 0.445273 (4.55%) (init=-5)
c: 1.02138228 +/- 0.006796 (0.67%) (init= 1)
peak_fwhm: 9.50548558 +/- 0.480950 (5.06%) == '2.3548200*peak_sigma'
peak_height: -0.96766623 +/- 0.041854 (4.33%) == '0.3989423*peak_amplitude/max(1.e-15, peak_sigma)'
[[Correlations]] (unreported correlations are < 0.100)
C(peak_sigma, peak_amplitude) = -0.599
C(peak_amplitude, c) = -0.328
C(peak_sigma, c) = 0.196
and make a plot like this:

how to replicate scipy.stats.fit using optimization function?

I am trying to fit a distribution to some values. This is my code
from __future__ import print_function
import pandas as pd
import numpy as np
import scipy as sp
import scipy.optimize as opt
import scipy.stats
import matplotlib.pyplot as plt
values = np.random.pareto(1.5, 10000)
loc = values.min()
scale = 1
def cost_function(alpha):
cost = -sp.stats.pareto(alpha, loc=loc, scale=scale).pdf(values)
return cost.sum()
opt_res = opt.fmin(cost_function, 1.5)
alpha_fit_v = sp.stats.pareto.fit(values, floc=loc, fscale=scale)
print('opt_res = ', opt_res,
' alpha_fit_v = ', alpha_fit_v)
I was expecting alpha_fit_v to be equivalent to opt_res but it is not. What's wrong?.
What's wrong?.
The cost function is wrong.
np.random.pareto has a different distribution than sp.stats.pareto
1. The cost function is wrong
It does not make sense to sum inverse probabilities. You need to use the logarithm:
def cost_function(alpha):
cost = -sp.stats.pareto(alpha, loc=loc, scale=scale).logpdf(values)
return cost.sum()
2. np.random.pareto has a different distribution than sp.stats.pareto
This one is tricky, but you may have noticed that not even sp.stats.pareto.fit returns the correct result. This is because scipy's Pareto distribution cannot fit the data generated by numpy.
import matplpotlib.pyplot as plt
import scipys as sp
import numpy as np
plt.subplot(2, 1, 1)
plt.hist(np.random.pareto(1.5, 10000), 1000) # This is a Lomax or Pareto II distribution
plt.xlim(0, 10)
plt.subplot(2, 1, 2)
plt.hist(sp.stats.pareto.rvs(1.5, size=1000), 1000) # This is a Pareto distribution
plt.xlim(0, 10)
That said, this will work as expected:
values = sp.stats.pareto.rvs(1.5, size=1000)
loc = 0
scale = 1
def cost_function(alpha):
cost = -sp.stats.pareto(alpha, loc=loc, scale=scale).logpdf(values)
return cost.sum()
opt_res = opt.fmin(cost_function, 1.5)
alpha_fit_v = sp.stats.pareto.fit(values, floc=loc, fscale=scale)
print('opt_res = ', opt_res,
' alpha_fit_v = ', alpha_fit_v)
# opt_res = [ 1.49611816] alpha_fit_v = (1.4960937500000013, 0, 1)
According to the documentation numpy.random.pareto does not quite draw from the Pareto distribution:
Draw samples from a Pareto II or Lomax distribution with specified shape.
The Lomax or Pareto II distribution is a shifted Pareto distribution. The classical Pareto distribution can be obtained from the Lomax distribution by adding 1 and multiplying by the scale parameter m (see Notes).
So you have two alternatives if using numpy to generate the data:
You can set loc=-1 for the scipy distribution.
You can do values = np.random.pareto(1.5, 10000) + 1 and set loc=0.

power-law curve fitting scipy, numpy not working

I came up with a problem in fitting a power-law curve on my data. I have two data sets: bins1 and bins2
bins1 acting fine in curve-fitting by using numpy.linalg.lstsq (I then use np.exp(coefs[0])*x**coefs[1] to get power-law equation)
On the other hand, bins2 is acting weird and shows a bad R-squared
Both data have different equations than what excel shows me (and worse R-squared).
here is the code (and data):
import numpy as np
import matplotlib.pyplot as plt
bins1 = np.array([[6.769318871738219667e-03,
1.306418618130891773e-02,
1.912138120913448383e-02,
2.545189874466026111e-02,
3.214689891729670401e-02,
4.101898933375244805e-02,
5.129862592803200588e-02,
6.636505322669797313e-02,
8.409809827572585494e-02,
1.058164348650862258e-01,
1.375849753230810046e-01,
1.830664031837437311e-01,
2.682454535427478137e-01,
3.912508246490400410e-01,
5.893271848997768680e-01,
8.480213305038615257e-01,
2.408136266017391058e+00,
3.629192766488219313e+00,
4.639246557509275171e+00,
9.901792214343277720e+00],
[8.501658465758301112e-04,
1.562697718429977012e-03,
1.902062808421856087e-04,
4.411817741488644959e-03,
3.409236963162485048e-03,
1.686099657013027898e-03,
3.643231240239608402e-03,
2.544120616413291154e-04,
2.549036204611017029e-02,
3.527340723977697573e-02,
5.038482027310990652e-02,
5.617932487522721979e-02,
1.620407270423956103e-01,
1.906538999080910068e-01,
3.180688368126549093e-01,
2.364903188268162038e-01,
3.267322385964683273e-01,
9.384571074801122403e-01,
4.419747716107813029e-01,
9.254710022316929852e+00]]).T
bins2 = np.array([[6.522512685133712192e-03,
1.300415548684437199e-02,
1.888928895701269539e-02,
2.509905819337970856e-02,
3.239654633369139919e-02,
4.130706234846069635e-02,
5.123820846515786398e-02,
6.444380072984744190e-02,
8.235238352205621892e-02,
1.070907072127811749e-01,
1.403438221033725120e-01,
1.863115065963684147e-01,
2.670209758710758163e-01,
4.003337413814173074e-01,
6.549054078382223754e-01,
1.116611087124244062e+00,
2.438604844718367914e+00,
3.480674117919704269e+00,
4.410201659398489404e+00,
6.401903059926267403e+00],
[1.793454543936148608e-03,
2.441092334386309615e-03,
2.754373929745804715e-03,
1.182752729942167062e-03,
1.357797177773524414e-03,
6.711673916715021199e-03,
1.392761674092503343e-02,
1.127957613093066511e-02,
7.928803089359596004e-03,
2.524609593305639915e-02,
5.698702885370290905e-02,
8.607729156137132465e-02,
2.453761830112021203e-01,
9.734443815196883176e-02,
1.487480479168299119e-01,
9.918002699934079791e-01,
1.121298151253063535e+00,
1.389239135742518227e+00,
4.254082922056571237e-01,
2.643453492951096440e+00]]).T
bins = bins1 #change to bins2 to see results for bins2
def fit(x,a,m): # power-law fit (based on previous studies)
return a*(x**m)
coefs= np.linalg.lstsq(np.vstack([np.ones(len(bins[:,0])), np.log(bins[:,0]), bins[:,0]]).T, np.log(bins[:,1]))[0] # calculating fitting coefficients (a,m)
y_predict = fit(bins[:,0],np.exp(coefs[0]),coefs[1]) # prediction based of fitted model
model_plot = plt.loglog(bins[:,0],bins[:,1],'o',label="error")
fit_line = plt.plot(bins[:,0],y_predict,'r', label="fit")
plt.ylabel('Y (bins[:,1])')
plt.xlabel('X (bins[:,0])')
plt.title('model')
plt.legend(loc='best')
plt.show(model_plot,fit_line)
def R_sqr (y,y_predict): # calculating R squared value to measure fitting accuracy
rsdl = y - y_predict
ss_res = np.sum(rsdl**2)
ss_tot = np.sum((y-np.mean(y))**2)
R2 = 1-(ss_res/ss_tot)
R2 = np.around(R2,decimals=4)
return R2
R2= R_sqr(bins[:,1],y_predict)
print ('(R^2 = %s)' % (R2))
The fit formula for bins1[[x],[y]]: python: y = 0.337*(x)^1.223 (R^2 = 0.7773), excel: y = 0.289*(x)^1.174 (R^2 = 0.8548)
The fit formula for bins2[[x],[y]]: python: y = 0.509*(x)^1.332 (R^2 = -1.753), excel: y = 0.311*(x)^1.174 (R^2 = 0.9116)
And these are two sample data sets out of 30, I randomly see this fitting problem in my data and some have R-squared around "-150"!!
Itried scipy "curve_fit" but I didn't get better results, in fact worse!
Anyone knows how to get excel-like fit in python?
You are trying to calculate an R-squared using Y's that have not been converted to log-space. The following change gives reasonable R-squared values:
R2 = R_sqr(np.log(bins[:,1]), np.log(y_predict))

Detrending a time-series of a multi-dimensional array without the for loops

I have a 3D array which has a time-series of air-sea carbon flux for each grid point on the earth's surface (model output). I want to remove the trend (linear) in the time series. I came across this code:
from matplotlib import mlab
for x in xrange(40):
for y in xrange(182):
cflux_detrended[:, x, y] = mlab.detrend_linear(cflux[:, x, y])
Can I speed this up by not using for loops?
Scipy has a lot of signal processing tools.
Using scipy.signal.detrend() will remove the linear trend along an axis of the data. From the documentation it looks like the linear trend of the complete data set will be subtracted from the time-series at each grid point.
import scipy.signal
cflux_detrended = scipy.signal.detrend(cflux, axis=0)
Using scipy.signal will get the same result as using the method in the original post. Using Josef's detrend_separate() function will also return the same result.
Here are two versions using numpy.linalg.lstsq. This version uses np.vander to create any polynomial trend.
Warning: not tested except on the example.
I think something like this will be added to scikits.statsmodels, which doesn't have yet a multivariate version for detrending either. For the common trend case, we could use scikits.statsmodels OLS and we would also get all the result statistics for the estimation.
# -*- coding: utf-8 -*-
"""Detrending multivariate array
Created on Fri Dec 02 15:08:42 2011
Author: Josef Perktold
http://stackoverflow.com/questions/8355197/detrending-a-time-series-of-a-multi-dimensional-array-without-the-for-loops
I should also add the multivariate version to statsmodels
"""
import numpy as np
import matplotlib.pyplot as plt
def detrend_common(y, order=1):
'''detrend multivariate series by common trend
Paramters
---------
y : ndarray
data, can be 1d or nd. if ndim is greater then 1, then observations
are along zero axis
order : int
degree of polynomial trend, 1 is linear, 0 is constant
Returns
-------
y_detrended : ndarray
detrended data in same shape as original
'''
nobs = y.shape[0]
shape = y.shape
y_ = y.ravel()
nobs_ = len(y_)
t = np.repeat(np.arange(nobs), nobs_ /float(nobs))
exog = np.vander(t, order+1)
params = np.linalg.lstsq(exog, y_)[0]
fittedvalues = np.dot(exog, params)
resid = (y_ - fittedvalues).reshape(*shape)
return resid, params
def detrend_separate(y, order=1):
'''detrend multivariate series by series specific trends
Paramters
---------
y : ndarray
data, can be 1d or nd. if ndim is greater then 1, then observations
are along zero axis
order : int
degree of polynomial trend, 1 is linear, 0 is constant
Returns
-------
y_detrended : ndarray
detrended data in same shape as original
'''
nobs = y.shape[0]
shape = y.shape
y_ = y.reshape(nobs, -1)
kvars_ = len(y_)
t = np.arange(nobs)
exog = np.vander(t, order+1)
params = np.linalg.lstsq(exog, y_)[0]
fittedvalues = np.dot(exog, params)
resid = (y_ - fittedvalues).reshape(*shape)
return resid, params
nobs = 30
sige = 0.1
y0 = 0.5 * np.random.randn(nobs,4,3)
t = np.arange(nobs)
y_observed = y0 + t[:,None,None]
for detrend_func, name in zip([detrend_common, detrend_separate],
['common', 'separate']):
y_detrended, params = detrend_func(y_observed, order=1)
print '\n\n', name
print 'params for detrending'
print params
print 'std of detrended', y_detrended.std() #should be roughly sig=0.5 (var of y0)
print 'maxabs', np.max(np.abs(y_detrended - y0))
print 'observed'
print y_observed[-1]
print 'detrended'
print y_detrended[-1]
print 'original "true"'
print y0[-1]
plt.figure()
for i in range(4):
for j in range(3):
plt.plot(y0[:,i,j], 'bo', alpha=0.75)
plt.plot(y_detrended[:,i,j], 'ro', alpha=0.75)
plt.title(name + ' detrending: blue - original, red - detrended')
plt.show()
Since Nicholas pointed out scipy.signal.detrend. My detrend separate is basically the same as scipy.signal.detrend with fewer (no axis or breaks) or different (with polynomial order) options.
>>> res = signal.detrend(y_observed, axis=0)
>>> (res - y0).var()
0.016931858083279336
>>> (y_detrended - y0).var()
0.01693185808327945
>>> (res - y_detrended).var()
8.402584948582852e-30
I think a plain old list comprehension is easiest:
cflux_detrended = np.array([[mlab.detrend_linear(t) for t in kk] for kk in cflux.T])

scipy linregress function erroneous standard error return?

I have a weird situation with scipy.stats.linregress seems to be returning an incorrect standard error:
from scipy import stats
x = [5.05, 6.75, 3.21, 2.66]
y = [1.65, 26.5, -5.93, 7.96]
gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
>>> gradient
5.3935773611970186
>>> intercept
-16.281127993087829
>>> r_value
0.72443514211849758
>>> r_value**2
0.52480627513624778
>>> std_err
3.6290901222878866
Whereas Excel returns the following:
slope: 5.394
intercept: -16.281
rsq: 0.525
steyX: 11.696
steyX is excel's standard error function, returning 11.696 versus scipy's 3.63. Anybody know what's going on here? Any alternative way of getting the standard error of a regression in python, without going to Rpy?
I've just been informed by the SciPy user group that the std_err here represents the standard error of the gradient line, not the standard error of the predicted y's, as per Excel. Nevertheless users of this function should be careful, because this was not always the behaviour of this library - it used to output exactly as Excel, and the changeover appears to have occurred in the past few months.
Anyway still looking for an equivalent to STEYX in Python.
You could try the statsmodels package:
In [37]: import statsmodels.api as sm
In [38]: x = [5.05, 6.75, 3.21, 2.66]
In [39]: y = [1.65, 26.5, -5.93, 7.96]
In [40]: X = sm.add_constant(x) # intercept
In [41]: model = sm.OLS(y, X)
In [42]: fit = model.fit()
In [43]: fit.params
Out[43]: array([ 5.39357736, -16.28112799])
In [44]: fit.rsquared
Out[44]: 0.52480627513624789
In [45]: np.sqrt(fit.mse_resid)
Out[45]: 11.696414461570097
yes this is true - the standard estimate of the gradient is what linregress returns; the standard estimate of the estimate (Y) is related, though, and you can back-into the SEE by multiplying the standard error of the gradient (SEG) that linregress gives you: SEG = SEE / sqrt( sum of (X - average X)**2 )
Stack Exchange doesn't handle latex but the math is here if you are interested, under the "Analyze Sample Data" heading.
This will give you an equivalent to STEYX using python:
fit = np.polyfit(x,y,deg=1)
n = len(x)
m = fit[0]
c = fit[1]
y_pred = m*x+c
STEYX = (((y-y_pred)**2).sum()/(n-2))**0.5
print(STEYX)
The calculation of "std err on y" in excel is actually standard deviation of values of y.
That's the same for std err on x. The number '2' in the final step is the degree of freedom of example you given.
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> def power(a):
return a*5.3936-16.2811
>>> y_fit = list(map(power,x))
>>> y_fit
[10.956580000000002, 20.125700000000005, 1.032356, -1.934123999999997]
>>> var = [y[i]-y_fit[i] for i in range(len(y))]
>>> def pow2(a):
return a**2
>>> summa = list(map(pow2,var))
>>> summa
[86.61243129640003, 40.63170048999993, 48.47440107073599, 97.89368972737596]
>>> total = 0
>>> for i in summa:
total += i
>>> total
273.6122225845119
>>> import math
>>> math.sqrt(total/2)
11.696414463084658

Categories