How to fix the poor fitting of 1-D data? - python

I have data set (1-D), with only one independent column. I would like to fit any model to it in order to sample from that model. The raw data
Data set
I tried various theoretical distributions from Fitter package (here https://pypi.org/project/fitter/), none of them works fine. Then i tried Kernel Density Estimation using sklearn. It is good, but i could not prevent negative values due to the way it works. Finally, i tried log normal, but it is not really perfect.
Code for log normal here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import math
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
NN = 3915 # sample same number as original data set
df = pd.read_excel (r'Data_sets2.xlsx',sheet_name="Set1")
eps = 0.1 # Additional term for c
"""
Estimate parameters of log(c) as normal distribution
"""
df["c"] = df["c"] + eps
mu = np.mean(np.log(df["c"]))
s = np.std(np.log(df["c"]))
print("Mean:",mu,"std:",s)
def simulate(N):
c = []
for i in range(N):
c_s = np.exp(np.random.normal(loc = mu, scale = s, size=1)[0])
c.append(round(c_s))
return (c)
predicted_c = simulate(NN)
XX=scipy.arange(3915)
### plot C relation ###
plt.scatter(XX,df["c"],color='g',label="Original data")
plt.scatter(XX,predicted_c,color='r',label="Sample data")
plt.xlabel('Index')
plt.ylabel('c')
plt.legend()
plt.show()
original vs samples
What i am looking for is how to improve the fitting, any suggestions or direction to models that may fit my data with a better accuracy is appreciated. Thanks

Here is a graphical Python fitter for the scipy statistical distribution Double Gamma using your spreadsheet data, I hope this might be of some use as a Normal distribution seems to be a poor fit to this data set. The scipy documentation for dgamma is at https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.dgamma.html - incidentally,the double Weibull distribution fit almost as well.
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_excel (r'Data_sets2.xlsx',sheet_name="Set1")
eps = 0.1 # Additional term for c
data = df["c"] + eps
P = ss.dgamma.fit(data)
rX = np.linspace(min(data), max(data), 50)
rP = ss.dgamma.pdf(rX, *P)
plt.hist(data,bins=25, normed=True, color='slategrey')
plt.plot(rX, rP, color='darkturquoise')
plt.show()

Related

How to implement a butterworth filter

I am trying to implement a butterworthfilter with python in jupyter Notebook. I wrote this code by a tutorial.
The data are from a CSV-File, it calls Samples.csv
The data in Samples.csv are like
998,4778415
1009,209592
1006,619094
1001,785406
993,9426543
990,1408991
992,736118
995,8127334
1002,381664
1006,094429
1000,634799
999,3287747
1002,318812
999,3287747
1004,427698
1008,516733
1007,964781
1002,680906
1000,14449
994,257009
The column calls Euclidian Norm. The range of the data are from 0 to 1679.286158 and theyre are 1838 rows.
This is the code in Jupyter:
from scipy.signal import filtfilt
from scipy import stats
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy
def plot():
data=pd.read_csv('Samples.csv',sep=";", decimal=",")
sensor_data=data[['Euclidian Norm']]
sensor_data=np.array(sensor_data)
time=np.linspace(0,1679.286158,1838)
plt.plot(time,sensor_data)
plt.show()
filtered_signal=bandPassFilter(sensor_data)
plt.plot(time,sensor_data)
plt.show()
def bandPassFilter(signal):
fs = 4000.0
lowcut=20.0
highcut=50.0
nyq=0.5*fs
low=lowcut/nyq
high=highcut/nyq
order =2
b,a=scipy.signal.butter(order,[low,high],'bandpass',analog=False)
y=scipy.signal.filtfilt(b,a,signal,axis=0)
return(y)
plot()
My problem is that nothing changes in my data. It doesnt filtered my data. The graph of the filtered data is the same like the source data. Does anyone know what could be wrong?
The first graph is the source data and the second graph is the filtered graph. It looks very similar. Its like the same graph
I can't comment yet.
You're never using filtered_signal and plot with the same arguments twice.
Here`s one of my implementations with added interpolation, very similar to yours:
def butterFit(data, freq, order=2):
ar = scipy.signal.butter(order, freq) # Gets params for filttilt
return spfilter.filtfilt(ar[0], ar[1], data)
def plotFilteredSplines(timeframe, data, amount_points):
# Generate evenly spread indices for the data points.
indices = np.arange(0, len(data), amount_points)
cutoff_freq = 2 / (2/10 * len(timeframe))
# Reshape the data with butter :)
data = butterFit(data, cutoff_freq)
# Plot Fitlered data
plt.plot(timeframe, data, '-.')
interpol_x = np.linspace(timeframe[0], timeframe[-1], 100)
# Get the cubic spline approx function
interpolation = sp.interpolate.interp1d(timeframe, data, kind='cubic')
# Plot the interpolation over the extended time frame.
plt.plot(interpol_x, interpolation(interpol_x), '-r')

Weak PYMC3 Estimates for Large Datasets

I generated a dataset from a known Weibull distribution:
Weibull( alpha= A.SI^-n, beta) where A=1800, n=0.5, Beta=1.5, and SI=1000. Here is the link of the dataset (DF1).
I tried to estimate the parameters of the distribution in a Bayesian analysis using PYMC3 and below is my code. The Bayesian estimates are very good when the size of the dataset is small (100 data points) but they get away from the true values when the dataset is larger (500 data points).
For the larger dataset I tried to get better estimates by increasing number of samples to 10000, tune to 10000, and target_accept to 0.99 but the estimates did not significantly change and still were far from the true values. I was wondering if anyone knows how to define the parameters of the pm.sample() to get better estimates for the larger dataset?
import warnings
import pandas as pd
import arviz as az
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
import theano.tensor as tt
from pymc3 import Model, Normal, Slice, sample
from pymc3.distributions import Interpolated
from scipy import stats
SI=1000
ns1 =round(DF1['biased drops'],2)
SIs=round(DF1['SIs'],2)
def logp(SIs,ns1,SI):
summ1 = 0
for i in range(0,len(DF1)):
print(i)
F=DF1['failure'][i]
nu=(ns1[i])*(SIs[i]/SI)**n
PDF = (B*nu**(B-1))/(A*SI**-n)**B
R = np.exp(-(nu/(A*SI**-n))**B)
logLik = (np.log ((PDF**F)*R))
summ1 += logLik
return(summ1)
with pm.Model() as model_ss1:
MuB = pm.Uniform('MuB', lower=1, upper=3)
SigmaB= pm.HalfNormal("SigmaB", 2/3)
B = pm.Normal('B', mu=MuB, sigma=SigmaB)
MuA = pm.Uniform('MuA', lower=400, upper=2000)
SigmaA= pm.HalfNormal("SigmaA", 400)
A = pm.Normal('A', mu=MuA, sigma=SigmaA)
Mun = pm.Uniform('Mun', lower=0.2, upper=0.8)
Sigman= pm.HalfNormal("Sigman", 0.16)
n = pm.Normal('n', mu=Mun, sigma=Sigman)
y = pm.DensityDist('y', logp, observed={ 'SI': SI,'SIs': SIs.values.astype(int), 'ns1': ns1.values.astype(int)})
trace_ss1 = pm.sample(1000, tune=1000, chains = 2)
Bi = pm.summary(trace_ss1, var_names=['B'])['mean'][0]
Ai = pm.summary(trace_ss1, var_names=['A'])['mean'][0]
ni = pm.summary(trace_ss1, var_names=['n'])['mean'][0]
az.plot_trace(trace_ss1, var_names=['B','A','n'])

Kernel Density Estimation using scipy's gaussian_kde and sklearn's KernelDensity leads to different results

I created some data from two superposed normal distributions and then applied sklearn.neighbors.KernelDensity and scipy.stats.gaussian_kde to estimate the density function. However, using the same bandwith (1.0) and the same kernel, both methods produce a different outcome. Can someone explain me the reason for this? Thanks for help.
Below you can find the code to reproduce the issue:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity
n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))
kde_sk = KernelDensity(bandwidth=1.0, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))
kde_sp = gaussian_kde(x, bw_method=1.0)
y_sp = kde_sp.pdf(eval_points)
sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])
If I change the scipy bandwith to 0.25, the result of both methods look approximately the same.
What is meant by bandwidth in scipy.stats.gaussian_kde and sklearn.neighbors.KernelDensity is not the same. Scipy.stats.gaussian_kde uses a bandwidth factor https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. For a 1-D kernel density estimation the following formula is applied:
the bandwidth of sklearn.neighbors.KernelDensity = bandwidth factor of the scipy.stats.gaussian_kde * standard deviation of the sample
For your estimation this probably means that your standard deviation equals 4.
I would like to refer to Getting bandwidth used by SciPy's gaussian_kde function for more information.
To be honest, I don't know why, but using scipy hyperparameter bw_method='scott' makes it work exactly the same as seaborn.
So, it seems to be all about the hyperparameters. We could find out why by understanding them in depth, but in the meantime just use ‘scott’ or ‘silverman’ instead of using a random scalar.
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity
n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))
kde_sk = KernelDensity(bandwidth=1, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))
kde_sp = gaussian_kde(x, bw_method='scott') ### I MEAN HERE! ###
y_sp = kde_sp.pdf(eval_points)
sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])
Increase the size of 'random normal'. your data points are too few.
try with n=500000 and check the results.

Fitting multiple datasets with shared paramaters

I am trying to fit different data set with different non-linear function that shared some parameters and it look something like this:
import matplotlib
from matplotlib import pyplot as plt
from scipy import optimize
import numpy as np
#some non-linear function
def Sigma1x(x,C11,C111,C1111,C11111):
return C11*x+1/2*C111*pow(x,2)+1/6*C1111*pow(x,3)+1/24*C11111*pow(x,4)
def Sigma2x(x,C12,C112,C1112,C11112):
return C12*x+1/2*C112*pow(x,2)+1/6*C1112*pow(x,3)+1/24*C11112*pow(x,4)
def Sigma1y(y,C12,C111,C222,C112,C1111,C1112,C2222,C12222):
return C12*y+1/2*(C111-C222+C112)*pow(y,2)+1/12*(C111+2*C1112-C2222)*pow(y,3)+1/24*C12222*pow(y,4)
def Sigma2y(y,C11,C222,C222,C2222):
return C11*y+1/2*C222*pow(y,2)+1/6*C2222*pow(y,3)+1/24*C22222*pow(y,4)
def Sigmaz(z,C11,C12,C111,C222,C112,C1111,C1112,C2222,C1122,C11111,C11112,C122222,C11122,C22222):
return (C11+C12)*z+1/2*(2*C111-C222+3*C112)*pow(z,2)+1/6*(3/2*C1111+4*C1112-1/2*C222+3*C1122)*pow(z,3)+\
1/24*(3*C11111+10*C11112-5*C12222+10*C11122-2*C22222)*pow(z,4)
# Experimental datasets
Xdata=np.loadtxt('x-direction.txt') #This contain x axis and two other dataset, should be fitted with Sigma1x and Sigma2x
Ydata=np.loadtxt('y-direction.txt') #his contain yaxis and two other dataset, should be fitted with Sigma1yand Sigma2y
Zdata=nploadtxt('z-direction.txt')#This contain z axis and one dataset fitted with Sigmaz
The question is how to use optimize.leastsq or other packages to fit the data with the appropriate function, knowing that they share multiple paramaters?
I was able to solve ( partially the initial question). I found symfit a very comprehensive and easy to use. So i wrote the following code
import matplotlib.pyplot as plt
from symfit import *
import numpy as np
from symfit.core.minimizers import DifferentialEvolution, BFGS
Y_strain = np.genfromtxt('Y_strain.csv', delimiter=',')
X_strain=np.genfromtxt('X_strain.csv', delimiter=',')
xmax=max(X_strain[:,0])
xmin=min(X_strain[:,0])
xdata = np.linspace(xmin, xmax, 50)
ymax=max(Y_strain[:,0])
ymin=max(Y_strain[:,0])
ydata=np.linspace(ymin, ymax, 50)
x,y,Sigma1x,Sigma2x,Sigma1y,Sigma2y= variables('x,y,Sigma1x,Sigma2x,Sigma1y,Sigma2y')
C11,C111,C1111,C11111,C12,C112,C1112,C11112,C222,C2222,C12222,C22222 = parameters('C11,C111,C1111,C11111,C12,C112,C1112,C11112,C222,C2222,C12222,C22222')
model =Model({
Sigma1x:C11*x+1/2*C111*pow(x,2)+1/6*C1111*pow(x,3)+1/24*C11111*pow(x,4),
Sigma2x:C12*x+1/2*C112*pow(x,2)+1/6*C1112*pow(x,3)+1/24*C11112*pow(x,4),
#Sigma1y:C12*y+1/2*(C111-C222+C112)*pow(y,2)+1/12*(C111+2*C1112-C2222)*pow(y,3)+1/24*C12222*pow(y,4),
#Sigma2y:C11*y+1/2*C222*pow(y,2)+1/6*C2222*pow(y,3)+1/24*C22222*pow(y,4),
})
fit = Fit(model, x=X_strain[:,0], Sigma1x=X_strain[:,1],Sigma2x=X_strain[:,2])
fit_result = fit.execute()
print(fit_result)
plt.scatter(Y_strain[:,0],Y_strain[:,2])
plt.scatter(Y_strain[:,0],Y_strain[:,1])
plt.plot(xdata, model(x=xdata, **fit_result.params).Sigma1x)
plt.plot(xdata, model(x=xdata, **fit_result.params).Sigma2x)
However, The resulting fit is very bad :
Parameter Value Standard Deviation
C11 1.203919e+02 3.988977e+00
C111 -6.541505e+02 5.643111e+01
C1111 1.520749e+03 3.713742e+02
C11111 -7.824107e+02 1.015887e+03
C11112 4.451211e+03 1.015887e+03
C1112 -1.435071e+03 3.713742e+02
C112 9.207923e+01 5.643111e+01
C12 3.272248e+01 3.988977e+00
Status message Desired error not necessarily achieved due to precision loss.
Number of iterations 59
Objective <symfit.core.objectives.LeastSquares object at 0x000001CC00C0A508>
Minimizer <symfit.core.minimizers.BFGS object at 0x000001CC7F84A548>
Goodness of fit qualifiers:
chi_squared 6.230510793023184
objective_value 3.115255396511592
r_squared 0.991979767376565
Any idea's how to improve the fit?

Is there any solution for better fit beta prime distribution to data than using Scipy?

I was trying to fit beta prime distribution to my data using python. As there's scipy.stats.betaprime.fit, I tried this:
import numpy as np
import math
import scipy.stats as sts
import matplotlib.pyplot as plt
N = 5000
nb_bin = 100
a = 12; b = 106; scale = 36; loc = -a/(b-1)*scale
y = sts.betaprime.rvs(a,b,loc,scale,N)
a_hat,b_hat,loc_hat,scale_hat = sts.betaprime.fit(y)
print('Estimated parameters: \n a=%.2f, b=%.2f, loc=%.2f, scale=%.2f'%(a_hat,b_hat,loc_hat,scale_hat))
plt.figure()
count, bins, ignored = plt.hist(y, nb_bin, normed=True)
pdf_ini = sts.betaprime.pdf(bins,a,b,loc,scale)
pdf_est = sts.betaprime.pdf(bins,a_hat,b_hat,loc_hat,scale_hat)
plt.plot(bins,pdf_ini,'g',linewidth=2.0,label='ini');plt.grid()
plt.plot(bins,pdf_est,'y',linewidth=2.0,label='est');plt.legend();plt.show()
It shows me the result that:
Estimated parameters:
a=9935.34, b=10846.64, loc=-90.63, scale=98.93
which is quite different from the original one and the figure from the PDF:
If I give the real value of loc and scale as the input of fit function, the estimation result would be better. Has anyone worked on this part already or got a better solution?

Categories