I generated a dataset from a known Weibull distribution:
Weibull( alpha= A.SI^-n, beta) where A=1800, n=0.5, Beta=1.5, and SI=1000. Here is the link of the dataset (DF1).
I tried to estimate the parameters of the distribution in a Bayesian analysis using PYMC3 and below is my code. The Bayesian estimates are very good when the size of the dataset is small (100 data points) but they get away from the true values when the dataset is larger (500 data points).
For the larger dataset I tried to get better estimates by increasing number of samples to 10000, tune to 10000, and target_accept to 0.99 but the estimates did not significantly change and still were far from the true values. I was wondering if anyone knows how to define the parameters of the pm.sample() to get better estimates for the larger dataset?
import warnings
import pandas as pd
import arviz as az
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
import theano.tensor as tt
from pymc3 import Model, Normal, Slice, sample
from pymc3.distributions import Interpolated
from scipy import stats
SI=1000
ns1 =round(DF1['biased drops'],2)
SIs=round(DF1['SIs'],2)
def logp(SIs,ns1,SI):
summ1 = 0
for i in range(0,len(DF1)):
print(i)
F=DF1['failure'][i]
nu=(ns1[i])*(SIs[i]/SI)**n
PDF = (B*nu**(B-1))/(A*SI**-n)**B
R = np.exp(-(nu/(A*SI**-n))**B)
logLik = (np.log ((PDF**F)*R))
summ1 += logLik
return(summ1)
with pm.Model() as model_ss1:
MuB = pm.Uniform('MuB', lower=1, upper=3)
SigmaB= pm.HalfNormal("SigmaB", 2/3)
B = pm.Normal('B', mu=MuB, sigma=SigmaB)
MuA = pm.Uniform('MuA', lower=400, upper=2000)
SigmaA= pm.HalfNormal("SigmaA", 400)
A = pm.Normal('A', mu=MuA, sigma=SigmaA)
Mun = pm.Uniform('Mun', lower=0.2, upper=0.8)
Sigman= pm.HalfNormal("Sigman", 0.16)
n = pm.Normal('n', mu=Mun, sigma=Sigman)
y = pm.DensityDist('y', logp, observed={ 'SI': SI,'SIs': SIs.values.astype(int), 'ns1': ns1.values.astype(int)})
trace_ss1 = pm.sample(1000, tune=1000, chains = 2)
Bi = pm.summary(trace_ss1, var_names=['B'])['mean'][0]
Ai = pm.summary(trace_ss1, var_names=['A'])['mean'][0]
ni = pm.summary(trace_ss1, var_names=['n'])['mean'][0]
az.plot_trace(trace_ss1, var_names=['B','A','n'])
Related
I created some data from two superposed normal distributions and then applied sklearn.neighbors.KernelDensity and scipy.stats.gaussian_kde to estimate the density function. However, using the same bandwith (1.0) and the same kernel, both methods produce a different outcome. Can someone explain me the reason for this? Thanks for help.
Below you can find the code to reproduce the issue:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity
n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))
kde_sk = KernelDensity(bandwidth=1.0, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))
kde_sp = gaussian_kde(x, bw_method=1.0)
y_sp = kde_sp.pdf(eval_points)
sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])
If I change the scipy bandwith to 0.25, the result of both methods look approximately the same.
What is meant by bandwidth in scipy.stats.gaussian_kde and sklearn.neighbors.KernelDensity is not the same. Scipy.stats.gaussian_kde uses a bandwidth factor https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. For a 1-D kernel density estimation the following formula is applied:
the bandwidth of sklearn.neighbors.KernelDensity = bandwidth factor of the scipy.stats.gaussian_kde * standard deviation of the sample
For your estimation this probably means that your standard deviation equals 4.
I would like to refer to Getting bandwidth used by SciPy's gaussian_kde function for more information.
To be honest, I don't know why, but using scipy hyperparameter bw_method='scott' makes it work exactly the same as seaborn.
So, it seems to be all about the hyperparameters. We could find out why by understanding them in depth, but in the meantime just use ‘scott’ or ‘silverman’ instead of using a random scalar.
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity
n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))
kde_sk = KernelDensity(bandwidth=1, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))
kde_sp = gaussian_kde(x, bw_method='scott') ### I MEAN HERE! ###
y_sp = kde_sp.pdf(eval_points)
sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])
Increase the size of 'random normal'. your data points are too few.
try with n=500000 and check the results.
In this code I have produced a dataset using gaussian distribution and then I have tried to apply stochastic gradient descent
In each iteration, I am updating the theta array. But, it is not getting updated.
It remains zero after every iteration.
Gradient is non zero. But still, theta is not updated
Help me please
import numpy as np # linear algebra
import pandas as pd
# data processing, CSV file I/O (e.g. pd.read_csv)
import math
import random
import matplotlib.pyplot as plt
#generating random samples
theta=np.array([3,1,2])
X=[]
E=[]
Y=[]
a1=3
v1=4
a2=-1
v2=4
v3=2
for i in range (0,1000000):
x1=(1/math.sqrt(2*3.14*v1))* math.exp(-(random.random()-a1)**2/(2*v1))
x2=(1/math.sqrt(2*3.14*v2))* math.exp(-(random.random()-a2)**2/(2*v2))
X.append([x1,x2])
e=(1/math.sqrt(2*3.14*v3))* math.exp(-(random.random())**2/(2*v3))
y=theta[0]+theta[1]*x1+theta[2]*x2 + e
Y.append(y)
E.append(e)
#Now Applying Stochastic Gradient
##Batch_Size = 1
r=1
learning_rate=0.001
theta=np.array([0,0,0])
theta = theta.reshape(3,1)
X= pd.DataFrame(X,columns=['X1','X2'])
Y=pd.DataFrame(Y,columns=['Y'])
Y.head()
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=40,random_state=50)
X_train['X0']=np.ones(len(X_train))
y_train.head()
def gradient_descent(x,y,theta,lr):
m=len(y)
prediction=(x.dot(theta)).to_numpy()
gradient = prediction-y.to_numpy()
current_cost= (1/2*m)*np.sum(np.square(prediction-y.to_numpy()) )
return gradient,current_cost
n_iterations=1000
import random
theta_history=[]
cost_history=[]
for i in range(0,n_iterations):
xi=X_train.sample(r)
yi=y_train.sample(r)
m=len(xi)
gradient,current_cost= gradient_descent(xi,yi,theta,learning_rate)
theta[0]= theta[0]-learning_rate*
((1/m)*np.sum(np.multiply(gradient,xi['X1'].to_numpy().reshape(m,1))))
theta[1]= theta[1]-learning_rate*
((1/m)*np.sum(np.multiply(gradient,xi['X2'].to_numpy().reshape(m,1))))
theta[2]= theta[2]-learning_rate*
((1/m)*np.sum(np.multiply(gradient,xi['X0'].to_numpy().reshape(m,1))))
print("theta=",theta)
theta_history.append(theta)
cost_history.append(current_cost)
if prev_index>=len(X_train):
break
You reassigned theta on line 34 to empty values: theta=np.array([0,0,0])
Today I'm working on a dataset from Kaggle https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. I would like to segment my dataset by beds, baths, neighborhood and use a DBSCAN to get a clustering by price in each segment. The problem is because each segment is different, I don't want to use the same epsilon for all my dataset but for each segment the best epsilon, do you know an efficient way to do it ?
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['beds','baths','neighborhood','price']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=6).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
Thank you.
A heuristic for the setting of Epsilon and MinPts parameters has been proposed in the original DBSCAN paper
Once the MinPts value is set (e.g. 2 ∗ Number of features) the partitioning result strongly depends on Epsilon. The heuristic suggests to infer epsilon through a visual analysis of the k-dist plot.
A toy example of the procedure with two gaussian distributions is reported in the following.
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
from sklearn.datasets import make_biclusters
data,lab,_ = make_biclusters((200,2), 2, noise=0.1, minval=0, maxval=1)
minpts = 4
nbrs = NearestNeighbors(n_neighbors=minpts, algorithm='ball_tree').fit(data)
distances, indices = nbrs.kneighbors(data)
k_dist = [x[-1] for x in distances]
f,ax = plt.subplots(1,2,figsize = (10,5))
ax[0].set_title('k-dist plot for k = minpts = 4')
ax[0].plot(sorted(k_dist))
ax[0].set_xlabel('object index after sorting by k-distance')
ax[0].set_ylabel('k-distance')
ax[1].set_title('original data')
ax[1].scatter(data[:,0],data[:,1],c = lab[0])
In the resulting k-dist plot, the "elbow" theoretically divides noise objects from cluster objects and indeed gives an indication on a plausible range of values for Epsilon (tailored on the dataset in combination with the selected value of MinPts). In this toy example, I would say between 0.05 and 0.075.
I have data set (1-D), with only one independent column. I would like to fit any model to it in order to sample from that model. The raw data
Data set
I tried various theoretical distributions from Fitter package (here https://pypi.org/project/fitter/), none of them works fine. Then i tried Kernel Density Estimation using sklearn. It is good, but i could not prevent negative values due to the way it works. Finally, i tried log normal, but it is not really perfect.
Code for log normal here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import math
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
NN = 3915 # sample same number as original data set
df = pd.read_excel (r'Data_sets2.xlsx',sheet_name="Set1")
eps = 0.1 # Additional term for c
"""
Estimate parameters of log(c) as normal distribution
"""
df["c"] = df["c"] + eps
mu = np.mean(np.log(df["c"]))
s = np.std(np.log(df["c"]))
print("Mean:",mu,"std:",s)
def simulate(N):
c = []
for i in range(N):
c_s = np.exp(np.random.normal(loc = mu, scale = s, size=1)[0])
c.append(round(c_s))
return (c)
predicted_c = simulate(NN)
XX=scipy.arange(3915)
### plot C relation ###
plt.scatter(XX,df["c"],color='g',label="Original data")
plt.scatter(XX,predicted_c,color='r',label="Sample data")
plt.xlabel('Index')
plt.ylabel('c')
plt.legend()
plt.show()
original vs samples
What i am looking for is how to improve the fitting, any suggestions or direction to models that may fit my data with a better accuracy is appreciated. Thanks
Here is a graphical Python fitter for the scipy statistical distribution Double Gamma using your spreadsheet data, I hope this might be of some use as a Normal distribution seems to be a poor fit to this data set. The scipy documentation for dgamma is at https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.dgamma.html - incidentally,the double Weibull distribution fit almost as well.
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_excel (r'Data_sets2.xlsx',sheet_name="Set1")
eps = 0.1 # Additional term for c
data = df["c"] + eps
P = ss.dgamma.fit(data)
rX = np.linspace(min(data), max(data), 50)
rP = ss.dgamma.pdf(rX, *P)
plt.hist(data,bins=25, normed=True, color='slategrey')
plt.plot(rX, rP, color='darkturquoise')
plt.show()
I was trying to fit beta prime distribution to my data using python. As there's scipy.stats.betaprime.fit, I tried this:
import numpy as np
import math
import scipy.stats as sts
import matplotlib.pyplot as plt
N = 5000
nb_bin = 100
a = 12; b = 106; scale = 36; loc = -a/(b-1)*scale
y = sts.betaprime.rvs(a,b,loc,scale,N)
a_hat,b_hat,loc_hat,scale_hat = sts.betaprime.fit(y)
print('Estimated parameters: \n a=%.2f, b=%.2f, loc=%.2f, scale=%.2f'%(a_hat,b_hat,loc_hat,scale_hat))
plt.figure()
count, bins, ignored = plt.hist(y, nb_bin, normed=True)
pdf_ini = sts.betaprime.pdf(bins,a,b,loc,scale)
pdf_est = sts.betaprime.pdf(bins,a_hat,b_hat,loc_hat,scale_hat)
plt.plot(bins,pdf_ini,'g',linewidth=2.0,label='ini');plt.grid()
plt.plot(bins,pdf_est,'y',linewidth=2.0,label='est');plt.legend();plt.show()
It shows me the result that:
Estimated parameters:
a=9935.34, b=10846.64, loc=-90.63, scale=98.93
which is quite different from the original one and the figure from the PDF:
If I give the real value of loc and scale as the input of fit function, the estimation result would be better. Has anyone worked on this part already or got a better solution?