In this code I have produced a dataset using gaussian distribution and then I have tried to apply stochastic gradient descent
In each iteration, I am updating the theta array. But, it is not getting updated.
It remains zero after every iteration.
Gradient is non zero. But still, theta is not updated
Help me please
import numpy as np # linear algebra
import pandas as pd
# data processing, CSV file I/O (e.g. pd.read_csv)
import math
import random
import matplotlib.pyplot as plt
#generating random samples
theta=np.array([3,1,2])
X=[]
E=[]
Y=[]
a1=3
v1=4
a2=-1
v2=4
v3=2
for i in range (0,1000000):
x1=(1/math.sqrt(2*3.14*v1))* math.exp(-(random.random()-a1)**2/(2*v1))
x2=(1/math.sqrt(2*3.14*v2))* math.exp(-(random.random()-a2)**2/(2*v2))
X.append([x1,x2])
e=(1/math.sqrt(2*3.14*v3))* math.exp(-(random.random())**2/(2*v3))
y=theta[0]+theta[1]*x1+theta[2]*x2 + e
Y.append(y)
E.append(e)
#Now Applying Stochastic Gradient
##Batch_Size = 1
r=1
learning_rate=0.001
theta=np.array([0,0,0])
theta = theta.reshape(3,1)
X= pd.DataFrame(X,columns=['X1','X2'])
Y=pd.DataFrame(Y,columns=['Y'])
Y.head()
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=40,random_state=50)
X_train['X0']=np.ones(len(X_train))
y_train.head()
def gradient_descent(x,y,theta,lr):
m=len(y)
prediction=(x.dot(theta)).to_numpy()
gradient = prediction-y.to_numpy()
current_cost= (1/2*m)*np.sum(np.square(prediction-y.to_numpy()) )
return gradient,current_cost
n_iterations=1000
import random
theta_history=[]
cost_history=[]
for i in range(0,n_iterations):
xi=X_train.sample(r)
yi=y_train.sample(r)
m=len(xi)
gradient,current_cost= gradient_descent(xi,yi,theta,learning_rate)
theta[0]= theta[0]-learning_rate*
((1/m)*np.sum(np.multiply(gradient,xi['X1'].to_numpy().reshape(m,1))))
theta[1]= theta[1]-learning_rate*
((1/m)*np.sum(np.multiply(gradient,xi['X2'].to_numpy().reshape(m,1))))
theta[2]= theta[2]-learning_rate*
((1/m)*np.sum(np.multiply(gradient,xi['X0'].to_numpy().reshape(m,1))))
print("theta=",theta)
theta_history.append(theta)
cost_history.append(current_cost)
if prev_index>=len(X_train):
break
You reassigned theta on line 34 to empty values: theta=np.array([0,0,0])
Related
I'm trying to do a guassian fit for some experimental data but I keep running into error after error. I've followed a few different threads online but either the fit isn't good (it's just a horizontal line) or the code just won't run. I'm following this code from another thread. Below is my code.
I apologize if my code seems a bit messy. There are some bits from other attempts when I tried making it work. Hence the "astropy" import.
import math as m
import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize as opt
import pandas as pd
import statistics as stats
from astropy import modeling
def gaus(x,a,x0,sigma, offset):
return a*m.exp(-(x-x0)**2/(2*sigma**2)) + offset
# Python program to get average of a list
def Average(lst):
return sum(lst) / len(lst)
wavelengths = [391.719, 391.984, 392.248, 392.512, 392.777, 393.041, 393.306, 393.57, 393.835, 394.099, 391.719, 391.455, 391.19, 390.926, 390.661, 390.396]
intensities = [511.85, 1105.85, 1631.85, 1119.85, 213.85, 36.85, 10.85, 6.85, 13.85, 7.85, 511.85, 200.85, 80.85, 53.85, 14.85, 24.85]
n=sum(intensities)
mean = sum(wavelengths*intensities)/n
sigma = m.sqrt(sum(intensities*(wavelengths-mean)**2)/n)
def gaus(x,a,x0,sigma):
return a*m.exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = opt.curve_fit(gaus,wavelengths,intensities,p0=[1,mean,sigma])
print(popt)
plt.scatter(wavelengths, intensities)
plt.title("Helium Spectral Line Peak 1")
plt.xlabel("Wavelength (nm)")
plt.ylabel("Intensity (a.u.)")
plt.show()
Thanks to the kind user, my curve seems to be working more reasonably well. However, one of the points seems to be back connecting to an earlier point? Screenshot below:
There are two problems with your code. The first is that you are performing vector operation on list which gives you the first error in the line mean = sum(wavelengths*intensities)/n. Therefore, you should use np.array instead. The second is that you take math.exp on python list which again throws an error as it takes a real number, so you should use np.exp here instead.
The following code solves your problem:
import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize as opt
wavelengths = [391.719, 391.984, 392.248, 392.512, 392.777, 393.041,
393.306, 393.57, 393.835, 394.099, 391.719, 391.455,
391.19, 390.926, 390.661, 390.396]
intensities = [511.85, 1105.85, 1631.85, 1119.85, 213.85, 36.85, 10.85, 6.85,
13.85, 7.85, 511.85, 200.85, 80.85, 53.85, 14.85, 24.85]
wavelengths_new = np.array(wavelengths)
intensities_new = np.array(intensities)
n=sum(intensities)
mean = sum(wavelengths_new*intensities_new)/n
sigma = np.sqrt(sum(intensities_new*(wavelengths_new-mean)**2)/n)
def gaus(x,a,x0,sigma):
return a*np.exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = opt.curve_fit(gaus,wavelengths_new,intensities_new,p0=[1,mean,sigma])
print(popt)
plt.scatter(wavelengths_new, intensities_new, label="data")
plt.plot(wavelengths_new, gaus(wavelengths_new, *popt), label="fit")
plt.title("Helium Spectral Line Peak 1")
plt.xlabel("Wavelength (nm)")
plt.ylabel("Intensity (a.u.)")
plt.show()
I generated a dataset from a known Weibull distribution:
Weibull( alpha= A.SI^-n, beta) where A=1800, n=0.5, Beta=1.5, and SI=1000. Here is the link of the dataset (DF1).
I tried to estimate the parameters of the distribution in a Bayesian analysis using PYMC3 and below is my code. The Bayesian estimates are very good when the size of the dataset is small (100 data points) but they get away from the true values when the dataset is larger (500 data points).
For the larger dataset I tried to get better estimates by increasing number of samples to 10000, tune to 10000, and target_accept to 0.99 but the estimates did not significantly change and still were far from the true values. I was wondering if anyone knows how to define the parameters of the pm.sample() to get better estimates for the larger dataset?
import warnings
import pandas as pd
import arviz as az
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
import theano.tensor as tt
from pymc3 import Model, Normal, Slice, sample
from pymc3.distributions import Interpolated
from scipy import stats
SI=1000
ns1 =round(DF1['biased drops'],2)
SIs=round(DF1['SIs'],2)
def logp(SIs,ns1,SI):
summ1 = 0
for i in range(0,len(DF1)):
print(i)
F=DF1['failure'][i]
nu=(ns1[i])*(SIs[i]/SI)**n
PDF = (B*nu**(B-1))/(A*SI**-n)**B
R = np.exp(-(nu/(A*SI**-n))**B)
logLik = (np.log ((PDF**F)*R))
summ1 += logLik
return(summ1)
with pm.Model() as model_ss1:
MuB = pm.Uniform('MuB', lower=1, upper=3)
SigmaB= pm.HalfNormal("SigmaB", 2/3)
B = pm.Normal('B', mu=MuB, sigma=SigmaB)
MuA = pm.Uniform('MuA', lower=400, upper=2000)
SigmaA= pm.HalfNormal("SigmaA", 400)
A = pm.Normal('A', mu=MuA, sigma=SigmaA)
Mun = pm.Uniform('Mun', lower=0.2, upper=0.8)
Sigman= pm.HalfNormal("Sigman", 0.16)
n = pm.Normal('n', mu=Mun, sigma=Sigman)
y = pm.DensityDist('y', logp, observed={ 'SI': SI,'SIs': SIs.values.astype(int), 'ns1': ns1.values.astype(int)})
trace_ss1 = pm.sample(1000, tune=1000, chains = 2)
Bi = pm.summary(trace_ss1, var_names=['B'])['mean'][0]
Ai = pm.summary(trace_ss1, var_names=['A'])['mean'][0]
ni = pm.summary(trace_ss1, var_names=['n'])['mean'][0]
az.plot_trace(trace_ss1, var_names=['B','A','n'])
I created some data from two superposed normal distributions and then applied sklearn.neighbors.KernelDensity and scipy.stats.gaussian_kde to estimate the density function. However, using the same bandwith (1.0) and the same kernel, both methods produce a different outcome. Can someone explain me the reason for this? Thanks for help.
Below you can find the code to reproduce the issue:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity
n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))
kde_sk = KernelDensity(bandwidth=1.0, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))
kde_sp = gaussian_kde(x, bw_method=1.0)
y_sp = kde_sp.pdf(eval_points)
sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])
If I change the scipy bandwith to 0.25, the result of both methods look approximately the same.
What is meant by bandwidth in scipy.stats.gaussian_kde and sklearn.neighbors.KernelDensity is not the same. Scipy.stats.gaussian_kde uses a bandwidth factor https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. For a 1-D kernel density estimation the following formula is applied:
the bandwidth of sklearn.neighbors.KernelDensity = bandwidth factor of the scipy.stats.gaussian_kde * standard deviation of the sample
For your estimation this probably means that your standard deviation equals 4.
I would like to refer to Getting bandwidth used by SciPy's gaussian_kde function for more information.
To be honest, I don't know why, but using scipy hyperparameter bw_method='scott' makes it work exactly the same as seaborn.
So, it seems to be all about the hyperparameters. We could find out why by understanding them in depth, but in the meantime just use ‘scott’ or ‘silverman’ instead of using a random scalar.
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity
n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))
kde_sk = KernelDensity(bandwidth=1, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))
kde_sp = gaussian_kde(x, bw_method='scott') ### I MEAN HERE! ###
y_sp = kde_sp.pdf(eval_points)
sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])
Increase the size of 'random normal'. your data points are too few.
try with n=500000 and check the results.
I have data set (1-D), with only one independent column. I would like to fit any model to it in order to sample from that model. The raw data
Data set
I tried various theoretical distributions from Fitter package (here https://pypi.org/project/fitter/), none of them works fine. Then i tried Kernel Density Estimation using sklearn. It is good, but i could not prevent negative values due to the way it works. Finally, i tried log normal, but it is not really perfect.
Code for log normal here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import math
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
NN = 3915 # sample same number as original data set
df = pd.read_excel (r'Data_sets2.xlsx',sheet_name="Set1")
eps = 0.1 # Additional term for c
"""
Estimate parameters of log(c) as normal distribution
"""
df["c"] = df["c"] + eps
mu = np.mean(np.log(df["c"]))
s = np.std(np.log(df["c"]))
print("Mean:",mu,"std:",s)
def simulate(N):
c = []
for i in range(N):
c_s = np.exp(np.random.normal(loc = mu, scale = s, size=1)[0])
c.append(round(c_s))
return (c)
predicted_c = simulate(NN)
XX=scipy.arange(3915)
### plot C relation ###
plt.scatter(XX,df["c"],color='g',label="Original data")
plt.scatter(XX,predicted_c,color='r',label="Sample data")
plt.xlabel('Index')
plt.ylabel('c')
plt.legend()
plt.show()
original vs samples
What i am looking for is how to improve the fitting, any suggestions or direction to models that may fit my data with a better accuracy is appreciated. Thanks
Here is a graphical Python fitter for the scipy statistical distribution Double Gamma using your spreadsheet data, I hope this might be of some use as a Normal distribution seems to be a poor fit to this data set. The scipy documentation for dgamma is at https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.dgamma.html - incidentally,the double Weibull distribution fit almost as well.
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_excel (r'Data_sets2.xlsx',sheet_name="Set1")
eps = 0.1 # Additional term for c
data = df["c"] + eps
P = ss.dgamma.fit(data)
rX = np.linspace(min(data), max(data), 50)
rP = ss.dgamma.pdf(rX, *P)
plt.hist(data,bins=25, normed=True, color='slategrey')
plt.plot(rX, rP, color='darkturquoise')
plt.show()
In sklearn an numpy there are different ways to compute the first principal component.
I obtain a different results for each method. Why?
import matplotlib.pyplot as pl
from sklearn import decomposition
import scipy as sp
import sklearn.preprocessing
import numpy as np
import sklearn as sk
def gen_data_3_1():
#### generate the data 3.1
m=1000 # number of samples
n=10 # number of variables
d1=np.random.normal(loc=0,scale=100,size=(m,1))
d2=np.random.normal(loc=0,scale=121,size=(m,1))
d3=-0.2*d1+0.9*d2
z=np.zeros(shape=(m,1))
for i in range(4):
z=np.hstack([z,d1+np.random.normal(size=(m,1))])
for i in range(4):
z=np.hstack([z,d2+np.random.normal(size=(m,1))])
for i in range(2):
z=np.hstack([z,d3+np.random.normal(size=(m,1))])
z=z[:,1:11]
z=sk.preprocessing.scale(z,axis=0)
return z
x=gen_data_3_1() #generate the sample dataset
x=sk.preprocessing.scale(x) #normalize the data
pca=sk.decomposition.PCA().fit(x) #compute the PCA of x and print the first princ comp.
print "first pca components=",pca.components_[:,0]
u,s,v=sp.sparse.linalg.svds(x) # the first column of v.T is the first princ comp
print "first svd components=",v.T[:,0]
trsvd=sk.decomposition.TruncatedSVD(n_components=3).fit(x) #the first components is the
#first princ comp
print "first component TruncatedSVD=",trsvd.components_[0,]
--
first pca components= [-0.04201262 0.49555992 0.53885401 -0.67007959 0.0217131 -0.02535204
0.03105254 -0.07313795 -0.07640555 -0.00442718]
first svd components= [ 0.02535204 -0.1317925 0.12071112 -0.0323422 0.20165568 -0.25104996
-0.0278177 0.17856688 -0.69344318 0.59089451]
first component TruncatedSVD= [-0.04201262 -0.04230353 -0.04213402 -0.04221069 0.4058159 0.40584108
0.40581564 0.40584842 0.40872029 0.40870925]
Because the methods PCA, SVD, and truncated SVD are not the same.
PCA calls SVD, but it also centers data before. Truncated SVD truncates the vectors. svds is a different method from svd as it is sparse.