Bayesian Statistics - python

I need to know how to find the Bayesian probability of two discrete distributions. For example the distributions are given as follows:
hypo_A=[ 0.1,0.4,0.5,0.0,0.0,0.0]
hypo_B=[ 0.1,0.1,0.1,0.3,0.3,0.1]
With a prior of both of them being equally likely
The Bayesian formula is given p(x/H) = (p(H/x)*p(x))/(summation(p(H/x`)*p(x`))).
Basically I need to know how to multiply these unequal distribution in python.

I highly recommended to read Think Bayes book.
Here is a simple implantation of Bayesian statistics with python I wrote:
from collections import namedtuple
hypothesis=namedtuple('hypothesis',['likelihood','belief'])
class DiscreteBayes:
def __init__(self):
"""initiates the hypothesis list"""
self.hypo=dict()
def normalize(self):
"""normalizes the sum of all beliefs to 1"""
s=sum([float(h.belief) for h in self.hypo.values()])
self.hypo=dict([(k,hypothesis(likelihood=h.likelihood,belief=h.belief/s)) for k,h in self.hypo.items()])
def update(self,data):
"""updates beliefs based on new data"""
if type(data)!=list:
data=[data]
for datum in data:
self.hypo=dict([(k,hypothesis(likelihood=h.likelihood,belief=h.belief*h.likelihood(datum))) for k,h in self.hypo.items()])
self.normalize()
def predict(self,x):
"""predict new data based on previously seen"""
return sum([float(h.belief)*float(h.likelihood(x)) for h in self.hypo.values()])
In your case:
hypo_A = [ 0.1,0.4,0.5,0.0,0.0,0.0]
hypo_B = [ 0.1,0.1,0.1,0.3,0.3,0.1]
d = DiscreteBayes()
d.hypo['hypo_A'] = hypothesis(likelihood=hypo_A.get ,belief=1)
d.hypo['hypo_B'] = hypothesis(likelihood=hypo_B.get ,belief=1)
d.normalize()
x = 1
d.update(x) #updating beliefs after seeing x
d.predict(x) #the probability of seeing x in the future
print (d.hypo)

Related

Python statsmodels.tsa: Cannot fit (p,q) found using arma_order_select_ic function into ARIMA function

I have a list of historical ETF data, and I was trying to write a function that finds (p,q) at every prediction automatically using AIC or BIC. However, I encountered two different LinAlg errors when fitting the resulting (p,q) into the ARIMA method also from tsa.
The two errors are:
Non-positive-definite forecast error covariance matrix encountered at period 1
LU decomposition error
Here's my code:
def BICpick_pq(data_series, window):
train_results = sm.tsa.arma_order_select_ic(data_series, ic=['bic'],
trend='nc', max_ar=window, max_ma=window)
return train_results.bic_min_order
def arimaV3_BIC(all_data_series, p_max, q_max, window):
predicts = []
for i in range(len(all_data_series)-window):
data = all_data_series[i:i+window]
d = find_d(data, window)
p,q = BICpick_pq(data, window)
pred = ARIMAresult(data, p, d, q)
predicts.append(pred)
win_rate,win_rate_aim = winRates(all_data_series, predicts, window)
return predicts, win_rate, win_rate_aim
I have no idea what's going on: aren't arma_order_select_ic have to fit p,q first to get the score, then compare score? Then why I cannot fit it into my method?

Should we scale before the KElbowVisualizer method for clustering in python

I know that before any clustering we need to scale the data.
But I want to ask if the KElbowVisualizer method do the scaling by itself or before giving it the data I should scale it.
I already searched in the documentation of this method but I did not find an answer please can you share it with me if you find it. Thank you;
I looked at the implementation of KElbowVisualizer inyellowbrick/cluster/elbow.py at github and I havn't found any code under function fit (line 306) for scaling the X variables.
# https://github.com/DistrictDataLabs/yellowbrick/blob/main/yellowbrick/cluster/elbow.py
#...
def fit(self, X, y=None, **kwargs):
"""
Fits n KMeans models where n is the length of ``self.k_values_``,
storing the silhouette scores in the ``self.k_scores_`` attribute.
The "elbow" and silhouette score corresponding to it are stored in
``self.elbow_value`` and ``self.elbow_score`` respectively.
This method finishes up by calling draw to create the plot.
"""
self.k_scores_ = []
self.k_timers_ = []
self.kneedle = None
self.knee_value = None
if self.locate_elbow:
self.elbow_value_ = None
self.elbow_score_ = None
for k in self.k_values_:
# Compute the start time for each model
start = time.time()
# Set the k value and fit the model
self.estimator.set_params(n_clusters=k)
self.estimator.fit(X, **kwargs)
# Append the time and score to our plottable metrics
self.k_timers_.append(time.time() - start)
self.k_scores_.append(self.scoring_metric(X, self.estimator.labels_))
#...
So, you may need to scale your data (X parameters) before passing to KElbowVisualizer().fit()

Bayesian modeling of repeated binary measurements in PyMC3 (Python)

I am going to run a study in which multiple raters have to evaluate whether each of a number of papers is '1' or '0'. The reason I use multiple raters is that I suspect that each individual rater is likely to make mistakes, and I hope that by using multiple raters I can control for that.
My aim is to estimate the true proportion of '1' in the population of papers, and I want to do this using a bayesian model in PyMC3. More general answers about model specification without the concrete implementation in PyMC3 are of course also welcome.
This is how I've simulated some data:
n = 250 # number of papers we sample
p = 0.3 # true rate
true_sample = binom.rvs(1, 0.3, size=n)
# add error
def rating(array,error_rate):
scores = []
for i in array:
scores.append(np.random.binomial(i, error_rate))
return np.array(scores)
r = 10 # number of raters
r_error = np.random.uniform(0.7, 0.99,10) # how often does each rater rate a paper correctly
#get the data
rated_data = {}
for i in range(r):
rated_data[f'rater_{i}'] = rating(true_sample, r_error[i])
df = pd.DataFrame(rated_data, index = [f'abstract_{i}' for i in range(250)])
This is the model I have tried:
with pm.Model() as binom_model2:
p = pm.Beta('p',0.5,0.5) # this is the proportion of '1' in the population
for i in range(10): # error_r and p for each rater separately
er = pm.Beta(f'er{i}',10,3)
prob = pm.Binomial(f'prob{i}', p = (p * er), n = n,observed = df.iloc[:,i].sum() )
This seems to work fine, in that it gives good estimates of p and error_r (but do tell me if you think there are problems with the model!). However, it doesn't use all information that is available, namely, the fact that the ratings on each row of the dataframe are ratings of the same paper. I presume that a model that could incorporate this, would give even more accurate estimates of p and of the error-rates. I'm not sure how to do this, and any help would be appreciated.

lmfit for exponential data returns linear function

I'm working on fitting muon lifetime data to a curve to extract the mean lifetime using the lmfit function. The general process I'm using is to bin the 13,000 data points into 10 bins using the histogram function, calculating the uncertainty with the square root of the counts in each bin (it's an exponential model), then use the lmfit module to determine the best fit along with means and uncertainty. However, graphing the output of the model.fit() method returns this graph, where the red line is the fit (and obviously not the correct fit). Fit result output graph
I've looked online and can't find a solution to this, I'd really appreciate some help figuring out what's going on. Here's the code.
import os
import numpy as np
import matplotlib.pyplot as plt
from numpy import sqrt, pi, exp, linspace
from lmfit import Model
class data():
def __init__(self,file_name):
times_dirty = sorted(np.genfromtxt(file_name, delimiter=' ',unpack=False)[:,0])
self.times = []
for i in range(len(times_dirty)):
if times_dirty[i]<40000:
self.times.append(times_dirty[i])
self.counts = []
self.binBounds = []
self.uncertainties = []
self.means = []
def binData(self,k):
self.counts, self.binBounds = np.histogram(self.times, bins=k)
self.binBounds = self.binBounds[:-1]
def calcStats(self):
if len(self.counts)==0:
print('Run binData function first')
else:
self.uncertainties = sqrt(self.counts)
def plotData(self,fit):
plt.errorbar(self.binBounds, self.counts, yerr=self.uncertainties, fmt='bo')
plt.plot(self.binBounds, fit.init_fit, 'k--')
plt.plot(self.binBounds, fit.best_fit, 'r')
plt.show()
def decay(t, N, lamb, B):
return N * lamb * exp(-lamb * t) +B
def main():
muonEvents = data('C:\Users\Colt\Downloads\muon.data')
muonEvents.binData(10)
muonEvents.calcStats()
mod = Model(decay)
result = mod.fit(muonEvents.counts, t=muonEvents.binBounds, N=1, lamb=1, B = 1)
muonEvents.plotData(result)
print(result.fit_report())
print (len(muonEvents.times))
if __name__ == "__main__":
main()
This might be a simple scaling problem. As a quick test, try dividing all raw data by a factor of 1000 (both X and Y) to see if changing the magnitude of the data has any effect.
Just to build on James Phillips answer, I think the data you show in your graph imply values for N, lamb, and B that are very different from 1, 1, 1. Keep in mind that exp(-lamb*t) is essentially 0 for lamb = 1, and t> 100. So, if the algorithm starts at lamb=1 and varies that by a little bit to find a better value, it won't actually be able to see any difference in how well the model matches the data.
I would suggest trying to start with values that are more reasonable for the data you have, perhaps N=1.e6, lamb=1.e-4, and B=100.
As James suggested, having the variables have values on the order of 1 and putting in scale factors as necessary is often helpful in getting numerically stable solutions.

Semi-supervised Naive Bayes with NLTK [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I have built a semi-supervised version of NLTK's Naive Bayes in Python based on the EM (expectation-maximization algorithm). However, in some iterations of EM I am getting negative log-likelihoods (the log-likelihoods of EM must be positive in every iteration), therefore I believe that there must be some mistakes in my code. After carefully reviewing my code, I have no idea why is this happenning. It would be really appreciated if someone could spot any mistakes in my code below:
(Reference material of semi-supervised Naive Bayes)
EM-algorithm main loop
#initial assumptions:
#Bernoulli NB: only feature presence (value 1) or absence (value None) is computed
#initial data:
#C: classifier trained with labeled data
#labeled_data: an array of tuples (feature dic, label)
#features: dictionary that outputs feature dictionary for a given document id
for iteration in range(1, self.maxiter):
#Expectation: compute probabilities for each class for each unlabeled document
#An array of tuples (feature dictionary, probability dist) is built
unlabeled_data = [(features[id],C.prob_classify(features[id])) for id in U]
#Maximization: given the probability distributions of previous step,
#update label, feature-label counts and update classifier C
#gen_freqdists is a custom function, see below
#gen_probdists is the original NLTK function
l_freqdist_act,ft_freqdist_act, ft_values_act = self.gen_freqdists(labeled_data,unlabeled_data)
l_probdist_act, ft_probdist_act = self.gen_probdists(l_freqdist_act, ft_freqdist_act, ft_values_act, ELEProbDist)
C = nltk.NaiveBayesClassifier(l_probdist_act, ft_probdist_act)
#Compute log-likelihood
#NLTK Naive bayes classifier prob_classify func gives logprob(class) + logprob(doc|class))
#for labeled data, sum logprobs output by the classifier for the label
#for unlabeled data, sum logprobs output by the classifier for each label
log_lh = sum([C.prob_classify(ftdic).prob(label) for (ftdic,label) in labeled_data])
log_lh += sum([C.prob_classify(ftdic).prob(label) for (ftdic,ignore) in unlabeled_data for label in l_freqdist_act.samples()])
#Continue until convergence
if log_lh_old == "first":
if self.debug: print "\tM: #iteration 1",log_lh,"(FIRST)"
log_lh_old = log_lh
else:
log_lh_diff = log_lh - log_lh_old
if self.debug: print "\tM: #iteration",iteration,log_lh_old,"->",log_lh,"(",log_lh_diff,")"
if log_lh_diff < self.log_lh_diff_min: break
log_lh_old = log_lh
Custom function gen-freqdists, used to create needed frequency distributions
def gen_freqdists(self, instances_l, instances_ul):
l_freqdist = FreqDist() #frequency distrib. of labels
ft_freqdist= defaultdict(FreqDist) #dictionary of freq. distrib. for ft-label pairs
ft_values = defaultdict(set) #dictionary of possible values for each ft (only 1/None)
fts = set() #set of all fts
#counts for labeled data
for (ftdic,label) in instances_l:
l_freqdist.inc(label,1)
for f in ftdic.keys():
fts.add(f)
ft_freqdist[label,f].inc(1,1)
ft_values[f].add(1)
#counts for unlabeled data
#we must compute maximum a posteriori label estimate
#and update label/ft occurrences accordingly
for (ftdic,probs) in instances_ul:
map_l = probs.max() #label with highest probability
map_p = probs.prob(map_l) #probability of map_l
l_freqdist.inc(map_l,count=map_p)
for f in ftdic.keys():
fts.add(f)
ft_freqdist[map_l,f].inc(1,count=map_p)
ft_values[f].add(1)
#features not appearing in documents get implicit None values
for l in l_freqdist.samples():
num_samples = l_freqdist[l]
for f in fts:
count = ft_freqdist[l,f].N()
ft_freqdist[l,f].inc(None, num_samples-count)
ft_values[f].add(None)
#return computed frequency distributions
return l_freqdist, ft_freqdist, ft_values
I think you're summing the wrong values.
This is your code that is supposed to compute the sum of the log probs:
#Compute log-likelihood
#NLTK Naive bayes classifier prob_classify func gives logprob(class) + logprob(doc|class))
#for labeled data, sum logprobs output by the classifier for the label
#for unlabeled data, sum logprobs output by the classifier for each label
log_lh = sum([C.prob_classify(ftdic).prob(label) for (ftdic,label) in labeled_data])
log_lh += sum([C.prob_classify(ftdic).prob(label) for (ftdic,ignore) in unlabeled_data for label in l_freqdist_act.samples()])
According to the NLTK documentation for prob_classify (on NaiveBayesClassifier) a ProbDistI object is returned (not logprob(class) + logprob(doc|class)). When you get this object, You're calling the prob method on it for a given label. You probably want to call logprob, and negate that return as well.

Categories