I am going to run a study in which multiple raters have to evaluate whether each of a number of papers is '1' or '0'. The reason I use multiple raters is that I suspect that each individual rater is likely to make mistakes, and I hope that by using multiple raters I can control for that.
My aim is to estimate the true proportion of '1' in the population of papers, and I want to do this using a bayesian model in PyMC3. More general answers about model specification without the concrete implementation in PyMC3 are of course also welcome.
This is how I've simulated some data:
n = 250 # number of papers we sample
p = 0.3 # true rate
true_sample = binom.rvs(1, 0.3, size=n)
# add error
def rating(array,error_rate):
scores = []
for i in array:
scores.append(np.random.binomial(i, error_rate))
return np.array(scores)
r = 10 # number of raters
r_error = np.random.uniform(0.7, 0.99,10) # how often does each rater rate a paper correctly
#get the data
rated_data = {}
for i in range(r):
rated_data[f'rater_{i}'] = rating(true_sample, r_error[i])
df = pd.DataFrame(rated_data, index = [f'abstract_{i}' for i in range(250)])
This is the model I have tried:
with pm.Model() as binom_model2:
p = pm.Beta('p',0.5,0.5) # this is the proportion of '1' in the population
for i in range(10): # error_r and p for each rater separately
er = pm.Beta(f'er{i}',10,3)
prob = pm.Binomial(f'prob{i}', p = (p * er), n = n,observed = df.iloc[:,i].sum() )
This seems to work fine, in that it gives good estimates of p and error_r (but do tell me if you think there are problems with the model!). However, it doesn't use all information that is available, namely, the fact that the ratings on each row of the dataframe are ratings of the same paper. I presume that a model that could incorporate this, would give even more accurate estimates of p and of the error-rates. I'm not sure how to do this, and any help would be appreciated.
Related
I am working on a dataset of 5 columns (named 'Healthy', 'Growth', 'Refined', 'Reasoned', 'Accepted') and 50k rows. I divided it into a train dataset (10k) and a validation set (the rest of the dataset).
I built a Bayesian Belief Network with the following edges ('Healthy', 'Refined'), ('Healthy', 'Reasoned'),
('Refined', 'Accepted'), ('Reasoned', 'Accepted'), ('Growth', 'Accepted').
I would like, in order to evaluate the quality of my network, to insert evidence in the nodes 'Healthy', 'Growth', 'Refined' and 'Reasoned', predict the value of 'Accepted' and finally compare it with the actual value in the validation set.
The for loop I made stops always after 584 iterations without sending any error message and the kernel looks still busy.
Here is a simpler version of my code. I write only the version of the Network with the Maximum likelihood method for computing the parameters. The issue is the same also with other method of computing the parameters.
import pandas as pd
from pgmpy.base import DAG
from pgmpy.models import BayesianNetwork
from pgmpy.sampling import BayesianModelSampling
from pgmpy.factors.discrete import State
#import dataset
df = pd.read_csv("C:\\Users\\puddu\\Desktop\\Tools\\Dummy.BBN\\Dummy_data_set.csv")
#preliminary operation on dataset
df.rename(columns = {'Q1.Healthy':'Healthy', 'Q2.Growth':'Growth',
'Q3.Refined':'Refined', 'Q9.Accepted':'Accepted',
'Q8.Reasoned':'Reasoned'}, inplace = True)
nodes = ('Healthy', 'Growth', 'Refined', 'Reasoned', 'Accepted')
replies = ['E','D', 'C', 'B', 'A']
edges = [('Healthy', 'Refined'),
('Healthy', 'Reasoned'),
('Refined', 'Accepted'),
('Reasoned', 'Accepted'),
('Growth', 'Accepted')]
for nod in nodes:
df[nod]=df[nod].astype('category')
df[nod] = df[nod].cat.set_categories(replies, ordered=True)
#training set definition
df_train = df.head(10000).copy().reset_index(drop= True)
#directed acyclic graph building
dag = DAG()
dag.add_edges_from(ebunch= edges)
#BBN building + estimating MLE parameters
model_mle = BayesianNetwork(dag)
model_mle.fit(df_train)
df_validation = df.iloc[(10000):(11000),].copy().reset_index(drop= True)
inference_mle = BayesianModelSampling(model_mle)
mle_guesses = 0
for i in range(1000):
evidence = [State(var= 'Growth', state= df_validation['Growth'][i]),
State(var= 'Healthy', state= df_validation['Healthy'][i]),
State(var= 'Reasoned', state= df_validation['Reasoned'][i]),
State(var= 'Refined', state = df_validation['Refined'][i])]
mle_prediction = inference_mle.rejection_sample(size= 1,
evidence = evidence, show_progress= False)['Accepted'][0]
result = df_validation['Accepted'][i]
if mle_prediction == result:
mle_guesses+= 1
print(f"Step {i}")
Thanks to everyone will spend time in helping me.
The way rejection sampling works is that it simulates data from the model and keeps the data that matches the given evidence. My guess is that the probability of evidence in line 585 is extremely low, so the algorithm is stuck in a loop trying to generate a sample that matches the evidence.
Some possible solutions:
If you want to use the sampling-based inference approach. Try to just simulate some data and compute the probability of each data point. This will just approximate the probability to 0 in the case you have described above. This would be much faster as well as you would need to simulate the data only once:
n_samples = int(1e5)
df_simulated = model_mle.simulate(n_samples)
for i in range(1000):
e = df_validation.iloc[i, :].to_dict()
result = df.loc[np.all(df[list(e)] == pd.Series(e), axis=1)]['Accepted'].value_counts() / n_samples
The other way is that you can do exact inference:
infer = VariableElimination(model_mle)
for i in range(1000):
result = infer.query(['Accepted'], evidence=df_validation.iloc[i, :].to_dict())
I'm unit acceptance testing some code I wrote. It's conceivable that at some point in the real world we will have input data where the dependent variable is constant. Not the norm, but possible. A linear model should yield coefficients of 0 in this case (right?), which is fine and what we would want -- but for some reason I'm getting some wild results when I try to fit the model on this use case.
I have tried 3 models and get diffirent weird results every time -- or no results in some cases.
For this use case all of the dependent observations are set at 100, all the freq_weights are set at 1, and the independent variables are a binary coded dummy set of 20 features.
In total there are 150 observations.
Again, this data is unlikely in the real world but I need my code to be able to work on this ugly data. IDK why I'm getting such erroneous and different results.
As I understand with no variance in the dependent variable I should be getting 0 for all my coefficients.
freq = freq['Freq']
Indies = sm.add_constant(df)
model = sm.OLS(df1, Indies)
res = model.fit()
res.params
yields:
const 65.990203
x1 17.214836
reg = statsmodels.GLM(df1, Indies, freq_weights = freq)
results = reg.fit(method = 'lbfgs', max_start_irls=0)
results.params
yields:
const 83.205034
x1 82.575228
reg = statsmodels.GLM(df1, Indies, freq_weights = freq)
result2 = reg.fit()
result2.params
yields
PerfectSeparationError: Perfect separation detected, results not available
I have been attempting to use the hmmlearn package in python to build a model predicting values of a time series. I have based my code on this article, detailing how to use the package for a stock price time series.
After fitting the model on a large segment of the time series data and attempting to build a predictive model for the remainder, I run into an issue. The model always predicts the same outcome as being most probable - hmm.score returns the highest log-likelihood for the same outcome for every instance in the test series. Moreover, the outcome it predicts is the one closest to the mean value of the time series it was fitted on. It never deviates. I'm really not sure what to do. Is the model deficient, or am I doing something wrong?
The code that does the prediction is below. It appends all of the possible_outcomes (defined immediately below) to a sequence of test points in the time series (the last 100 in the test dataset) and evaluates the likelihood (using hmm.score):
possible_outcomes = np.linspace(-0.1, 0.1, 10)
latency_days = 10
def predict_close_price(time_index):
open_price = actuals_test[time_index]
predicted_frac_change = get_most_probable_outcome(time_index)
return open_price * (1 + predicted_frac_change)
def get_most_probable_outcome(time_index):
previous_data_start_index = max(0, time_index - latency_days)
previous_data_end_index = max(0, time_index - 1)
prev_start = int(previous_data_start_index)
prev_end = int(previous_data_end_index)
previous_data = test_data[prev_start: prev_end]
outcome_score = []
for possible_outcome in possible_outcomes:
total_data = np.row_stack((previous_data, possible_outcome))
outcome_score.append(hmm.score(total_data))
most_probable_outcome = possible_outcomes[np.argmax(outcome_score)]
print(most_probable_outcome)
return most_probable_outcome
predicted_close_prices = []
actuals_vector = []
for time_index in range(len(actuals_test)-100,len(actuals_test)-1):
predicted_close_prices.append(predict_close_price(time_index))
actuals_vector.append(actuals_test[(time_index)])
I don't know if the issue is with the above, or with the actual creation of data and fitting of the model itself. That is done simplistically as follows:
timeSeries.reverse()
difference_fracs = []
for i in range(0, len(timeSeries)-1):
difference_frac = ((timeSeries[i+1] - timeSeries[i])/(timeSeries[i]))
difference_fracs.append(difference_frac)
differences_array = np.array(difference_fracs)
differences_array = np.reshape(differences_array, (-1,1))
train_data_length = 2000
train_data = differences_array[:train_data_length,:]
test_data = differences_array[train_data_length:len(timeSeries),:]
actuals_test = timeSeries[train_data_length:]
n_hidden_states = 4
hmm = GaussianHMM(n_components = n_hidden_states)
hmm.fit(trainData)
I realize most of this is meaningless without the actual time series, which I am not allowed to share - though if someone has had similar issues in the past, I would love to hear your thoughts.
I have implemented a Kmeans using Scikit Learn command and I have tried Elbow and Silhoutte Coefficient to find the optimal K. I am planning to use gap statistics to further verify my results.
def optimalK(data, nrefs=3, maxClusters=15):
gaps = np.zeros((len(range(1, maxClusters)),))
resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
for gap_index, k in enumerate(range(1, maxClusters)):
# Holder for reference dispersion results
refDisps = np.zeros(nrefs)
for i in range(nrefs):
# Create new random reference set
randomReference = np.random.random_sample(size=data.shape)
# Fit to it
km = KMeans(k)
km.fit(randomReference)
refDisp = km.inertia_
refDisps[i] = refDisp
km = KMeans(k)
km.fit(data)
origDisp = km.inertia_
# Calculate gap statistic
gap = np.log(np.mean(refDisps)) - np.log(origDisp)
# Assign this loop's gap statistic to gaps
gaps[gap_index] = gap
resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)
return (gaps.argmax() + 1, resultsdf)
However my plots for gap statistic is increasing therefore optimal number of clusters is always the end point for my range of clusters. Assume I am defining cluster range to be from 1 to 10 then optimal will be 10.
According to the internet websites and the original paper the workaround is to implement the standard 1 error in which
GAP(K)> GAP(K+1)- S(K+1)
Can anyone explain to me how to implement this in the above code? I do not know how to calculate the S(k+1) since it involves finding the standard deviation of the reference distribution.
s(k+1) = sd(k+1)*square_root(1+(1/B))
B is the number of copies of Monte Carlo Samples. I look at different websites but it seems they did not implement the gap statistics with standard 1 error.
def gap_stat(data,label):
k = len(np.unique(label))
n = data.shape[0]
p = data.shape[1]
D_r = []
C_r = []
for label_number in range(0,k):
this_label_index = np.where(label==label_number)[0]
temp_sum = 0
pairwise_distance_matrix =
euclidean_distances(data[this_label_index],squared=True)
D_r.append(np.sum(pairwise_distance_matrix))
C_r.append(float(len(this_label_index)))
W_r = np.sum(np.asarray(D_r)/(2*np.asarray(C_r)))
gap_stats = np.log(float(p*n)/12)-(2/float(p))*np.log(k)-
np.log(W_r)
return(gap_stats)
I have run this simulation (given below) and got the simulated transition probabilities for dry-to-dry and wet-to-wet conditions. The simulated results for dry-to-dry are almost equal to the estimated dry-to-dry (d2d_tran). But, the simulated wet-to-wet values are substantially lower than the estimated ones. It seems there is something wrong in the program. I tried several other ways but haven’t got the expected results. Can you please run the program and suggest me how I may get improved results for wet-to-wet probabilities? Thanks in advance.
My codes:
import numpy as np
import random, datetime
d2d = np.zeros(12)
d2w = np.zeros(12)
w2w = np.zeros(12)
w2d = np.zeros(12)
pd2d = np.zeros(12)
pw2w = np.zeros(12)
dry = [0.333] ##unconditional probability of dry for January
d2d_tran = [0.564,0.503,0.582,0.621,0.634,0.679,0.738,0.667,0.604,0.564,0.577,0.621]
w2w_tran = [0.784,0.807,0.8,0.732,0.727,0.728,0.64,0.64,0.665,0.717,0.741,0.769]
mu = [3.71,4.46,4.11,2.94,3.01,2.87,2.31,2.44,2.56,3.45,4.32,4.12]
sigma = [6.72,7.92,7.49,6.57,6.09,5.53,4.38,4.69,4.31,5.71,7.64,7.54]
days = np.array([31,28,31,30,31,30,31,31,30,31,30,31])
rain = np.array([])
for y in xrange(0,10000):
for m in xrange(0,12):
#Include leap years in the calculation and creat random variables for each month
if ((y%4 == 0 and y%100 != 0) or y%400 == 0) and m==1:
random_num = np.random.rand(29)
else:
random_num = np.random.rand(days[m])
#lets generate a rainfall amount for first day of the random series
if random_num[0] <= dry[0]:
random_num[0] = 0
else:
random_num[0] = abs(random.gauss(mu[0],sigma[0]))
# generate the whole series in sequence of month and year
for i in xrange(0,days[m]):
if random_num[i-1] == 0: #if yesterday was dry
if random_num[i] <= d2d_tran[m]: #check today against the dry2dry transition probabilities
random_num[i] = 0
d2d[m] += 1.0
else:
random_num[i] = abs(random.gauss(mu[m],sigma[m]))
d2w[m] += 1.0
else:
if random_num[i] <= w2w_tran[m]:
random_num[i] = abs(random.gauss(mu[m],sigma[m]))
w2w[m] += 1.0
else:
random_num[i] = 0
w2d[m] += 1.0
pd2d[m] = d2d[m]/(d2d[m] + d2w[m])
pw2w[m] = w2w[m]/(w2d[m] + w2w[m])
print 'Simulated transition probability of dry2dry:\n', np.around(pd2d, decimals=3)
print 'Simulated transition probability of wet2wet:\n', np.around(pw2w, decimals=3)
### pd2d and pw2w of generated data should be identical to d2d_tran and w2w_tran respectively
The simulation looks correct as far as it goes, and after running it for 8000 years, I get transition probabilities within .001 most of the time, and there is convergence as the number of days increases.
Nothing guarantees that you will get the exact transition probabilities - on any single run you may get anything. What you've done is generate an estimator for each single transition probability that has mean equal to the actual value (0.345), and some positive variance. The variance of your estimator decreases with n = sample size, but it will always be positive.
If you'd like values closer to the actual transition probabilities (faster convergence), apply some well-known variance reduction techniques: Stratified Sampling, Importance Sampling, etc. - too many to mention. Here's a quick technique - take the uniform random deviates generated by np.random.rand(), and estimate as usual. Then generate another estimator using the transformed deviates: [(1-x) for x in stored_deviates]. The average of the two estimators has reduced variance (by .5).