I have implemented a Kmeans using Scikit Learn command and I have tried Elbow and Silhoutte Coefficient to find the optimal K. I am planning to use gap statistics to further verify my results.
def optimalK(data, nrefs=3, maxClusters=15):
gaps = np.zeros((len(range(1, maxClusters)),))
resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
for gap_index, k in enumerate(range(1, maxClusters)):
# Holder for reference dispersion results
refDisps = np.zeros(nrefs)
for i in range(nrefs):
# Create new random reference set
randomReference = np.random.random_sample(size=data.shape)
# Fit to it
km = KMeans(k)
km.fit(randomReference)
refDisp = km.inertia_
refDisps[i] = refDisp
km = KMeans(k)
km.fit(data)
origDisp = km.inertia_
# Calculate gap statistic
gap = np.log(np.mean(refDisps)) - np.log(origDisp)
# Assign this loop's gap statistic to gaps
gaps[gap_index] = gap
resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)
return (gaps.argmax() + 1, resultsdf)
However my plots for gap statistic is increasing therefore optimal number of clusters is always the end point for my range of clusters. Assume I am defining cluster range to be from 1 to 10 then optimal will be 10.
According to the internet websites and the original paper the workaround is to implement the standard 1 error in which
GAP(K)> GAP(K+1)- S(K+1)
Can anyone explain to me how to implement this in the above code? I do not know how to calculate the S(k+1) since it involves finding the standard deviation of the reference distribution.
s(k+1) = sd(k+1)*square_root(1+(1/B))
B is the number of copies of Monte Carlo Samples. I look at different websites but it seems they did not implement the gap statistics with standard 1 error.
def gap_stat(data,label):
k = len(np.unique(label))
n = data.shape[0]
p = data.shape[1]
D_r = []
C_r = []
for label_number in range(0,k):
this_label_index = np.where(label==label_number)[0]
temp_sum = 0
pairwise_distance_matrix =
euclidean_distances(data[this_label_index],squared=True)
D_r.append(np.sum(pairwise_distance_matrix))
C_r.append(float(len(this_label_index)))
W_r = np.sum(np.asarray(D_r)/(2*np.asarray(C_r)))
gap_stats = np.log(float(p*n)/12)-(2/float(p))*np.log(k)-
np.log(W_r)
return(gap_stats)
Related
I am going to run a study in which multiple raters have to evaluate whether each of a number of papers is '1' or '0'. The reason I use multiple raters is that I suspect that each individual rater is likely to make mistakes, and I hope that by using multiple raters I can control for that.
My aim is to estimate the true proportion of '1' in the population of papers, and I want to do this using a bayesian model in PyMC3. More general answers about model specification without the concrete implementation in PyMC3 are of course also welcome.
This is how I've simulated some data:
n = 250 # number of papers we sample
p = 0.3 # true rate
true_sample = binom.rvs(1, 0.3, size=n)
# add error
def rating(array,error_rate):
scores = []
for i in array:
scores.append(np.random.binomial(i, error_rate))
return np.array(scores)
r = 10 # number of raters
r_error = np.random.uniform(0.7, 0.99,10) # how often does each rater rate a paper correctly
#get the data
rated_data = {}
for i in range(r):
rated_data[f'rater_{i}'] = rating(true_sample, r_error[i])
df = pd.DataFrame(rated_data, index = [f'abstract_{i}' for i in range(250)])
This is the model I have tried:
with pm.Model() as binom_model2:
p = pm.Beta('p',0.5,0.5) # this is the proportion of '1' in the population
for i in range(10): # error_r and p for each rater separately
er = pm.Beta(f'er{i}',10,3)
prob = pm.Binomial(f'prob{i}', p = (p * er), n = n,observed = df.iloc[:,i].sum() )
This seems to work fine, in that it gives good estimates of p and error_r (but do tell me if you think there are problems with the model!). However, it doesn't use all information that is available, namely, the fact that the ratings on each row of the dataframe are ratings of the same paper. I presume that a model that could incorporate this, would give even more accurate estimates of p and of the error-rates. I'm not sure how to do this, and any help would be appreciated.
I am implementing Gaussian Naive Bayes Algorithm:
# importing modules
import pandas as pd
import numpy as np
# create an empty dataframe
data = pd.DataFrame()
# create our target variable
data["gender"] = ["male","male","male","male",
"female","female","female","female"]
# create our feature variables
data["height"] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75]
data["weight"] = [180,190,170,165,100,150,130,150]
data["foot_size"] = [12,11,12,10,6,8,7,9]
# view the data
print(data)
# create an empty dataframe
person = pd.DataFrame()
# create some feature values for this single row
person["height"] = [6]
person["weight"] = [130]
person["foot_size"] = [8]
# view the data
print(person)
# Priors can be calculated either constants or probability distributions.
# In our example, this is simply the probability of being a gender.
# calculating prior now
# number of males
n_male = data["gender"][data["gender"] == "male"].count()
# number of females
n_female = data["gender"][data["gender"] == "female"].count()
# total people
total_ppl = data["gender"].count()
print ("Male count =",n_male,"and Female count =",n_female)
print ("Total number of persons =",total_ppl)
# number of males divided by the total rows
p_male = n_male / total_ppl
# number of females divided by the total rows
p_female = n_female / total_ppl
print ("Probability of MALE =",p_male,"and FEMALE =",p_female)
# group the data by gender and calculate the means of each feature
data_means = data.groupby("gender").mean()
# view the values
data_means
# group the data by gender and calculate the variance of each feature
data_variance = data.groupby("gender").var()
# view the values
data_variance
data_variance = data.groupby("gender").var()
data_variance["foot_size"][data_variance.index == "male"].values[0]
# means for male
male_height_mean=data_means["height"][data_means.index=="male"].values[0]
male_weight_mean=data_means["weight"][data_means.index=="male"].values[0]
male_footsize_mean=data_means["foot_size"][data_means.index=="male"].values[0]
print (male_height_mean,male_weight_mean,male_footsize_mean)
# means for female
female_height_mean=data_means["height"][data_means.index=="female"].values[0]
female_weight_mean=data_means["weight"][data_means.index=="female"].values[0]
female_footsize_mean=data_means["foot_size"][data_means.index=="female"].values[0]
print (female_height_mean,female_weight_mean,female_footsize_mean)
# variance for male
male_height_var=data_variance["height"][data_variance.index=="male"].values[0]
male_weight_var=data_variance["weight"][data_variance.index=="male"].values[0]
male_footsize_var=data_variance["foot_size"][data_variance.index=="male"].values[0]
print (male_height_var,male_weight_var,male_footsize_var)
# variance for female
female_height_var=data_variance["height"][data_variance.index=="female"].values[0]
female_weight_var=data_variance["weight"][data_variance.index=="female"].values[0]
female_footsize_var=data_variance["foot_size"][data_variance.index=="female"].values[0]
print (female_height_var,female_weight_var,female_footsize_var)
# create a function that calculates p(x | y):
def p_x_given_y(x,mean_y,variance_y):
# input the arguments into a probability density function
p = 1 / (np.sqrt(2 * np.pi * variance_y)) * \
np.exp((-(x - mean_y) ** 2) / (2 * variance_y))
# return p
return p
# numerator of the posterior if the unclassified observation is a male
posterior_numerator_male = p_male * \
p_x_given_y(person["height"][0],male_height_mean,male_height_var) * \
p_x_given_y(person["weight"][0],male_weight_mean,male_weight_var) * \
p_x_given_y(person["foot_size"][0],male_footsize_mean,male_footsize_var)
# numerator of the posterior if the unclassified observation is a female
posterior_numerator_female = p_female * \
p_x_given_y(person["height"][0],female_height_mean,female_height_var) * \
p_x_given_y(person["weight"][0],female_weight_mean,female_weight_var) * \
p_x_given_y(person["foot_size"][0],female_footsize_mean,female_footsize_var)
print ("Numerator of Posterior MALE =",posterior_numerator_male)
print ("Numerator of Posterior FEMALE =",posterior_numerator_female)
if (posterior_numerator_male >= posterior_numerator_female):
print ("Predicted gender is MALE")
else:
print ("Predicted gender is FEMALE")
When we are calculating the probability, we are calculating it using the Gaussian PDF:
$$ P(x) = \frac{1}{\sqrt {2 \pi {\sigma}^2}} e^{\frac{-(x- \mu)^2}{2 {\sigma}^2}} $$
My question is that the above equation is that of a PDF. To calculate probability, we have to integrate it over an area dx.
$ \int_{x0}^{x1} P(x)dx $
But in the above program, we are plugging the value of x and calculating the probability. Is that correct? Why? I have seen most of the articles calculating the probability ib the same manner.
If this is the wrong way to calculate the probability in the Naive Bayes Classifier, then what is the correct method?
The method is correct. The pdf function is a probability density, i.e., a function that measures the probability of being in a neighborhood of a value divided by the "size" of such a neighborhood, where the "size" is the length in dimension 1, the area in 2, the volume in 3, etc.
In continuous probabilities the probability of getting precisely any given outcome is 0, and this is why densities are used instead. Therefore, we don't deal with expressions such as P(X=x) but with P(|X-x| < Δ(x)), which stands for the probability of X being close x.
Let me simplify the notation and write P(X~x) for P(|X-x| < Δ(x)).
If you apply the Bayes rule here, you will get
P(X~x|W~w) = P(W~w|X~x)*P(X~x)/P(W~w)
because we are dealing with probabilities. If we now introduce densities:
pdf(x|w)*Δ(x) = pdf(w|x)Δ(w)*pdf(x)Δ(x)/(pdf(w)*Δ(w))
because probability = density*neighborhood_size. And since all Δ(·) cancel out in the expression above, we get
pdf(x|w) = pdf(w|x)*pdf(x)/pdf(w)
which is the Bayes rule for densities.
The conclusion is that, given that the Bayes rule also holds for densities, it is legitimate to use the same methods replacing probabilities with densities when dealing with continuous random variables.
I would like to remove outliers from Pandas dataframe using some user defined function. There are some answers to the same question I am asking in Stackoverflow but the difference is that the Data-set I have are circular data. Therefore, using Pandas built-in functions mean(), std() would not be appropriate. For example in circular data values of 355 and 5 have only a difference of 10 but linear difference gives 350.
I have thousands of dataframes like the one below. We clearly see that Geophone 6 is an outlier.
Geophone azimuth incidence
0 1 194.765326 29.703151
1 2 193.143982 23.380681
2 3 199.327911 34.752212
3 4 195.641010 49.186893
4 5 193.479015 21.192982
5 6 0.745142 3.410046
6 7 192.380435 29.778807
7 8 196.700814 19.750237
It can also be confirmed when plotting the data in a polar diagram.
I have written two functions mean_angle and variance_angle which calculates circular mean and variance to be applied to the data. Variance gives a value between 0 and 1. When data are close to each other Variance value gets closer to 0 and vise versa.
import numpy as np
def mean_angle(deg):
deg = np.deg2rad(deg)
S = np.array(deg)
C = np.array(deg)
S = S[np.isfinite(S)] #remove np.nan
C = C[np.isfinite(C)]
S = np.sum(np.sin(S))
C = np.sum(np.cos(C))
mu = np.arctan(S/C)
mu = np.rad2deg(mu)
if S>0 and C>0:
mu = mu
elif S>0 and C<0:
mu = mu +180
elif S<0 and C<0:
mu = mu+180
elif S<0 and C>0:
mu = mu +360
return mu
def variance_angle(deg):
"""
deg: angles in degrees
"""
deg = np.deg2rad(deg)
S = np.array(deg)
C = np.array(deg)
S = S[np.isfinite(S)] #remove np.nan
C = C[np.isfinite(C)]
length = C.size
S = np.sum(np.sin(S))
C = np.sum(np.cos(C))
R = np.sqrt(S**2 + C**2)
R_avg = R/length
V = 1- R_avg
return V
mean_azimuth = mean_angle(df.azimuth)
variance = variance_angle(df.azimuth)
print(mean_azimuth)
197.4122778774279
print(variance)
0.24614383460498535
However, when excluding row 5 from calculation, mean and variance become 195.06226604362286 , 0.0007544067627361928 respectively. The Variance is changed from 0.25 to almost 0.
Therefore, I would like to find a way to remove any circular outlier value/s (azimuth) which makes circular variance high using the defined functions shown above.
In this example incidence is also an outlier for the same Geophone but It actually does not have any relation to azimuth. There are other data where incidenceis within the range but azimuth is an outlier.
Any help is really appreciated.
One way to do outlier detection is to compute mean and std of the data, then remove points that lie somewhere outside A*std of the mean (where you tune A to be whatever is reasonable for your data.)
So you could use your functions to compute mean and variance of your dataframe, then pass over the dataframe again to remove data points outside A*std of the mean.
I have run this simulation (given below) and got the simulated transition probabilities for dry-to-dry and wet-to-wet conditions. The simulated results for dry-to-dry are almost equal to the estimated dry-to-dry (d2d_tran). But, the simulated wet-to-wet values are substantially lower than the estimated ones. It seems there is something wrong in the program. I tried several other ways but haven’t got the expected results. Can you please run the program and suggest me how I may get improved results for wet-to-wet probabilities? Thanks in advance.
My codes:
import numpy as np
import random, datetime
d2d = np.zeros(12)
d2w = np.zeros(12)
w2w = np.zeros(12)
w2d = np.zeros(12)
pd2d = np.zeros(12)
pw2w = np.zeros(12)
dry = [0.333] ##unconditional probability of dry for January
d2d_tran = [0.564,0.503,0.582,0.621,0.634,0.679,0.738,0.667,0.604,0.564,0.577,0.621]
w2w_tran = [0.784,0.807,0.8,0.732,0.727,0.728,0.64,0.64,0.665,0.717,0.741,0.769]
mu = [3.71,4.46,4.11,2.94,3.01,2.87,2.31,2.44,2.56,3.45,4.32,4.12]
sigma = [6.72,7.92,7.49,6.57,6.09,5.53,4.38,4.69,4.31,5.71,7.64,7.54]
days = np.array([31,28,31,30,31,30,31,31,30,31,30,31])
rain = np.array([])
for y in xrange(0,10000):
for m in xrange(0,12):
#Include leap years in the calculation and creat random variables for each month
if ((y%4 == 0 and y%100 != 0) or y%400 == 0) and m==1:
random_num = np.random.rand(29)
else:
random_num = np.random.rand(days[m])
#lets generate a rainfall amount for first day of the random series
if random_num[0] <= dry[0]:
random_num[0] = 0
else:
random_num[0] = abs(random.gauss(mu[0],sigma[0]))
# generate the whole series in sequence of month and year
for i in xrange(0,days[m]):
if random_num[i-1] == 0: #if yesterday was dry
if random_num[i] <= d2d_tran[m]: #check today against the dry2dry transition probabilities
random_num[i] = 0
d2d[m] += 1.0
else:
random_num[i] = abs(random.gauss(mu[m],sigma[m]))
d2w[m] += 1.0
else:
if random_num[i] <= w2w_tran[m]:
random_num[i] = abs(random.gauss(mu[m],sigma[m]))
w2w[m] += 1.0
else:
random_num[i] = 0
w2d[m] += 1.0
pd2d[m] = d2d[m]/(d2d[m] + d2w[m])
pw2w[m] = w2w[m]/(w2d[m] + w2w[m])
print 'Simulated transition probability of dry2dry:\n', np.around(pd2d, decimals=3)
print 'Simulated transition probability of wet2wet:\n', np.around(pw2w, decimals=3)
### pd2d and pw2w of generated data should be identical to d2d_tran and w2w_tran respectively
The simulation looks correct as far as it goes, and after running it for 8000 years, I get transition probabilities within .001 most of the time, and there is convergence as the number of days increases.
Nothing guarantees that you will get the exact transition probabilities - on any single run you may get anything. What you've done is generate an estimator for each single transition probability that has mean equal to the actual value (0.345), and some positive variance. The variance of your estimator decreases with n = sample size, but it will always be positive.
If you'd like values closer to the actual transition probabilities (faster convergence), apply some well-known variance reduction techniques: Stratified Sampling, Importance Sampling, etc. - too many to mention. Here's a quick technique - take the uniform random deviates generated by np.random.rand(), and estimate as usual. Then generate another estimator using the transformed deviates: [(1-x) for x in stored_deviates]. The average of the two estimators has reduced variance (by .5).
I have a numpy array (actually imported from a GIS raster map) which contains
probability values of occurrence of a species like following example:
a = random.randint(1.0,20.0,1200).reshape(40,30)
b = (a*1.0)/sum(a)
Now I want to get a discrete version for that array again. Like if I have
e.g. 100 individuals which are located on the area of that array (1200 cells) how are they
distributed? Of course they should be distributed according to their probability,
meaning lower values indicated lower probability of occurrence. However, as everything is statistics there is still the chance that a individual is located at a low probability
cell. It should be possible that multiple individuals can occupy on cell...
It is like transforming a continuous distribution curve into a histogram again. Like many different histograms may result in a certain distribution curve it should also be the other way round. Accordingly applying the algorithm I am looking for will produce different discrete values each time.
...is there any algorithm in python which can do that? As I am not that familiar with discretization maybe someone can help.
Use random.choice with bincount:
np.bincount(np.random.choice(b.size, 100, p=b.flat),
minlength=b.size).reshape(b.shape)
If you don't have NumPy 1.7, you can replace random.choice with:
np.searchsorted(np.cumsum(b), np.random.random(100))
giving:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(100)),
minlength=b.size).reshape(b.shape)
So far I think ecatmur's answer seems quite reasonable and simple.
I just want to add may a more "applied" example. Considering a dice
with 6 faces (6 numbers). Each number/result has a probability of 1/6.
Displaying the dice in form of an array could look like:
b = np.array([[1,1,1],[1,1,1]])/6.0
Thus rolling the dice 100 times (n=100) results in following simulation:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(n)),minlength=b.size).reshape(b.shape)
I think that can be an appropriate approach for such an application.
Thus thank you ecatmur for your help!
/Johannes
this is similar to my question i had earlier this month.
import random
def RandFloats(Size):
Scalar = 1.0
VectorSize = Size
RandomVector = [random.random() for i in range(VectorSize)]
RandomVectorSum = sum(RandomVector)
RandomVector = [Scalar*i/RandomVectorSum for i in RandomVector]
return RandomVector
from numpy.random import multinomial
import math
def RandIntVec(ListSize, ListSumValue, Distribution='Normal'):
"""
Inputs:
ListSize = the size of the list to return
ListSumValue = The sum of list values
Distribution = can be 'uniform' for uniform distribution, 'normal' for a normal distribution ~ N(0,1) with +/- 5 sigma (default), or a list of size 'ListSize' or 'ListSize - 1' for an empirical (arbitrary) distribution. Probabilities of each of the p different outcomes. These should sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).
Output:
A list of random integers of length 'ListSize' whose sum is 'ListSumValue'.
"""
if type(Distribution) == list:
DistributionSize = len(Distribution)
if ListSize == DistributionSize or (ListSize-1) == DistributionSize:
Values = multinomial(ListSumValue,Distribution,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'uniform': #I do not recommend this!!!! I see that it is not as random (at least on my computer) as I had hoped
UniformDistro = [1/ListSize for i in range(ListSize)]
Values = multinomial(ListSumValue,UniformDistro,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'normal':
"""
Normal Distribution Construction....It's very flexible and hideous
Assume a +-3 sigma range. Warning, this may or may not be a suitable range for your implementation!
If one wishes to explore a different range, then changes the LowSigma and HighSigma values
"""
LowSigma = -3#-3 sigma
HighSigma = 3#+3 sigma
StepSize = 1/(float(ListSize) - 1)
ZValues = [(LowSigma * (1-i*StepSize) +(i*StepSize)*HighSigma) for i in range(int(ListSize))]
#Construction parameters for N(Mean,Variance) - Default is N(0,1)
Mean = 0
Var = 1
#NormalDistro= [self.NormalDistributionFunction(Mean, Var, x) for x in ZValues]
NormalDistro= list()
for i in range(len(ZValues)):
if i==0:
ERFCVAL = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
NormalDistro.append(ERFCVAL)
elif i == len(ZValues) - 1:
ERFCVAL = NormalDistro[0]
NormalDistro.append(ERFCVAL)
else:
ERFCVAL1 = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
ERFCVAL2 = 0.5 * math.erfc(-ZValues[i-1]/math.sqrt(2))
ERFCVAL = ERFCVAL1 - ERFCVAL2
NormalDistro.append(ERFCVAL)
#print "Normal Distribution sum = %f"%sum(NormalDistro)
Values = multinomial(ListSumValue,NormalDistro,size=1)
OutputValue = Values[0]
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
ProbabilityDistibution = RandFloats(1200)#This is your probability distribution for your 1200 cell array
SizeDistribution = RandIntVec(1200,100,Distribution=ProbabilityDistribution)#for a 1200 cell array, whose sum is 100 with given probability distribution
The two main lines that are important are the last two lines in the code above