import sys
import numpy as np
import scipy.io as sio
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.svm import SVC
filename = sys.argv[1]
datafile = sio.loadmat(filename)
data = datafile['bow']
sizedata=[len(data), len(data[0])]
gap=[]
SD=[]
for knum in xrange(10,20):
print knum
#Clustering original Data
kmeanspp = KMeans(n_clusters=knum,init = 'k-means++',max_iter = 100,n_jobs = 1)
kmeanspp.fit(data)
dispersion = kmeanspp.inertia_
#Clustering Reference Data
nrefs = 10
refDisp = np.zeros(nrefs)
for nref in xrange(nrefs):
refdata = np.random.random_sample((sizedata[0],sizedata[1]))
refkmeans = KMeans(n_clusters=knum,init='k-means++',max_iter=100,n_jobs=1)
refkmeans.fit(refdata)
refdisp = refkmeans.inertia_
refDisp[nref]=np.log(refdisp)
mean_log_refdisp = np.mean(refDisp)
gap.append(mean_log_refdisp-np.log(dispersion))
#Calculating standard deviaiton
sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
SD.append(sd)
SD = [sd*((1+(1/nrefs))**0.5) for sd in SD]
#determining optimal k
opt_k = None
diff = []
for i in xrange(len(gap)-1):
diff = (SD[i+1]-(gap[i+1]-gap[i]))
if diff>0:
opt_k = i+10
break
print diff
plt.plot(np.linspace(10,19,10,True),gap)
plt.show()
Here I am trying to implement the Gap Statistic method for determining the optimal number of clusters. But the problem is that every time I run the code I get a different value for k.
What is the solution to the problem?
How can the value of optimal k differ for the same data?
I have stored the data in a .mat file beforehand and I am passing it as an argument via terminal
I am looking for the smallest value of k for which Gap(k)>= Gap(k+1)-s(k+1) where s(k+1) = sd(k+1)*square_root(1+(1/B)) where sd is the standard deviation of the reference distribution and B is the number of copies of Monte Carlo sample
Otherwise stated, I am searching for the value of k for which
s(k+1)-Gap(k+1)+Gap(k)>=0
Couple of problems with your simulation:
1- sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
Why did you multiply the second component of zip by nrefs that is not needed according to the original paper.
2-
if diff>0:
opt_k = i+10
break
if diff>0 you want diff>=0 since equality can happen a
About why you get different number of clusters each time, as people said it is monte carlo simulation so there can be randomness and also it depends on what you are clustering and your dataset. I suggest you to test your algorithms against Silhouette and Elbow to get a better idea about number of clusters.
One option is to run your function several times and then average the gap statistics and the s values, and find the smallest k where the average s(k+1)-Gap(k+1)+Gap(k) is greater than
This will take longer but give a more reliable result.
Related
I have 100 clusters, each with a mean and standard deviation value. These clusters are predefined using the SPSS software package, by using the 2-step cluster method. Therefore, the optimisation of these cluster distributions to fit the data has already been done.
For new (unseen) data, we want to assign cluster membership by selecting the maximum log-likelihood cluster, for any given set of coordinates X. To do this, I have written my own code for comparison with what was output by SPSS using the same method: https://www.norusis.com/pdf/SPC_v19.pdf
Using data that has been correctly labelled by SPSS, about 42% of the clusters are correctly labelled by minimising the RMSE to the cluster mean (which is not what SPSS does), and less than 20% of the clusters are labelled correctly by my code when assigning the maximum log-likelihood cluster (which is what SPPSS reports to do).
I know that the maximum log-likelihood cluster should be the correct cluster ( https://www.norusis.com/pdf/SPC_v19.pdf ), but there is only a 20% success rate from this code when compared to the correct cluster labels from SPSS. What am I doing wrong?
Here is the code below.
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
import math
from scipy import stats
# importa raw files
clusters_df = pd.read_csv('ClusterCoordinates.csv') # clusters are in order of cluster numbers enabling us to use index for identification
clusters_df = clusters_df.drop(columns=['Cluster'])
print(clusters_df.shape)
clusters = clusters_df.to_numpy()
frames_df_raw = pd.read_csv('FrameCoordinates.csv')
frames_df = frames_df_raw.drop(columns=['frame','replica','voltage','system','ff','cluster'])
print(frames_df.shape)
frames = frames_df.to_numpy()
clusters_sd_df = pd.read_csv('ClusterCoordinates_SD.csv')
clusters_sd_df = clusters_sd_df.drop(columns=['Cluster'])
print(clusters_sd_df.shape)
clusters_sd = clusters_sd_df.to_numpy()
rmseCalc = []
llCalc = []
assignedCluster_RMSE = []
assignedCluster_LL = []
# create tables with RMSE and LL values
for frame in frames:
for cluster, cluster_sd in zip(clusters, clusters_sd):
# we compare cluster assignment using minimum RMSE vs maximum log likelihood methods.
rmseCalc.append(math.sqrt(mean_squared_error(np.array(cluster),np.array(frame))))
llCalc.append(-np.sum(stats.norm.logpdf(frame, loc=cluster, scale=cluster_sd)))
rmseCalc=np.array(rmseCalc)
llCalc=np.array(llCalc)
llCalc=np.nan_to_num(llCalc)
minRMSE = np.where(rmseCalc==rmseCalc.min())
maxLL = np.where(llCalc==llCalc.min())
print(maxLL[0][0]+1)
assignedCluster_RMSE.append(minRMSE[0][0]+1)
assignedCluster_LL.append(maxLL[0][0]+1)
rmseCalc=[]
llCalc=[]
frames_df_raw['predCluster_RMSE'] = np.array(assignedCluster_RMSE)
frames_df_raw['predCluster_LL'] = np.array(assignedCluster_LL)
frames_df_raw.to_csv('frames_clustered.csv')
I was expecting the cluster labels assigned by the code to match those already assigned by SPSS, since the methods used are intended to be the same.
I have a dataset containing around 1000 different time series. Some of these are showing clear periodicity, and some are not.
I want to be able to automatically determine if a time series has clear periodicity in it, so I know if I need to do seasonal decomposition of it before applying some outlier methods.
Here is a signal with daily periodicity, each sample is taken with 15 minute interval.
In order for me to try and automatically determine if there is daily periodicity, I have tried using different methods. The first approach is using detection function in library from kats.
from kats.consts import TimeSeriesData
from kats.detectors.seasonality import FFTDetector, ACFDetector
def detect_seasonality(df,feature,time_col,detector_type):
df_kpi = df[[feature]].reset_index().rename(columns={feature:'value'})
ts = TimeSeriesData(df_kpi,time_col_name=time_col)
if detector_type == 'fft':
detector = FFTDetector(ts)
elif detector_type == 'acf':
detector = ACFDetector(ts)
else:
raise Exception("Detector types are fft or acf")
detection = detector.detector()
seasonality_presence = detection['seasonality_presence']
return seasonality_presence
This approach returned "False" seasonality presence, both using fft and acf detector.
Another approach is using fft
import numpy as np
import scipy.signal
from matplotlib import pyplot as plt
L = np.array(df[kpi_of_interest].values)
L -= np.mean(L)
# Window signal
L *= scipy.signal.windows.hann(len(L))
fft = np.fft.rfft(L, norm="ortho")
plt.figure()
plt.plot(abs(fft))
But here we don't see any clear way to determine the daily periodicity I expected.
So in order for me to automatically detect the daily periodicity, are there any other better methods to apply here? Are there any necessary preprocessing steps needed for me in beforehand? Or could it simply be a lack of data? I only have around 10 days of data for each time series.
Lets say I have 2 datasets / tuples "Left" and "Right". Some values are present in both datasets due to the overlap. How can I find the best transfromation for "Right" to combine it with "Left"?
I just found geometric transformation for matrices and images but not for datasets.
In this example, the data is identical besides the transformation. What if the data differs slighlty with noise? Would the best transformation be the result of a RANSAC Fit, i.e. a homography. This should be possible with Sklearn (Scikit-learn), isn't it? Again, most of the results what I looked were based on matrices and images and not datasets.
I have the feeling this problem occurred many times and a nice solution is definitely somewhere out there but unfortunately I couldn't find any.
Thank you very much for your help!
import numpy as np
import matplotlib.pyplot as plt
#Load data from txt file
dataleft = np.genfromtxt(r'C:\Data2MergeLeft.txt',delimiter="\t")
dataright = np.genfromtxt(r'C:\Data2MergeRight.txt',delimiter="\t")
#Find overlap
overlapleft = np.where(dataleft[:,0]>=dataright[1,0])[0]
overlapright = np.where(dataright[:,0]<dataleft[-1,0])[0]
#Trim data (overlap only)
olleft = dataleft[overlapleft]
olright = dataright[overlapright]
#Transformed data ->newarray
olrightnew = olright
#Initial values of transformation
offsetx = 10
linyconstant = 5
linyfactor = 1.2
#Loop for optimization
#Transformation
olrightnew[:,0] = olright[:,0]-offsetx
olrightnew[:,1] = (olright[:,1]-linyconstant)/linyfactor
#Residual
residuals = olright[:,1]-olrightnew[:,1]
sqresidual = residuals*residuals
residual = np.sum(sqresidual)
I'm working on fitting muon lifetime data to a curve to extract the mean lifetime using the lmfit function. The general process I'm using is to bin the 13,000 data points into 10 bins using the histogram function, calculating the uncertainty with the square root of the counts in each bin (it's an exponential model), then use the lmfit module to determine the best fit along with means and uncertainty. However, graphing the output of the model.fit() method returns this graph, where the red line is the fit (and obviously not the correct fit). Fit result output graph
I've looked online and can't find a solution to this, I'd really appreciate some help figuring out what's going on. Here's the code.
import os
import numpy as np
import matplotlib.pyplot as plt
from numpy import sqrt, pi, exp, linspace
from lmfit import Model
class data():
def __init__(self,file_name):
times_dirty = sorted(np.genfromtxt(file_name, delimiter=' ',unpack=False)[:,0])
self.times = []
for i in range(len(times_dirty)):
if times_dirty[i]<40000:
self.times.append(times_dirty[i])
self.counts = []
self.binBounds = []
self.uncertainties = []
self.means = []
def binData(self,k):
self.counts, self.binBounds = np.histogram(self.times, bins=k)
self.binBounds = self.binBounds[:-1]
def calcStats(self):
if len(self.counts)==0:
print('Run binData function first')
else:
self.uncertainties = sqrt(self.counts)
def plotData(self,fit):
plt.errorbar(self.binBounds, self.counts, yerr=self.uncertainties, fmt='bo')
plt.plot(self.binBounds, fit.init_fit, 'k--')
plt.plot(self.binBounds, fit.best_fit, 'r')
plt.show()
def decay(t, N, lamb, B):
return N * lamb * exp(-lamb * t) +B
def main():
muonEvents = data('C:\Users\Colt\Downloads\muon.data')
muonEvents.binData(10)
muonEvents.calcStats()
mod = Model(decay)
result = mod.fit(muonEvents.counts, t=muonEvents.binBounds, N=1, lamb=1, B = 1)
muonEvents.plotData(result)
print(result.fit_report())
print (len(muonEvents.times))
if __name__ == "__main__":
main()
This might be a simple scaling problem. As a quick test, try dividing all raw data by a factor of 1000 (both X and Y) to see if changing the magnitude of the data has any effect.
Just to build on James Phillips answer, I think the data you show in your graph imply values for N, lamb, and B that are very different from 1, 1, 1. Keep in mind that exp(-lamb*t) is essentially 0 for lamb = 1, and t> 100. So, if the algorithm starts at lamb=1 and varies that by a little bit to find a better value, it won't actually be able to see any difference in how well the model matches the data.
I would suggest trying to start with values that are more reasonable for the data you have, perhaps N=1.e6, lamb=1.e-4, and B=100.
As James suggested, having the variables have values on the order of 1 and putting in scale factors as necessary is often helpful in getting numerically stable solutions.
So I have successfully found out the optimal number of clusters required for kmeans algorithm in python, but now how can I find out the exact size of cluster that I get after applying the Kmeans in python?
Here's a code snippet
data=np.vstack(zip(simpleassetid_arr,simpleuidarr))
centroids,_ = kmeans(data,round(math.sqrt(len(uidarr)/2)))
idx,_ = vq(data,centroids)
initial = [cluster.vq.kmeans(data,i) for i in range(1,10)]
var=[var for (cent,var) in initial] #to determine the optimal number of k using elbow test
num_k=int(raw_input("Enter the number of clusters: "))
cent, var = initial[num_k-1]
assignment,cdist = cluster.vq.vq(data,cent)
You can get the cluster size using this:
print np.bincount(idx)
For the the example below, np.bincount(idx) outputs an array of two elements, e.g. [ 156 144]
from numpy import vstack,array
import numpy as np
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))
# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
#Print number of elements per cluster
print np.bincount(idx)