How to vectorize bigrams with the hashing-trick in scikit-learn? - python
I have some bigrams, lets say: [('word','word'),('word','word'),...,('word','word')]. How can i use scikit's HashingVectorizer to create a feature vector that subsequently will be presented to some classification algorithm like e.g. SVC or Naive Bayes or any type of classification algorithm?
Firstly, you MUST understand what the different vectorizers are doing. Most vectorizers are based on the bag-of-word approaches where documents are tokens are mapped onto a matrix.
From sklearn documentation, CountVectorizer and HashVectorizer:
Convert a collection of text documents to a matrix of token counts
For instance, these sentences
The Fulton County Grand Jury said Friday an investigation of Atlanta's
recent primary election produced no evidence that any
irregularities took place .
The jury further said in term-end presentments that the City Executive
Committee , which had over-all charge of the election , `` deserves
the praise and thanks of the City of Atlanta '' for the manner in
which the election was conducted .
with this rough vectorizer:
from collections import Counter
from itertools import chain
from string import punctuation
from nltk.corpus import brown, stopwords
# Let's say the training/testing data is a list of words and POS
sentences = brown.sents()[:2]
# Extract the content words as features, i.e. columns.
vocabulary = list(chain(*sentences))
stops = stopwords.words('english') + list(punctuation)
vocab_nostop = [i.lower() for i in vocabulary if i not in stops]
# Create a matrix from the sentences
matrix = [Counter([w for w in words if w in vocab_nostop]) for words in sentences]
print matrix
would become:
[Counter({u"''": 1, u'``': 1, u'said': 1, u'took': 1, u'primary': 1, u'evidence': 1, u'produced': 1, u'investigation': 1, u'place': 1, u'election': 1, u'irregularities': 1, u'recent': 1}), Counter({u'the': 6, u'election': 2, u'presentments': 1, u'``': 1, u'said': 1, u'jury': 1, u'conducted': 1, u"''": 1, u'deserves': 1, u'charge': 1, u'over-all': 1, u'praise': 1, u'manner': 1, u'term-end': 1, u'thanks': 1})]
So this might be rather inefficient considering very large dataset, so the sklearn devs built more efficient code. One of the most important feature of sklearn is that you don't even need to load the dataset into memory before vectorizing it.
Since it's unclear what is your task, i think you're sort of looking for a general use. Let's say you're using it for language ID.
Let's say that your input file for the training data in train.txt:
Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.
De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.
Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?
Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.
And your corresponding labels are Bosnian, Portuguese, Spanish and Slovak, i.e.
[bs,pt,es,sr]
Here's one way to use the CountVectorizer and the naive bayes classifier. The following example is from https://github.com/alvations/bayesline of the DSL shared task.
Let's start from the vectorizer. Firstly, the vectorizer takes the input file and then converts the training set into a vectorized matrix and initializes the vectorizer (i.e. features):
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
testfile = 'test.txt'
# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
print word_vectorizer.get_feature_names()
[out]:
[u'acuerdo', u'aj', u'ajudou', u'al', u'alex', u'algo', u'alpsk\xfdmi', u'alpy', u'andaba', u'andrea', u'ao', u'apresenta', u'as', u'bien', u'bl\xedzko', u'buscando', u'come\xe7o', u'como', u'con', u'conseguido', u'da', u'de', u'decepcionantes', u'deti', u'dificuldades', u'dif\xedcil', u'distancia', u'do', u'doprinese', u'druh', u'd\xe1', u'ela', u'encontrar', u'enfrentar', u'es', u'est\xe1', u'eulex', u'excusa', u'fama', u'foi', u'for\xe7as', u'furiosa', u'golf', u'golfistami', u'golfov\xfdch', u'guasch', u'ha', u'hotelmi', u'hra\u0165', u'ide', u'ihr\xedsk', u'incident', u'intranspon\xedveis', u'in\xedcio', u'in\xfd', u'ispit', u'istragu', u'izbijanju', u'ja\u010danju', u'je', u'jedan', u'jo\u0161', u'kapaciteta', u'kde', u'kombin\xe1cie', u'komplex', u'kon\u010diarmi', u'kosova', u'la', u'lado', u'lequio', u'lete', u'llevar', u'lo', u'longo', u'ly\u017eova\u0165', u'mais', u'man\u017eelky', u'mas', u'me', u'mesmo', u'meu', u'minha', u'misije', u'mo\u017enos\u0165ami', u'muy', u'm\xe1s', u'm\xe3e', u'na', u'nada', u'nad\u0161en\xfdmi', u'nasilja', u'negativas', u'nie', u'nieko\u013ek\xfdch', u'no', u'obaviti', u'obe\u0107ao', u'para', u'parecem', u'parecer', u'pod', u'pone', u'pon\xfakaj\xfa', u'por', u'potrebuj\xfa', u'po\u0161to', u'prava', u'predstavlja', u'pri', u'prova\xe7\xf5es', u'pro\u0161losedmi\u010dnom', u'punham', u'qual', u'qualquer', u'que', u'quem', u'rak\xfaske', u'relaci\xf3n', u'rezortov', u'sa', u'sebe', u'sempre', u'situa\xe7\xf5es', u'sjeveru', u'spojen\xfdch', u'suplantar', u's\xfa', u'taj', u'tak', u'talianske', u'teve', u'tive', u'todas', u'tr\xe1venia', u'una', u've\u013ek\xfd', u'vida', u'visto', u'vladavine', u'vo', u'vo\u013en\xe9ho', u'vysok\xfdmi', u'vy\u017eitia', u'v\xe4\u010d\u0161ine', u'v\u017edy', u'ya', u'zauj\xedmav\xe9', u'zime', u'\u0107e', u'\u010dasu', u'\u010di', u'\u010fal\u0161\xedmi', u'\u0161vaj\u010diarske']
Let's say your test documents are in test.txt, which labels are Spanish es and Portuguese pt:
Por ello, ha insistido en que Europa tiene que darle un toque de atención porque Portugal esta incumpliendo la directiva del establecimiento del peaje
Estima-se que o mercado homossexual só na Cidade do México movimente cerca de oito mil milhões de dólares, aproximadamente seis mil milhões de euros
Now, you can label the test documents with the trained classifier as such:
import codecs, re, time
from itertools import chain
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
testfile = 'test.txt'
# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)
# Tagging the documents
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)
print results
[out]:
['es' 'pt']
For more information of text classification, possibly you might find this NLTK related question/answer useful, see nltk NaiveBayesClassifier training for sentiment analysis
To use the HashingVectorizer, you need to note that it produces vector values that are negative and MultinomialNaiveBayes classifier don't do negative values, so you would have to use another classifier, as such:
import codecs, re, time
from itertools import chain
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import Perceptron
trainfile = 'train.txt'
testfile = 'test.txt'
# Vectorizing data.
train = []
word_vectorizer = HashingVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
# Training Perceptron
pct = Perceptron(n_iter=100)
pct.fit(trainset, tags)
# Tagging the documents
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = pct.predict(testset)
print results
[out]:
['es' 'es']
But do note that the results of the perceptron is worse in this small example. Different classifier fits different task and different features fit different vectors, also different classifiers accepts different vectors.
There is no perfect model, just better or worse
Since you've already extracted the bigrams yourself, you can vectorize using a FeatureHasher. The main thing you need to do is squash the bigrams to strings. E.g.,
>>> data = [[('this', 'is'), ('is', 'a'), ('a', 'text')],
... [('and', 'one'), ('one', 'more')]]
>>> from sklearn.feature_extraction import FeatureHasher
>>> fh = FeatureHasher(input_type='string')
>>> X = fh.transform(((' '.join(x) for x in sample) for sample in data))
>>> X
<2x1048576 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
Related
In regard with an image classification machine learning basic exercise in python
Sorry because this code contains comments about the exercise in Spanish. The thing is, I am trying to train 3 models with 3 different algorithms of machine learning which are in charge of image classification based on labels. I got to a point where, I have created a dataframe in pandas that contains, a column for the images, and a column for each image labels. At the beggining I am showing some code of how do I create the image, the thing is, I get them from a data file which is binary encoded. I use unpickle to uncode it, obtaining a dictionary with all the images. I build the images using np.dstack() with the 3 channels of color of the image, this is the input that is given for us. I think the problem is with the dimensions of the images because, each image in the panda's dataframe columns size's is (32x32x3), mainly because each image is 32x32 pixels, and contains 3 layers of color red green and blue. I am getting errors all the time regarding "ValueError: setting an array element with a sequence". I think I am no way near close to getting this right. I have tried using the function flatten but it seems like it does not work either. What should I do to adapt the input? I am very lost right now, have tried everything. # Los datos de cada entidad contienen los valores de la imagen. La imagen se obtiene por la combinación de tres canales/capas (red, green, blue) de la siguiente forma: ch0 = d0[0:1024] ch1 = d0[1024:2048] ch2 = d0[2048:] # Cada canal es una capa del correspondiente color ch0 = np.reshape(ch0, (32,32)) # red ch1 = np.reshape(ch1, (32,32)) # green ch2 = np.reshape(ch2, (32,32)) # blue # La combinación de ellas da una imagen con los tres colores: image = np.dstack((ch0, ch1, ch2)) fig, ax = plt.subplots(figsize=(2, 2)) ax.imshow(image) plt.show() import random as r import cv2 categoriasAleatorias = [] for i in range(3): num = r.randint(0,19) categoriasAleatorias.append(dataMeta[b'coarse_label_names'][num]) print(categoriasAleatorias) #Para cada imagen. columnaImagenes = [] columnaEtiquetas = [] for i in range (len(data[b'data'])): if dataMeta[b'coarse_label_names'][data[b'coarse_labels'][i]] in categoriasAleatorias: columnaEtiquetas.append(dataMeta[b'coarse_label_names'][data[b'coarse_labels'][i]]) # Cada entidad, se encuentra en una posición de cada uno de los anteriores atributos # Vamos a ver la entidad '0' d0 = data[b'data'][i] # Los datos de cada entidad contienen los valores de la imagen. La imagen se obtiene por la combinación de tres canales/capas (red, green, blue) de la siguiente forma: ch0 = d0[0:1024] ch1 = d0[1024:2048] ch2 = d0[2048:] # Cada canal es una capa del correspondiente color ch0 = np.reshape(ch0, (32,32)) # red ch1 = np.reshape(ch1, (32,32)) # green ch2 = np.reshape(ch2, (32,32)) # blue # La combinación de ellas da una imagen con los tres colores: image = np.dstack((ch0, ch1, ch2)) columnaImagenes.append(image) print(len(columnaEtiquetas)) print(len(columnaImagenes)) df = pd.DataFrame({'etiqueta': columnaEtiquetas, 'imagen': columnaImagenes}) from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score from sklearn.preprocessing import MultiLabelBinarizer X_train = df['imagen'] y_train = df['etiqueta'] # Crear el modelo SVM svm = SVC() # Entrenar el modelo SVM svm.fit(X_train, y_train) # Crear el modelo de Random Forest rfc = RandomForestClassifier() # Entrenar el modelo de Random Forest rfc.fit(X_train, y_train) # Crear el modelo de KNN knn = KNeighborsClassifier() # Entrenar el modelo de KNN knn.fit(X_train, y_train) flatten(), reshape() nothing seems to work When I do this : print(len(columnaEtiquetas)) print(len(columnaImagenes)) print(columnaImagenes[0].shape) df = pd.DataFrame({'etiqueta': columnaEtiquetas, 'imagen': columnaImagenes}) print(df.head()) I obtain this: 7500 7500 (32, 32, 3) etiqueta imagen 0 b'food_containers' [[[178, 168, 176], [175, 165, 173], [175, 165,... 1 b'trees' [[[254, 254, 254], [255, 255, 255], [255, 255,... 2 b'trees' [[[152, 71, 119], [154, 77, 124], [152, 84, 13... 3 b'trees' [[[153, 157, 168], [155, 160, 167], [162, 168,... 4 b'trees' [[[48, 78, 134], [51, 88, 148], [53, 86, 150],... This is how my images are made,32x32pixels and 32x32x3 because there is 3 layers of color, I do not know how to fit this data or transform it in a way that the model gets it, tried your line of work to the image but kept giving me same exception.
Indeed, none of the methods you use would work for 32x32x3. If you want to fully utilise the spatial information from the image, you should use convolutions. But this is the more advanced technique. What you would need is to have the image arrays flattened into an array with 3072 features. I experimented in a shell and if you do columnaImagenes.append(image.flatten().tolist()) this should give you 3072 sized feature arrays. If it's not working, I'd suggest to take an element from the column and check it's shape with np.shape But also there's a part of your code where you do the color to grayscale convertion (cvt_color). If I understand correctly you are trying to convert the image as BGR where you are actually concatenating is as RGB, which is a different order and grayscale values would actually differ.
SpaCy - TextCategorizer - Bag Of Words: Is there a way to show the vectorized document?
I just trained and implemented a text categorizer using Space 3.0. Everything went smooth but I'd like to visualize the vectorized document ([13, 0, 0, 120..etc]) in order to better understand what feature (words) drove the bag-of-words (BoW) model to classify the document in a specific class. nlp = spacy.load('./nlp_single_label_cli/output/model-best') documents = pd.read_csv(target_directory+'_ocr.csv') ... test_texts = documents['text'].values test_docs = [nlp.tokenizer(text) for text in test_texts] text_categorizer = nlp.get_pipe('textcat') scores = text_categorizer.predict(test_docs) predicted_labels = scores.argmax(axis=1) I've created the model from scratch with just the TextCategorizer layer (following the space 3.0 guidelines). By doing this, the documents have no .vector attribute. this is my config.cfg [nlp] lang = "it" pipeline = ["textcat"] disabled = [] before_creation = null after_creation = null after_pipeline_creation = null batch_size = 1000 tokenizer = {"#tokenizers":"spacy.Tokenizer.v1"} [components] [components.textcat] factory = "textcat" scorer = {"#scorers":"spacy.textcat_scorer.v1"} threshold = 0.5 [components.textcat.model] #architectures = "spacy.TextCatEnsemble.v2" nO = null [components.textcat.model.linear_model] #architectures = "spacy.TextCatBOW.v2" exclusive_classes = true ngram_size = 1 no_output_layer = false nO = null [components.textcat.model.tok2vec] #architectures = "spacy.Tok2Vec.v2" ...
In SpaCy you can use .vector to get the vector representation of a document. import spacy nlp = spacy.load("en_core_web_sm") doc1 = nlp("I like apples.") doc2 = nlp("I like bananas.") doc3 = nlp("The book is about quantum physics.") print(doc1.vector) > [-0.28571516 ... 0.77876693] print(doc2.vector) > [ 3.14834714e-02 ... 5.22061586e-01] print(doc3.vector) > [ 0.08713525 ... 0.2642293 ] Visualize these vectors for example by using PCA in sklearn to reduce the dimensions to two, then create a basic plot: from sklearn.decomposition import PCA import matplotlib.pyplot as plt import numpy pca_ = PCA(n_components=2).fit_transform(numpy.array([doc.vector for doc in [doc1, doc2, doc3]])) pca_x = tsne_x = [x_[0] for x_ in pca_] pca_y = [x_[1] for x_ in pca_] ax = plt.gca() plt.scatter(pca_x, pca_y, alpha=0.25) ax.set_xlim([-3, 3]) ax.set_ylim([-3, 3]) for i, name in enumerate(["doc1", "doc2", "doc3"]): text = ax.annotate(name, (pca_x[i], pca_y[i])) plt.show() Which will give you: I am not sure if this exactly what you meant though, if you do not mean the generic vector representation of each document but the feature vectors or something similar you can do the same thing and use that instead of .vector. Of course by reducing the dimensions you will not be able see the individual vector values in the plot afterwards.
Where is the real fault in my Noverov Alogorithm + Schroedinger Equation?[Physics]
I am solving Schroedinger Equation with help of the Noverov Theorom. So, My initial project question is: Solve s-wave Schroedinger Equation for the ground state and first excited state of the hydrogen atom: D^2y+2(E-V(r))y=0 , D=d/dr and potential is V(r)=-1/r But I want to make the program a program which automatically calculate it's eigen energies and using hatree units and reconsidering it: here is my code : import numpy as np import matplotlib.pyplot as plt dom=[0.001,5] n=10 r,dr=np.linspace(dom[0],dom[1],n,retstep=True) def Potential(x,E): energy=2*(E+(1/x)) return energy i=0 E=-2 s0i=0.0 s1i=0.0001 si=np.zeros(n) indi=0 while i==0: si[0]=s0i si[1]=s1i for j in range(0,n-2,1): f0c=((dr**2)*Potential(x=r[j],E=E))/12.0 f1c=((dr**2)*Potential(x=r[j+1],E=E))/12.0 f2c=((dr**2)*Potential(x=r[j+2],E=E))/12.0 s2i=((2*(1-5*f1c)*s1i)-((1+f0c)*s0i))/(1+f2c) s1i=s2i s0i=s1i si[j+2]=s2i if abs(s2i)<=1e-18: i=1 elif s2i>1e-18: E=E-0.1 elif s2i<-1e-18: E=E+0.1 elif(indi==500): i=1 print('loop Count : ',indi,'Energy : ',E,'last Pshi',si[-1]) indi+=1 print(len(r),len(si)) plt.plot(r,si) plt.show() I didn't understand where is the issue and I got some hilarious result. So Please help me to do so .
I see you speak dutch. So I hope it is good that I answer in my native language. Je weet dat er een sympy module is waarmee je het ook kan oplossen? Het is al geschreven code die het voor je oplost. from sympy.physics.hydrogen import E_nl from sympy.abc import n, Z E_nl(n, Z) # output: -Z**2/(2*n**2) # Z = atoomnummer n = het quantum getal # Z=1 (hydrogen) bij default kan je veranderen trouwens E_nl(1) # output: -1/2 E_nl(2) # output: -1/8 E_nl(3) # output: -1/18 E_nl(3, 47) # output: -2209/18 Maar uiteraard is het ook interessant om je eigen code te schrijven. In je code ontstaat er een overflow aangezien het deze errror geeft: RuntimeWarning: overflow encountered in double_scalars en als je kijkt naar de antwoorden is dat niet heel raar. loop Count : 78 Energy : -2.1 last Pshi 1.9967193884115876e+306 loop Count : 79 Energy : -2.1 last Pshi NaN dat betekent dus een 1 met 306 nullen er achter.... erg groot getal, en het wijkt steeds verder van van de verwachte -1/2. Dus ergens loopt ie verkeerd door jouw berekening heen. Zou je de berekening duidelijker kunnen uitschrijven?
Build a correlation circle with Python - Error ValueError: could not broadcast input array from shape (3) into shape (28)
I'm trying to build a correlation circle, basically, it allows to measure to which extend the Eigenvalue / Eigenvector of a variable is correlated to the principal components (dimensions) of a dataset. Something like this : Here is my code : import pandas as pd import numpy as np from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt #chargement données X = pd.read_excel("mortalitePaysUE.xlsx",sheet_name=0,header=0,index_col=0) #nombre d'observations n = X.shape[0] #nombre de variable p = X.shape[0] print(p) #transformation - centrage/réduction sc = StandardScaler() Z = sc.fit_transform(X) print(Z) print("-------------") #moyenne print("Moyenne : ") print(np.mean(X,axis=0)) print("-------------") #ecart-type print("Ecart type : ") print(np.std(X,axis=1,ddof=0)) print("-------------") #acp acp = PCA(svd_solver='full') coord = acp.fit_transform(Z) eigval = (n-1)/n*acp.explained_variance_ print(eigval) #screen plot #plt.plot(np.arange(1,p+1),eigval) #plt.title("Décès en 1990 selon le genre") #plt.xlabel("Numéro de facteur") #plt.ylabel("Valeur propre") #plt.show() #positionnement des individus dans le premier plan fig, axes = plt.subplots(figsize=(12,12)) axes.set_xlim(-6,6) #même limites en abscisse axes.set_ylim(-6,6) #et en ordonnée #placement des étiquettes des observations for i in range(n): plt.annotate(X.index[i],(coord[i,0],coord[i,1])) #ajouter les axes plt.plot([-6,6],[0,0],color='silver',linestyle='-',linewidth=1) plt.plot([0,0],[-6,6],color='silver',linestyle='-',linewidth=1) #affichage plt.show() #racine carrée des valeurs propres sqrt_eigval = np.sqrt(eigval) #corrélation des variables avec les axes corvar = np.zeros((p,p)) for k in range(p): corvar[:,k] = acp.components_[k,:] * sqrt_eigval[k] #afficher la matrice des corrélations variables x facteurs #print(corvar) #cercle des corrélations fig, axes = plt.subplots(figsize=(8,8)) axes.set_xlim(-1,1) axes.set_ylim(-1,1) #affichage des étiquettes (noms des variables) for j in range(p): plt.annotate(X.columns[j],(corvar[j,0],corvar[j,1])) #ajouter les axes plt.plot([-1,1],[0,0],color='silver',linestyle='-',linewidth=1) plt.plot([0,0],[-1,1],color='silver',linestyle='-',linewidth=1) #ajouter un cercle cercle = plt.Circle((0,0),1,color='blue',fill=False) axes.add_artist(cercle) The problem is that i got an error and i can't display the circle. And i can't resolve the error corvar[:,k] = acp.components_[k,:] * sqrt_eigval[k] ValueError: could not broadcast input array from shape (3) into shape (28) Can anyone help me to fix this please :) Thanks in advance !
Estimation of number of Clusters via gap statistics and prediction strength
I am trying to translate the R implementations of gap statistics and prediction strength http://edchedch.wordpress.com/2011/03/19/counting-clusters/ into python scripts for the estimation of number of clusters in iris data with 3 clusters. Instead of getting 3 clusters, I get different results on different runs with 3 (actual number of clusters) hardly estimated. Graph shows estimated number to be 10 instead of 3. Am I missing something? Can anyone help me locate the problem? import random import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans def dispersion (data, k): if k == 1: cluster_mean = np.mean(data, axis=0) distances_from_mean = np.sum((data - cluster_mean)**2,axis=1) dispersion_val = np.log(sum(distances_from_mean)) else: k_means_model_ = KMeans(n_clusters=k, max_iter=50, n_init=5).fit(data) distances_from_mean = range(k) for i in range(k): distances_from_mean[i] = int() for idx, label in enumerate(k_means_model_.labels_): if i == label: distances_from_mean[i] += sum((data[idx] - k_means_model_.cluster_centers_[i])**2) dispersion_val = np.log(sum(distances_from_mean)) return dispersion_val def reference_dispersion(data, num_clusters, num_reference_bootstraps): dispersions = [dispersion(generate_uniform_points(data), num_clusters) for i in range(num_reference_bootstraps)] mean_dispersion = np.mean(dispersions) stddev_dispersion = float(np.std(dispersions)) / np.sqrt(1. + 1. / num_reference_bootstraps) return mean_dispersion def generate_uniform_points(data): mins = np.argmin(data, axis=0) maxs = np.argmax(data, axis=0) num_dimensions = data.shape[1] num_datapoints = data.shape[0] reference_data_set = np.zeros((num_datapoints,num_dimensions)) for i in range(num_datapoints): for j in range(num_dimensions): reference_data_set[i][j] = random.uniform(data[mins[j]][j],data[maxs[j]][j]) return reference_data_set def gap_statistic (data, nthCluster, referenceDatasets): actual_dispersion = dispersion(data, nthCluster) ref_dispersion = reference_dispersion(data, nthCluster, num_reference_bootstraps) return actual_dispersion, ref_dispersion if __name__ == "__main__": data=np.loadtxt('iris.mat', delimiter=',', dtype=float) maxClusters = 10 num_reference_bootstraps = 10 dispersion_values = np.zeros((maxClusters,2)) for cluster in range(1, maxClusters+1): dispersion_values_actual,dispersion_values_reference = gap_statistic(data, cluster, num_reference_bootstraps) dispersion_values[cluster-1][0] = dispersion_values_actual dispersion_values[cluster-1][1] = dispersion_values_reference gaps = dispersion_values[:,1] - dispersion_values[:,0] print gaps print "The estimated number of clusters is ", range(maxClusters)[np.argmax(gaps)]+1 plt.plot(range(len(gaps)), gaps) plt.show()
Your graph is showing the correct value of 3. Let me explain a bit As you increase the number of clusters, your distance metric will certainly decrease. Therefore you are assuming that the correct value is 10. If you increase it to beyond 10, the distance metric will further decrease. But this should not be our decision making criteria We need to find the inflection point ( here marked in RED ). It is the point where the slope smoothens out. You might want to take a look at elbow curves Based on the above 2 points, the inflection point is 3 ( which is also the correct solution ) Hope this helps
you could take a look on this code and you could change your output plot format [![# coding: utf-8 # Implémentation de K-means clustering python #Chargement des bibliothèques import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn import datasets #chargement de jeu des données Iris iris = datasets.load_iris() #importer le jeu de données Iris dataset à l'aide du module pandas x = pd.DataFrame(iris.data) x.columns = \['Sepal_Length','Sepal_width','Petal_Length','Petal_width'\] y = pd.DataFrame(iris.target) y.columns = \['Targets'\] #Création d'un objet K-Means avec un regroupement en 3 clusters (groupes) model=KMeans(n_clusters=3) #application du modèle sur notre jeu de données Iris model.fit(x) #Visualisation des clusters plt.scatter(x.Petal_Length, x.Petal_width) plt.show() colormap=np.array(\['Red','green','blue'\]) #Visualisation du jeu de données sans altération de ce dernier (affichage des fleurs selon leur étiquettes) plt.scatter(x.Petal_Length, x.Petal_width,c=colormap\[y.Targets\],s=40) plt.title('Classification réelle') plt.show() #Visualisation des clusters formés par K-Means plt.scatter(x.Petal_Length, x.Petal_width,c=colormap\[model.labels_\],s=40) plt.title('Classification K-means ') plt.show()][1]][1] Output 1