How to vectorize bigrams with the hashing-trick in scikit-learn?

How to vectorize bigrams with the hashing-trick in scikit-learn? - python

I have some bigrams, lets say: [('word','word'),('word','word'),...,('word','word')]. How can i use scikit's HashingVectorizer to create a feature vector that subsequently will be presented to some classification algorithm like e.g. SVC or Naive Bayes or any type of classification algorithm?

Firstly, you MUST understand what the different vectorizers are doing. Most vectorizers are based on the bag-of-word approaches where documents are tokens are mapped onto a matrix.
From sklearn documentation, CountVectorizer and HashVectorizer:
Convert a collection of text documents to a matrix of token counts
For instance, these sentences
The Fulton County Grand Jury said Friday an investigation of Atlanta's
recent primary election produced no evidence that any
irregularities took place .
The jury further said in term-end presentments that the City Executive
Committee , which had over-all charge of the election , `` deserves
the praise and thanks of the City of Atlanta '' for the manner in
which the election was conducted .
with this rough vectorizer:
from collections import Counter
from itertools import chain
from string import punctuation
from nltk.corpus import brown, stopwords
# Let's say the training/testing data is a list of words and POS
sentences = brown.sents()[:2]
# Extract the content words as features, i.e. columns.
vocabulary = list(chain(*sentences))
stops = stopwords.words('english') + list(punctuation)
vocab_nostop = [i.lower() for i in vocabulary if i not in stops]
# Create a matrix from the sentences
matrix = [Counter([w for w in words if w in vocab_nostop]) for words in sentences]
print matrix
would become:
[Counter({u"''": 1, u'``': 1, u'said': 1, u'took': 1, u'primary': 1, u'evidence': 1, u'produced': 1, u'investigation': 1, u'place': 1, u'election': 1, u'irregularities': 1, u'recent': 1}), Counter({u'the': 6, u'election': 2, u'presentments': 1, u'``': 1, u'said': 1, u'jury': 1, u'conducted': 1, u"''": 1, u'deserves': 1, u'charge': 1, u'over-all': 1, u'praise': 1, u'manner': 1, u'term-end': 1, u'thanks': 1})]
So this might be rather inefficient considering very large dataset, so the sklearn devs built more efficient code. One of the most important feature of sklearn is that you don't even need to load the dataset into memory before vectorizing it.
Since it's unclear what is your task, i think you're sort of looking for a general use. Let's say you're using it for language ID.
Let's say that your input file for the training data in train.txt:
Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.
De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.
Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?
Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.
And your corresponding labels are Bosnian, Portuguese, Spanish and Slovak, i.e.
[bs,pt,es,sr]
Here's one way to use the CountVectorizer and the naive bayes classifier. The following example is from https://github.com/alvations/bayesline of the DSL shared task.
Let's start from the vectorizer. Firstly, the vectorizer takes the input file and then converts the training set into a vectorized matrix and initializes the vectorizer (i.e. features):
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
testfile = 'test.txt'
# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
print word_vectorizer.get_feature_names()
[out]:
[u'acuerdo', u'aj', u'ajudou', u'al', u'alex', u'algo', u'alpsk\xfdmi', u'alpy', u'andaba', u'andrea', u'ao', u'apresenta', u'as', u'bien', u'bl\xedzko', u'buscando', u'come\xe7o', u'como', u'con', u'conseguido', u'da', u'de', u'decepcionantes', u'deti', u'dificuldades', u'dif\xedcil', u'distancia', u'do', u'doprinese', u'druh', u'd\xe1', u'ela', u'encontrar', u'enfrentar', u'es', u'est\xe1', u'eulex', u'excusa', u'fama', u'foi', u'for\xe7as', u'furiosa', u'golf', u'golfistami', u'golfov\xfdch', u'guasch', u'ha', u'hotelmi', u'hra\u0165', u'ide', u'ihr\xedsk', u'incident', u'intranspon\xedveis', u'in\xedcio', u'in\xfd', u'ispit', u'istragu', u'izbijanju', u'ja\u010danju', u'je', u'jedan', u'jo\u0161', u'kapaciteta', u'kde', u'kombin\xe1cie', u'komplex', u'kon\u010diarmi', u'kosova', u'la', u'lado', u'lequio', u'lete', u'llevar', u'lo', u'longo', u'ly\u017eova\u0165', u'mais', u'man\u017eelky', u'mas', u'me', u'mesmo', u'meu', u'minha', u'misije', u'mo\u017enos\u0165ami', u'muy', u'm\xe1s', u'm\xe3e', u'na', u'nada', u'nad\u0161en\xfdmi', u'nasilja', u'negativas', u'nie', u'nieko\u013ek\xfdch', u'no', u'obaviti', u'obe\u0107ao', u'para', u'parecem', u'parecer', u'pod', u'pone', u'pon\xfakaj\xfa', u'por', u'potrebuj\xfa', u'po\u0161to', u'prava', u'predstavlja', u'pri', u'prova\xe7\xf5es', u'pro\u0161losedmi\u010dnom', u'punham', u'qual', u'qualquer', u'que', u'quem', u'rak\xfaske', u'relaci\xf3n', u'rezortov', u'sa', u'sebe', u'sempre', u'situa\xe7\xf5es', u'sjeveru', u'spojen\xfdch', u'suplantar', u's\xfa', u'taj', u'tak', u'talianske', u'teve', u'tive', u'todas', u'tr\xe1venia', u'una', u've\u013ek\xfd', u'vida', u'visto', u'vladavine', u'vo', u'vo\u013en\xe9ho', u'vysok\xfdmi', u'vy\u017eitia', u'v\xe4\u010d\u0161ine', u'v\u017edy', u'ya', u'zauj\xedmav\xe9', u'zime', u'\u0107e', u'\u010dasu', u'\u010di', u'\u010fal\u0161\xedmi', u'\u0161vaj\u010diarske']
Let's say your test documents are in test.txt, which labels are Spanish es and Portuguese pt:
Por ello, ha insistido en que Europa tiene que darle un toque de atención porque Portugal esta incumpliendo la directiva del establecimiento del peaje
Estima-se que o mercado homossexual só na Cidade do México movimente cerca de oito mil milhões de dólares, aproximadamente seis mil milhões de euros
Now, you can label the test documents with the trained classifier as such:
import codecs, re, time
from itertools import chain
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
testfile = 'test.txt'
# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)
# Tagging the documents
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)
print results
[out]:
['es' 'pt']
For more information of text classification, possibly you might find this NLTK related question/answer useful, see nltk NaiveBayesClassifier training for sentiment analysis
To use the HashingVectorizer, you need to note that it produces vector values that are negative and MultinomialNaiveBayes classifier don't do negative values, so you would have to use another classifier, as such:
import codecs, re, time
from itertools import chain
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import Perceptron
trainfile = 'train.txt'
testfile = 'test.txt'
# Vectorizing data.
train = []
word_vectorizer = HashingVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
# Training Perceptron
pct = Perceptron(n_iter=100)
pct.fit(trainset, tags)
# Tagging the documents
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = pct.predict(testset)
print results
[out]:
['es' 'es']
But do note that the results of the perceptron is worse in this small example. Different classifier fits different task and different features fit different vectors, also different classifiers accepts different vectors.
There is no perfect model, just better or worse

Since you've already extracted the bigrams yourself, you can vectorize using a FeatureHasher. The main thing you need to do is squash the bigrams to strings. E.g.,
>>> data = [[('this', 'is'), ('is', 'a'), ('a', 'text')],
... [('and', 'one'), ('one', 'more')]]
>>> from sklearn.feature_extraction import FeatureHasher
>>> fh = FeatureHasher(input_type='string')
>>> X = fh.transform(((' '.join(x) for x in sample) for sample in data))
>>> X
<2x1048576 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>

Related

In regard with an image classification machine learning basic exercise in python

Sorry because this code contains comments about the exercise in Spanish. The thing is, I am trying to train 3 models with 3 different algorithms of machine learning which are in charge of image classification based on labels. I got to a point where, I have created a dataframe in pandas that contains, a column for the images, and a column for each image labels. At the beggining I am showing some code of how do I create the image, the thing is, I get them from a data file which is binary encoded. I use unpickle to uncode it, obtaining a dictionary with all the images. I build the images using np.dstack() with the 3 channels of color of the image, this is the input that is given for us. I think the problem is with the dimensions of the images because, each image in the panda's dataframe columns size's is (32x32x3), mainly because each image is 32x32 pixels, and contains 3 layers of color red green and blue. I am getting errors all the time regarding "ValueError: setting an array element with a sequence". I think I am no way near close to getting this right. I have tried using the function flatten but it seems like it does not work either. What should I do to adapt the input? I am very lost right now, have tried everything.
# Los datos de cada entidad contienen los valores de la imagen. La imagen se obtiene por la combinación de tres canales/capas (red, green, blue) de la siguiente forma:
ch0 = d0[0:1024]
ch1 = d0[1024:2048]
ch2 = d0[2048:]
# Cada canal es una capa del correspondiente color
ch0 = np.reshape(ch0, (32,32)) # red
ch1 = np.reshape(ch1, (32,32)) # green
ch2 = np.reshape(ch2, (32,32)) # blue
# La combinación de ellas da una imagen con los tres colores:
image = np.dstack((ch0, ch1, ch2))
fig, ax = plt.subplots(figsize=(2, 2))
ax.imshow(image)
plt.show()
import random as r
import cv2
categoriasAleatorias = []
for i in range(3):
num = r.randint(0,19)
categoriasAleatorias.append(dataMeta[b'coarse_label_names'][num])
print(categoriasAleatorias)
#Para cada imagen.
columnaImagenes = []
columnaEtiquetas = []
for i in range (len(data[b'data'])):
if dataMeta[b'coarse_label_names'][data[b'coarse_labels'][i]] in categoriasAleatorias:
columnaEtiquetas.append(dataMeta[b'coarse_label_names'][data[b'coarse_labels'][i]])
# Cada entidad, se encuentra en una posición de cada uno de los anteriores atributos
# Vamos a ver la entidad '0'
d0 = data[b'data'][i]
# Los datos de cada entidad contienen los valores de la imagen. La imagen se obtiene por la combinación de tres canales/capas (red, green, blue) de la siguiente forma:
ch0 = d0[0:1024]
ch1 = d0[1024:2048]
ch2 = d0[2048:]
# Cada canal es una capa del correspondiente color
ch0 = np.reshape(ch0, (32,32)) # red
ch1 = np.reshape(ch1, (32,32)) # green
ch2 = np.reshape(ch2, (32,32)) # blue
# La combinación de ellas da una imagen con los tres colores:
image = np.dstack((ch0, ch1, ch2))
columnaImagenes.append(image)
print(len(columnaEtiquetas))
print(len(columnaImagenes))
df = pd.DataFrame({'etiqueta': columnaEtiquetas, 'imagen': columnaImagenes})
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MultiLabelBinarizer
X_train = df['imagen']
y_train = df['etiqueta']
# Crear el modelo SVM
svm = SVC()
# Entrenar el modelo SVM
svm.fit(X_train, y_train)
# Crear el modelo de Random Forest
rfc = RandomForestClassifier()
# Entrenar el modelo de Random Forest
rfc.fit(X_train, y_train)
# Crear el modelo de KNN
knn = KNeighborsClassifier()
# Entrenar el modelo de KNN
knn.fit(X_train, y_train)
flatten(), reshape() nothing seems to work
When I do this :
print(len(columnaEtiquetas))
print(len(columnaImagenes))
print(columnaImagenes[0].shape)
df = pd.DataFrame({'etiqueta': columnaEtiquetas, 'imagen': columnaImagenes})
print(df.head())
I obtain this:
7500
7500
(32, 32, 3)
etiqueta imagen
0 b'food_containers' [[[178, 168, 176], [175, 165, 173], [175, 165,...
1 b'trees' [[[254, 254, 254], [255, 255, 255], [255, 255,...
2 b'trees' [[[152, 71, 119], [154, 77, 124], [152, 84, 13...
3 b'trees' [[[153, 157, 168], [155, 160, 167], [162, 168,...
4 b'trees' [[[48, 78, 134], [51, 88, 148], [53, 86, 150],...
This is how my images are made,32x32pixels and 32x32x3 because there is 3 layers of color, I do not know how to fit this data or transform it in a way that the model gets it, tried your line of work to the image but kept giving me same exception.

Indeed, none of the methods you use would work for 32x32x3. If you want to fully utilise the spatial information from the image, you should use convolutions. But this is the more advanced technique.
What you would need is to have the image arrays flattened into an array with 3072 features. I experimented in a shell and if you do columnaImagenes.append(image.flatten().tolist()) this should give you 3072 sized feature arrays.
If it's not working, I'd suggest to take an element from the column and check it's shape with np.shape
But also there's a part of your code where you do the color to grayscale convertion (cvt_color). If I understand correctly you are trying to convert the image as BGR where you are actually concatenating is as RGB, which is a different order and grayscale values would actually differ.

SpaCy - TextCategorizer - Bag Of Words: Is there a way to show the vectorized document?

I just trained and implemented a text categorizer using Space 3.0. Everything went smooth but I'd like to visualize the vectorized document ([13, 0, 0, 120..etc]) in order to better understand what feature (words) drove the bag-of-words (BoW) model to classify the document in a specific class.
nlp = spacy.load('./nlp_single_label_cli/output/model-best')
documents = pd.read_csv(target_directory+'_ocr.csv')
...
test_texts = documents['text'].values
test_docs = [nlp.tokenizer(text) for text in test_texts]
text_categorizer = nlp.get_pipe('textcat')
scores = text_categorizer.predict(test_docs)
predicted_labels = scores.argmax(axis=1)
I've created the model from scratch with just the TextCategorizer layer (following the space 3.0 guidelines). By doing this, the documents have no .vector attribute.
this is my config.cfg
[nlp]
lang = "it"
pipeline = ["textcat"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"#tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.textcat]
factory = "textcat"
scorer = {"#scorers":"spacy.textcat_scorer.v1"}
threshold = 0.5
[components.textcat.model]
#architectures = "spacy.TextCatEnsemble.v2"
nO = null
[components.textcat.model.linear_model]
#architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null
[components.textcat.model.tok2vec]
#architectures = "spacy.Tok2Vec.v2"
...

In SpaCy you can use .vector to get the vector representation of a document.
import spacy
nlp = spacy.load("en_core_web_sm")
doc1 = nlp("I like apples.")
doc2 = nlp("I like bananas.")
doc3 = nlp("The book is about quantum physics.")
print(doc1.vector)
> [-0.28571516 ... 0.77876693]
print(doc2.vector)
> [ 3.14834714e-02 ... 5.22061586e-01]
print(doc3.vector)
> [ 0.08713525 ... 0.2642293 ]
Visualize these vectors for example by using PCA in sklearn to reduce the dimensions to two, then create a basic plot:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy
pca_ = PCA(n_components=2).fit_transform(numpy.array([doc.vector for doc in [doc1, doc2, doc3]]))
pca_x = tsne_x = [x_[0] for x_ in pca_]
pca_y = [x_[1] for x_ in pca_]
ax = plt.gca()
plt.scatter(pca_x, pca_y, alpha=0.25)
ax.set_xlim([-3, 3])
ax.set_ylim([-3, 3])
for i, name in enumerate(["doc1", "doc2", "doc3"]):
text = ax.annotate(name, (pca_x[i], pca_y[i]))
plt.show()
Which will give you:
I am not sure if this exactly what you meant though, if you do not mean the generic vector representation of each document but the feature vectors or something similar you can do the same thing and use that instead of .vector. Of course by reducing the dimensions you will not be able see the individual vector values in the plot afterwards.

Where is the real fault in my Noverov Alogorithm + Schroedinger Equation?[Physics]

I am solving Schroedinger Equation with help of the Noverov Theorom. So, My initial project question is: Solve s-wave Schroedinger Equation for the ground state and first excited state of the hydrogen atom:
D^2y+2(E-V(r))y=0 , D=d/dr
and potential is V(r)=-1/r
But I want to make the program a program which automatically calculate it's eigen energies and using hatree units and reconsidering it: here is my code :
import numpy as np
import matplotlib.pyplot as plt
dom=[0.001,5]
n=10
r,dr=np.linspace(dom[0],dom[1],n,retstep=True)
def Potential(x,E):
energy=2*(E+(1/x))
return energy
i=0
E=-2
s0i=0.0
s1i=0.0001
si=np.zeros(n)
indi=0
while i==0:
si[0]=s0i
si[1]=s1i
for j in range(0,n-2,1):
f0c=((dr**2)*Potential(x=r[j],E=E))/12.0
f1c=((dr**2)*Potential(x=r[j+1],E=E))/12.0
f2c=((dr**2)*Potential(x=r[j+2],E=E))/12.0
s2i=((2*(1-5*f1c)*s1i)-((1+f0c)*s0i))/(1+f2c)
s1i=s2i
s0i=s1i
si[j+2]=s2i
if abs(s2i)<=1e-18:
i=1
elif s2i>1e-18:
E=E-0.1
elif s2i<-1e-18:
E=E+0.1
elif(indi==500):
i=1
print('loop Count : ',indi,'Energy : ',E,'last Pshi',si[-1])
indi+=1
print(len(r),len(si))
plt.plot(r,si)
plt.show()
I didn't understand where is the issue and I got some hilarious result. So Please help me to do so .

I see you speak dutch. So I hope it is good that I answer in my native language.
Je weet dat er een sympy module is waarmee je het ook kan oplossen? Het is al geschreven code die het voor je oplost.
from sympy.physics.hydrogen import E_nl
from sympy.abc import n, Z
E_nl(n, Z) # output: -Z**2/(2*n**2)
# Z = atoomnummer n = het quantum getal
# Z=1 (hydrogen) bij default kan je veranderen trouwens
E_nl(1) # output: -1/2
E_nl(2) # output: -1/8
E_nl(3) # output: -1/18
E_nl(3, 47) # output: -2209/18
Maar uiteraard is het ook interessant om je eigen code te schrijven.
In je code ontstaat er een overflow
aangezien het deze errror geeft:
RuntimeWarning: overflow encountered in double_scalars
en als je kijkt naar de antwoorden is dat niet heel raar.
loop Count : 78 Energy : -2.1 last Pshi 1.9967193884115876e+306
loop Count : 79 Energy : -2.1 last Pshi NaN
dat betekent dus een 1 met 306 nullen er achter.... erg groot getal, en het wijkt steeds verder van van de verwachte -1/2. Dus ergens loopt ie verkeerd door jouw berekening heen. Zou je de berekening duidelijker kunnen uitschrijven?

Build a correlation circle with Python - Error ValueError: could not broadcast input array from shape (3) into shape (28)

I'm trying to build a correlation circle, basically, it allows to measure to which extend the Eigenvalue / Eigenvector of a variable is correlated to the principal components (dimensions) of a dataset.
Something like this :
Here is my code :
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
#chargement données
X = pd.read_excel("mortalitePaysUE.xlsx",sheet_name=0,header=0,index_col=0)
#nombre d'observations
n = X.shape[0]
#nombre de variable
p = X.shape[0]
print(p)
#transformation - centrage/réduction
sc = StandardScaler()
Z = sc.fit_transform(X)
print(Z)
print("-------------")
#moyenne
print("Moyenne : ")
print(np.mean(X,axis=0))
print("-------------")
#ecart-type
print("Ecart type : ")
print(np.std(X,axis=1,ddof=0))
print("-------------")
#acp
acp = PCA(svd_solver='full')
coord = acp.fit_transform(Z)
eigval = (n-1)/n*acp.explained_variance_
print(eigval)
#screen plot
#plt.plot(np.arange(1,p+1),eigval)
#plt.title("Décès en 1990 selon le genre")
#plt.xlabel("Numéro de facteur")
#plt.ylabel("Valeur propre")
#plt.show()
#positionnement des individus dans le premier plan
fig, axes = plt.subplots(figsize=(12,12))
axes.set_xlim(-6,6) #même limites en abscisse
axes.set_ylim(-6,6) #et en ordonnée
#placement des étiquettes des observations
for i in range(n):
plt.annotate(X.index[i],(coord[i,0],coord[i,1]))
#ajouter les axes
plt.plot([-6,6],[0,0],color='silver',linestyle='-',linewidth=1)
plt.plot([0,0],[-6,6],color='silver',linestyle='-',linewidth=1)
#affichage
plt.show()
#racine carrée des valeurs propres
sqrt_eigval = np.sqrt(eigval)
#corrélation des variables avec les axes
corvar = np.zeros((p,p))
for k in range(p):
corvar[:,k] = acp.components_[k,:] * sqrt_eigval[k]
#afficher la matrice des corrélations variables x facteurs
#print(corvar)
#cercle des corrélations
fig, axes = plt.subplots(figsize=(8,8))
axes.set_xlim(-1,1)
axes.set_ylim(-1,1)
#affichage des étiquettes (noms des variables)
for j in range(p):
plt.annotate(X.columns[j],(corvar[j,0],corvar[j,1]))
#ajouter les axes
plt.plot([-1,1],[0,0],color='silver',linestyle='-',linewidth=1)
plt.plot([0,0],[-1,1],color='silver',linestyle='-',linewidth=1)
#ajouter un cercle
cercle = plt.Circle((0,0),1,color='blue',fill=False)
axes.add_artist(cercle)
The problem is that i got an error and i can't display the circle. And i can't resolve the error
corvar[:,k] = acp.components_[k,:] * sqrt_eigval[k]
ValueError: could not broadcast input array from shape (3) into shape (28)
Can anyone help me to fix this please :) Thanks in advance !

Estimation of number of Clusters via gap statistics and prediction strength

I am trying to translate the R implementations of gap statistics and prediction strength http://edchedch.wordpress.com/2011/03/19/counting-clusters/ into python scripts for the estimation of number of clusters in iris data with 3 clusters. Instead of getting 3 clusters, I get different results on different runs with 3 (actual number of clusters) hardly estimated. Graph shows estimated number to be 10 instead of 3. Am I missing something? Can anyone help me locate the problem?
import random
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def dispersion (data, k):
if k == 1:
cluster_mean = np.mean(data, axis=0)
distances_from_mean = np.sum((data - cluster_mean)**2,axis=1)
dispersion_val = np.log(sum(distances_from_mean))
else:
k_means_model_ = KMeans(n_clusters=k, max_iter=50, n_init=5).fit(data)
distances_from_mean = range(k)
for i in range(k):
distances_from_mean[i] = int()
for idx, label in enumerate(k_means_model_.labels_):
if i == label:
distances_from_mean[i] += sum((data[idx] - k_means_model_.cluster_centers_[i])**2)
dispersion_val = np.log(sum(distances_from_mean))
return dispersion_val
def reference_dispersion(data, num_clusters, num_reference_bootstraps):
dispersions = [dispersion(generate_uniform_points(data), num_clusters) for i in range(num_reference_bootstraps)]
mean_dispersion = np.mean(dispersions)
stddev_dispersion = float(np.std(dispersions)) / np.sqrt(1. + 1. / num_reference_bootstraps)
return mean_dispersion
def generate_uniform_points(data):
mins = np.argmin(data, axis=0)
maxs = np.argmax(data, axis=0)
num_dimensions = data.shape[1]
num_datapoints = data.shape[0]
reference_data_set = np.zeros((num_datapoints,num_dimensions))
for i in range(num_datapoints):
for j in range(num_dimensions):
reference_data_set[i][j] = random.uniform(data[mins[j]][j],data[maxs[j]][j])
return reference_data_set
def gap_statistic (data, nthCluster, referenceDatasets):
actual_dispersion = dispersion(data, nthCluster)
ref_dispersion = reference_dispersion(data, nthCluster, num_reference_bootstraps)
return actual_dispersion, ref_dispersion
if __name__ == "__main__":
data=np.loadtxt('iris.mat', delimiter=',', dtype=float)
maxClusters = 10
num_reference_bootstraps = 10
dispersion_values = np.zeros((maxClusters,2))
for cluster in range(1, maxClusters+1):
dispersion_values_actual,dispersion_values_reference = gap_statistic(data, cluster, num_reference_bootstraps)
dispersion_values[cluster-1][0] = dispersion_values_actual
dispersion_values[cluster-1][1] = dispersion_values_reference
gaps = dispersion_values[:,1] - dispersion_values[:,0]
print gaps
print "The estimated number of clusters is ", range(maxClusters)[np.argmax(gaps)]+1
plt.plot(range(len(gaps)), gaps)
plt.show()

Your graph is showing the correct value of 3. Let me explain a bit
As you increase the number of clusters, your distance metric will certainly decrease. Therefore you are assuming that the correct value is 10. If you increase it to beyond 10, the distance metric will further decrease. But this should not be our decision making criteria
We need to find the inflection point ( here marked in RED ). It is the point where the slope smoothens out. You might want to take a look at elbow curves
Based on the above 2 points, the inflection point is 3 ( which is also the correct solution )
Hope this helps

you could take a look on this code and you could change your output plot format
[![# coding: utf-8
# Implémentation de K-means clustering python
#Chargement des bibliothèques
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets
#chargement de jeu des données Iris
iris = datasets.load_iris()
#importer le jeu de données Iris dataset à l'aide du module pandas
x = pd.DataFrame(iris.data)
x.columns = \['Sepal_Length','Sepal_width','Petal_Length','Petal_width'\]
y = pd.DataFrame(iris.target)
y.columns = \['Targets'\]
#Création d'un objet K-Means avec un regroupement en 3 clusters (groupes)
model=KMeans(n_clusters=3)
#application du modèle sur notre jeu de données Iris
model.fit(x)
#Visualisation des clusters
plt.scatter(x.Petal_Length, x.Petal_width)
plt.show()
colormap=np.array(\['Red','green','blue'\])
#Visualisation du jeu de données sans altération de ce dernier (affichage des fleurs selon leur étiquettes)
plt.scatter(x.Petal_Length, x.Petal_width,c=colormap\[y.Targets\],s=40)
plt.title('Classification réelle')
plt.show()
#Visualisation des clusters formés par K-Means
plt.scatter(x.Petal_Length, x.Petal_width,c=colormap\[model.labels_\],s=40)
plt.title('Classification K-means ')
plt.show()][1]][1]
Output 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to vectorize bigrams with the hashing-trick in scikit-learn? - python

I have some bigrams, lets say: [('word','word'),('word','word'),...,('word','word')]. How can i use scikit's HashingVectorizer to create a feature vector that subsequently will be presented to some classification algorithm like e.g. SVC or Naive Bayes or any type of classification algorithm?

Related

In regard with an image classification machine learning basic exercise in python

SpaCy - TextCategorizer - Bag Of Words: Is there a way to show the vectorized document?

Where is the real fault in my Noverov Alogorithm + Schroedinger Equation?[Physics]

Build a correlation circle with Python - Error ValueError: could not broadcast input array from shape (3) into shape (28)

Estimation of number of Clusters via gap statistics and prediction strength

Categories

Resources