Sorry because this code contains comments about the exercise in Spanish. The thing is, I am trying to train 3 models with 3 different algorithms of machine learning which are in charge of image classification based on labels. I got to a point where, I have created a dataframe in pandas that contains, a column for the images, and a column for each image labels. At the beggining I am showing some code of how do I create the image, the thing is, I get them from a data file which is binary encoded. I use unpickle to uncode it, obtaining a dictionary with all the images. I build the images using np.dstack() with the 3 channels of color of the image, this is the input that is given for us. I think the problem is with the dimensions of the images because, each image in the panda's dataframe columns size's is (32x32x3), mainly because each image is 32x32 pixels, and contains 3 layers of color red green and blue. I am getting errors all the time regarding "ValueError: setting an array element with a sequence". I think I am no way near close to getting this right. I have tried using the function flatten but it seems like it does not work either. What should I do to adapt the input? I am very lost right now, have tried everything.
# Los datos de cada entidad contienen los valores de la imagen. La imagen se obtiene por la combinación de tres canales/capas (red, green, blue) de la siguiente forma:
ch0 = d0[0:1024]
ch1 = d0[1024:2048]
ch2 = d0[2048:]
# Cada canal es una capa del correspondiente color
ch0 = np.reshape(ch0, (32,32)) # red
ch1 = np.reshape(ch1, (32,32)) # green
ch2 = np.reshape(ch2, (32,32)) # blue
# La combinación de ellas da una imagen con los tres colores:
image = np.dstack((ch0, ch1, ch2))
fig, ax = plt.subplots(figsize=(2, 2))
ax.imshow(image)
plt.show()
import random as r
import cv2
categoriasAleatorias = []
for i in range(3):
num = r.randint(0,19)
categoriasAleatorias.append(dataMeta[b'coarse_label_names'][num])
print(categoriasAleatorias)
#Para cada imagen.
columnaImagenes = []
columnaEtiquetas = []
for i in range (len(data[b'data'])):
if dataMeta[b'coarse_label_names'][data[b'coarse_labels'][i]] in categoriasAleatorias:
columnaEtiquetas.append(dataMeta[b'coarse_label_names'][data[b'coarse_labels'][i]])
# Cada entidad, se encuentra en una posición de cada uno de los anteriores atributos
# Vamos a ver la entidad '0'
d0 = data[b'data'][i]
# Los datos de cada entidad contienen los valores de la imagen. La imagen se obtiene por la combinación de tres canales/capas (red, green, blue) de la siguiente forma:
ch0 = d0[0:1024]
ch1 = d0[1024:2048]
ch2 = d0[2048:]
# Cada canal es una capa del correspondiente color
ch0 = np.reshape(ch0, (32,32)) # red
ch1 = np.reshape(ch1, (32,32)) # green
ch2 = np.reshape(ch2, (32,32)) # blue
# La combinación de ellas da una imagen con los tres colores:
image = np.dstack((ch0, ch1, ch2))
columnaImagenes.append(image)
print(len(columnaEtiquetas))
print(len(columnaImagenes))
df = pd.DataFrame({'etiqueta': columnaEtiquetas, 'imagen': columnaImagenes})
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MultiLabelBinarizer
X_train = df['imagen']
y_train = df['etiqueta']
# Crear el modelo SVM
svm = SVC()
# Entrenar el modelo SVM
svm.fit(X_train, y_train)
# Crear el modelo de Random Forest
rfc = RandomForestClassifier()
# Entrenar el modelo de Random Forest
rfc.fit(X_train, y_train)
# Crear el modelo de KNN
knn = KNeighborsClassifier()
# Entrenar el modelo de KNN
knn.fit(X_train, y_train)
flatten(), reshape() nothing seems to work
When I do this :
print(len(columnaEtiquetas))
print(len(columnaImagenes))
print(columnaImagenes[0].shape)
df = pd.DataFrame({'etiqueta': columnaEtiquetas, 'imagen': columnaImagenes})
print(df.head())
I obtain this:
7500
7500
(32, 32, 3)
etiqueta imagen
0 b'food_containers' [[[178, 168, 176], [175, 165, 173], [175, 165,...
1 b'trees' [[[254, 254, 254], [255, 255, 255], [255, 255,...
2 b'trees' [[[152, 71, 119], [154, 77, 124], [152, 84, 13...
3 b'trees' [[[153, 157, 168], [155, 160, 167], [162, 168,...
4 b'trees' [[[48, 78, 134], [51, 88, 148], [53, 86, 150],...
This is how my images are made,32x32pixels and 32x32x3 because there is 3 layers of color, I do not know how to fit this data or transform it in a way that the model gets it, tried your line of work to the image but kept giving me same exception.
Indeed, none of the methods you use would work for 32x32x3. If you want to fully utilise the spatial information from the image, you should use convolutions. But this is the more advanced technique.
What you would need is to have the image arrays flattened into an array with 3072 features. I experimented in a shell and if you do columnaImagenes.append(image.flatten().tolist()) this should give you 3072 sized feature arrays.
If it's not working, I'd suggest to take an element from the column and check it's shape with np.shape
But also there's a part of your code where you do the color to grayscale convertion (cvt_color). If I understand correctly you are trying to convert the image as BGR where you are actually concatenating is as RGB, which is a different order and grayscale values would actually differ.
Related
I'm trying to build a correlation circle, basically, it allows to measure to which extend the Eigenvalue / Eigenvector of a variable is correlated to the principal components (dimensions) of a dataset.
Something like this :
Here is my code :
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
#chargement données
X = pd.read_excel("mortalitePaysUE.xlsx",sheet_name=0,header=0,index_col=0)
#nombre d'observations
n = X.shape[0]
#nombre de variable
p = X.shape[0]
print(p)
#transformation - centrage/réduction
sc = StandardScaler()
Z = sc.fit_transform(X)
print(Z)
print("-------------")
#moyenne
print("Moyenne : ")
print(np.mean(X,axis=0))
print("-------------")
#ecart-type
print("Ecart type : ")
print(np.std(X,axis=1,ddof=0))
print("-------------")
#acp
acp = PCA(svd_solver='full')
coord = acp.fit_transform(Z)
eigval = (n-1)/n*acp.explained_variance_
print(eigval)
#screen plot
#plt.plot(np.arange(1,p+1),eigval)
#plt.title("Décès en 1990 selon le genre")
#plt.xlabel("Numéro de facteur")
#plt.ylabel("Valeur propre")
#plt.show()
#positionnement des individus dans le premier plan
fig, axes = plt.subplots(figsize=(12,12))
axes.set_xlim(-6,6) #même limites en abscisse
axes.set_ylim(-6,6) #et en ordonnée
#placement des étiquettes des observations
for i in range(n):
plt.annotate(X.index[i],(coord[i,0],coord[i,1]))
#ajouter les axes
plt.plot([-6,6],[0,0],color='silver',linestyle='-',linewidth=1)
plt.plot([0,0],[-6,6],color='silver',linestyle='-',linewidth=1)
#affichage
plt.show()
#racine carrée des valeurs propres
sqrt_eigval = np.sqrt(eigval)
#corrélation des variables avec les axes
corvar = np.zeros((p,p))
for k in range(p):
corvar[:,k] = acp.components_[k,:] * sqrt_eigval[k]
#afficher la matrice des corrélations variables x facteurs
#print(corvar)
#cercle des corrélations
fig, axes = plt.subplots(figsize=(8,8))
axes.set_xlim(-1,1)
axes.set_ylim(-1,1)
#affichage des étiquettes (noms des variables)
for j in range(p):
plt.annotate(X.columns[j],(corvar[j,0],corvar[j,1]))
#ajouter les axes
plt.plot([-1,1],[0,0],color='silver',linestyle='-',linewidth=1)
plt.plot([0,0],[-1,1],color='silver',linestyle='-',linewidth=1)
#ajouter un cercle
cercle = plt.Circle((0,0),1,color='blue',fill=False)
axes.add_artist(cercle)
The problem is that i got an error and i can't display the circle. And i can't resolve the error
corvar[:,k] = acp.components_[k,:] * sqrt_eigval[k]
ValueError: could not broadcast input array from shape (3) into shape (28)
Can anyone help me to fix this please :) Thanks in advance !
I'm using the subplot method to make my work more orderly but I don't know how to fix the overlap of the graphics. Here is my code :
import matplotlib.pyplot as plt
import pandas as pd
data=pd.read_csv('churn.csv')
fugados=data[data.Churn=='Yes']
noFugados=data[data.Churn=='No']
plt.subplot(3,2,1)
plt.title('Cantidad de clientes y fugados [1]')
plt.scatter('Fugados',fugados.Churn.size)
plt.scatter('Clientes',noFugados.Churn.size)
plt.ylabel('Cantidad de personas')
plt.subplot(3,2,2)
plt.title('Cantidad de hombres y mujeres [2]')
m=data[data.gender=='Female']
h=data[data.gender=='Male']
plt.scatter('Hombres',h.gender.size)
plt.scatter('Mujeres',m.gender.size)
plt.ylabel('Cantidad de personas')
plt.subplot(3,2,3)
plt.title('Cantidad de hombres y mujeres fugados [3]')
m=data[(data.gender=='Female') & (data.Churn=='Yes')]
h=data[(data.gender=='Male') & (data.Churn=='Yes')]
plt.scatter('Hombres',h.gender.size)
plt.scatter('Mujeres',m.gender.size)
plt.ylabel('Cantidad de personas')
plt.subplot(3,2,4)
plt.title('Cantidad de hombres y mujeres que son clientes [4]')
m=data[(data.gender=='Female') & (data.Churn=='No')]
h=data[(data.gender=='Male') & (data.Churn=='No')]
plt.scatter('Hombres',h.gender.size)
plt.scatter('Mujeres',m.gender.size)
plt.ylabel('Cantidad de personas')
plt.subplot(3,2,5)
plt.title('Cantidad de fugados que tenían fibra óptica u otro servicio [5]')
conFibra=data[(data.InternetService=='Fiber optic') & (data.Churn=='Yes')]
sinFibra=data[(data.InternetService!='Fiber optic') & (data.Churn=='Yes')]
plt.scatter('Fibra óptica',conFibra.gender.size)
plt.scatter('Otro servicio',sinFibra.gender.size)
plt.ylabel('Cantidad de personas')
This is the output :
If someone can help me to show something more clear and organized I would appreciate that. I want to separate the graphics.
Try to add the plt.tight_layout() before show?
Ref: https://matplotlib.org/tutorials/intermediate/tight_layout_guide.html
I'm trying to build an Autoencoder neural network for finding outliers in a single column list of text. My input have 138 lines and they look like this:
amaze_header_2.png
amaze_header.png
circle_shape.xml
disableable_ic_edit_24dp.xml
fab_label_background.xml
fab_shadow_black.9.png
fab_shadow_dark.9.png
I've built an autoencoder network using Keras, and I use a python function to convert my text input into an array with the ascii representation of each character, padded by zeroes so they all have the same size.
And my full code is like this:
import sys
from keras import Input, Model
import matplotlib.pyplot as plt
from keras.layers import Dense
import numpy as np
from pprint import pprint
from google.colab import drive
# Monta o arquivo do Google Drive
drive.mount('/content/drive')
with open('/content/drive/My Drive/Colab Notebooks/drawables.txt', 'r') as arquivo:
dados = arquivo.read().splitlines()
# Define uma função para pegar uma lista e retornar um inteiro com o tamanho do
# maior elemento
def tamanho_maior_elemento(lista):
maior = 0
for elemento in lista:
tamanho_elemento = len(elemento)
if tamanho_elemento > maior:
maior = tamanho_elemento
return maior
# Define uma função para pegar uma lista e o tamanho do maior elemento e
# retornar uma lista contendo uma outra lista com cada caractere convertido para
# ascii, antes de converter são adicionados zeros a direita para eles ficarem
# com o mesmo tamanho do maior elemento.
def texto_para_ascii(lista, tamanho_maior_elemento):
#para cada linha
lista_ascii = list()
for elemento in lista:
elemento_ascii_lista = list()
#coloca zeros do lado da string
elemento_com_zeros = elemento.ljust(tamanho_maior_elemento, "0")
for caractere in elemento_com_zeros:
elemento_ascii_lista.append(ord(caractere))
lista_ascii.append(elemento_ascii_lista)
return lista_ascii
def ascii_para_texto(lista):
#para cada linha
lista_ascii = list()
for elemento in lista:
elemento_ascii_lista = list()
for caractere in elemento:
elemento_ascii_lista.append(chr(caractere))
elemento_ascii_string = "".join(elemento_ascii_lista)
lista_ascii.append(elemento_ascii_string)
return lista_ascii
# Pega o tamanho do maior elemento
tamanho_maior_elemento = tamanho_maior_elemento(dados)
# Pega o tamanho da lista
tamanho_lista = len(dados)
# Converte os dados para ascii
dados_ascii = texto_para_ascii(dados, tamanho_maior_elemento)
# Converte a linha de dados em ascii para um array numpy
np_dados_ascii = np.array(dados_ascii)
# Define o tamanho da camada comprimida
tamanho_comprimido = int(tamanho_maior_elemento/5)
# Cria a camada de Input com o tamanho do maior elemento
dados_input = Input(shape=(tamanho_maior_elemento,))
# Cria uma camada escondida com o tamanho da camada comprimida
hidden = Dense(tamanho_comprimido, activation='relu')(dados_input)
# Cria a camada de saida com o tamanho do maior elemento
output = Dense(tamanho_maior_elemento, activation='relu')(hidden)
#resultado = Dense(tamanho_maior_elemento, activation='sigmoid')(output)
resultado = Dense(tamanho_maior_elemento)(output)
# Cria o modelo
autoencoder = Model(input=dados_input, output=resultado)
# Compila o modelo
autoencoder.compile(optimizer='adam', loss='mse')
# Faz o fit com os dados
history = autoencoder.fit(np_dados_ascii, np_dados_ascii, epochs=10)
# Plota o gráfico das epochs
plt.plot(history.history["loss"])
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.show()
# Pega a saída do predict
predict = autoencoder.predict(np_dados_ascii)
# Pega os índices do array que foram classificados
indices = np.argmax(predict, axis=0)
# Converte a saída do predict de array numpy para array normal
indices_list = indices.tolist()
identificados = list()
for indice in indices_list:
identificados.append(dados[indice])
pprint(identificados)
My np.argmax(predict, axis=0) function returns a list of numbers, which none of them are higher than my array size, so I presumed that they are the positions in my input array that were outliers.
But I'm super unsure on how to interpret the predict data, my "indices" variable looks like this:
array([116, 116, 74, 74, 97, 115, 34, 116, 39, 39, 116, 116, 115,
116, 34, 74, 74, 34, 115, 116, 115, 74, 116, 39, 84, 116,
39, 34, 34, 84, 115, 115, 34, 39, 34, 116, 116, 10])
Have I done the correct interpretation? I mean, what are these numbers being returned? They look nothing like my input. So I assumed that they are the positions on my input data array. Am I right?
EDIT: if at the end of the script I do:
print("--------------")
pprint(np_dados_ascii)
print("--------------")
pprint(predict)
I get the following data:
--------------
array([[ 97, 98, 111, ..., 48, 48, 48],
[ 97, 109, 97, ..., 48, 48, 48],
[ 97, 109, 97, ..., 48, 48, 48],
...,
[115, 97, 102, ..., 48, 48, 48],
[115, 100, 95, ..., 48, 48, 48],
[115, 101, 97, ..., 48, 48, 48]])
--------------
array([[86.44533 , 80.48006 , 13.409852, ..., 60.649754, 21.34232 ,
24.23074 ],
[98.18514 , 87.98954 , 14.873579, ..., 65.382866, 22.747816,
23.74556 ],
[85.682945, 79.46511 , 13.117042, ..., 60.182964, 21.096725,
22.625275],
...,
[86.989494, 77.36661 , 14.291222, ..., 53.586407, 18.540628,
26.212025],
[76.0646 , 70.029236, 11.804929, ..., 52.506832, 18.65119 ,
21.961123],
[93.25003 , 82.855354, 15.329873, ..., 56.992035, 19.869513,
28.3672 ]], dtype=float32)
What do the predict output mean? I don't get why there are floats being returned if my input is an integer array.
Shouldn't it be an array with a different shape (in my result, they are equal) containing just the ascii text of the outliers?
Autoencoders are a type of NN used to map higher dimensional input to a lower dimensional representation. The architecture of an autoencoder is quite easy to understand and implement.
This article explains in a simple way what they do and how you should interpret your data.
For your specific case, first of all, I would try a different representation of the input, splitting each word after any '_' or '.' and encode it as a vector using the Keras Embedding layer: here a tutorial on how to use Embedding Layers
Then, what you really want is to look at the output of your middle hidden layer, that is the one that encodes your input into a lower dimensional space. From this lower dimensional space, you can then either train a classifier to detect outliers if you have ground truth or use other unsupervised learning techniques to perform anomaly detection or simply visualization and clustering.
I have some bigrams, lets say: [('word','word'),('word','word'),...,('word','word')]. How can i use scikit's HashingVectorizer to create a feature vector that subsequently will be presented to some classification algorithm like e.g. SVC or Naive Bayes or any type of classification algorithm?
Firstly, you MUST understand what the different vectorizers are doing. Most vectorizers are based on the bag-of-word approaches where documents are tokens are mapped onto a matrix.
From sklearn documentation, CountVectorizer and HashVectorizer:
Convert a collection of text documents to a matrix of token counts
For instance, these sentences
The Fulton County Grand Jury said Friday an investigation of Atlanta's
recent primary election produced no evidence that any
irregularities took place .
The jury further said in term-end presentments that the City Executive
Committee , which had over-all charge of the election , `` deserves
the praise and thanks of the City of Atlanta '' for the manner in
which the election was conducted .
with this rough vectorizer:
from collections import Counter
from itertools import chain
from string import punctuation
from nltk.corpus import brown, stopwords
# Let's say the training/testing data is a list of words and POS
sentences = brown.sents()[:2]
# Extract the content words as features, i.e. columns.
vocabulary = list(chain(*sentences))
stops = stopwords.words('english') + list(punctuation)
vocab_nostop = [i.lower() for i in vocabulary if i not in stops]
# Create a matrix from the sentences
matrix = [Counter([w for w in words if w in vocab_nostop]) for words in sentences]
print matrix
would become:
[Counter({u"''": 1, u'``': 1, u'said': 1, u'took': 1, u'primary': 1, u'evidence': 1, u'produced': 1, u'investigation': 1, u'place': 1, u'election': 1, u'irregularities': 1, u'recent': 1}), Counter({u'the': 6, u'election': 2, u'presentments': 1, u'``': 1, u'said': 1, u'jury': 1, u'conducted': 1, u"''": 1, u'deserves': 1, u'charge': 1, u'over-all': 1, u'praise': 1, u'manner': 1, u'term-end': 1, u'thanks': 1})]
So this might be rather inefficient considering very large dataset, so the sklearn devs built more efficient code. One of the most important feature of sklearn is that you don't even need to load the dataset into memory before vectorizing it.
Since it's unclear what is your task, i think you're sort of looking for a general use. Let's say you're using it for language ID.
Let's say that your input file for the training data in train.txt:
Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.
De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.
Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?
Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.
And your corresponding labels are Bosnian, Portuguese, Spanish and Slovak, i.e.
[bs,pt,es,sr]
Here's one way to use the CountVectorizer and the naive bayes classifier. The following example is from https://github.com/alvations/bayesline of the DSL shared task.
Let's start from the vectorizer. Firstly, the vectorizer takes the input file and then converts the training set into a vectorized matrix and initializes the vectorizer (i.e. features):
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
testfile = 'test.txt'
# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
print word_vectorizer.get_feature_names()
[out]:
[u'acuerdo', u'aj', u'ajudou', u'al', u'alex', u'algo', u'alpsk\xfdmi', u'alpy', u'andaba', u'andrea', u'ao', u'apresenta', u'as', u'bien', u'bl\xedzko', u'buscando', u'come\xe7o', u'como', u'con', u'conseguido', u'da', u'de', u'decepcionantes', u'deti', u'dificuldades', u'dif\xedcil', u'distancia', u'do', u'doprinese', u'druh', u'd\xe1', u'ela', u'encontrar', u'enfrentar', u'es', u'est\xe1', u'eulex', u'excusa', u'fama', u'foi', u'for\xe7as', u'furiosa', u'golf', u'golfistami', u'golfov\xfdch', u'guasch', u'ha', u'hotelmi', u'hra\u0165', u'ide', u'ihr\xedsk', u'incident', u'intranspon\xedveis', u'in\xedcio', u'in\xfd', u'ispit', u'istragu', u'izbijanju', u'ja\u010danju', u'je', u'jedan', u'jo\u0161', u'kapaciteta', u'kde', u'kombin\xe1cie', u'komplex', u'kon\u010diarmi', u'kosova', u'la', u'lado', u'lequio', u'lete', u'llevar', u'lo', u'longo', u'ly\u017eova\u0165', u'mais', u'man\u017eelky', u'mas', u'me', u'mesmo', u'meu', u'minha', u'misije', u'mo\u017enos\u0165ami', u'muy', u'm\xe1s', u'm\xe3e', u'na', u'nada', u'nad\u0161en\xfdmi', u'nasilja', u'negativas', u'nie', u'nieko\u013ek\xfdch', u'no', u'obaviti', u'obe\u0107ao', u'para', u'parecem', u'parecer', u'pod', u'pone', u'pon\xfakaj\xfa', u'por', u'potrebuj\xfa', u'po\u0161to', u'prava', u'predstavlja', u'pri', u'prova\xe7\xf5es', u'pro\u0161losedmi\u010dnom', u'punham', u'qual', u'qualquer', u'que', u'quem', u'rak\xfaske', u'relaci\xf3n', u'rezortov', u'sa', u'sebe', u'sempre', u'situa\xe7\xf5es', u'sjeveru', u'spojen\xfdch', u'suplantar', u's\xfa', u'taj', u'tak', u'talianske', u'teve', u'tive', u'todas', u'tr\xe1venia', u'una', u've\u013ek\xfd', u'vida', u'visto', u'vladavine', u'vo', u'vo\u013en\xe9ho', u'vysok\xfdmi', u'vy\u017eitia', u'v\xe4\u010d\u0161ine', u'v\u017edy', u'ya', u'zauj\xedmav\xe9', u'zime', u'\u0107e', u'\u010dasu', u'\u010di', u'\u010fal\u0161\xedmi', u'\u0161vaj\u010diarske']
Let's say your test documents are in test.txt, which labels are Spanish es and Portuguese pt:
Por ello, ha insistido en que Europa tiene que darle un toque de atención porque Portugal esta incumpliendo la directiva del establecimiento del peaje
Estima-se que o mercado homossexual só na Cidade do México movimente cerca de oito mil milhões de dólares, aproximadamente seis mil milhões de euros
Now, you can label the test documents with the trained classifier as such:
import codecs, re, time
from itertools import chain
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
testfile = 'test.txt'
# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)
# Tagging the documents
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)
print results
[out]:
['es' 'pt']
For more information of text classification, possibly you might find this NLTK related question/answer useful, see nltk NaiveBayesClassifier training for sentiment analysis
To use the HashingVectorizer, you need to note that it produces vector values that are negative and MultinomialNaiveBayes classifier don't do negative values, so you would have to use another classifier, as such:
import codecs, re, time
from itertools import chain
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import Perceptron
trainfile = 'train.txt'
testfile = 'test.txt'
# Vectorizing data.
train = []
word_vectorizer = HashingVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
# Training Perceptron
pct = Perceptron(n_iter=100)
pct.fit(trainset, tags)
# Tagging the documents
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = pct.predict(testset)
print results
[out]:
['es' 'es']
But do note that the results of the perceptron is worse in this small example. Different classifier fits different task and different features fit different vectors, also different classifiers accepts different vectors.
There is no perfect model, just better or worse
Since you've already extracted the bigrams yourself, you can vectorize using a FeatureHasher. The main thing you need to do is squash the bigrams to strings. E.g.,
>>> data = [[('this', 'is'), ('is', 'a'), ('a', 'text')],
... [('and', 'one'), ('one', 'more')]]
>>> from sklearn.feature_extraction import FeatureHasher
>>> fh = FeatureHasher(input_type='string')
>>> X = fh.transform(((' '.join(x) for x in sample) for sample in data))
>>> X
<2x1048576 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
I am trying to translate the R implementations of gap statistics and prediction strength http://edchedch.wordpress.com/2011/03/19/counting-clusters/ into python scripts for the estimation of number of clusters in iris data with 3 clusters. Instead of getting 3 clusters, I get different results on different runs with 3 (actual number of clusters) hardly estimated. Graph shows estimated number to be 10 instead of 3. Am I missing something? Can anyone help me locate the problem?
import random
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def dispersion (data, k):
if k == 1:
cluster_mean = np.mean(data, axis=0)
distances_from_mean = np.sum((data - cluster_mean)**2,axis=1)
dispersion_val = np.log(sum(distances_from_mean))
else:
k_means_model_ = KMeans(n_clusters=k, max_iter=50, n_init=5).fit(data)
distances_from_mean = range(k)
for i in range(k):
distances_from_mean[i] = int()
for idx, label in enumerate(k_means_model_.labels_):
if i == label:
distances_from_mean[i] += sum((data[idx] - k_means_model_.cluster_centers_[i])**2)
dispersion_val = np.log(sum(distances_from_mean))
return dispersion_val
def reference_dispersion(data, num_clusters, num_reference_bootstraps):
dispersions = [dispersion(generate_uniform_points(data), num_clusters) for i in range(num_reference_bootstraps)]
mean_dispersion = np.mean(dispersions)
stddev_dispersion = float(np.std(dispersions)) / np.sqrt(1. + 1. / num_reference_bootstraps)
return mean_dispersion
def generate_uniform_points(data):
mins = np.argmin(data, axis=0)
maxs = np.argmax(data, axis=0)
num_dimensions = data.shape[1]
num_datapoints = data.shape[0]
reference_data_set = np.zeros((num_datapoints,num_dimensions))
for i in range(num_datapoints):
for j in range(num_dimensions):
reference_data_set[i][j] = random.uniform(data[mins[j]][j],data[maxs[j]][j])
return reference_data_set
def gap_statistic (data, nthCluster, referenceDatasets):
actual_dispersion = dispersion(data, nthCluster)
ref_dispersion = reference_dispersion(data, nthCluster, num_reference_bootstraps)
return actual_dispersion, ref_dispersion
if __name__ == "__main__":
data=np.loadtxt('iris.mat', delimiter=',', dtype=float)
maxClusters = 10
num_reference_bootstraps = 10
dispersion_values = np.zeros((maxClusters,2))
for cluster in range(1, maxClusters+1):
dispersion_values_actual,dispersion_values_reference = gap_statistic(data, cluster, num_reference_bootstraps)
dispersion_values[cluster-1][0] = dispersion_values_actual
dispersion_values[cluster-1][1] = dispersion_values_reference
gaps = dispersion_values[:,1] - dispersion_values[:,0]
print gaps
print "The estimated number of clusters is ", range(maxClusters)[np.argmax(gaps)]+1
plt.plot(range(len(gaps)), gaps)
plt.show()
Your graph is showing the correct value of 3. Let me explain a bit
As you increase the number of clusters, your distance metric will certainly decrease. Therefore you are assuming that the correct value is 10. If you increase it to beyond 10, the distance metric will further decrease. But this should not be our decision making criteria
We need to find the inflection point ( here marked in RED ). It is the point where the slope smoothens out. You might want to take a look at elbow curves
Based on the above 2 points, the inflection point is 3 ( which is also the correct solution )
Hope this helps
you could take a look on this code and you could change your output plot format
[![# coding: utf-8
# Implémentation de K-means clustering python
#Chargement des bibliothèques
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets
#chargement de jeu des données Iris
iris = datasets.load_iris()
#importer le jeu de données Iris dataset à l'aide du module pandas
x = pd.DataFrame(iris.data)
x.columns = \['Sepal_Length','Sepal_width','Petal_Length','Petal_width'\]
y = pd.DataFrame(iris.target)
y.columns = \['Targets'\]
#Création d'un objet K-Means avec un regroupement en 3 clusters (groupes)
model=KMeans(n_clusters=3)
#application du modèle sur notre jeu de données Iris
model.fit(x)
#Visualisation des clusters
plt.scatter(x.Petal_Length, x.Petal_width)
plt.show()
colormap=np.array(\['Red','green','blue'\])
#Visualisation du jeu de données sans altération de ce dernier (affichage des fleurs selon leur étiquettes)
plt.scatter(x.Petal_Length, x.Petal_width,c=colormap\[y.Targets\],s=40)
plt.title('Classification réelle')
plt.show()
#Visualisation des clusters formés par K-Means
plt.scatter(x.Petal_Length, x.Petal_width,c=colormap\[model.labels_\],s=40)
plt.title('Classification K-means ')
plt.show()][1]][1]
Output 1