Mixture of Gaussians using scikit learn mixture - python

I'd like to use sklearn.mixture.GMM to fit a mixture of Gaussians to some data, with results similar to the ones I get using R's "Mclust" package.
The data looks like this:
So here's how I cluster the data using R, it gives me 14 nicely separated clusters and is easy as falling down stairs:
data <- read.table('~/gmtest/foo.csv',sep=",")
library(mclust)
D = Mclust(data,G=1:20)
summary(D)
plot(D, what="classification")
And here's what I say when I try it with python:
from sklearn import mixture
import numpy as np
import os
import pyplot
os.chdir(os.path.expanduser("~/gmtest"))
data = np.loadtxt(open('foo.csv',"rb"),delimiter=",",skiprows=0)
gmm = mixture.GMM( n_components=14,n_iter=5000, covariance_type='full')
gmm.fit(data)
classes = gmm.predict(data)
pyplot.scatter(data[:,0], data[:,1], c=classes)
pyplot.show()
Which assigns all points to the same cluster. I've also noticed that the AIC for the fit is lowest when I tell it to find excatly 1 cluster, and increases linearly with increasing numbers of clusters. What am I doing wrong? Are there additional parameters I need to consider?
Is there a difference in the models used by Mclust and by sklearn.mixture?
But more important: what is the best way in sklearn to cluster my data?

The trick is to set GMM's min_covar. So in this case I get good results from:
mixture.GMM( n_components=14,n_iter=5000, covariance_type='full',min_covar=0.0000001)
The large default value for min_covar assigns all points to one cluster.

Related

Can we rank K-Means clusters or assign weights to certain clusters?

I am working on a K-Means Clustering task and I am wondering if there is some way to do some kind of ranking of clusters, or maybe assign specific weights to some specific clusters. Is there a way to do this? Here is my code.
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
import numpy as np
from scipy.cluster.vq import kmeans,vq
import pandas as pd
import pandas_datareader as dr
from math import sqrt
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
df = pd.read_csv('C:\\my_path\\analytics.csv')
data = np.asarray([np.asarray(dataset['Rating']),np.asarray(dataset['Maturity']),np.asarray(dataset['Score']),np.asarray(dataset['Bin']),np.asarray(dataset['Price1']),np.asarray(dataset['Price2']),np.asarray(dataset['Price3'])]).T
centroids,_ = kmeans(data,1000)
idx,_ = vq(data,centroids)
details = [(name,cluster) for name, cluster in zip(dataset.Cusip,idx)]
So, I get my 'details', I look at it, and everything seems fine at this point. I end up with around 700 clusters. I'm just wondering if there is a way to rank-order these clusters, assuming 'Rating' is the most important feature. Or, perhaps there is a way to assign a higher weight to 'Rating'. I'm not sure this makes 100% sense. I'm just thinking about the concept and wondering if there is some obvious solution or maybe this is just nonsense. I can easily count the records in each cluster, but I don't think that has any significance whatsoever. I Googled this and didn't find anything useful.
One "cheat" trick would be to use the feature ratingtwice or three times, then it automatically gets more weight:
data = np.asarray([np.asarray(dataset['Rating']), np.asarray(dataset['Rating']), np.asarray(dataset['Maturity']),np.asarray(dataset['Score']),np.asarray(dataset['Bin']),np.asarray(dataset['Price1']),np.asarray(dataset['Price2']),np.asarray(dataset['Price3'])]).T
there are also adjustments of kmeans around, but they are not implemented in python.

How to use k means for a product recommendation dataset

I have a data set with columns titled as product name, brand,rating(1:5),review text, review-helpfulness. What I need is to propose a recommendation algorithm using reviews. I have to use python for coding here. data set is in .csv format.
To identify the nature of the data set I need to use kmeans on the data set. How to use k means on this data set?
Thus I did following,
1.data pre-processing,
2.review text data cleaning,
3.sentiment analysis,
4.giving sentiment score from 1 to 5 according to the sentiment value (given by sentiment analysis) they get and tagging reviews as very negative, negative, neutral, positive, very positive.
after these procedures i have these columns in my data set, product name, brand,rating(1:5),review text, review-helpfulness, sentiment-value, sentiment-tag.
This is the link to the data set https://drive.google.com/file/d/1YhCJNvV2BQk0T7PbPoR746DCL6tYmH7l/view?usp=sharing
I tried to get k means using following code It run without error. but I don't know this is something useful or is there any other ways to use kmeans on this data set to get some other useful outputs. To identify more about data how should i use k means in this data set..
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df.info()
X = np.array(df.drop(['sentiment_value'], 1).astype(float))
y = np.array(df['rating'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
plt.show()
You did not plot anything.
So nothing shows up.
Unless you are more specific about what you are trying to achieve we won't be able to help. Figure out what exactly you want to predict. Do you just want to cluster products according to their sentiment score which isn't especially promising or do you want to predict actual product preferences on a new dataset?
If you want to build a recommendation system the only possibility (considering your dataset) would be to identify similar products according to the rating/sentiment. Is that what you want?

Diffrent PCA plots

I was trying to to learn pca(using the iris dataset) with python and i got some results,so i wanted to test the results ir R to make sure it was good.When i checked the results,it gave me a mirror diagram that of python(in the y axis),and the negative numeric sign in some of the values(python: [140,1]=0.1826089,r[141,2]=-0.1826089[python counts form zero]).
The python code:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.decomposition as p
data=np.loadtxt("sample_data/iris.txt",delimiter=';',usecols=(0,1,2,3))
pca=p.PCA().fit(data)
pcaData=pca.transform(data)
plt.scatter(pcaData[:,0],pcaData[:,1])
print(pcaData[140,1])
My python diagram
The R code:
data=read.csv("C:\\Users\\George\\Desktop\\iris.csv",sep=";",colClasses=c(NA, NA, NA,NA,"NULL"));data=data[-151,]
pca=prcomp(data)
plot(pca$x[,1],pca$x[,2])
print(pca$x[141,2])
My R diagram
In search i did on the internet,i found the same happens.
The R diagram on the internet-Source
The Python diagram on the internet-Source.
I was expecting to be the same.
Is somthing that i do not understand well?
Thank you.
ScikitLearn uses a pseudo-randomized method to determine an approximation of the singular value decomposition.
see https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.randomized_svd.html
Therefore, unless you can guarantee that the methods are the same and use the same random seed, you will not obtain exactly the same values for the principal components.

Finding Hot Spots in Python. KernelDensity

I am facing the following problem:
I have (large) sample of unevenly distributed points $(X_i,Y_i)$ in a 2D space. I would like to determine the local extremas of the density of the distribution.
Does the function KernelDensity allow to estimate the density of the sample in a point outside the sample ?
if yes, i cannot find the right syntax ?
Here is an example:
import numpy as np
import pandas as pd
mean0=[0,0]
cov0=[[1,0],[0,1]]
mean1=[3,3]
cov1=[[1,0.2],[0.2,1]]
A=pd.DataFrame(np.vstack((np.random.multivariate_normal(mean0, cov0, 5000),np.random.multivariate_normal(mean1, cov1, 5000))))
A.columns=['X','Y']
A.describe()
from sklearn.neighbors import KernelDensity
kde = KernelDensity(bandwidth=0.04, metric='euclidean',
kernel='gaussian', algorithm='ball_tree')
kde.fit(A)
If I make this query
kde.score_samples([(0,0)])
i get a negative number, clearly not a density !!
array([-2.88134574])
I don't know if its the right approach. I would like then use that function to use an optimizer to get local extremas. (which library/function would you recommend ?)
EDIT: yes this is a log density, not a density so it can be a negative number

ml-py svm converges but classifying wrongly

I am trying to do some classification task with python and SVM.
From collected data I extracted the feature vectors for each class and created a training set. The feature vectors have n-dimensions(39 or more). So, say for 2 classes I have a set of 39-d feature vectors and a single array of class labels corresponding to each entry in the feature vector.Currently, I am using mlpy and doing something like this:
import numpy as np
import mlpy
svm=mlpy.Svm('gaussian') #tried a linear kernel too but not having the convergence
instance= np.vstack((featurevector1,featurevector1))
label=np.hstack((np.ones((1,len(featurevector1),dtype=int),-1*np.ones((1,len(featurevector2),dtype=int)))
#Assigning a label(+1/-1) for each entry in instance, (+1 for entries coming from
#featurevector 1 and -1 for featurevector2
svm.compute(instance,label) #it converges and outputs 1
svm.predict(testdata) #This one says all class label are 1 only whereas I ve testing data from both classes
Am I doing some mistake here? Or should I use some other library? Please help.
I don't use mlpy, but np.ones((1,len(featurevector1)) should perhaps be just np.ones(len(featurevector1)) --
print .shape of each to see the difference.
(If you have a link to public data anything like yours, could you post it please ?)

Categories