How to input twitter data (csv/txt) into DBSCAN python? - python

Could someone guide me how could i cluster twitter data using DBSCAN in python? I am totally new to DBSCAN. Also, how to determine the eps value and the iloc or loc value.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
def clusterEvaluate(cluster):
count_cluster = np.bincount(cluster)
count_cluster = np.argmax(count_cluster)
same_clusters = np.count_nonzero(cluster == count_cluster)/np.size(cluster)
return same_clusters
dataset = np.loadtxt('tweetdata.csv') # not sure if this works
X = StandardScaler().fit_transform(dataset)
y_valid = dataset.iloc[:6].values()
dbscan = DBSCAN(eps= 0.5,min_samples=5,metric='euclidean')
y = dbscan.fit_predict(X)
cluster_labels = np.unique(y)
same_clusters = []
i = 0
for index in cluster_labels:
cluster = y_valid[y == index]
same_clusters.insert((i, clusterEvaluate(cluster)))

You need to choose and appropriate data representation and distance function for this. Furthermore, scalability will kill you.
I do not think it will work well. I have it seen anything that gives insightful results beyond counting frequent words in a unnecessary complex fashion. Twitter data is a bitch. The messages are just too short. All the good approaches like LDA need much longer documents.

Related

ValueError: could not convert string to float: 'GIAC'

I am trying to perform a K Means Clustering on a set of data that all texts. I have tried these lines of code and I am getting an error saying "ValueError: could not convert string to float: 'GIAC'".
I think the program is still having problems converting my text into vectors to be able to perform a clustering.
I really do not know how to solve this.
Here are the lines of code:
import numpy as np
import matplotlib.pyplot as plot
import pandas as pd
from sklearn.cluster import KMeans
Cert = pd.read_csv('Certification.csv')
X = Cert.iloc[:,:].values
wcss =[]
for i in range(1,5):
kmeans = KMeans(n_clusters = i, init='k-means++', random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plot.plot(range(1,5),wcss)
plot.title('Elbow Method')
plot.xlabel('Number of Clusters')
plot.ylabel('WCSS')
plot.show()
I also have attached a screenshot of the error message.error message
enter code here
K-means requires your data to be continuous variables.
Clearly, 'GIAC' is not a number, is it?
K-means cannot be used on this data. You'd need to do one-hot encoding or similar, but that comes with it's very own set of problems with k-means... Usually when you have data with values such as 'GIAC' there just is no sound way to cluster the data in a statistically meaningful way. Too many heuristic choice along he way to get a result, that you could get pretty much any other result, too. Try to approach the problem mathematically, not with copy&pasting code.

Dealing with Memory Error (Python sklearn clustering)

I have a dataset each of datum has sparse labels.
So, below is how data looks like.
[["Snow","Winter","Freezing","Fun","Beanie","Footwear","Headgear","Fur","Playing in the snow","Photography"],["Tree","Sky","Daytime","Urban area","Branch","Metropolitan area","Winter","Town","City","Street light"],...]
The total numbers of labels are around 50, and the numbers of data are 200K. And I want to cluster this data, but I'm having trouble dealing with that.
I want to cluster that data with four clustering algorithms(AgglomerativeClustering, SpectralClustering, MiniBatchKMeans, KMeans), but none of these worked because of memory issues.
Below is my code.
from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import SpectralClustering
import json
NUM_OF_CLUSTERS = 10
with open('./data/sample.json') as json_file:
json_data = json.load(json_file)
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in json_data:
for term in d:
index = vocabulary.setdefault(term, len(vocabulary))
indices.append(index)
data.append(1)
indptr.append(len(indices))
X = csr_matrix((data, indices, indptr), dtype=int).toarray()
# None of these algorithms work properly. I think it's because of memory issues.
# miniBatchKMeans = MiniBatchKMeans(n_clusters=NUM_OF_CLUSTERS, n_init=5, random_state=0).fit(X)
# agglomerative = AgglomerativeClustering(n_clusters=NUM_OF_CLUSTERS).fit(X)
# spectral = SpectralClustering(n_clusters=NUM_OF_CLUSTERS, assign_labels="discretize", random_state=0).fit(X)
#
# print(miniBatchKMeans.labels_)
# print(agglomerative.labels_)
# print(spectral.labels_)
with open('data.json', 'w') as outfile:
json.dump(miniBatchKMeans.labels_.tolist(), outfile)
Are there any solutions or other recommendations for my problem?
What is the size of X?
With toarray() you are converting the data into a sense format. That significantly increases the memory requirements.
With 200k instances you cannot use spectral clustering not affiniy propagation, because these need O(n²) memory. So either you choose other algorithms or subsample your data. Obviously there is also no use in doing both kmeans and minibatch kmeans (which is an approximation to kmeans). Use only one.
To efficiently work with sparse data, you may need to implement the algorithms yourself. Kmeans is designed for dense data, so it makes sense to time the implementation for dense data by default. In fact, using the mean on sparse data is rather questionable. So I'd not expect the results to be very good on your data with kmeans either.

Gap Statistic Method

import sys
import numpy as np
import scipy.io as sio
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.svm import SVC
filename = sys.argv[1]
datafile = sio.loadmat(filename)
data = datafile['bow']
sizedata=[len(data), len(data[0])]
gap=[]
SD=[]
for knum in xrange(10,20):
print knum
#Clustering original Data
kmeanspp = KMeans(n_clusters=knum,init = 'k-means++',max_iter = 100,n_jobs = 1)
kmeanspp.fit(data)
dispersion = kmeanspp.inertia_
#Clustering Reference Data
nrefs = 10
refDisp = np.zeros(nrefs)
for nref in xrange(nrefs):
refdata = np.random.random_sample((sizedata[0],sizedata[1]))
refkmeans = KMeans(n_clusters=knum,init='k-means++',max_iter=100,n_jobs=1)
refkmeans.fit(refdata)
refdisp = refkmeans.inertia_
refDisp[nref]=np.log(refdisp)
mean_log_refdisp = np.mean(refDisp)
gap.append(mean_log_refdisp-np.log(dispersion))
#Calculating standard deviaiton
sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
SD.append(sd)
SD = [sd*((1+(1/nrefs))**0.5) for sd in SD]
#determining optimal k
opt_k = None
diff = []
for i in xrange(len(gap)-1):
diff = (SD[i+1]-(gap[i+1]-gap[i]))
if diff>0:
opt_k = i+10
break
print diff
plt.plot(np.linspace(10,19,10,True),gap)
plt.show()
Here I am trying to implement the Gap Statistic method for determining the optimal number of clusters. But the problem is that every time I run the code I get a different value for k.
What is the solution to the problem?
How can the value of optimal k differ for the same data?
I have stored the data in a .mat file beforehand and I am passing it as an argument via terminal
I am looking for the smallest value of k for which Gap(k)>= Gap(k+1)-s(k+1) where s(k+1) = sd(k+1)*square_root(1+(1/B)) where sd is the standard deviation of the reference distribution and B is the number of copies of Monte Carlo sample
Otherwise stated, I am searching for the value of k for which
s(k+1)-Gap(k+1)+Gap(k)>=0
Couple of problems with your simulation:
1- sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
Why did you multiply the second component of zip by nrefs that is not needed according to the original paper.
2-
if diff>0:
opt_k = i+10
break
if diff>0 you want diff>=0 since equality can happen a
About why you get different number of clusters each time, as people said it is monte carlo simulation so there can be randomness and also it depends on what you are clustering and your dataset. I suggest you to test your algorithms against Silhouette and Elbow to get a better idea about number of clusters.
One option is to run your function several times and then average the gap statistics and the s values, and find the smallest k where the average s(k+1)-Gap(k+1)+Gap(k) is greater than
This will take longer but give a more reliable result.

python - tsne reduce memory consumption

I have a very large csv file that I am trying to do a tf-idf analysis with. I am using the tfidvectorizer for sklearn and then I want to reduce the dimensionality using tsne so I can plot the results. However, whenever I use tsne, I get a memory error. Is there a way I can prevent this from happening? I read you can use svdTruncate before implementing tsne. How would I go about doing this (or are there any other suggestions)? Here is my code:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,max_features=n_features,stop_words='english', lowercase = False)
tfidf = tfidf_vectorizer.fit_transform(df['paper_text'])
nmf = NMF(n_components=n_topics, random_state=0,alpha=.1, l1_ratio=.5).fit(tfidf)
nmf_embedding = nmf.transform(tfidf)
nmf_embedding = (nmf_embedding - nmf_embedding.mean(axis=0))/nmf_embedding.std(axis=0)
tsne = TSNE(random_state=3211)
tsne_embedding = tsne.fit_transform(nmf_embedding)
tsne_embedding = pd.DataFrame(tsne_embedding,columns=['x','y'])
tsne_embedding['hue'] = nmf_embedding.argmax(axis=1)
Thanks in advance!

ML - Getting feature names after feature selection - SelectPercentile, python

I have been struggling with this one for a while.
My goal is to take a text feature that I have, and find the best 5-10 words in it to help me classify. Hence, I am running a TfIdfVectorizer, and choosing ~90 best for now. however, after I downsize the feature amount, I am unable to see which features were actually chosen.
here is what I have:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
train=pandas.read_csv("train.tsv", sep='\t')
labels_train = train["label"]
documents = []
for i, row in train.iterrows():
documents.append((row['boilerplate'][1:-1].lower()))
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words="english")
features_train_transformed = vectorizer.fit_transform(documents)
selector = SelectPercentile(f_classif, percentile=0.1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
The result is that features_train_transformed contains a matrix of all the tfidf scores per word per document of the selected words, however I have no idea which words were chosen, and methods like "get_feature_names()" are unavailable for the class SelectPercentile.
This is neccesary because i need to add these features to a bunch of numeric features and only then make my training and predictions.
selector.get_support() to get you a boolean array of columns that were within the percentile range you specified
train.columns.values should get you the complete list of column names for the original dataframe
filtering the latter with the former should give you the names of columns that make up your chosen percentile range.
the code below (cut-pasted from working code) is similar enough to yours, that it's hopefully helpful
import numpy as np
selection = SelectPercentile(f_regression, percentile=2)
train_minus_target = train.drop("y", axis=1)
x_features = selection.fit_transform(train_minus_target, y_train)
columns = np.asarray(train_minus_target.columns.values)
support = np.asarray(selection.get_support())
columns_with_support = columns[support]
Reference:
about get_support

Categories