Dealing with Memory Error (Python sklearn clustering) - python

I have a dataset each of datum has sparse labels.
So, below is how data looks like.
[["Snow","Winter","Freezing","Fun","Beanie","Footwear","Headgear","Fur","Playing in the snow","Photography"],["Tree","Sky","Daytime","Urban area","Branch","Metropolitan area","Winter","Town","City","Street light"],...]
The total numbers of labels are around 50, and the numbers of data are 200K. And I want to cluster this data, but I'm having trouble dealing with that.
I want to cluster that data with four clustering algorithms(AgglomerativeClustering, SpectralClustering, MiniBatchKMeans, KMeans), but none of these worked because of memory issues.
Below is my code.
from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import SpectralClustering
import json
NUM_OF_CLUSTERS = 10
with open('./data/sample.json') as json_file:
json_data = json.load(json_file)
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in json_data:
for term in d:
index = vocabulary.setdefault(term, len(vocabulary))
indices.append(index)
data.append(1)
indptr.append(len(indices))
X = csr_matrix((data, indices, indptr), dtype=int).toarray()
# None of these algorithms work properly. I think it's because of memory issues.
# miniBatchKMeans = MiniBatchKMeans(n_clusters=NUM_OF_CLUSTERS, n_init=5, random_state=0).fit(X)
# agglomerative = AgglomerativeClustering(n_clusters=NUM_OF_CLUSTERS).fit(X)
# spectral = SpectralClustering(n_clusters=NUM_OF_CLUSTERS, assign_labels="discretize", random_state=0).fit(X)
#
# print(miniBatchKMeans.labels_)
# print(agglomerative.labels_)
# print(spectral.labels_)
with open('data.json', 'w') as outfile:
json.dump(miniBatchKMeans.labels_.tolist(), outfile)
Are there any solutions or other recommendations for my problem?

What is the size of X?
With toarray() you are converting the data into a sense format. That significantly increases the memory requirements.
With 200k instances you cannot use spectral clustering not affiniy propagation, because these need O(n²) memory. So either you choose other algorithms or subsample your data. Obviously there is also no use in doing both kmeans and minibatch kmeans (which is an approximation to kmeans). Use only one.
To efficiently work with sparse data, you may need to implement the algorithms yourself. Kmeans is designed for dense data, so it makes sense to time the implementation for dense data by default. In fact, using the mean on sparse data is rather questionable. So I'd not expect the results to be very good on your data with kmeans either.

Related

TruncatedSVD n_oversamples seems to have no bearing

I'm looking for way to improve the quality of my eigenvectors produced by sklearn TruncatedSVD. The documentation at scikit-learn.org suggests that the n_oversamples parameter is a good place to start. I have a sparse 2200 square matrix as input (provided as three separate files consisting of row indexes, column indexes, and data value.) Here's my code:
from array import array
import sys
import numpy as np
import struct
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
path="c:\\users\\lenwh\\documents\\wikipedia\\weights\\"
file=sys.argv[1]
dims=int(sys.argv[2]) #I use 300
with open(path+ file + ".rows","rb") as f:
rows=np.fromfile(f,dtype=np.int32)
with open(path+ file + ".cols","rb") as f:
cols=np.fromfile(f,dtype=np.int32)
with open(path+ file + ".data","rb") as f:
data = np.fromfile(f, dtype=np.float32)
rowCount=len(np.unique(rows))
csr=csr_matrix((data, (rows, cols)), shape=(rowCount, rowCount))
vectorsfile=path+"eigens.vec"
transfile=path+ file + ".eig"
oversamples=10;
pca=TruncatedSVD(n_components=dims, n_oversamples=oversamples)
pca.fit(csr)
np.savetxt(transfile,pca.transform(csr),fmt='%16f')
The problem is that whether I have oversamples set to 10, 100, or 1000, the results are not discernably different, meaning the explained variance is the same for all, as is the performance of the results in my application. As a minimum, I expected that the quality of the explained variance would change. I would appreciate any explanation of where my expectations are misguided, and whether there are any other settings -- or alternatives to TruncatedSVD -- that I could looked to other than the n_components setting.

ValueError: could not convert string to float: 'GIAC'

I am trying to perform a K Means Clustering on a set of data that all texts. I have tried these lines of code and I am getting an error saying "ValueError: could not convert string to float: 'GIAC'".
I think the program is still having problems converting my text into vectors to be able to perform a clustering.
I really do not know how to solve this.
Here are the lines of code:
import numpy as np
import matplotlib.pyplot as plot
import pandas as pd
from sklearn.cluster import KMeans
Cert = pd.read_csv('Certification.csv')
X = Cert.iloc[:,:].values
wcss =[]
for i in range(1,5):
kmeans = KMeans(n_clusters = i, init='k-means++', random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plot.plot(range(1,5),wcss)
plot.title('Elbow Method')
plot.xlabel('Number of Clusters')
plot.ylabel('WCSS')
plot.show()
I also have attached a screenshot of the error message.error message
enter code here
K-means requires your data to be continuous variables.
Clearly, 'GIAC' is not a number, is it?
K-means cannot be used on this data. You'd need to do one-hot encoding or similar, but that comes with it's very own set of problems with k-means... Usually when you have data with values such as 'GIAC' there just is no sound way to cluster the data in a statistically meaningful way. Too many heuristic choice along he way to get a result, that you could get pretty much any other result, too. Try to approach the problem mathematically, not with copy&pasting code.

Select top n TFIDF features for a given document

I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
n = 50
df = pd.read_pickle('my_df.pickle')
df_t = tfidfvectorizer.fit_transform(df['text'])
df_t
Out[15]:
<21175x201380 sparse matrix of type '<class 'numpy.float64'>'
with 6055621 stored elements in Compressed Sparse Row format>
I have tried following the example in this post, although my aim is not to display the features, but just to select the top n for each document before training. But I get a memory error as my data is too large to be converted into a dense matrix.
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
Traceback (most recent call last):
File "<ipython-input-16-e0a74c393ca5>", line 1, in <module>
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 943, in toarray
out = self._process_toarray_args(order, out)
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 1130, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Is there any way to do what I want without working with a dense representation (i.e. without the toarray() call) and without reducing the feature space too much more than I already have (with min_df)?
Note: the max_features parameter is not what I want as it only considers "the top max_features ordered by term frequency across the corpus" (docs here) and what I want is a document-level ranking.
EDIT: I wonder if the best way to address this problem is to set the values of all features except the n-best to zero. I say this because the vocabulary has already been calculated, so feature indices must remain the same, as I will want to use them for other purposes (e.g. to visualise the actual words that correspond to the n-best features).
A colleague wrote some code to retrieve the indices of the n highest-ranked features:
n = 2
tops = np.zeros((df_t.shape[0], n), dtype=int) # store the top indices in a new array
for ind in range(df_t.shape[0]):
tops[ind,] = np.argsort(-df_t[ind].toarray())[0, 0:n] # for each row (i.e. document) sort the (inversed, as argsort is ascending) list and slice top n
But from there, I would need to either:
retrieve the list of remaining (i.e. lowest-ranked) indices and modify the values "in place", or
loop through the original matrix (df_t) and set all values to 0 except for the n best indices in tops.
There is a post here explaining how to work with a csr_matrix, but I'm not sure how to put this into practice to get what I want.
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer=word_tokenize,ngram_range=(1,2), binary=True, max_features=50)
TFIDF=vect.fit_transform(df['processed_cv_data'])
The max_features parameter passed in the TfidfVectorizer will pick out the top 50 features ordered by their term frequency but not by their Tf-idf score.
You can view the features by using:
print(vect.get_feature_names())
As you mention, the max_features parameter of the TfidfVectorizer is one way of selecting features.
If you are looking for an alternative way which takes the relationship to the target variable into account, you can use sklearn's SelectKBest. By setting k=50, this will filter your data for the best features. The metric to use for selection can be specified as the parameter score_func.
Example:
from sklearn.feature_selection import SelectKBest
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(df['text'])
df_t_reduced = SelectKBest(k=50).fit_transform(df_t, df['target'])
You can also chain it in a pipeline:
pipeline = Pipeline([("vectorizer", TfidfVectorizer()),
("feature_reduction", SelectKBest(k=50)),
("classifier", classifier)])
You could break your numpy array in multiple one to free the memory. Then just concat them
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train').data
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(data)
n = 10
df_t = tfidfvectorizer.fit_transform(data)
df_top = [np.argsort(df_t[i: i+500, :].toarray(), axis=1)[:, :n]
for i in range(0, df_t.shape[0], 500)]
np.concatenate(df_top, axis=0).shape
>> (11314, 10)

How to input twitter data (csv/txt) into DBSCAN python?

Could someone guide me how could i cluster twitter data using DBSCAN in python? I am totally new to DBSCAN. Also, how to determine the eps value and the iloc or loc value.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
def clusterEvaluate(cluster):
count_cluster = np.bincount(cluster)
count_cluster = np.argmax(count_cluster)
same_clusters = np.count_nonzero(cluster == count_cluster)/np.size(cluster)
return same_clusters
dataset = np.loadtxt('tweetdata.csv') # not sure if this works
X = StandardScaler().fit_transform(dataset)
y_valid = dataset.iloc[:6].values()
dbscan = DBSCAN(eps= 0.5,min_samples=5,metric='euclidean')
y = dbscan.fit_predict(X)
cluster_labels = np.unique(y)
same_clusters = []
i = 0
for index in cluster_labels:
cluster = y_valid[y == index]
same_clusters.insert((i, clusterEvaluate(cluster)))
You need to choose and appropriate data representation and distance function for this. Furthermore, scalability will kill you.
I do not think it will work well. I have it seen anything that gives insightful results beyond counting frequent words in a unnecessary complex fashion. Twitter data is a bitch. The messages are just too short. All the good approaches like LDA need much longer documents.

How to use vectors as features in scikit learn

I'm trying to use vector representations of words as feature for a scikit learn classifier. I have tried SVC. Here is the code
from sklearn.svm import SVC
import csv
import numpy as np
from gensim.models import word2vec
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('text.model.bin', binary=True)
with open('1000.csv', newline='') as csvfile:
listwords = csv.reader(csvfile)
features = []
labels = []
n = 0
for row in listwords:
if n>=199:
break
try:
line = [int(row[2]),np.array(model[row[0]])]
features.append([line])
labels.append([row[1]])
n+=1
except KeyError:
pass
features.append([])
n+=1
clf = SVC()
clf = clf.fit(features, labels)
vocab_obj = model.vocab['anne']
print (clf.predict([vocab_obj.count,model['anne']]))
The function model[X] returns a vector.
I get the error : ValueError: setting an array element with a sequence.
How am I supposed to do this?
There appear to be several issues w.r.t. the representation of your labels and features.
As far as I can see from your code, labels appears to be a list of lists (possibly) containing strings (probably looking like [['0'], ['1'], ...]), however the fit() function expects a numpy array of integers. When building your labels list, try using labels.append(int(row[1])) (ignore the cast to int if row[1] is already an integer). If your labels are category names (e.g. sports, politics, or whatever), the you'd need to use a LabelEncoder first. Before calling fit() you might also want to convert your labels list to a numpy array: y = np.array(labels).
Your features list appears to have a similar problem, but looks as if it might be a triple nested list. The fit() function expects the data matrix to be a n_samples x n_features matrix. If you are working with word vectors, then n_features is the dimensionality of your word vectors and n_samples the number of documents in your csv file.
For getting a document representation from the word vectors you'd need to compose them in some way. Commonly, simply adding or averaging all vectors in a document has been found to be a strong baseline. Note that its hard to tell from your example what the meaning of int(row[2]) in line = [int(row[2]),np.array(model[row[0]])] is.
I'd encourage you to post some more information of how a single line in your csv file looks like if you're still struggling to get this to work.

Categories