Agglomerative Clustering to cluster doc2vec - python

I'm new to Agglomerative Clustering and doc2vec, so I hope somebody can help me with the following issue.
This is my code:
model = AgglomerativeClustering(linkage='average',
connectivity=None, n_clusters=2)
X = model_dm.docvecs.doctag_syn0
model.fit(X, y=None)
model.fit_predict(X, y=None)
What I want is to predict the average of the distances of each observation. I got the following error:
MemoryErrorTraceback (most recent call last)
<ipython-input-22-d8b93bc6abe1> in <module>()
2 model = AgglomerativeClustering(linkage='average',connectivity=None,n_clusters=2)
3 X = model_dm.docvecs.doctag_syn0
----> 4 model.fit(X, y=None)
5
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in fit(self, X, y)
763 n_components=self.n_components,
764 n_clusters=n_clusters,
--> 765 **kwargs)
766 # Cut the tree
767 if compute_full_tree:
/usr/local/lib64/python2.7/site-packages/sklearn/externals/joblib/memory.pyc in __call__(self, *args, **kwargs)
281
282 def __call__(self, *args, **kwargs):
--> 283 return self.func(*args, **kwargs)
284
285 def call_and_shelve(self, *args, **kwargs):
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in _average_linkage(*args, **kwargs)
547 def _average_linkage(*args, **kwargs):
548 kwargs['linkage'] = 'average'
--> 549 return linkage_tree(*args, **kwargs)
550
551
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in linkage_tree(X, connectivity, n_components, n_clusters, linkage, affinity, return_distance)
428 i, j = np.triu_indices(X.shape[0], k=1)
429 X = X[i, j]
--> 430 out = hierarchy.linkage(X, method=linkage, metric=affinity)
431 children_ = out[:, :2].astype(np.int)
432
/usr/local/lib64/python2.7/site-packages/scipy/cluster/hierarchy.pyc in linkage(y, method, metric)
669 'matrix looks suspiciously like an uncondensed '
670 'distance matrix')
--> 671 y = distance.pdist(y, metric)
672 else:
673 raise ValueError("`y` must be 1 or 2 dimensional.")
/usr/local/lib64/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
1375
1376 m, n = s
-> 1377 dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
1378
1379 # validate input for multi-args metrics
MemoryError:

You are getting a MemoryError. This is a reliable indicator that you are running out of memory, on the line indicated.
The line indicates an attempt to allocate an np.zeros() array of (m * (m - 1)) // 2 values of type double (8 bytes). Looking at the scipy source, m, here, is the number of vectors in X, aka model_dm.docvecs.doctag_syn0.shape[0].
So, how many docvecs are you working with? If it's 200,000, you will need...
((200000 * 199999) // 2) * 8 bytes
...or about 320GB of RAM for that np.zeros() allocation to succeed. (If you have more docvecs, even more RAM.)
(Agglomerative clustering needs to know all the pairwise distances, which the scipy implementation tries to calculate and store at the beginning, which is very space-consuming.)
You may need to have more RAM, or use fewer docvecs, or use a different clustering algorithm, or use an implementation which is lazier about calculating distances (but is then much much slower because it will often be recalculating, rather than reusing, distances it needs repeatedly.

Related

What is 'G' in CVXPY and how to fix it

I'm trying to use a binary integer linear program to assign members of my staff to different shift. I have a 16x9 matrix of preferences for my staff in a csv (16 staff members, 9 slots to fill) and I used the following code to try and assign them:
weights = pd.read_csv("holiday_green day.csv", index_col= 0)
weights = weights.to_numpy().astype(float)
selection = cvx.Variable((9,16), boolean = True)
row_sum_vector = np.ones((16,1)).astype(float)
result_constraint = np.ones((9,1)).astype(float) * 2
objective = cvx.Minimize(cvx.trace(weights # assignments))
prob = cvx.Problem(objective, [assignments # row_sum_vector == result_constraint])
prob.solve()
When I try running this, I get the error TypeError: G must be a 'd' matrix and I don't know where to start debugging. I looked at this post, but it wasn't helpful. Can someone help me figure out what G is and what it means by 'd' matrix? Its my first time actually using CVXPY and I'm very lost.
Full Stack Trace:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-23-d07ad22cbc25> in <module>()
6 objective = cvx.Minimize(cvx.atoms.affine.trace.trace(weights # assignments))
7 prob = cvx.Problem(objective, [assignments # row_sum_vector == result_constraint])
----> 8 prob.solve()
3 frames
/usr/local/lib/python3.7/dist-packages/cvxpy/problems/problem.py in solve(self, *args, **kwargs)
288 else:
289 solve_func = Problem._solve
--> 290 return solve_func(self, *args, **kwargs)
291
292 #classmethod
/usr/local/lib/python3.7/dist-packages/cvxpy/problems/problem.py in _solve(self, solver, warm_start, verbose, parallel, gp, qcp, **kwargs)
570 self._intermediate_problem)
571 solution = self._solving_chain.solve_via_data(
--> 572 self, data, warm_start, verbose, kwargs)
573 full_chain = self._solving_chain.prepend(self._intermediate_chain)
574 inverse_data = self._intermediate_inverse_data + solving_inverse_data
/usr/local/lib/python3.7/dist-packages/cvxpy/reductions/solvers/solving_chain.py in solve_via_data(self, problem, data, warm_start, verbose, solver_opts)
194 """
195 return self.solver.solve_via_data(data, warm_start, verbose,
--> 196 solver_opts, problem._solver_cache)
/usr/local/lib/python3.7/dist-packages/cvxpy/reductions/solvers/conic_solvers/glpk_mi_conif.py in solve_via_data(self, data, warm_start, verbose, solver_opts, solver_cache)
73 data[s.B],
74 set(int(i) for i in data[s.INT_IDX]),
---> 75 set(int(i) for i in data[s.BOOL_IDX]))
76 results_dict = {}
77 results_dict["status"] = results_tup[0]
TypeError: G must be a 'd' matrix
Edit: Tried casting all numpy arrays as float like they suggested in a different post. It didn't work.

memory error in dask when using dummy encoder

I am in the process of going to dummy encode a dask dataframe train_final[categorical_var]. However, when I run the code I get a memory error. Could this happen since dask is supposed to do it by loading data chunk by chunk.
The code is below:
from dask_ml.preprocessing import DummyEncoder
de = DummyEncoder()
train_final_cat = de.fit_transform(train_final[categorical_var])
The error:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-84-e21592c13279> in <module>
1 from dask_ml.preprocessing import DummyEncoder
2 de = DummyEncoder()
----> 3 train_final_cat = de.fit_transform(train_final[categorical_var])
~/env/lib/python3.5/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
460 if y is None:
461 # fit method of arity 1 (unsupervised transformation)
--> 462 return self.fit(X, **fit_params).transform(X)
463 else:
464 # fit method of arity 2 (supervised transformation)
~/env/lib/python3.5/site-packages/dask_ml/preprocessing/data.py in fit(self, X, y)
602
603 self.transformed_columns_ = pd.get_dummies(
--> 604 sample, drop_first=self.drop_first
605 ).columns
606 return self
~/env/lib/python3.5/site-packages/pandas/core/reshape/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
890 dummy = _get_dummies_1d(col[1], prefix=pre, prefix_sep=sep,
891 dummy_na=dummy_na, sparse=sparse,
--> 892 drop_first=drop_first, dtype=dtype)
893 with_dummies.append(dummy)
894 result = concat(with_dummies, axis=1)
~/env/lib/python3.5/site-packages/pandas/core/reshape/reshape.py in _get_dummies_1d(data, prefix, prefix_sep, dummy_na, sparse, drop_first, dtype)
978
979 else:
--> 980 dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)
981
982 if not dummy_na:
~/env/lib/python3.5/site-packages/numpy/lib/twodim_base.py in eye(N, M, k, dtype, order)
184 if M is None:
185 M = N
--> 186 m = zeros((N, M), dtype=dtype, order=order)
187 if k >= M:
188 return m
MemoryError:
Would anyone be able to give me some direction in this regard
Thanks
Michael
Encoding dummy variables is a very memory intensive task, as you're creating a new column for each unique value of your categorical_column. If categorical_column is high cardinality then even a single chunk can explode in size. As well, creating dummies is not "embarrassingly parallel"; so workers can't just process each chunk independently. The workers need to communicate and replicate some data during the computation.

pymc3 with custom likelihood function from kernel density estimation

I'm trying to use pymc3 with a likelihood function derived from some observed data. This observed data doesn't fit any nice, standard distribution, so I want to define my own, based on these observations.
One approach is to use kernel density estimation over the observations. This was possible in pymc2, but doesn't play nicely with the Theano variables in pymc3.
In my code below I'm just generating some dummy data that is normally distributed. As my prior, I'm essentially assuming a uniform distribution for my observations.
Here's my code:
from scipy import stats
import numpy as np
import pymc3 as pm
from sklearn.neighbors.kde import KernelDensity
data = np.sort(stats.norm.rvs(loc=0, scale=1, size=1000))
kde = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(data.reshape(-1, 1))
def get_log_likelihood(x):
return kde.score_samples(x)
with pm.Model() as test_model:
x = pm.Uniform('prior rv', lower=-10, upper=10)
obs = pm.DensityDist('observed likelihood', get_log_likelihood, observed={'x': x})
step = pm.Metropolis()
trace = pm.sample(200, step=step)
The error I receive seems to be the kde score_samples function blowing up as it expects an array, but x is a Theano variable.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-4efbbe7376dc> in <module>()
1 with pm.Model() as test_model:
2 x = pm.Uniform('prior rv', lower=0.0, upper=1e6)
----> 3 obs = pm.DensityDist('observed likelihood', get_log_likelihood, observed={'x': x})
4
5 step = pm.Metropolis()
~/research_notebooks/venv/lib/python3.6/site-packages/pymc3/distributions/distribution.py in __new__(cls, name, *args, **kwargs)
40 total_size = kwargs.pop('total_size', None)
41 dist = cls.dist(*args, **kwargs)
---> 42 return model.Var(name, dist, data, total_size)
43 else:
44 raise TypeError("Name needs to be a string but got: {}".format(name))
~/research_notebooks/venv/lib/python3.6/site-packages/pymc3/model.py in Var(self, name, dist, data, total_size)
825 with self:
826 var = MultiObservedRV(name=name, data=data, distribution=dist,
--> 827 total_size=total_size, model=self)
828 self.observed_RVs.append(var)
829 if var.missing_values:
~/research_notebooks/venv/lib/python3.6/site-packages/pymc3/model.py in __init__(self, name, data, distribution, total_size, model)
1372 self.missing_values = [datum.missing_values for datum in self.data.values()
1373 if datum.missing_values is not None]
-> 1374 self.logp_elemwiset = distribution.logp(**self.data)
1375 # The logp might need scaling in minibatches.
1376 # This is done in `Factor`.
<ipython-input-48-535f58ce543b> in get_log_likelihood(x)
1 def get_log_likelihood(x):
----> 2 return kde.score_samples(x)
~/research_notebooks/venv/lib/python3.6/site-packages/sklearn/neighbors/kde.py in score_samples(self, X)
150 # For it to be a probability, we must scale it. For this reason
151 # we'll also scale atol.
--> 152 X = check_array(X, order='C', dtype=DTYPE)
153 N = self.tree_.data.shape[0]
154 atol_N = self.atol * N
~/research_notebooks/venv/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: setting an array element with a sequence.
Any help would be greatly appreciated. Thanks!

Converting TfidfVectorizer sparse matrix to dataframe or dense array results in memory error

My input is a pandas dataframe ("vector") with one column and 178885 rows holding strings with up to 600 words each.
0 this is an example text...
1 more examples...
...
178885 last example
Name: vectortext, Length: 178886, dtype: object
I'm doing feature extraction (unigrams) using the TfidfVectorizer:
vectorizer_uni = TfidfVectorizer(ngram_range=(1,1), use_idf=True, analyzer="word", stop_words=stop)
X = vectorizer_uni.fit_transform(vector).toarray()
X = pd.DataFrame(X, columns=vectorizer_uni.get_feature_names()) #map grams
k = len(X.columns) #number of features
Unfortunately I'm receiving a Memory Error as below. I'm using the 64bit version of python 3.6 with 16GB RAM on my windows 10 machine. I've red alot about python generators etc. but I can't figure out how to solve this problem without limiting the number of features (which is not really an option). Any ideas how to solve this? Could I somehow split my dataframe before?
Traceback:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-88-15b6091ceec7> in <module>()
1 vectorizer_uni = TfidfVectorizer(ngram_range=(1,1), use_idf=True, analyzer="word", stop_words=stop)
----> 2 X = vectorizer_uni.fit_transform(vector).toarray()
3 X = pd.DataFrame(X, columns=vectorizer_uni.get_feature_names()) #map grams
4 k = len(X.columns) # number of features
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out)
962 def toarray(self, order=None, out=None):
963 """See the docstring for `spmatrix.toarray`."""
--> 964 return self.tocoo(copy=False).toarray(order=order, out=out)
965
966 ##############################################################
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\coo.py in toarray(self, order, out)
250 def toarray(self, order=None, out=None):
251 """See the docstring for `spmatrix.toarray`."""
--> 252 B = self._process_toarray_args(order, out)
253 fortran = int(B.flags.f_contiguous)
254 if not fortran and not B.flags.c_contiguous:
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\base.py in _process_toarray_args(self, order, out)
1037 return out
1038 else:
-> 1039 return np.zeros(self.shape, dtype=self.dtype, order=order)
1040
1041 def __numpy_ufunc__(self, func, method, pos, inputs, **kwargs):
MemoryError:

ValueError: negative dimensions are not allowed in scikit linear regression CV model with sparse matrices

I recently competed in a kaggle competition and ran into problems trying to run linear CV models from scikit learn. I am aware of a similar question on stack overflow but I can't see how the accepted reply relates to my issue. Any assistance would be greatly appreciated. My code is given below:
train=pd.read_csv(".../train.csv")
test=pd.read_csv(".../test.csv")
data=pd.read_csv(".../sampleSubmission.csv")
from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer(max_features=None)
Y=transformer.fit_transform(train.tweet)
Z=transformer.transform(test.tweet)
from sklearn import linear_model
clf = linear_model.RidgeCV()
a=4
b=1
while (a<28):
clf.fit(Y, train.ix[:,a])
pred=clf.predict(Z)
linpred=pd.DataFrame(pred)
data[data.columns[b]]=linpred
b=b+1
a=a+1
print b
The error that I receive is pasted in total below:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-41c31233c15c> in <module>()
1 blah=train.ix[:,a]
----> 2 clf.fit(Y, blah)
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
815 gcv_mode=self.gcv_mode,
816 store_cv_values=self.store_cv_values)
--> 817 estimator.fit(X, y, sample_weight=sample_weight)
818 self.alpha_ = estimator.alpha_
819 if self.store_cv_values:
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
722 raise ValueError('bad gcv_mode "%s"' % gcv_mode)
723
--> 724 v, Q, QT_y = _pre_compute(X, y)
725 n_y = 1 if len(y.shape) == 1 else y.shape[1]
726 cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in _pre_compute(self, X, y)
607 def _pre_compute(self, X, y):
608 # even if X is very sparse, K is usually very dense
--> 609 K = safe_sparse_dot(X, X.T, dense_output=True)
610 v, Q = linalg.eigh(K)
611 QT_y = np.dot(Q.T, y)
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\utils\extmath.pyc in safe_sparse_dot(a, b, dense_output)
76 from scipy import sparse
77 if sparse.issparse(a) or sparse.issparse(b):
---> 78 ret = a * b
79 if dense_output and hasattr(ret, "toarray"):
80 ret = ret.toarray()
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\base.pyc in __mul__(self, other)
301 if self.shape[1] != other.shape[0]:
302 raise ValueError('dimension mismatch')
--> 303 return self._mul_sparse_matrix(other)
304
305 try:
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\scipy\sparse\compressed.pyc in _mul_sparse_matrix(self, other)
518
519 nnz = indptr[-1]
--> 520 indices = np.empty(nnz, dtype=np.intc)
521 data = np.empty(nnz, dtype=upcast(self.dtype,other.dtype))
522
ValueError: negative dimensions are not allowed
It looks like this problem occurs without using sklearn. Its in scipy.sparse matrix multiplication. There is this issue on a scipy-users board: sparse matrix multiplication problem. The crux of the problem is that scipy uses a 32-bit int for non-zero indices during sparse matrix multiplication. That's the marked line at the bottom of the traceback above. That can overflow if there are too many non-zero elements. That overflow causes the variable nnz to become negative. Then the code at the last arrow creates an empty array of size nnz, resulting in a ValueError due to a negative dimension.
You can generate the tail end of the traceback above without sklearn as follows:
import scipy.sparse as ss
X = ss.rand(75000, 42000, format='csr', density=0.01)
X * X.T
For this problem, the input is probably quite sparse, but RidgeCV looks like its multiplying X and X.T in the last part of the traceback within sklearn. That product might not be sparse enough.

Categories