I want to calculate the normal distribution given the mean, covariance, and some datapoints.
from scipy.stats import multivariate_normal
def calc_gamma(x, mu, sigma):
Gamma = multivariate_normal(x, mu, sigma)
return Gamma
now when I pass these values to the function, I get an error for the length of the mean.
test_data = np.array([[0, 0],[1,2]])
test_mu = np.array([[1, 1]])
test_sigma = np.array([[1, 0], [0, 1]])
test_gamma = calc_gamma(test_data, test_mus, test_sigmas)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_19532\3211928287.py in <module>
2 test_mus = np.array([[1, 1]])
3 test_sigmas = np.array([[[1, 0], [0, 1]]])
----> 4 test_gamma_hat = calc_gamma_hat(test_data, test_pi, test_mus, test_sigmas)
~\AppData\Local\Temp\ipykernel_19532\3403511833.py in calc_gamma_hat(x, pi, mus, sigmas)
1 def calc_gamma_hat(x,pi,mus, sigmas):
2
----> 3 Gamma = multivariate_normal(x,mus,sigmas)
4 return Gamma
5
~\anaconda3\envs\my_env\lib\site-packages\scipy\stats\_multivariate.py in __call__(self, mean,
cov, allow_singular, seed)
362 See `multivariate_normal_frozen` for more information.
363 """
--> 364 return multivariate_normal_frozen(mean, cov,
365 allow_singular=allow_singular,
366 seed=seed)
~\anaconda3\envs\my_env\lib\site-packages\scipy\stats\_multivariate.py in __init__(self, mean,
cov, allow_singular, seed, maxpts, abseps, releps)
730 """
731 self._dist = multivariate_normal_gen(seed)
--> 732 self.dim, self.mean, self.cov = self._dist._process_parameters(
733 None, mean, cov)
734 self.cov_info = _PSD(self.cov, allow_singular=allow_singular)
~\anaconda3\envs\my_env\lib\site-packages\scipy\stats\_multivariate.py in
_process_parameters(self, dim, mean, cov)
405
406 if mean.ndim != 1 or mean.shape[0] != dim:
--> 407 raise ValueError("Array 'mean' must be a vector of length %d." %
408 dim)
409 if cov.ndim == 0:
ValueError: Array 'mean' must be a vector of length 4.
Since I can't change the test values, any suggestions on why it does not work and how to "adjust" the function?
Related
I would like to cluster sets of spatial data using my own metric. The data comes as pairs of (x,y) values in a dataframe, where each set of pairs has an id. Like in the following example where I have three sets of points:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [1] * 4 + [2] * 5 + [3] * 3,
'x': np.random.random(12),
'y': np.random.random(12)})
df['xy'] = df[['x','y']].apply(lambda row: [row['x'],row['y']], axis = 1)
Here is the distance function I would like to use:
from scipy.spatial.distance import directed_hausdorff
def some_distance(u, v):
return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
This function computes the Hausdorff distance, i.e. the distance between two subsets u and v of n-dimensional space. In my case, I would like to use this distance function to cluster subsets of the real plane. In the data above there are three such subsets (ids from 1 to 3) so the resulting distance matrix should be 3x3.
My idea for the clustering step was to use sklearn.cluster.AgglomerativeClustering with a precomputed metric, which in turn I want to compute with sklearn.metrics.pairwise import pairwise_distances.
from sklearn.metrics.pairwise import pairwise_distances
def to_np_array(col):
return np.array(list(col.values))
X = df.groupby('id')['xy'].apply(to_np_array).as_matrix()
m = pairwise_distances(X, X, metric=some_distance)
However, the last line is giving me an error:
ValueError: setting an array element with a sequence.
What does work fine, however, is calling some_distance(X[1], X[2]).
My hunch is that X needs to be a different format for pairwise_distances to work. Any ideas on how to make this work, or how to compute the matrix myself so I can stick it into sklearn.cluster.AgglomerativeClustering?
The error stack is
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-e34155622595> in <module>
12 def some_distance(u, v):
13 return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
---> 14 m = pairwise_distances(X, X, metric=some_distance)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1430 func = partial(distance.cdist, metric=metric, **kwds)
1431
-> 1432 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1433
1434
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1065
1066 if effective_n_jobs(n_jobs) == 1:
-> 1067 return func(X, Y, **kwds)
1068
1069 # TODO: in some cases, backend='threading' may be appropriate
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in _pairwise_callable(X, Y, metric, **kwds)
1079 """Handle the callable case for pairwise_{distances,kernels}
1080 """
-> 1081 X, Y = check_pairwise_arrays(X, Y)
1082
1083 if X is Y:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
106 if Y is X or Y is None:
107 X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
--> 108 warn_on_dtype=warn_on_dtype, estimator=estimator)
109 else:
110 X = check_array(X, accept_sparse='csr', dtype=dtype,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
525 try:
526 warnings.simplefilter('error', ComplexWarning)
--> 527 array = np.asarray(array, dtype=dtype, order=order)
528 except ComplexWarning:
529 raise ValueError("Complex data not supported\n"
C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.
Try this:
import numpy as np
import pandas as pd
from scipy.spatial.distance import directed_hausdorff
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering
df = pd.DataFrame({'id': [1] * 4 + [2] * 5 + [3] * 3, 'x':
np.random.random(12), 'y': np.random.random(12)})
df['xy'] = df[['x','y']].apply(lambda row: [row['x'],row['y']], axis = 1)
df.groupby('id')['xy'].apply(to_np_array)
def some_distance(u, v):
return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
def to_np_array(col):
return np.array(list(col.values))
X = df.groupby('id')['xy'].apply(to_np_array)
d = np.zeros((len(X),len(X)))
for i, u in enumerate(X):
for j, v in list(enumerate(X))[i:]:
d[i,j] = some_distance(u,v)
d[j,i] = d[i,j]
And now when you print d you get this:
array([[0. , 0.58928274, 0.40767213],
[0.58928274, 0. , 0.510095 ],
[0.40767213, 0.510095 , 0. ]])
And for clustering:
cluster = AgglomerativeClustering(n_clusters=2, affinity='precomputed', linkage = 'average')
cluster.fit(d)
It would help if you showed some of the variables. Fortunately you gave enough code to run it. For example the dataframe:
In [9]: df
Out[9]:
id x y xy
0 1 0.428437 0.267264 [0.42843730501201727, 0.2672637429997736]
1 1 0.944687 0.023323 [0.9446872371859233, 0.023322969159167317]
2 1 0.091055 0.683154 [0.09105472832178496, 0.6831542985617349]
3 1 0.474522 0.313541 [0.4745222021519122, 0.3135405569298565]
4 2 0.835237 0.491541 [0.8352366339973815, 0.4915408434083248]
5 2 0.905918 0.854030 [0.9059178939221513, 0.8540297797160584]
6 2 0.182154 0.909656 [0.18215390836391654, 0.9096555360282939]
7 2 0.225270 0.522193 [0.22527013482912195, 0.5221926076838651]
8 2 0.924208 0.858627 [0.9242076604008371, 0.8586274362498842]
9 3 0.419813 0.634741 [0.41981292371175905, 0.6347409684931891]
10 3 0.954141 0.795452 [0.9541413559045294, 0.7954524369652217]
11 3 0.896593 0.271187 [0.8965932351250882, 0.2711872631673109]
And your X:
In [10]: X
Out[10]:
array([array([[0.42843731, 0.26726374],
[0.94468724, 0.02332297],
[0.09105473, 0.6831543 ],
[0.4745222 , 0.31354056]]),
array([[0.83523663, 0.49154084],
[0.90591789, 0.85402978],
[0.18215391, 0.90965554],
[0.22527013, 0.52219261],
[0.92420766, 0.85862744]]),
array([[0.41981292, 0.63474097],
[0.95414136, 0.79545244],
[0.89659324, 0.27118726]])], dtype=object)
That is a (3,) object array - in effect a list of 3 2d arrays, with different sizes ((3,2),(5,2),(4,2)). That's one array for each group.
How is pairwise supposed to feed that to your distance code? pairwise docs says X should be a (n,m) array - n samples, m features. Your X doesn't fit that description!
The error is probably produced by when trying to make a float array from X:
In [12]: np.asarray(X,dtype=float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-a6e08bb1590c> in <module>
----> 1 np.asarray(X,dtype=float)
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.
I am not sure why stats.multivariate_normal.pdf is not working.
At the moment I have
from scipy import stats
stats.multivariate_normal.pdf(X, meanX, covX)
where
X.shape = (150, 2)
meanX.shape = () # just a float
covX.shape = (150,)
The error I get is: "total size of new array must be unchanged"
Now I tried to follow the answer:
meanL = np.float(np.mean(xL))
covL = np.cov(xL)
stats.multivariate_normal.pdf(xL.T, np.full((150,), meanL), covL)
I get the following error:
LinAlgError Traceback (most recent call last)
<ipython-input-77-4c0280512087> in <module>()
2 meanL = np.full((150,), meanL)
3 covL = np.cov(xL)
----> 4 stats.multivariate_normal.pdf(xL.T, meanL, covL)
5
/Users/laura/anaconda/lib/python3.5/site-packages/scipy/stats/_multivariate.py in pdf(self, x, mean, cov, allow_singular)
497 dim, mean, cov = self._process_parameters(None, mean, cov)
498 x = self._process_quantiles(x, dim)
--> 499 psd = _PSD(cov, allow_singular=allow_singular)
500 out = np.exp(self._logpdf(x, mean, psd.U, psd.log_pdet, psd.rank))
501 return _squeeze_output(out)
/Users/laura/anaconda/lib/python3.5/site-packages/scipy/stats/_multivariate.py in __init__(self, M, cond, rcond, lower, check_finite, allow_singular)
148 d = s[s > eps]
149 if len(d) < len(s) and not allow_singular:
--> 150 raise np.linalg.LinAlgError('singular matrix')
151 s_pinv = _pinv_1d(s, eps)
152 U = np.multiply(u, np.sqrt(s_pinv))
LinAlgError: singular matrix
I can't reproduce the exact error your getting, but dimensions have to match:
mean and covariance need to have shapes (N,) and (N, N). and X must have width N. Some but not all of these requirements may be alleviated by broadcasting. Anyway, the following works for me:
>>> X = np.random.random((150,2))
>>> meanX = 0.5
>>> covX = np.identity(150)
>>> print(stats.multivariate_normal.pdf(X.T, np.full((150,), meanX), covX))
[4.43555177e-63 2.84151145e-63]
Update From the updated Q I suspect you want
>>> X = np.random.random((150,2))
>>>
>>> meanX = np.mean(X, axis=0)
>>> covX = np.cov(X.T)
>>> stats.multivariate_normal.pdf(X, meanX, covX)
array([0.83292328, 0.18944144, 0.37425605, 1.22840732, 0.5089164 ,
1.78568641, 0.31210331, 0.64079837, 1.05805662, 0.66416311,
0.77964264, 0.65744803, 0.53025325, 1.22309949, 1.62169299,
0.84558019, 1.23537247, 0.44383979, 1.45601888, 0.85368635,
...
I'm new to Agglomerative Clustering and doc2vec, so I hope somebody can help me with the following issue.
This is my code:
model = AgglomerativeClustering(linkage='average',
connectivity=None, n_clusters=2)
X = model_dm.docvecs.doctag_syn0
model.fit(X, y=None)
model.fit_predict(X, y=None)
What I want is to predict the average of the distances of each observation. I got the following error:
MemoryErrorTraceback (most recent call last)
<ipython-input-22-d8b93bc6abe1> in <module>()
2 model = AgglomerativeClustering(linkage='average',connectivity=None,n_clusters=2)
3 X = model_dm.docvecs.doctag_syn0
----> 4 model.fit(X, y=None)
5
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in fit(self, X, y)
763 n_components=self.n_components,
764 n_clusters=n_clusters,
--> 765 **kwargs)
766 # Cut the tree
767 if compute_full_tree:
/usr/local/lib64/python2.7/site-packages/sklearn/externals/joblib/memory.pyc in __call__(self, *args, **kwargs)
281
282 def __call__(self, *args, **kwargs):
--> 283 return self.func(*args, **kwargs)
284
285 def call_and_shelve(self, *args, **kwargs):
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in _average_linkage(*args, **kwargs)
547 def _average_linkage(*args, **kwargs):
548 kwargs['linkage'] = 'average'
--> 549 return linkage_tree(*args, **kwargs)
550
551
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in linkage_tree(X, connectivity, n_components, n_clusters, linkage, affinity, return_distance)
428 i, j = np.triu_indices(X.shape[0], k=1)
429 X = X[i, j]
--> 430 out = hierarchy.linkage(X, method=linkage, metric=affinity)
431 children_ = out[:, :2].astype(np.int)
432
/usr/local/lib64/python2.7/site-packages/scipy/cluster/hierarchy.pyc in linkage(y, method, metric)
669 'matrix looks suspiciously like an uncondensed '
670 'distance matrix')
--> 671 y = distance.pdist(y, metric)
672 else:
673 raise ValueError("`y` must be 1 or 2 dimensional.")
/usr/local/lib64/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
1375
1376 m, n = s
-> 1377 dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
1378
1379 # validate input for multi-args metrics
MemoryError:
You are getting a MemoryError. This is a reliable indicator that you are running out of memory, on the line indicated.
The line indicates an attempt to allocate an np.zeros() array of (m * (m - 1)) // 2 values of type double (8 bytes). Looking at the scipy source, m, here, is the number of vectors in X, aka model_dm.docvecs.doctag_syn0.shape[0].
So, how many docvecs are you working with? If it's 200,000, you will need...
((200000 * 199999) // 2) * 8 bytes
...or about 320GB of RAM for that np.zeros() allocation to succeed. (If you have more docvecs, even more RAM.)
(Agglomerative clustering needs to know all the pairwise distances, which the scipy implementation tries to calculate and store at the beginning, which is very space-consuming.)
You may need to have more RAM, or use fewer docvecs, or use a different clustering algorithm, or use an implementation which is lazier about calculating distances (but is then much much slower because it will often be recalculating, rather than reusing, distances it needs repeatedly.
I am having trouble trying to use scipy.stats.multivariate_normal, hopefully one of you might be able to help.
I have a 2x2 matrix which is possible to find the inverse of using numpy.linalg.inv(), however when I attempt to use it as the covariance matrix in multivariate_normal I receive a LinAlgError stating that it is a singular matrix:
In [89]: cov = np.array([[3.2e5**2, 3.2e5*0.103*-0.459],[3.2e5*0.103*-0.459, 0.103**2]])
In [90]: np.linalg.inv(cov)
Out[90]:
array([[ 1.23722158e-11, 1.76430200e-05],
[ 1.76430200e-05, 1.19418880e+02]])
In [91]: multivariate_normal([0,0], cov)
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-91-44a6625beda5> in <module>()
----> 1 multivariate_normal([0,0], cov)
/mnt/ssd/Enthought_jli199/Canopy_64bit/User/lib/python2.7/site-packages/scipy/stats/_multivariate.pyc in __call__(self, mean, cov, allow_singular, seed)
421 return multivariate_normal_frozen(mean, cov,
422 allow_singular=allow_singular,
--> 423 seed=seed)
424
425 def _logpdf(self, x, mean, prec_U, log_det_cov, rank):
/mnt/ssd/Enthought_jli199/Canopy_64bit/User/lib/python2.7/site-packages/scipy/stats/_multivariate.pyc in __init__(self, mean, cov, allow_singular, seed)
591 """
592 self.dim, self.mean, self.cov = _process_parameters(None, mean, cov)
--> 593 self.cov_info = _PSD(self.cov, allow_singular=allow_singular)
594 self._dist = multivariate_normal_gen(seed)
595
/mnt/ssd/Enthought_jli199/Canopy_64bit/User/lib/python2.7/site-packages/scipy/stats/_multivariate.pyc in __init__(self, M, cond, rcond, lower, check_finite, allow_singular)
217 d = s[s > eps]
218 if len(d) < len(s) and not allow_singular:
--> 219 raise np.linalg.LinAlgError('singular matrix')
220 s_pinv = _pinv_1d(s, eps)
221 U = np.multiply(u, np.sqrt(s_pinv))
LinAlgError: singular matrix
By default multivariate_normal checks whether any of the eigenvalues of the covariance matrix are less than some tolerance chosen based on its dtype and the magnitude of its largest eigenvalue (take a look at the source code for scipy.stats._multivariate._PSD and scipy.stats._multivariate._eigvalsh_to_eps for the full details).
As #kazemakase mentioned above, whilst your covariance matrix may be invertible according to the criteria used by np.linalg.inv, it is still very ill-conditioned and fails the more stringent test used by multivariate_normal.
You could pass allow_singular=True to multivariate_normal to skip this test, but in general it would be better to rescale your data to avoid passing such an ill-conditioned covariance matrix in the first place.
I recently competed in a kaggle competition and ran into problems trying to run linear CV models from scikit learn. I am aware of a similar question on stack overflow but I can't see how the accepted reply relates to my issue. Any assistance would be greatly appreciated. My code is given below:
train=pd.read_csv(".../train.csv")
test=pd.read_csv(".../test.csv")
data=pd.read_csv(".../sampleSubmission.csv")
from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer(max_features=None)
Y=transformer.fit_transform(train.tweet)
Z=transformer.transform(test.tweet)
from sklearn import linear_model
clf = linear_model.RidgeCV()
a=4
b=1
while (a<28):
clf.fit(Y, train.ix[:,a])
pred=clf.predict(Z)
linpred=pd.DataFrame(pred)
data[data.columns[b]]=linpred
b=b+1
a=a+1
print b
The error that I receive is pasted in total below:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-41c31233c15c> in <module>()
1 blah=train.ix[:,a]
----> 2 clf.fit(Y, blah)
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
815 gcv_mode=self.gcv_mode,
816 store_cv_values=self.store_cv_values)
--> 817 estimator.fit(X, y, sample_weight=sample_weight)
818 self.alpha_ = estimator.alpha_
819 if self.store_cv_values:
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
722 raise ValueError('bad gcv_mode "%s"' % gcv_mode)
723
--> 724 v, Q, QT_y = _pre_compute(X, y)
725 n_y = 1 if len(y.shape) == 1 else y.shape[1]
726 cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in _pre_compute(self, X, y)
607 def _pre_compute(self, X, y):
608 # even if X is very sparse, K is usually very dense
--> 609 K = safe_sparse_dot(X, X.T, dense_output=True)
610 v, Q = linalg.eigh(K)
611 QT_y = np.dot(Q.T, y)
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\utils\extmath.pyc in safe_sparse_dot(a, b, dense_output)
76 from scipy import sparse
77 if sparse.issparse(a) or sparse.issparse(b):
---> 78 ret = a * b
79 if dense_output and hasattr(ret, "toarray"):
80 ret = ret.toarray()
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\base.pyc in __mul__(self, other)
301 if self.shape[1] != other.shape[0]:
302 raise ValueError('dimension mismatch')
--> 303 return self._mul_sparse_matrix(other)
304
305 try:
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\scipy\sparse\compressed.pyc in _mul_sparse_matrix(self, other)
518
519 nnz = indptr[-1]
--> 520 indices = np.empty(nnz, dtype=np.intc)
521 data = np.empty(nnz, dtype=upcast(self.dtype,other.dtype))
522
ValueError: negative dimensions are not allowed
It looks like this problem occurs without using sklearn. Its in scipy.sparse matrix multiplication. There is this issue on a scipy-users board: sparse matrix multiplication problem. The crux of the problem is that scipy uses a 32-bit int for non-zero indices during sparse matrix multiplication. That's the marked line at the bottom of the traceback above. That can overflow if there are too many non-zero elements. That overflow causes the variable nnz to become negative. Then the code at the last arrow creates an empty array of size nnz, resulting in a ValueError due to a negative dimension.
You can generate the tail end of the traceback above without sklearn as follows:
import scipy.sparse as ss
X = ss.rand(75000, 42000, format='csr', density=0.01)
X * X.T
For this problem, the input is probably quite sparse, but RidgeCV looks like its multiplying X and X.T in the last part of the traceback within sklearn. That product might not be sparse enough.