I'm again having trouble using the scikit-learn silhouette coefficient. (first question was here : silhouette coefficient in python with sklearn).
I make a clustering that can be very unbalanced but with a lot of individuals so I want to use the sampling parameter of the silhouette coefficient. I was wondering if the subsampling was stratified, meaning sampling with respect to clusters. I take the iris dataset as an example but my dataset is far bigger (and that's why I need sampling).
My code is :
from sklearn import datasets
from sklearn.metrics import *
iris = datasets.load_iris()
col = iris.feature_names
name = iris.target_names
X = pd.DataFrame(iris.data, columns = col)
y = iris.target
s = silhouette_score(X.values, y, metric='euclidean',sample_size=50)
which works. But now If I biased that with :
y[0:148] =0
y[148] = 1
y[149] = 2
print y
s = silhouette_score(X.values, y, metric='euclidean',sample_size=50)
I get :
ValueError Traceback (most recent call last)
<ipython-input-12-68a7fba49c54> in <module>()
4 y[149] =2
5 print y
----> 6 s = silhouette_score(X.values, y, metric='euclidean',sample_size=50)
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
82 else:
83 X, labels = X[indices], labels[indices]
---> 84 return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
85
86
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_samples(X, labels, metric, **kwds)
146 for i in range(n)])
147 B = np.array([_nearest_cluster_distance(distances[i], labels, i)
--> 148 for i in range(n)])
149 sil_samples = (B - A) / np.maximum(A, B)
150 # nan values are for clusters of size 1, and should be 0
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in _nearest_cluster_distance(distances_row, labels, i)
200 label = labels[i]
201 b = np.min([np.mean(distances_row[labels == cur_label])
--> 202 for cur_label in set(labels) if not cur_label == label])
203 return b
/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.pyc in amin(a, axis, out, keepdims)
1980 except AttributeError:
1981 return _methods._amin(a, axis=axis,
-> 1982 out=out, keepdims=keepdims)
1983 # NOTE: Dropping the keepdims parameter
1984 return amin(axis=axis, out=out)
/usr/lib/python2.7/dist-packages/numpy/core/_methods.pyc in _amin(a, axis, out, keepdims)
12 def _amin(a, axis=None, out=None, keepdims=False):
13 return um.minimum.reduce(a, axis=axis,
---> 14 out=out, keepdims=keepdims)
15
16 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
ValueError: zero-size array to reduction operation minimum which has no identity
an error which is due I think to the fact that sampling is random not stratified so it has not taken into account the two small clusters.
Am I correct ?
Yes you are correct. The sampling is not stratified since it doesn't take the labels into consideration when doing the sampling.
This is how the sample is taken (version 0.14.1)
indices = random_state.permutation(X.shape[0])[:sample_size]
Where X is the input array of size [n_samples_a, n_samples_a] or [n_samples_a, n_features].
I think you are right, the current implementation does not support balanced resampling.
Just an update for year 2020:
As of scikit-learn 0.22.1, the sampling remains random (i.e. not stratified).
The source code is still:
indices = random_state.permutation(X.shape[0])[:sample_size]
Related
So I'm working on a problem where I need to simply use SciPy to perform linear regression to get the weights and statistics on the weights, but I'm getting the error
"ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 11 and the array at index 1 has size 100"
The code is simply:
from scipy import stats
x = x_copy
y = y_copy
stats.linregress(x, y)
Where x is a dataframe and y is a numpy array.
When doing x.shape and y.shape I get that x is (100, 11) and y is (100,). Running the exact same matrices in np.linalg.lstsq and sklearn.linear_model.LinearRegression both work fine and output the weights, but as far as I'm aware I need SciPy to get the statistics on the weights themselves.
I've also checked x.dtypes and all variables are float64, and y.dtype also returns float64. I've also tried replacing to x in the regression call with x.to_numpy() incase there was something with the headers/index but received the same issue.
Any suggestions?
Edit:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_28012/197024375.py in <module>
4 y = y_copy
5
----> 6 stats.linregress(x.values, y)
7
8 x.values.shape
~\anaconda3\lib\site-packages\scipy\stats\_stats_mstats_common.py in linregress(x, y, alternative)
153 # ssxm = mean( (x-mean(x))^2 )
154 # ssxym = mean( (x-mean(x)) * (y-mean(y)) )
--> 155 ssxm, ssxym, _, ssym = np.cov(x, y, bias=1).flat
156
157 # R-value
<__array_function__ internals> in cov(*args, **kwargs)
~\anaconda3\lib\site-packages\numpy\lib\function_base.py in cov(m, y, rowvar, bias, ddof, fweights, aweights, dtype)
2426 if not rowvar and y.shape[0] != 1:
2427 y = y.T
-> 2428 X = np.concatenate((X, y), axis=0)
2429
2430 if ddof is None:
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 11 and the array at index 1 has size 100```
I want to get the covariance from the iris data set, https://www.kaggle.com/jchen2186/machine-learning-with-iris-dataset/data
I am using numpy, and the function -> np.cov(iris)
with open("Iris.csv") as iris:
reader = csv.reader(iris)
data = []
next(reader)
for row in reader:
data.append(row)
for i in data:
i.pop(0)
i.pop(4)
iris = np.array(data)
np.cov(iris)
And I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-bfb836354075> in <module>
----> 1 np.cov(iris)
D:\Anaconda\lib\site-packages\numpy\lib\function_base.py in cov(m, y, rowvar, bias, ddof, fweights, aweights)
2300 w *= aweights
2301
-> 2302 avg, w_sum = average(X, axis=1, weights=w, returned=True)
2303 w_sum = w_sum[0]
2304
D:\Anaconda\lib\site-packages\numpy\lib\function_base.py in average(a, axis, weights, returned)
354
355 if weights is None:
--> 356 avg = a.mean(axis)
357 scl = avg.dtype.type(a.size/avg.size)
358 else:
D:\Anaconda\lib\site-packages\numpy\core\_methods.py in _mean(a, axis, dtype, out, keepdims)
73 is_float16_result = True
74
---> 75 ret = umr_sum(arr, axis, dtype, out, keepdims)
76 if isinstance(ret, mu.ndarray):
77 ret = um.true_divide(
TypeError: cannot perform reduce with flexible type
I don't understand what it means..
So, if you want to modify your code you could try by reading the Iris.csv with pandas.read_csv function. And then select the appropiate columns of your choice.
BUT, here is a little set of commands to ease up this task. They use scikit-learn and numpy to load the iris dataset obtain X and y and obtain covariance matrix:
from sklearn.datasets import load_iris
import numpy as np
data = load_iris()
X = data['data']
y = data['target']
np.cov(X)
Hope this has helped.
I am trying to use the lda 1.0.2 package for python.
The documentation says that sparse matrix are acceptable, but when I pass a sparse matrix to the transform() function. It throws the error
The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all().
The transform() function works fine with normal matrix.
Has anybody else faced similar problem ?
any help will be great! Thanks in advance :)
I just got the same error. To reproduce:
from scipy.sparse import csr_matrix
import lda
X = csr_matrix([[1,0],[0,1]])
lda_test = lda.LDA(n_topics=2, n_iter=10)
lda_test.fit(X)
X_trans = lda_test.transform(X)
Which produces the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-192-a1a0875bac02> in <module>()
5 lda_test = lda.LDA(n_topics=2, n_iter=10)
6 lda_test.fit(X)
----> 7 X_trans = lda_test.transform(X)
C:\Users\lidw6lw\PortablePython\App\lib\site-packages\lda\lda.pyc in transform(self, X, max_iter, tol)
173 n_topics = len(self.components_)
174 doc_topic = np.empty((len(X), n_topics))
--> 175 WS, DS = lda.utils.matrix_to_lists(X)
176 # TODO: this loop is parallelizable
177 for d in range(len(X)):
C:\Users\lidw6lw\PortablePython\App\lib\site-packages\lda\utils.pyc in matrix_to_lists(doc_word)
44 if np.count_nonzero(doc_word.sum(axis=1)) != doc_word.shape[0]:
45 logger.warning("all zero row in document-term matrix found")
---> 46 if np.count_nonzero(doc_word.sum(axis=0)) != doc_word.shape[1]:
47 logger.warning("all zero column in document-term matrix found")
48 sparse = True
C:\Users\lidw6lw\PortablePython\App\lib\site-packages\numpy\core\_methods.pyc in _sum(a, axis, dtype, out, keepdims)
23 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
24 return um.add.reduce(a, axis=axis, dtype=dtype,
---> 25 out=out, keepdims=keepdims)
26
27 def _prod(a, axis=None, dtype=None, out=None, keepdims=False):
C:\Users\lidw6lw\PortablePython\App\lib\site-packages\scipy\sparse\base.pyc in __bool__(self)
181 return True if self.nnz == 1 else False
182 else:
--> 183 raise ValueError("The truth value of an array with more than one "
184 "element is ambiguous. Use a.any() or a.all().")
185 __nonzero__ = __bool__
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
Looks like it's due to lda.utils.matrix_to_lists
Both of the below work just fine:
X_trans = lda_test.fit(X.toarray())
X_trans2 = lda_test.fit_transform(X)
EDIT: It's actually that the transform function that doesn't account for sparse matrices properly.Make a copy of the package, and in the code for transformjust replace len(X) with X.shape(0) and comment out the np.atleast_2d(X) line. So the section right below the docstring in transform looks like this:
# X = np.atleast_2d(X)
phi = self.components_
alpha = self.alpha
# for debugging, let's not worry about the documents
n_topics = len(self.components_)
doc_topic = np.empty((X.shape[0], n_topics))
WS, DS = lda.utils.matrix_to_lists(X)
# TODO: this loop is parallelizable
for d in range(X.shape[0]):
Got the similar error recently.
ValueError: expected sparse matrix with integer values, found float values
This fixed the issue:
model.fit(X.toarray().astype(int))
I'm having trouble computing the silhouette coefficient in python with sklearn.
Here is my code :
from sklearn import datasets
from sklearn.metrics import *
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = col)
y = pd.DataFrame(iris.target,columns = ['cluster'])
s = silhouette_score(X, y, metric='euclidean',sample_size=int(50))
I get the error :
IndexError: indices are out-of-bounds
I want to use the sample_size parameter because when working with very large datasets, silhouette is too long to compute. Anyone knows how this parameter could work ?
Complete traceback :
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-72-70ff40842503> in <module>()
4 X = pd.DataFrame(iris.data, columns = col)
5 y = pd.DataFrame(iris.target,columns = ['cluster'])
----> 6 s = silhouette_score(X, y, metric='euclidean',sample_size=50)
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
81 X, labels = X[indices].T[indices].T, labels[indices]
82 else:
---> 83 X, labels = X[indices], labels[indices]
84 return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
85
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
1993 if isinstance(key, (np.ndarray, list)):
1994 # either boolean or fancy integer index
-> 1995 return self._getitem_array(key)
1996 elif isinstance(key, DataFrame):
1997 return self._getitem_frame(key)
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_array(self, key)
2030 else:
2031 indexer = self.ix._convert_to_indexer(key, axis=1)
-> 2032 return self.take(indexer, axis=1, convert=True)
2033
2034 def _getitem_multilevel(self, key):
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in take(self, indices, axis, convert)
2981 if convert:
2982 axis = self._get_axis_number(axis)
-> 2983 indices = _maybe_convert_indices(indices, len(self._get_axis(axis)))
2984
2985 if self._is_mixed_type:
/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _maybe_convert_indices(indices, n)
1038 mask = (indices>=n) | (indices<0)
1039 if mask.any():
-> 1040 raise IndexError("indices are out-of-bounds")
1041 return indices
1042
IndexError: indices are out-of-bounds
silhouette_score expects regular numpy arrays as input. Why wrap your arrays in data frames?
>>> silhouette_score(iris.data, iris.target, sample_size=50)
0.52999903616584543
From the traceback, you can observe that the code is doing fancy indexing (subsampling) on the first axis. By default indexing a dataframe will index the columns and not the rows hence the issue you observe.
I recently competed in a kaggle competition and ran into problems trying to run linear CV models from scikit learn. I am aware of a similar question on stack overflow but I can't see how the accepted reply relates to my issue. Any assistance would be greatly appreciated. My code is given below:
train=pd.read_csv(".../train.csv")
test=pd.read_csv(".../test.csv")
data=pd.read_csv(".../sampleSubmission.csv")
from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer(max_features=None)
Y=transformer.fit_transform(train.tweet)
Z=transformer.transform(test.tweet)
from sklearn import linear_model
clf = linear_model.RidgeCV()
a=4
b=1
while (a<28):
clf.fit(Y, train.ix[:,a])
pred=clf.predict(Z)
linpred=pd.DataFrame(pred)
data[data.columns[b]]=linpred
b=b+1
a=a+1
print b
The error that I receive is pasted in total below:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-41c31233c15c> in <module>()
1 blah=train.ix[:,a]
----> 2 clf.fit(Y, blah)
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
815 gcv_mode=self.gcv_mode,
816 store_cv_values=self.store_cv_values)
--> 817 estimator.fit(X, y, sample_weight=sample_weight)
818 self.alpha_ = estimator.alpha_
819 if self.store_cv_values:
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
722 raise ValueError('bad gcv_mode "%s"' % gcv_mode)
723
--> 724 v, Q, QT_y = _pre_compute(X, y)
725 n_y = 1 if len(y.shape) == 1 else y.shape[1]
726 cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in _pre_compute(self, X, y)
607 def _pre_compute(self, X, y):
608 # even if X is very sparse, K is usually very dense
--> 609 K = safe_sparse_dot(X, X.T, dense_output=True)
610 v, Q = linalg.eigh(K)
611 QT_y = np.dot(Q.T, y)
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\utils\extmath.pyc in safe_sparse_dot(a, b, dense_output)
76 from scipy import sparse
77 if sparse.issparse(a) or sparse.issparse(b):
---> 78 ret = a * b
79 if dense_output and hasattr(ret, "toarray"):
80 ret = ret.toarray()
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\base.pyc in __mul__(self, other)
301 if self.shape[1] != other.shape[0]:
302 raise ValueError('dimension mismatch')
--> 303 return self._mul_sparse_matrix(other)
304
305 try:
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\scipy\sparse\compressed.pyc in _mul_sparse_matrix(self, other)
518
519 nnz = indptr[-1]
--> 520 indices = np.empty(nnz, dtype=np.intc)
521 data = np.empty(nnz, dtype=upcast(self.dtype,other.dtype))
522
ValueError: negative dimensions are not allowed
It looks like this problem occurs without using sklearn. Its in scipy.sparse matrix multiplication. There is this issue on a scipy-users board: sparse matrix multiplication problem. The crux of the problem is that scipy uses a 32-bit int for non-zero indices during sparse matrix multiplication. That's the marked line at the bottom of the traceback above. That can overflow if there are too many non-zero elements. That overflow causes the variable nnz to become negative. Then the code at the last arrow creates an empty array of size nnz, resulting in a ValueError due to a negative dimension.
You can generate the tail end of the traceback above without sklearn as follows:
import scipy.sparse as ss
X = ss.rand(75000, 42000, format='csr', density=0.01)
X * X.T
For this problem, the input is probably quite sparse, but RidgeCV looks like its multiplying X and X.T in the last part of the traceback within sklearn. That product might not be sparse enough.