How to use cov function to a dataset iris python - python

I want to get the covariance from the iris data set, https://www.kaggle.com/jchen2186/machine-learning-with-iris-dataset/data
I am using numpy, and the function -> np.cov(iris)
with open("Iris.csv") as iris:
reader = csv.reader(iris)
data = []
next(reader)
for row in reader:
data.append(row)
for i in data:
i.pop(0)
i.pop(4)
iris = np.array(data)
np.cov(iris)
And I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-bfb836354075> in <module>
----> 1 np.cov(iris)
D:\Anaconda\lib\site-packages\numpy\lib\function_base.py in cov(m, y, rowvar, bias, ddof, fweights, aweights)
2300 w *= aweights
2301
-> 2302 avg, w_sum = average(X, axis=1, weights=w, returned=True)
2303 w_sum = w_sum[0]
2304
D:\Anaconda\lib\site-packages\numpy\lib\function_base.py in average(a, axis, weights, returned)
354
355 if weights is None:
--> 356 avg = a.mean(axis)
357 scl = avg.dtype.type(a.size/avg.size)
358 else:
D:\Anaconda\lib\site-packages\numpy\core\_methods.py in _mean(a, axis, dtype, out, keepdims)
73 is_float16_result = True
74
---> 75 ret = umr_sum(arr, axis, dtype, out, keepdims)
76 if isinstance(ret, mu.ndarray):
77 ret = um.true_divide(
TypeError: cannot perform reduce with flexible type
I don't understand what it means..

So, if you want to modify your code you could try by reading the Iris.csv with pandas.read_csv function. And then select the appropiate columns of your choice.
BUT, here is a little set of commands to ease up this task. They use scikit-learn and numpy to load the iris dataset obtain X and y and obtain covariance matrix:
from sklearn.datasets import load_iris
import numpy as np
data = load_iris()
X = data['data']
y = data['target']
np.cov(X)
Hope this has helped.

Related

Logistic Regression Model (binary) crosstab error = shape of passed values issue

I am currently trying to run logistic regression for a data set. I dummy encoded my cat variables and normalized my continuous variables, and I fill null values with -1 (which works for my dataset). I am going through the steps and I am not getting any errors until I try to run my crosstab where its complaining about the shape of my the values passed. I'm getting the same error for both LogR w/ and w/out CV. I have included my code below, I did not include the encoding because that does not seem to be the issue or the code LogR w/out CV because it is basically identical except it excluding the CV.
# read in the df w/ encoded variables
allyrs=pd.read_csv("C:/Users/cyrra/OneDrive/Documents/Pythonread/HDS805/CS1W1/modelready_working.csv")
# Find locations of where I need to trim the data down selecting only the encoded variables
allyrs.columns.get_loc("BMI_C__-1.0")
23
allyrs.columns.get_loc("N_BMIR")
152
# Finding the location of the Y col
allyrs.columns.get_loc("CM")
23
#create new X and y for binary LR
y_bi = allyrs[["CM"]]
X_bi = allyrs.iloc[0:1305720, 23:152]
I then went ahead and checked the lengths of both variables and checked for all the columns in the X set, everything was there. The values are as followed: y_bi = 1305720 rows x 1 col , X_bi = 1305720 rows × 129 columns
# Create test/train
# Create test/train for bi column
from sklearn.model_selection import train_test_split
Xbi_train, Xbi_test, ybi_train, ybi_test = train_test_split(X_bi, y_bi,
train_size=0.8,test_size = 0.2)
again I check the size of Xbi_train and & Ybi_train: Xbi_train=1044576 rows × 129 columns, ybi_train= 1044576 rows × 1 columns
# LRw/CV for the binary col
from sklearn.linear_model import LogisticRegressionCV
logitbi_cv = LogisticRegressionCV(cv=2, random_state=0).fit(Xbi_train, ybi_train)
# Set predicted (checking to see if its an array)
logitbi_cv.predict(Xbi_train)
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
# Set predicted to its own variable
[IN]:pred_logitbi_cv =logitbi_cv.predict(Xbi_train)
# Cross tab LR w/0ut
from sklearn.metrics import confusion_matrix
ct_bi_cv=pd.crosstab(ybi_train, pred_logitbi_cv)
The error:
[OUT]:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_arrays(arrays, names, axes)
1701 blocks = _form_blocks(arrays, names, axes)
-> 1702 mgr = BlockManager(blocks, axes)
1703 mgr._consolidate_inplace()
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in __init__(self, blocks, axes, do_integrity_check)
142 if do_integrity_check:
--> 143 self._verify_integrity()
144
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in _verify_integrity(self)
322 if block.shape[1:] != mgr_shape[1:]:
--> 323 raise construction_error(tot_items, block.shape[1:], self.axes)
324 if len(self.items) != tot_items:
ValueError: Shape of passed values is (1, 2), indices imply (1044576, 2)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-121-c669b17c171f> in <module>
1 # LR W/ CV
2 # Cross tab LR w/0ut
----> 3 ct_bi_cv=pd.crosstab(ybi_train, pred_logitbi_cv)
~\anaconda3\lib\site-packages\pandas\core\reshape\pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, margins_name, dropna, normalize)
596 **dict(zip(unique_colnames, columns)),
597 }
--> 598 df = DataFrame(data, index=common_idx)
599 original_df_cols = df.columns
600
~\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
527
528 elif isinstance(data, dict):
--> 529 mgr = init_dict(data, index, columns, dtype=dtype)
530 elif isinstance(data, ma.MaskedArray):
531 import numpy.ma.mrecords as mrecords
~\anaconda3\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
285 arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
286 ]
--> 287 return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
288
289
~\anaconda3\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity)
93 axes = [columns, index]
94
---> 95 return create_block_manager_from_arrays(arrays, arr_names, axes)
96
97
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_arrays(arrays, names, axes)
1704 return mgr
1705 except ValueError as e:
-> 1706 raise construction_error(len(arrays), arrays[0].shape, axes, e)
1707
1708
ValueError: Shape of passed values is (1, 2), indices imply (1044576, 2)
I realize this is saying that the number of rows being passed in to the cross tab doesn't match but can someone tell me why this is happening or where I am going wrong? I am copying the example code with my own data exactly as it was provided in the book I am working from .
Thank you so much!
Your target variable should be of shape (n,) not (n,1) as is your case when you call y_bi = allyrs[["CM"]] . See the relevant help page. There should be a warning about this because the fit will not work but I guess this was missed somehow.
If you call y_bi = allyrs["CM"], for example, if I set up some dummy data:
import numpy as np
import pandas as pd
np.random.seed(111)
allyrs = pd.DataFrame(np.random.binomial(1,0.5,(100,4)),columns=['x1','x2','x3','CM'])
X_bi = allyrs.iloc[:,:4]
y_bi = allyrs["CM"]
Then run the train test split followed by the fit:
from sklearn.model_selection import train_test_split
Xbi_train, Xbi_test, ybi_train, ybi_test = train_test_split(X_bi, y_bi,
train_size=0.8,test_size = 0.2)
from sklearn.linear_model import LogisticRegressionCV
logitbi_cv = LogisticRegressionCV(cv=2, random_state=0).fit(Xbi_train, ybi_train)
pred_logitbi_cv =logitbi_cv.predict(Xbi_train)
pd.crosstab(ybi_train, pred_logitbi_cv)
col_0 0 1
CM
0 39 0
1 0 41

Python - linear regression TypeError: invalid type promotion

i am trying to run linear regression and i am having issues with data type i think. I have tested line by line and everything works until i reach last line where i get the issue TypeError: invalid Type promotion. Based on my research i think it is due to date format.
Here is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
data=pd.read_excel('C:\\Users\\Proximo\\PycharmProjects\Counts\\venv\\Counts.xlsx')
data['DATE'] = pd.to_datetime(data['DATE'])
data.plot(x = 'DATE', y = 'COUNT', style = 'o')
plt.title('Corona Spread Over the Time')
plt.xlabel('Date')
plt.ylabel('Count')
plt.show()
X=data['DATE'].values.reshape(-1,1)
y=data['COUNT'].values.reshape(-1,1)
X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=.2,random_state=0)
regressor = LinearRegression()
regressor.fit(X_train,Y_train)
y_pre = regressor.predict(X_test)
When i run it this is the full error i get:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-21-c9e943251026> in <module>
----> 1 y_pre = regressor.predict(X_test)
2
c:\users\slavi\pycharmprojects\coronavirus\venv\lib\site-packages\sklearn\linear_model\_base.py in predict(self, X)
223 Returns predicted values.
224 """
--> 225 return self._decision_function(X)
226
227 _preprocess_data = staticmethod(_preprocess_data)
c:\users\slavi\pycharmprojects\coronavirus\venv\lib\site-packages\sklearn\linear_model\_base.py in _decision_function(self, X)
207 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
208 return safe_sparse_dot(X, self.coef_.T,
--> 209 dense_output=True) + self.intercept_
210
211 def predict(self, X):
c:\users\Proximo\pycharmprojects\Count\venv\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
149 ret = np.dot(a, b)
150 else:
--> 151 ret = a # b
152
153 if (sparse.issparse(a) and sparse.issparse(b)
TypeError: invalid type promotion
My date format which looks like this:
array([['2020-01-20T00:00:00.000000000'],
['2020-01-21T00:00:00.000000000'],
['2020-01-22T00:00:00.000000000'],
['2020-01-23T00:00:00.000000000'],
['2020-01-24T00:00:00.000000000'],
['2020-01-25T00:00:00.000000000'],
['2020-01-26T00:00:00.000000000'],
['2020-01-27T00:00:00.000000000'],
['2020-01-28T00:00:00.000000000'],
['2020-01-29T00:00:00.000000000'],
['2020-01-30T00:00:00.000000000'],
['2020-01-31T00:00:00.000000000'],
['2020-02-01T00:00:00.000000000'],
['2020-02-02T00:00:00.000000000']], dtype='datetime64[ns]')
Any suggestion on how to resolve this issue?
I think linear regression not work for date type data.You need to convert it to numerical data.
for example
import numpy as np
import pandas as pd
import datetime as dt
X_test = pd.DataFrame(np.array([
['2020-01-24T00:00:00.000000000'],
['2020-01-25T00:00:00.000000000'],
['2020-01-26T00:00:00.000000000'],
['2020-01-27T00:00:00.000000000'],
['2020-01-28T00:00:00.000000000'],
['2020-01-29T00:00:00.000000000'],
['2020-01-30T00:00:00.000000000'],
['2020-01-31T00:00:00.000000000'],
['2020-02-01T00:00:00.000000000'],
['2020-02-02T00:00:00.000000000']], dtype='datetime64[ns]'))
X_test.columns = ["Date"]
X_test['Date'] = pd.to_datetime(X_test['Date'])
X_test['Date']=X_test['Date'].map(dt.datetime.toordinal)
Try this approach.this should work.
Note - it is better to covert training set dates to numeric and train on that data.

sklearn clustering with custom metric: pairwise_distances throwing error

I would like to cluster sets of spatial data using my own metric. The data comes as pairs of (x,y) values in a dataframe, where each set of pairs has an id. Like in the following example where I have three sets of points:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [1] * 4 + [2] * 5 + [3] * 3,
'x': np.random.random(12),
'y': np.random.random(12)})
df['xy'] = df[['x','y']].apply(lambda row: [row['x'],row['y']], axis = 1)
Here is the distance function I would like to use:
from scipy.spatial.distance import directed_hausdorff
def some_distance(u, v):
return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
This function computes the Hausdorff distance, i.e. the distance between two subsets u and v of n-dimensional space. In my case, I would like to use this distance function to cluster subsets of the real plane. In the data above there are three such subsets (ids from 1 to 3) so the resulting distance matrix should be 3x3.
My idea for the clustering step was to use sklearn.cluster.AgglomerativeClustering with a precomputed metric, which in turn I want to compute with sklearn.metrics.pairwise import pairwise_distances.
from sklearn.metrics.pairwise import pairwise_distances
def to_np_array(col):
return np.array(list(col.values))
X = df.groupby('id')['xy'].apply(to_np_array).as_matrix()
m = pairwise_distances(X, X, metric=some_distance)
However, the last line is giving me an error:
ValueError: setting an array element with a sequence.
What does work fine, however, is calling some_distance(X[1], X[2]).
My hunch is that X needs to be a different format for pairwise_distances to work. Any ideas on how to make this work, or how to compute the matrix myself so I can stick it into sklearn.cluster.AgglomerativeClustering?
The error stack is
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-e34155622595> in <module>
12 def some_distance(u, v):
13 return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
---> 14 m = pairwise_distances(X, X, metric=some_distance)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1430 func = partial(distance.cdist, metric=metric, **kwds)
1431
-> 1432 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1433
1434
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1065
1066 if effective_n_jobs(n_jobs) == 1:
-> 1067 return func(X, Y, **kwds)
1068
1069 # TODO: in some cases, backend='threading' may be appropriate
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in _pairwise_callable(X, Y, metric, **kwds)
1079 """Handle the callable case for pairwise_{distances,kernels}
1080 """
-> 1081 X, Y = check_pairwise_arrays(X, Y)
1082
1083 if X is Y:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
106 if Y is X or Y is None:
107 X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
--> 108 warn_on_dtype=warn_on_dtype, estimator=estimator)
109 else:
110 X = check_array(X, accept_sparse='csr', dtype=dtype,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
525 try:
526 warnings.simplefilter('error', ComplexWarning)
--> 527 array = np.asarray(array, dtype=dtype, order=order)
528 except ComplexWarning:
529 raise ValueError("Complex data not supported\n"
C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.
Try this:
import numpy as np
import pandas as pd
from scipy.spatial.distance import directed_hausdorff
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering
df = pd.DataFrame({'id': [1] * 4 + [2] * 5 + [3] * 3, 'x':
np.random.random(12), 'y': np.random.random(12)})
df['xy'] = df[['x','y']].apply(lambda row: [row['x'],row['y']], axis = 1)
df.groupby('id')['xy'].apply(to_np_array)
def some_distance(u, v):
return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
def to_np_array(col):
return np.array(list(col.values))
X = df.groupby('id')['xy'].apply(to_np_array)
d = np.zeros((len(X),len(X)))
for i, u in enumerate(X):
for j, v in list(enumerate(X))[i:]:
d[i,j] = some_distance(u,v)
d[j,i] = d[i,j]
And now when you print d you get this:
array([[0. , 0.58928274, 0.40767213],
[0.58928274, 0. , 0.510095 ],
[0.40767213, 0.510095 , 0. ]])
And for clustering:
cluster = AgglomerativeClustering(n_clusters=2, affinity='precomputed', linkage = 'average')
cluster.fit(d)
It would help if you showed some of the variables. Fortunately you gave enough code to run it. For example the dataframe:
In [9]: df
Out[9]:
id x y xy
0 1 0.428437 0.267264 [0.42843730501201727, 0.2672637429997736]
1 1 0.944687 0.023323 [0.9446872371859233, 0.023322969159167317]
2 1 0.091055 0.683154 [0.09105472832178496, 0.6831542985617349]
3 1 0.474522 0.313541 [0.4745222021519122, 0.3135405569298565]
4 2 0.835237 0.491541 [0.8352366339973815, 0.4915408434083248]
5 2 0.905918 0.854030 [0.9059178939221513, 0.8540297797160584]
6 2 0.182154 0.909656 [0.18215390836391654, 0.9096555360282939]
7 2 0.225270 0.522193 [0.22527013482912195, 0.5221926076838651]
8 2 0.924208 0.858627 [0.9242076604008371, 0.8586274362498842]
9 3 0.419813 0.634741 [0.41981292371175905, 0.6347409684931891]
10 3 0.954141 0.795452 [0.9541413559045294, 0.7954524369652217]
11 3 0.896593 0.271187 [0.8965932351250882, 0.2711872631673109]
And your X:
In [10]: X
Out[10]:
array([array([[0.42843731, 0.26726374],
[0.94468724, 0.02332297],
[0.09105473, 0.6831543 ],
[0.4745222 , 0.31354056]]),
array([[0.83523663, 0.49154084],
[0.90591789, 0.85402978],
[0.18215391, 0.90965554],
[0.22527013, 0.52219261],
[0.92420766, 0.85862744]]),
array([[0.41981292, 0.63474097],
[0.95414136, 0.79545244],
[0.89659324, 0.27118726]])], dtype=object)
That is a (3,) object array - in effect a list of 3 2d arrays, with different sizes ((3,2),(5,2),(4,2)). That's one array for each group.
How is pairwise supposed to feed that to your distance code? pairwise docs says X should be a (n,m) array - n samples, m features. Your X doesn't fit that description!
The error is probably produced by when trying to make a float array from X:
In [12]: np.asarray(X,dtype=float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-a6e08bb1590c> in <module>
----> 1 np.asarray(X,dtype=float)
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.

pymc3 with custom likelihood function from kernel density estimation

I'm trying to use pymc3 with a likelihood function derived from some observed data. This observed data doesn't fit any nice, standard distribution, so I want to define my own, based on these observations.
One approach is to use kernel density estimation over the observations. This was possible in pymc2, but doesn't play nicely with the Theano variables in pymc3.
In my code below I'm just generating some dummy data that is normally distributed. As my prior, I'm essentially assuming a uniform distribution for my observations.
Here's my code:
from scipy import stats
import numpy as np
import pymc3 as pm
from sklearn.neighbors.kde import KernelDensity
data = np.sort(stats.norm.rvs(loc=0, scale=1, size=1000))
kde = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(data.reshape(-1, 1))
def get_log_likelihood(x):
return kde.score_samples(x)
with pm.Model() as test_model:
x = pm.Uniform('prior rv', lower=-10, upper=10)
obs = pm.DensityDist('observed likelihood', get_log_likelihood, observed={'x': x})
step = pm.Metropolis()
trace = pm.sample(200, step=step)
The error I receive seems to be the kde score_samples function blowing up as it expects an array, but x is a Theano variable.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-4efbbe7376dc> in <module>()
1 with pm.Model() as test_model:
2 x = pm.Uniform('prior rv', lower=0.0, upper=1e6)
----> 3 obs = pm.DensityDist('observed likelihood', get_log_likelihood, observed={'x': x})
4
5 step = pm.Metropolis()
~/research_notebooks/venv/lib/python3.6/site-packages/pymc3/distributions/distribution.py in __new__(cls, name, *args, **kwargs)
40 total_size = kwargs.pop('total_size', None)
41 dist = cls.dist(*args, **kwargs)
---> 42 return model.Var(name, dist, data, total_size)
43 else:
44 raise TypeError("Name needs to be a string but got: {}".format(name))
~/research_notebooks/venv/lib/python3.6/site-packages/pymc3/model.py in Var(self, name, dist, data, total_size)
825 with self:
826 var = MultiObservedRV(name=name, data=data, distribution=dist,
--> 827 total_size=total_size, model=self)
828 self.observed_RVs.append(var)
829 if var.missing_values:
~/research_notebooks/venv/lib/python3.6/site-packages/pymc3/model.py in __init__(self, name, data, distribution, total_size, model)
1372 self.missing_values = [datum.missing_values for datum in self.data.values()
1373 if datum.missing_values is not None]
-> 1374 self.logp_elemwiset = distribution.logp(**self.data)
1375 # The logp might need scaling in minibatches.
1376 # This is done in `Factor`.
<ipython-input-48-535f58ce543b> in get_log_likelihood(x)
1 def get_log_likelihood(x):
----> 2 return kde.score_samples(x)
~/research_notebooks/venv/lib/python3.6/site-packages/sklearn/neighbors/kde.py in score_samples(self, X)
150 # For it to be a probability, we must scale it. For this reason
151 # we'll also scale atol.
--> 152 X = check_array(X, order='C', dtype=DTYPE)
153 N = self.tree_.data.shape[0]
154 atol_N = self.atol * N
~/research_notebooks/venv/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: setting an array element with a sequence.
Any help would be greatly appreciated. Thanks!

Is silhouette coefficient subsampling stratified in sklearn ?

I'm again having trouble using the scikit-learn silhouette coefficient. (first question was here : silhouette coefficient in python with sklearn).
I make a clustering that can be very unbalanced but with a lot of individuals so I want to use the sampling parameter of the silhouette coefficient. I was wondering if the subsampling was stratified, meaning sampling with respect to clusters. I take the iris dataset as an example but my dataset is far bigger (and that's why I need sampling).
My code is :
from sklearn import datasets
from sklearn.metrics import *
iris = datasets.load_iris()
col = iris.feature_names
name = iris.target_names
X = pd.DataFrame(iris.data, columns = col)
y = iris.target
s = silhouette_score(X.values, y, metric='euclidean',sample_size=50)
which works. But now If I biased that with :
y[0:148] =0
y[148] = 1
y[149] = 2
print y
s = silhouette_score(X.values, y, metric='euclidean',sample_size=50)
I get :
ValueError Traceback (most recent call last)
<ipython-input-12-68a7fba49c54> in <module>()
4 y[149] =2
5 print y
----> 6 s = silhouette_score(X.values, y, metric='euclidean',sample_size=50)
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
82 else:
83 X, labels = X[indices], labels[indices]
---> 84 return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
85
86
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_samples(X, labels, metric, **kwds)
146 for i in range(n)])
147 B = np.array([_nearest_cluster_distance(distances[i], labels, i)
--> 148 for i in range(n)])
149 sil_samples = (B - A) / np.maximum(A, B)
150 # nan values are for clusters of size 1, and should be 0
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in _nearest_cluster_distance(distances_row, labels, i)
200 label = labels[i]
201 b = np.min([np.mean(distances_row[labels == cur_label])
--> 202 for cur_label in set(labels) if not cur_label == label])
203 return b
/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.pyc in amin(a, axis, out, keepdims)
1980 except AttributeError:
1981 return _methods._amin(a, axis=axis,
-> 1982 out=out, keepdims=keepdims)
1983 # NOTE: Dropping the keepdims parameter
1984 return amin(axis=axis, out=out)
/usr/lib/python2.7/dist-packages/numpy/core/_methods.pyc in _amin(a, axis, out, keepdims)
12 def _amin(a, axis=None, out=None, keepdims=False):
13 return um.minimum.reduce(a, axis=axis,
---> 14 out=out, keepdims=keepdims)
15
16 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
ValueError: zero-size array to reduction operation minimum which has no identity
an error which is due I think to the fact that sampling is random not stratified so it has not taken into account the two small clusters.
Am I correct ?
Yes you are correct. The sampling is not stratified since it doesn't take the labels into consideration when doing the sampling.
This is how the sample is taken (version 0.14.1)
indices = random_state.permutation(X.shape[0])[:sample_size]
Where X is the input array of size [n_samples_a, n_samples_a] or [n_samples_a, n_features].
I think you are right, the current implementation does not support balanced resampling.
Just an update for year 2020:
As of scikit-learn 0.22.1, the sampling remains random (i.e. not stratified).
The source code is still:
indices = random_state.permutation(X.shape[0])[:sample_size]

Categories