So I tried to learn about hierarchical clustering, but I alwas get an error code on spyder:
AttributeError: 'AgglomerativeClustering' object has no attribute 'distances_'
This is the code
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
df = pd.DataFrame({
'x':[41, 36, 32, 34, 32, 31, 24, 30, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],
'y':[39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 30, 30, 31, 32, 29]
})
clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=0)
clustering.fit(df)
import numpy as np
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram
def plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack([model.children_, model.distances_,
counts]).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
plt.title('Hierarchical Clustering Dendogram')
#plot the top 3 levels of the dendrogram
plot_dendrogram(clustering)
plt.xlabel("index data")
plt.show()
#print(clustering.labels_)
I have upgraded the scikit learning to the newest one, but the same error still exist, so is there anything that I can do? or is there something wrong in this code
official document of sklearn.cluster.AgglomerativeClustering() says
distances_ : array-like of shape (n_nodes-1,)
Distances between nodes in the corresponding place in children_.
Only computed if distance_threshold is used or compute_distances is set to True.
I have the same problem and I fix it by set parameter compute_distances=True
Apparently, I might miss some step before I upload this question, so here is the step that I do in order to solve this problem:
Uninstall scikit-learn through anaconda prompt
Instal scikit-learn back
If somehow your spyder is gone, install it again with anaconda prompt
install some missing library
It can work again :)
Update sklearn from 21.* to 22.*
pip install -U scikit-learn
For me https://stackoverflow.com/a/61363342/10270590 solution did not worked but, it worked without the dendrogram.
Ex. Code:
aggmodel = AgglomerativeClustering(distance_threshold=None,
n_clusters=10,
affinity = "manhattan",
linkage = "complete",
)
aggmodel = aggmodel.fit(data1)
aggmodel.n_clusters_
#aggmodel.labels_
Related
I want to fit the following model:
To some fluorescence measurements ([YFP] over time). Basically I can measure the change of YFP over time, but not the change of x. After navigating through different solutions in overflow (and trying various of the solutions proposed), I finished getting pretty close with Symfit.
However, when I try to fit the model to the experimental results, I get the following fit results:
Parameter Value Standard Deviation
TauOFF 4.425923e-02 2.173698e+00
TauON 9.687891e+00 1.945774e+02
TauONx 4.539607e-02 2.239210e+00
x_SS 7.968579e+00 2.726591e+02
Status message Maximum number of function evaluations has been exceeded.
Number of iterations 443
Objective <symfit.core.objectives.LeastSquares object at 0x000002640701C898>
Minimizer <symfit.core.minimizers.NelderMead object at 0x000002640701CEF0>
Goodness of fit qualifiers:
chi_squared 480161.4690600715
objective_value 240080.73453003576
r_squared 0.9677940481847731
I don't understand why X's prediction is so low, and almost a constant (almost because when I zoom in, it actually changes a little bit). Also, it says that "Maximum number of function evaluations has been exceeded". What am I doing wrong?? Am I am using the wrong minimizer? The wrong initial parameter estimated values?
Below is my code:
# %% Importing modules
import symfit
from symfit import parameters, variables, ODEModel, Fit, Parameter, D
from symfit.core.objectives import LogLikelihood
from symfit.core.minimizers import NelderMead
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sympy.solvers import ode
# %% Experimental data. Inputs is time, outputs is fluorescence measurements ([YFP])
inputs = np.array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66])
outputs = [73.64756293519015, 44.83500717360115, 66.59467242620596, 49.65998568360771, 46.484859217283514, 72.37530519707008, 74.47354904982025, 61.437468439656605, 80.15204098496119, 93.11740890688259, 74.73900664346728, 87.38835848475512, 94.96499329658872, 116.07910576096306, 126.95045168354777, 123.76237623762376, 147.73432650527624, 168.04489072652098, 183.3221551531411, 321.22186495176834, 356.38957816377166, 389.03440885819737, 321.22186495176834, 356.38957816377166, 389.03440885819737, 582.1501961516907, 607.139657798083, 651.6151143860851, 682.4329863103533, 716.422610612502, 749.3927432822223, 777.726234656009, 809.6079246328624, 847.2845376012857, 870.6370831711431, 895.512942218847, 914.3568311720239, 1002.7537605116663, 1019.3525890625908, 1028.7006485379452, 1073.162564875272, 1080.7277331278212, 1106.8392267287595, 1119.0425361584034, 1139.207233729366, 1145.790182270091, 1177.2867420349437, 1185.0114126299773, 1196.1818638533032, 1213.7383689107828, 1208.2922013820337, 1209.8943558642277, 1225.7463589296947, 1232.9657629893582, 1221.7722725107194, 1237.6858956142842, 1240.1111320399323, 1240.6384572177496, 1249.767333643555, 1247.0462864291337, 1259.6783113651027, 1258.188648128636, 1267.006026296567, 1272.2310666363428, 1260.6866757617101, 1266.8857660924748]
# %% Model Definitions
x, y, t = variables('x, y, t')
TauONx = Parameter('TauONx', 0.1)
TauON = Parameter('TauON', 0.180854297)
### For a moment, I thought of fixing TauOFF, obtaining this value from other experiments
TauOFF = Parameter('TauOFF', 10.53547354)
#TauOFF = 10.53547354
x_SS = Parameter('x_SS', 0.1)
#### All of this is using symfit package!
model_dict = {
D(x, t): TauONx*(x_SS - x),
D(y, t): TauON*x - TauOFF*y,
}
# %% Execute data
ode_model = ODEModel(model_dict, initial={t: 0.0, x: 54 * 10e-4, y: 54 * 10e-4})
fit = Fit(ode_model, t=inputs, x=None, y=outputs, minimizer=NelderMead)
#fit = Fit(ode_model, outputs, objective=LogLikelihood)
fit_result = fit.execute()
print(fit_result)
# %% Plot the data generated vs the output
tvec = np.linspace(0, 60, 1000)
X, Y = ode_model(t=tvec, **fit_result.params)
plt.plot(tvec, X, label='[x]')
plt.plot(tvec, Y, label='[y]')
plt.scatter(inputs, outputs)
plt.legend()
plt.show()
I am confused. I found few formulas for finding the SD (standard deviation).
This is the NumPy library std method:
>>> nums = np.array([65, 36, 52, 91, 63, 79])
>>> np.std(nums)
17.716909687891082
But I found another formula here:Standard deviation
By this formula with the same dataset my result is 323,1666666666667. Now which one is right? Or they are used for two different things?
EDIT: Seems I forgot about the square root
numpy is correct, of course. here the plain python version:
from math import sqrt
data = [65, 36, 52, 91, 63, 79]
mean = sum(data) / len(data)
std = sqrt(sum((d - mean) ** 2 for d in data) / len(data))
print(std) # 17.716909687891082
Core python. see pstdev
import statistics
print(statistics.pstdev([65, 36, 52, 91, 63, 79]))
output
17.716909687891082
Edit: Everything is good :)
This is a code which works with small values of t=20 and TR=([[30,20,12,23..],[...]]) but when I put higher values it is shown "Expect x to be a 1-D sorted array_like.". Do you know how to solve this problem??
import matplotlib.pylab as plt
from scipy.special import erfc
from scipy import sqrt
from scipy import exp
import numpy as np
from scipy.interpolate import interp1d
# The function to inverse:
t = 100
alfa = 1.1*10**(-7)
k = 0.18
T1 = 20
Tpow = 180
def F(h):
p = erfc(h*sqrt(alfa*t)/k)
return T1 + (Tpow-T1)*(1-exp((h**2*alfa*t)/k**2)*(p))
# Interpolation
h_eval = np.linspace(-80, 500, 200) # critical step: define the discretization grid
F_inverse = interp1d( F(h_eval), h_eval, kind='cubic', bounds_error=True )
# Some random data:
TR = np.array([[130, 100, 130, 130, 130],
[ 90, 101, 100, 120, 90],
[130, 130, 100, 100, 130],
[120, 101, 120, 90, 110],
[110, 130, 130, 110, 130]])
# Compute the array h for a given array TR
h = F_inverse(TR)
print(h)
# Graph to verify the interpolation
plt.plot(h_eval, F(h_eval), '.-', label='discretized F(h)');
plt.plot(h.ravel(), TR.ravel(), 'or', label='interpolated values')
plt.xlabel('h'); plt.ylabel('F(h) or TR'); plt.legend();
Has anyone an idea how to solve non-linear, implicit equation in numpy.
I have array TR and other values which are included in my equation.
I need to solve it - as a result receive a new array with the same shape
Here is a solution using an 1D interpolation to compute the inverse of the F(h) function. Because non standard root finding method is used, the error is not controlled, and the discretization grid have to be chosen with care. However, the interpolated inverse function can be directly computed over an array.
note: the definition of F is modified, the problem is now Solve h for F(h) = TR
import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pylab as plt
# The function to inverse:
t = 10
alfa = 1.1*10**(-7)
k = 0.18
T1 = 20
Tpow = 100
def F(h):
A = np.exp(h**2*alfa*t/k**2)
B = h**3*2/(3*np.sqrt(3))*(alfa*t)**(3/2)/k**3
return -(Tpow-T1)*( 1 - A + B )
# Interpolation
h_eval = np.linspace(40, 100, 50) # critical step: define the discretization grid
F_inverse = interp1d( F(h_eval), h_eval, kind='cubic', bounds_error=True )
# Some random data:
TR = np.array([[13, 10, 13, 13, 13],
[ 9, 11, 10, 12, 9],
[13, 13, 10, 10, 13],
[12, 11, 12, 9, 11],
[11, 13, 13, 11, 13]])
# Compute the array h for a given array TR
h = F_inverse(TR)
print(h)
# Graph to verify the interpolation
plt.plot(h_eval, F(h_eval), '.-', label='discretized F(h)');
plt.plot(h.ravel(), TR.ravel(), 'or', label='interpolated values')
plt.xlabel('h'); plt.ylabel('F(h) or TR'); plt.legend();
With the other function, the following lines are changed:
from scipy.special import erf
def F(h):
return (Tpow-T1)*(1-np.exp((h**2*alfa*t)/k**2)*(1.0-erf(h*np.sqrt(alfa*t)/k)))
# Interpolation
h_eval = np.linspace(15, 35, 50) # the range is changed
I have done clustering using Kmeans using sklearn. While it has a method to print the centroids, I am finding it rather bizzare that scikit-learn doesn't have a method to find out the cluster length (or that I have not seen it so far). Is there a neat way to get the cluster-length of each cluster or many points associated with cluster? I currently have this rather cludgy code to do it where I am finding cluster of length one and need to add other point to this cluster by measuring the Euclidean distance between the points and have to update the labels
import numpy as np
from clustering.clusternew import Kmeans_clu
from evolution.generate import reproduction
from mapping.somnew import mapping, no_of_neurons, neuron_weights_init
from population_creation.population import pop_create
from New_SOL import newsol
data = genfromtxt('iris.csv', delimiter=',', skip_header=0, usecols=range(0, 4)) ##Read the input data
actual_label = genfromtxt('iris.csv', delimiter=',', dtype=str,skip_header=0, usecols=(4))
chromosome = int(input("Enter the number of chromosomes: ")) #Input the population size
max_gen = int(input("Enter the maximum number of generation: ")) #Input the maximum number of generation
for i in range(0, chromosome):
cluster = 3#random.randint(2, max_cluster) ##Randomly selects cluster number from 2 to root(poplation)
K.insert(i, cluster) ##Store the number of clusters in clu
print('value of K is ',K)
u, label,z1,A1= Kmeans_clu(cluster, data)
#print("centers and labels : ", u, label)
lab.insert(i, label) ##Store the labels in lab
center.insert(i, u)
new_center = pop_create(max_cluster, features, cluster, u)
population.insert(i, new_center)
print("VAlue of population in main\n" ,population)
newsol(max_gen,population,data)
For newsol method we pass the new population from the above method generated code and again doing K-Means on the population
def ClusterIndicesComp(clustNum, labels_array): #list comprehension for accessing the features in iris data set
return np.array([i for i, x in enumerate(labels_array) if x == clustNum])
def newsol(max_gen,population,data):
#print('VAlue of NewSol Population is',population)
for i in range(max_gen):
cluster1=5
u,label,t,l=Kmeans_clu(cluster1, population)
A1.insert(i,t)
plab.insert(i,label)
pcenter.insert(i,u)
k2=Counter(l.labels_) #Count number of elements in each cluster
k1=[t for (t, v) in k2.items() if v == 1] #element whose length is one will be fetched
t1= np.array(k1) #Iterating through the cluster which have one point associated with them
for b in range(len(t1)):
print("Value in NEW_SOL is of 1 length cluster\n",t1[b])
plot1=data[ClusterIndicesComp(t1[b], l.labels_)]
print("Values are in sol of plot1",plot1)
for q in range(cluster1):
plot2=data[ClusterIndicesComp(q, l.labels_)]
print("VAlue of plot2 is for \n",q,plot2)
for i in range(len(plot2)):#To get one element at a time from plot2
plotk=plot2[i]
if([t for (t, v) in k2.items() if v >2]):#checking if the cluster have more than 2 points than only the distance will be calculated
S=np.linalg.norm(np.array(plot1) - np.array(plotk))
print("Distance between plot1 and plotk is",plot1,plotk,np.linalg.norm(np.array(plot1) - np.array(plotk)))#euclidian distance is calculated
else:
print("NO distance between them\n")
Kmeans which I have done is
from sklearn.cluster import KMeans
import numpy as np
def Kmeans_clu(K, data):
kmeans = KMeans(n_clusters=K, init='random', max_iter=1, n_init=1).fit(data) ##Apply k-means clustering
labels = kmeans.labels_
clu_centres = kmeans.cluster_centers_
z={i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)} #getting cluster for each label
return clu_centres, labels ,z,kmeans
For getting number of instances in each cluster may be you can try using Counter:
from collections import Counter, defaultdict
print(Counter(estimator.labels_))
Result:
Counter({0: 62, 1: 50, 2: 38})
where cluster 0 has 62 instances, cluster 1 has 50 instances, and cluster 2 has 38 instances
And may be to store index of instances of each clusters, you can use defaultdict:
clusters_indices = defaultdict(list)
for index, c in enumerate(estimator.labels_):
clusters_indices[c].append(index)
Now, to find indices of instances in cluster 0, calling:
print(clusters_indices[0])
Result:
[50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114, 119, 121, 123, 126, 127, 133, 138, 142, 146, 149]
This is the lines:
from sklearn import tree
X = [[181,80,44], [177, 70, 43], [160, 60, 38], 154, 54, 37],
[166,64,40], [190,90,47], [175,64,39],[177,70,40],[159,55,37],
[171,75,42],[181,85,43]
Y = ['male', 'female', 'female', 'female', 'male', 'male', 'male',
'female', 'male', 'female', 'male']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,Y)
prediction = clf.predict([[182,78,43]])
print (prediction)
Result:
Traceback (most recent call last):
File "C:\Python\code\test.py", line 14, in <module>
clf = clf.fit(X,Y)
File "C:\Python\lib\site-packages\sklearn\tree\tree.py", line 790, in fit
X_idx_sorted=X_idx_sorted)
File "C:\Python\lib\site-packages\sklearn\tree\tree.py", line 116, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "C:\Python\lib\site-packages\sklearn\utils\validation.py", line 402,
in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
[Finished in 0.5s]
Expected result:
Should display gender predicted from body measurements: "182,78,43"
Example: male or female
Run with Python 3.6 using sklearn, numpy+mkl and scipy on Sublime.
The code is originally from: https://www.youtube.com/watch?v=T5pRlIbr6gg.
No answer in the whole youtube comment section to this.
Appreciate if answer can be found here, could not find any answer.
If that's really your code, the problem starts right at the beginning:
X = [[181,80,44], [177, 70, 43], [160, 60, 38], 154, 54, 37],
[166,64,40], [190,90,47], [175,64,39],[177,70,40],[159,55,37],
[171,75,42],[181,85,43]
is creating something far far away from being usable:
print(X)
# ([[181, 80, 44], [177, 70, 43], [160, 60, 38], 154, 54, 37],)
So there is one bracket of each orientation to add (if that's not clear to you: read sklearn's docs on the data-format: 2d-array of shape (n_samples, n_features); consider also reading some introduction to numpy where the word shape comes from -> internally everything is numpy-based):
X = [[181,80,44], [177, 70, 43], [160, 60, 38], [154, 54, 37], # before 154
[166,64,40], [190,90,47], [175,64,39],[177,70,40],[159,55,37],
[171,75,42],[181,85,43]] # at end
I have to admit: that's something which should be found immediately and i can't understand why someone would invest time to create a post for SO, but not invest time to check the syntax of a simple array-creation.
To be fair: i thought it would not be going through the syntax-check at first (it's really a strange construction as stated).
Edit: To be fair #2: it's really that bad in that linked video too... I'm not sure what to think of that (well using a DTree for this task is probably already nuts, even LinearRegression seems more viable)!
And yes, the code predicts male after the correction has been done as above!