Estimation of number of Clusters via gap statistics and prediction strength - python

I am trying to translate the R implementations of gap statistics and prediction strength http://edchedch.wordpress.com/2011/03/19/counting-clusters/ into python scripts for the estimation of number of clusters in iris data with 3 clusters. Instead of getting 3 clusters, I get different results on different runs with 3 (actual number of clusters) hardly estimated. Graph shows estimated number to be 10 instead of 3. Am I missing something? Can anyone help me locate the problem?
import random
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def dispersion (data, k):
if k == 1:
cluster_mean = np.mean(data, axis=0)
distances_from_mean = np.sum((data - cluster_mean)**2,axis=1)
dispersion_val = np.log(sum(distances_from_mean))
else:
k_means_model_ = KMeans(n_clusters=k, max_iter=50, n_init=5).fit(data)
distances_from_mean = range(k)
for i in range(k):
distances_from_mean[i] = int()
for idx, label in enumerate(k_means_model_.labels_):
if i == label:
distances_from_mean[i] += sum((data[idx] - k_means_model_.cluster_centers_[i])**2)
dispersion_val = np.log(sum(distances_from_mean))
return dispersion_val
def reference_dispersion(data, num_clusters, num_reference_bootstraps):
dispersions = [dispersion(generate_uniform_points(data), num_clusters) for i in range(num_reference_bootstraps)]
mean_dispersion = np.mean(dispersions)
stddev_dispersion = float(np.std(dispersions)) / np.sqrt(1. + 1. / num_reference_bootstraps)
return mean_dispersion
def generate_uniform_points(data):
mins = np.argmin(data, axis=0)
maxs = np.argmax(data, axis=0)
num_dimensions = data.shape[1]
num_datapoints = data.shape[0]
reference_data_set = np.zeros((num_datapoints,num_dimensions))
for i in range(num_datapoints):
for j in range(num_dimensions):
reference_data_set[i][j] = random.uniform(data[mins[j]][j],data[maxs[j]][j])
return reference_data_set
def gap_statistic (data, nthCluster, referenceDatasets):
actual_dispersion = dispersion(data, nthCluster)
ref_dispersion = reference_dispersion(data, nthCluster, num_reference_bootstraps)
return actual_dispersion, ref_dispersion
if __name__ == "__main__":
data=np.loadtxt('iris.mat', delimiter=',', dtype=float)
maxClusters = 10
num_reference_bootstraps = 10
dispersion_values = np.zeros((maxClusters,2))
for cluster in range(1, maxClusters+1):
dispersion_values_actual,dispersion_values_reference = gap_statistic(data, cluster, num_reference_bootstraps)
dispersion_values[cluster-1][0] = dispersion_values_actual
dispersion_values[cluster-1][1] = dispersion_values_reference
gaps = dispersion_values[:,1] - dispersion_values[:,0]
print gaps
print "The estimated number of clusters is ", range(maxClusters)[np.argmax(gaps)]+1
plt.plot(range(len(gaps)), gaps)
plt.show()

Your graph is showing the correct value of 3. Let me explain a bit
As you increase the number of clusters, your distance metric will certainly decrease. Therefore you are assuming that the correct value is 10. If you increase it to beyond 10, the distance metric will further decrease. But this should not be our decision making criteria
We need to find the inflection point ( here marked in RED ). It is the point where the slope smoothens out. You might want to take a look at elbow curves
Based on the above 2 points, the inflection point is 3 ( which is also the correct solution )
Hope this helps

you could take a look on this code and you could change your output plot format
[![# coding: utf-8
# Implémentation de K-means clustering python
#Chargement des bibliothèques
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets
#chargement de jeu des données Iris
iris = datasets.load_iris()
#importer le jeu de données Iris dataset à l'aide du module pandas
x = pd.DataFrame(iris.data)
x.columns = \['Sepal_Length','Sepal_width','Petal_Length','Petal_width'\]
y = pd.DataFrame(iris.target)
y.columns = \['Targets'\]
#Création d'un objet K-Means avec un regroupement en 3 clusters (groupes)
model=KMeans(n_clusters=3)
#application du modèle sur notre jeu de données Iris
model.fit(x)
#Visualisation des clusters
plt.scatter(x.Petal_Length, x.Petal_width)
plt.show()
colormap=np.array(\['Red','green','blue'\])
#Visualisation du jeu de données sans altération de ce dernier (affichage des fleurs selon leur étiquettes)
plt.scatter(x.Petal_Length, x.Petal_width,c=colormap\[y.Targets\],s=40)
plt.title('Classification réelle')
plt.show()
#Visualisation des clusters formés par K-Means
plt.scatter(x.Petal_Length, x.Petal_width,c=colormap\[model.labels_\],s=40)
plt.title('Classification K-means ')
plt.show()][1]][1]
Output 1

Related

Run Different Scikit-learn Clustering Algorithms on Dataset

I have a dataframe like below. The shape is (24,7)
Name x1 x2 x3 x4 x5 x6
Harry 102 204 0.43 0.21 1.02 0.39
James 242 500 0.31 0.11 0.03 0.73
.
.
.
Mike 3555 4002 0.12 0.03 0.52. 0.11
Henry 532 643 0.01 0.02 0.33 0.10
I want to run Scikit-learn's Different Clustering Algorithms Script on the above dataframe. However, the input data looks quite confusing, not too sure how to input my dataframe
https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py
There are two main differences between your scenario and the scikit-learn example you link to:
You only have one dataset, not several different ones to compare.
You have six features, not just two.
Point one allows you to simplify the example code by deleting the loops over the different datasets and related calculations. Point two implies that you cannot easily plot your results. Instead, you could just add the predicted class labels found by each algorithm to your dataset.
So you could modify the example code like this:
import time
import warnings
import numpy as np
import pandas as pd
from sklearn import cluster, datasets, mixture
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from itertools import cycle, islice
np.random.seed(0)
# ============
# Introduce your dataset
# ============
my_df = # Insert your data here, as a pandas dataframe.
features = [f'x{i}' for i in range(1, 7)]
X = my_df[features].values
# ============
# Set up cluster parameters
# ============
params = {
"quantile": 0.3,
"eps": 0.3,
"damping": 0.9,
"preference": -200,
"n_neighbors": 3,
"n_clusters": 3,
"min_samples": 7,
"xi": 0.05,
"min_cluster_size": 0.1,
}
# normalize dataset for easier parameter selection
X = StandardScaler().fit_transform(X)
# estimate bandwidth for mean shift
bandwidth = max(cluster.estimate_bandwidth(X, quantile=params["quantile"]),
0.001) # arbitrary correction to avoid 0
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(
X, n_neighbors=params["n_neighbors"], include_self=False
)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
# ============
# Create cluster objects
# ============
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
two_means = cluster.MiniBatchKMeans(n_clusters=params["n_clusters"])
ward = cluster.AgglomerativeClustering(
n_clusters=params["n_clusters"], linkage="ward", connectivity=connectivity
)
spectral = cluster.SpectralClustering(
n_clusters=params["n_clusters"],
eigen_solver="arpack",
affinity="nearest_neighbors",
)
dbscan = cluster.DBSCAN(eps=params["eps"])
optics = cluster.OPTICS(
min_samples=params["min_samples"],
xi=params["xi"],
min_cluster_size=params["min_cluster_size"],
)
affinity_propagation = cluster.AffinityPropagation(
damping=params["damping"], preference=params["preference"], random_state=0
)
average_linkage = cluster.AgglomerativeClustering(
linkage="average",
affinity="cityblock",
n_clusters=params["n_clusters"],
connectivity=connectivity,
)
birch = cluster.Birch(n_clusters=params["n_clusters"])
gmm = mixture.GaussianMixture(
n_components=params["n_clusters"], covariance_type="full"
)
clustering_algorithms = (
("MiniBatch\nKMeans", two_means),
("Affinity\nPropagation", affinity_propagation),
("MeanShift", ms),
("Spectral\nClustering", spectral),
("Ward", ward),
("Agglomerative\nClustering", average_linkage),
("DBSCAN", dbscan),
("OPTICS", optics),
("BIRCH", birch),
("Gaussian\nMixture", gmm),
)
for name, algorithm in clustering_algorithms:
t0 = time.time()
# catch warnings related to kneighbors_graph
with warnings.catch_warnings():
warnings.filterwarnings(
"ignore",
message="the number of connected components of the "
+ "connectivity matrix is [0-9]{1,2}"
+ " > 1. Completing it to avoid stopping the tree early.",
category=UserWarning,
)
warnings.filterwarnings(
"ignore",
message="Graph is not fully connected, spectral embedding"
+ " may not work as expected.",
category=UserWarning,
)
algorithm.fit(X)
t1 = time.time()
if hasattr(algorithm, "labels_"):
y_pred = algorithm.labels_.astype(int)
else:
y_pred = algorithm.predict(X)
# Add cluster labels to the dataset
my_df[name] = y_pred
PS : please replace : data = X_data.iloc[:20000] by your X
import numpy as np
import matplotlib as plt
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import decomposition
from sklearn import preprocessing
from sklearn import cluster, metrics
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn import preprocessing
from collections import Counter
from sklearn.cluster import DBSCAN
from sklearn import mixture
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
comp_model = pd.DataFrame(columns=['Model', 'Score_Silhouette',
'num_clusters', 'size_clusters',
'parameters'])
K-Means :
def k_means(X_data, nb_clusters, model_comp):
ks = nb_clusters
inertias = []
data = X_data.iloc[:20000]
X = data.values
X_scaled = preprocessing.StandardScaler().fit_transform(X)
for num_clusters in ks:
# Create a KMeans instance with k clusters: model
model = KMeans(n_clusters=num_clusters, n_init=1)
# Fit model to samples
model.fit(X_scaled)
# Append the inertia to the list of inertias
inertias.append(model.inertia_)
silh = metrics.silhouette_score(X_scaled, model.labels_)
# Counting the amount of data in each cluster
taille_clusters = Counter(model.labels_)
data = [{'Model': 'kMeans',
'Score_Silhouette': silh,
'num_clusters': num_clusters,
'size_clusters': taille_clusters,
'parameters': 'nb_clusters :'+str(num_clusters)}]
model_comp = model_comp.append(data, ignore_index=True, sort=False)
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
return model_comp
comp_model = k_means(X_data=df,
nb_clusters=pd.np.arange(2, 11, 1),
model_comp=comp_model)
DBscan :
def dbscan_grid_search(X_data, model_comp, eps_space=0.5,
min_samples_space=5, min_clust=0, max_clust=10):
data = X_data.iloc[:20000]
X = data.values
X_scaled = preprocessing.StandardScaler().fit_transform(X)
# Starting a tally of total iterations
n_iterations = 0
# Looping over each combination of hyperparameters
for eps_val in eps_space:
for samples_val in min_samples_space:
dbscan_grid = DBSCAN(eps=eps_val,
min_samples=samples_val)
# fit_transform
clusters = dbscan_grid.fit_predict(X=X_scaled)
# Counting the amount of data in each cluster
cluster_count = Counter(clusters)
#n_clusters = sum(abs(pd.np.unique(clusters))) - 1
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
# Increasing the iteration tally with each run of the loop
n_iterations += 1
# Appending the lst each time n_clusters criteria is reached
if n_clusters >= min_clust and n_clusters <= max_clust:
silh = metrics.silhouette_score(X_scaled, clusters)
data = [{'Model': 'Dbscan',
'Score_Silhouette': silh,
'num_clusters': n_clusters,
'size_clusters': cluster_count,
'parameters': 'eps :'+str(eps_val)+'+ samples_val :'+str(samples_val)}]
model_comp = model_comp.append(
data, ignore_index=True, sort=False)
return model_comp
comp_model = dbscan_grid_search(X_data=df,
model_comp=comp_model,
eps_space=pd.np.arange(0.1, 5, 0.6),
min_samples_space=pd.np.arange(1, 30, 3),
min_clust=2,
max_clust=10)
GMM :
def gmm(X_data, nb_clusters, model_comp):
ks = nb_clusters
data = X_data.iloc[:20000]
X = data.values
X_scaled = preprocessing.StandardScaler().fit_transform(X)
for num_clusters in ks:
# Create a KMeans instance with k clusters: model
gmm = mixture.GaussianMixture(n_components=num_clusters).fit(X_scaled)
# Fit model to samples
gmm.fit(X_scaled)
pred = gmm.predict(X_scaled)
cluster_count = Counter(pred)
silh = metrics.silhouette_score(X_scaled, pred)
data = [{'Model': 'GMM',
'Score_Silhouette': silh,
'num_clusters': num_clusters,
'size_clusters': cluster_count,
'parameters': 'nb_clusters :'+str(num_clusters)}]
model_comp = model_comp.append(data, ignore_index=True, sort=False)
return model_comp
comp_model = gmm(X_data=df,
nb_clusters=pd.np.arange(2, 11, 1),
model_comp=comp_model
)
At the end you will have comp_model which will contain all the results of your algo. Here I am using three algorithms, after you selected the best fit for you (with score silhouette and number of cluster).
You should check the repartitions of each cluster :
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py

How to implement kmeans clustering as a feature for classification techniques in SVM?

Ive already created a clustering and saved the model but im confused what should i do with this model and how to use it as a feature for classification.
This clustering is based on the coordinate of a crime place. after the data has been clustered, i want to use the clustered model as features in SVM.
import pandas as pd
import matplotlib.pyplot as plt
import random
import numpy as np
import xlrd
import pickle
import tkinter as tk
from tkinter import *
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
#kmeans section
#Creating and labelling latitudes of X and Y and plotting it
data=pd.read_excel("sanfrancisco.xlsx")
x1=data['X']
y1=data['Y']
X = np.array(list(zip(x1,y1)))
# Elbow method
from sklearn.cluster import KMeans
wcss = [] #empty string
# to check in range for 10 cluster
for i in range(1,11):
kmeans = KMeans(n_clusters=i, init='k-means++') # will generate centroids
kmeans.fit(X)
wcss.append(kmeans.inertia_) # to find euclidean distance
plot1 = plt.figure(1)
plt.xlabel("Number of Clusters")
plt.ylabel("Euclidean Distance")
plt.plot(range(1,11), wcss)
k = 3
# data visual section.. Eg: how many crimes in diff month, most number of crime in a day in a week
# most number crime in what address, most number of crimes in what city, how many crime occur
# in how much time. , etc..
# X coordinates of random centroids
C_x = np.random.randint(0, np.max(X)-20, size=k)
# Y coordinates of random centroids
C_y = np.random.randint(0, np.max(X)-20, size=k)
C = np.array(list(zip(C_x,C_y)), dtype=np.float32)
print("Initial Centroids")
print(C)
# n_clustersr takes numbers of clusters, init chooses random data points for the initial centroids
# in default sckit provides 10 times of count and chooses the best one, in order to elak n_init assigned to 1
model = KMeans(n_clusters=k, init='random', n_init=1)
model.fit_transform(X)
centroids = model.cluster_centers_ # final centroids
rgb_colors = {0.: 'y',
1.: 'c',
2.: 'fuchsia',
}
if k == 4:
rgb_colors[3.] = 'lime'
if k == 6:
rgb_colors[3.] = 'lime'
rgb_colors[4.] = 'orange'
rgb_colors[5.] = 'tomato'
new_labels = pd.Series(model.labels_.astype(float)) # label that predicted by kmeans
plot2 = plt.figure(2)
plt.scatter(x1, y1, c=new_labels.map(rgb_colors), s=20)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', c='black', s=200 )
plt.xlabel('Final Cluster Centers\n Iteration Count=' +str(model.n_iter_)+
'\n Objective Function Value: ' +str(model.inertia_))
plt.ylabel('y')
plt.title("k-Means")
plt.show()
# save the model to disk
filename = 'clusteredmatrix.sav'
pickle.dump(model, open(filename,'wb'))
Your problem is not much clear, but if you want to see the behavior of clusters, I recommend you to use a tool like Weka, so that you can freely cluster them and get meaningful inferences before going into complex coding stuff!

Build a correlation circle with Python - Error ValueError: could not broadcast input array from shape (3) into shape (28)

I'm trying to build a correlation circle, basically, it allows to measure to which extend the Eigenvalue / Eigenvector of a variable is correlated to the principal components (dimensions) of a dataset.
Something like this :
Here is my code :
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
#chargement données
X = pd.read_excel("mortalitePaysUE.xlsx",sheet_name=0,header=0,index_col=0)
#nombre d'observations
n = X.shape[0]
#nombre de variable
p = X.shape[0]
print(p)
#transformation - centrage/réduction
sc = StandardScaler()
Z = sc.fit_transform(X)
print(Z)
print("-------------")
#moyenne
print("Moyenne : ")
print(np.mean(X,axis=0))
print("-------------")
#ecart-type
print("Ecart type : ")
print(np.std(X,axis=1,ddof=0))
print("-------------")
#acp
acp = PCA(svd_solver='full')
coord = acp.fit_transform(Z)
eigval = (n-1)/n*acp.explained_variance_
print(eigval)
#screen plot
#plt.plot(np.arange(1,p+1),eigval)
#plt.title("Décès en 1990 selon le genre")
#plt.xlabel("Numéro de facteur")
#plt.ylabel("Valeur propre")
#plt.show()
#positionnement des individus dans le premier plan
fig, axes = plt.subplots(figsize=(12,12))
axes.set_xlim(-6,6) #même limites en abscisse
axes.set_ylim(-6,6) #et en ordonnée
#placement des étiquettes des observations
for i in range(n):
plt.annotate(X.index[i],(coord[i,0],coord[i,1]))
#ajouter les axes
plt.plot([-6,6],[0,0],color='silver',linestyle='-',linewidth=1)
plt.plot([0,0],[-6,6],color='silver',linestyle='-',linewidth=1)
#affichage
plt.show()
#racine carrée des valeurs propres
sqrt_eigval = np.sqrt(eigval)
#corrélation des variables avec les axes
corvar = np.zeros((p,p))
for k in range(p):
corvar[:,k] = acp.components_[k,:] * sqrt_eigval[k]
#afficher la matrice des corrélations variables x facteurs
#print(corvar)
#cercle des corrélations
fig, axes = plt.subplots(figsize=(8,8))
axes.set_xlim(-1,1)
axes.set_ylim(-1,1)
#affichage des étiquettes (noms des variables)
for j in range(p):
plt.annotate(X.columns[j],(corvar[j,0],corvar[j,1]))
#ajouter les axes
plt.plot([-1,1],[0,0],color='silver',linestyle='-',linewidth=1)
plt.plot([0,0],[-1,1],color='silver',linestyle='-',linewidth=1)
#ajouter un cercle
cercle = plt.Circle((0,0),1,color='blue',fill=False)
axes.add_artist(cercle)
The problem is that i got an error and i can't display the circle. And i can't resolve the error
corvar[:,k] = acp.components_[k,:] * sqrt_eigval[k]
ValueError: could not broadcast input array from shape (3) into shape (28)
Can anyone help me to fix this please :) Thanks in advance !

Python Curve fit, gaussian

I am trying to gauss fit my data using scipy and curve fit, here is my code :
import csv
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
A=[]
T=[]
seuil=1000
range_gauss=4
a=0
pos_peaks=[]
amp_peaks=[]
A_gauss=[]
T_gauss=[]
new_A=[]
new_T=[]
def gauss(x,a,x0,sigma):
return a*np.exp(-(x-x0)**2/(2*sigma**2))
with open("classeur_test.csv",'r') as csvfile:
reader=csv.reader(csvfile, delimiter=',')
for row in reader :
A.append(float(row[0]))
T.append(float(row[1]))
npA=np.array(A)
npT=np.array(T)
for i in range(1,len(T)):
#PEAK DETECTION
if (A[i]>A[i-1] and A[i]>A[i+1]) and A[i]>seuil:
pos_peaks.append(i)
amp_peaks.append(A[i])
#GAUSSIAN RANGE
for j in range(-range_gauss,range_gauss):
#ATTENTION AUX LIMITES
if(i+j>0 and i+j<len(T)-1):
A_gauss.append(A[i+j])
T_gauss.append(T[i+j])
npA_gauss = np.array(A_gauss)
npT_gauss = np.array(T_gauss)
for i in range (0,7):
new_A.append(npA_gauss[i])
new_T.append(npT_gauss[i])
new_npA=np.array(new_A)
new_npT=np.array(new_T)
n = 2*range_gauss
mean = sum(new_npT*new_npA)/n
sigma = sum(new_npA*(new_npT-mean)**2)/n
popt,pcov = curve_fit(gauss,new_npT,new_npA,p0=[1,mean,sigma])
plt.plot(T,A,'b+:',label='data')
plt.plot(new_npT,gauss(new_npT,*popt),'ro:',label='Fit')
print ("new_npA : ",new_npA)
print ("new_npT : ",new_npT)
plt.legend()
plt.title('Fit')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
My arrays new_npT and new_npA are numpy arrays like this :
new_npA : [ 264. 478. 733. 1402. 1337. 698. 320.]
new_npT : [229.609344 231.619385 233.62944 235.639496 237.649536 239.659592
241.669647]
This is the result
I don't understand why I can't successfully plot the gauss curves...
Any explanations?
I can now fit gaussians curves on my data
I still can't understand how Jannick found the p0 for the curve fit, but it works.
I created a 3 dimensional array with positions and amplitudes of peaks and used a while loop for the rang_gauss. I used the scipy curve_fit properly with my 3D array, and corrected the amplitudes with a coefficient f.
import csv
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
seuil=1000 # calculer en fonction du bruit etc ................................
range_gauss=4
A=[]
T=[]
pos_peaks=[]
amp_peaks=[]
indices_peaks=[]
tab_popt=[]
l=[]
gauss_result=[]
tab_w=[]
def gauss1(x,a,x0,sigma):
return a*np.exp(-(x-x0)**2/(2*sigma**2))
def gauss2(x,a,x0,sigma):
return (a/sigma*np.sqrt(2*np.pi))*np.exp(-0.5*((x-x0)/sigma)**2)
#LECTURE DU FICHIER ET INITIALISATION DE TABLEAUX CONTENANT TOUTES LES VALEURS
with open("classeur_test.csv",'r') as csvfile:
reader=csv.reader(csvfile, delimiter=',')
for row in reader :
A.append(float(row[0]))
T.append(float(row[1]))
#PEAK DETECTION
for i in range(1,len(T)):
if (A[i]>A[i-1] and A[i]>A[i+1]) and A[i]>seuil:
pos_peaks.append(T[i])
amp_peaks.append(A[i])
indices_peaks.append(i)
#TABLEAU 3D AVEC LES AMPLITUDES ET TEMPS DE CHAQUE PIC
Tableau=np.zeros((len(pos_peaks),2,2*range_gauss+1))
#POUR CHAQUE PIC
m=0
j=-range_gauss
for i in range(0,len(pos_peaks)):
while(j<range_gauss+1):
#PEAK DETECTION & LIMITS CONSIDERATION
if(pos_peaks[i]+j>=0 and pos_peaks[i]+j<=T[len(T)-1] and m<=2*range_gauss+1 and indices_peaks[i]+j>=0):
Tableau[i,0,m]=(A[indices_peaks[i]+j])
Tableau[i,1,m]=(T[indices_peaks[i]+j])
m=m+1
j=j+1
else :
j=j+1
print("else")
print("1 : ",pos_peaks[i]+j,", m : ",m," , indices_peaks[i]+j : ",indices_peaks[i]+j)
m=0
j=-range_gauss
popt,pcov = curve_fit(gauss2,Tableau[i,1,:],Tableau[i,0,:],p0=[[1400,240,10]])
tab_popt.append(popt)
l.append(np.linspace(T[indices_peaks[i]-range_gauss],T[indices_peaks[i]+range_gauss],50))
gauss_result.append(gauss2(l[i],1,tab_popt[i][1],tab_popt[i][2])*(1))
f= amp_peaks[i]/max(gauss_result[i])
gauss_result[i]=gauss_result[i]*f
#LARGEUR MI HAUTEUR
w=2*np.sqrt(2*np.log(2))*tab_popt[i][2]
tab_w.append(w)
####################################PLOTS
plt.subplot(2,1,1)
plt.plot(T,A,label='data')
plt.axis([T[0]-5,T[len(T)-1]-10,0,max(A)+200])
#plt.plot(Tableau[i,1,:],gauss2(Tableau[i,1,:],*popt),'ro:',label='fit')
plt.subplot(2,1,2)
plt.plot(l[i],gauss_result[i])
plt.axis([T[0]-5,T[len(T)-1]-10,0,max(A)+200])
'''TEST POINTS INFLEXIONS
for j in range(0,len(A)-1):
inflex_points.append((np.diff(np.diff(A[j],n=2),n=2)))
print(inflex_points[j])
for k in range(0,len(inflex_points[j])-1):
if (inflex_points[j][k] < 1 and inflex_points[j][k] > -1):
print("j : ",j)'''
'''TEST INTERNET GRADIENT ???
plt.plot(np.gradient(gauss_result[0]), '+')
spl = UnivariateSpline(np.arange(len(gauss_result[0])), np.gradient(gauss_result[0]), k=5)
spl.set_smoothing_factor(1000)
plt.plot(spl(np.arange(len(gauss_result[0]))), label='Smooth Fct 1e3')
spl.set_smoothing_factor(10000)
plt.plot(spl(np.arange(len(gauss_result[0]))), label='Smooth Fct 1e4')
plt.legend(loc='lower left')
max_idx = np.argmax(spl(np.arange(len(gauss_result[0]))))
plt.vlines(max_idx, -5, 9, linewidth=5, alpha=0.3)
'''
plt.show()

ValueError in Random forest (Python)

I am trying to perform a Random Forest analysis in Python. Everything seems OK but, when I try to run the code, I get the following error message:
Did any of you get this ValueError?
Cheers
Dataset: https://www.dropbox.com/s/ehyccl8kubazs8x/test.csv?dl=0&preview=test.csv
Code:
from sklearn.ensemble import RandomForestRegressor as RF
import numpy as np
import pylab as pl
headers = file("test.csv").readline().strip().split('\r')[0].split(',')[1:]
data = np.loadtxt("test.csv", delimiter=',', skiprows=1, usecols = range(1,14))
#yellow==PAR, green==VPD, blue== Tsoil and orange==Tair
PAR = data[:,headers.index("PAR")]
VPD = data[:,headers.index("VPD")]
Tsoil= data[:,headers.index("Tsoil")]
Tair = data[:,headers.index("Tair")]
drivers = np.column_stack([PAR,VPD,Tsoil,Tair])
hour = data[:,-1].astype("int")
#performs a random forest hour-wise to explain each NEE, GPP and Reco fluxes
importances = np.zeros([24,2,3,4])
for ff,flux in enumerate(["NEE_f","GPP_f","Reco"]):
fid = headers.index(flux)
obs = data[:,fid]
#store importances: dim are average/std; obs var; expl var
for hh in range(24):
mask = hour == hh
forest = RF(n_estimators=1000)
forest.fit(drivers[mask],obs[mask])
importances[hh,0,ff] = forest.feature_importances_
importances[hh,1,ff] = np.std([tree.feature_importances_ for tree in forest.estimators_],axis=0)
fig = pl.figure('importances',figsize=(15,5));fig.clf()
xx=range(24)
colors = ["#F0E442","#009E73","#56B4E9","#E69F00"];labels= ['PAR','VPD','Tsoil','Tair']
for ff,flux in enumerate(["NEE_f","GPP_f","Reco"]):
ax = fig.add_subplot(1,3,ff+1)
for vv in range(drivers.shape[1]):
ax.fill_between(xx,importances[:,0,ff,vv]+importances[:,1,ff,vv],importances[:,0,ff,vv]-importances[:,1,ff,vv],color=colors[vv],alpha=.35,edgecolor="none")
ax.plot(xx,importances[:,0,ff,vv],color=colors[vv],ls='-',lw=2,label = labels[vv])
ax.set_title(flux);ax.set_xlim(0,23)
if ff == 0:
ax.legend(ncol=2,fontsize='medium',loc='upper center')
fig.show()
fig.savefig('importance-hourly.png')
The problem was that I selected the column where years are stored, not where hours are. Therefore the RF was trained on empty arrays.

Categories