I have a random forest feature importance procedure. All the feature importance parameters have been generated for each variable. I have also plotted it on a horizontal bar graph.
Now I would like to sort the bars into ascending / descending order. How do I do it?
My code is as follow:
#Feature Selection (shortlisting key variables)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\z003v0ee\Desktop\TP Course\project module\ProjectDataSetrev4.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\z003v0ee\Desktop\TP Course\project module\ProjectDataSetrev4.xlsx',sheet_name=1)
## Convert date time format and set as index
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
## Save each feature to a list (independent variables)
allvarlist = list()
for each_var in df2.columns:
allvarlist.append(each_var)
countvar = len(allvarlist)
allvar = df[allvarlist]
allvar = allvar.values.reshape(len(allvar),countvar)
## Define dependent variable
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
## Split into training and test data
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.7)
## Choose a model
clf = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)
#print(allvar_train)
#print(target_train)
clf.fit(allvar_train,np.ravel(target_train))
## Show feature importance results
for feature in zip(allvarlist, clf.feature_importances_):
print(feature)
## Plot feature importance results
importances = clf.feature_importances_
#indices = np.argsort(importances)
plt.figure().set_size_inches(14,16)
plt.barh(range(allvar_train.shape[1]), importances, color="r")
plt.yticks(range(allvar_train.shape[1]),allvarlist)
My graph looks like this.
Updated code that plots horizontal bar graph:
plt.figure(figsize=(14,16))
df3=pd.DataFrame({'allvarlist':range(countvar),'importances':allvarlist})
df3.sort_values('importances',inplace=True)
df3.plot(kind='barh',y='importances',x='allvarlist',color='r')
Still does not work. Error is 'TypeError: Empty 'DataFrame': no numeric data to plot'
Any other suggestions please?
You could do something like this!
Feed allVarlist with your feature names.
plt.figure(figsize=(14,16))
df=pd.DataFrame({'allvarlist':range(5),'importances':np.random.randint(50,size=5)})
df.sort_values('importances',inplace=True)
df.plot(kind='barh',y='importances',x='allvarlist',color='r')
EDIT:
plt.figure(figsize=(14,16))
df3=pd.DataFrame({'allvarlist':allvarlist,'importances':clf.feature_importances_})
df3.sort_values('importances',inplace=True)
df3.plot(kind='barh',y='importances',x='allvarlist',color='r')
Related
I used the ellipticenvelope method to find the anomalies in the iris dataset as below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris = load_iris()
cols = iris.feature_names
X = pd.DataFrame(iris.data, columns=cols)
X.head()
from sklearn.preprocessing import StandardScaler
from sklearn.covariance import EllipticEnvelope
scaler = StandardScaler()
scaler.fit_transform(X)
cov = EllipticEnvelope(store_precision=True,
assume_centered=True,
support_fraction=None,
contamination=0.01,
random_state=0)
cov.fit(X)
X['Anomaly'] = cov.predict(X)
Now you can find the anomalies in the last column with the value -1.
X[X['Anomaly'] == -1]
Now I want to do a root cause analysis to find the source of the anomaly, so I want to plot the anomalies in the boxplot with red dots for example. Is it possible or not? if yes, how can I add it?
X.boxplot(column=cols, grid=False, rot=45)
# code to plot anomalies on boxplot
plt.show()
There is a subset of gene expression data making 6 feature columns with no target. Using PCA in sklearn, I could separate the 6 features by extracting principal axes in feature space using PCA. Is it possible to plot similar figure using KernelPCA considering components_ attributes does not exist in KernelPCA? Here is my code taken from here with small changes.
It is obvious that using KernelPCA(kernel="linear") should lead to the same results as PCA.
from sklearn.decomposition import PCA,KernelPCA
from sklearn.preprocessing import StandardScaler
from bioinfokit.analys import get_data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = get_data('gexp').data
df_st = StandardScaler().fit_transform(df)
pca_out = PCA().fit(df_st)
loadings = pca_out.components_
fig, ax = plt.subplots(1,2)
zz=[]
for i in df.columns.values:
zz.append(i)
ax[0].scatter(loadings[0],loadings[1])
for i, txt in enumerate(zz):
ax[0].annotate(zz[i], (loadings[0][i], loadings[1][i]),fontsize=12)
plt.show()
########################## KernelPCA ###################
kpca=KernelPCA(kernel="linear")
kpca_o=kpca.fit(df_st)
#ax[1].scatter(kpca_o[0,:],kpca_o[1,:])
Use: kpca_o.alphas_array
Source: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html
alphas_array, (n_samples, n_components)
Eigenvectors of the centered kernel matrix. If n_components and remove_zero_eig are not set, then all components are stored.
I have two question about correlation between Categorical variables from my dataset for predicting models.
Using both Cramers V and TheilU to double check the correlation.
I got 1.0 from Cramers V for two of my variable, however, I only got 0.2 when I used TheilU method, I am not sure how to interpret the relationship between the two variables?
Also for those that are experienced, if I got a 0.73 for a correlation of 2 variables, should I remove one of the variable for the predicting model?
Thanks you so much in advance!
Well, you probably want to convert non-numerics to numerics. I don't think I have seen correlations of non-numerics, but maybe there is is something out there. Not sure how it would work, though. If you think about it, how would you apply the formula below, to non-numeric data?
Anyway, here is some sample code for you to experiment with.
FYI: look specifically at 'labelencoder' and 'dfDummies'.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
#%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, auc, roc_curve
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz
df = pd.read_csv('C:\\Users\\ryans\\OneDrive\\Desktop\\mushrooms.csv')
df.columns
df.head(5)
# The data is categorial so I convert it with LabelEncoder to transfer to ordinal.
labelencoder=LabelEncoder()
for column in df.columns:
df[column] = labelencoder.fit_transform(df[column])
#df.describe()
#df=df.drop(["veil-type"],axis=1)
#df_div = pd.melt(df, "class", var_name="Characteristics")
#fig, ax = plt.subplots(figsize=(10,5))
#p = sns.violinplot(ax = ax, x="Characteristics", y="value", hue="class", split = True, data=df_div, inner = 'quartile', palette = 'Set1')
#df_no_class = df.drop(["class"],axis = 1)
#p.set_xticklabels(rotation = 90, labels = list(df_no_class.columns));
#plt.figure()
#pd.Series(df['class']).value_counts().sort_index().plot(kind = 'bar')
#plt.ylabel("Count")
#plt.xlabel("class")
#plt.title('Number of poisonous/edible mushrooms (0=edible, 1=poisonous)');
plt.figure(figsize=(14,12))
sns.heatmap(df.corr(),linewidths=.1,cmap="YlGnBu", annot=True)
plt.yticks(rotation=0);
dfDummies = pd.get_dummies(df)
plt.figure(figsize=(14,12))
sns.heatmap(dfDummies.corr(),linewidths=.1,cmap="YlGnBu", annot=True)
plt.yticks(rotation=0);
See the link below for more info.
http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example
Sample data is from the link below, and the bottom of that page.
https://www.kaggle.com/haimfeld87/analysis-and-classification-of-mushrooms/data
If you find something that's actually based on a method of NOT converting categorical data to numeric data, please do share your findings. I'd like to see that!!
I run this code for polynomial regression using sklearn but my plot is not what i was expecting. As you can see here i'm not getting a smooth line but it's jumping from one point to another. From my understanding i have to sort X, but when i do that all i get is an empty plot with a linear line.
import operator
import numpy as np
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.formula.api as smf
df = pd.read_csv('D:\Mall_Customers.csv', usecols = ['Age', 'Annual Income (k$)','Spending Score (1-100)'])
x = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=3, max_iter=100)
y_kmeans= kmeans.fit_predict(x)
mydict = {i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)}
dictlist = []
for key, value in mydict.items():
temp = [key,value]
dictlist.append(temp)
df0 = df[df.index.isin(mydict[0].tolist())]
X = df0[['Age', 'Annual Income (k$)']]
Y = df0['Spending Score (1-100)']
poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, Y)
y_poly_pred = model.predict(X_poly)
r2 = r2_score(Y,y_poly_pred)
print(r2)
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression(fit_intercept = False))
model.fit(X,Y)
plt.scatter(X.iloc[:, 1], Y, color='red')
plt.plot(X, Y, color='blue')
plt.xlabel('Age. Annual income')
plt.ylabel('Spending Score')
plt.show()
TLDR; the data is not linear dependent.
The reason the graph got so messy is because you plotted the X (train data) with the Y (the actual prediction data) and the fact that you were plotting this data while:
the data was messy and not really linear dependent
is what made the result this messy graph.
I suggest you to:
split to the train data into train, test and then after you train the model check the error with the test and maybe create 2 plots, 1 with the model results according to the test data and one with the actual result for the test data.
and change plot code to this:
.
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()
I am trying to plot multiple ROC curves on a plot by varying a variable in a cell in a pandas dataframe.
So in a particular row, if the total is above a certain threshold then it will be classified as an invoice. I want to be plotting the different curves on different thresholds of total.
This is the code that I have so far that measures basic metrics and is an attempt to plot the ROC curve but I have been unsuccessful so far.
import os
import pandas as pd
from sklearn import datasets, metrics, model_selection, svm
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv("test_results.csv", header = 0)
true_array = list(df["actual"].to_numpy())
predicted_array = list(df["predicted"].to_numpy())
accuracy = accuracy_score(true_array, predicted_array)
precision, recall, fscore, support = score(true_array, predicted_array, average = None, labels = ['invoice', 'non-invoice'])
print("Labels: \t invoice", "non-invoice")
print('Accuracy: \t {}'.format(accuracy))
print('Precision: \t {}'.format(precision))
print('Recall: \t {}'.format(recall))
print('Fscore: \t {}'.format(fscore))
skplt.metrics.plot_roc_curve(true_array, predicted_array)
plt.show()
The error I am getting is
fpr[i], tpr[i], _ = roc_curve(y_true, probas[:, i],
IndexError: too many indices for array
Any help would be appreciated..
The following documentation mentions that skplt.metrics.plot_roc_curve takes ground truth (correct) target values and prediction probabilities for each class returned by a classifier. So you should change 2nd input prediction_array.
https://scikit-plot.readthedocs.io/en/stable/metrics.html?highlight=roc#scikitplot.metrics.plot_roc