Barplot grouped by row index - python

Background
I'm reading Introduction to Machine Learning with Python and tried visualization of In[45] in Chapter 2. First, I fitted 3 LogisticRegression classifiers to Winsconsin cancer dataset using different C parameters. Then, for each classifier, I plotted coefficient magnitudes of each feature.
%matplotlib inline
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
cancer = load_breast_cancer()
for C, marker in [(0.01, 'o'), (1., '^'), (100., 'v')]:
logreg = LogisticRegression(C=C).fit(cancer.data, cancer.target)
plt.plot(logreg.coef_[0], marker, label=f"C={C}")
plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)
plt.hlines(0, 0, cancer.data.shape[1])
plt.legend()
I prefer barplot than using markers in this case. I'd like to get a graph such as:
I achieved this by the following workflow.
Step 1: Create a DataFrame holding coefficient magnitudes as a row
%matplotlib inline
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
import pandas as pd
cancer = load_breast_cancer()
df = pd.DataFrame(columns=cancer.feature_names)
for C in [0.01, 1., 100.]:
logreg = LogisticRegression(C=C).fit(cancer.data, cancer.target)
df.loc[f"C={C}"] = logreg.coef_[0]
df
Step 2: Convert the DataFrame into a seaborn.barplot-applicable form
import itertools
df_bar = pd.DataFrame(columns=['C', 'Feature', 'Coefficient magnitude'])
for C, feature in itertools.product(df.index, df.columns):
magnitude = df.at[C, feature]
df_bar = df_bar.append({'C': C, 'Feature': feature, 'Coefficient magnitude': magnitude},
ignore_index=True)
df_bar.head()
Step 3: Plot by seaborn.barplot
from matplotlib import pyplot as plt
import seaborn as sns
plt.figure(figsize=(12,8))
sns.barplot(x='Feature', y='Coefficient magnitude', hue='C', data=df_bar)
plt.xticks(rotation=90)
This yielded the graph I wanted.
Problem
I think Step 2 is tedious. Can I make the barplot from df in Step 1 directly or make df_bar by one-liner? Or is there a more elegant workflow to get the barplot?

Pandas plots grouped barplots column-wise. Hence it should be possible to do
df = df.transpose()
df.plot(kind="bar")
without using seaborn.
If the use of seaborn is for whatever reason required, step2 from the question could probably be simplified via pandas.melt.
df_bar = df.reset_index().melt(id_vars=["index"])
sns.barplot(x="variable", y="value", hue="index", data=df_bar)

Related

Extract principal axes in feature space from Kernel PCA in sklearn

There is a subset of gene expression data making 6 feature columns with no target. Using PCA in sklearn, I could separate the 6 features by extracting principal axes in feature space using PCA. Is it possible to plot similar figure using KernelPCA considering components_ attributes does not exist in KernelPCA? Here is my code taken from here with small changes.
It is obvious that using KernelPCA(kernel="linear") should lead to the same results as PCA.
from sklearn.decomposition import PCA,KernelPCA
from sklearn.preprocessing import StandardScaler
from bioinfokit.analys import get_data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = get_data('gexp').data
df_st = StandardScaler().fit_transform(df)
pca_out = PCA().fit(df_st)
loadings = pca_out.components_
fig, ax = plt.subplots(1,2)
zz=[]
for i in df.columns.values:
zz.append(i)
ax[0].scatter(loadings[0],loadings[1])
for i, txt in enumerate(zz):
ax[0].annotate(zz[i], (loadings[0][i], loadings[1][i]),fontsize=12)
plt.show()
########################## KernelPCA ###################
kpca=KernelPCA(kernel="linear")
kpca_o=kpca.fit(df_st)
#ax[1].scatter(kpca_o[0,:],kpca_o[1,:])
Use: kpca_o.alphas_array
Source: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html
alphas_array, (n_samples, n_components)
Eigenvectors of the centered kernel matrix. If n_components and remove_zero_eig are not set, then all components are stored.

Trying to plot outliers using DBSCAN

I have never been great with Python plotting concepts, and now I'm still apparently missing something new.
Here is my code.
import pandas as pd
import matplotlib.pyplot as plt
import sys
from numpy import genfromtxt
from sklearn.cluster import DBSCAN
data = pd.read_csv('C:\\Users\\path_here\\wine.csv')
data
# Reading in 2D Feature Space
model = DBSCAN(eps=0.9, min_samples=10).fit(data)
array_flavanoids = data.iloc[:, 2]
# Slicing array
array_colorintensity = data.iloc[:, 3]
# Scatter plot function
colors = model.labels_
plt.scatter(array_flavanoids, array_colorintensity, c=colors, marker='o')
plt.xlabel('Concentration of flavanoids', fontsize=16)
plt.ylabel('Color intensity', fontsize=16)
plt.title('Concentration of flavanoids vs Color intensity', fontsize=20)
plt.show()
Here is my result.
I am expecting the outliers to be in a different color than the non-outliers. So, something like this.
Maybe one color for outliers and another for non-outliers. I am just trying to learn the concept in this exercise. I am trying to follow the example from this link.
https://towardsdatascience.com/outlier-detection-python-cd22e6a12098
I am using this data source.
https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
I am testing different data sets.
I got this to work.
from sklearn.cluster import DBSCAN
def dbscan(X, eps, min_samples):
ss = StandardScaler()
X = ss.fit_transform(X)
db = DBSCAN(eps=eps, min_samples=min_samples)
db.fit(X)
y_pred = db.fit_predict(X)
plt.scatter(X[:,0], X[:,1],c=y_pred, cmap='Paired')
plt.title("DBSCAN")
dbscan(data, eps=.5, min_samples=5)
I found this to be a great resource.
https://medium.com/#plog397/functions-to-plot-kmeans-hierarchical-and-dbscan-clustering-c4146ed69744

How to define the Quartile range for multivariable and plot the box plot

How to plot the Outliers with Box plot for the below data
no,store_id,revenue,profit,state,country
0,101,779183,281257,WD,India
1,101,144829,838451,WD,India
2,101,766465,757565,AL,Japan
Code is below, code is there till converting data to standardscalar any can choose minmaxscalar. After that How to define Quartile range to define outliers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
df = pd.read_csv(r'anomaly.csv',index_col=False);
df1 = pd.get_dummies(data=df)
df2 = StandardScaler().fit_transform(df1)
Box and whisker plots display the 25th and 75th percentiles of the data by convention.
This is calculated automatically using the medians of the data you provided.
For example, for the following data:
no,store_id,revenue,profit,state,country
0,101,779183,281257,WD,India
1,101,144829,838451,WD,India
2,101,766465,757565,AL,Japan
2,101,1000000,757565,AL,Italy
You can display the boxplot as follows for the revenue column:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
df = pd.read_csv(r'anomaly.csv',index_col=False)
df1 = pd.get_dummies(data=df)
df2 = StandardScaler().fit_transform(df1)
green_diamond = dict(markerfacecolor='g', marker='D')
fig1, ax1 = plt.subplots()
ax1.set_title('Box plot')
ax1.boxplot(df['revenue'], flierprops=green_diamond)
plt.show()
The outlier is displayed:

Correlation between Categorical variables within a dataset

I have two question about correlation between Categorical variables from my dataset for predicting models.
Using both Cramers V and TheilU to double check the correlation.
I got 1.0 from Cramers V for two of my variable, however, I only got 0.2 when I used TheilU method, I am not sure how to interpret the relationship between the two variables?
Also for those that are experienced, if I got a 0.73 for a correlation of 2 variables, should I remove one of the variable for the predicting model?
Thanks you so much in advance!
Well, you probably want to convert non-numerics to numerics. I don't think I have seen correlations of non-numerics, but maybe there is is something out there. Not sure how it would work, though. If you think about it, how would you apply the formula below, to non-numeric data?
Anyway, here is some sample code for you to experiment with.
FYI: look specifically at 'labelencoder' and 'dfDummies'.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
#%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, auc, roc_curve
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz
df = pd.read_csv('C:\\Users\\ryans\\OneDrive\\Desktop\\mushrooms.csv')
df.columns
df.head(5)
# The data is categorial so I convert it with LabelEncoder to transfer to ordinal.
labelencoder=LabelEncoder()
for column in df.columns:
df[column] = labelencoder.fit_transform(df[column])
#df.describe()
#df=df.drop(["veil-type"],axis=1)
#df_div = pd.melt(df, "class", var_name="Characteristics")
#fig, ax = plt.subplots(figsize=(10,5))
#p = sns.violinplot(ax = ax, x="Characteristics", y="value", hue="class", split = True, data=df_div, inner = 'quartile', palette = 'Set1')
#df_no_class = df.drop(["class"],axis = 1)
#p.set_xticklabels(rotation = 90, labels = list(df_no_class.columns));
#plt.figure()
#pd.Series(df['class']).value_counts().sort_index().plot(kind = 'bar')
#plt.ylabel("Count")
#plt.xlabel("class")
#plt.title('Number of poisonous/edible mushrooms (0=edible, 1=poisonous)');
plt.figure(figsize=(14,12))
sns.heatmap(df.corr(),linewidths=.1,cmap="YlGnBu", annot=True)
plt.yticks(rotation=0);
dfDummies = pd.get_dummies(df)
plt.figure(figsize=(14,12))
sns.heatmap(dfDummies.corr(),linewidths=.1,cmap="YlGnBu", annot=True)
plt.yticks(rotation=0);
See the link below for more info.
http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example
Sample data is from the link below, and the bottom of that page.
https://www.kaggle.com/haimfeld87/analysis-and-classification-of-mushrooms/data
If you find something that's actually based on a method of NOT converting categorical data to numeric data, please do share your findings. I'd like to see that!!

Overlay Linear Regression Line on Scatter Plot (iPython Notebook)

gh_data = ascii.read('http://dept.astro.lsa.umich.edu/~ericbell/data/GHOSTS/M81/ngc3031- field15.newphoto_radec')
ra = gh_data['col5'][:]
dec = gh_data['col6'][:]
f606 = gh_data['col3'][:]
f814 = gh_data['col4'][:]
plot(f6062-f8142,f8142, 'bo', alpha=0.15)
axis([-1,2.5,27,23])
xlabel('F606W-F814W')
ylabel('F814W')
title('Field 14')
The data set is imported and organized into different columns, I am trying to overlay a line of best fit, or linear regression over the scatterplot created, but I cannot figure out how. Thanks in advance.
As #rayryeng pointed out, your code just plots the data, but doesn't actually compute any regression results to plot. Here's one way of doing it:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.DataFrame({"y": range(1,11)+np.random.rand(10),
"x": range(1,11)+np.random.rand(10)})
Use statsmodels OLS method to fit a regression line, and params to extract the coefficient on the single regressor:
beta_1 = sm.OLS(data.y, data.x).fit().params
Produce a scatterplot and add a regression line:
fig, ax = plt.subplots()
ax.scatter(data.x, data.y)
ax.plot(range(1,11), [i*beta_1 for i in range(1,11)], label = "best fit")
ax.legend(loc="best")

Categories