There is a subset of gene expression data making 6 feature columns with no target. Using PCA in sklearn, I could separate the 6 features by extracting principal axes in feature space using PCA. Is it possible to plot similar figure using KernelPCA considering components_ attributes does not exist in KernelPCA? Here is my code taken from here with small changes.
It is obvious that using KernelPCA(kernel="linear") should lead to the same results as PCA.
from sklearn.decomposition import PCA,KernelPCA
from sklearn.preprocessing import StandardScaler
from bioinfokit.analys import get_data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = get_data('gexp').data
df_st = StandardScaler().fit_transform(df)
pca_out = PCA().fit(df_st)
loadings = pca_out.components_
fig, ax = plt.subplots(1,2)
zz=[]
for i in df.columns.values:
zz.append(i)
ax[0].scatter(loadings[0],loadings[1])
for i, txt in enumerate(zz):
ax[0].annotate(zz[i], (loadings[0][i], loadings[1][i]),fontsize=12)
plt.show()
########################## KernelPCA ###################
kpca=KernelPCA(kernel="linear")
kpca_o=kpca.fit(df_st)
#ax[1].scatter(kpca_o[0,:],kpca_o[1,:])
Use: kpca_o.alphas_array
Source: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html
alphas_array, (n_samples, n_components)
Eigenvectors of the centered kernel matrix. If n_components and remove_zero_eig are not set, then all components are stored.
Related
I have never been great with Python plotting concepts, and now I'm still apparently missing something new.
Here is my code.
import pandas as pd
import matplotlib.pyplot as plt
import sys
from numpy import genfromtxt
from sklearn.cluster import DBSCAN
data = pd.read_csv('C:\\Users\\path_here\\wine.csv')
data
# Reading in 2D Feature Space
model = DBSCAN(eps=0.9, min_samples=10).fit(data)
array_flavanoids = data.iloc[:, 2]
# Slicing array
array_colorintensity = data.iloc[:, 3]
# Scatter plot function
colors = model.labels_
plt.scatter(array_flavanoids, array_colorintensity, c=colors, marker='o')
plt.xlabel('Concentration of flavanoids', fontsize=16)
plt.ylabel('Color intensity', fontsize=16)
plt.title('Concentration of flavanoids vs Color intensity', fontsize=20)
plt.show()
Here is my result.
I am expecting the outliers to be in a different color than the non-outliers. So, something like this.
Maybe one color for outliers and another for non-outliers. I am just trying to learn the concept in this exercise. I am trying to follow the example from this link.
https://towardsdatascience.com/outlier-detection-python-cd22e6a12098
I am using this data source.
https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
I am testing different data sets.
I got this to work.
from sklearn.cluster import DBSCAN
def dbscan(X, eps, min_samples):
ss = StandardScaler()
X = ss.fit_transform(X)
db = DBSCAN(eps=eps, min_samples=min_samples)
db.fit(X)
y_pred = db.fit_predict(X)
plt.scatter(X[:,0], X[:,1],c=y_pred, cmap='Paired')
plt.title("DBSCAN")
dbscan(data, eps=.5, min_samples=5)
I found this to be a great resource.
https://medium.com/#plog397/functions-to-plot-kmeans-hierarchical-and-dbscan-clustering-c4146ed69744
I have two question about correlation between Categorical variables from my dataset for predicting models.
Using both Cramers V and TheilU to double check the correlation.
I got 1.0 from Cramers V for two of my variable, however, I only got 0.2 when I used TheilU method, I am not sure how to interpret the relationship between the two variables?
Also for those that are experienced, if I got a 0.73 for a correlation of 2 variables, should I remove one of the variable for the predicting model?
Thanks you so much in advance!
Well, you probably want to convert non-numerics to numerics. I don't think I have seen correlations of non-numerics, but maybe there is is something out there. Not sure how it would work, though. If you think about it, how would you apply the formula below, to non-numeric data?
Anyway, here is some sample code for you to experiment with.
FYI: look specifically at 'labelencoder' and 'dfDummies'.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
#%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, auc, roc_curve
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz
df = pd.read_csv('C:\\Users\\ryans\\OneDrive\\Desktop\\mushrooms.csv')
df.columns
df.head(5)
# The data is categorial so I convert it with LabelEncoder to transfer to ordinal.
labelencoder=LabelEncoder()
for column in df.columns:
df[column] = labelencoder.fit_transform(df[column])
#df.describe()
#df=df.drop(["veil-type"],axis=1)
#df_div = pd.melt(df, "class", var_name="Characteristics")
#fig, ax = plt.subplots(figsize=(10,5))
#p = sns.violinplot(ax = ax, x="Characteristics", y="value", hue="class", split = True, data=df_div, inner = 'quartile', palette = 'Set1')
#df_no_class = df.drop(["class"],axis = 1)
#p.set_xticklabels(rotation = 90, labels = list(df_no_class.columns));
#plt.figure()
#pd.Series(df['class']).value_counts().sort_index().plot(kind = 'bar')
#plt.ylabel("Count")
#plt.xlabel("class")
#plt.title('Number of poisonous/edible mushrooms (0=edible, 1=poisonous)');
plt.figure(figsize=(14,12))
sns.heatmap(df.corr(),linewidths=.1,cmap="YlGnBu", annot=True)
plt.yticks(rotation=0);
dfDummies = pd.get_dummies(df)
plt.figure(figsize=(14,12))
sns.heatmap(dfDummies.corr(),linewidths=.1,cmap="YlGnBu", annot=True)
plt.yticks(rotation=0);
See the link below for more info.
http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example
Sample data is from the link below, and the bottom of that page.
https://www.kaggle.com/haimfeld87/analysis-and-classification-of-mushrooms/data
If you find something that's actually based on a method of NOT converting categorical data to numeric data, please do share your findings. I'd like to see that!!
I have a set of data that I've been assigned to apply PCA and retain one component and then visualize the distribution in a scatter plot which indicates the class of each data point.
For context: The data we're working with has three columns. X is column 1 and 2 and y is column 3 which contains the class of each data point.
It was implied that the resulting visualization should be a horizontal line, but I'm not seeing that. The resulting visualization is a scatter plot that looks like a positive linear distribution.
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
df = pd.read_csv("data.csv", header=None)
X = df.iloc[:, 0:2].values
y = df.iloc[:,-1].values
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=np.random)
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
pcaObj1 = PCA(n_components=1)
X_train_PCA = pcaObj1.fit_transform(X_train)
X_test_PCA = pcaObj1.transform(X_test)
X_set, y_set = X_test_PCA, y_test
X3 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01))
X3 = np.array(X3)
plt.xlim(X3.min(), X3.max())
plt.ylim(X3.min(), X3.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 0],
c = ListedColormap(('purple', 'yellow'))(i), label = j)
I see that you have a test set in addition to a training set, however this not the usual setup for PCA. PCA has multiple applications, but one of the main ones is dimensionality reduction. Dimensionality reduction is about removing variables, and PCA serves this purpose by changing the basis of your data and ordering them by the amount (or relative amount) of the total variation that they linearly explain. Since this does not require test data, we can think of this as unsupervised machine learning, although many would also prefer to call this feature engineering as it is often used to preprocess data to improve the performance of models trained on that preprocessed data.
Let me generate a random dataset with 10 variables and 1000 entries for the sake of example. Fitting the PCA transform for 1 component, you're selecting a new variable (feature) that is a linear combination of the original variables that attempts to linearly explain the most variance in the data. As you say, it is a number line; just as a quick-and-easy plot let's just use the x-axis as the index of the new variable array and the y-axis as the value of the variable.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
X_train = np.random.random((1000, 10))
y_labels = np.array([0] * 500 + [1] * 500)
pcaObj1 = PCA(n_components=1)
X_PCA = pcaObj1.fit_transform(X_train)
plt.scatter(range(len(y_labels)), X_PCA, c=['red' if i==0 else 'green' for i in y_labels])
plt.show()
You can see this produces a 1000 x 1 array representing your new variable.
>>> X_PCA.shape
(1000, 1)
If you had selected n_components=2 instead, you'd have a 1000 x 2 array with two such variables. Let's see that as example. This time I'll plot the two principal components against each other instead of using a single principal component against its index.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
X_train = np.random.random((1000, 10))
y_labels = np.array([0] * 500 + [1] * 500)
pcaObj1 = PCA(n_components=2)
X_PCA = pcaObj1.fit_transform(X_train)
plt.scatter(X_PCA[:,0], X_PCA[:,1], c=['red' if i==0 else 'green' for i in y_labels])
plt.show()
Now, my randomly-generated data may not have the same properties as your data set. If you really expect the output to be a line, then I'd say certainly not as my example generates a very eratic trace. You'll see even in the 2D case that the data doesn't seem structured by class, but that's what you would expect from random data.
This example should give some clarity. Make sure you read all the comments so you can follow what's going on.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4] # we take the first four features.
y = iris.target
print(X.sample(5))
print(y.sample(5))
# see how many samples we have of each species
data["species"].value_counts()
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array, columns = X.columns)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
.add_legend();
# do same for petals
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "petal_width", "sepal_width") \
.add_legend();
# if you have a lot of features it can be helpful to do some feature reduction
# to avoid the curse of dimensionality (i.e. needing exponentially more data
# to do accurate predictions as the number of features grows).
# you can do this with Principal Component Analysis (PCA), which remaps the data
# to a new (smaller) coordinate system which tries to account for the
# most information possible.
# you can *also* use PCA to visualize the data by reducing the
# features to 2 dimensions and making a scatterplot.
# it kind of mashes the data down into 2d, so can lose
# information - but in this case it's just going from 4d to 2d,
# so not losing too much info.
# so let's just use it to visualize the data...
# mash the data down into 2 dimensions
from sklearn.decomposition import PCA
ndimensions = 2
seed = 10
pca = PCA(n_components=ndimensions, random_state=seed)
pca.fit(X_scaled)
X_pca_array = pca.transform(X_scaled)
X_pca = pd.DataFrame(X_pca_array, columns=['PC1','PC2']) # PC=principal component
X_pca.sample(5)
# so that gives us new 2d coordinates for each data point.
# at this point, if you don't have labelled data,
# you can add the k-means cluster ids to this table and make a
# colored scatterplot.
# we do actually have labels for the data points, but let's imagine
# we don't, and use the predicted labels to see what the predictions look like.
# first, convert species to an arbitrary number
y_id_array = pd.Categorical(data['species']).codes
df_plot = X_pca.copy()
df_plot['ClusterKmeans'] = y_cluster_kmeans
df_plot['SpeciesId'] = y_id_array # also add actual labels so we can use it in later plots
df_plot.sample(5)
# so now we can make a 2d scatterplot of the clusters
# first define a plot fn
def plotData(df, groupby):
"make a scatterplot of the first two principal components of the data, colored by the groupby field"
# make a figure with just one subplot.
# you can specify multiple subplots in a figure,
# in which case ax would be an array of axes,
# but in this case it'll just be a single axis object.
fig, ax = plt.subplots(figsize = (7,7))
# color map
cmap = mpl.cm.get_cmap('prism')
# we can use pandas to plot each cluster on the same graph.
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
for i, cluster in df.groupby(groupby):
cluster.plot(ax = ax, # need to pass this so all scatterplots are on same graph
kind = 'scatter',
x = 'PC1', y = 'PC2',
color = cmap(i/(nclusters-1)), # cmap maps a number to a color
label = "%s %i" % (groupby, i),
s=30) # dot size
ax.grid()
ax.axhline(0, color='black')
ax.axvline(0, color='black')
ax.set_title("Principal Components Analysis (PCA) of Iris Dataset");
# plot the clusters each datapoint was assigned to
plotData(df_plot, 'ClusterKmeans')
the question topic is a little complex because I need a lot of help lol. To explain, I have a csv of data with labels (names) and numerical data...
name,post_count,follower_count,following_count,anonymous_pic,is_private,...
adam,3,997,435,0,0,1,0,0,0,0 bob,2,723,600,0,0,1,0,0,0,0
jill,11,3193,962,0,0,1,0,0,0,0 sara,0,225,298,0,0,1,0,0,0,0
.
.
and so on. This data is loaded into a pandas dataframe from the csv. Now, I wish to pass only the numerical parts of this data into a sklearn.manifold class called TSNE (t-distributed stochastic neighbor embedding) which will output a list the same size as the input data, where each element of the new list is is list of size k (where k is the number of components specified as an argument to the TSNE class). In my case k = 2.
I'm graphing this data on a 2-D scatter plot from matplotlib, and I'd like to be able to inspect the labels on the data. I know matplotlib has an annotate feature in which points can be labeled, but how do I go about separating these labels from the data for TSNE? and if i just separate the labels prior to transformation, how can I go about ensuring that i'm relabeling the right points?
I'd like to be able to inspect these names, because I need to see if the transformation is useful on my data. This way I can analyze a few really bizarre places and see if something interesting is happening. Here is my code if you find it useful (Although I'll admit its just scratchwork)
# Data structuring
import pandas as pd
import numpy as np
# Plotting
import seaborn as sns
import matplotlib.pyplot as plt
sns.set() # for plot styling
# Load data
df = pd.read_csv('user_data.csv')
print(df.head())
# sklearn
from sklearn.mixture import GMM
from sklearn.manifold import TSNE
tsne = TSNE(n_components = 2, init = 'random', random_state = 0)
lab_proj = tsne.fit_transform(df)
x = [i[0] for i in lab_proj]
y = [i[1] for i in lab_proj]
print(len(lab_proj))
df['PCA1'] = x
df['PCA2'] = y
model = GMM(n_components = 1, covariance_type = 'full')
model.fit(df)
y_gmm = model.predict(df)
df['cluster'] = y_gmm
sns.lmplot('PCA1', 'PCA2', data = df, col='cluster', fit_reg = False)
plt.show()
Thanks!
Background
I'm reading Introduction to Machine Learning with Python and tried visualization of In[45] in Chapter 2. First, I fitted 3 LogisticRegression classifiers to Winsconsin cancer dataset using different C parameters. Then, for each classifier, I plotted coefficient magnitudes of each feature.
%matplotlib inline
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
cancer = load_breast_cancer()
for C, marker in [(0.01, 'o'), (1., '^'), (100., 'v')]:
logreg = LogisticRegression(C=C).fit(cancer.data, cancer.target)
plt.plot(logreg.coef_[0], marker, label=f"C={C}")
plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)
plt.hlines(0, 0, cancer.data.shape[1])
plt.legend()
I prefer barplot than using markers in this case. I'd like to get a graph such as:
I achieved this by the following workflow.
Step 1: Create a DataFrame holding coefficient magnitudes as a row
%matplotlib inline
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
import pandas as pd
cancer = load_breast_cancer()
df = pd.DataFrame(columns=cancer.feature_names)
for C in [0.01, 1., 100.]:
logreg = LogisticRegression(C=C).fit(cancer.data, cancer.target)
df.loc[f"C={C}"] = logreg.coef_[0]
df
Step 2: Convert the DataFrame into a seaborn.barplot-applicable form
import itertools
df_bar = pd.DataFrame(columns=['C', 'Feature', 'Coefficient magnitude'])
for C, feature in itertools.product(df.index, df.columns):
magnitude = df.at[C, feature]
df_bar = df_bar.append({'C': C, 'Feature': feature, 'Coefficient magnitude': magnitude},
ignore_index=True)
df_bar.head()
Step 3: Plot by seaborn.barplot
from matplotlib import pyplot as plt
import seaborn as sns
plt.figure(figsize=(12,8))
sns.barplot(x='Feature', y='Coefficient magnitude', hue='C', data=df_bar)
plt.xticks(rotation=90)
This yielded the graph I wanted.
Problem
I think Step 2 is tedious. Can I make the barplot from df in Step 1 directly or make df_bar by one-liner? Or is there a more elegant workflow to get the barplot?
Pandas plots grouped barplots column-wise. Hence it should be possible to do
df = df.transpose()
df.plot(kind="bar")
without using seaborn.
If the use of seaborn is for whatever reason required, step2 from the question could probably be simplified via pandas.melt.
df_bar = df.reset_index().melt(id_vars=["index"])
sns.barplot(x="variable", y="value", hue="index", data=df_bar)