I am a beginner in python trying to create a 2 component PCA plot, using pandas, sklearn.preprocessing, sklearn.decomposition, and Matplotlib.pyplot.
My data frame is very large, relating to the characteristics of different species of plant, with many variables (>100 columns), and I would like to compare the effect of one of the characteristics/columns (stem length) on the variance of the data. The column for stem length consists of floats, ranging in size from 0 to around 75cm.
I would like to plot a PCA comparing the variance of characteristics when stem length >40cm and stem length <40cm. However I have no idea how to proceed with this.
I have been using the following website as a guide for the PCA plot.
I have already written the following code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
df = pd.read_csv("plant_data.csv")
x = StandardScaler().fit_transform(x)
plt.style.use("seaborn-darkgrid")
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents,
columns = ['principal component 1', 'principal component 2'])
finalDf = pd.concat([principalDf, df[['stem_length']]], axis = 1)
How do I set the conditions for the parameters to be stem_length >40 and stem_length <40?
The given dataset in the question link is called the "Iris Dataset". Considering that, and your working example with 2-principal components, you now have finalDF with three features (or dimensions, or columns - in excel).
Now, you need to define a feature, which can be acheived as:
finalDF['stem_length_gt_40'] = finalDF['stem_length'].apply(lambda x: 1 if x > 40 else 0)
The code creates another column named stem_length_gt_40 whose value is 1 if stem_length > 40 else 0.
Considering this, now you can perhaps plot PCA-1 vs. PCA-2 and colour them differently based on stem_length_gt_40 using seaborn.scatterplot as below:
import seaborn as sns
import matplotlib.pyplot as plt
# plt.style.use("seaborn-darkgrid")
sns.scatterplot(x = 'principal component 1', y = 'principal component 2', data = finalDF, hue = 'stem_length_gt_40')
You can learn more about sns.scatterpolt over here.
Related
I have a dataset with 569 data points, each associated with two features and labelled either as 0 0r 1. Based on the label, I want to make a scatterplot graph such that data point associated with label 0 gets a green dot while the other with label 1 gets red dot on the scatterplot.I am using spyder(python 3.9).
Here is my code :
`
from sklearn.datasets import load_breast_cancer
breast=load_breast_cancer()
#print(breast)
breast_data=breast.data
#print(breast_data)
print(breast_data.shape)
breast_labels = breast.target
#print(breast_labels)
print(breast_labels.shape)
import numpy as np
labels=np.reshape(breast_labels,(569,1))
final_breast_data=np.concatenate([breast_data,labels],axis=1)
#print(final_breast_data)
print(final_breast_data.shape)
import pandas as pd
breast_dataset=pd.DataFrame(final_breast_data)
#print(breast_dataset)
features=breast.feature_names
print(features)
features_labels=np.append(features,'label')
#print(features_labels)
breast_dataset.columns=features_labels
print(breast_dataset.head())
breast_dataset['label'].replace(0,'Benign',inplace=True)
breast_dataset['label'].replace(1,'Malignant',inplace=True)
print(breast_dataset.tail())
from sklearn.preprocessing import StandardScaler
x=breast_dataset.loc[:,features].values
x=StandardScaler().fit_transform(x)
print(x.shape)
print(np.mean(x),np.std(x))
feat_cols=['feature'+str(i) for i in range(x.shape[1]) ]
print(feat_cols)
#print(x.shape[0])
normalised_breast=pd.DataFrame(x,columns=feat_cols)
print(normalised_breast.tail())
from sklearn.decomposition import PCA
pca_breast=PCA(n_components=2)
#print(pca_breast)
principalComponents_breast=pca_breast.fit_transform(x)
#print(principalComponents_breast)
principal_breast_Df=pd.DataFrame(data=principalComponents_breast,columns=['Principal component 1','Principal component 2'])
print(principal_breast_Df.tail())
print('Explained variation per principal component:{}'.format(pca_breast.explained_variance_ratio_))
import pandas as pd
import matplotlib.pyplot as plt
#x=principal_breast_Df['Principal component 1']
#y=principal_breast_Df['Principal component 2']
#plt.scatter(x,y)
#plt.show()
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component 1',fontsize=20)
plt.ylabel('Principal Component 2',fontsize=20)
plt.title("Principal Component Analysis of Breast Cancer Dataset",fontsize=20)
targets=['Benign','Malignant']
colors=['r','g']
for target,color in zip(targets,colors):
indicesToKeep=breast_dataset['label']==target
plt.scatter(principal_breast_Df.loc[indicesToKeep,'Principal componenet 1'],principal_breast_Df.loc[indicesToKeep,'Principal component 2'],c=color,s=50)
plt.legend(targets,prop={'size':15});
`
I tried to print the graph as mentioned in the problem but got a blank graph with no dots ! I think that I am missing some packages that I should include. Any help would be appreciated.
Change the last part to:
for target,color in zip(targets,colors):
indicesToKeep=breast_dataset['label']==target
plt.scatter(principal_breast_Df[breast_dataset['label']==target]['Principal component 1'], principal_breast_Df[breast_dataset['label']==target]['Principal component 2'],c=color,s=50)
plt.legend(targets,prop={'size':15});
I have the below scikit learn script which outputs a nice chart (below) with each of the clusters.
I have a couple of questions:
- How can I export this to CSV - with a cluster name or ID?
- How can I name the clusters?
- How can I make sure the clusters are always named the same thing? For example, I want to call the top right segment 'high spenders' how do I so that where it will always be correct?
Thanks!
#import the required libraries
# - matplotlib is a charting library
# - Seaborn builds on top of Matplotlib and introduces additional plot types. It also makes your traditional Matplotlib plots look a bit prettier.
# - Numpy is numerical Python
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
#Generate sample data, with distinct clusters for testing
#n_samples = the number of datapoints, equally split across each clusters
#centers = The number of centers to generate (number of clusters) - a center is the arithmetic mean of all the points belonging to the cluster.
#cluster_std = the standard deviation of the clusters - a quantity expressing by how much the members of a group differ from the mean value for the group (how tight is the cluster going to be)
#random_state = controls the random number generator being used. If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time. However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.
#make_blobs generates "isotropic Gaussian blobs" - X is a numpy array with two columns which contain the (x, y) Gaussian coordinates of these points, whereas y contains the list of categories for each.
#X, y = simply means that the output of make_blobs() has two elements, that are assigned to X and y.
X, y = make_blobs(n_samples=300, centers=4,
cluster_std=0.50, random_state=0)
#X now looks like this - column zero becomes the X axis, column1 becomes the Y axis
array([[ 1.85219907, 1.10411295],
[-1.27582283, 7.76448722],
[ 1.0060939 , 4.43642592],
[-1.20998253, 7.83203579],
[ 1.92461484, 1.06347673],
[ 2.28565919, 0.79166208],
[-1.57379043, 2.69773813],
[ 1.04917913, 4.31668562],
[-1.07436851, 7.93489945],
[-1.15872975, 7.97295642]
#The below statement, will enable us to visualise matplotlib charts, even in ipython
#Using matplotlib backend: MacOSX
#Populating the interactive namespace from numpy and matplotlib
%pylab
#plot the chart
#s = the sizer of the points.
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], s=50);
#now, I am definining that I want to find 4 clusters within the data. The general rule I follow is, I will have 7 times less clusters than datapoints.
kmeans = KMeans(n_clusters=4)
#build the model, based on X with the number of clusters defined above
kmeans.fit(X)
#now we're going to find clusters in the randomly generated dataset
predict = kmeans.predict(X)
#now we can plot the prediction
#c = colour, which is based on the predict variable we defined above
#s = the size of the plots
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], c=predict, s=50)
Based on your code the following worked for me. You can certainly stay with numpy for storing the CSV but I simply prefer pandas. The sorting line should give you the same results everytime you run the code. However, since the initliazation of the clusters can have an impact I would also set a seed in your code, e.g. np.random.seed(42) and call the kmeans function with the random_state parameter, e.g. kmeans = KMeans(n_clusters=4, random_state=42)
# transform to dataframe
import pandas as pd
import seaborn as sns
df = pd.DataFrame(X)
df.columns = ["var1", "var2"]
df["cluster"] = predict
colors = sns.color_palette()[0:4]
df = df.sort_values("cluster")
# check plot
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster"], palette=colors)
plt.show()
# define rename schema
mynames = {"0": "center_left", "1": "top_left", "2": "bot_right", "3": "center"}
df["cluster_name"] = [mynames[str(i)] for i in df.cluster]
# plot again to verify order
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster_name"],
palette=colors)
sns.despine()
plt.show()
# save dataframe as CSV
df.to_csv("myoutput.csv")
The first plot looks like this:
The second plot looks like this:
The CSV will look like this:
For my evaluation, I wanted to use PyKalman filter library. I have created a very small time series data ready with three columns formatted as follows. The full dataset is attached here for reproduciability since I can't attach a file on stackoverflow:
http://www.mediafire.com/file/el1tkrdun0j2dk4/testdata.csv/file
time X Y
0.040662 1.041667 1
0.139757 1.760417 2
0.144357 1.190104 1
0.145341 1.047526 1
0.145401 1.011882 1
0.148465 1.002970 1
.... ..... .
I have read the PyKalman library documetation for Python and managed to do a simple linear filtering using Kalman Filterand here is my code
import matplotlib.pyplot as plt
from pykalman import KalmanFilter
import numpy as np
import pandas as pd
df = pd.read_csv('testdata.csv')
print(df)
pd.set_option('use_inf_as_null', True)
df.dropna(inplace=True)
X = df.drop('Y', axis=1)
y = df['Y']
estimated_value= np.array(X)
real_value = np.array(y)
measurements = np.asarray(estimated_value)
kf = KalmanFilter(n_dim_obs=1, n_dim_state=1,
transition_matrices=[1],
observation_matrices=[1],
initial_state_mean=measurements[0,1],
initial_state_covariance=1,
observation_covariance=5,
transition_covariance=1)
state_means, state_covariances = kf.filter(measurements[:,1])
state_std = np.sqrt(state_covariances[:,0])
print (state_std)
print (state_means)
print (state_covariances)
fig, ax = plt.subplots()
ax.margins(x=0, y=0.05)
plt.plot(measurements[:,0], measurements[:,1], '-r', label='Real Value Input')
plt.plot(measurements[:,0], state_means, '-b', label='Kalman-Filter')
plt.legend(loc='best')
ax.set_xlabel("Time")
ax.set_ylabel("Value")
plt.show()
Which gives the following plot as an output
As we can see from the plot and my dataset, my input is non-linear. Therefore, I wanted to use Kalman Filter and see if I can detect and track the drops of the filtered signal (blue color in the above plot). But since I am so new to Kalman Filter, I seem to have a hardtime understanding the mathematical formulation and and to get started with Unscented Kalman Filter. I found a good example on basic use of PyKalman UKF - but it doesn't show how to defined the percentage of the drop (peaks). I would, therefore, appreciate for any help at least which detects how big the drop from the peaks of the filtered one is (for example, 50% or 80% of the previous drop of the blue line in the plot). Any help would be appreciated.
Usually when I do dendrograms and heatmaps, I use a distance matrix and do a bunch of SciPy stuff. I want to try out Seaborn but Seaborn wants my data in rectangular form (rows=samples, cols=attributes, not a distance matrix)?
I essentially want to use seaborn as the backend to compute my dendrogram and tack it on to my heatmap. Is this possible? If not, can this be a feature in the future.
Maybe there are parameters I can adjust so it can take a distance matrix instead of a rectangular matrix?
Here's the usage:
seaborn.clustermap¶
seaborn.clustermap(data, pivot_kws=None, method='average', metric='euclidean',
z_score=None, standard_scale=None, figsize=None, cbar_kws=None, row_cluster=True,
col_cluster=True, row_linkage=None, col_linkage=None, row_colors=None,
col_colors=None, mask=None, **kwargs)
My code below:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
DF = pd.DataFrame(X, index = ["iris_%d" % (i) for i in range(X.shape[0])], columns = iris.feature_names)
I don't think my method is correct below because I'm giving it a precomputed distance matrix and NOT a rectangular data matrix as it requests. There's no examples of how to use a correlation/distance matrix with clustermap but there is for https://stanford.edu/~mwaskom/software/seaborn/examples/network_correlations.html but the ordering is not clustered w/ the plain sns.heatmap func.
DF_corr = DF.T.corr()
DF_dism = 1 - DF_corr
sns.clustermap(DF_dism)
You can pass the precomputed distance matrix as linkage to clustermap():
import pandas as pd, seaborn as sns
import scipy.spatial as sp, scipy.cluster.hierarchy as hc
from sklearn.datasets import load_iris
sns.set(font="monospace")
iris = load_iris()
X, y = iris.data, iris.target
DF = pd.DataFrame(X, index = ["iris_%d" % (i) for i in range(X.shape[0])], columns = iris.feature_names)
DF_corr = DF.T.corr()
DF_dism = 1 - DF_corr # distance matrix
linkage = hc.linkage(sp.distance.squareform(DF_dism), method='average')
sns.clustermap(DF_dism, row_linkage=linkage, col_linkage=linkage)
For clustermap(distance_matrix) (i.e., without linkage passed), the linkage is calculated internally based on pairwise distances of the rows and columns in the distance matrix (see note below for full details) instead of using the elements of the distance matrix directly (the correct solution). As a result, the output is somewhat different from the one in the question:
Note: if no row_linkage is passed to clustermap(), the row linkage is determined internally by considering each row a "point" (observation) and calculating the pairwise distances between the points. So the row dendrogram reflects row similarity. Analogous for col_linkage, where each column is considered a point. This explanation should likely be added to the docs. Here the docs's first example modified to make the internal linkage calculation explicit:
import seaborn as sns; sns.set()
import scipy.spatial as sp, scipy.cluster.hierarchy as hc
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")
row_linkage, col_linkage = (hc.linkage(sp.distance.pdist(x), method='average')
for x in (flights.values, flights.values.T))
g = sns.clustermap(flights, row_linkage=row_linkage, col_linkage=col_linkage)
# note: this produces the same plot as "sns.clustermap(flights)", where
# clustermap() calculates the row and column linkages internally
Sci-Kit learn Kmeans and PCA dimensionality reduction
I have a dataset, 2M rows by 7 columns, with different measurements of home power consumption with a date for each measurement.
date,
Global_active_power,
Global_reactive_power,
Voltage,
Global_intensity,
Sub_metering_1,
Sub_metering_2,
Sub_metering_3
I put my dataset into a pandas dataframe, selecting all columns but the date column, then perform cross validation split.
import pandas as pd
from sklearn.cross_validation import train_test_split
data = pd.read_csv('household_power_consumption.txt', delimiter=';')
power_consumption = data.iloc[0:, 2:9].dropna()
pc_toarray = power_consumption.values
hpc_fit, hpc_fit1 = train_test_split(pc_toarray, train_size=.01)
power_consumption.head()
I use K-means classification followed by PCA dimensionality reduction to display.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
hpc = PCA(n_components=2).fit_transform(hpc_fit)
k_means = KMeans()
k_means.fit(hpc)
x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1
y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(hpc[:, 0], hpc[:, 1], 'k.', markersize=4)
centroids = k_means.cluster_centers_
inert = k_means.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
Now I would like to find out which rows fell under a given class then which dates fell under a given class.
Is there any way to relate the points on the graph to an index in my
dataset, after PCA?
Some method I don't know of?
Or is my approach fundamentally flawed?
Any recommendations?
I am fairly new to this field and am trying to read through lots of code, this is a compilation of several examples I've seen documented .
My goal is to classify the data and then get the dates that fall under a class.
Thank You
KMeans().predict(X) ..docs here
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters: (New data to predict)
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns: (Index of the cluster each sample belongs to)
labels : array, shape [n_samples,]
The problem I with the code you submitted is the use of
train_test_split()
which returns two arrays of random rows in your data-set, effectively ruining your dataset order making it difficult to correlate the labels returned from KMeans classification to sequential dates in your data set.
Here's an example:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
#read data into pandas dataframe
df = pd.read_csv('household_power_consumption.txt', delimiter=';')
#convert merge date and time colums and convert to datetime objects
df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df.set_index(pd.DatetimeIndex(df['Datetime'],inplace=True))
df.drop(['Date','Time'], axis=1, inplace=True)
#put last column first
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
df = df.dropna()
#convert dataframe to data array and removes date column not to be processed,
sliced = df.iloc[0:, 1:8].dropna()
hpc = sliced.values
k_means = KMeans()
k_means.fit(hpc)
# array of indexes corresponding to classes around centroids, in the order of your dataset
classified_data = k_means.labels_
#copy dataframe (may be memory intensive but just for illustration)
df_processed = df.copy()
df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index)
Now you can see your result matched with your data-set on the right side.
Now that it's classified, it's up to you to derive meaning.
This is just a good overall example of how it can be used, from start to finish.
Displaying your result, look at PCA or making other graphs dependent on class.