K-means clustering on 3 dimensions with sklearn - python

I'm trying to cluster data using lat/lon as X/Y axes and DaysUntilDueDate as my Z axis. I also want to retain the index column ('PM') so that I can create a schedule later using this clustering analysis. The tutorial I found here has been wonderful but I don't know if it's taking the Z-axis into account, and my poking around hasn't resulted in anything but errors. I think the essential point in the code is the parameters of the iloc bit of this line:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(A.iloc[:, :])
I tried changing this part to iloc[1:4] (to only work on columns 1-3) but that resulted in the following error:
ValueError: n_samples=3 should be >= n_clusters=4
So my question is: How can I set up my code to run clustering analysis on 3-dimensions while retaining the index ('PM') column?
Here's my python file, thanks for your help:
from sklearn.cluster import KMeans
import csv
import pandas as pd
# Import csv file with data in following columns:
# [PM (index)] [Longitude] [Latitude] [DaysUntilDueDate]
df = pd.read_csv('point_data_test.csv',index_col=['PM'])
numProjects = len(df)
K = numProjects // 3 # Around three projects can be worked per day
print("Number of projects: ", numProjects)
print("K-clusters: ", K)
for k in range(1, K):
# Create a kmeans model on our data, using k clusters.
# Random_state helps ensure that the algorithm returns the
# same results each time.
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
# These are our fitted labels for clusters --
# the first cluster has label 0, and the second has label 1.
labels = kmeans_model.labels_
# Sum of distances of samples to their closest cluster center
SSE = kmeans_model.inertia_
print("k:",k, " SSE:", SSE)
# Add labels to df
df['Labels'] = labels
#print(df)
df.to_csv('test_KMeans_out.csv')

It seems the issue is with the syntax of iloc[1:4].
From your question it appears you changed:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
to:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[1:4])
It seems to me that either you have a typo or you don't understand how iloc works. So I will explain.
You should start by reading Indexing and Selecting Data from the pandas documentation.
But in short .iloc is an integer based indexing method for selecting data by position.
Let's say you have the dataframe:
A B C
1 2 3
4 5 6
7 8 9
10 11 12
The use of iloc in the example you provided iloc[:,:] selects all rows and columns and produces the entire dataframe. In case you aren't familiar with Python's slice notation take a look at the question Explain slice notation or the docs for An Informal Introduction to Python. The example you said caused your error iloc[1:4] selects the rows at index 1-3. This would result in:
A B C
4 5 6
7 8 9
10 11 12
Now, if you think about what you are trying to do and the error you received you will realize that you have selected fewer samples form your data than you are looking for clusters. 3 samples (rows 1, 2, 3) but you're telling KMeans to find 4 clusters, which just isn't possible.
What you really intended to do (as I understand it) was to select all rows and columns 1-3 that correspond to your lat, lng, and z values. To do this just add a colon as the first argument to iloc like so:
df.iloc[:, 1:4]
Now you will have selected all of your samples and the columns at index 1, 2, and 3. Now, assuming you have enough samples, KMeans should work as you intended.

Related

Why Multioutput XGBoosting feature importance gives different results using importnace_plot or estimators_[0].feature_importances_?

I have a multioutput XGboosting model and trying to plot important features for each output. There are 23 outputs.
I have tried to do this from two ways:
important features as a dataframe:
# Get features for the first output as numpy array. Can change number [0,22]
features = multioutputregressor.estimators_[0].feature_importances_
# Convert features to dataframe and corresponding feature names
wo_interaction_terms = pd.DataFrame(features, index=list(X_train.columns()),\
columns=['importance']).sort_values('importance', ascending=False)
important features as a bar plot in a for loop to get for all 23 outputs :
f = 0
fig, ax = plt.subplots(5,5,figsize=(12, 18))
for i in range(5):
for j in range(5):
plot_importance(multioutputregressor.estimators_[f], height=0.2, ax=ax[i, j], title=output_cols[f])
f += 1
fig.tight_layout()
The output for the first approach gives the following result for output 0:
the plot from the second approach generate a different set of important features and the values are also different from what you see in the first image.
f22 is not "Lead" or f0 is not "Gaseous CO2" and so on.
Questions:
1- Plot_importance uses F score but what .estimators_[0].feature_importances_ uses as the criteria? the numbers are obviously different?
2- How add feature names to the plots? I saw other posts like here
but it dsnt work for multioutput XGBoosting. what the options are for this case?

Pandas correlation on just one column containing np arrays

I'm working with a dataframe with a column containing a np.array per row (in this case representing the mean waveform of brain recordings trought the time). I want to calculate the pearson correlation of this column (array by array).
This is my code
lenght = len(df.Mean)
Mean = []
for i in range(len(df.Mean)):
Mean.append(df.Mean[i])
Correlation_p = np.zeros((lenght,lenght))
P_Value_p = np.zeros((lenght,lenght))
for i in range(lenght):
for j in range(lenght):
Correlation_p[i][j],P_Value_p[i][j] = stats.pearsonr(df.Mean[i],df.Mean[j])
This works, but I want to know if there is a more pythonic way to do it, maybe using df.corr(). I tried but I failed in how to do it.
EDIT: the output of df.Mean.head()
0 [-0.2559348091247745, 0.02743063113723536, 0.3...
1 [-0.37025615099744325, -0.11299328141596175, 0...
2 [-1.0543681894876467, -0.8452798699354909, -0....
3 [-0.23527437766943646, -0.28657810260136585, -...
4 [0.45557980303095674, 0.6055674269814991, 0.74...
Name: Mean, dtype: object
The arrays that you would like to correlate seem in single cells of the DataFrame, if I am not mistaken. The following brings it in a format where each single array occupies a single column.
I made an data example that resembles the format of df.Mean.head():
df = pd.DataFrame({'x':[np.random.randint(0,5,10), np.random.randint(0,5,10), np.random.randint(0,5,10)]})
You can turn these arrays into columns using this:
df = pd.DataFrame(np.array(df['x'].tolist()).transpose())
Adapt the reshape parameters according to your own dimensions.
From there, it would be fairly straightforward.
A correlation matrix can be created by:
df.corr()
A visualization of the correlation matrix:
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()

Python. How to import my own dataset to "k means" algorithm

I want to import my own data (sentences which are located in a .txt file) into this example algorithm, which can be found at: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
The problem is that this code uses a make_blobs dataset and i have a hard time understanding how to replace it with data from .txt file.
All I predict is that I need to replace this piece of code right here:
X, y = make_blobs(n_samples=500,
n_features=2,
centers=4,
cluster_std=1,
center_box=(-10.0, 10.0),
shuffle=True,
random_state=1) # For reproducibility
Also I do not understand these variables X, y . I assume that X is an array of data, and what about y?
Should I just assign everything to the X as like this and that example code would work? But what about those make_blobs features like centers, n_features etc.? Do I need to specify them somehow differently?
# open and read from the txt file
path = "C:/Users/user/Desktop/sentences.txt"
file = open(path, 'r')
# assign it to the X
X = file.readlines()
Any help is appreciated!
Firstly, you need to create a mapping of your words to a number that your k-means algorithm can use.
For example:
I ride a bike and I like it.
1 2 3 4 5 1 6 7 # <- number ids
After that you have a new embedding for you dataset and you can apply k-means. If you want a homogeneous appearance for your sample you must convert them to one-hot-representation (which is that you create a N-length array for each sample, where N is the total number of unique words you have, which has one to the corresponding position which is the same as the index of the sample).
Example for the above for N = 7 would be
1 -> 1000000
2 -> 0100000
...
So, now you can have a X variable containing your data in a proper format. You don't need y which is the corresponding labels for your samples.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
...

Compact way of visualizing heat maps of correlated data

I am trying to visualize the correlation of the Result column with every other column.
A_B A_C B_C Result
0 0.318182 0.925311 0.860465 91
1 -0.384030 0.991803 0.996344 12
2 -0.818182 0.411765 0.920000 53
3 0.444444 0.978261 0.944444 64
A_B = (A-B)/(A+B) correspondingly all other values too.
which works for smaller no. of columns but if I increase the no. of columns then no. of rows in heatmap keeps on stacking up.Is there any compact way to represent it.
Following code will reproduce the output-
import pandas as pd
import seaborn as sns
data = {'A':[232,243,12,546,67,12,78,11,245],
'B':[120,546,120,210,56,120,56,89,12],
'C':[9,1,5,6,7,43,7,12,64],
'Result':[91,12,53,64,71,436,74,123,641],
}
df = pd.DataFrame(data,columns=['A','B','C','Result'])
#Responsible for (A-B)/(A+B) ,(A-C)/(A+C) and similarly
colnames = df.columns.tolist()[:-1]
for i,c in enumerate(colnames):
if i!=len(colnames):
for k in range(i+1,len(colnames)):
df[c+'_'+colnames[k]]=(df[c]-df[colnames[k]])/(df[c]+df[colnames[k]])
newdf = df[['A_B','A_C','B_C','Result']].copy()
#Plotting A_B,A_C,B_C by ignoring the output of result of itself
plot = pd.DataFrame(newdf.corr().iloc[:-1,-1])
sns.heatmap(plot,annot=True)
A technique which I heard but unable to find any source ,is representing each correlation factor in the mini-recangles like
So according to it, considering the given map as a matrix of 3*3 and (0,0) starting from left-bottom, A_B will be represented in (1,1)
A_C in (2,1),B_C in (2,2).
But ,I am not getting it how to do it ?
You can plot the correlation of each column against the Result column and other columns as well. Below is one way to do so. Providing the x- and y-ticklabels guides you better for comparing the correlations. You can also annotate the correlation values to be displayed on the heat map.
cor = newdf.corr()
sns.heatmap(cor, xticklabels=cor.columns.values,
yticklabels=cor.columns.values, annot=True)

Why does Pandas qcut give me unequal sized bins?

Pandas docs have this to say about the qcut function:
Discretize variable into equal-sized buckets based on rank or based on sample quantiles.
So I would expect this code to give me 4 bins of 10 values each:
import numpy as np
import pandas as pd
np.random.seed(4242)
y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y, 4, labels=['1st', '2nd', '3rd', '4th'])
print('Quartiles:')
print(quartiles.value_counts(sort=False))
y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');
But instead I get this:
Quartiles:
1st 14
2nd 6
3rd 11
4th 9
dtype: int64
graph
What am I doing wrong here?
The reason this happens is because python doesn't know how to deal with 'boundary-line' cases, i.e. records that may fit first and second quartiles easily. A simple adjustment to your code will produce the desired result:
import numpy as np
import pandas as pd
np.random.seed(4242)
y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y.rank(method = 'first'), 4, labels=['1st', '2nd', '3rd', '4th'])
print('Quartiles:')
print(quartiles.value_counts(sort=False))
y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');
By stating the approach to be used by python using the rank() function, we give python a clear approach to handling records that cut across multiple bins. In this case, I've used (method = 'first') as the argument for the rank() function.
The output I get is as follows:
Quartiles:
1st 10
2nd 10
3rd 10
4th 10
dtype: int64
Looking at the boundaries of the bins highlights the problem stated inside the comments.
boundaries = [1, 2, 3.5, 6, 9]
These boundaries are correct. The code of pandas creates the values for the quantiles (inside qcut), first. Afterwards the samples are put into the bins. The range of 2s overlaps the boundary of the first quartile.
The reason for the third values is that the value below the threshold is a 3 and the value above the threshold is a 4. The function quantile of pandas is called so that the boundary lies in between the two neighboring values.
Concluding: A concept like quantiles gets more and more appropriate, when there are a larger number of samples, so that more values are available fixing the boundaries.

Categories