Swap leafs of Python scipy's dendrogram/linkage - python
I generated a dendrogram plot for my dataset and I am not happy how the splits at some levels have been ordered. I am thus looking for a way to swap the two branches (or leaves) of a single split.
If we look at the code and dendrogram plot at the bottom, there are two labels 11 and 25 split away from the rest of the big cluster. I am really unhappy about this, and would like that the branch with 11 and 25 to be the right branch of the split and the rest of the cluster to be the left branch. The shown distances would still be the same, and thus the data would not be changed, just the aesthetics.
Can this be done? And how? I am specifically for a manual intervention because the optimal leaf ordering algorithm supposedly does not work in this case.
import numpy as np
# random data set with two clusters
np.random.seed(65) # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[20,])
X = np.concatenate((a, b),)
# create linkage and plot dendrogram
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, 'ward')
plt.figure(figsize=(15, 5))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=12., # font size for the x axis labels
)
plt.show()
I had a similar problem and got solved by using optimal_ordering option in linkage. I attach the code and result for your case, which might not be exactly what you like but seems highly improved to me.
import numpy as np
import matplotlib.pyplot as plt
# random data set with two clusters
np.random.seed(65) # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[20,])
X = np.concatenate((a, b),)
# create linkage and plot dendrogram
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, 'ward', optimal_ordering = True)
plt.figure(figsize=(15, 5))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=12., # font size for the x axis labels
distance_sort=False,
show_leaf_counts=True,
count_sort=False
)
plt.show()
result of using optimal_ordering in linkage
Related
Create a Seaborn style histogram / kernel density plot using the actual density function
I really like to the look of Seaborn's KDE plot: I was wondering how can I replicate this for line plot. In my case I actually have the function to generate the density instead of samples of the data. So assuming I have the data in a data frame: x - The value of x per sample. y - The value of the density function at y. μσ - Categorical variable to group data from the same density (In the code, I use the mean and standard deviation of a normal distribution). I can use Seaborn's lineplot to get what I want without the area below the curve as in the image above. I'm after achieving the look as above for the data I have. Is there a way to replicate this theme, area under the curve included, for lineplot? The code below shows what I got so far: import numpy as np import scipy as sp import pandas as pd from scipy.stats import norm import matplotlib.pyplot as plt import seaborn as sns num_grid_pts = 1000 val_μ = [0, -1, 1, 0] val_σ = [1, 2, 3, 4] num_var = len(val_μ) # variations x = np.linspace(-10, 10, num_grid_pts) P = np.zeros((num_grid_pts, num_var)) # PDF μσ = [f'μ = {μ}, σ = {σ}' for μ, σ in zip(val_μ, val_σ)] for ii, (μ, σ) in enumerate(zip(val_μ, val_σ)): randVar = norm(μ, σ) P[:, ii] = randVar.pdf(x) df_P = pd.DataFrame(data = {'x': np.tile(x, num_var), 'PDF': P.flatten('F'), 'μσ': np.repeat(μσ, len(x))}) f, ax = plt.subplots(figsize=(15, 10)) sns.lineplot(data=df_P, x='x', y='PDF', hue='μσ', ax=ax) plot_lines = ax.get_lines() for ii in range(num_var): ax.fill_between(x=plot_lines[ii].get_xdata(), y1=plot_lines[ii].get_ydata(), alpha=0.25, color=plot_lines[ii].get_color()) ax.set_title(f'Normal Distribution') ax.set_xlabel(f'Value') ax.set_ylabel(f'Probability') plt.show() I used the lineplot to create the lines and then created the fills. But this is a hack, I was wondering if I can do it more naturally within Seaborn.
I found a way to manually play with the elements do so using the area object: ( so.Plot(healthexp, "Year", "Spending_USD", color="Country") .add(so.Area(alpha=.7), so.Stack()) ) The result is: Yet for some reason the example code doesn't work. What I did was using Seabron's lineplot() and then manually add fill_between() polygon: ax = sns.lineplot(data=data_frame, x='data_x', y='data_y', hue='data_color') plot_lines = ax.get_lines() for i in range(num_unique_colors): ax.fill_between(x=plot_lines[i].get_xdata(), y1=plot_lines[i].get_ydata(), alpha=0.25, color=plot_lines[i].get_color())
KMeans clustering from all possible combinations of 2 columns not producing correct output
I have a 4 column dataframe which I extracted from the iris dataset. I use kmeans to plot 3 clusters from all possible combinations of 2 columns. However, there seems to be something wrong with the output, especially since the cluster centers are not placed at the center of the clusters. I have provided examples of the output. Only cluster_1 seems OK but the other 3 look completely wrong . How best can I fix my clustering? This is the sample code I am using import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans import itertools df = pd.read_csv('iris.csv') df_columns = ['column_a', 'column_b', 'column_c', 'column_d'] n_clusters=3 kmeans = KMeans(n_clusters=n_clusters, init = 'k-means++', max_iter=200) kmeans = kmeans.fit(df) centroids = kmeans.cluster_centers_ cluster_labels = kmeans.labels_ for i in itertools.combinations(df_columns, 2): fig, ax = plt.subplots(figsize=(12, 8)) fig=plt.figure() ax.scatter(df[i[0]].values, df[i[1]].values, c=cluster_labels , cmap='viridis', edgecolor='k', s=20, alpha = 0.5) ax.scatter(centroids[:, 0], centroids[:, 1],s = 20, c = 'black', marker='*') plt.show() Dataset used: **column_a**,**column_b**,**column_c**,**column_d** 5.1,3.5,1.4,0.2 4.9,3.0,1.4,0.2 4.7,3.2,1.3,0.2 4.6,3.1,1.5,0.2 5.0,3.6,1.4,0.2 5.4,3.9,1.7,0.4 4.6,3.4,1.4,0.3 5.0,3.4,1.5,0.2 4.4,2.9,1.4,0.2 4.9,3.1,1.5,0.1 5.4,3.7,1.5,0.2 4.8,3.4,1.6,0.2 4.8,3.0,1.4,0.1 4.3,3.0,1.1,0.1 5.8,4.0,1.2,0.2 5.7,4.4,1.5,0.4 5.4,3.9,1.3,0.4 5.1,3.5,1.4,0.3 5.7,3.8,1.7,0.3 5.1,3.8,1.5,0.3 5.4,3.4,1.7,0.2 5.1,3.7,1.5,0.4 4.6,3.6,1.0,0.2 5.1,3.3,1.7,0.5 4.8,3.4,1.9,0.2 5.0,3.0,1.6,0.2 5.0,3.4,1.6,0.4 5.2,3.5,1.5,0.2 5.2,3.4,1.4,0.2 4.7,3.2,1.6,0.2 4.8,3.1,1.6,0.2 5.4,3.4,1.5,0.4 5.2,4.1,1.5,0.1 5.5,4.2,1.4,0.2 4.9,3.1,1.5,0.1 5.0,3.2,1.2,0.2 5.5,3.5,1.3,0.2 4.9,3.1,1.5,0.1 4.4,3.0,1.3,0.2 5.1,3.4,1.5,0.2 5.0,3.5,1.3,0.3 4.5,2.3,1.3,0.3 4.4,3.2,1.3,0.2 5.0,3.5,1.6,0.6 5.1,3.8,1.9,0.4 4.8,3.0,1.4,0.3 5.1,3.8,1.6,0.2 4.6,3.2,1.4,0.2 5.3,3.7,1.5,0.2 5.0,3.3,1.4,0.2 7.0,3.2,4.7,1.4 6.4,3.2,4.5,1.5 6.9,3.1,4.9,1.5 5.5,2.3,4.0,1.3 6.5,2.8,4.6,1.5 5.7,2.8,4.5,1.3 6.3,3.3,4.7,1.6 4.9,2.4,3.3,1.0 6.6,2.9,4.6,1.3 5.2,2.7,3.9,1.4 5.0,2.0,3.5,1.0 5.9,3.0,4.2,1.5 6.0,2.2,4.0,1.0 6.1,2.9,4.7,1.4 5.6,2.9,3.6,1.3 6.7,3.1,4.4,1.4 5.6,3.0,4.5,1.5 5.8,2.7,4.1,1.0 6.2,2.2,4.5,1.5 5.6,2.5,3.9,1.1 5.9,3.2,4.8,1.8 6.1,2.8,4.0,1.3 6.3,2.5,4.9,1.5 6.1,2.8,4.7,1.2 6.4,2.9,4.3,1.3 6.6,3.0,4.4,1.4 6.8,2.8,4.8,1.4 6.7,3.0,5.0,1.7 6.0,2.9,4.5,1.5 5.7,2.6,3.5,1.0 5.5,2.4,3.8,1.1 5.5,2.4,3.7,1.0 5.8,2.7,3.9,1.2 6.0,2.7,5.1,1.6 5.4,3.0,4.5,1.5 6.0,3.4,4.5,1.6 6.7,3.1,4.7,1.5 6.3,2.3,4.4,1.3 5.6,3.0,4.1,1.3 5.5,2.5,4.0,1.3 5.5,2.6,4.4,1.2 6.1,3.0,4.6,1.4 5.8,2.6,4.0,1.2 5.0,2.3,3.3,1.0 5.6,2.7,4.2,1.3 5.7,3.0,4.2,1.2 5.7,2.9,4.2,1.3 6.2,2.9,4.3,1.3 5.1,2.5,3.0,1.1 5.7,2.8,4.1,1.3 6.3,3.3,6.0,2.5 5.8,2.7,5.1,1.9 7.1,3.0,5.9,2.1 6.3,2.9,5.6,1.8 6.5,3.0,5.8,2.2 7.6,3.0,6.6,2.1 4.9,2.5,4.5,1.7 7.3,2.9,6.3,1.8 6.7,2.5,5.8,1.8 7.2,3.6,6.1,2.5 6.5,3.2,5.1,2.0 6.4,2.7,5.3,1.9 6.8,3.0,5.5,2.1 5.7,2.5,5.0,2.0 5.8,2.8,5.1,2.4 6.4,3.2,5.3,2.3 6.5,3.0,5.5,1.8 7.7,3.8,6.7,2.2 7.7,2.6,6.9,2.3 6.0,2.2,5.0,1.5 6.9,3.2,5.7,2.3 5.6,2.8,4.9,2.0 7.7,2.8,6.7,2.0 6.3,2.7,4.9,1.8 6.7,3.3,5.7,2.1 7.2,3.2,6.0,1.8 6.2,2.8,4.8,1.8 6.1,3.0,4.9,1.8 6.4,2.8,5.6,2.1 7.2,3.0,5.8,1.6 7.4,2.8,6.1,1.9 7.9,3.8,6.4,2.0 6.4,2.8,5.6,2.2 6.3,2.8,5.1,1.5 6.1,2.6,5.6,1.4 7.7,3.0,6.1,2.3 6.3,3.4,5.6,2.4 6.4,3.1,5.5,1.8 6.0,3.0,4.8,1.8 6.9,3.1,5.4,2.1 6.7,3.1,5.6,2.4 6.9,3.1,5.1,2.3 5.8,2.7,5.1,1.9 6.8,3.2,5.9,2.3 6.7,3.3,5.7,2.5 6.7,3.0,5.2,2.3 6.3,2.5,5.0,1.9 6.5,3.0,5.2,2.0 6.2,3.4,5.4,2.3 5.9,3.0,5.1,1.8
You compute the clusters in four dimensions. Note this implies the centroids are four-dimensional points too. Then you plot two-dimensional projections of the clusters. So when you plot the centroids, you have to pick out the same two dimensions that you just used for the scatterplot of the individual points. for i, j in itertools.combinations([0, 1, 2, 3], 2): fig, ax = plt.subplots(figsize=(12, 8)) ax.scatter(df.iloc[:, i], df.iloc[:, j], c=cluster_labels, cmap='viridis', edgecolor='k', s=20, alpha=0.5) ax.scatter(centroids[:, i], centroids[:, j], s=20, c='black', marker='*') plt.show()
Python: Barplot colored according to a third variable
Currently I am trying to create a Barplot that shows the amount of reviews for an app per week. The bar should however be colored according to a third variable which contains the average rating of the reviews in each week (range: 1 to 5). I followed the instructions of the following post to create the graph: Python: Barplot with colorbar The code works fine: # Import Packages import pandas as pd import matplotlib.pyplot as plt from matplotlib.cm import ScalarMappable # Create Dataframe data = [[1, 10, 3.4], [2, 15, 3.9], [3, 12, 3.6], [4, 30,1.2]] df = pd.DataFrame(data, columns = ["week", "count", "score"]) # Convert to lists data_x = list(df["week"]) data_hight = list(df["count"]) data_color = list(df["score"]) #Create Barplot: data_color = [x / max(data_color) for x in data_color] fig, ax = plt.subplots(figsize=(15, 4)) my_cmap = plt.cm.get_cmap('RdYlGn') colors = my_cmap(data_color) rects = ax.bar(data_x, data_hight, color=colors) sm = ScalarMappable(cmap=my_cmap, norm=plt.Normalize(1,5)) sm.set_array([]) cbar = plt.colorbar(sm) cbar.set_label('Color', rotation=270,labelpad=25) plt.show() Now to the issue: As you might notice the value of the average score in week 4 is "1.2". The Barplot does however indicate that the value lies around "2.5". I understand that this stems from the following code line, which standardizes the values by dividing it with the max value: data_color = [x / max(data_color) for x in data_color] Unfortunatly I am not able to change this command in a way that the colors resemble the absolute values of the scores, e.g. with a average score of 1.2 the last bar should be colored in deep red not light orange. I tried to just plug in the regular score values (Not standardized) to solve the issue, however, doing so creates all bars with the same green color... Since this is only my second python project, I have a hard time comprehending the process behind this matter and would be very thankful for any advice or solution. Cheers Neil
You identified correctly that the normalization is the problem here. It is in the linked code by valued SO user #ImportanceOfBeingEarnest defined for the interval [0, 1]. If you want another normalization range [normmin, normmax], you have to take this into account during the normalization: # Import Packages import pandas as pd import matplotlib.pyplot as plt from matplotlib.cm import ScalarMappable # Create Dataframe data = [[1, 10, 3.4], [2, 15, 3.9], [3, 12, 3.6], [4, 30,1.2]] df = pd.DataFrame(data, columns = ["week", "mycount", "score"]) # Not necessary to convert to lists, pandas series or numpy array is also fine data_x = df.week data_hight = df.mycount data_color = df.score #Create Barplot: normmin=1 normmax=5 data_color = [(x-normmin) / (normmax-normmin) for x in data_color] #see the difference here fig, ax = plt.subplots(figsize=(15, 4)) my_cmap = plt.cm.get_cmap('RdYlGn') colors = my_cmap(data_color) rects = ax.bar(data_x, data_hight, color=colors) sm = ScalarMappable(cmap=my_cmap, norm=plt.Normalize(normmin,normmax)) sm.set_array([]) cbar = plt.colorbar(sm) cbar.set_label('Color', rotation=270,labelpad=25) plt.show() Sample output: Obviously, this does not check that all values are indeed within the range [normmin, normmax], so a better script would make sure that all values adhere to this specification. We could, alternatively, address this problem by clipping the values that are outside the normalization range: #... import numpy as np #..... #Create Barplot: normmin=1 normmax=3.5 data_color = [(x-normmin) / (normmax-normmin) for x in np.clip(data_color, normmin, normmax)] #.... You may also have noticed another change that I introduced. You don't have to provide lists - pandas series or numpy arrays are fine, too. And if you name your columns not like pandas functions such as count, you can access them as df.ABC instead of df["ABC"].
Python - legend values duplicate
I'm plotting a matrix, as shown below, and the legend repeats over and over again. I've tried using numpoints = 1 and this didn't seem to have any effect. Any hints? import numpy as np import matplotlib.pyplot as plt import pandas as pd import matplotlib %matplotlib inline matplotlib.rcParams['figure.figsize'] = (10, 8) # set default figure size, 8in by 6inimport numpy as np data = pd.read_csv('data/assg-03-data.csv', names=['exam1', 'exam2', 'admitted']) x = data[['exam1', 'exam2']].as_matrix() y = data.admitted.as_matrix() # plot the visualization of the exam scores here no_admit = np.where(y == 0) admit = np.where(y == 1) from pylab import * # plot the example figure plt.figure() # plot the points in our two categories, y=0 and y=1, using markers to indicated # the category or output plt.plot(x[no_admit,0], x[no_admit,1],'yo', label = 'Not admitted', markersize=8, markeredgewidth=1) plt.plot(x[admit,0], x[admit,1], 'r^', label = 'Admitted', markersize=8, markeredgewidth=1) # add some labels and titles plt.xlabel('$Exam 1 score$') plt.ylabel('$Exam 2 score$') plt.title('Admit/No Admit as a function of Exam Scores') plt.legend()
It's nearly impossible to understand the problem if you don't put an example of data format especially if one is not familiar with pandas. However, assuming your input has this format: x=pd.DataFrame(np.array([np.arange(10),np.arange(10)**2]).T,columns=['exam1','exam2']).as_matrix() y=pd.DataFrame(np.arange(10)%2).as_matrix() >>x array([[ 0, 0], [ 1, 1], [ 2, 4], [ 3, 9], [ 4, 16], [ 5, 25], [ 6, 36], [ 7, 49], [ 8, 64], [ 9, 81]]) >> y array([[0], [1], [0], [1], [0], [1], [0], [1], [0], [1]]) the reason is the strange transformation from DataFrame to matrix, I guess it wouldn't happen if you have vectors (1D arrays). For my example this works (not sure if it is the cleanest form, I don't know where the 2D matrix for x and y comes from): plt.plot(x[no_admit,0][0], x[no_admit,1][0],'yo', label = 'Not admitted', markersize=8, markeredgewidth=1) plt.plot(x[admit,0][0], x[admit,1][0], 'r^', label = 'Admitted', markersize=8, markeredgewidth=1)
How to plot specific parts of a matrix in matplotlib?
I have a matrix that represents temperature distribution in a hollow square plate (hope the attached figure helps). The problem is with the hollow part in the plate which doesn't represent any solid material so I need to exclude this part from the plot. The simulation returns an np.array() with the temperature results (except of course for the hollow part). and this is the part where I define dimensions of the grid: import numpy as np plate_height = 0.4 #meters hollow_square_height = 0.2 #meters #discretization data delta_x = delta_y = 0.05 #meters grid_points_n = (plate_height/delta_x) + 1 grid = np.zeros(shape=(grid_points_n, grid_points_n)) # the simulation assures that the hollow part will remain zero valued. So, how do I approach this?
Instead of changing the original data, you can mask the values that you don't want to be used in calculations, plots, etc.: import matplotlib.pyplot as plt import numpy as np data = [ [11, 11, 12, 13], [9, 0, 0, 12], [8, 0, 0, 11], [8, 9, 10, 11] ] #Here's what you have: data_array = np.array(data) #Mask every position where there is a 0: masked_data = np.ma.masked_equal(data_array, 0) #Plot the matrix: fig = plt.figure() ax = fig.gca() ax.matshow(masked_data, cmap=plt.cm.autumn_r) #_r => reverse the standard color map plt.show() #plt.savefig('heatmap.png')
Replace zeros by nan, nan values are ignored in any plot. For example: import matplotlib.pyplot as plt from numpy import nan,matrix M = matrix([ [20,30,25,20,50], [22,nan,nan,nan,27], [30,nan,nan,nan,20], [33,nan,nan,nan,31], [21,28,29,23,36]]) fig = plt.figure() ax = fig.add_subplot(111) ax.matshow(M, cmap=plt.cm.jet) # Show matrix color plt.show() You can replace zeros by nan in a matrix as follow: from numpy import nan A[A==0.0]=nan # A is your matrix