I am trying to conduct hierarchical clustering through Japanese words/terms and using scipy.cluster.hierarchy.dendrogram to plot the results. However, the plot cannot show the Japanese words/terms but instead use small rectangles. At first, I was thinking this may be because when I create the dictionary, the keys are unicode not Japanese (as the question I asked here). Then I was suggested to use Python3 to solve such issue and I finally make the dictionary key in Japanese words instead of unicode (as the question I ask here). However, it turns out that even if I feed the label parameter of scipy.cluster.hierarchy.dendrogram with Japanese words/terms, the plot still cannot show those words. I have checked several similar posts but it seems like there is still no clear solution. My codes are as follows:
import pandas as pd
import numpy as np
from sklearn import decomposition
from sklearn.cluster import AgglomerativeClustering as hicluster
from scipy.spatial.distance import cdist, pdist
from scipy import sparse as sp ## Sparse Matrix
from scipy.cluster.hierarchy import dendrogram
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
## Import Data
allWrdMat10 = pd.read_csv("../../data/allWrdMat10.csv.gz",
encoding='CP932')
## Set X as CSR Sparse Matrix
X = np.array(allWrdMat10)
X = sp.csr_matrix(X)
def plot_dendrogram(model, **kwargs):
# Children of hierarchical clustering
children = model.children_
# Distances between each pair of children
# Since we don't have this information, we can use a uniform one
for plotting
distance = np.arange(children.shape[0])
# The number of observations contained in each cluster level
no_of_observations = np.arange(2, children.shape[0]+2)
# Create linkage matrix and then plot the dendrogram
linkage_matrix = np.column_stack([children, distance,
no_of_observations]).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
dict_index = {t:i for i,t in enumerate(allWrdMat10.columns)}
dictlist = []
temp = []
akey = []
avalue = []
for key, value in dict_index.items():
akey.append(key)
avalue.append(value)
temp = [key,value]
dictlist.append(temp)
avalue = np.array(avalue)
X_transform = X[:, avalue < 1000].transpose().toarray()
freq1000terms = akey
freq1000terms = np.array(freq1000terms)[avalue < 1000]
hicl_ward = hicluster(n_clusters=40,linkage='ward', compute_full_tree =
False)
hiclwres = hicl_ward.fit(X_transform)
plt.rcParams["figure.figsize"] = (15,6)
model1 = hiclwres
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)')
plot_dendrogram(model1, p = 40, truncate_mode = 'lastp', orientation =
'top', labels=freq1000terms[model1.labels_], color_threshold = 991)
plt.ylim(959,1000)
plt.show()
You need to give matplotlib a valid font to display Japanese characters with. You can find the available fonts from your system by using the following code:
import matplotlib.font_manager
matplotlib.font_manager.findSystemFonts(fontpaths=None)
It will give you a list of system fonts that matplotlib can use:
['c:\\windows\\fonts\\seguisli.ttf',
'C:\\WINDOWS\\Fonts\\BOD_R.TTF',
'C:\\WINDOWS\\Fonts\\GILC____.TTF',
'c:\\windows\\fonts\\segoewp-light.ttf',
'c:\\windows\\fonts\\glsnecb.ttf',
...
...
'c:\\windows\\fonts\\elephnti.ttf',
'C:\\WINDOWS\\Fonts\\COPRGTB.TTF']
Pick a font that supports Japanese character encoding, and give it as a parameter to matplotlib at the beginning of your code as following:
import matplotlib.pyplot as plt
plt.rcParams["font.family"] = "Yu Gothic" # I.E Yu Gothic, supports shift-jis
This is a global parameter setting, other plots on the same project will also use the same font family. If you want to change it for a single text, you can use font properties of matplotlib text object.
Also: If you can't find/see an appropriate font you can download a font like code2000, install it and use it the same way. (For the font to show up at the list, you may need to clear matplotlib's cache)
Related
I am having difficulties accessing (the right) data when using holoviews/bokeh, either for connected plots showing a different aspect of the dataset, or just customising a plot with dynamic access to the data as plotted (say a tooltip).
TLDR: How to add a projection plot of my dataset (different set of dimensions and linked to main plot, like a marginal distribution but, you know, not restricted to histogram or distribution) and probably with a similar solution a related question I asked here on SO
Let me exemplify (straight from a ipynb, should be quite reproducible):
import numpy as np
import random, pandas as pd
import bokeh
import datashader as ds
import holoviews as hv
from holoviews import opts
from holoviews.operation.datashader import datashade, shade, dynspread, spread, rasterize
hv.extension('bokeh')
With imports set up, let's create a dataset (N target 10e12 ;) to use with datashader. Beside the key dimensions, I really need some value dimensions (here z and z2).
import numpy as np
import pandas as pd
N = int(10e6)
x_r = (0,100)
y_r = (100,2000)
z_r = (0,10e8)
x = np.random.randint(x_r[0]*1000,x_r[1]*1000,size=(N, 1))
y = np.random.randint(y_r[0]*1000,y_r[1]*1000,size=(N, 1))
z = np.random.randint(z_r[0]*1000,z_r[1]*1000,size=(N, 1))
z2 = np.ones((N,1)).astype(int)
df = pd.DataFrame(np.column_stack([x,y,z,z2]), columns=['x','y','z','z2'])
df[['x','y','z']] = df[['x','y','z']].div(1000, axis=0)
df
Now I plot the data, rasterised, and also activate the tooltip to see the defaults. Sure, x/y is trivial, but as I said, I care about the value dimensions. It shows z2 as x_y z2. I have a question related to tooltips with the same sort of data here on SO for value dimension access for the tooltips.
from matplotlib.cm import get_cmap
palette = get_cmap('viridis')
# palette_inv = palette.reversed()
p=hv.Points(df,['x','y'], ['z','z2'])
P=rasterize(p, aggregator=ds.sum("z2"),x_range=(0,100)).opts(cmap=palette)
P.opts(tools=["hover"]).opts(height=500, width=500,xlim=(0,100),ylim=(100,2000))
Now I can add a histogram or a marginal distribution which is pretty close to what I want, but there are issues with this soon past the trivial defaults. (E.g.: P << hv.Distribution(p, kdims=['y']) or P.hist(dimension='y',weight_dimension='x_y z',num_bins = 2000,normed=True))
Both are close approaches, but do not give me the other value dimension I'd like visualise. If I try to access the other value dimension ('x_y z') this fails. Also, the 'x_y z2' way seems very clumsy, is there a better way?
When I do something like this, my browser/notebook-extension blows up, of course.
transformed = p.transform(x=hv.dim('z'))
P << hv.Curve(transformed)
So how do I access all my data in the right way?
I want to use hierarchical cluster analysis to get the optimal number (K) of clusters automatically, then apply this K to K-means clustering in python.
After studying many article, I know some methods tell us that we can plot the graph to determine K, but have any methods can output a real number automatically in python?
The hierarchical clustering method is based on dendrogram to determine the optimal number of clusters. Plot the dendrogram using a code similar to the following:
# General imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Special imports
from scipy.cluster.hierarchy import dendrogram, linkage
# Load data, fill in appropriately
X = []
# How to cluster the data, single is minimal distance between clusters
linked = linkage(X, 'single')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked,
orientation='top',
labels=labelList,
distance_sort='descending',
show_leaf_counts=True)
plt.show()
In the dendrogram locate the largest vertical difference between nodes, and in the middle pass an horizontal line. The number of vertical lines intersecting it is the optimal number of clusters (when affinity is calculated using the method set in linkage).
See example here: https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/
How to automatically read a dendrogram and extract that number is something I would also like to know.
Added in edit:
There is a way to do so using SK Learn package. See the following example:
#==========================================================================
# Hierarchical Clustering - Automatic determination of number of clusters
#==========================================================================
# General imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from os import path
# Special imports
from scipy.cluster.hierarchy import dendrogram, linkage
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering
# %matplotlib inline
print("============================================================")
print(" Hierarchical Clustering demo - num of clusters ")
print("============================================================")
print(" ")
folder = path.dirname(path.realpath(__file__)) # set current folder
# Load data
customer_data = pd.read_csv( path.join(folder, "hierarchical-clustering-with-python-and-scikit-learn-shopping-data.csv"))
# print(customer_data.shape)
print("In this data there should be 5 clusters...")
# Retain only the last two columns
data = customer_data.iloc[:, 3:5].values
# # Plot dendrogram using SciPy
# plt.figure(figsize=(10, 7))
# plt.title("Customer Dendograms")
# dend = shc.dendrogram(shc.linkage(data, method='ward'))
# plt.show()
# Initialize hiererchial clustering method, in order for the algorithm to determine the number of clusters
# put n_clusters=None, compute_full_tree = True,
# best distance threshold value for this dataset is distance_threshold = 200
cluster = AgglomerativeClustering(n_clusters=None, affinity='euclidean', linkage='ward', compute_full_tree=True, distance_threshold=200)
# Cluster the data
cluster.fit_predict(data)
print(f"Number of clusters = {1+np.amax(cluster.labels_)}")
# Display the clustering, assigning cluster label to every datapoint
print("Classifying the points into clusters:")
print(cluster.labels_)
# Display the clustering graphically in a plot
plt.scatter(data[:,0],data[:,1], c=cluster.labels_, cmap='rainbow')
plt.title(f"SK Learn estimated number of clusters = {1+np.amax(cluster.labels_)}")
plt.show()
print(" ")
The data was taken from here: https://stackabuse.s3.amazonaws.com/files/hierarchical-clustering-with-python-and-scikit-learn-shopping-data.csv
I found a solution I am using in my code. It involves color_list that counts amount of numbers of "connections". If one wants to extract the number of "leaves" (clusters) just decrease the number by 1:
https://www.youtube.com/watch?v=4DInt3H2UNE
My data file is shared in the following link.
We can plot this data using the following script.
import matplotlib as mpl
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
def read_datafile(file_name):
data = np.loadtxt(file_name, delimiter=',')
return data
data = read_datafile('mah_data.csv')
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.set_title("Data")
ax1.set_xlabel('t')
ax1.set_ylabel('s')
ax1.plot(x,y, c='r', label='My data')
leg = ax1.legend()
plt.show()
How can we detect peaks in python? I can't find a suitable peak detection algorithm in Python.
You can use the argrelextrema function in scipy.signal to return the indices of the local maxima or local minima of an array. This works for multi-dimensional arrays as well by specifying the axis.
from scipy.signal import argrelextrema
ind_max = argrelextrema(z, np.greater) # indices of the local maxima
ind_min = argrelextrema(z, np.less) # indices of the local minima
maxvals = z[ind_max]
minvals = z[ind_min]
More specifically, one can use the argrelmax or argrelmin to find the local maximas or local minimas. This also works for multi dimensional arrays using the axis argument.
from scipy.signal import argrelmax, argrelmin
ind_max = argrelmax(z, np.greater) # indices of the local maxima
ind_min = argrelmin(z, np.less) # indices of the local minima
maxvals = z[ind_max]
minvals = z[ind_min]
For more details, one can refer to this link: https://docs.scipy.org/doc/scipy/reference/signal.html#peak-finding
Try using peakutil (http://pythonhosted.org/PeakUtils/). Here is my solution to your question using peakutil.
import pandas as pd
import peakutils
data = pd.read_csv("mah_data.csv", header=None)
ts = data[0:10000][1] # Get the second column in the csv file
print(ts[0:10]) # Print the first 10 rows, for quick testing
# check peakutils for all the parameters.
# indices are the index of the points where peaks appear
indices = peakutils.indexes(ts, thres=0.4, min_dist=1000)
print(indices)
You should also checkout peak finding in scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks_cwt.html)
Try the findpeaks library.
pip install findpeaks
I can not find the data attached but suppose the data is a vector and stored in data:
import pandas as pd
data = pd.read_csv("mah_data.csv", header=None).values
# Import library
from findpeaks import findpeaks
# If the resolution of your data is low, I would recommend the ``lookahead`` parameter, and if your data is "bumpy", also the ``smooth`` parameter.
fp = findpeaks(lookahead=1, interpolate=10)
# Find peaks
results = fp.fit(data)
# Make plot
fp.plot()
# Results with respect to original input data.
results['df']
# Results based on interpolated smoothed data.
results['df_interp']
I want to color my clusters with a color map that I made in the form of a dictionary (i.e. {leaf: color}).
I've tried following https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/ but the colors get messed up for some reason. The default plot looks good, I just want to assign those colors differently. I saw that there was a link_color_func but when I tried using my color map (D_leaf_color dictionary) I got an error b/c it wasn't a function. I've created D_leaf_color to customize the colors of the leaves associated with particular clusters. In my actual dataset, the colors mean something so I'm steering away from arbitrary color assignments.
I don't want to use color_threshold b/c in my actual data, I have way more clusters and SciPy repeats the colors, hence this question. . .
How can I use my leaf-color dictionary to customize the color of my dendrogram clusters?
I made a GitHub issue https://github.com/scipy/scipy/issues/6346 where I further elaborated on the approach to color the leaves in Interpreting the output of SciPy's hierarchical clustering dendrogram? (maybe found a bug...) but I still can't figure out how to actually either: (i) use dendrogram output to reconstruct my dendrogram with my specified color dictionary or (ii) reformat my D_leaf_color dictionary for the link_color_func parameter.
# Init
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
# Load data
from sklearn.datasets import load_diabetes
# Clustering
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list
from scipy.spatial import distance
from fastcluster import linkage # You can use SciPy one too
%matplotlib inline
# Dataset
A_data = load_diabetes().data
DF_diabetes = pd.DataFrame(A_data, columns = ["attr_%d" % j for j in range(A_data.shape[1])])
# Absolute value of correlation matrix, then subtract from 1 for disimilarity
DF_dism = 1 - np.abs(DF_diabetes.corr())
# Compute average linkage
A_dist = distance.squareform(DF_dism.as_matrix())
Z = linkage(A_dist,method="average")
# Color mapping
D_leaf_colors = {"attr_1": "#808080", # Unclustered gray
"attr_4": "#B061FF", # Cluster 1 indigo
"attr_5": "#B061FF",
"attr_2": "#B061FF",
"attr_8": "#B061FF",
"attr_6": "#B061FF",
"attr_7": "#B061FF",
"attr_0": "#61ffff", # Cluster 2 cyan
"attr_3": "#61ffff",
"attr_9": "#61ffff",
}
# Dendrogram
# To get this dendrogram coloring below `color_threshold=0.7`
D = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None, leaf_font_size=12, leaf_rotation=45, link_color_func=D_leaf_colors)
# TypeError: 'dict' object is not callable
I also tried how do I get the subtrees of dendrogram made by scipy.cluster.hierarchy
Here a solution that uses the return matrix Z of linkage() (described early but a little hidden in the docs) and link_color_func:
# see question for code prior to "color mapping"
# Color mapping
dflt_col = "#808080" # Unclustered gray
D_leaf_colors = {"attr_1": dflt_col,
"attr_4": "#B061FF", # Cluster 1 indigo
"attr_5": "#B061FF",
"attr_2": "#B061FF",
"attr_8": "#B061FF",
"attr_6": "#B061FF",
"attr_7": "#B061FF",
"attr_0": "#61ffff", # Cluster 2 cyan
"attr_3": "#61ffff",
"attr_9": "#61ffff",
}
# notes:
# * rows in Z correspond to "inverted U" links that connect clusters
# * rows are ordered by increasing distance
# * if the colors of the connected clusters match, use that color for link
link_cols = {}
for i, i12 in enumerate(Z[:,:2].astype(int)):
c1, c2 = (link_cols[x] if x > len(Z) else D_leaf_colors["attr_%d"%x]
for x in i12)
link_cols[i+1+len(Z)] = c1 if c1 == c2 else dflt_col
# Dendrogram
D = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None,
leaf_font_size=12, leaf_rotation=45, link_color_func=lambda x: link_cols[x])
Here the output:
Two-liner for applying custom colormap to cluster branches:
import matplotlib as mpl
from matplotlib.pyplot import cm
from scipy.cluster import hierarchy
cmap = cm.rainbow(np.linspace(0, 1, 10))
hierarchy.set_link_color_palette([mpl.colors.rgb2hex(rgb[:3]) for rgb in cmap])
You can then replace rainbow by any cmap and change 10 for the number of cluster you want.
I found a hackish solution, and does require to use the color threshold (but I need to use it in order to obtain the same original coloring, otherwise the colors are not the same as presented in the OP), but could lead you to a solution. However, you may not have enough information to know how to set the color palette order.
# Init
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
# Load data
from sklearn.datasets import load_diabetes
# Clustering
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list, set_link_color_palette
from scipy.spatial import distance
from fastcluster import linkage # You can use SciPy one too
%matplotlib inline
# Dataset
A_data = load_diabetes().data
DF_diabetes = pd.DataFrame(A_data, columns = ["attr_%d" % j for j in range(A_data.shape[1])])
# Absolute value of correlation matrix, then subtract from 1 for disimilarity
DF_dism = 1 - np.abs(DF_diabetes.corr())
# Compute average linkage
A_dist = distance.squareform(DF_dism.as_matrix())
Z = linkage(A_dist,method="average")
# Color mapping dict not relevant in this case
# Dendrogram
# To get this dendrogram coloring below `color_threshold=0.7`
#Change the color palette, I did not include the grey, which is used above the threshold
set_link_color_palette(["#B061FF", "#61ffff"])
D = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=.7, leaf_font_size=12, leaf_rotation=45,
above_threshold_color="grey")
The result:
I'm trying to learn how to use dendrograms in Python using SciPy . I want to get clusters and be able to visualize them; I heard hierarchical clustering and dendrograms are the best way.
How can I "cut" the tree at a specific distance?
In this example, I just want to cut it at distance 1.6
I looked up a tutorial on https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/#Inconsistency-Method but the guy did some really confusing wrapper function using **kwargs (he calls his threshold max_d)
Here is my code and plot below; I tried annotating it as best as I could for reproducibility:
from __future__ import print_function
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram,linkage,fcluster
from scipy.spatial import distance
np.random.seed(424173239) #43984
#Dims
n,m = 20,7
#DataFrame: rows = Samples, cols = Attributes
attributes = ["a" + str(j) for j in range(m)]
DF_data = pd.DataFrame(np.random.random((n, m)), columns = attributes)
A_dist = distance.cdist(DF_data.as_matrix().T, DF_data.as_matrix().T)
#(i) . Do the labels stay in place from DF_data for me to do this?
DF_dist = pd.DataFrame(A_dist, index = attributes, columns = attributes)
#Create dendrogram
fig, ax = plt.subplots()
Z = linkage(distance.squareform(DF_dist.as_matrix()), method="average")
D_dendro = dendrogram(Z, labels = attributes, ax=ax) #create dendrogram dictionary
threshold = 1.6 #for hline
ax.axhline(y=threshold, c='k')
plt.show()
#(ii) How can I "cut" the tree by giving it a distance threshold?
#i.e. If I cut at 1.6 it would make (a5 : cluster_1 or not in a cluster), (a2,a3 : cluster_2), (a0,a1 : cluster_3), and (a4,a6 : cluster_4)
#link_1 says use fcluster
#This -> fcluster(Z, t=1.5, criterion='inconsistent', depth=2, R=None, monocrit=None)
#gives me -> array([1, 1, 1, 1, 1, 1, 1], dtype=int32)
print(
len(set(D_dendro["color_list"])), "^ # of colors from dendrogram",
len(D_dendro["ivl"]), "^ # of labels",sep="\n")
#3
#^ # of colors from dendrogram it should be 4 since clearly (a6, a4) and a5 are in different clusers
#7
#^ # of labels
link_1 : How to compute cluster assignments from linkage/distance matrices in scipy in Python?
color_threshold is the method I was looking for. It doesn't really help when the color_palette is too small for the amount of clusters being generated. Migrated the next step to Bigger color-palette in matplotlib for SciPy's dendrogram (Python) if anyone can help.
For a bigger color palette this should work:
from scipy.cluster import hierarchy as hc
import matplotlib.cm as cm
import matplotlib.colors as col
#get a color spectrum "gist_ncar" from matplotlib cm.
#When you have a spectrum it begins with 0 and ends with 1.
#make tinier steps if you need more than 10 colors
colors = cm.gist_ncar(np.arange(0, 1, 0.1))
colorlst=[]# empty list where you will put your colors
for i in range(len(colors)): #get for your color hex instead of rgb
colorlst.append(col.to_hex(colors[i]))
hc.set_link_color_palette(colorlst) #sets the color to use.
Put all of that infront of your code and it should work