How to access the x axis index of a scipy denogram? - python

I have data, for which I create a linkage model, like this:
model = sc.linkage(data, 'ward')
Where model is as follows:
Z = np.array([
[ 2. , 9. , 20.12172148, 2. ],
[ 0. , 1. , 26.16772232, 2. ],
[ 11. , 12. , 29.40258214, 2. ],
[ 14. , 16. , 30.89332011, 3. ],
[ 3. , 7. , 33.70695832, 2. ],
[ 5. , 13. , 34.22180543, 2. ],
[ 4. , 15. , 35.52080322, 3. ],
[ 17. , 21. , 45.3919152 , 5. ],
[ 6. , 20. , 45.56339627, 3. ],
[ 8. , 23. , 66.42828305, 4. ],
[ 10. , 22. , 87.52531145, 6. ],
[ 18. , 24. , 93.78070161, 7. ],
[ 19. , 26. , 124.09967826, 9. ],
[ 25. , 27. , 160.11685636, 15. ]])
Z == model # returns true
I can then plot this linkage model using matplotlib:
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram for signature data')
plt.xlabel('sample index')
plt.ylabel('distance')
sc.dendrogram(
model,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=8., # font size for the x axis labels
)
plt.show()
Now, this plots the dendogram and sets the xticks to the index. I would want to replace this with actual labels, which are
labels = ['wood', 'stone', 'flora', 'liquid', 'food', 'metal', 'ceramic',
'sky', 'glass', 'paper', 'animal', 'skin', 'fabrics', 'gem', 'ground']
as in, the first tick on the x-axis reads 10. Which would be labels[10]. However, I can't find out how to access this index.

There is no need to access the index. scipy.cluster.hierarchy.dendrogram provides a labels argument which you should use to supply your labels.
scipy.cluster.hierarchy.dendrogram(Z, labels=labels, ....)
Complete code:
import numpy as np
import scipy.cluster.hierarchy as sc
import matplotlib.pyplot as plt
Z = np.array([
[ 2. , 9. , 20.12172148, 2. ],
[ 0. , 1. , 26.16772232, 2. ],
[ 11. , 12. , 29.40258214, 2. ],
[ 14. , 16. , 30.89332011, 3. ],
[ 3. , 7. , 33.70695832, 2. ],
[ 5. , 13. , 34.22180543, 2. ],
[ 4. , 15. , 35.52080322, 3. ],
[ 17. , 21. , 45.3919152 , 5. ],
[ 6. , 20. , 45.56339627, 3. ],
[ 8. , 23. , 66.42828305, 4. ],
[ 10. , 22. , 87.52531145, 6. ],
[ 18. , 24. , 93.78070161, 7. ],
[ 19. , 26. , 124.09967826, 9. ],
[ 25. , 27. , 160.11685636, 15. ]])
labels = ['wood', 'stone', 'flora', 'liquid', 'food', 'metal', 'ceramic',
'sky', 'glass', 'paper', 'animal', 'skin', 'fabrics', 'gem', 'ground']
# calculate full dendrogram
plt.figure()
plt.title('Hierarchical Clustering Dendrogram for signature data')
plt.xlabel('sample index')
plt.ylabel('distance')
sc.dendrogram(
Z,
labels=labels,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=8., # font size for the x axis labels
)
plt.tight_layout()
plt.show()

I do not have dendogram module but the following should work for you. The idea is:
Create an axis instance ax and pass it to dendogram plot as axis argument
Get the existing x-ticklabels, and convert them to integers. Use these integers as indices to access the values from labels. This is just to have the labels in the order as you want them to be displayed on the x-axis
Set these new labels using set_xticklabels
Following is the relevant piece of code you can use
fig, ax = plt.subplots(figsize=(25, 10))
sc.dendrogram(
model,
leaf_rotation=90.,
leaf_font_size=8., ax=ax)
fig.canvas.draw()
new_labels = [labels[int(i.get_text())] for i in ax.get_xticklabels()]
ax.set_xticklabels(new_labels)

Related

How can i sort an array based on the mean of each column in python?

Input:
array([[ 1. , 5. , 1. ],
[ 10. , 7. , 1.5],
[ 6.9, 5. , 1. ],
[ 19. , 9. , 100. ],
[ 11. , 11. , 11. ]])
Expected Output:
array([[ 19. , 9. , 100. ],
[ 11. , 11. , 11. ],
[ 10. , 7. , 1.5],
[ 6.9, 5. , 1. ],
[ 1. , 5. , 1. ]])
i tried doing the below:
for i in M:
ls = i.mean()
x = np.append(i,ls)
print(x) #found the mean
After this i am unable to arrange each column based on the mean value in each row. All i can do
is to arrange each row in descending order but that is not what i wanted.
You can do this:
In [405]: row_idxs = np.argsort(np.mean(a * -1, axis=1))
In [406]: a[row_idxs, :]
Out[406]:
array([[ 19. , 9. , 100. ],
[ 11. , 11. , 11. ],
[ 10. , 7. , 1.5],
[ 6.9, 5. , 1. ],
[ 1. , 5. , 1. ]])
Using argsort will sort the indices. Multiplying by -1 allows you to get descending order.

Assign people which have different locations to a group

I have a pandas dataframe with people who live in different locations (latitude, longitude, floor number). I would like to assign 3 people to a group. The groups should be numbered from 1 to n. Important: These 3 people share different locations in terms of latitude, longitude and floor. This means, at the end of this process, every person is assigned to one particular group. My dataframe has the length of multiples of 9 (e.g 18 people).
Example:
Here is my dataframe:
array_data=([[ 50.56419 , 8.67667 , 2. , 160. ],
[ 50.5740356, 8.6718179, 1. , 5. ],
[ 50.5746321, 8.6831284, 3. , 202. ],
[ 50.5747453, 8.6765588, 4. , 119. ],
[ 50.5748992, 8.6611471, 2. , 260. ],
[ 50.5748992, 8.6611471, 3. , 102. ],
[ 50.575 , 8.65985 , 2. , 267. ],
[ 50.5751 , 8.66027 , 2. , 7. ],
[ 50.5751 , 8.66027 , 2. , 56. ],
[ 50.57536 , 8.67741 , 1. , 194. ],
[ 50.57536 , 8.67741 , 1. , 282. ],
[ 50.5755255, 8.6884584, 0. , 276. ],
[ 50.5755273, 8.674282 , 3. , 167. ],
[ 50.57553 , 8.6826 , 2. , 273. ],
[ 50.5755973, 8.6847492, 0. , 168. ],
[ 50.5756757, 8.6846139, 4. , 255. ],
[ 50.57572 , 8.65965 , 0. , 66. ],
[ 50.57591 , 8.68175 , 1. , 187. ]])
all_persons = pd.DataFrame(data=array_data) # convert back to dataframe
all_persons.rename(columns={0: 'latitude', 1: 'longitude', 2:'floor', 3:'id'}, inplace=True) # rename columns
How can I create this column? As you can see, my approach doesn't work correctly.
This was my approach: Google Colab Link to my solution
temp = ()
temp += (pd.concat([df.loc[users group 1]], keys=[1], names=['group']),)
temp += (pd.concat([df.loc[users group 2]], keys=[2], names=['group']),)
temp += (pd.concat([df.loc[users group 3]], keys=[3], names=['group']),)
df = pd.concat(temp)
Of course you can do this in a loop and locate the users you need in a more elegant way.

Shuffle with cross_validation

I have a dataset like this:
[ 5. , 2. , 15. , 0.25535303],
[ 5. , 3. , 15. , 6.72465845],
[ 5. , 4. , 15. , 5.62719504],
[ 5. , 5. , 15. , 5.61760597],
[ 5. , 6. , 15. , 4.9561533 ],
[ 6. , 2. , 15. , 0.2709665 ],
[ 6. , 3. , 15. , 6.07004364],
[ 6. , 4. , 15. , 5.62719504],
[ 6. , 5. , 15. , 5.54684885],
[ 6. , 6. , 15. , 5.32846201],
[ 2. , 2. , 20. , 3.79257349],
[ 2. , 3. , 20. , 4.00440964],
[ 2. , 4. , 20. , 4.37965706],
[ 2. , 5. , 20. , 3.92216922],
[ 2. , 6. , 20. , 3.41378368],
[ 3. , 2. , 20. , 0.13500398],
[ 3. , 3. , 20. , 4.38384781],
[ 3. , 4. , 20. , 5.17229688],
[ 3. , 5. , 20. , 5.00464056],
The third column values go from 15 to 35. I wanted to apply cross-validation but I suspected that the K-Folds would only include in each K block the same value in the third column, and it would affect negatively to my model.
Therefore, my solution is:
dataset_shuffle = shuffle(dataset)
X = dataset_shuffle["A", "B", "C"]
y = dataset_shuffle["D"]
result = cross_validate(estimator,X,y,scoring=scoretypes,cv=5,return_train_score=False)
r2 = result['test_r2'].mean()
mselist = -result['test_neg_mean_squared_error']
rmse = np.sqrt(mselist).mean()
Do you consider this a proper solution to my problem?
Is my solution the same as doing this?:
X = dataset["A", "B", "C"]
y = dataset["D"]
cv = KFold(n_splits=5, shuffle=True)
result = cross_validate(estimator,X,y,scoring=scoretypes,cv=cv,return_train_score=False)

Sort 2D NumPy array by one of the columns

I though this would be super easy but I am struggling a little. I have a data structure as follows
array([[ 5. , 3.40166205],
[ 10. , 2.72778882],
[ 15. , 2.31881804],
[ 20. , 2.50643777],
[ 1. , 3.94076063],
[ 2. , 3.80598599],
[ 3. , 3.67121134],
[ 6. , 3.2668874 ],
[ 7. , 3.13211276],
[ 8. , 2.99733811],
[ 9. , 2.86256347],
[ 11. , 2.64599467],
[ 12. , 2.56420051],
[ 13. , 2.48240635],
[ 14. , 2.4006122 ],
[ 16. , 1.8280531 ],
[ 17. , 1.74625894],
[ 18. , 1.66446479],
[ 19. , 1.58267063],
[ 20. , 1.50087647]])
And I want to sort it ONLY on the first column ... so it is ordered as follows:
array([[1. , 3.9],
[2. , 3.8],
... ,
[20. , 1.5]])
np.sort doesn't seem to work as it moves array to a flat structure. I've also used itemgetter
from operator import itemgetter
sorted(data, key=itemgetter(1))
But this doesn't give me the output I'm looking for.
Help appreciated!
This is a common numpy idiom. You can use argsort (on the first column) + numpy indexing here -
x[x[:, 0].argsort()]
array([[ 1. , 3.94076063],
[ 2. , 3.80598599],
[ 3. , 3.67121134],
[ 5. , 3.40166205],
[ 6. , 3.2668874 ],
[ 7. , 3.13211276],
[ 8. , 2.99733811],
[ 9. , 2.86256347],
[ 10. , 2.72778882],
[ 11. , 2.64599467],
[ 12. , 2.56420051],
[ 13. , 2.48240635],
[ 14. , 2.4006122 ],
[ 15. , 2.31881804],
[ 16. , 1.8280531 ],
[ 17. , 1.74625894],
[ 18. , 1.66446479],
[ 19. , 1.58267063],
[ 20. , 2.50643777],
[ 20. , 1.50087647]])

Matplotlib RegularPolygon collection location on the canvas

I am trying to plot a feature map (SOM) using python.
To keep it simple, imagine a 2D plot where each unit is represented as an hexagon.
As it is shown on this topic: Hexagonal Self-Organizing map in Python the hexagons are located side-by-side formated as a grid.
I manage to write the following piece of code and it works perfectly for a set number of polygons and for only few shapes (6 x 6 or 10 x 4 hexagons for example). However one important feature of a method like this is to support any grid shape from 3 x 3.
def plot_map(grid,
d_matrix,
w=10,
title='SOM Hit map'):
"""
Plot hexagon map where each neuron is represented by a hexagon. The hexagon
color is given by the distance between the neurons (D-Matrix) Scaled
hexagons will appear on top of the background image whether the hits array
is provided. They are scaled according to the number of hits on each
neuron.
Args:
- grid: Grid dictionary (keys: centers, x, y ),
- d_matrix: array contaning the distances between each neuron
- w: width of the map in inches
- title: map title
Returns the Matplotlib SubAxis instance
"""
n_centers = grid['centers']
x, y = grid['x'], grid['y']
fig = plt.figure(figsize=(1.05 * w, 0.85 * y * w / x), dpi=100)
ax = fig.add_subplot(111)
ax.axis('equal')
# Discover difference between centers
collection_bg = RegularPolyCollection(
numsides=6, # a hexagon
rotation=0,
sizes=(y * (1.3 * 2 * math.pi * w) ** 2 / x,),
edgecolors = (0, 0, 0, 1),
array= d_matrix,
cmap = cm.gray,
offsets = n_centers,
transOffset = ax.transData,
)
ax.add_collection(collection_bg, autolim=True)
ax.axis('off')
ax.autoscale_view()
ax.set_title(title)
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(collection_bg, cax=cax)
return ax
I've tried to make something that automatically understands the grid shape. It didn't work (and I'm not sure why). It always appear a undesired space between the hexagons
Summarising: I would like to generate 3x3 or 6x6 or 10x4 (and so on) grid using hexagons with no spaces in the between for given points and setting the plot width.
As it was asked, here is the data for the hexagons location. As you can see, it always the same pattern
3x3
{'centers': array([[ 1.5 , 0.8660254 ],
[ 2.5 , 0.8660254 ],
[ 3.5 , 0.8660254 ],
[ 1. , 1.73205081],
[ 2. , 1.73205081],
[ 3. , 1.73205081],
[ 1.5 , 2.59807621],
[ 2.5 , 2.59807621],
[ 3.5 , 2.59807621]]),
'x': array([ 3.]),
'y': array([ 3.])}
6x6
{'centers': array([[ 1.5 , 0.8660254 ],
[ 2.5 , 0.8660254 ],
[ 3.5 , 0.8660254 ],
[ 4.5 , 0.8660254 ],
[ 5.5 , 0.8660254 ],
[ 6.5 , 0.8660254 ],
[ 1. , 1.73205081],
[ 2. , 1.73205081],
[ 3. , 1.73205081],
[ 4. , 1.73205081],
[ 5. , 1.73205081],
[ 6. , 1.73205081],
[ 1.5 , 2.59807621],
[ 2.5 , 2.59807621],
[ 3.5 , 2.59807621],
[ 4.5 , 2.59807621],
[ 5.5 , 2.59807621],
[ 6.5 , 2.59807621],
[ 1. , 3.46410162],
[ 2. , 3.46410162],
[ 3. , 3.46410162],
[ 4. , 3.46410162],
[ 5. , 3.46410162],
[ 6. , 3.46410162],
[ 1.5 , 4.33012702],
[ 2.5 , 4.33012702],
[ 3.5 , 4.33012702],
[ 4.5 , 4.33012702],
[ 5.5 , 4.33012702],
[ 6.5 , 4.33012702],
[ 1. , 5.19615242],
[ 2. , 5.19615242],
[ 3. , 5.19615242],
[ 4. , 5.19615242],
[ 5. , 5.19615242],
[ 6. , 5.19615242]]),
'x': array([ 6.]),
'y': array([ 6.])}
11x4
{'centers': array([[ 1.5 , 0.8660254 ],
[ 2.5 , 0.8660254 ],
[ 3.5 , 0.8660254 ],
[ 4.5 , 0.8660254 ],
[ 5.5 , 0.8660254 ],
[ 6.5 , 0.8660254 ],
[ 7.5 , 0.8660254 ],
[ 8.5 , 0.8660254 ],
[ 9.5 , 0.8660254 ],
[ 10.5 , 0.8660254 ],
[ 11.5 , 0.8660254 ],
[ 1. , 1.73205081],
[ 2. , 1.73205081],
[ 3. , 1.73205081],
[ 4. , 1.73205081],
[ 5. , 1.73205081],
[ 6. , 1.73205081],
[ 7. , 1.73205081],
[ 8. , 1.73205081],
[ 9. , 1.73205081],
[ 10. , 1.73205081],
[ 11. , 1.73205081],
[ 1.5 , 2.59807621],
[ 2.5 , 2.59807621],
[ 3.5 , 2.59807621],
[ 4.5 , 2.59807621],
[ 5.5 , 2.59807621],
[ 6.5 , 2.59807621],
[ 7.5 , 2.59807621],
[ 8.5 , 2.59807621],
[ 9.5 , 2.59807621],
[ 10.5 , 2.59807621],
[ 11.5 , 2.59807621],
[ 1. , 3.46410162],
[ 2. , 3.46410162],
[ 3. , 3.46410162],
[ 4. , 3.46410162],
[ 5. , 3.46410162],
[ 6. , 3.46410162],
[ 7. , 3.46410162],
[ 8. , 3.46410162],
[ 9. , 3.46410162],
[ 10. , 3.46410162],
[ 11. , 3.46410162]]),
'x': array([ 11.]),
'y': array([ 4.])}
I've manage to find a workaround by calculating the figure size of inches according the given dpi. After, I compute the pixel distance between two adjacent points (by plotting it using a hidden scatter plot). This way I could calculate the hexagon apothem and estimate correctly the size of the hexagon's inner circle (as the matplotlib expects).
No gaps in the end!
import matplotlib.pyplot as plt
from matplotlib import colors, cm
from matplotlib.collections import RegularPolyCollection
from mpl_toolkits.axes_grid1 import make_axes_locatable
import math
import numpy as np
def plot_map(grid,
d_matrix,
w=1080,
dpi=72.,
title='SOM Hit map'):
"""
Plot hexagon map where each neuron is represented by a hexagon. The hexagon
color is given by the distance between the neurons (D-Matrix)
Args:
- grid: Grid dictionary (keys: centers, x, y ),
- d_matrix: array contaning the distances between each neuron
- w: width of the map in inches
- title: map title
Returns the Matplotlib SubAxis instance
"""
n_centers = grid['centers']
x, y = grid['x'], grid['y']
# Size of figure in inches
xinch = (x * w / y) / dpi
yinch = (y * w / x) / dpi
fig = plt.figure(figsize=(xinch, yinch), dpi=dpi)
ax = fig.add_subplot(111, aspect='equal')
# Get pixel size between to data points
xpoints = n_centers[:, 0]
ypoints = n_centers[:, 1]
ax.scatter(xpoints, ypoints, s=0.0, marker='s')
ax.axis([min(xpoints)-1., max(xpoints)+1.,
min(ypoints)-1., max(ypoints)+1.])
xy_pixels = ax.transData.transform(np.vstack([xpoints, ypoints]).T)
xpix, ypix = xy_pixels.T
# In matplotlib, 0,0 is the lower left corner, whereas it's usually the
# upper right for most image software, so we'll flip the y-coords
width, height = fig.canvas.get_width_height()
ypix = height - ypix
# discover radius and hexagon
apothem = .9 * (xpix[1] - xpix[0]) / math.sqrt(3)
area_inner_circle = math.pi * (apothem ** 2)
collection_bg = RegularPolyCollection(
numsides=6, # a hexagon
rotation=0,
sizes=(area_inner_circle,),
edgecolors = (0, 0, 0, 1),
array= d_matrix,
cmap = cm.gray,
offsets = n_centers,
transOffset = ax.transData,
)
ax.add_collection(collection_bg, autolim=True)
ax.axis('off')
ax.autoscale_view()
ax.set_title(title)
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="10%", pad=0.05)
plt.colorbar(collection_bg, cax=cax)
return ax

Categories