This paper has a nice way of visualizing clusters of a dataset with binary features by plotting a 2D matrix and sorting the values according to a cluster.
In this case, there are three clusters, as indicated by the black dividing lines; the rows are sorted, and show which examples are in each cluster, and the columns are the features of each example.
Given a vector of cluster assignments and a pandas DataFrame, how can I replicate this using a Python library (e.g. seaborn)? Plotting a DataFrame using seaborn isn't difficult, nor is sorting the rows of the DataFrame to align with the cluster assignments. What I am most interested in is how to display those black dividing lines which delineate each cluster.
Dummy data:
"""
col1 col2
x1_c0 0 1
x2_c0 0 1
================= I want a line drawn here
x3_c1 1 0
================= and here
x4_c2 1 0
"""
import pandas as pd
import seaborn as sns
df = pd.DataFrame(
data={'col1': [0, 0, 1, 1], 'col2': [1, 1, 0, 0]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2']
)
clus = [0, 0, 1, 2] # This is the cluster assignment
sns.heatmap(df)
The link that mwaskom posted in a comment is good starting place. The trick is figuring out what the coordinates are for the vertical and horizontal lines.
To illustrate what the code is actually doing, it's worthwhile to just plot all of the lines individually
%matplotlib inline
import pandas as pd
import seaborn as sns
df = pd.DataFrame(data={'col1': [0, 0, 1, 1], 'col2': [1, 1, 0, 0]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2'])
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df)
ax.axvline(1, 0, 2, linewidth=3, c='w')
ax.axhline(1, 0, 1, linewidth=3, c='w')
ax.axhline(2, 0, 1, linewidth=3, c='w')
ax.axhline(3, 0, 1, linewidth=3, c='w')
f.tight_layout()
The the way that the axvline method works is the first argument is the x location of the line and then the lower bound and upper bound of the line (in this case 1, 0, 2). The horizontal line takes the y location and then the x start and x stop of the line. The defaults will create the line for the entire plot, so you can typically leave those out.
This code above creates a line for every value in the dataframe. If you want to create groups for the heatmap, you will want to create an index in your data frame, or some other list of values to loop through. For instance with a more complicated example using code from this example:
df = pd.DataFrame(data={'col1': [0, 0, 1, 1, 1.5], 'col2': [1, 1, 0, 0, 2]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2', 'x5_c2'])
df['id_'] = df.index
df['group'] = [1, 2, 2, 3, 3]
df.set_index(['group', 'id_'], inplace=True)
df
col1 col2
group id_
1 x1_c0 0.0 1
2 x2_c0 0.0 1
x3_c1 1.0 0
3 x4_c2 1.0 0
x5_c2 1.5 2
Then plot the heatmap with the groups:
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df)
groups = df.index.get_level_values(0)
for i, group in enumerate(groups):
if i and group != groups[i - 1]:
ax.axhline(len(groups) - i, c="w", linewidth=3)
ax.axvline(1, c="w", linewidth=3)
f.tight_layout()
Because your heatmap is not symmetric you may need to use a separate for loop for the columns
Related
I have a dataframe with both positive and negative values. I would like to show a bar chart that shows two bars, one bar shows the percentage of positive values and another percentage of negative values.
dummy = pd.DataFrame({'A' : [-4, -3, -1, 0, 1, 2, 3, 4, 5], 'B' : [-4, -3, -1, 0, 1, 2, 3, 4, 5]})
less_than_0 = dummy['A'][dummy['A'] < 0]
greater_than_0 = dummy['A'][dummy['A'] >= 0]
I am able to split the positive and negative values. I tried this using seaborn.
sns.barplot(dummy['A'])
but both positive and negative are coming in single bar.
I tried this too
sns.barplot(less_than_0)
sns.barplot(greater_than_0)
Is there any way to show 2 bars, 1 for percentage of positive values and other for percentage of negative values?
This isn't the most elegant solution, but you can create a new DataFrame that contains two columns: labels that contain the labels you want to display on the x-axis of the barplot, and percentages that contain the percentages of negative and positive values.
Then you can pass these column names with the relevant information to sns.barplot as the x and y parameters.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dummy = pd.DataFrame({'A' : [-4, -3, -1, 0, 1, 2, 3, 4, 5], 'B' : [-4, -3, -1, 0, 1, 2, 3, 4, 5]})
df_percentages = pd.DataFrame({
'labels':['Less than 0', 'Greater than or equal to 0'],
'percentage':[100*x/len(dummy['A']) for x in [sum(dummy['A'] < 0), sum(dummy['A']>=0)]]
})
sns.barplot(x='labels', y='percentage', data=df_percentages)
plt.show()
You can use Series.value_counts with normalize=True and then plot Series:
d = {True:'Less than 0',False:'Greater than or equal to 0'}
s = (dummy['A'] < 0).value_counts(normalize=True).rename(d)
print (s)
Greater than or equal to 0 0.666667
Less than 0 0.333333
Name: A, dtype: float64
sns.barplot(x=s.index, y=s.to_numpy())
I am trying to plot a scatter plot of the following type of pandas dataframe:
df = pd.DataFrame([['RH1', 1, 3], ['RH2', 0, 3], ['RH3', 2, 0], ['RH4', 1, 2], columns=['name', 'A', 'B'])
The final plot should have "name" column as Y axis and "A" and "B" as X axis. And the different numerical values with different colours. something like this
I tried to plot it by looping over each row of the dataframe but I got stuck at some place and couldn't do it, the main problem I encounter is the size of both the axis. It would be really great if anyone can help me. Thank you in advance.
You can melt your dataframe and use the values as the column for color:
from matplotlib import pyplot as plt
import pandas as pd
df = pd.DataFrame([['RH1', 1, 3], ['RH2', 0, 3], ['RH3', 2, 0], ['RH4', 1, 2]], columns=['name', 'A', 'B'])
df.melt(["name"]).plot(x="variable", y= "name", kind="scatter", c="value", cmap="plasma")
plt.show()
Sample output:
If you have a limited number of values, you can change the colormap to a discrete colormap and label each color with its value. Alternatively, use seaborn's stripplot:
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
df = pd.DataFrame([['RH1', 1, 3], ['RH2', 0, 3], ['RH3', 2, 0], ['RH4', 1, 2]], columns=['name', 'A', 'B'])
sns.stripplot(data=df.melt(["name"]), x="variable", y= "name", hue="value", jitter=False)
plt.show()
Output:
I have a dataframe (97 columns x 30 rows). In this dataframe there
are only 1 and 0.
I want to plot it like a scatter plot, in which in the x axis the are
the name of the columns and in the y axis the name of the indexes.
[my dataframe is like this][1]
The output I want is similar to the photo, but the red dot must be
there only if the value of the intersection between row and columns
has a 1 value.
If there is a 0 value nothing is plot in the intersection.[][2][the
output scatter plot I want][3]
https://i.stack.imgur.com/hFnQX.png
https://i.stack.imgur.com/Rsguk.jpg
https://i.stack.imgur.com/keGC6.png
A straightforward way to do this is to use two nested loops for plotting the points conditionally on each dataframe cell:
import pandas as pd
import matplotlib.pyplot as plt
example = pd.DataFrame({'column 1': [0, 1, 0, 1],
'column 2': [1, 0, 1, 0],
'column 3': [1, 1, 0, 0]})
for x, col in enumerate(example.columns):
for y, ind in enumerate(example.index):
if example.loc[ind, col]:
plt.plot(x, y, 'o', color='red')
plt.xticks(range(len(example.columns)), labels=example.columns)
plt.yticks(range(len(example)), labels=example.index)
plt.show()
This is an example dataframe:
import pandas as pd
import numpy as np
values = np.array([
[0, 1, 2, 0, 0, 4],
[1, 0, 0, 1, 1, 0 ],
[0, 4, 0, 0, 2, 1],
[2, 0, 2, 0, 4, 0],
])
indexes= 'a','b','c','d'
columns='ab','bc','cd','de','ef','fg'
df = pd.DataFrame(index=indexes,columns=columns, data=values)
print(df)
from this dataframe I need to create a series of pie charts, one for every column, shown on the same figure, where the slices dimension is fixed (equal to 100/len(indexes)) and the color of the slices depends on the value of the index, in particular: white if 0, green if 1, yellow if 2, red if 4.
What suggestions can you give me?
I found that:
df.plot(kind='pie', subplots=True, figsize=(len(columns)*2, 2))
it creates a series, but I can't control the input values...
I've created a pie for a column, but then I wasn't able to link the color to the value of index:
labels = indexes
sizes = np.linspace(100/len(labels),100/len(labels), num=len(labels))
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels)
ax1.axis('equal')
plt.show()
ImportanceOfBeingErnest answer has helped me giving to the piechart the wanted look:
fig1, ax1 = plt.subplots()
labels = indexes
sizes = np.linspace(100/len(labels),100/len(labels), num=len(labels))
coldic = {0 : "w", 1 : "g", 2 : "y", 4 : "r" }
colors = [coldic[v] for v in values[:,0]]
ax1.pie(sizes, labels=labels, colors=colors,counterclock=False, startangle=90)
ax1.axis('equal')
plt.show()
Now the colors a linked to the values, and the dimensions of the slices are fixed. I just need to have the same pie chart for all the columns and in the same image.
The importance of these charts is given by the colors, not the dimensions of the slices, which I want to be always equal.
Thanks for your time!
Not relying on pandas internal plotting functions (which are of course limited) one can use matplotlib' pie function to plot the diagrams.
The colors can be set as a list, which is generated from the values according to some mapping dictionary.
import numpy as np
import matplotlib.pyplot as plt
coldic = {0 : "w", 1 : "g", 2 : "y", 4 : "r" }
values = np.array([
[0, 1, 2, 0, 0, 4],
[1, 0, 0, 1, 1, 0 ],
[0, 4, 0, 0, 2, 1],
[2, 0, 2, 0, 4, 0],
])
labels= ['a','b','c','d']
fig1, axes = plt.subplots(ncols=values.shape[1], )
for i in range(values.shape[1]):
colors = [coldic[v] for v in values[:,i]]
labs = [l if values[j,i] > 0 else "" for j, l in enumerate(labels)]
axes[i].pie(values[:,i], labels=labs, colors=colors)
axes[i].set_aspect("equal")
plt.show()
For fixed wedge sizes you just use a fixed array to supply to pie.
import numpy as np
import matplotlib.pyplot as plt
coldic = {0 : "w", 1 : "g", 2 : "y", 4 : "r" }
values = np.array([
[0, 1, 2, 0, 0, 4],
[1, 0, 0, 1, 1, 0 ],
[0, 4, 0, 0, 2, 1],
[2, 0, 2, 0, 4, 0],
])
labels= ['a','b','c','d']
fig1, axes = plt.subplots(ncols=values.shape[1], )
for i in range(values.shape[1]):
colors = [coldic[v] for v in values[:,i]]
axes[i].pie(np.ones(values.shape[0]), labels=labels, colors=colors,
wedgeprops=dict(linewidth=1, edgecolor="k"))
axes[i].set_aspect("equal")
axes[i].set_title("".join(list(map(str,values[:,i]))))
plt.show()
I have a dataframe with 3 columns. The first two columns are my data. The third column only takes on binary values, 0 or 1. I'd like to plot the first two columns such that the points are color coded (in two colors) depending upon whether the corresponding value in the third column is 0 or 1.
df = pd.DataFrame(dict(A=[1, 2, 3, 4],
B=[7.5, 7, 5, 4.5],
C=[0, 1, 1, 0]))
colors = {0: 'red', 1: 'aqua'}
plt.scatter(df.A, df.B, c=df.C.map(colors))