I want to represent correlation matrix using a heatmap. There is something called correlogram in R, but I don't think there's such a thing in Python.
How can I do this? The values go from -1 to 1, for example:
[[ 1. 0.00279981 0.95173379 0.02486161 -0.00324926 -0.00432099]
[ 0.00279981 1. 0.17728303 0.64425774 0.30735071 0.37379443]
[ 0.95173379 0.17728303 1. 0.27072266 0.02549031 0.03324756]
[ 0.02486161 0.64425774 0.27072266 1. 0.18336236 0.18913512]
[-0.00324926 0.30735071 0.02549031 0.18336236 1. 0.77678274]
[-0.00432099 0.37379443 0.03324756 0.18913512 0.77678274 1. ]]
I was able to produce the following heatmap based on another question, but the problem is that my values get 'cut' at 0, so I would like to have a map which goes from blue(-1) to red(1), or something like that, but here values below 0 are not presented in an adequate way.
Here's the code for that:
plt.imshow(correlation_matrix,cmap='hot',interpolation='nearest')
Another alternative is to use the heatmap function in seaborn to plot the covariance. This example uses the Auto data set from the ISLR package in R (the same as in the example you showed).
import pandas.rpy.common as com
import seaborn as sns
%matplotlib inline
# load the R package ISLR
infert = com.importr("ISLR")
# load the Auto dataset
auto_df = com.load_data('Auto')
# calculate the correlation matrix
corr = auto_df.corr()
# plot the heatmap
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)
If you wanted to be even more fancy, you can use Pandas Style, for example:
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)
def magnify():
return [dict(selector="th",
props=[("font-size", "7pt")]),
dict(selector="td",
props=[('padding', "0em 0em")]),
dict(selector="th:hover",
props=[("font-size", "12pt")]),
dict(selector="tr:hover td:hover",
props=[('max-width', '200px'),
('font-size', '12pt')])
]
corr.style.background_gradient(cmap, axis=1)\
.set_properties(**{'max-width': '80px', 'font-size': '10pt'})\
.set_caption("Hover to magify")\
.set_precision(2)\
.set_table_styles(magnify())
How about this one?
import seaborn as sb
corr = df.corr()
sb.heatmap(corr, cmap="Blues", annot=True)
If your data is in a Pandas DataFrame, you can use Seaborn's heatmap function to create your desired plot.
import seaborn as sns
Var_Corr = df.corr()
# plot the heatmap and annotation on it
sns.heatmap(Var_Corr, xticklabels=Var_Corr.columns, yticklabels=Var_Corr.columns, annot=True)
Correlation plot
From the question, it looks like the data is in a NumPy array. If that array has the name numpy_data, before you can use the step above, you would want to put it into a Pandas DataFrame using the following:
import pandas as pd
df = pd.DataFrame(numpy_data)
The code below will produce this plot:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# A list with your data slightly edited
l = [1.0,0.00279981,0.95173379,0.02486161,-0.00324926,-0.00432099,
0.00279981,1.0,0.17728303,0.64425774,0.30735071,0.37379443,
0.95173379,0.17728303,1.0,0.27072266,0.02549031,0.03324756,
0.02486161,0.64425774,0.27072266,1.0,0.18336236,0.18913512,
-0.00324926,0.30735071,0.02549031,0.18336236,1.0,0.77678274,
-0.00432099,0.37379443,0.03324756,0.18913512,0.77678274,1.00]
# Split list
n = 6
data = [l[i:i + n] for i in range(0, len(l), n)]
# A dataframe
df = pd.DataFrame(data)
def CorrMtx(df, dropDuplicates = True):
# Your dataset is already a correlation matrix.
# If you have a dateset where you need to include the calculation
# of a correlation matrix, just uncomment the line below:
# df = df.corr()
# Exclude duplicate correlations by masking uper right values
if dropDuplicates:
mask = np.zeros_like(df, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set background color / chart style
sns.set_style(style = 'white')
# Set up matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Add diverging colormap from red to blue
cmap = sns.diverging_palette(250, 10, as_cmap=True)
# Draw correlation plot with or without duplicates
if dropDuplicates:
sns.heatmap(df, mask=mask, cmap=cmap,
square=True,
linewidth=.5, cbar_kws={"shrink": .5}, ax=ax)
else:
sns.heatmap(df, cmap=cmap,
square=True,
linewidth=.5, cbar_kws={"shrink": .5}, ax=ax)
CorrMtx(df, dropDuplicates = False)
I put this together after it was announced that the outstanding seaborn corrplot was to be deprecated. The snippet above makes a resembling correlation plot based on seaborn heatmap. You can also specify the color range and select whether or not to drop duplicate correlations. Notice that I've used the same numbers as you, but that I've put them in a pandas dataframe. Regarding the choice of colors you can have a look at the documents for sns.diverging_palette. You asked for blue, but that falls out of this particular range of the color scale with your sample data. For both observations of
0.95173379, try changing to -0.95173379 and you'll get this:
import seaborn as sns
# label to make it neater
labels = {
's1':'vibration sensor',
'temp':'outer temperature',
'actPump':'flow rate',
'pressIn':'input pressure',
'pressOut':'output pressure',
'DrvActual':'acutal RPM',
'DrvSetPoint':'desired RPM',
'DrvVolt':'input voltage',
'DrvTemp':'inside temperature',
'DrvTorque':'motor torque'}
corr = corr.rename(labels)
# remove the top right triange - duplicate information
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Colors
cmap = sns.diverging_palette(500, 10, as_cmap=True)
# uncomment this if you want only the lower triangle matrix
# ans=sns.heatmap(corr, mask=mask, linewidths=1, cmap=cmap, center=0)
ans=sns.heatmap(corr, linewidths=1, cmap=cmap, center=0)
#save image
figure = ans.get_figure()
figure.savefig('correlations.png', dpi=800)
These are all reasonable answers, and it seems like the question has mostly been settled, but I thought I'd add one that doesn't use matplotlib/seaborn. In particular this solution uses altair which is based on a grammar of graphics (which might be a little more familiar to someone coming from ggplot).
# import libraries
import pandas as pd
import altair as alt
# download dataset and create correlation
df = pd.read_json("https://raw.githubusercontent.com/vega/vega-datasets/master/data/penguins.json")
corr_df = df.corr()
# data preparation
pivot_cols = list(corr_df.columns)
corr_df['cat'] = corr_df.index
# actual chart
alt.Chart(corr_df).mark_rect(tooltip=True)\
.transform_fold(pivot_cols)\
.encode(
x="cat:N",
y='key:N',
color=alt.Color("value:Q", scale=alt.Scale(scheme="redyellowblue"))
)
This yields
If you should find yourself needing labels in those cells, you can just swap the #actual chart section for something like
base = alt.Chart(corr_df).transform_fold(pivot_cols).encode(x="cat:N", y='key:N').properties(height=300, width=300)
boxes = base.mark_rect().encode(color=alt.Color("value:Q", scale=alt.Scale(scheme="redyellowblue")))
labels = base.mark_text(size=30, color="white").encode(text=alt.Text("value:Q", format="0.1f"))
boxes + labels
Use the 'jet' colormap for a transition between blue and red.
Use pcolor() with the vmin, vmax parameters.
It is detailed in this answer:
https://stackoverflow.com/a/3376734/21974
Related
I have the following synthetic dataframe, including numerical and categorical columns as well as the label column.
I want to plot a diagonal correlation matrix and display correlation coefficients in the upper part as the following:
expected output:
Despite the point that categorical columns within synthetic dataset/dataframedf needs to be converted into numerical, So far I have used this seaborn example using 'titanic' dataset which is synthetic and fits my task, but I added label column as follows:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Generate a large random dataset with synthetic nature (categorical + numerical)
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)
# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1.0, vmax=1.0, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
I checked a related post but couldn't figure it out to do this task. The best I could find so far is this workaround which can be installed using this package that gives me the following output:
#!pip install heatmapz
# Import the two methods from heatmap library
from heatmap import heatmap, corrplot
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Generate a large random dataset
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)
# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
mask[np.diag_indices_from(mask)] = False
np.fill_diagonal(mask, True)
# Set up the matplotlib figure
plt.figure(figsize=(8, 8))
# Draw the heatmap using "Heatmapz" package
corrplot(corr[mask], size_scale=300)
Sadly, corr[mask] doesn't mask the upper triangle in this package.
I also noticed that in R, reaching this fancy plot is much easier, so I'm open if there is a more straightforward way to convert Python Pandas dataFrame to R dataframe since it seems there is a package, so-called rpy2 that we could use Python & R together even in Google Colab notebook: Ref.1
from rpy2.robjects import pandas2ri
pandas2ri.activate()
So if it is the case, I find this post1 & post2 using R for regarding Visualization of a correlation matrix.
So, in short, my 1st priority is using Python and its packages Matplotlib, seaborn, Plotly Express, and then R and its packages to reach the expected output.
Note
I provided you with executable code in google Colab notebook with R using dataset so that you can form/test your final answer if your solution is by rpy2 otherwise I'd be interested in a Pythonic solution.
I'm not an expert in rpy2, so I can't help there, but here is how I would build it out in R. Since I don't have your data, I can't promise that everything will work perfectly for your dataset, but here is a general outline:
library(tidyverse)
#get some data
df <- as_tibble(mtcars) |>
(\(d) select(d, order(colnames(d))))()
#calculate correlation matrix
cor_mat <- cor(df)
#make 2 "blank" matrices
low <- matrix(NA, nrow = nrow(cor_mat), ncol = ncol(cor_mat))
up <- matrix(NA, nrow = nrow(cor_mat), ncol = ncol(cor_mat))
#populate upper and lower matrices
up[upper.tri(up)] <- cor_mat[upper.tri(cor_mat)]
low[lower.tri(low)] <- cor_mat[lower.tri(cor_mat)]
#pivot upper and lower for plotting
lower_dat <- low|>
as.data.frame() |>
`colnames<-`(colnames(df)) |>
mutate(xvar = colnames(df)) |>
pivot_longer(cols = -xvar, names_to = "yvar")
upper_dat <- up|>
as.data.frame() |>
`colnames<-`(colnames(df)) |>
mutate(xvar = colnames(df)) |>
pivot_longer(cols = -xvar, names_to = "yvar")
#plot
lower_dat|> #lower matrix data
ggplot(aes((xvar), yvar))+
geom_tile(fill = NA, color = "grey")+ #background grid
geom_point(aes(fill = value, size = value), pch = 22)+ # differnt sized points
geom_text(data = upper_dat, aes(color = value, label = round(value, 2)))+ #plot cor in upper right
scale_size_continuous(breaks = seq(-1, 1, by = 0.5))+ # define size breaks
labs(x = "", y = "")+ #remove unnecessary labels
scale_fill_gradient2(low = "darkred",mid = "white", high = "darkblue", midpoint = 0)+ #define square colors
scale_color_gradient2(low = "darkred",mid = "white", high = "darkblue", midpoint = 0)+ #define text colors
scale_x_discrete(limits = rev)+# rev to make the triagle a certain side
#make it look pretty
theme(panel.background = element_blank(),
panel.border = element_rect(fill = NA, color = "black"),
axis.text = element_text(color = "black", size = 10),
axis.title = element_text(size = 12))
Another option is creating two corrplots from the corrplot package in R. You can specify one plot with add=TRUE to combine both plots. Here is a reproducible example with mtcars dataset:
library(corrplot)
M<-cor(mtcars)
diag(M) <- 0
corrplot(M, method="number", type = "upper", tl.pos = "t")
corrplot(M, method="square", type = "lower", tl.pos = "l", cl.pos = "n", add = TRUE)
Output:
I'd be interested in a Pythonic solution.
Use a seaborn scatter plot with matplotlib text/line annotations:
Plot the lower triangle via sns.scatterplot with square markers
Annotate the upper triangle via plt.text
Draw the heatmap grid via plt.vlines and plt.hlines
Full code using the titanic sample:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="white")
# generate sample correlation matrix
df = sns.load_dataset("titanic")
df["label"] = np.random.randint(0, 2, size=len(df))
corr = df.corr()
# mask and melt correlation matrix
mask = np.tril(np.ones_like(corr, dtype=bool)) | corr.abs().le(0.1)
melt = corr.mask(mask).melt(ignore_index=False).reset_index()
melt["size"] = melt["value"].abs()
fig, ax = plt.subplots(figsize=(8, 6))
# normalize colorbar
cmap = plt.cm.RdBu
norm = plt.Normalize(-1, 1)
sm = plt.cm.ScalarMappable(norm=norm, cmap=cmap)
cbar = plt.colorbar(sm, ax=ax)
cbar.ax.tick_params(labelsize="x-small")
# plot lower triangle (scatter plot with normalized hue and square markers)
sns.scatterplot(ax=ax, data=melt, x="index", y="variable", size="size",
hue="value", hue_norm=norm, palette=cmap,
style=0, markers=["s"], legend=False)
# format grid
xmin, xmax = (-0.5, corr.shape[0] - 0.5)
ymin, ymax = (-0.5, corr.shape[1] - 0.5)
ax.vlines(np.arange(xmin, xmax + 1), ymin, ymax, lw=1, color="silver")
ax.hlines(np.arange(ymin, ymax + 1), xmin, xmax, lw=1, color="silver")
ax.set(aspect=1, xlim=(xmin, xmax), ylim=(ymax, ymin), xlabel="", ylabel="")
ax.tick_params(labelbottom=False, labeltop=True)
plt.xticks(rotation=90)
# annotate upper triangle
for y in range(corr.shape[0]):
for x in range(corr.shape[1]):
value = corr.mask(mask).to_numpy()[y, x]
if pd.notna(value):
plt.text(x, y, f"{value:.2f}", size="x-small",
# color=sm.to_rgba(value), weight="bold",
ha="center", va="center")
Note that since most of these titanic correlations are low, I disabled the text coloring for readability.
If you want color-coded text, uncomment the color=sm.to_rgba(value) line at the end:
I cannot setup heatmap package in Windows, but have you tried to set upper diagonal elements to nan?
corr_masked = corr.copy()
corr_masked[mask] = np.nan
corrplot(corr_masked, size_scale=300)
plt.plot for example does not plot nan samples, so the same trick may work here. If not, just setting the UD elements to 0 may suffice (or whatever color corresponds to the white on the scale).
This is the color palette, seaborn has used by default when I used a column with categorical variables to color the scatter points.
Is there a way to get the name or colors of color-palette being used?
I get this color scheme in the beginning but as soon as I use a diff scheme for a plotly, I can't get to this palette for the same chart.
This is not the scheme which comes from sns.color_palette. This can also be a matplotlib color scheme.
Minimum reproducible example
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
# download data
df = pd.read_csv("https://www.statlearning.com/s/Auto.csv")
df.head()
# remove rows with "?"
df.drop(df.index[~df.horsepower.str.isnumeric()], axis=0, inplace=True)
df['horsepower'] = pd.to_numeric(df.horsepower, errors='coerce')
# plot 1 (gives the desired color-palette)
fig = sns.PairGrid(df, vars=df.columns[~df.columns.isin(['cylinders','origin','name'])].tolist(), hue='cylinders')
plt.gcf().set_size_inches(17,15)
fig.map_diag(sns.histplot)
fig.map_upper(sns.scatterplot)
fig.map_lower(sns.kdeplot)
fig.add_legend(ncol=5, loc=1, bbox_to_anchor=(0.5, 1.05), borderaxespad=0, frameon=False);
# plot 2
# Converting column cylinder to factor before using for 'color'
df.cylinders = df.cylinders.astype('category')
# Scatter plot - Cylinders as hue
pal = ['#fdc086','#386cb0','#beaed4','#33a02c','#f0027f']
col_map = dict(zip(sorted(df.cylinders.unique()), pal))
fig = px.scatter(df, y='mpg', x='year', color='cylinders',
color_discrete_map=col_map,
hover_data=['name','origin'])
fig.update_layout(width=800, height=400, plot_bgcolor='#fff')
fig.update_traces(marker=dict(size=8, line=dict(width=0.2,color='DarkSlateGrey')),
selector=dict(mode='markers'))
fig.show()
# plot 1 run again
fig = sns.PairGrid(df, vars=df.columns[~df.columns.isin(['cylinders','origin','name'])].tolist(), hue='cylinders')
plt.gcf().set_size_inches(17,15)
fig.map_diag(sns.histplot)
fig.map_upper(sns.scatterplot)
fig.map_lower(sns.kdeplot)
fig.add_legend(ncol=5, loc=1, bbox_to_anchor=(0.5, 1.05), borderaxespad=0, frameon=False);
The specific palette you have mentioned is the cubehelix and you can get it using:
sns.cubehelix_palette()
You can get the colours using indexing:
sns.cubehelix_palette()[:]
# [[0.9312692223325372, 0.8201921796082118, 0.7971480974663592],
# [0.8559578605899612, 0.6418993116910497, 0.6754191211563135],
# [0.739734329496642, 0.4765280683170713, 0.5959617419736206],
# [0.57916573903086, 0.33934576125314425, 0.5219003947563425],
# [0.37894937987025, 0.2224702044652721, 0.41140014301575434],
# [0.1750865648952205, 0.11840023306916837, 0.24215989137836502]]
In general, checking the official docs or in the case when you need to check some defaults of seaborn (which is the only case when you don't know what the palette is, otherwise you know because you're the one defining it), you can always check the github code (eg. here or here).
In your first graph the cylinders is a continuous variable of type int64 and seaborn is using a single color, in this case purple, and shading it to indicate the scale of the value, so that 8 cylinders would be darker than 4. This is done on purpose so you can easily tell what is what by the shade of the color.
Once you convert to categorical halfway through there is no longer such a relationship between the cylinder values, i.e. 8 cylinders is not twice as much a 4 cylinders anymore, they are essentially two totally different categories. To avoid associating the shade of color with the scale of the variable (since the values are no longer continuous and the relationship doesn't exist) a categorical color palette will be used by default, such that every color is distinct from the other.
In order to solve your problem, you will need to cast cylinders back to int64 prior to running your final chart with
df.cylinders = df.cylinders.astype('int64')
This will restore the variable to a continuous one and will allow seaborn to use gradients of the same color to represent the size of the values and your final plot will look just like the first one.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
# download data
df = pd.read_csv("https://www.statlearning.com/s/Auto.csv")
df.head()
# remove rows with "?"
df.drop(df.index[~df.horsepower.str.isnumeric()], axis=0, inplace=True)
df['horsepower'] = pd.to_numeric(df.horsepower, errors='coerce')
# plot 1 (gives the desired color-palette)
fig = sns.PairGrid(df, vars=df.columns[~df.columns.isin(['cylinders','origin','name'])].tolist(), hue='cylinders')
plt.gcf().set_size_inches(17,15)
fig.map_diag(sns.histplot)
fig.map_upper(sns.scatterplot)
fig.map_lower(sns.kdeplot)
fig.add_legend(ncol=5, loc=1, bbox_to_anchor=(0.5, 1.05), borderaxespad=0, frameon=False);
# plot 2
# Converting column cylinder to factor before using for 'color'
df.cylinders = df.cylinders.astype('category')
# Scatter plot - Cylinders as hue
pal = ['#fdc086','#386cb0','#beaed4','#33a02c','#f0027f']
col_map = dict(zip(sorted(df.cylinders.unique()), pal))
fig = px.scatter(df, y='mpg', x='year', color='cylinders',
color_discrete_map=col_map,
hover_data=['name','origin'])
fig.update_layout(width=800, height=400, plot_bgcolor='#fff')
fig.update_traces(marker=dict(size=8, line=dict(width=0.2,color='DarkSlateGrey')),
selector=dict(mode='markers'))
fig.show()
# plot 1 run again
df.cylinders = df.cylinders.astype('int64')
fig = sns.PairGrid(df, vars=df.columns[~df.columns.isin(['cylinders','origin','name'])].tolist(), hue='cylinders')
plt.gcf().set_size_inches(17,15)
fig.map_diag(sns.histplot)
fig.map_upper(sns.scatterplot)
fig.map_lower(sns.kdeplot)
fig.add_legend(ncol=5, loc=1, bbox_to_anchor=(0.5, 1.05), borderaxespad=0, frameon=False);
Output
One way to get it back is with sns.set(). But that doesn't tell us the name of the color scheme.
I am creating bar graphs for data that comes from series. However the names (x-axis values) are extremely long. If they are rotated 90 degrees it is impossible to read the entire name and get a good image of the graph. 45 degrees is not much better. I am looking for a way to label the x-axis by numbers 1-15 and then have a legend listing the names that correspond to each number.
This is the completed function I have so far, including creating the series from a larger dataframe
def graph_average_expressions(TAD_matches, CAGE):
"""graphs the top 15 expression levels of each lncRNA"""
for i, row in TAD_matches.iterrows():
mask = (
CAGE['short_description'].isin(row['peak_ID'])
)#finds expression level for peaks in an lncRNA
average = CAGE[mask].iloc[:,8:].mean(axis=0).astype('float32').sort_values().tail(n=15)
#made a new df of the top 15 highest expression levels for all averaged groups
#a group is peaks belong to the same lncRNA
cell_type = list(average.index)
expression = list(average.values)
average_df = pd.DataFrame(
list(zip(cell_type, expression)),
columns=['cell_type','expression']
)
colors = sns.color_palette(
'husl',
n_colors=len(cell_type)
)
p = sns.barplot(
x=average_df.index,
y='expression',
data=average_df,
palette=colors
)
cmap = dict(zip(average_df.cell_type, colors))
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
plt.legend(
handles=patches,
bbox_to_anchor=(1.04, 0.5),
loc='center left',
borderaxespad=0
)
plt.title('expression_levels_of_lncRNA_' + row['lncRNA_name'])
plt.xlabel('cell_type')
plt.ylabel('expression')
plt.show()
Here is an example of the data I am graphing
CD14_monocytes_treated_with_Group_A_streptococci_donor2.CNhs13532 1.583428
Neutrophils_donor3.CNhs11905 1.832527
CD14_monocytes_treated_with_Trehalose_dimycolate_TDM_donor2.CNhs13483 1.858384
CD14_monocytes_treated_with_Candida_donor1.CNhs13473 1.873013
CD14_Monocytes_donor2.CNhs11954 2.041607
CD14_monocytes_treated_with_Candida_donor2.CNhs13488 2.112112
CD14_Monocytes_donor3.CNhs11997 2.195365
CD14_monocytes_treated_with_Group_A_streptococci_donor1.CNhs13469 2.974203
Eosinophils_donor3.CNhs12549 3.566822
CD14_monocytes_treated_with_lipopolysaccharide_donor1.CNhs13470 3.685389
CD14_monocytes_treated_with_Salmonella_donor1.CNhs13471 4.409062
CD14_monocytes_treated_with_Candida_donor3.CNhs13494 5.546789
CD14_monocytes_-_treated_with_Group_A_streptococci_donor3.CNhs13492 5.673991
Neutrophils_donor1.CNhs10862 8.352045
Neutrophils_donor2.CNhs11959 11.595509
With the new code above this is the graph I get, but no legend or title.
A bit of a different route. Made a string mapping x values to the names and added it to the figure.
Made my own DataFrame for illustration.
from matplotlib import pyplot as plt
import pandas as pd
import string,random
df = pd.DataFrame({'name':[''.join(random.sample(string.ascii_letters,15))
for _ in range(10)],
'data':[random.randint(1,20) for _ in range(10)]})
Make the plot.
fig,ax = plt.subplots()
ax.bar(df.index,df.data)
Make the legend.
x_legend = '\n'.join(f'{n} - {name}' for n,name in zip(df.index,df['name']))
Add the legend as a Text artist and adjust the plot to accommodate it.
t = ax.text(.7,.2,x_legend,transform=ax.figure.transFigure)
fig.subplots_adjust(right=.65)
plt.show()
plt.close()
That can be made dynamic by getting and using the Text artist's size and the Figure's size.
# using imports and DataFrame from above
fig,ax = plt.subplots()
r = fig.canvas.get_renderer()
ax.bar(df.index,df.data)
x_legend = '\n'.join(f'{n} - {name}' for n,name in zip(df.index,df['name']))
t = ax.text(0,.1,x_legend,transform=ax.figure.transFigure)
# find the width of the Text and place it on the right side of the Figure
twidth = t.get_window_extent(renderer=r).width
*_,fwidth,fheight = fig.bbox.extents
tx,ty = t.get_position()
tx = .95 - (twidth/fwidth)
t.set_position((tx,ty))
# adjust the right edge of the plot/Axes
ax_right = tx - .05
fig.subplots_adjust(right=ax_right)
Setup the dataframe
verify the index of the dataframe to be plotted is reset, so it's integers beginning at 0, and use the index as the x-axis
plot the values on the y-axis
Option 1A: Seaborn hue
The easiest way is probably to use seaborn.barplot and use the hue parameter with the 'names'
Seaborn: Choosing color palettes
This plot is using husl
Additional options for the husl palette can be found at seaborn.husl_palette
The bars will not be centered for this option, because they are placed according to the number of hue levels, and there are 15 levels in this case.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# plt styling parameters
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = (16.0, 10.0)
plt.rcParams["patch.force_edgecolor"] = True
# create a color palette the length of the dataframe
colors = sns.color_palette('husl', n_colors=len(df))
# plot
p = sns.barplot(x=df.index, y='values', data=df, hue='names')
# place the legend to the right of the plot
plt.legend(bbox_to_anchor=(1.04, 0.5), loc='center left', borderaxespad=0)
Option 1B: Seaborn palette
Using the palette parameter instead of hue, places the bars directly over the ticks.
This option requires "manually" associating 'names' with the colors and creating the legend.
patches uses Patch to create each item in the legend. (e.g. the rectangle, associated with color and name).
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Patch
# create a color palette the length of the dataframe
colors = sns.color_palette('husl', n_colors=len(df))
# plot
p = sns.barplot(x=df.index, y='values', data=df, palette=colors)
# create color map with colors and df.names
cmap = dict(zip(df.names, colors))
# create the rectangles for the legend
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
# add the legend
plt.legend(handles=patches, bbox_to_anchor=(1.04, 0.5), loc='center left', borderaxespad=0)
Option 2: pandas.DataFrame.plot
This option also requires "manually" associating 'names' with the palette and creating the legend using Patch.
Choosing Colormaps in Matplotlib
This plot is using tab20c
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.patches import Patch
# plt styling parameters
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = (16.0, 10.0)
plt.rcParams["patch.force_edgecolor"] = True
# chose a color map with enough colors for the number of bars
colors = [plt.cm.tab20c(np.arange(len(df)))]
# plot the dataframe
df.plot.bar(color=colors)
# create color map with colors and df.names
cmap = dict(zip(df.names, colors[0]))
# create the rectangles for the legend
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
# add the legend
plt.legend(handles=patches, bbox_to_anchor=(1.04, 0.5), loc='center left', borderaxespad=0)
Reproducible DataFrame
data = {'names': ['CD14_monocytes_treated_with_Group_A_streptococci_donor2.CNhs13532', 'Neutrophils_donor3.CNhs11905', 'CD14_monocytes_treated_with_Trehalose_dimycolate_TDM_donor2.CNhs13483', 'CD14_monocytes_treated_with_Candida_donor1.CNhs13473', 'CD14_Monocytes_donor2.CNhs11954', 'CD14_monocytes_treated_with_Candida_donor2.CNhs13488', 'CD14_Monocytes_donor3.CNhs11997', 'CD14_monocytes_treated_with_Group_A_streptococci_donor1.CNhs13469', 'Eosinophils_donor3.CNhs12549', 'CD14_monocytes_treated_with_lipopolysaccharide_donor1.CNhs13470', 'CD14_monocytes_treated_with_Salmonella_donor1.CNhs13471', 'CD14_monocytes_treated_with_Candida_donor3.CNhs13494', 'CD14_monocytes_-_treated_with_Group_A_streptococci_donor3.CNhs13492', 'Neutrophils_donor1.CNhs10862', 'Neutrophils_donor2.CNhs11959'],
'values': [1.583428, 1.832527, 1.858384, 1.873013, 2.041607, 2.1121112, 2.195365, 2.974203, 3.566822, 3.685389, 4.409062, 5.546789, 5.673991, 8.352045, 11.595509]}
df = pd.DataFrame(data)
I am translating a set of R visualizations to Python. I have the following target R multiple plot histograms:
Using Matplotlib and Seaborn combination and with the help of a kind StackOverflow member (see the link: Python Seaborn Distplot Y value corresponding to a given X value), I was able to create the following Python plot:
I am satisfied with its appearance, except, I don't know how to put the Header information in the plots. Here is my Python code that creates the Python Charts
""" Program to draw the sampling histogram distributions """
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns
def main():
""" Main routine for the sampling histogram program """
sns.set_style('whitegrid')
markers_list = ["s", "o", "*", "^", "+"]
# create the data dataframe as df_orig
df_orig = pd.read_csv('lab_samples.csv')
df_orig = df_orig.loc[df_orig.hra != -9999]
hra_list_unique = df_orig.hra.unique().tolist()
# create and subset df_hra_colors to match the actual hra colors in df_orig
df_hra_colors = pd.read_csv('hra_lookup.csv')
df_hra_colors['hex'] = np.vectorize(rgb_to_hex)(df_hra_colors['red'], df_hra_colors['green'], df_hra_colors['blue'])
df_hra_colors.drop(labels=['red', 'green', 'blue'], axis=1, inplace=True)
df_hra_colors = df_hra_colors.loc[df_hra_colors['hra'].isin(hra_list_unique)]
# hard coding the current_component to pc1 here, we will extend it by looping
# through the list of components
current_component = 'pc1'
num_tests = 5
df_columns = df_orig.columns.tolist()
start_index = 5
for test in range(num_tests):
current_tests_list = df_columns[start_index:(start_index + num_tests)]
# now create the sns distplots for each HRA color and overlay the tests
i = 1
for _, row in df_hra_colors.iterrows():
plt.subplot(3, 3, i)
select_columns = ['hra', current_component] + current_tests_list
df_current_color = df_orig.loc[df_orig['hra'] == row['hra'], select_columns]
y_data = df_current_color.loc[df_current_color[current_component] != -9999, current_component]
axs = sns.distplot(y_data, color=row['hex'],
hist_kws={"ec":"k"},
kde_kws={"color": "k", "lw": 0.5})
data_x, data_y = axs.lines[0].get_data()
axs.text(0.0, 1.0, row['hra'], horizontalalignment="left", fontsize='x-small',
verticalalignment="top", transform=axs.transAxes)
for current_test_index, current_test in enumerate(current_tests_list):
# this_x defines the series of current_component(pc1,pc2,rhob) for this test
# indicated by 1, corresponding R program calls this test_vector
x_series = df_current_color.loc[df_current_color[current_test] == 1, current_component].tolist()
for this_x in x_series:
this_y = np.interp(this_x, data_x, data_y)
axs.plot([this_x], [this_y - current_test_index * 0.05],
markers_list[current_test_index], markersize = 3, color='black')
axs.xaxis.label.set_visible(False)
axs.xaxis.set_tick_params(labelsize=4)
axs.yaxis.set_tick_params(labelsize=4)
i = i + 1
start_index = start_index + num_tests
# plt.show()
pp = PdfPages('plots.pdf')
pp.savefig()
pp.close()
def rgb_to_hex(red, green, blue):
"""Return color as #rrggbb for the given color values."""
return '#%02x%02x%02x' % (red, green, blue)
if __name__ == "__main__":
main()
The Pandas code works fine and it is doing what it is supposed to. It is my lack of knowledge and experience of using 'PdfPages' in Matplotlib that is the bottleneck. How can I show the header information in Python/Matplotlib/Seaborn that I can show in the corresponding R visalization. By the Header information, I mean What The R visualization has at the top before the histograms, i.e., 'pc1', MRP, XRD,....
I can get their values easily from my program, e.g., current_component is 'pc1', etc. But I don't know how to format the plots with the Header. Can someone provide some guidance?
You may be looking for a figure title or super title, fig.suptitle:
fig.suptitle('this is the figure title', fontsize=12)
In your case you can easily get the figure with plt.gcf(), so try
plt.gcf().suptitle("pc1")
The rest of the information in the header would be called a legend.
For the following let's suppose all subplots have the same markers. It would then suffice to create a legend for one of the subplots.
To create legend labels, you can put the labelargument to the plot, i.e.
axs.plot( ... , label="MRP")
When later calling axs.legend() a legend will automatically be generated with the respective labels. Ways to position the legend are detailed e.g. in this answer.
Here, you may want to place the legend in terms of figure coordinates, i.e.
ax.legend(loc="lower center",bbox_to_anchor=(0.5,0.8),bbox_transform=plt.gcf().transFigure)
I want to create a bar chart of two series (say 'A' and 'B') contained in a Pandas dataframe. If I wanted to just plot them using a different y-axis, I can use secondary_y:
df = pd.DataFrame(np.random.uniform(size=10).reshape(5,2),columns=['A','B'])
df['A'] = df['A'] * 100
df.plot(secondary_y=['A'])
but if I want to create bar graphs, the equivalent command is ignored (it doesn't put different scales on the y-axis), so the bars from 'A' are so big that the bars from 'B' are cannot be distinguished:
df.plot(kind='bar',secondary_y=['A'])
How can I do this in pandas directly? or how would you create such graph?
I'm using pandas 0.10.1 and matplotlib version 1.2.1.
Don't think pandas graphing supports this. Did some manual matplotlib code.. you can tweak it further
import pylab as pl
fig = pl.figure()
ax1 = pl.subplot(111,ylabel='A')
#ax2 = gcf().add_axes(ax1.get_position(), sharex=ax1, frameon=False, ylabel='axes2')
ax2 =ax1.twinx()
ax2.set_ylabel('B')
ax1.bar(df.index,df.A.values, width =0.4, color ='g', align = 'center')
ax2.bar(df.index,df.B.values, width = 0.4, color='r', align = 'edge')
ax1.legend(['A'], loc = 'upper left')
ax2.legend(['B'], loc = 'upper right')
fig.show()
I am sure there are ways to force the one bar further tweak it. move bars further apart, one slightly transparent etc.
Ok, I had the same problem recently and even if it's an old question, I think that I can give an answer for this problem, in case if someone else lost his mind with this. Joop gave the bases of the thing to do, and it's easy when you only have (for exemple) two columns in your dataframe, but it becomes really nasty when you have a different numbers of columns for the two axis, due to the fact that you need to play with the position argument of the pandas plot() function. In my exemple I use seaborn but it's optionnal :
import pandas as pd
import seaborn as sns
import pylab as plt
import numpy as np
df1 = pd.DataFrame(np.array([[i*99 for i in range(11)]]).transpose(), columns = ["100"], index = [i for i in range(11)])
df2 = pd.DataFrame(np.array([[i for i in range(11)], [i*2 for i in range(11)]]).transpose(), columns = ["1", "2"], index = [i for i in range(11)])
fig, ax = plt.subplots()
ax2 = ax.twinx()
# we must define the length of each column.
df1_len = len(df1.columns.values)
df2_len = len(df2.columns.values)
column_width = 0.8 / (df1_len + df2_len)
# we calculate the position of each column in the plot. This value is based on the position definition :
# Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center)
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.plot.html
df1_posi = 0.5 + (df2_len/float(df1_len)) * 0.5
df2_posi = 0.5 - (df1_len/float(df2_len)) * 0.5
# In order to have nice color, I use the default color palette of seaborn
df1.plot(kind='bar', ax=ax, width=column_width*df1_len, color=sns.color_palette()[:df1_len], position=df1_posi)
df2.plot(kind='bar', ax=ax2, width=column_width*df2_len, color=sns.color_palette()[df1_len:df1_len+df2_len], position=df2_posi)
ax.legend(loc="upper left")
# Pandas add line at x = 0 for each dataframe.
ax.lines[0].set_visible(False)
ax2.lines[0].set_visible(False)
# Specific to seaborn, we have to remove the background line
ax2.grid(b=False, axis='both')
# We need to add some space, the xlim don't manage the new positions
column_length = (ax2.get_xlim()[1] - abs(ax2.get_xlim()[0])) / float(len(df1.index))
ax2.set_xlim([ax2.get_xlim()[0] - column_length, ax2.get_xlim()[1] + column_length])
fig.patch.set_facecolor('white')
plt.show()
And the result : http://i.stack.imgur.com/LZjK8.png
I didn't test every possibilities but it looks like it works fine whatever the number of columns in each dataframe you use.