Is it possible to add some spacing in the heatmaps created by using mark_rect() in Altair python plots? The heatmap in figure 1 will be converted to the one in figure 2. You can assume that this is from a dataframe and each column corresponds to a variable. I deliberately drew the white bars like this to avoid any hardcoded indexed solution. Basically, I am looking for a solution where I can provide the column name and/or the index name to get white spacings drawn both vertically and/or horizontally.
You can specify the spacing within heatmaps using the scale.bandPaddingInner configuration parameter, which is a number between zero and one that specifies the fraction of the rectangle mark that should be padded, and defaults to zero. For example:
import altair as alt
import numpy as np
import pandas as pd
# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})
alt.Chart(source).mark_rect().encode(
x='x:O',
y='y:O',
color='z:Q'
).configure_scale(
bandPaddingInner=0.1
)
One way to create these bands would be to facet the chart using custom bins. Here is a way to do that, using pandas.cut to create the bins.
import pandas as pd
import altair as alt
df = (pd.util.testing.makeDataFrame()
.reset_index(drop=True) # drop string index
.reset_index() # add an index column
.melt(id_vars=['index'], var_name="column"))
# To include all the indices and not create NaNs, I add -1 and max(indices) + 1 to the desired bins.
bins= [-1, 3, 9, 15, 27, 30]
df['bins'] = pd.cut(df['index'], bins, labels=range(len(bins) - 1))
# This was done for the index, but a similar approach could be taken for the columns as well.
alt.Chart(df).mark_rect().encode(
x=alt.X('index:O', title=None),
y=alt.Y('column:O', title=None),
color="value:Q",
column=alt.Column("bins:O",
title=None,
header=alt.Header(labelFontSize=0))
).resolve_scale(
x="independent"
).configure_facet(
spacing=5
)
Note the resolve_scale(x='independent') to not repeat the axis in each facet, and thhe spacing parameter in configure_facet to control the width of the spacing. I set labelFontSize=0 in the header so that we do not see the bins names on top of each facet.
Related
I want to create sort of Stacked Bar Chart [don't know the proper name]. I hand drew the graph [for years 2016 and 2017] and attached it here.
The code to create the df is below:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = [[2016.0, 0.4862, 0.4115, 0.3905, 0.3483, 0.1196],
[2017.0, 0.4471, 0.4096, 0.3725, 0.2866, 0.1387],
[2018.0, 0.4748, 0.4016, 0.3381, 0.2905, 0.2012],
[2019.0, 0.4705, 0.4247, 0.3857, 0.3333, 0.2457],
[2020.0, 0.4755, 0.4196, 0.3971, 0.3825, 0.2965]]
cols = ['attribute_time', '100-81 percentile', '80-61 percentile', '60-41 percentile', '40-21 percentile', '20-0 percentile']
df = pd.DataFrame(data, columns=cols)
#set seaborn plotting aesthetics
sns.set(style='white')
#create stacked bar chart
df.set_index('attribute_time').plot(kind='bar', stacked=True)
The data doesn't need to stack on top of each other. The code will create a stacked bar chart, but that's not exactly what needs to be displayed. The percentile needs to have labeled horizontal lines indicating the percentile on the x axis for that year. Does anyone have recommendations on how to achieve this goal? Is it a sort of modified stacked bar chart that needs to be visualized?
My approach to this is to represent the data as a categorical scatter plot (stripplot in Seaborn) using horizontal lines rather than points as markers. You'll have to make some choices about exactly how and where you want to plot things, but this should get you started!
I first modified the data a little bit:
df['attribute_time'] = df['attribute_time'].astype('int') # Just to get rid of the decimals.
df = df.melt(id_vars = ['attribute_time'],
value_name = 'pct_value',
var_name = 'pct_range')
Melting the DataFrame takes the wide data and makes it long instead, so the columns are now year, pct_value, and pct_range and there is a row for each data point.
Next is the plotting:
fig, ax = plt.subplots()
sns.stripplot(data = df,
x = 'attribute_time',
y = 'pct_value',
hue = 'pct_range',
jitter = False,
marker = '_',
s = 40,
linewidth = 3,
ax = ax)
Instead of labeling each point with the range that it belongs to, I though it would be a lot cleaner to separate them into ranges by color.
The jitter is used when there are lots of points for a given category that might overlap to try and prevent them from touching. In this case, we don't need to worry about that so I turned the jitter off. The marker style is designated here as hline.
The s parameter is the horizontal width of each line, and the linewidth is the thickness, so you can play around with those a bit to see what works best for you.
Text is added to the figure using the ax.text method as follows:
for year, value in zip(df['attribute_time'],df['pct_value']):
ax.text(year - 2016,
value,
str(value),
ha = 'center',
va = 'bottom',
fontsize = 'small')
The figure coordinates are indexed starting from 0 despite the horizontal markers displaying the years, so the x position of the text is shifted left by the minimum year (2016). The y position is equal to the value, and the text itself is a string representation of the value. The text is centered above the line and sits slightly above it due to the vertical anchor being on the bottom.
There's obviously a lot you can tweak to make it look how you want with sizing and labeling and stuff, but hopefully this is at least a good start!
I'm trying to create a plot that contains both a violin plot and a stripplot with jitter. How do I go about doing this? I provided my attempt below. The problem that I have been encountering is that the violin plot seems to be invisible in the plots.
# 1. Create violin plot
violin = alt.Chart(df).transform_density(
"n_genes_by_counts",
as_=["n_genes_by_counts", "density"],
).mark_area(orient="horizontal").encode(
y="n_genes_by_counts:Q",
x=alt.X("Density:Q", stack="center", title=None),
)
# 2. Create stripplot
stripplot = alt.Chart(df).mark_circle(size=8, color="black").encode(
y="n_gene_by_counts",
x=alt.X("jitter:Q", title=None),
).transform_calculate(
jitter="sqrt(-2*log(random()))*cos(2*PI*random())"
)
# 3. Combine both
combined = stripplot + violin
I have a feeling that it could be a problem with the scaling of the X axis. That is, density is much, much smaller than jitter. If that's the case, how to I make jitter so that it's on the same order of magnitude as density? Would it be possible for someone to show me how to create a violin+stripplot given a column name n_gene_by_counts that belongs to some pandas dataframe df? Here's an example image of the kind of plot I'm looking for:
As you suspected, the different scales will make the violin very small in the stripplot unless you adjust for it. In your case, you have also accidentally capitalized Density:Q in the channel encoding, which means that your violinplot is actually empty since this channel doesn't exist. This example works:
import altair as alt
from vega_datasets import data
df = data.cars()
# 1. Create violin plot
violin = alt.Chart(df).transform_density(
"Horsepower",
as_=["Horsepower", "density"],
).mark_area().encode(
x="Horsepower:Q",
y=alt.Y("density:Q", stack="center", title=None),
)
# 2. Create stripplot
stripplot = alt.Chart(df).mark_circle(size=8, color="black").encode(
x="Horsepower",
y=alt.X("jitter:Q", title=None),
).transform_calculate(
jitter="(random() / 400) + 0.0052" # Narrowing and centering the points
)
# 3. Combine both
violin + stripplot
By using scipy, you could also lay out the points themselves in the shape of the violin, which I am personally quite found of (discussion in this issue):
import altair as alt
import numpy as np
import pandas as pd
from scipy import stats
from vega_datasets import data
# NAs are not supported in SciPy's density calculation
df = data.cars().dropna()
y = 'Horsepower'
# Compute the density function of the data
dens = stats.gaussian_kde(df[y])
# Compute the density value for each data point
pdf = dens(df[y].sort_values())
# Randomly jitter points within 0 and the upper bond of the probability density function
density_cloud = np.empty(pdf.shape[0])
for i in range(pdf.shape[0]):
density_cloud[i] = np.random.uniform(0, pdf[i])
# To create a symmetric density/violin, we make every second point negative
# Distributing every other point like this is also more likely to preserve the shape of the violin
violin_cloud = density_cloud.copy()
violin_cloud[::2] = violin_cloud[::2] * -1
# Append the density cloud to the original data in the correctly sorted order
df_with_density = pd.concat([
df,
pd.DataFrame({
'density_cloud': density_cloud,
'violin_cloud': violin_cloud
},
index=df['Horsepower'].sort_values().index)],
axis=1
)
# Visualize using the new Offset channel
alt.Chart(df_with_density).mark_circle().encode(
x='Horsepower',
y='violin_cloud'
)
Both these approaches will work with multiple categoricals without faceting in the next version of Altair when support for x/y offset channels are added.
I'd like to plot a heatmap with masked values using Altair. This can be done via passing a mask array to seaborn's heatmap method, but I want to do it using Altair. Thanks!
In Altair, you can apply a mask by removing fro the dataset any data that you don't want to be shown. For example, here is a masked version of the Simple Heatmap example from Altair's documentation:
import altair as alt
import numpy as np
import pandas as pd
# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})
mask = np.random.rand(len(source)) < 0.9
alt.Chart(source.iloc[mask]).mark_rect().encode(
x='x:O',
y='y:O',
color='z:Q'
)
If you want the masking to take place via the chart specification rather than via a preprocessing step, you can similarly filter rows using a Filter transform.
Let's say I have a DataFrame that looks (simplified) like this
>>> df
freq
2 2
3 16
1 25
where the index column represents a value, and the freq column represents the frequency of occurance of that value, as in a frequency table.
I'd like to plot a density plot for this table like one obtained from plot kind kde. However, this kind is apparently only meant for pd.Series. My df is too large to flatten out to a 1D Series, i.e. df = [2, 2, 3, 3, 3, ..,, 1, 1].
How can I plot such a density plot under these circumstances?
I know you have asked for the case where df is too large to flatten out, but the following answer works where this isn't the case:
pd.Series(df.index.repeat(df.freq)).plot.kde()
Or more generally, when the values are in a column called val and not the index:
df.val.repeat(df.freq).plot.kde()
You can plot a density distribution using a bar plot if you normalize the y values by the product of the size of the population. This will make the area covered by the bars equal to 1.
plt.bar(
df.index,
df.freq / df.freq.sum(),
width=-1,
align='edge'
)
The width and align parameters are to make sure each bar covers the interval (k-1, k].
Somebody with better knowledge of statistics should answer whether kernel density estimation actually makes sense for discrete distributions.
Maybe this will work:
import matplotlib.pyplot as plt
plt.plot(df.index, df['freq'])
plt.show()
Seaborn was built to do this on top of Matplotlib and automatically calculates kernel density estimates if you want.
import seaborn as sns
x = pd.Series(np.random.randint(0, 20, size = 10000), name = 'freq')
sns.distplot(x, kde = True)
I have two or three csv files with the same header and would like to draw the histograms for each column overlaying one another on the same plot.
The following code gives me two separate figures, each containing all histograms for each of the files. Is there a compact way to go about plotting them together on the same figure using pandas/matplot lib? I imagine something close to this but using dataframes.
Code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('input1.csv')
df2 = pd.read_csv('input2.csv')
df.hist(bins=20)
df2.hist(bins=20)
plt.show()
In [18]: from pandas import DataFrame
In [19]: from numpy.random import randn
In [20]: df = DataFrame(randn(10, 2))
In [21]: df2 = DataFrame(randn(10, 2))
In [22]: axs = df.hist()
In [23]: for ax, (colname, values) in zip(axs.flat, df2.iteritems()):
....: values.hist(ax=ax, bins=10)
....:
In [24]: draw()
gives
The main issue of overlaying the histograms of two (or more) dataframes containing the same variables in side-by-side plots within a single figure has been already solved in the answer by Phillip Cloud.
This answer provides a solution to the issue raised by the author of the question (in the comments to the accepted answer) regarding how to enforce the same number of bins and range for the variables common to both dataframes. This can be accomplished by creating a list of bins common to all variables of both dataframes. In fact, this answer goes a little bit further by adjusting the plots for cases where the different variables contained in each dataframe cover slightly different ranges (but still within the same order of magnitude), as illustrated in the following example:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
from matplotlib.lines import Line2D
# Set seed for random data
rng = np.random.default_rng(seed=1)
# Create two similar dataframes each containing two random variables,
# with df2 twice the size of df1
df1_size = 1000
df1 = pd.DataFrame(dict(var1 = rng.exponential(scale=1.0, size=df1_size),
var2 = rng.normal(loc=40, scale=5, size=df1_size)))
df2_size = 2*df1_size
df2 = pd.DataFrame(dict(var1 = rng.exponential(scale=2.0, size=df2_size),
var2 = rng.normal(loc=50, scale=10, size=df2_size)))
# Combine the dataframes to extract the min/max values of each variable
df_combined = pd.concat([df1, df2])
vars_min = [df_combined[var].min() for var in df_combined]
vars_max = [df_combined[var].max() for var in df_combined]
# Create custom bins based on the min/max of all values from both
# dataframes to ensure that in each histogram the bins are aligned
# making them easily comparable
nbins = 30
bin_edges, step = np.linspace(min(vars_min), max(vars_max), nbins+1, retstep=True)
# Create figure by combining the outputs of two pandas df.hist() function
# calls using the 'step' type of histogram to improve plot readability
htype = 'step'
alpha = 0.7
lw = 2
axs = df1.hist(figsize=(10,4), bins=bin_edges, histtype=htype,
linewidth=lw, alpha=alpha, label='df1')
df2.hist(ax=axs.flatten(), grid=False, bins=bin_edges, histtype=htype,
linewidth=lw, alpha=alpha, label='df2')
# Adjust x-axes limits based on min/max values and step between bins, and
# remove top/right spines: if, contrary to this example dataset, var1 and
# var2 cover the same range, setting the x-axes limits with this loop is
# not necessary
for ax, v_min, v_max in zip(axs.flatten(), vars_min, vars_max):
ax.set_xlim(v_min-2*step, v_max+2*step)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Edit legend to get lines as legend keys instead of the default polygons:
# use legend handles and labels from any of the axes in the axs object
# (here taken from first one) seeing as the legend box is by default only
# shown in the last subplot when using the plt.legend() function.
handles, labels = axs.flatten()[0].get_legend_handles_labels()
lines = [Line2D([0], [0], lw=lw, color=h.get_facecolor()[:-1], alpha=alpha)
for h in handles]
plt.legend(lines, labels, frameon=False)
plt.suptitle('Pandas', x=0.5, y=1.1, fontsize=14)
plt.show()
It is worth noting that the seaborn package provides a more convenient way to create this kind of plot, where contrary to pandas the bins are automatically aligned. The only downside is that the dataframes must first be combined and reshaped to long format, as shown in this example using the same dataframes and bins as before:
import seaborn as sns # v 0.11.0
# Combine dataframes and convert the combined dataframe to long format
df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')
# Create figure using seaborn displot: note that the bins are automatically
# aligned thanks the 'common_bins' parameter of the seaborn histplot function
# (called here with 'kind='hist'') that is set to True by default. Here, the
# bins from the previous example are used to make the figures more comparable.
# Also note that the facets share the same x and y axes by default, this can
# be changed when var1 and var2 have different ranges and different
# distribution shapes, as it is the case in this example.
g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
element='step', bins=bin_edges, fill=False, height=4,
facet_kws=dict(sharex=False, sharey=False))
# For some reason setting sharex as above does not automatically adjust the
# x-axes limits (even when not setting a bins argument, maybe due to a bug
# with this package version) which is why this is done in the following loop,
# but note that you still need to set 'sharex=False' in displot, or else
# 'ax.set.xlim' will have no effect.
for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
ax.set_xlim(v_min-2*step, v_max+2*step)
# Additional formatting
g.legend.set_bbox_to_anchor((.9, 0.75))
g.legend.set_title('')
plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)
plt.show()
As you may notice, the histogram line is cut off at the limits of the list of bin edges (not visible on the maximum side due to scale). To get a line more similar to the example with pandas, an empty bin can be added at each extremity of the list of bins, like this:
bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
bin_edges = np.append(bin_edges, bin_edges.max()+step)
This example also illustrates the limits to this approach of setting common bins for both facets. Seeing as the ranges of var1 var2 are somewhat different and that 30 bins are used to cover the combined range, the histogram for var1 contains rather few bins and the histogram for var2 has slightly more bins than necessary. To my knowledge, there is no straightforward way of assigning a different list of bins to each facet when calling the plotting functions df.hist() and displot(df). So for cases where variables cover significantly different ranges, these figures would have to be created from scratch using matplotlib or some other plotting library.