Mask values in Python Altair heatmap plots - python

I'd like to plot a heatmap with masked values using Altair. This can be done via passing a mask array to seaborn's heatmap method, but I want to do it using Altair. Thanks!

In Altair, you can apply a mask by removing fro the dataset any data that you don't want to be shown. For example, here is a masked version of the Simple Heatmap example from Altair's documentation:
import altair as alt
import numpy as np
import pandas as pd
# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})
mask = np.random.rand(len(source)) < 0.9
alt.Chart(source.iloc[mask]).mark_rect().encode(
x='x:O',
y='y:O',
color='z:Q'
)
If you want the masking to take place via the chart specification rather than via a preprocessing step, you can similarly filter rows using a Filter transform.

Related

Plot existing covariance dataframe

I have computed a covariance of 26 inputs from another software. I have an existing table of the results. See image below:
What I want to do is enter the table as a pandas dataframe and plot the matrix. I have seen the thread here: Plot correlation matrix using pandas. However, the aforementioned example, computed the covariance first and plotted the 'covariance' object. In my case, I want to plot the dataframe object to look like the covariance matrix in the example.
Link to data: HERE.
IIUC, you can use seaborn.heatmap with annot=True :
plt.figure(figsize=(6, 4))
(
pd.read_excel("/tmp/Covariance Matrix.xlsx", header=None)
.pipe(lambda df: sns.heatmap(df.sample(10).sample(10, axis=1), annot=True, fmt=".1f"))
);
# for a sample of 10 rows / 10 columns
Output :
And, as suggested by stukituk in the comments, you can add cmap="coolwarm" for colors :
a clean option, in my opinion, from this other answer: How to plot only the lower triangle of a seaborn heatmap?
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_excel('Covariance Matrix.xlsx', header=None)
# Getting the Upper Triangle of the co-relation matrix
matrix = np.triu(df)
# using the upper triangle matrix as mask
fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(df, ax=ax, fmt='.1g', annot=True, mask=matrix)
plt.show()
hope this helps

Matplotlib pcolormesh with time on x-axis and boolean true/false as the colouring

I want a plot with the xticklabels as datetime-like objects and the colors to be True/False (1,0) depending on whether that time has been sampled.
The data I am working with looks like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
all_days = pd.date_range("2001-01-01", "2020-12-31", freq="D")
times = all_days[::10]
arr = np.isin(all_days, times).reshape(1, -1)
The plotting code I have is here:
Producing a plot without the colouring (but the correct labels)
plt.pcolormesh(all_days, np.ones(1), arr)
Or without the labels but with the colouring
plt.pcolormesh(arr)
Ultimately, I want a combination of these two plots, the xticklabels of the former and the colouring of the latter.
matplotlib.pyplot.pcolormesh optionally requires X and Y arrays which specify coordinates of the corners of quadrilaterals of the pcolormesh. So you need to pass a proper Y array:
size 1
from 0 to 1 as y axis limits
So you just need to use:
plt.pcolormesh(all_days, [0, 1], arr)

Adding spacing in heatmaps of python altair plots

Is it possible to add some spacing in the heatmaps created by using mark_rect() in Altair python plots? The heatmap in figure 1 will be converted to the one in figure 2. You can assume that this is from a dataframe and each column corresponds to a variable. I deliberately drew the white bars like this to avoid any hardcoded indexed solution. Basically, I am looking for a solution where I can provide the column name and/or the index name to get white spacings drawn both vertically and/or horizontally.
You can specify the spacing within heatmaps using the scale.bandPaddingInner configuration parameter, which is a number between zero and one that specifies the fraction of the rectangle mark that should be padded, and defaults to zero. For example:
import altair as alt
import numpy as np
import pandas as pd
# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})
alt.Chart(source).mark_rect().encode(
x='x:O',
y='y:O',
color='z:Q'
).configure_scale(
bandPaddingInner=0.1
)
One way to create these bands would be to facet the chart using custom bins. Here is a way to do that, using pandas.cut to create the bins.
import pandas as pd
import altair as alt
df = (pd.util.testing.makeDataFrame()
.reset_index(drop=True) # drop string index
.reset_index() # add an index column
.melt(id_vars=['index'], var_name="column"))
# To include all the indices and not create NaNs, I add -1 and max(indices) + 1 to the desired bins.
bins= [-1, 3, 9, 15, 27, 30]
df['bins'] = pd.cut(df['index'], bins, labels=range(len(bins) - 1))
# This was done for the index, but a similar approach could be taken for the columns as well.
alt.Chart(df).mark_rect().encode(
x=alt.X('index:O', title=None),
y=alt.Y('column:O', title=None),
color="value:Q",
column=alt.Column("bins:O",
title=None,
header=alt.Header(labelFontSize=0))
).resolve_scale(
x="independent"
).configure_facet(
spacing=5
)
Note the resolve_scale(x='independent') to not repeat the axis in each facet, and thhe spacing parameter in configure_facet to control the width of the spacing. I set labelFontSize=0 in the header so that we do not see the bins names on top of each facet.

Pandas plot density plot from frequency table

Let's say I have a DataFrame that looks (simplified) like this
>>> df
freq
2 2
3 16
1 25
where the index column represents a value, and the freq column represents the frequency of occurance of that value, as in a frequency table.
I'd like to plot a density plot for this table like one obtained from plot kind kde. However, this kind is apparently only meant for pd.Series. My df is too large to flatten out to a 1D Series, i.e. df = [2, 2, 3, 3, 3, ..,, 1, 1].
How can I plot such a density plot under these circumstances?
I know you have asked for the case where df is too large to flatten out, but the following answer works where this isn't the case:
pd.Series(df.index.repeat(df.freq)).plot.kde()
Or more generally, when the values are in a column called val and not the index:
df.val.repeat(df.freq).plot.kde()
You can plot a density distribution using a bar plot if you normalize the y values by the product of the size of the population. This will make the area covered by the bars equal to 1.
plt.bar(
df.index,
df.freq / df.freq.sum(),
width=-1,
align='edge'
)
The width and align parameters are to make sure each bar covers the interval (k-1, k].
Somebody with better knowledge of statistics should answer whether kernel density estimation actually makes sense for discrete distributions.
Maybe this will work:
import matplotlib.pyplot as plt
plt.plot(df.index, df['freq'])
plt.show()
Seaborn was built to do this on top of Matplotlib and automatically calculates kernel density estimates if you want.
import seaborn as sns
x = pd.Series(np.random.randint(0, 20, size = 10000), name = 'freq')
sns.distplot(x, kde = True)

How can I create stacked line graph?

I would like to be able to produce a stacked line graph (similar to the method used here) with Python (preferably using matplotlib, but another library would be fine too). How can I do this?
This similar to the stacked bar graph example on their website, except I'd like the top of bar to be connected with a line segment and the area underneath to be filled. I might be able to approximate this by decreasing the gaps between bars and using lots of bars (but this seems like a hack, and besides I'm not sure if it is possible).
Newer versions of matplotlib contain the function plt.stackplot, which allow for several different "out-of-the-box" stacked area plots:
import numpy as np
import pylab as plt
X = np.arange(0, 10, 1)
Y = X + 5 * np.random.random((5, X.size))
baseline = ["zero", "sym", "wiggle", "weighted_wiggle"]
for n, v in enumerate(baseline):
plt.subplot(2 ,2, n + 1)
plt.stackplot(X, *Y, baseline=v)
plt.title(v)
plt.axis('tight')
plt.show()
I believe Area Plot is a common term for this type of plot, and in the specific instance recited in the OP, Stacked Area Plot.
Matplotlib does not have an "out-of-the-box" function that combines both the data processing and drawing/rendering steps to create a this type of plot, but it's easy to roll your own from components supplied by Matplotlib and NumPy.
The code below first stacks the data, then draws the plot.
import numpy as NP
from matplotlib import pyplot as PLT
# just create some random data
fnx = lambda : NP.random.randint(3, 10, 10)
y = NP.row_stack((fnx(), fnx(), fnx()))
# this call to 'cumsum' (cumulative sum), passing in your y data,
# is necessary to avoid having to manually order the datasets
x = NP.arange(10)
y_stack = NP.cumsum(y, axis=0) # a 3x10 array
fig = PLT.figure()
ax1 = fig.add_subplot(111)
ax1.fill_between(x, 0, y_stack[0,:], facecolor="#CC6666", alpha=.7)
ax1.fill_between(x, y_stack[0,:], y_stack[1,:], facecolor="#1DACD6", alpha=.7)
ax1.fill_between(x, y_stack[1,:], y_stack[2,:], facecolor="#6E5160")
PLT.show()
If you have a dataframe, it's quite easy:
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df.plot.area();
From: pandas documentation
A slightly less hackish way would be to use a line graph in the first place and matplotlib.pyplot.fill_between. To emulate the stacking you have to shift the points up yourself.
x = np.arange(0,4)
y1 = np.array([1,2,4,3])
y2 = np.array([5,2,1,3])
# y2 should go on top, so shift them up
y2s = y1+y2
plot(x,y1)
plot(x,y2s)
fill_between(x,y1,0,color='blue')
fill_between(x,y1,y2s,color='red')

Categories