Altair: Creating a layered violin + stripplot - python

I'm trying to create a plot that contains both a violin plot and a stripplot with jitter. How do I go about doing this? I provided my attempt below. The problem that I have been encountering is that the violin plot seems to be invisible in the plots.
# 1. Create violin plot
violin = alt.Chart(df).transform_density(
"n_genes_by_counts",
as_=["n_genes_by_counts", "density"],
).mark_area(orient="horizontal").encode(
y="n_genes_by_counts:Q",
x=alt.X("Density:Q", stack="center", title=None),
)
# 2. Create stripplot
stripplot = alt.Chart(df).mark_circle(size=8, color="black").encode(
y="n_gene_by_counts",
x=alt.X("jitter:Q", title=None),
).transform_calculate(
jitter="sqrt(-2*log(random()))*cos(2*PI*random())"
)
# 3. Combine both
combined = stripplot + violin
I have a feeling that it could be a problem with the scaling of the X axis. That is, density is much, much smaller than jitter. If that's the case, how to I make jitter so that it's on the same order of magnitude as density? Would it be possible for someone to show me how to create a violin+stripplot given a column name n_gene_by_counts that belongs to some pandas dataframe df? Here's an example image of the kind of plot I'm looking for:

As you suspected, the different scales will make the violin very small in the stripplot unless you adjust for it. In your case, you have also accidentally capitalized Density:Q in the channel encoding, which means that your violinplot is actually empty since this channel doesn't exist. This example works:
import altair as alt
from vega_datasets import data
df = data.cars()
# 1. Create violin plot
violin = alt.Chart(df).transform_density(
"Horsepower",
as_=["Horsepower", "density"],
).mark_area().encode(
x="Horsepower:Q",
y=alt.Y("density:Q", stack="center", title=None),
)
# 2. Create stripplot
stripplot = alt.Chart(df).mark_circle(size=8, color="black").encode(
x="Horsepower",
y=alt.X("jitter:Q", title=None),
).transform_calculate(
jitter="(random() / 400) + 0.0052" # Narrowing and centering the points
)
# 3. Combine both
violin + stripplot
By using scipy, you could also lay out the points themselves in the shape of the violin, which I am personally quite found of (discussion in this issue):
import altair as alt
import numpy as np
import pandas as pd
from scipy import stats
from vega_datasets import data
# NAs are not supported in SciPy's density calculation
df = data.cars().dropna()
y = 'Horsepower'
# Compute the density function of the data
dens = stats.gaussian_kde(df[y])
# Compute the density value for each data point
pdf = dens(df[y].sort_values())
# Randomly jitter points within 0 and the upper bond of the probability density function
density_cloud = np.empty(pdf.shape[0])
for i in range(pdf.shape[0]):
density_cloud[i] = np.random.uniform(0, pdf[i])
# To create a symmetric density/violin, we make every second point negative
# Distributing every other point like this is also more likely to preserve the shape of the violin
violin_cloud = density_cloud.copy()
violin_cloud[::2] = violin_cloud[::2] * -1
# Append the density cloud to the original data in the correctly sorted order
df_with_density = pd.concat([
df,
pd.DataFrame({
'density_cloud': density_cloud,
'violin_cloud': violin_cloud
},
index=df['Horsepower'].sort_values().index)],
axis=1
)
# Visualize using the new Offset channel
alt.Chart(df_with_density).mark_circle().encode(
x='Horsepower',
y='violin_cloud'
)
Both these approaches will work with multiple categoricals without faceting in the next version of Altair when support for x/y offset channels are added.

Related

customization of plotly create_scattermatrix plots

A simple call to plotly's figure_factory routine to create a scatter matrix:
import pandas as pd
import numpy as np
from plotly import figure_factory
df = pd.DataFrame(np.random.randn(40,3))
fig = figure_factory.create_scatterplotmatrix(df, diag='histogram')
fig.show()
yields
My questions are:
How can I specify a single color for all the plots?
How can I set the axes ranges for each of the three variables on the scatter plot?
Is there a way to create a density (normalized) version of the histogram?
Is there a way to include the correlation coefficient (say, computed from df.corr()) in the upper right corner of the non-diagonal plots?
To change to the same color for the first, update the marker attribute color in the generated graph data; to modify the range of axes for the second scatter plot, update the generated data in the same way; since only the x-axis has been modified, use the same technique for the y-axis if necessary; to change to a normalized version of the third histogram To change to the normalized version of the third histogram, replace it with the normalized data. The data to be replaced is the one done in the example specification in Ref. If this does not hit normalization, I believe it is possible to replace it with data obtained with np.histogram(), etc. The fourth is a note, but I have added the data obtained with df.corr() with the graph data reference, specifying the data by axis name for each subplot.
import pandas as pd
import numpy as np
from plotly import figure_factory
np.random.seed(20220529)
df = pd.DataFrame(np.random.randn(40,3))
density = px.histogram(df, x=[0,1,2], histnorm='probability density')
df_corr = df.corr()
fig = figure_factory.create_scatterplotmatrix(df, diag='histogram', height=600, width=600)
# 1.How can I specify a single color for all the plots?
for i in range(9):
fig.data[i]['marker']['color'] = 'blue'
# 2.How can I set the axes ranges for each of the three variables on the scatter plot?
for axes in ['xaxis2','xaxis3','xaxis4','xaxis6','xaxis7']:
fig.layout[axes]['range']=(-4,4)
# 3.Is there a way to create a density (normalized) version of the histogram?
fig['data'][0]['histnorm'] = 'probability density'
fig['data'][4]['histnorm'] = 'probability density'
fig['data'][8]['histnorm'] = 'probability density'
# 4.Is there a way to include the correlation coefficient (say, computed from df.corr())
# in the upper right corner of the non-diagonal plots?
for r,x,y in zip(df_corr.values.flatten(),
['x1','x2','x3','x4','x5','x6','x7','x8','x9'],
['y1','y2','y3','y4','y5','y6','y7','y8','y9']):
if r == 1.0:
pass
else:
fig.add_annotation(x=3.3, y=2, xref=x, yref=y, showarrow=False, text='R:'+str(round(r,2)))
fig.show()

While plotting scatter graph using plotly the plot is separated by categories even when the data is ordered

So I am trying to plot a scatter graph like this using plotly Code and plot without specifying color
The plot is correct as long as I don't specify the color attribute to color the point according to categories. But when I specify the color the graph turns into this and the categories are separated into 3 little plots Code and plot for after specifying color. Is there any way I can specify the color and preserve the order of the plot?
it's always better to ask a SO question with your sample code as marked up text, not embedded in image and sample data
have implied the structure of your data frame and simulated it with random data
core concept, when column used for color is a categorical (string), plotly will create a trace per value in the categorical series. Given your xaxis is a categorical as well (not continuous) this will result in traces be partitioned
given you only want one trace, have updated marker_color of single trace to map() a categorical value to a color
import plotly.express as px
import pandas as pd
import numpy as np
# simulate data frame
df20 = pd.DataFrame(
{
"SectionContentDetailType": np.random.choice(["Lesson", "Quiz"], 100),
"Page_Title": [
chr(ord("A") + i // 26) + chr(ord("A") + i % 26) for i in range(100)
],
"Number of Users": np.sort(np.random.randint(100, 2500, 100))[::-1],
}
)
# create figure in same way
fig = px.scatter(
df20, x="Page_Title", y="Number of Users", hover_data=["SectionContentDetailType"]
)
# add marker color to single trace figure
fig.update_traces(
marker_color=df20["SectionContentDetailType"].map(
{
t: c
for t, c in zip(
df20["SectionContentDetailType"].unique(), px.colors.qualitative.Plotly
)
}
)
)

Size legend for plotly express scatterplot in Python

Here is a Plotly Express scatterplot with marker color, size and symbol representing different fields in the data frame. There is a legend for symbol and a colorbar for color, but there is nothing to indicate what marker size represents.
Is it possible to display a "size" legend? In the legend I'm hoping to show some example marker sizes and their respective values.
A similar question was asked for R and I'm hoping for a similar results in Python. I've tried adding markers using fig.add_trace(), and this would work, except I don't know how to make the sizes equal.
import pandas as pd
import plotly.express as px
import random
# create data frame
df = pd.DataFrame({
'X':list(range(1,11,1)),
'Y':list(range(1,11,1)),
'Symbol':['Yes']*5+['No']*5,
'Color':list(range(1,11,1)),
'Size':random.sample(range(10,150), 10)
})
# create scatterplot
fig = px.scatter(df, y='Y', x='X',color='Color',symbol='Symbol',size='Size')
# move legend
fig.update_layout(legend=dict(y=1, x=0.1))
fig.show()
Scatterplot Image:
Thank you
You can not achieve this goal, if you use a metric scale/data like in your range. Plotly will try to always interpret it like metric, even if it seems/is discrete in the output. So your data has to be a factor like in R, as you are showing groups. One possible solution could be to use a list comp. and convert everything to a str. I did it in two steps so you can follow:
import pandas as pd
import plotly.express as px
import random
check = sorted(random.sample(range(10,150), 10))
check = [str(num) for num in check]
# create data frame
df = pd.DataFrame({
'X':list(range(1,11,1)),
'Y':list(range(1,11,1)),
'Symbol':['Yes']*5+['No']*5,
'Color':check,
'Size':list(range(1,11,1))
})
# create scatterplot
fig = px.scatter(df, y='Y', x='X',color='Color',symbol='Symbol',size='Size')
# move legend
fig.update_layout(legend=dict(y=1, x=0.1))
fig.show()
That gives:
Keep in mind, that you also get the symbol label, as you now have TWO groups!
Maybe you want to sort the values in the list before converting to string!
Like in this picture (added it to the code above)
UPDATE
Hey There,
yes, but as far as I know, only in matplotlib, and it is a little bit hacky, as you simulate scatter plots. I can only show you a modified example from matplotlib, but maybe it helps you so you can fiddle it out by yourself:
from numpy.random import randn
z = randn(10)
red_dot, = plt.plot(z, "ro", markersize=5)
red_dot_other, = plt.plot(z*2, "ro", markersize=20)
plt.legend([red_dot, red_dot_other], ["Yes", "No"], markerscale=0.5)
That gives:
As you can see you are working with two different plots, to be exact one plot for each size legend. In the legend these plots are merged together. Legendsize is further steered through markerscale and it is linked to markersize of each plot. And because we have two plots with TWO different markersizes, we can create a plot with different markersizes in the legend. markerscale is normally a value between 0 and 1 but you can also do 150% thus 1.5.
You can achieve this through fiddling around with the legend handler in matplotlib see here:
https://matplotlib.org/stable/tutorials/intermediate/legend_guide.html

How to make legends have colours that correspond to data points using python's plotly?

I'm following the plotly documentation for colouring a scatter graph. Here is my code:
I first create a fake data frame with the same shape as what I'm working with
import pandas
import colorlover as cl
import plotly
import plotly.graph_objs as go
import numpy
data = numpy.random.normal(0, 1, 3*6*11*2)
data = data.reshape(((3*6*11), 2))
data = pandas.DataFrame(data)
sub_experiments = ['Subexperiment_{}'.format(i) for i in [1, 2, 3]]
repeats = ['Repeat_{}'.format(i) for i in range(1, 7)]
time = ['{}h'.format(i) for i in range(11)]
array = [sub_experiments, repeats, time]
idx = pandas.MultiIndex.from_product(array, names=['SubExperiment', 'Repeats', 'Time'])
data.index = idx
Now my I want to create a scatter graph with plotly:
scatters = []
for label, df in data.groupby(level=[0, 1]):
scales = cl.scales[str(df.shape[0])]
colour = scales['qual']['Paired']
d = go.Scatter(
x=df[0],
y=df[1],
mode='markers',
name=reduce(lambda x, y: '{}_{}'.format(x, y), label),
marker=go.Marker(color=colour, line=go.Line(color='black')),
)
scatters.append(d)
And looks like this:
Note that since I've made up data for this example and I'm actually doing principle component analysis, the plot in the above screenshot shows clusters while the example code will not.
The problem here is that plotly has not coloured the legend like the points.
How can I colour the legend in the same way as the points ?
You are passing an array with colours for each trace, i.e. the each traces has several different colors, picking one in the legend doesn't really make sense (try removing the color=colour to see the effect).
Based on your code I am not sure what you are expecting to see, i.e. whether the color should be based on the label, i.e subexperiment+repeat or time (as defined in the code).
In the first case, you could just pick one color for each trace. In the latter case, plotting the time and assigning one color to each time makes more sense in my opinion.

how to scale the histogram plot via matplotlib

You can see there is histogram below.
It is made like
pl.hist(data1,bins=20,color='green',histtype="step",cumulative=-1)
How to scale the histogram?
For example, let the height of the histogram be one third of it is like now.
Besides, it is a way to remove the vertical line at the left?
The matplotlib hist is actually just making calls to some other functions. It is often easier to use these directly allowing you to inspect the data and modify it directly:
# Generate some data
data = np.random.normal(size=1000)
# Generate the histogram data directly
hist, bin_edges = np.histogram(data, bins=10)
# Get the reversed cumulative sum
hist_neg_cumulative = [np.sum(hist[i:]) for i in range(len(hist))]
# Get the cin centres rather than the edges
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2.
# Plot
plt.step(bin_centers, hist_neg_cumulative)
plt.show()
The hist_neg_cumulative is the array of data being plotted. So you can rescale is as you wish before passing it to the plotting function. This also doesn't plot the vertical line.

Categories