draw quantile lines and connect two violin plots - python

How do I draw quantile lines and connect two violin plots in plotly in Python?
For example, there is a library to do this in R (https://github.com/GRousselet/rogme). The library provided does not necessarily work when there are more than two groups.

There is definitely no built-in method to do something this specific in Plotly. The best you can do is probably draw some lines, and consider writing a function or some loops if you need to do this for multiple groups of data for different quantile values.
Here is how I would get started. You can create a list or array to store all of the coordinates of the lines if you want to connect the same quantiles from the Grouped violin plots. I acknowledge what I have at the moment is hacky, as it relies on groups in Plotly having y-coordinates starting at 0 and increasing by 1. There might be a way to access the y-coordinates of grouped violin plots, I'd recommend looking into the documentation.
Some more work will need to be done if you want to add text boxes to indicate the values of quantiles.
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
# generate some random data that is normally distributed
np.random.seed(42)
y1 = np.random.normal(0, 1, 1000) * 1.5 + 6
y2 = np.random.normal(0, 5, 1000) + 6
# group the data together and combine into one dataframe
df1 = pd.DataFrame({'Group': 'Group1', 'Values': y1})
df2 = pd.DataFrame({'Group': 'Group2', 'Values': y2})
df_final = pd.concat([df1, df2])
fig = px.strip(df_final, x='Values', y='Group', color_discrete_sequence=['grey'])
quantiles_list = [0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95]
## this is a bit hacky and relies on y coordinates for groups starting from 0 and increasing by 1
y_diff = 0
## these store the coordinates in order to connect the quantile lines
lower_coordinates, upper_coordinates = [], []
for group_name in df_final.Group.unique():
for quantile in quantiles_list:
quantile_value = np.quantile(df_final[df_final['Group'] == group_name].Values, quantile)
if group_name == 'Group1':
lower_coordinates.append((quantile_value, 0.2+1*y_diff))
if group_name == 'Group2':
upper_coordinates.append((quantile_value, -0.2+1*y_diff))
fig.add_shape(
# Vertical Line for Group1
dict(
type="line",
x0=quantile_value,
y0=-0.2+1*y_diff,
x1=quantile_value,
y1=0.2+1*y_diff,
line=dict(
color="black",
width=4
)
),
)
y_diff += 1
## draw connecting lines
for idx in range(len(upper_coordinates)):
fig.add_shape(
dict(
type="line",
x0=lower_coordinates[idx][0],
y0=lower_coordinates[idx][1],
x1=upper_coordinates[idx][0],
y1=upper_coordinates[idx][1],
line=dict(
color="chocolate",
width=4
)
),
)
fig.show()

Related

Having trouble plotting histograms for continuous quantitative variables

I wanted to draw histograms(10 bins) for continuous quantitative variables within the range of their values. I've tried a few methods but the histograms don't look right.
I want to use the columns 'pricesold' and 'mileage' to draw histograms.
The 'mileage' column max is 350,000, and the 'pricesold' column max is 20,000.
The dataframe:
The code I've tried:
from matplotlib import pyplot as plt
x1 = df['pricesold']
x2 = df['Mileage']
plt.hist(df['pricesold'], bins=10)
plt.hist(df['Mileage'], bins = 10)
plt.hist([x1,x2], label = ['Price Sold', 'Mileage'], bins = 10)
plt.show()
I would like to suggest different approach
sample input:
df = pd.DataFrame({'priceold': [abs(random.gauss(150_000, 75_000)) for i in range(160)],
'mileage': [abs(random.gauss(10_000, 5_000)) for i in range(160)]
})
create new column for new bins:
df['mileage_bins'] = pd.cut(df['mileage'], bins=10).astype(str)
plotting in a prettier way:
import plotly.express as px
fig = px.histogram(df, x='priceold', y='mileage', color='mileage_bins', width=600, nbins=10)
fig.show()
output:

Connecting data points with lines in a Plotly boxplot in Python

I am working on some boxplots. I found this code very helpful and I managed to replicate it for my needs:
import plotly.express as px
import numpy as np
import pandas as pd
np.random.seed(1)
y0 = np.random.randn(50) - 1
y1 = np.random.randn(50) + 1
df = pd.DataFrame({'graph_name':['trace 0']*len(y0)+['trace 1']*len(y1),
'value': np.concatenate([y0,y1],0),
'color':np.random.choice([0,1,2,3,4,5,6,7,8,9], size=100, replace=True)}
)
fig = px.strip(df,
x='graph_name',
y='value',
color='color',
stripmode='overlay')
fig.add_trace(go.Box(y=df.query('graph_name == "trace 0"')['value'], name='trace 0'))
fig.add_trace(go.Box(y=df.query('graph_name == "trace 1"')['value'], name='trace 1'))
fig.update_layout(autosize=False,
width=600,
height=600,
legend={'traceorder':'normal'})
fig.show()
I am now trying to put some lines connecting the datapoints with the same colors, but I am lost. Any idea?
Something similar to this:
My first idea was to add lines to your figure by using plotly shapes and specifying the start and end points in x- and y-axis coordinates. However, when you use px.strip, plotly implements jittering (adding randomly generated small values, say between -0.1 and 0.1, to the x-coordinates under the hood to avoid points overlapping), but as far as I know, there is no way to retrieve the exact x-coordinates of each point.
However we can get around this by using go.Scatter to plot all the paired points individually, adding jittering as needed to the x-values and connecting each pair of points with a line. We are basically implementing px.strip ourselves but with full control of the exact coordinates of each point.
In order to toggle colors the same way that px.strip allows you to, we need to assign all points of the same color to the same legendgroup, and also only show the legend entry the first time a color is plotted (as we don't want an legend entry for each point)
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import pandas as pd
np.random.seed(1)
y0 = np.random.randn(50) - 1
y1 = np.random.randn(50) + 1
## sort both sets of data so we can easily connect them with line annotations
y0.sort()
y1.sort()
df = pd.DataFrame({'graph_name':['trace 0']*len(y0)+['trace 1']*len(y1),
'value': np.concatenate([y0,y1],0)}
# 'color':np.random.choice([0,1,2,3,4,5,6,7,8,9], size=100, replace=True)}
)
fig = go.Figure()
## i will set jittering to 0.1
x0 = np.array([0]*len(y0)) + np.random.uniform(-0.1,0.1,len(y0))
x1 = np.array([1]*len(y0)) + np.random.uniform(-0.1,0.1,len(y0))
## px.colors.sequential.Plasma contains 10 distinct colors
## colors_list = np.random.choice(px.colors.qualitative.D3, size=50)
## for simplicity, we repeat it 5 times instead of selecting randomly
## this guarantees the colors appear in order in the legend
colors_list = px.colors.qualitative.D3*5
color_number = {i:color for color,i in enumerate(px.colors.qualitative.D3)}
## keep track of whether the color is showing up for the first time as we build out the legend
colors_legend = {color:False for color in colors_list}
for x_start,x_end,y_start,y_end,color in zip(x0,x1,y0,y1,colors_list):
## if the color hasn't been added to the legend yet, add a legend entry
if colors_legend[color] == False:
fig.add_trace(
go.Scatter(
x=[x_start,x_end],
y=[y_start,y_end],
mode='lines+markers',
marker=dict(color=color),
line=dict(color="rgba(100,100,100,0.5)"),
legendgroup=color_number[color],
name=color_number[color],
showlegend=True,
hoverinfo='skip'
)
)
colors_legend[color] = True
## otherwise omit the legend entry, but add it to the same legend group
else:
fig.add_trace(
go.Scatter(
x=[x_start,x_end],
y=[y_start,y_end],
mode='lines+markers',
marker=dict(color=color),
line=dict(color="rgba(100,100,100,0.5)"),
legendgroup=color_number[color],
showlegend=False,
hoverinfo='skip'
)
)
fig.add_trace(go.Box(y=df.query('graph_name == "trace 0"')['value'], name='trace 0'))
fig.add_trace(go.Box(y=df.query('graph_name == "trace 1"')['value'], name='trace 1'))
fig.update_layout(autosize=False,
width=600,
height=600,
legend={'traceorder':'normal'})
fig.show()

Rounding Numbers in a Quartile Figures of a Plotly Box Plot

I have been digging around a while trying to figure out how to round the numbers displayed in quartile figures displayed in the hover feature. There must be a straightforward to do this as it is with the x and y coordinates. In this case rounding to two decimals would be sufficient.
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/tips.csv")
fig = go.Figure(data=go.Box(y=df['total_bill'],
name='total_bill',
boxmean=True,
)
)
fig.update_layout(width=800, height=800,
hoverlabel=dict(bgcolor="white",
font_size=16,
font_family="Arial",
)
)
fig.show()
Unfortunately this is something that it looks like Plotly cannot easily do. If you modify the hovertemplate, it will only apply to markers that you hover over (the outliers), and the decimals after each of the boxplot statistics will remain unchanged upon hovering. Another issue with plotly-python is that you cannot extract the boxplot statistics because this would require you to interact with the javascript under the hood.
However, you can calculate the boxplot statistics on your own using the same method as plotly and round all of the statistics down to two decimal places. Then you can pass boxplot statistics: lowerfence, q1, median, mean, q3, upperfence to force plotly to construct the boxplot manually, and plot all the outliers as another trace of scatters.
This is a pretty ugly hack because you are essentially redoing all of calculations Plotly already does, and then constructing the boxplot manually, but it does force the boxplot statistics to display to two decimal places.
from math import floor, ceil
from numpy import mean
import pandas as pd
import plotly.graph_objects as go
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/tips.csv")
## calculate quartiles as outlined in the plotly documentation
def get_percentile(data, p):
data.sort()
n = len(data)
x = n*p + 0.5
x1, x2 = floor(x), ceil(x)
y1, y2 = data[x1-1], data[x2-1] # account for zero-indexing
return round(y1 + ((x - x1) / (x2 - x1))*(y2 - y1), 2)
## calculate all boxplot statistics
y = df['total_bill'].values
lowerfence = min(y)
q1, median, q3 = get_percentile(y, 0.25), get_percentile(y, 0.50), get_percentile(y, 0.75)
upperfence = max([y0 for y0 in y if y0 < (q3 + 1.5*(q3-q1))])
## construct the boxplot
fig = go.Figure(data=go.Box(
x=["total_bill"]*len(y),
q1=[q1], median=[median], mean=[round(mean(y),2)],
q3=[q3], lowerfence=[lowerfence],
upperfence=[upperfence], orientation='v', showlegend=False,
)
)
outliers = y[y>upperfence]
fig.add_trace(go.Scatter(x=["total_bill"]*len(outliers), y=outliers, showlegend=False, mode='markers', marker={'color':'#1f77b4'}))
fig.update_layout(width=800, height=800,
hoverlabel=dict(bgcolor="white",
font_size=16,
font_family="Arial",
)
)
fig.show()
for me, setting yaxis_tickformat=",.2f" worked:
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/tips.csv")
fig = go.Figure(data=go.Box(y=df['total_bill'],
name='total_bill',
boxmean=True,
)
)
fig.update_layout(width=800, height=800,
# >>>>
yaxis_tickformat=",.2f",
# <<<<
hoverlabel=dict(bgcolor="white",
font_size=16,
font_family="Arial",
)
)
fig.show()
... you can also override yaxis back by setting text of y ticks:
fig.update_layout(
yaxis = dict(
tickformat=",.2f",
tickmode = 'array',
tickvals = [10, 20, 30, 40, 50],
ticktext =["10", "20", "30", "40", "50"],
))
if you want the y axis ticks unchanged
(tested on plotly 5.8.2)

Altair Scatterplot adds unwanted lines

When layered above a heatmap, the Altair scatterplot only seems to work if the point values are also on the axis of the heatmap. I any other case, white lines along the x and y-values are added. Here's a minimal example:
import streamlit as st
import altair as alt
import numpy as np
import pandas as pd
# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})
c = alt.Chart(source).mark_rect().encode(
x='x:O',
y='y:O',
color='z:Q'
)
scatter_source = pd.DataFrame({'x': [-1.001,-3], 'y': [0,1]})
s = alt.Chart(scatter_source).mark_circle(size=100).encode(
x='x:O',
y='y:O')
st.altair_chart(c + s)
Is there any way to prevent this behavior? I'd like to animate the points later on, so adding values to the heatmap axis is not an option.
Ordinal encodings (marked by :O) will always create a discrete axis with one bin per unique value. It sounds like you would like to visualize your data with a quantitative encoding (marked by :Q), which creates a continuous, real-valued axis.
In the case of the heatmap, though, this complicates things: if you're no longer treating the data as ordered categories, you must specify the starting and ending point for each bin along each axis. This requires some thought about what your bins represent: does the value "2" represent numbers spanning from 2 to 3? from 1 to 2? from 1.5 to 2.5? The answer will depend on context.
Here is an example of computing these bin boundaries using a calculate transform, assuming the values represent the center of unit bins:
c = alt.Chart(source).transform_calculate(
x1=alt.datum.x - 0.5,
x2=alt.datum.x + 0.5,
y1=alt.datum.y - 0.5,
y2=alt.datum.y + 0.5,
).mark_rect().encode(
x='x1:Q', x2='x2:Q',
y='y1:Q', y2='y2:Q',
color='z:Q'
).properties(
width=400, height=400
)
scatter_source = pd.DataFrame({'x': [-1.001,-3], 'y': [0,1]})
s = alt.Chart(scatter_source).mark_circle(size=100).encode(
x='x:Q',
y='y:Q'
)
st.altair_chart(c + s)
Alternatively, if you would like this binning to happen more automatically, you can use a bin transform on each axis:
c = alt.Chart(source).mark_rect().encode(
x=alt.X('x:Q', bin=True),
y=alt.Y('y:Q', bin=True),
color='z:Q'
).properties(
width=400,
height=400
)
scatter_source = pd.DataFrame({'x': [-1.001,-3], 'y': [0,1]})
s = alt.Chart(scatter_source).mark_circle(size=100).encode(
x='x:Q',
y='y:Q'
)

Plotly: How to set a fill color between two vertical lines?

Using matplotlib, we can "trivially" fill the area between two vertical lines using fill_between() as in the example:
https://matplotlib.org/3.2.1/gallery/lines_bars_and_markers/fill_between_demo.html#selectively-marking-horizontal-regions-across-the-whole-axes
Using matplotlib, I can make what I need:
We have two signals, and I''m computing the rolling/moving Pearson's and Spearman's correlation. When the correlations go either below -0.5 or above 0.5, I want to shade the period (blue for Pearson's and orange for Spearman's). I also darken the weekends in gray in all plots.
However, I'm finding a hard time to accomplish the same using Plotly. And it will also be helpful to know how to do it between two horizontal lines.
Note that I'm using Plotly and Dash to speed up the visualization of several plots. Users asked for a more "dynamic type of thing." However, I'm not a GUI guy and cannot spend time on this, although I need to feed them with initial results.
BTW, I tried Bokeh in the past, and I gave up for some reason I cannot remember. Plotly looks good since I can use either from Python or R, which are my main development tools.
Thanks,
Carlos
I don't think there is any built-in Plotly method that that is equivalent to matplotlib's fill_between() method. However you can draw shapes so a possible workaround is to draw a grey rectangle and set the the parameter layer="below" so that the signal is still visible. You can also set the coordinates of the rectangle outside of your axis range to ensure the rectangle extends to the edges of the plot.
You can fill the area in between horizontal lines by drawing a rectangle and setting the axes ranges in a similar manner.
import numpy as np
import plotly.graph_objects as go
x = np.arange(0, 4 * np.pi, 0.01)
y = np.sin(x)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=x,
y=y
))
# hard-code the axes
fig.update_xaxes(range=[0, 4 * np.pi])
fig.update_yaxes(range=[-1.2, 1.2])
# specify the corners of the rectangles
fig.update_layout(
shapes=[
dict(
type="rect",
xref="x",
yref="y",
x0="4",
y0="-1.3",
x1="5",
y1="1.3",
fillcolor="lightgray",
opacity=0.4,
line_width=0,
layer="below"
),
dict(
type="rect",
xref="x",
yref="y",
x0="9",
y0="-1.3",
x1="10",
y1="1.3",
fillcolor="lightgray",
opacity=0.4,
line_width=0,
layer="below"
),
]
)
fig.show()
You haven't provided a data sample so I'm going to use a synthetical time-series to show you how you can add a number of shapes with defined start and stop dates for several different categories using a custom function bgLevel
Two vertical lines with a fill between them very quickly turns into a rectangle. And rectangles can easily be added as shapes using fig.add_shape. The example below will show you how to find start and stop dates for periods given by a certain critera. In your case these criteria are whether or not the value of a variable is higher or lower than a certain level.
Using shapes instead of traces with fig.add_trace() will let you define the position with regards to plot layers using layer='below'. And the shapes outlines can easily be hidden using line=dict(color="rgba(0,0,0,0)).
Plot 1: Time series figure with random data:
Plot 2: Background is set to an opaque grey when A > 100 :
Plot 2: Background is also set to an opaque red when D < 60
Complete code:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import datetime
pd.set_option('display.max_rows', None)
# data sample
nperiods = 200
np.random.seed(123)
df = pd.DataFrame(np.random.randint(-10, 12, size=(nperiods, 4)),
columns=list('ABCD'))
datelist = pd.date_range(datetime.datetime(2020, 1, 1).strftime('%Y-%m-%d'),periods=nperiods).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
df.iloc[0] = 0
df = df.cumsum().reset_index()
# function to set background color for a
# specified variable and a specified level
# plotly setup
fig = px.line(df, x='dates', y=df.columns[1:])
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='rgba(0,0,255,0.1)')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='rgba(0,0,255,0.1)')
def bgLevels(fig, variable, level, mode, fillcolor, layer):
"""
Set a specified color as background for given
levels of a specified variable using a shape.
Keyword arguments:
==================
fig -- plotly figure
variable -- column name in a pandas dataframe
level -- int or float
mode -- set threshold above or below
fillcolor -- any color type that plotly can handle
layer -- position of shape in plotly fiugre, like "below"
"""
if mode == 'above':
m = df[variable].gt(level)
if mode == 'below':
m = df[variable].lt(level)
df1 = df[m].groupby((~m).cumsum())['dates'].agg(['first','last'])
for index, row in df1.iterrows():
#print(row['first'], row['last'])
fig.add_shape(type="rect",
xref="x",
yref="paper",
x0=row['first'],
y0=0,
x1=row['last'],
y1=1,
line=dict(color="rgba(0,0,0,0)",width=3,),
fillcolor=fillcolor,
layer=layer)
return(fig)
fig = bgLevels(fig = fig, variable = 'A', level = 100, mode = 'above',
fillcolor = 'rgba(100,100,100,0.2)', layer = 'below')
fig = bgLevels(fig = fig, variable = 'D', level = -60, mode = 'below',
fillcolor = 'rgba(255,0,0,0.2)', layer = 'below')
fig.show()
I think that fig.add_hrect() and fig.add_vrect() are the simplest approaches to reproducing the MatPlotLib fill_between functionality in this case:
https://plotly.com/python/horizontal-vertical-shapes/
For your example, add_vrect() should do the trick.

Categories