I am trying to create a chart depicting two different processes on the same timescale in Altair. Here is an example in excel
I have generated the stacked horizontal bar chart below in excel using the following data. The numbers in red are offsets/gaps, not to be displayed in the final plot. Nothing special about these numbers, please feel free to use any other set of numbers.
The numbers in red are offsets and
I would have posted an attempt, but I am completely out of my depth guessing what functionality to begin with. Any help would be greatly appreciated.
Here's an example of how you might make a chart like this, using a conditional opacity to hide the offset values:
import altair as alt
import pandas as pd
df = pd.DataFrame({
'axis': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [0.5, 0.9, 2, 1, 3, 1, 0.8, 1, 1.4, 1.1, 4.1, 0.3, 1.1],
'label': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', None, 'C', None, 'F', 'G']
y=alt.Y('axis:O', scale=alt.Scale(domain=[2, 1])),
color=alt.Color('label:N', legend=None),
opacity=alt.condition('isValid(datum.label)', alt.value(1), alt.value(0)),
order=alt.Order('index:Q', sort='ascending')
Surprisingly little info out there regarding python and the pyalluvial package. I'm hoping to combine stacked bars and a corresponding alluvial in the same figure.
Using below, I have three unique groups, which is outlined in Group. I want to display the proportion of each Group for each unique Point. I have the data formatted this way as I need three separate stacked bar charts for each Point.
So overall (Ove) highlight the overall proportion taken from all three Points. Group 1 makes up 70%, Group 2 makes up 20%, Group 3 makes up 10%. But the proportion of each group changes at different intervals Points. I'm hoping to show this like a standard stacked bar chart, but add the alluvial over the top.
import pandas as pd
import pyalluvial.alluvial as alluvial
df = pd.DataFrame({
'Group': [1, 2, 3],
'Ove': [0.7, 0.2, 0.1],
'Point 1': [0.8, 0.1, 0.1],
'Point 2': [0.6, 0.2, 0.2],
'Point 3': [0.7, 0.3, 0.0],
ax = alluvial.plot(
df = df,
xaxis_names = ['Group','Point 1','Point 2', 'Point 3'],
y_name = 'Ove',
alluvium = 'Group',
Output shows the overall group proportion (1st bar) being correct. But the following stacked bars with the proportions.
If I transform the df and put the Points as a single column, then I don't get 3 separate bars.
As correctly pointed out by #darthbaba, pyalluvial expects the dataframe format to consist of frequencies matching different variable-type combinations. To give you an example of a valid input, each Point in each Group has been labelled as present (1) or absent (0):
df = pd.DataFrame({
'Group': [1] * 6 + [2] * 6 + [3] * 6,
'Point 1': [1, 1, 1, 1, 0, 0] * 3,
'Point 2': [0, 1, 0, 1, 1, 0] * 3,
'Point 3': [0, 0, 1, 1, 1, 1] * 3,
'freq': [23, 11, 5, 7, 10, 12, 17, 3, 6, 17, 19, 20, 28, 4, 13, 8, 14, 9]
fig = alluvial.plot(df=df, xaxis_names=['Point 1','Point 2', 'Point 3'], y_name='freq', alluvium='Group', ignore_continuity=False)
Clearly, the above code doesn't resolve the issue since pyalluvial has yet to support the inclusion of stacked bars, much like how it's implemented in ggalluvial (see example #5). Therefore, unless you want to use ggalluvial, your best option IMO is to add the required functionality yourself. I'd start by modifying line #85.
How can I label only the points where X >= 3? I don't see any points labelled with this output.
This is very similar to the simple labelled points example but I feel like I am missing something simple.
import altair as alt
import pandas as pd
source = pd.DataFrame({
'x': [1, 3, 5, 7, 9],
'y': [1, 3, 5, 7, 9],
'label': ['A', 'B', 'C', 'D', 'E']
points = alt.Chart(source).mark_point().encode(
text = points.mark_text(
text=alt.condition(alt.FieldGTEPredicate('x:Q', 3), 'label', alt.value(' '))
points + text
Predicates do not recognize encoding type shorthands; you should use the field name directly:
text=alt.condition(alt.FieldGTEPredicate('x', 3), 'label', alt.value(' '))
Even better, since this is essentially a filter operation, is to use a filter transform in place of the conditional value:
import altair as alt
import pandas as pd
source = pd.DataFrame({
'x': [1, 3, 5, 7, 9],
'y': [1, 3, 5, 7, 9],
'label': ['A', 'B', 'C', 'D', 'E']
points = alt.Chart(source).mark_point().encode(
text = points.transform_filter(
alt.datum.x >= 3
points + text
I've created a bar chart as described here where I have multiple variables (indicated in the 'value' column) and they belong to repeat groups. I've colored the bars by their group membership.
I want to create a legend ultimately equivalent to the colors dictionary, showing the color corresponding to a given group membership.
Code here:
d = {'value': [1, 2, 4, 5, 7 ,10], 'group': [1, 2, 3, 2, 2, 3]}
df = pd.DataFrame(data=d)
colors = {1: 'r', 2: 'b', 3: 'g'}
df['value'].plot(kind='bar', color=[colors[i] for i in df['group']])
In this way, I just get a legend with one color (1) instead of (1, 2, 3).
You can use sns:
sns.barplot(data=df, x=df.index, y='value',
hue='group', palette=colors, dodge=False)
With pandas, you could create your own legend as follows:
from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
d = {'value': [1, 2, 4, 5, 7 ,10], 'group': [1, 2, 3, 2, 2, 3]}
df = pd.DataFrame(data=d)
colors = {1: 'r', 2: 'b', 3: 'g'}
df['value'].plot(kind='bar', color=[colors[i] for i in df['group']])
handles = [mpatches.Patch(color=colors[i]) for i in colors]
labels = [f'group {i}' for i in colors]
plt.legend(handles, labels)
I want to take the weighted mean of a column in a group-by statement, like this
import pandas as pd
import numpy as np
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [0.4, 0.3, 0.2, 0.4, 0.3, 0.2],
'weight': [2, 2, 4, 3, 1, 2]})
df_grouped = df.groupby('group')[['value', 'weight']].apply(lambda x: sum(x['value']*x['weight'])/sum(x['weight']))
A 0.275000
B 0.316667
dtype: float64
So far all is well. However, in some cases the weights sum to zero, for instance
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [0.4, 0.3, 0.2, 0.4, 0.3, 0.2],
'weight': [1, 2, 3, 0, 0, 0]})
In this case I want to take the simple mean. The above expression obviously fail because of a divide by zero.
The method I currently use is to replace the weights with one wherever the weights sum to one
df_temp = df.groupby('group')['weight'].transform('sum').reset_index()
df['new_weight'] = np.where(df_temp['weight']==0, 1, df['weight'])
df_grouped = df.groupby('group')[['value', 'new_weight']].apply(lambda x: sum(x['value']*x['new_weight'])/sum(x['new_weight']))
This is an ok solution. But can this be achieved by a one-liner? Some special function for instance?
If you need it to be done in a one-liner it is possible to check whether the Group By Sum is equivalent to zero using a ternary operator inside the lambda as follows. If the group by sum is zero then use the regular mean.
df.groupby('group')[['value', 'weight']].apply(lambda x:sum(x['value'])/len(x['weight']) if (sum(x['weight'])) == 0 else sum(x['value']*x['weight'])/sum(x['weight']))
A 0.266667
B 0.300000
dtype: float64
The above snippet's regular mean calculation can be further minified as follows.
df.groupby('group')[['value', 'weight']].apply(lambda x:x['value'].mean() if (sum(x['weight'])) == 0 else sum(x['value']*x['weight'])/sum(x['weight']))
However, I think this type of one liners reduce the readability of the code.
I have a dataframe that has 4 fields in it, Responder, female, married, and children which I plotted as a histogram.
import pandas as pd
data2= data1.groupby('Responder')
data3= data2['female','married','children'].mean()
As you can see in the output, it was grouped, which is what I wanted. The only thing I want to do now just have it so that each variable is grouped together. So for example you would have two blue bars for female, first one for N and second for Y. Then next to that, the N and Y bars for married, etc.
What is the syntax I need to do this?
When plotting a DataFrame, each column becomes a legend entry, and each row becomes a horizontal axis category.
# Example data (different from yours):
df = pd.DataFrame({'Responder': ['Y', 'N', 'N', 'Y', 'Y', 'N', 'Y', 'N'],
'female': [0, 1, 1, 0, 1, 1, 0, 1],
'married': [0, 1, 1, 1, 1, 0, 0, 1],
'children': [0, 1, 0, 1, 1, 0, 1, 0]})
g = df.groupby('Responder')
res = g.mean().T
Responder N Y
female 1.00 0.25
married 0.75 0.50
children 0.25 0.75
By the way, I'm not sure if mean is the correct choice here, since your original data consists of binary counts. Would a normalized sum make more sense?