I am trying to create a chart depicting two different processes on the same timescale in Altair. Here is an example in excel
I have generated the stacked horizontal bar chart below in excel using the following data. The numbers in red are offsets/gaps, not to be displayed in the final plot. Nothing special about these numbers, please feel free to use any other set of numbers.
The numbers in red are offsets and
I would have posted an attempt, but I am completely out of my depth guessing what functionality to begin with. Any help would be greatly appreciated.
Here's an example of how you might make a chart like this, using a conditional opacity to hide the offset values:
import altair as alt
import pandas as pd
df = pd.DataFrame({
'axis': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [0.5, 0.9, 2, 1, 3, 1, 0.8, 1, 1.4, 1.1, 4.1, 0.3, 1.1],
'label': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', None, 'C', None, 'F', 'G']
})
alt.Chart(df.reset_index()).mark_bar().encode(
y=alt.Y('axis:O', scale=alt.Scale(domain=[2, 1])),
x='value:Q',
color=alt.Color('label:N', legend=None),
opacity=alt.condition('isValid(datum.label)', alt.value(1), alt.value(0)),
order=alt.Order('index:Q', sort='ascending')
)
Related
Surprisingly little info out there regarding python and the pyalluvial package. I'm hoping to combine stacked bars and a corresponding alluvial in the same figure.
Using below, I have three unique groups, which is outlined in Group. I want to display the proportion of each Group for each unique Point. I have the data formatted this way as I need three separate stacked bar charts for each Point.
So overall (Ove) highlight the overall proportion taken from all three Points. Group 1 makes up 70%, Group 2 makes up 20%, Group 3 makes up 10%. But the proportion of each group changes at different intervals Points. I'm hoping to show this like a standard stacked bar chart, but add the alluvial over the top.
import pandas as pd
import pyalluvial.alluvial as alluvial
df = pd.DataFrame({
'Group': [1, 2, 3],
'Ove': [0.7, 0.2, 0.1],
'Point 1': [0.8, 0.1, 0.1],
'Point 2': [0.6, 0.2, 0.2],
'Point 3': [0.7, 0.3, 0.0],
})
ax = alluvial.plot(
df = df,
xaxis_names = ['Group','Point 1','Point 2', 'Point 3'],
y_name = 'Ove',
alluvium = 'Group',
)
Output shows the overall group proportion (1st bar) being correct. But the following stacked bars with the proportions.
If I transform the df and put the Points as a single column, then I don't get 3 separate bars.
As correctly pointed out by #darthbaba, pyalluvial expects the dataframe format to consist of frequencies matching different variable-type combinations. To give you an example of a valid input, each Point in each Group has been labelled as present (1) or absent (0):
df = pd.DataFrame({
'Group': [1] * 6 + [2] * 6 + [3] * 6,
'Point 1': [1, 1, 1, 1, 0, 0] * 3,
'Point 2': [0, 1, 0, 1, 1, 0] * 3,
'Point 3': [0, 0, 1, 1, 1, 1] * 3,
'freq': [23, 11, 5, 7, 10, 12, 17, 3, 6, 17, 19, 20, 28, 4, 13, 8, 14, 9]
})
fig = alluvial.plot(df=df, xaxis_names=['Point 1','Point 2', 'Point 3'], y_name='freq', alluvium='Group', ignore_continuity=False)
Clearly, the above code doesn't resolve the issue since pyalluvial has yet to support the inclusion of stacked bars, much like how it's implemented in ggalluvial (see example #5). Therefore, unless you want to use ggalluvial, your best option IMO is to add the required functionality yourself. I'd start by modifying line #85.
How can I label only the points where X >= 3? I don't see any points labelled with this output.
This is very similar to the simple labelled points example but I feel like I am missing something simple.
import altair as alt
import pandas as pd
source = pd.DataFrame({
'x': [1, 3, 5, 7, 9],
'y': [1, 3, 5, 7, 9],
'label': ['A', 'B', 'C', 'D', 'E']
})
points = alt.Chart(source).mark_point().encode(
x='x:Q',
y='y:Q'
)
text = points.mark_text(
align='left',
baseline='middle',
dx=7
).encode(
text=alt.condition(alt.FieldGTEPredicate('x:Q', 3), 'label', alt.value(' '))
)
points + text
Predicates do not recognize encoding type shorthands; you should use the field name directly:
text=alt.condition(alt.FieldGTEPredicate('x', 3), 'label', alt.value(' '))
Even better, since this is essentially a filter operation, is to use a filter transform in place of the conditional value:
import altair as alt
import pandas as pd
source = pd.DataFrame({
'x': [1, 3, 5, 7, 9],
'y': [1, 3, 5, 7, 9],
'label': ['A', 'B', 'C', 'D', 'E']
})
points = alt.Chart(source).mark_point().encode(
x='x:Q',
y='y:Q'
)
text = points.transform_filter(
alt.datum.x >= 3
).mark_text(
align='left',
baseline='middle',
dx=7
).encode(
text='label'
)
points + text
I've created a bar chart as described here where I have multiple variables (indicated in the 'value' column) and they belong to repeat groups. I've colored the bars by their group membership.
I want to create a legend ultimately equivalent to the colors dictionary, showing the color corresponding to a given group membership.
Code here:
d = {'value': [1, 2, 4, 5, 7 ,10], 'group': [1, 2, 3, 2, 2, 3]}
df = pd.DataFrame(data=d)
colors = {1: 'r', 2: 'b', 3: 'g'}
df['value'].plot(kind='bar', color=[colors[i] for i in df['group']])
plt.legend(df['group'])
In this way, I just get a legend with one color (1) instead of (1, 2, 3).
Thanks!
You can use sns:
sns.barplot(data=df, x=df.index, y='value',
hue='group', palette=colors, dodge=False)
Output:
With pandas, you could create your own legend as follows:
from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
d = {'value': [1, 2, 4, 5, 7 ,10], 'group': [1, 2, 3, 2, 2, 3]}
df = pd.DataFrame(data=d)
colors = {1: 'r', 2: 'b', 3: 'g'}
df['value'].plot(kind='bar', color=[colors[i] for i in df['group']])
handles = [mpatches.Patch(color=colors[i]) for i in colors]
labels = [f'group {i}' for i in colors]
plt.legend(handles, labels)
plt.show()
I want to take the weighted mean of a column in a group-by statement, like this
import pandas as pd
import numpy as np
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [0.4, 0.3, 0.2, 0.4, 0.3, 0.2],
'weight': [2, 2, 4, 3, 1, 2]})
df_grouped = df.groupby('group')[['value', 'weight']].apply(lambda x: sum(x['value']*x['weight'])/sum(x['weight']))
df_grouped
Out[17]:
group
A 0.275000
B 0.316667
dtype: float64
So far all is well. However, in some cases the weights sum to zero, for instance
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [0.4, 0.3, 0.2, 0.4, 0.3, 0.2],
'weight': [1, 2, 3, 0, 0, 0]})
In this case I want to take the simple mean. The above expression obviously fail because of a divide by zero.
The method I currently use is to replace the weights with one wherever the weights sum to one
df_temp = df.groupby('group')['weight'].transform('sum').reset_index()
df['new_weight'] = np.where(df_temp['weight']==0, 1, df['weight'])
df_grouped = df.groupby('group')[['value', 'new_weight']].apply(lambda x: sum(x['value']*x['new_weight'])/sum(x['new_weight']))
This is an ok solution. But can this be achieved by a one-liner? Some special function for instance?
If you need it to be done in a one-liner it is possible to check whether the Group By Sum is equivalent to zero using a ternary operator inside the lambda as follows. If the group by sum is zero then use the regular mean.
df.groupby('group')[['value', 'weight']].apply(lambda x:sum(x['value'])/len(x['weight']) if (sum(x['weight'])) == 0 else sum(x['value']*x['weight'])/sum(x['weight']))
group
A 0.266667
B 0.300000
dtype: float64
The above snippet's regular mean calculation can be further minified as follows.
df.groupby('group')[['value', 'weight']].apply(lambda x:x['value'].mean() if (sum(x['weight'])) == 0 else sum(x['value']*x['weight'])/sum(x['weight']))
However, I think this type of one liners reduce the readability of the code.
I have a dataframe that has 4 fields in it, Responder, female, married, and children which I plotted as a histogram.
import pandas as pd
data2= data1.groupby('Responder')
data3= data2['female','married','children'].mean()
data3.plot(kind='bar')
As you can see in the output, it was grouped, which is what I wanted. The only thing I want to do now just have it so that each variable is grouped together. So for example you would have two blue bars for female, first one for N and second for Y. Then next to that, the N and Y bars for married, etc.
What is the syntax I need to do this?
When plotting a DataFrame, each column becomes a legend entry, and each row becomes a horizontal axis category.
# Example data (different from yours):
df = pd.DataFrame({'Responder': ['Y', 'N', 'N', 'Y', 'Y', 'N', 'Y', 'N'],
'female': [0, 1, 1, 0, 1, 1, 0, 1],
'married': [0, 1, 1, 1, 1, 0, 0, 1],
'children': [0, 1, 0, 1, 1, 0, 1, 0]})
g = df.groupby('Responder')
res = g.mean().T
res
Responder N Y
female 1.00 0.25
married 0.75 0.50
children 0.25 0.75
res.plot(kind='bar')
By the way, I'm not sure if mean is the correct choice here, since your original data consists of binary counts. Would a normalized sum make more sense?