Surprisingly little info out there regarding python and the pyalluvial package. I'm hoping to combine stacked bars and a corresponding alluvial in the same figure.
Using below, I have three unique groups, which is outlined in Group. I want to display the proportion of each Group for each unique Point. I have the data formatted this way as I need three separate stacked bar charts for each Point.
So overall (Ove) highlight the overall proportion taken from all three Points. Group 1 makes up 70%, Group 2 makes up 20%, Group 3 makes up 10%. But the proportion of each group changes at different intervals Points. I'm hoping to show this like a standard stacked bar chart, but add the alluvial over the top.
import pandas as pd
import pyalluvial.alluvial as alluvial
df = pd.DataFrame({
'Group': [1, 2, 3],
'Ove': [0.7, 0.2, 0.1],
'Point 1': [0.8, 0.1, 0.1],
'Point 2': [0.6, 0.2, 0.2],
'Point 3': [0.7, 0.3, 0.0],
})
ax = alluvial.plot(
df = df,
xaxis_names = ['Group','Point 1','Point 2', 'Point 3'],
y_name = 'Ove',
alluvium = 'Group',
)
Output shows the overall group proportion (1st bar) being correct. But the following stacked bars with the proportions.
If I transform the df and put the Points as a single column, then I don't get 3 separate bars.
As correctly pointed out by #darthbaba, pyalluvial expects the dataframe format to consist of frequencies matching different variable-type combinations. To give you an example of a valid input, each Point in each Group has been labelled as present (1) or absent (0):
df = pd.DataFrame({
'Group': [1] * 6 + [2] * 6 + [3] * 6,
'Point 1': [1, 1, 1, 1, 0, 0] * 3,
'Point 2': [0, 1, 0, 1, 1, 0] * 3,
'Point 3': [0, 0, 1, 1, 1, 1] * 3,
'freq': [23, 11, 5, 7, 10, 12, 17, 3, 6, 17, 19, 20, 28, 4, 13, 8, 14, 9]
})
fig = alluvial.plot(df=df, xaxis_names=['Point 1','Point 2', 'Point 3'], y_name='freq', alluvium='Group', ignore_continuity=False)
Clearly, the above code doesn't resolve the issue since pyalluvial has yet to support the inclusion of stacked bars, much like how it's implemented in ggalluvial (see example #5). Therefore, unless you want to use ggalluvial, your best option IMO is to add the required functionality yourself. I'd start by modifying line #85.
Related
enter image description here][1]The dataframe data, there is a matching away column with home column,(I.E away_win_perc -> home_win_perc, away_first_downs -> home_first_downs and so forth) (https://i.stack.imgur.com/TTR3d.png)
I just want subplots of bar charts of each corresponding feature to compare; so away_win_perc -> home_win_perc, home_win_perc, away_first_downs and so forth. I want this for each row of data because it's matchup specific.
Sample code:
df = pd.DataFrame({
'home_name': ['Packers', 'Rams', 'Texans'],
'away_name': ['Saints', 'Eagles', 'Colts'],
'week': [1, 1, 1],
'Height(in cm)': [150, 180, 160],
'home_win_perc': [.57, .65, .32],
'home_first_downs': [1, 5, 3],
'home_fumbles': [4, 2, 3],
'away_win_perc': [.57, .65, .32],
'away_first_downs': [1, 5, 3],
'away_fumbles': [4, 2, 3]})
similar to this but for each feature and each row of data, with each titled with away team actual name vs home team actual name. For example: Saints vs Packers for the first row. Ideally with each bar color corresponding with away/home team feature. So two colors throughout one for away one for home.[2]
https://i.stack.imgur.com/CWaFk.png
I have a problem with sns.histplot(). As far as I understand, the bins parameter should indicate how many of the bins should be in the plot. Here is some code to visualize the strange (at least for me) behavior.
d = {'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9 , 10 , 11 , 12, 13, 14, 15], 'col2': [1, 1, 1, 1, 1, 1, 2, 2, 2 , 2 , 2, 2, 2, 2, 2]}
df = pd.DataFrame(data=d)
sns.histplot(data=df, x='col1', multiple='dodge', hue='col2', binwidth=2, bins=8)
I have almost the same problem in my original code where I have:
hist = sns.histplot(data=Data, x=y_data, multiple='dodge', ax=axes[0], hue=i[2][1], binwidth=2, bins=10)
And as you can see, there is only one bin where data has its minimum and std, but it is not split into the number of bins I declared. Why is this not splitting data into the provided number of bins? How can I change code to ensure constant number of bins?
I think the problem is the binwidth parameter. Maybe just try to delete that parameter, or set it to a smaller value (0.2 or 0.1).
From the docs, regarding the binwidth parameter:
Width of each bin, overrides bins but can be used with binrange.
So you can't specify both bins and binwidth at the same time.
Let's say the data used is something like this
df = pd.DataFrame({'Order_id': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
'Order_date': ['10/1/2020', '10/1/2020', '11/1/2020', '11/1/2020', '12/1/2020', '12/1/2020', '12/1/2020', '12/1/2020', '13/1/2020', '13/1/2020'],
'Product_nr': [0, 2, 1, 0, 2, 0, 2, 1, 2, 0],
'Quantity': [3, 1, 6, 5, 10, 1, 2, 5, 4, 3]})
#transforming the date column into datetime
df['Order_date'] = pd.to_datetime(df['Order_date'])
and I'm trying to plot the number of ordered products per day per product over the given time span.
My initial idea would be something like
product_groups = df.groupby(['Product_nr'])
products_daily = pd.DataFrame()
for product, total_orders in product_groups:
products_daily[product.day] = total_orders.values
products_daily.plot(subplots=True, legend=False)
pyplot.show()
I know there must be a groupby('Product_nr') and the date should be splitted into days using Grouper(freq='D'). They should also be a for loop to combine them and then plotting them all but I really have no clue how to put those pieces together. How can I archieve this? My ultimate goal is actually to plot them per month per product for over 4 years of sales records, but given the example data here I changed it into daily.
Any suggestion or link for guides, tutorials are welcome too. Thank you very much!
You can pivot the table and use pandas' plot function:
(df.groupby(['Order_date', 'Product_nr'])
['Quantity'].sum()
.unstack('Product_nr')
.plot(subplots=True, layout=(1,3)) # change layout to fit your data
)
Output:
I want to create a barplot from a dataframe. But I want to color each bar according to a value from the column 'red' in the dataframe.
I have the following code:
plt.bar(df.index, df['Mean'], yerr = df['yerr'], capsize=7, color = (df['red'], 0, 0, 0.6))
I would like to take the value from the column 'red' (which goes from 0 to 1) but it keeps failing. How would you do it?
Something like this will work. You have to create blue, green, and alpha columns of common length, then zip them all together with red.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = [
[1, 1, 0.1],
[2, 1, 0.2],
[3, 1, 0.3],
[4, 1, 0.4],
[5, 1, 0.5],
[6, 1, 0.6],
[7, 1, 0.7],
[8, 1, 0.8],
[9, 1, 0.9],
]
columns = ['Mean', 'yerr', 'red']
df = pd.DataFrame(data=data, columns=columns)
r = df['red']
g = np.zeros(r.shape[0])
b = np.zeros(r.shape[0])
a = np.ones(r.shape[0]) * 0.6
plt.bar(df.index, df['Mean'], yerr=df['yerr'], capsize=7,
color=list(zip(r, g, b, a)))
plt.show()
This paper has a nice way of visualizing clusters of a dataset with binary features by plotting a 2D matrix and sorting the values according to a cluster.
In this case, there are three clusters, as indicated by the black dividing lines; the rows are sorted, and show which examples are in each cluster, and the columns are the features of each example.
Given a vector of cluster assignments and a pandas DataFrame, how can I replicate this using a Python library (e.g. seaborn)? Plotting a DataFrame using seaborn isn't difficult, nor is sorting the rows of the DataFrame to align with the cluster assignments. What I am most interested in is how to display those black dividing lines which delineate each cluster.
Dummy data:
"""
col1 col2
x1_c0 0 1
x2_c0 0 1
================= I want a line drawn here
x3_c1 1 0
================= and here
x4_c2 1 0
"""
import pandas as pd
import seaborn as sns
df = pd.DataFrame(
data={'col1': [0, 0, 1, 1], 'col2': [1, 1, 0, 0]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2']
)
clus = [0, 0, 1, 2] # This is the cluster assignment
sns.heatmap(df)
The link that mwaskom posted in a comment is good starting place. The trick is figuring out what the coordinates are for the vertical and horizontal lines.
To illustrate what the code is actually doing, it's worthwhile to just plot all of the lines individually
%matplotlib inline
import pandas as pd
import seaborn as sns
df = pd.DataFrame(data={'col1': [0, 0, 1, 1], 'col2': [1, 1, 0, 0]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2'])
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df)
ax.axvline(1, 0, 2, linewidth=3, c='w')
ax.axhline(1, 0, 1, linewidth=3, c='w')
ax.axhline(2, 0, 1, linewidth=3, c='w')
ax.axhline(3, 0, 1, linewidth=3, c='w')
f.tight_layout()
The the way that the axvline method works is the first argument is the x location of the line and then the lower bound and upper bound of the line (in this case 1, 0, 2). The horizontal line takes the y location and then the x start and x stop of the line. The defaults will create the line for the entire plot, so you can typically leave those out.
This code above creates a line for every value in the dataframe. If you want to create groups for the heatmap, you will want to create an index in your data frame, or some other list of values to loop through. For instance with a more complicated example using code from this example:
df = pd.DataFrame(data={'col1': [0, 0, 1, 1, 1.5], 'col2': [1, 1, 0, 0, 2]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2', 'x5_c2'])
df['id_'] = df.index
df['group'] = [1, 2, 2, 3, 3]
df.set_index(['group', 'id_'], inplace=True)
df
col1 col2
group id_
1 x1_c0 0.0 1
2 x2_c0 0.0 1
x3_c1 1.0 0
3 x4_c2 1.0 0
x5_c2 1.5 2
Then plot the heatmap with the groups:
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df)
groups = df.index.get_level_values(0)
for i, group in enumerate(groups):
if i and group != groups[i - 1]:
ax.axhline(len(groups) - i, c="w", linewidth=3)
ax.axvline(1, c="w", linewidth=3)
f.tight_layout()
Because your heatmap is not symmetric you may need to use a separate for loop for the columns