Easily show mean value for plotly express bar plot - python

Plotly Express's bar chart stacks the observations by default, showing the sum.
import seaborn as sns
import plotly.express as px
df =sns.load_dataset("penguins")
px.bar(data_frame=df, x="species", y="bill_depth_mm")
I'm trying to display the mean for each species, which is what most other popular Python libraries return.
I could manually calculate the mean of each species and make a new dictionary/Data Frame. However I feel like there should be an easy way to display the mean directly from Plotly.
I've checked the docs and SO with no luck. What am I missing?

I don't think you're missing anything. I imagine what the Plotly developers had in mind is that DataFrames being passed to the px.bar method have one y-value per unique category as evidenced by this documentation showing how Plotly Express works with long or wide format data. In the medals dataset, there are 9 bars for 9 unique categories.
As you said, this means that you would need to calculate the mean for each unique species, and this can be accomplished by passing a groupby mean of your DataFrame directly to the data_frame parameter, even if it's not the most elegant.
fig = px.bar(
data_frame=df.groupby(['species']).mean().reset_index(),
x="species",
y="bill_depth_mm"
)

Related

How to modify my current data set in order to plot desired box plot using Seaborn?

I want to see the median as well as outliers using boxplot (seaborn). I want all boxes for all customers in a single plot . Example data looks like this:
Surveyed some drivers to capture how many times they press horn each day.
Data Set
The numbers represent the number of times horn was pressed.
I want to make boxplots for each customer to identify outliers. Actual data is quite big.
You can pass vectors of data represented as lists, numpy arrays, or pandas Series to the Seaborn boxplot function.
For example
import seaborn
import pandas as pd
import numpy as np
df = pd.read_csv("your.csv")
seaborn.boxplot(data=df)
This will result in the following figure.
An alternative would be df.boxplot() which will result in the following figure

difference between countplot and catplot

In python seaborn, What is the difference between countplot and catplot?
Eg:
sns.catplot(x='class', y='survived', hue='sex', kind='bar', data=titanic);
sns.countplot(y='deck', hue='class', data=titanic);
seaborn.countplot
Shows the counts of observations in each categorical bin using bars.
seaborn.catplot
Provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations.
There is a lot of overhead in catplot, or for that matter in FacetGrid, that will ensure that the categories are synchronized along the grid. Consider e.g. that you have a variable you plot along the columns of the grid for which not every age group occurs. You would still need to show that non-occuring age group and hold on to its color. Hence, two countplots next to each other do not necessarily make up one catplot.
However, if you are only interested in a single countplot, a catplot is clearly overkill. On the other hand, even a single countplot is overkill compared to a barplot of the counts.

Skip weekends on stock charts with matplolib

This is not duplicate, because existing answers on similar questions don't describe exactly what I need.
Matplotlib has great formatters inside and I love to use them:
ax.xaxis.set_major_locator(matplotlib.dates.MonthLocator())
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%b%y'))
They let me plot such stock market charts:
This is what I need, but it has 1 issue: weekends. They are present on x axis and make my chart a little ugly.
Other questions about this issue give advice to create custom formatter. They show examples of such formatters. But no one of them do pretty formatting like matplotlib do:
May19, Jun19, Jul19...
I mean this line of code:
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%b%y'))
My question is: please help me to format x axis like matplotlib do: May19, Jun19, Jul19... and don't create weekends when stock market is closed.
What you could almost always do is something similar to what Nic Wanavit suggested.
Manually set your labels, depending on what you need on your axis.
Especially in this case the plot is looking a bit ugly because you have timespans in your data that are not provided with actual data (the weekends in this case) so pyplot will simply connect these points with the corresponding length from the x-axis.
What you can do then is just to plot your data equally distant - which is correct if the data is daily - otherwise consider to interpolate it using e.g. pandas bultin interpolation.
To avoid pyplot automatically detect the index I had to do this:
df['plotidx'] = [i for i in range(len(df['close'])):
Here all the closing values for the stock are stored in a column named 'close' obvsl.
You plot this correspondingly.
Then you can obtain all the ticks created via
labels = [item.get_text() for item in ax.get_xticklabels()]
Adjust them as desired with
labels[i] = string_for_the_label_no_i
Then get them back on the graph using
ax.xaxis.set_ticklabels(labels)
You need to somewhat "update" the plot then. Also keep in mind, that resizing a lot could end up with the labels being as also said in the documentation strange location.
It is some kind of a workaround but worked fine for me because it feels natural to plot data equally distant next to each other rather then making up some data for the weekends.
Greets
to set the x ticks
assuming that you have the dates variable in dataframe row df['dates']
ax.xaxis.set_ticks(df['dates'])

Hvplot/bokeh summed Bar chart from Pandas Dataframe

I'm trying to print a "simple" Bar chart, using HVPlot and bokeh in jupyter notebook.
Here is some simplified data:
My Data originally looks like this:
My goal is to get a bar chart like That (Note it doesn't have to be stacked. The only importatnt thing are the Totals.):
Since I couldn't figure out how to get a bar chart with the sum of certain columns, I used pandas.melt to model the Data to look like that:
With this Data I can plot it, but then the values aren't summed. Instead, there are multiple Bars behind each other.
Here is the code I used to test:
testd = {'Name': ['Item1', 'Item2','Item3','Item3'],'Filter': ['F1','F2','F1','F1'],
'Count': [1,5,2,1], 'CountCategory': ['CountA','CountB','CountA','CountD']}
testdf = pd.DataFrame(data=testd)
testdf.hvplot.bar('CountCategory','Count',groupby='Filter', rot=90, aggregator=np.sum)
It doesn't change anything if I omit the aggregator=np.sum
Does anyone know how to properly plot this?
It doesn't have to use the "transposed" data since I'm only doing that because I have no idea how to plot the Original Data.
And another question would be if there is a possibility
The aggregator is used by the datashade/rasterize operation to aggregate the data and indeed has no effect on bar plots. If you want to aggregate the data I recommend doing so using pandas methods. However in your case I don't think that's the issue, the main problem in implementing the plot you requested is that in holoviews the legend is generally linked to the styling, which means that you can't easily get the legend to display the filter and color each bar separately.
You could do this and add the Filter as a hover column, which means you still have access to it:
testdf.hvplot.bar('CountCategory', 'Count', by='Name', stacked=True, rot=90, hover_cols=['Filter'])
I'll probably raise an issue in HoloViews to support a legend decoupled from the styling.

boxplot on groupby timegrouper without subplots using pandas

I am doing a groupby using pd.timegrouper on a time series dataset. When I am plotting a boxplot on this groupby object,it has subplots. I dont want to divide the plot area into subplots. I tried using the parameter subplots=False,but its throwing an error saying KEY ERROR "value".
This is the plot i am getting with subplots.
the code:
df['timestamp1'] = df['timestamp'].values.astype('datetime64[s]')
df=df.groupby(pd.TimeGrouper(key="timestamp1",freq="3H"),group_keys=True,as_index=True)
df.boxplot(column="value",subplots=True)
The dataframe object i am using is:
I want to plot all the box plots in the same area without dividing it into subplots
Thanks a lot in advance.
This might actually be a bug. You can get the desired outcome by selecting only the timestamp1 and value columns, therefore eliminating the need to use the column parameter.
df[['timestamp1', 'value']].groupby(pd.TimeGrouper('3H', key='timestamp1'))\
.boxplot(subplots=False)
I went ahead and submitted an issue for this on github.

Categories