Create clustered bar chart across two columns in bokeh - python

I have a data frame that looks like this:
type price1 price2
0 A 5450.0 31980.0
1 B 5450.0 20000.0
2 C 15998.0 18100.0
What I want is a clustered bar chart that plots "type" against "price". The end goal is a chart that has two bars for each type, one bar for "price1" and the other for "price2". Both columns are in the same unit ($). Using Bokeh I can group by type, but I cant seem to group by a generic "price" unit. I have this code so far:
import pandas as pd
import numpy as np
from bokeh.charts import Bar, output_file, show
from bokeh.palettes import Category20 as palette
from bokeh.models import HoverTool, PanTool
p = Bar(
df,
plot_width=1300,
plot_height=900,
label='type',
values='price2',
bar_width=0.4,
legend='top_right',
agg='median',
tools=[HoverTool(), PanTool()],
palette=palette[20])
But that only gets me one column for each type.
How can I modify my code to get two bars for each type?

What you are searching for is a grouped Bar plot.
But you have to reorganise your data a little bit, so that bokeh (or better Pandas) is able to group the data correctly.
df2 = pd.DataFrame(data={'type': ['A','B','C', 'A', 'B', 'C'],
'price':[5450, 5450, 15998, 3216, 20000, 15000],
'price_type':['price1', 'price1', 'price1', 'price2', 'price2', 'price2']})
p = Bar(
df2,
plot_width=1300,
plot_height=900,
label='type',
values='price',
bar_width=0.4,
group='price_type',
legend='top_right')
show(p)

Your table is "wide" format. you want to melt it to a long format first using pd.melt() function. For visualization,I suggest you use the "Seaborn" package and make your life easier. you can visualize every thing in one line.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
your_df = pd.DataFrame(data={'type': ['A','B','C'],
'price1':[5450, 5450, 15998],
'price2' : [3216, 20000, 15000]})
long_df = pd.melt(your_df,id_vars = ['type'],value_vars =['price1','price2'])
print long_df
my_plot = sns.barplot(x="type", y="value",hue = "variable", data=long_df)
sns.plt.show()
A good post on long and wide formats can be found here:
Reshape Long Format Multivalue Dataframes with Pandas
if you insist on using bokeh here is how you do it as renzop pointed out :
p = Bar(long_df,
plot_width=1000,
plot_height=800,
label='type',
values='value',
bar_width=0.4,
group='variable',
legend='top_right')
show(p)

Related

Remove data of type category from plot

Say we have a df with a column defined as a category:
import pandas as pd
df = pd.DataFrame({'Color': ['Yellow', 'Blue', 'Red', 'Red']}, dtype='category') # data type is category
Now say we want to plot these data while removing one of the categorical levels:
# Exclude Yellow, save in new df
df2 = df.loc[df.Color != 'Yellow']
# Plot
df2.value_counts().plot(kind='bar')
Output:
Although the bar for Yellow is not displayed, the Yellow tick label is still visible.
My question: How do we completely remove Yellow from the plot?
I suspect this issue is due to the fact that the data type is category. But I don't want to convert the data type. The type category is sometimes useful, e.g., to reorder levels or other operations.
Ideal solution for me would also work with seaborn, where I found a similar issue:
# Remake a df based on the above and plot with seaborn
df2=pd.DataFrame(df2.value_counts()).reset_index()
import seaborn as sns
from matplotlib import pyplot as plt
sns.catplot(data=df2, x=0, y='Color', kind='bar')
plt.show()
Output:
Dani Mesejo answer works, but only with histograms, I believe. And I need bar plots per se.
You can convert categorical values to string for the plot (not inplace) your datatypes will remain same in df2:
df2 = df[df['Color'] != 'Yellow']
df2.Color.astype(str).value_counts().plot(kind='bar')
Or You can use hist for that
df2 = df[df['Color'] != 'Yellow']['Color']
plt.hist(df2)
plt.xlabel('Color')
Using seaborn.histplot, you can do the following:
sns.histplot(data=df2, x="Color", shrink=.8)
plt.show()
Output
Note:
Don't forget to import seaborn
import seaborn as sns

Python vs matplotlib - Chart generation issue

I have the below python code. but as an output it gives a chart like in the attachment. And its really messy in python. Can anybody tell me hw to fix the issue and make the day in ascenting order in X axis?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel("C/desktop/data.xlsx")
df = df.loc[df['month'] == 8]
df = df.astype({'day': str})
plt.plot( 'day', 'cases', data=df)
In the first instance, i didnt take the day as str. So it came like this.
Because it had decimal numbers, i have converted it to str. now this happens.
What you got is typical of an unsorted dataset with many points per group.
As you did not provide an example, here is one:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'day': np.random.randint(1,21,size=100),
'cases': np.random.randint(0,50000,size=100),
})
plt.plot('day', 'cases', data=df)
There is no reason to plot a line in this case, you can use a scatter plot instead:
plt.scatter('day', 'cases', data=df)
To make more sense of your data, you can also compute an aggregated value (ex. mean):
plt.plot('day', 'cases', data=df.groupby('day', as_index=False)['cases'].mean())

Plotly: How to plot time series in Dash Plotly

I've searched for days and didn't find an answer. How can I plot a time series data in Dash Plotly as a linegraph with selectable lines?
My data (pandas dataframe) describes GDP of different countrys. Index is country, column is years.
I don't find a solution to pass the data to Dash Plotly linegraph. What are my x and y values?
fig = px.line(df, x=?, y=?)
By the looks of it, the solution in your example should be:
fig = px.line(df, x=df.index, y = df.columns)
Plot 1 - plot by columns as they appear in your dataset
From here, if you'd like to display countries in the legend and have time on the x-axis, you can just add df = df.T into the mix and get:
Plot 2 - transposed dataframe to show time on the x-axis
Details
There's a multitude of possibilites when it comes to plotting time series with plotly. Your example displays a dataset of a wide format. With the latest versions, plotly handles both datasets of long and wide format elegantly straight out of the box. If you need specifics of long and wide data in plotly you can also take a closer look here.
The code snippet below uses the approach described above, but in order for this to work for you exactly the same way, your countries will have to be set as the dataframe row index. But you've stated that they are, so give it a try and let me know how it works out for you. And one more thing: you can freely select which traces to display by clicking the years in the plot legend. The figure produced by the snippet below can also be directly implemented in Dash by following the steps under the section What About Dash? here.
Complete code:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import plotly.io as pio
# sample dataframe of a wide format
np.random.seed(5); cols = ['Canada', 'France', 'Germany']
X = np.random.randn(6,len(cols))
df=pd.DataFrame(X, columns=cols)
df.iloc[0]=0;df=df.cumsum()
df['Year'] = pd.date_range('2020', freq='Y', periods=len(df)).year.astype(str)
df = df.T
df.columns = df.iloc[-1]
df = df.head(-1)
df.index.name = 'Country'
# Want time on the x-axis? ###
# just include:
# df = df.T
##############################
# plotly
fig = px.line(df, x=df.index, y = df.columns)
fig.update_layout(template="plotly_dark")
fig.show()

Plot stacked bar chart from pandas data frame

I have dataframe:
payout_df.head(10)
What would be the easiest, smartest and fastest way to replicate the following excel plot?
I've tried different approaches, but couldn't get everything into place.
Thanks
If you just want a stacked bar chart, then one way is to use a loop to plot each column in the dataframe and just keep track of the cumulative sum, which you then pass as the bottom argument of pyplot.bar
import pandas as pd
import matplotlib.pyplot as plt
# If it's not already a datetime
payout_df['payout'] = pd.to_datetime(payout_df.payout)
cumval=0
fig = plt.figure(figsize=(12,8))
for col in payout_df.columns[~payout_df.columns.isin(['payout'])]:
plt.bar(payout_df.payout, payout_df[col], bottom=cumval, label=col)
cumval = cumval+payout_df[col]
_ = plt.xticks(rotation=30)
_ = plt.legend(fontsize=18)
Besides the lack of data, I think the following code will produce the desired graph
import pandas as pd
import matplotlib.pyplot as plt
df.payout = pd.to_datetime(df.payout)
grouped = df.groupby(pd.Grouper(key='payout', freq='M')).sum()
grouped.plot(x=grouped.index.year, kind='bar', stacked=True)
plt.show()
I don't know how to reproduce this fancy x-axis style. Also, your payout column must be a datetime, otherwise pd.Grouper won't work (available frequencies).

Side-by-side boxplot of multiple columns of a pandas DataFrame

One year of sample data:
import pandas as pd
import numpy.random as rnd
import seaborn as sns
n = 365
df = pd.DataFrame(data = {"A":rnd.randn(n), "B":rnd.randn(n)+1},
index=pd.date_range(start="2017-01-01", periods=n, freq="D"))
I want to boxplot these data side-by-side grouped by the month (i.e., two boxes per month, one for A and one for B).
For a single column sns.boxplot(df.index.month, df["A"]) works fine. However, sns.boxplot(df.index.month, df[["A", "B"]]) throws an error (ValueError: cannot copy sequence with size 2 to array axis with dimension 365). Melting the data by the index (pd.melt(df, id_vars=df.index, value_vars=["A", "B"], var_name="column")) in order to use seaborn's hue property as a workaround doesn't work either (TypeError: unhashable type: 'DatetimeIndex').
(A solution doesn't necessarily need to use seaborn, if it is easier using plain matplotlib.)
Edit
I found a workaround that basically produces what I want. However, it becomes somewhat awkward to work with once the DataFrame includes more variables than I want to plot. So if there is a more elegant/direct way to do it, please share!
df_stacked = df.stack().reset_index()
df_stacked.columns = ["date", "vars", "vals"]
df_stacked.index = df_stacked["date"]
sns.boxplot(x=df_stacked.index.month, y="vals", hue="vars", data=df_stacked)
Produces:
here's a solution using pandas melting and seaborn:
import pandas as pd
import numpy.random as rnd
import seaborn as sns
n = 365
df = pd.DataFrame(data = {"A": rnd.randn(n),
"B": rnd.randn(n)+1,
"C": rnd.randn(n) + 10, # will not be plotted
},
index=pd.date_range(start="2017-01-01", periods=n, freq="D"))
df['month'] = df.index.month
df_plot = df.melt(id_vars='month', value_vars=["A", "B"])
sns.boxplot(x='month', y='value', hue='variable', data=df_plot)
month_dfs = []
for group in df.groupby(df.index.month):
month_dfs.append(group[1])
plt.figure(figsize=(30,5))
for i,month_df in enumerate(month_dfs):
axi = plt.subplot(1, len(month_dfs), i + 1)
month_df.plot(kind='box', subplots=False, ax = axi)
plt.title(i+1)
plt.ylim([-4, 4])
plt.show()
Will give this
Not exactly what you're looking for but you get to keep a readable DataFrame if you add more variables.
You can also easily remove the axis by using
if i > 0:
y_axis = axi.axes.get_yaxis()
y_axis.set_visible(False)
in the loop before plt.show()
This is quite straight-forward using Altair:
alt.Chart(
df.reset_index().melt(id_vars = ["index"], value_vars=["A", "B"]).assign(month = lambda x: x["index"].dt.month)
).mark_boxplot(
extent='min-max'
).encode(
alt.X('variable:N', title=''),
alt.Y('value:Q'),
column='month:N',
color='variable:N'
)
The code above melts the DataFrame and adds a month column. Then Altair creates box-plots for each variable broken down by months as the plot columns.

Categories