How to connect boxplot median values - python

It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.

You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.

You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!

Related

How to create grouped and stacked bars

I have a very huge dataset with a lot of subsidiaries serving three customer groups in various countries, something like this (in reality there are much more subsidiaries and dates):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'subsidiary': ['EU','EU','EU','EU','EU','EU','EU','EU','EU','US','US','US','US','US','US','US','US','US'],'date': ['2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05'],'business': ['RETAIL','RETAIL','RETAIL','CORP','CORP','CORP','PUBLIC','PUBLIC','PUBLIC','RETAIL','RETAIL','RETAIL','CORP','CORP','CORP','PUBLIC','PUBLIC','PUBLIC'],'value': [500.36,600.45,700.55,750.66,950.89,1300.13,100.05,120.00,150.01,800.79,900.55,1000,3500.79,5000.36,4500.25,50.17,75.25,90.33]})
print(df)
I'd like to make an analysis per subsidiary by producing a stacked bar chart. To do this, I started by defining the x-axis to be the unique months and by defining a subset per business type in a country like this:
x=df['date'].drop_duplicates()
EUCORP = df[(df['subsidiary']=='EU') & (df['business']=='CORP')]
EURETAIL = df[(df['subsidiary']=='EU') & (df['business']=='RETAIL')]
EUPUBLIC = df[(df['subsidiary']=='EU') & (df['business']=='PUBLIC')]
I can then make a bar chart per business type:
plotEUCORP = plt.bar(x=x, height=EUCORP['value'], width=.35)
plotEURETAIL = plt.bar(x=x, height=EURETAIL['value'], width=.35)
plotEUPUBLIC = plt.bar(x=x, height=EUPUBLIC['value'], width=.35)
However, if I try to stack all three together in one chart, I keep failing:
plotEURETAIL = plt.bar(x=x, height=EURETAIL['value'], width=.35)
plotEUCORP = plt.bar(x=x, height=EUCORP['value'], width=.35, bottom=EURETAIL)
plotEUPUBLIC = plt.bar(x=x, height=EUPUBLIC['value'], width=.35, bottom=EURETAIL+EUCORP)
plt.show()
I always receive the below error message:
ValueError: Missing category information for StrCategoryConverter; this might be caused by unintendedly mixing categorical and numeric data
ConversionError: Failed to convert value(s) to axis units: subsidiary date business value
0 EU 2019-03 RETAIL 500.36
1 EU 2019-04 RETAIL 600.45
2 EU 2019-05 RETAIL 700.55
I tried converting the months into the dateformat and/or indexing it, but it actually confused me further...
I would really appreciate any help/support on any of the following, as I a already spend a lot of hours to try to figure this out (I am still a python noob, sry):
How can I fix the error to create a stacked bar chart?
Assuming, the error can be fixed, is this the most efficient way to create the bar chart (e.g. do I really need to create three sub-dfs per subsidiary, or is there a more elegant way?)
Would it be possible to code an iteration, that produces a stacked bar chart by country, so that I don't need to create one per subsidiary?
As an FYI, stacked bars are not the best option, because they can make it difficult to compare bar values and can easily be misinterpreted. The purpose of a visualization is to present data in an easily understood format; make sure the message is clear. Side-by-side bars are often a better option.
Side-by-side stacked bars are a difficult manual process to construct, it's better to use a figure-level method like seaborn.catplot, which will create a single, easy to read, data visualization.
Bar plot ticks are located by 0 indexed range (not datetimes), the dates are just labels, so it is not necessary to convert them to a datetime dtype.
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.3, seaborn 0.11.2
seaborn
import seaborn as sns
sns.catplot(kind='bar', data=df, col='subsidiary', x='date', y='value', hue='business')
Create grouped and stacked bars
See Stacked Bar Chart and Grouped bar chart with labels
The issue with the creation of the stacked bars in the OP is bottom is being set on the entire dataframe for that group, instead of only the values that make up the bar height.
do I really need to create three sub-dfs per subsidiary. Yes, a DataFrame is needed for every group, so 6, in this case.
Creating the data subsets can be automated using a dict-comprehension to unpack the .groupby object into a dict.
data = {''.join(k): v for k, v in df.groupby(['subsidiary', 'business'])} to create a dict of DataFrames
Access the values like: data['EUCORP'].value
Automating the plot creation is more arduous, as can be seen x depends on how many groups of bars for each tick, and bottom depends on the values for each subsequent plot.
import numpy as np
import matplotlib.pyplot as plt
labels=df['date'].drop_duplicates() # set the dates as labels
x0 = np.arange(len(labels)) # create an array of values for the ticks that can perform arithmetic with width (w)
# create the data groups with a dict comprehension and groupby
data = {''.join(k): v for k, v in df.groupby(['subsidiary', 'business'])}
# build the plots
subs = df.subsidiary.unique()
stacks = len(subs) # how many stacks in each group for a tick location
business = df.business.unique()
# set the width
w = 0.35
# this needs to be adjusted based on the number of stacks; each location needs to be split into the proper number of locations
x1 = [x0 - w/stacks, x0 + w/stacks]
fig, ax = plt.subplots()
for x, sub in zip(x1, subs):
bottom = 0
for bus in business:
height = data[f'{sub}{bus}'].value.to_numpy()
ax.bar(x=x, height=height, width=w, bottom=bottom)
bottom += height
ax.set_xticks(x0)
_ = ax.set_xticklabels(labels)
As you can see, small values are difficult to discern, and using ax.set_yscale('log') does not work as expected with stacked bars (e.g. it does not make small values more readable).
Create only stacked bars
As mentioned by #r-beginners, use .pivot, or .pivot_table, to reshape the dataframe to a wide form to create stacked bars where the x-axis is a tuple ('date', 'subsidiary').
Use .pivot if there are no repeat values for each category
Use .pivot_table, if there are repeat values that must be combined with aggfunc (e.g. 'sum', 'mean', etc.)
# reshape the dataframe
dfp = df.pivot(index=['date', 'subsidiary'], columns=['business'], values='value')
# plot stacked bars
dfp.plot(kind='bar', stacked=True, rot=0, figsize=(10, 4))

Plot multiple lines in subplots

I'd like to plot lines from a 3D data frame, the third dimension being an extra level in the column index. But I can't manage to either wrangle the data in a proper format or call the plot function appropriately. What I'm looking for is a plot where many series are plotted in subplots arranged by the outer column index. Let me illustrate with some random data.
import numpy as np
import pandas as pd
n_points_per_series = 6
n_series_per_feature = 5
n_features = 4
shape = (n_points_per_series, n_features, n_series_per_feature)
data = np.random.randn(*shape).reshape(n_points_per_series, -1)
points = range(n_points_per_series)
features = [chr(ord('a') + i) for i in range(n_features)]
series = [f'S{i}' for i in range(n_series_per_feature)]
index = pd.Index(points, name='point')
columns = pd.MultiIndex.from_product((features, series)).rename(['feature', 'series'])
data = pd.DataFrame(data, index=index, columns=columns)
So for this particular data frame, 4 subplots (n_features) should be generated, each containing 5 (n_series_per_feature) series with 6 data points. Since the method plots lines in the index direction and subplots can be generated for each column, I tried some variations:
data.plot()
data.plot(subplots=True)
data.stack().plot()
data.stack().plot(subplots=True)
None of them work. Either too many lines are generated with no subplots, a subplot is made for each line separately or after stacking values along the index are joined to one long series. And I think the x and y arguments are not usable here, since converting the index to a column and using it in x just produces a long line jumping all over the place:
data.stack().reset_index().set_index('series').plot(x='point', y=features)
In my experience this sort of stuff should be pretty straight forward in Pandas, but I'm at a loss. How could this subplot arrangement be achieved? If not a single function call, are there any more convenient ways than generating subplots in matplotlib and indexing the series for plotting manually?
If you're okay with using seaborn, it can be used to produce subplots from a data frame column, onto which plots with other columns can then be mapped. With the same setup you had I'd try something along these lines:
import seaborn as sns
# Completely stack the data frame
df = data \
.stack() \
.stack() \
.rename("value") \
.reset_index()
# Create grid and map line plots
g = sns.FacetGrid(df, col="feature", col_wrap=2, hue="series")
g.map_dataframe(sns.lineplot, x="point", y="value")
g.add_legend()
Output:

Seaborn distplot only return one column when try to plot each Pandas column by loop for

I have problem when try to plot Pandas columns using for each loop
when i use displot instead distplot it act well, besides it only show distribution globally, not based from its group. Let say i have list of column name called columns and Pandas' dataframe n, which has column name class. The goal is to show Distribution Plot based on column for each class:
for w in columns:
if w!=<discarded column> or w!=<discarded column>:
sns.displot(n[w],kde=True
but when I use distplot, it returns only first column:
for w in columns:
if w!=<discarded column> or w!=<discarded column>:
sns.distplot(n[w],kde=True
I'm still new using Seaborn, since i never use any visualization and rely on numerical analysis like p-value and correlation. Any help are appreciated.
You probably getting only the figure corresponding to the last loop.
So you have to explicitly ask for showing the picture in each loop.
import matplotlib.pyplot as plt
for w in columns:
if w not in discarded_columns:
sns.distplot(n[w], kde=True)
plt.show()
or you can make subplots:
# Keep only target-columns
target_columns = list(filter(lambda x: x not in discarded_columns, columns))
# Plot with subplots
fig, axes = plt.subplots(len(target_columns)) # see the parameters, like: nrows, ncols ... figsize=(16,12)
for i,w in enumerate(target_columns):
sns.distplot(n[w], kde=True, ax=axes[i])

Why is matplotlib .plot(kind='bar') plot so different to .plot()

This may be a very stupid question, but when plotting a Pandas DataFrame using .plot() it is very quick and produces a graph with an appropriate index. As soon as I try to change this to a bar chart, it just seems to lose all formatting and the index goes wild. Why is this the case? And is there an easy way to just plot a bar chart with the same format as the line chart?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.date_range(start='01/01/2012', end='31/12/2018')
df['Value'] = np.random.randint(low=5, high=100, size=len(df))
df.set_index('Date', inplace=True)
df.plot()
plt.show()
df.plot(kind='bar')
plt.show()
Update:
For comparison, if I take the data and put it into Excel, then create a line plot and a bar ('column') plot it instantly will convert the plot and keep the axis labels as they were for the line plot. If I try to produce many (thousands) of bar charts in Python with years of daily data, this takes a long time. Is there just an equivalent way of doing this Excel transformation in Python?
Pandas bar plots are categorical in nature; i.e. each bar is a separate category and those get their own label. Plotting numeric bar plots (in the same manner a line plots) is not currently possible with pandas.
In contrast matplotlib bar plots are numerical if the input data is numbers or dates. So
plt.bar(df.index, df["Value"])
produces
Note however that due to the fact that there are 2557 data points in your dataframe, distributed over only some hundreds of pixels, not all bars are actually plotted. Inversely spoken, if you want each bar to be shown, it needs to be one pixel wide in the final image. This means with 5% margins on each side your figure needs to be more than 2800 pixels wide, or a vector format.
So rather than showing daily data, maybe it makes sense to aggregate to monthly or quarterly data first.
The default .plot() connects all your data points with straight lines and produces a line plot.
On the other hand, the .plot(kind='bar') plots each data point as a discrete bar. To get a proper formatting on the x-axis, you will have to modify the tick-labels post plotting.

How do I plot my histogram for density rather than count? (Matplotlib)

I have a data frame called 'train' with a column 'string' and a column 'string length' and a column 'rank' which has ranking ranging from 0-4.
I want to create a histogram of the string length for each ranking and plot all of the histograms on one graph to compare. I am experiencing two issues with this:
The only way I can manage to do this is by creating separate datasets e.g. with the following type of code:
S0 = train.loc[train['rank'] == 0]
S1 = train.loc[train['rank'] == 1]
Then I create individual histograms for each dataset using:
plt.hist(train['string length'], bins = 100)
plt.show()
This code doesn't plot the density but instead plots the counts. How do I alter my code such that it plots density instead?
Is there also a way to do this without having to create separate datasets? I was told that my method is 'unpythonic'
You could do something like:
df.loc[:, df.columns != 'string'].groupby('rank').hist(density=True, bins =10, figsize=(5,5))
Basically, what it does is select all columns except string, group them by rank and make an histogram of all them following the arguments.
The density argument set to density=True draws it in a normalized manner, as
Hope this has helped.
EDIT:
f there are more variables and you want the histograms overlapped, try:
df.groupby('rank')['string length'].hist(density=True, histtype='step', bins =10,figsize=(5,5))

Categories