How to create a plot with stacked and labeled line segments - python

I want to create sort of Stacked Bar Chart [don't know the proper name]. I hand drew the graph [for years 2016 and 2017] and attached it here.
The code to create the df is below:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = [[2016.0, 0.4862, 0.4115, 0.3905, 0.3483, 0.1196],
[2017.0, 0.4471, 0.4096, 0.3725, 0.2866, 0.1387],
[2018.0, 0.4748, 0.4016, 0.3381, 0.2905, 0.2012],
[2019.0, 0.4705, 0.4247, 0.3857, 0.3333, 0.2457],
[2020.0, 0.4755, 0.4196, 0.3971, 0.3825, 0.2965]]
cols = ['attribute_time', '100-81 percentile', '80-61 percentile', '60-41 percentile', '40-21 percentile', '20-0 percentile']
df = pd.DataFrame(data, columns=cols)
#set seaborn plotting aesthetics
sns.set(style='white')
#create stacked bar chart
df.set_index('attribute_time').plot(kind='bar', stacked=True)
The data doesn't need to stack on top of each other. The code will create a stacked bar chart, but that's not exactly what needs to be displayed. The percentile needs to have labeled horizontal lines indicating the percentile on the x axis for that year. Does anyone have recommendations on how to achieve this goal? Is it a sort of modified stacked bar chart that needs to be visualized?

My approach to this is to represent the data as a categorical scatter plot (stripplot in Seaborn) using horizontal lines rather than points as markers. You'll have to make some choices about exactly how and where you want to plot things, but this should get you started!
I first modified the data a little bit:
df['attribute_time'] = df['attribute_time'].astype('int') # Just to get rid of the decimals.
df = df.melt(id_vars = ['attribute_time'],
value_name = 'pct_value',
var_name = 'pct_range')
Melting the DataFrame takes the wide data and makes it long instead, so the columns are now year, pct_value, and pct_range and there is a row for each data point.
Next is the plotting:
fig, ax = plt.subplots()
sns.stripplot(data = df,
x = 'attribute_time',
y = 'pct_value',
hue = 'pct_range',
jitter = False,
marker = '_',
s = 40,
linewidth = 3,
ax = ax)
Instead of labeling each point with the range that it belongs to, I though it would be a lot cleaner to separate them into ranges by color.
The jitter is used when there are lots of points for a given category that might overlap to try and prevent them from touching. In this case, we don't need to worry about that so I turned the jitter off. The marker style is designated here as hline.
The s parameter is the horizontal width of each line, and the linewidth is the thickness, so you can play around with those a bit to see what works best for you.
Text is added to the figure using the ax.text method as follows:
for year, value in zip(df['attribute_time'],df['pct_value']):
ax.text(year - 2016,
value,
str(value),
ha = 'center',
va = 'bottom',
fontsize = 'small')
The figure coordinates are indexed starting from 0 despite the horizontal markers displaying the years, so the x position of the text is shifted left by the minimum year (2016). The y position is equal to the value, and the text itself is a string representation of the value. The text is centered above the line and sits slightly above it due to the vertical anchor being on the bottom.
There's obviously a lot you can tweak to make it look how you want with sizing and labeling and stuff, but hopefully this is at least a good start!

Related

How to position the bar graph while using DataFrame so that the y-aixs coordinates will be shown completely in the saved .png file

I am using a dictionary through DataFrame to plot a barh graph, as:
import pandas as pd
#data is a dictionary and index is a list
df = pd.DataFrame (data = data, index = index)
df.plot.barh(stacked= True, figsize= (15,8), fontsize = 14, postion = 2.5, title="Thinking about it")
fig = plt.gcf()
fig.savefig('tts_tds.png')
The problem that I am facing is when I open the .png file, as the contents in the index have larger strings in it, they are not completely shown in the y-axis, i.e, the coordinates of y-axis are half cut.
For example:
if one of the elements in the index is 'ABCDEFGHIJKLMNOP QRSTUVWXYZ 1234TSR'
then in the .png file, at y-axis the coordinate is shown as: ABCDEFGHI
I know about the argument named as postion that can be passed in df.plot.barh() to position the bar graph. But when I increase the number above 1 (like 2 or 3 or 5) the coordinates shift upwards and not towards right side.
So, if there is any way through which I can resolve this problem regarding positioning the graph please let me know or if there is a way by which I will be able to represent the coordinates separately in a vertical or horizontal column aside or below the graph then please tell me how to do it.
Below as in way below x-axis and not getting into the x-axis coordinates or just like a separate image but still as a part of the graph.
If you save the fig with bbox_inches="tight" (matplotlib documentation savefig), it will accommodate the long string. I'm not sure about the "position" argument, I received an error with that included. I think this is what you meant: Chart
import pandas as pd
import matplotlib.pyplot as plt
# DataFrame
data = {'factor1': [5,4,3], 'factor2': [9, 4, 3]}
index = ["ABCDEFGHIJKLMNOP QRSTUVWXYZ 1234TSR", "two", "three"]
df = pd.DataFrame (data = data, index = index)
# Plot
fig, ax = plt.subplots()
df.plot.barh(stacked= True, figsize= (15,8), fontsize = 14, title="Thinking about it", ax=ax)
fig.savefig('tts_tds.png', bbox_inches="tight")

How to create grouped and stacked bars

I have a very huge dataset with a lot of subsidiaries serving three customer groups in various countries, something like this (in reality there are much more subsidiaries and dates):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'subsidiary': ['EU','EU','EU','EU','EU','EU','EU','EU','EU','US','US','US','US','US','US','US','US','US'],'date': ['2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05'],'business': ['RETAIL','RETAIL','RETAIL','CORP','CORP','CORP','PUBLIC','PUBLIC','PUBLIC','RETAIL','RETAIL','RETAIL','CORP','CORP','CORP','PUBLIC','PUBLIC','PUBLIC'],'value': [500.36,600.45,700.55,750.66,950.89,1300.13,100.05,120.00,150.01,800.79,900.55,1000,3500.79,5000.36,4500.25,50.17,75.25,90.33]})
print(df)
I'd like to make an analysis per subsidiary by producing a stacked bar chart. To do this, I started by defining the x-axis to be the unique months and by defining a subset per business type in a country like this:
x=df['date'].drop_duplicates()
EUCORP = df[(df['subsidiary']=='EU') & (df['business']=='CORP')]
EURETAIL = df[(df['subsidiary']=='EU') & (df['business']=='RETAIL')]
EUPUBLIC = df[(df['subsidiary']=='EU') & (df['business']=='PUBLIC')]
I can then make a bar chart per business type:
plotEUCORP = plt.bar(x=x, height=EUCORP['value'], width=.35)
plotEURETAIL = plt.bar(x=x, height=EURETAIL['value'], width=.35)
plotEUPUBLIC = plt.bar(x=x, height=EUPUBLIC['value'], width=.35)
However, if I try to stack all three together in one chart, I keep failing:
plotEURETAIL = plt.bar(x=x, height=EURETAIL['value'], width=.35)
plotEUCORP = plt.bar(x=x, height=EUCORP['value'], width=.35, bottom=EURETAIL)
plotEUPUBLIC = plt.bar(x=x, height=EUPUBLIC['value'], width=.35, bottom=EURETAIL+EUCORP)
plt.show()
I always receive the below error message:
ValueError: Missing category information for StrCategoryConverter; this might be caused by unintendedly mixing categorical and numeric data
ConversionError: Failed to convert value(s) to axis units: subsidiary date business value
0 EU 2019-03 RETAIL 500.36
1 EU 2019-04 RETAIL 600.45
2 EU 2019-05 RETAIL 700.55
I tried converting the months into the dateformat and/or indexing it, but it actually confused me further...
I would really appreciate any help/support on any of the following, as I a already spend a lot of hours to try to figure this out (I am still a python noob, sry):
How can I fix the error to create a stacked bar chart?
Assuming, the error can be fixed, is this the most efficient way to create the bar chart (e.g. do I really need to create three sub-dfs per subsidiary, or is there a more elegant way?)
Would it be possible to code an iteration, that produces a stacked bar chart by country, so that I don't need to create one per subsidiary?
As an FYI, stacked bars are not the best option, because they can make it difficult to compare bar values and can easily be misinterpreted. The purpose of a visualization is to present data in an easily understood format; make sure the message is clear. Side-by-side bars are often a better option.
Side-by-side stacked bars are a difficult manual process to construct, it's better to use a figure-level method like seaborn.catplot, which will create a single, easy to read, data visualization.
Bar plot ticks are located by 0 indexed range (not datetimes), the dates are just labels, so it is not necessary to convert them to a datetime dtype.
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.3, seaborn 0.11.2
seaborn
import seaborn as sns
sns.catplot(kind='bar', data=df, col='subsidiary', x='date', y='value', hue='business')
Create grouped and stacked bars
See Stacked Bar Chart and Grouped bar chart with labels
The issue with the creation of the stacked bars in the OP is bottom is being set on the entire dataframe for that group, instead of only the values that make up the bar height.
do I really need to create three sub-dfs per subsidiary. Yes, a DataFrame is needed for every group, so 6, in this case.
Creating the data subsets can be automated using a dict-comprehension to unpack the .groupby object into a dict.
data = {''.join(k): v for k, v in df.groupby(['subsidiary', 'business'])} to create a dict of DataFrames
Access the values like: data['EUCORP'].value
Automating the plot creation is more arduous, as can be seen x depends on how many groups of bars for each tick, and bottom depends on the values for each subsequent plot.
import numpy as np
import matplotlib.pyplot as plt
labels=df['date'].drop_duplicates() # set the dates as labels
x0 = np.arange(len(labels)) # create an array of values for the ticks that can perform arithmetic with width (w)
# create the data groups with a dict comprehension and groupby
data = {''.join(k): v for k, v in df.groupby(['subsidiary', 'business'])}
# build the plots
subs = df.subsidiary.unique()
stacks = len(subs) # how many stacks in each group for a tick location
business = df.business.unique()
# set the width
w = 0.35
# this needs to be adjusted based on the number of stacks; each location needs to be split into the proper number of locations
x1 = [x0 - w/stacks, x0 + w/stacks]
fig, ax = plt.subplots()
for x, sub in zip(x1, subs):
bottom = 0
for bus in business:
height = data[f'{sub}{bus}'].value.to_numpy()
ax.bar(x=x, height=height, width=w, bottom=bottom)
bottom += height
ax.set_xticks(x0)
_ = ax.set_xticklabels(labels)
As you can see, small values are difficult to discern, and using ax.set_yscale('log') does not work as expected with stacked bars (e.g. it does not make small values more readable).
Create only stacked bars
As mentioned by #r-beginners, use .pivot, or .pivot_table, to reshape the dataframe to a wide form to create stacked bars where the x-axis is a tuple ('date', 'subsidiary').
Use .pivot if there are no repeat values for each category
Use .pivot_table, if there are repeat values that must be combined with aggfunc (e.g. 'sum', 'mean', etc.)
# reshape the dataframe
dfp = df.pivot(index=['date', 'subsidiary'], columns=['business'], values='value')
# plot stacked bars
dfp.plot(kind='bar', stacked=True, rot=0, figsize=(10, 4))

How to center the histogram bars around tick marks using seaborn displot? Stacking bars is essential

I have searched many ways of making histograms centered around tick marks but not able to find a solution that works with seaborn displot. The function displot lets me stack the histogram according to a column in the dataframe and thus would prefer a solution using displot or something that allows stacking based on a column in a data frame with color-coding as with palette.
Even after setting the tick values, I am not able to get the bars to center around the tick marks.
Example code
# Center the histogram on the tick marks
tips = sns.load_dataset('tips')
sns.displot(x="total_bill",
hue="day", multiple = 'stack', data=tips)
plt.xticks(np.arange(0, 50, 5))
I would also like to plot a histogram of a variable that takes a single value and choose the bin width of the resulting histogram in such a way that it is centered around the value. (0.5 in this example.)
I can get the center point by choosing the number of bins equal to a number of tick marks but the resulting bar is very thin. How can I increase the bin size in this case, where there is only one bar but want to display all the other possible points. By displaying all the tick marks, the bar width is very tiny.
I want the same centering of the bar at the 0.5 tick mark but make it wider as it is the only value for which counts are displayed.
Any solutions?
tips['single'] = 0.5
sns.displot(x='single',
hue="day", multiple = 'stack', data=tips, bins = 10)
plt.xticks(np.arange(0, 1, 0.1))
Edit:
Would it be possible to have more control over the tick marks in the second case? I would not want to display the round off to 1 decimal place but chose which of the tick marks to display. Is it possible to display just one value in the tick mark and have it centered around that?
Does the min_val and max_val in this case refer to value of the variable which will be 0 in this case and then the x axis would be plotted on negative values even when there are none and dont want to display them.
For your first problem, you may want to figure out a few properties of the data that your plotting. For example the range of the data. Additionally, you may want to choose beforehand the number of bins that you want displayed.
tips = sns.load_dataset('tips')
min_val = tips.total_bill.min()
max_val = tips.total_bill.max()
val_width = max_val - min_val
n_bins = 10
bin_width = val_width/n_bins
sns.histplot(x="total_bill",
hue="day", multiple = 'stack', data=tips,
bins=n_bins, binrange=(min_val, max_val),
palette='Paired')
plt.xlim(0, 55) # Define x-axis limits
Another thing to remember is that width a of a bar in a histogram identifies the bounds of its range. So a bar spanning [2,5] on the x-axis implies that the values represented by that bar belong to that range.
Considering this, it is easy to formulate a solution. Assume that we want the original bar graphs - identifying the bounds of each bar graph, one solution may look like
plt.xticks(np.arange(min_val-bin_width, max_val+bin_width, bin_width))
Now, if we offset the ticks by half a bin-width, we will get to the centers of the bars.
plt.xticks(np.arange(min_val-bin_width/2, max_val+bin_width/2, bin_width))
For your single value plot, the idea remains the same. Control the bin_width and the x-axis range and ticks. Bin-width has to be controlled explicitly since automatic inference of bin-width will probably be 1 unit wide which on the plot will have no thickness. Histogram bars always indicate a range - even though when we have just one single value. This is illustrated in the following example and figure.
single_val = 23.5
tips['single'] = single_val
bin_width = 4
fig, axs = plt.subplots(1, 2, sharey=True, figsize=(12,4)) # Get 2 subplots
# Case 1 - With the single value as x-tick label on subplot 0
sns.histplot(x='single',
hue="day", multiple = 'stack', data=tips,
binwidth=bin_width, binrange=(single_val-bin_width, single_val+bin_width),
palette='rocket',
ax=axs[0])
ticks = [single_val, single_val+bin_width] # 2 ticks - given value and given_value + width
axs[0].set(
title='Given value as tick-label starts the bin on x-axis',
xticks=ticks,
xlim=(0, int(single_val*2)+bin_width)) # x-range such that bar is at middle of x-axis
axs[0].xaxis.set_major_formatter(FormatStrFormatter('%.1f'))
# Case 2 - With centering on the bin starting at single-value on subplot 1
sns.histplot(x='single',
hue="day", multiple = 'stack', data=tips,
binwidth=bin_width, binrange=(single_val-bin_width, single_val+bin_width),
palette='rocket',
ax=axs[1])
ticks = [single_val+bin_width/2] # Just the bin center
axs[1].set(
title='Bin centre is offset from single_value by bin_width/2',
xticks=ticks,
xlim=(0, int(single_val*2)+bin_width) ) # x-range such that bar is at middle of x-axis
axs[1].xaxis.set_major_formatter(FormatStrFormatter('%.1f'))
Output:
I feel from your description that what you are really implying by a bar graph is a categorical bar graph. The centering is then automatic. Because the bar is not a range anymore but a discrete category. For the numeric and continuous nature of the variable in the example data, I would not recommend such an approach. Pandas provides for plotting categorical bar plots. See here. For our example, one way to do this is as follows:
n_colors = len(tips['day'].unique()) # Get number of uniques categories
agg_df = tips[['single', 'day']].groupby(['day']).agg(
val_count=('single', 'count'),
val=('single','max')
).reset_index() # Get aggregated information along the categories
agg_df.pivot(columns='day', values='val_count', index='val').plot.bar(
stacked=True,
color=sns.color_palette("Paired", n_colors), # Choose "number of days" colors from palette
width=0.05 # Set bar width
)
plt.show()
This yields:

Sort categorical x-axis in a seaborn scatter plot

I am trying to plot the top 30 percent values in a data frame using a seaborn scatter plot as shown below.
The reproducible code for the same plot:
import seaborn as sns
df = sns.load_dataset('iris')
#function to return top 30 percent values in a dataframe.
def extract_top(df):
n = int(0.3*len(df))
top = df.sort_values('sepal_length', ascending = False).head(n)
return top
#storing the top values
top = extract_top(df)
#plotting
sns.scatterplot(data = top,
x='species', y='sepal_length',
color = 'black',
s = 100,
marker = 'x',)
Here, I want sort the x-axis in order = ['virginica','setosa','versicolor']. When I tried to use order as one of the parameter in sns.scatterplot(), it returned an error AttributeError: 'PathCollection' object has no property 'order'. What is the right way to do it?
Please note: In the dataframe, setosa is also a category in species, however, in the top 30% values non of its value is falling. Hence, that label is not shown in the example output from the reproducible code at the top. But I want even that label in the x-axis as well in the given order as shown below:
scatterplot() is not the correct tool for the job. Since you have a categorical axis you want to use stripplot() and not scatterplot(). See the difference between relational and categorical plots here https://seaborn.pydata.org/api.html
sns.stripplot(data = top,
x='species', y='sepal_length',
order = ['virginica','setosa','versicolor'],
color = 'black', jitter=False)
This means sns.scatterplot() does not take order as one of its args. For species setosa, you can use alpha to hide the scatter points while keep the ticks.
import seaborn as sns
df = sns.load_dataset('iris')
#function to return top 30 percent values in a dataframe.
def extract_top(df):
n = int(0.3*len(df))
top = df.sort_values('sepal_length', ascending = False).head(n)
return top
#storing the top values
top = extract_top(df)
top.append(top.iloc[0,:])
top.iloc[-1,-1] = 'setosa'
order = ['virginica','setosa','versicolor']
#plotting
for species in order:
alpha = 1 if species != 'setosa' else 0
sns.scatterplot(x="species", y="sepal_length",
data=top[top['species']==species],
alpha=alpha,
marker='x',color='k')
the output is
For those wanting to make use of the extra arguments available in sns.scatterplot over sns.strpplot (size and style mappings for variables), it's possible to set the order of the x axis simply by sorting the dataframe before passing it to seaborn. The following will sort alphabetically.
df.sort_values(feature)

Garbled x-axis labels in matplotlib subplots

I am querying COVID-19 data and building a dataframe of day-over-day changes for one of the data points (positive test results) where each row is a day, each column is a state or territory (there are 56 altogether). I can then generate a chart for every one of the states, but I can't get my x-axis labels (the dates) to behave like I want. There are two problems which I suspect are related. First, there are too many labels -- usually matplotlib tidily reduces the label count for readability, but I think the subplots are confusing it. Second, I would like the labels to read vertically; but this only happens on the last of the plots. (I tried moving the rotation='vertical' inside the for block, to no avail.)
The dates are the same for all the subplots, so -- this part works -- the x-axis labels only need to appear on the bottom row of the subplots. Matplotlib is doing this automatically. But I need fewer of the labels, and for all of them to align vertically. Here is my code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# get current data
all_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")
# convert the YYYYMMDD date to a datetime object
all_states[['gooddate']] = all_states[['date']].applymap(lambda s: pd.to_datetime(str(s), format = '%Y%m%d'))
# 'positive' is the cumulative total of COVID-19 test results that are positive
all_states_new_positives = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'positive', aggfunc='sum')
all_states_new_positives_diff = all_states_new_positives.diff()
fig, axes = plt.subplots(14, 4, figsize = (12,8), sharex = True )
plt.tight_layout
for i , ax in enumerate(axes.ravel()):
# get the numbers for the last 28 days
x = all_states_new_positives_diff.iloc[-28 :].index
y = all_states_new_positives_diff.iloc[-28 : , i]
ax.set_title(y.name, loc='left', fontsize=12, fontweight=0)
ax.plot(x,y)
plt.xticks(rotation='vertical')
plt.subplots_adjust(left=0.5, bottom=1, right=1, top=4, wspace=2, hspace=2)
plt.show();
Suggestions:
Increase the height of the figure.
fig, axes = plt.subplots(14, 4, figsize = (12,20), sharex = True)
Rotate all the labels:
fig.autofmt_xdate(rotation=90)
Use tight_layout at the end instead of subplots_adjust:
fig.tight_layout()

Categories