I have just started learning python and I am using the Titanic data set to practice
I am not able to create a grouped bar chart and it it giving me an error
'incompatible sizes: argument 'height' must be length 2 or scalar'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("Titanic/train.csv")
top_five = df.head(5)
print(top_five)
column_no = df.columns
print(column_no)
female_count = len([p for p in df["Sex"] if p == 'female'])
male_count = len([i for i in df["Sex"] if i == 'male'])
have_survived= len([m for m in df["Survived"] if m == 1])
not_survived = len([n for n in df["Survived"] if n == 0])
plt.bar([0],female_count, color ='b')
plt.bar([1],male_count,color = 'y')
plt.xticks([0+0.2,1+0.2],['females','males'])
plt.show()
plt.bar([0],not_survived, color ='r')
plt.bar([1],have_survived, color ='g')
plt.xticks([0+0.2,1+0.2],['not_survived','have_survived'])
plt.show()
it works fine until here and I get two individual charts
Instead i want one chart that displays bars for male and female and color code the bars based on survival.
This does not seem to work
N = 2
index = np.arange(N)
bar_width = 0.35
plt.bar(index, have_survived, bar_width, color ='b')
plt.bar(index + bar_width, not_survived, bar_width,color ='r',)
plt.xticks([0+0.2,1+0.2],['females','males'])
plt.legend()
Thanks in advance!!
How about replacing your second block of code (the one that returns a ValueError) with this
bar_width = 0.35
tot_people_count = (female_count + male_count) * 1.0
plt.bar(0, female_count, bar_width, color ='b')
plt.bar(1, male_count, bar_width, color ='y',)
plt.bar(0, have_survived/tot_people_count*female_count, bar_width, color='r')
plt.bar(1, have_survived/tot_people_count*male_count, bar_width, color='g')
plt.xticks([0+0.2,1+0.2],['females','males'])
plt.legend(['female deceased', 'male deceased', 'female survivors', 'male survivors'],
loc='best')
I get this bar graph as an output,
The reason for the error you get is that the left and height parameters of plt.bar must either have the same length as each other or one (or both) of them must be a scalar. That is why changing index in your code to the simple scalars 0 and 1 fixes the error.
Related
This question already has answers here:
Stacked bars are unexpectedly annotated with the sum of bar heights
(2 answers)
How to add value labels on a bar chart
(7 answers)
Closed 10 months ago.
I want to create a stacked horizontal bar plot with values of each stack displayed inside it and the total value of the stacks just after the bar. Using python matplotlib, I could create a simple barh. My dataframe looks like below:
import pandas as pd
df = pd.DataFrame({"single":[168,345,345,352],
"comp":[481,44,23,58],})
item = ["white_rice",
"pork_and_salted_vegetables",
"sausage_and_potato_in_tomato_sauce",
"curry_vegetable",]
df.index = item
Expect to get bar plot like below except that it is not horizontal:
The code I tried is here...and i get AttributeError: 'DataFrame' object has no attribute 'rows'. Please help me with horizontal bar plot. Thanks.
fig, ax = plt.subplots(figsize=(10,4))
colors = ['c', 'y']
ypos = np.zeros(len(df))
for i, row in enumerate(df.index):
ax.barh(df.index, df[row], x=ypos, label=row, color=colors[i])
bottom += np.array(df[row])
totals = df.sum(axis=0)
x_offset = 4
for i, total in enumerate(totals):
ax.text(totals.index[i], total + x_offset, round(total), ha='center',) # weight='bold')
x_offset = -15
for bar in ax.patches:
ax.text(
# Put the text in the middle of each bar. get_x returns the start so we add half the width to get to the middle.
bar.get_y() + bar.get_height() / 2,
bar.get_width() + bar.get_x() + x_offset,
# This is actual value we'll show.
round(bar.get_width()),
# Center the labels and style them a bit.
ha='center',
color='w',
weight='bold',
size=10)
labels = df.index
ax.set_title('Label Distribution Overview')
ax.set_yticklabels(labels, rotation=90)
ax.legend(fancybox=True)
Consider the following approach to get something similar with matplotlib only (I use matplotlib 3.5.0). Basically the job is done with bar/barh and bar_label combination. You may change label_type and add padding to tweak plot appearance. Also you may use fmt to format values. Edited code with total values added.
import matplotlib.pyplot as plt
import pandas as pd
import random
def main(data):
data['total'] = data['male'] + data['female']
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.suptitle('Plot title')
ax1.bar(x=data['year'].astype(str), height=data['female'], label='female')
ax1.bar_label(ax1.containers[0], label_type='center')
ax1.bar(x=data['year'].astype(str), height=data['male'], bottom=data['female'], label='male')
ax1.bar_label(ax1.containers[1], label_type='center')
ax1.bar_label(ax1.containers[1], labels=data['total'], label_type='edge')
ax1.legend()
ax2.barh(y=data['year'].astype(str), width=data['female'], label='female')
ax2.bar_label(ax2.containers[0], label_type='center')
ax2.barh(y=data['year'].astype(str), width=data['male'], left=data['female'], label='male')
ax2.bar_label(ax2.containers[1], label_type='center')
ax2.bar_label(ax2.containers[1], labels=data['total'], label_type='edge')
ax2.legend()
plt.show()
if __name__ == '__main__':
N = 4
main(pd.DataFrame({
'year': [2010 + val for val in range(N)],
'female': [int(10 + 100 * random.random()) for dummy in range(N)],
'male': [int(10 + 100 * random.random()) for dummy in range(N)]}))
Result (with total values added):
I was trying to reproduce this plot with Matplotlib:
So, by looking at the documentation, I found out that the closest thing is a grouped bar chart. The problem is that I have a different number of "bars" for each category (subject, illumination, ...) compared to the example provided by matplotlib that instead only has 2 classes (M, F) for each category (G1, G2, G3, ...). I don't know exactly from where to start, does anyone here has any clue? I think in this case the trick they made to specify bars location:
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, men_means, width, label='Men')
rects2 = ax.bar(x + width/2, women_means, width, label='Women')
does not work at all as in the second class (for example) there is a different number of bars. It would be awesome if anyone could give me an idea. Thank you in advance!
Supposing the data resides in a dataframe, the bars can be generated by looping through the categories:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# first create some test data, similar in structure to the question's
categories = ['Subject', 'Illumination', 'Location', 'Daytime']
df = pd.DataFrame(columns=['Category', 'Class', 'Value'])
for cat in categories:
for _ in range(np.random.randint(2, 7)):
df = df.append({'Category': cat,
'Class': "".join(np.random.choice([*'tuvwxyz'], 10)),
'Value': np.random.uniform(10, 17)}, ignore_index=True)
fig, ax = plt.subplots()
start = 0 # position for first label
gap = 1 # gap between labels
labels = [] # list for all the labels
label_pos = np.array([]) # list for all the label positions
# loop through the categories of the dataframe
# provide a list of colors (at least as long as the expected number of categories)
for (cat, df_cat), color in zip(df.groupby('Category', sort=False), ['navy', 'orange'] * len(df)):
num_in_cat = len(df_cat)
# add a text for the category, using "axes coordinates" for the y-axis
ax.text(start + num_in_cat / 2, 0.95, cat, ha='center', va='top', transform=ax.get_xaxis_transform())
# positions for the labels of the current category
this_label_pos = np.arange(start, start + num_in_cat)
# create bars at the desired positions
ax.bar(this_label_pos, df_cat['Value'], color=color)
# store labels and their positions
labels += df_cat['Class'].to_list()
label_pos = np.append(label_pos, this_label_pos)
start += num_in_cat + gap
# set the positions for the labels
ax.set_xticks(label_pos)
# set the labels
ax.set_xticklabels(labels, rotation=30)
# optionally set a new lower position for the y-axis
ax.set_ylim(ymin=9)
# optionally reduce the margin left and right
ax.margins(x=0.01)
plt.tight_layout()
plt.show()
Now let's assume I have a data file example.csv:
first,second,third,fourth,fifth,sixth
-42,11,3,L,D
4,21,40,L,Q
2,31,15,R,D
-42,122,50,S,L
print(df.head()) of the above is:
first second third fourth fifth sixth
0 -42 11 3 L D NaN
1 4 21 40 L Q NaN
2 2 31 15 R D NaN
3 -42 122 50 S L NaN
I want to draw the bar plot as a group, where the first and second columns will work as an index. Their different numbers will have different colors.
What I'm expecting is below which I have started working on.
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
filename = 'example.csv'
df = pd.read_csv(filename)
print(df.head())
first = df['first']
second = df['second']
third = df['third']
labels = df['third']
x = np.arange(len(labels))
width = 0.35
df.sort_values(by=['third'], axis=0, ascending=False)
fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, third, width, label='Parent 1')
rects2 = ax.bar(x + width/2, third, width, label='Parent 2')
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
From the bar plot, it is clear that the Y value is from the column called "third", which is exactly what we are getting. But in the grouping, we need to have some modifications to the labeling in the grouping. I have drawn on the figure so you can see what I'm expecting.
Each top number on each bar plot will have a different color. FOr example, in the first pair of bars, we have numbers (-42,11). So we need to assign two different colors. But if these numbers on other bars reappear again, these same numbers will have the same color. That means each number will have a unique bar color. The complete list of bar colors can be shown as legends in the top left instead of what we have right now.
Another identification will be the bottom of the bars. For example, we have (L, D) in the first pair which are representing the fourth and fifth columns of the data file.
I wanted to draw with the descending order of the third column. I applied the command to short the column as descending, but it seems it did not do that in the plot.
df.sort_values(by=['third'], axis=0, ascending=False)
Too many customization, so I think it's easier with a loop through the rows and plot the bars differently. Also, sort_values returns a copy by default, pass inplace=True makes it operate inplace:
# sort dataframe, notice `inplace`
df.sort_values(by=['third'], axis=0, ascending=False, inplace=True)
from matplotlib import cm
# we use this to change the colors with `cmap`
values = np.unique(df[['first','second']])
# scaled the values to 0-1 for cmap
def scaled_value(val):
return (val-values.min())/np.ptp(values)
cmap = cm.get_cmap('viridis')
width = 0.35
fig, ax = plt.subplots()
for i, idx in enumerate(df.index):
row = df.loc[idx]
# draw the first
ax.bar(i-width/2,row['third'],
color=cmap(scaled_value(row['first'])), # specify color here
width=width, edgecolor='w',
label='Parent 1' if i==0 else None) # label first bar
# draw the second
ax.bar(i+width/2, row['third'],
color=cmap(scaled_value(row['second'])),
width=width, edgecolor='w', hatch='//',
label='Parent 2' if i==0 else None)
# set the ticks manually
ax.set_xticks([i + o for i in range(df.shape[0]) for o in [-width/2, width/2]]);
ax.set_xticklabels(df[['fourth','fifth']].values.ravel());
ax.legend()
Output:
I believe you first need to work on the right data structure in the dataframe. I believe you want the following:
df['xaxis'] = df.fourth + ":" + df.fifth
df.groupby('xaxis').agg({'third':'sum','first':'sum'}).plot(kind='bar')
it output
third first
xaxis
L:D 3 -42
L:Q 40 4
R:D 15 2
S:L 50 -42
plots as :
I really don't understand what's going wrong with this... I've looked through what is pretty simple data several times and have restarted the kernel (running on Jupyter Notebook) and nothing seems to be solving it.
Here's the data frame I have (sorry I know the numbers look a bit silly, this is a really sparse dataset over a long time period, original is reindexed for 20 years):
DATE NODP NVP VP VDP
03/08/2002 0.083623 0.10400659 0.81235517 1.52458E-05
14/09/2003 0.24669167 0.24806379 0.5052293 1.52458E-05
26/07/2005 0.15553726 0.13324796 0.7111538 0.000060983
20/05/2006 0 0.23 0.315 0.455
05/06/2007 0.21280034 0.29139224 0.49579217 1.52458E-05
21/02/2010 0 0.55502195 0.4449628 1.52458E-05
09/04/2011 0.09531311 0.17514162 0.72954527 0
14/02/2012 0.19213217 0.12866237 0.67920546 0
17/01/2014 0.12438848 0.10297326 0.77263826 0
24/02/2017 0.01541347 0.09897548 0.88561105 0
Note that all of the rows add up to 1! I have triple, quadruple checked this...XD
I am trying to produce a stacked bar chart of this data, with the following code, which seems to have worked perfectly for everything else I have been using it for:
NODP = df['NODP']
NVP = df['NVP']
VDP = df['VDP']
VP = df['VP']
ind = np.arange(len(df.index))
width = 5.0
p1 = plt.bar(ind, NODP, width, label = 'NODP', bottom=NVP, color= 'grey')
p2 = plt.bar(ind, NVP, width, label = 'NVP', bottom=VDP, color= 'tan')
p3 = plt.bar(ind, VDP, width, label = 'VDP', bottom=VP, color= 'darkorange')
p4 = plt.bar(ind, VP, width, label = 'VP', color= 'darkgreen')
plt.ylabel('Ratio')
plt.xlabel('Year')
plt.title('Ratio change',x=0.06,y=0.8)
plt.xticks(np.arange(min(ind), max(ind)+1, 6.0), labels=xlabels) #the xticks were cumbersome so not included in this example code
plt.legend()
Which gives me the following plot:
As is evident, 1) NODP is not showing up at all, and 2) the remainder of them are being plotted with the wrong proportions...
I really don't understand what is wrong, it should be really simple, right?! I'm sorry if it is really simple, it's probably right under my nose. Any ideas greatly appreciated!
If you want to create stacked bars this way (so standard matplotlib without using pandas or seaborn for plotting), the bottom needs to be the sum of all the lower bars.
Here is an example with the given data.
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
columns = ['DATE', 'NODP', 'NVP', 'VP', 'VDP']
data = [['03/08/2002', 0.083623, 0.10400659, 0.81235517, 1.52458E-05],
['14/09/2003', 0.24669167, 0.24806379, 0.5052293, 1.52458E-05],
['26/07/2005', 0.15553726, 0.13324796, 0.7111538, 0.000060983],
['20/05/2006', 0, 0.23, 0.315, 0.455],
['05/06/2007', 0.21280034, 0.29139224, 0.49579217, 1.52458E-05],
['21/02/2010', 0, 0.55502195, 0.4449628, 1.52458E-05],
['09/04/2011', 0.09531311, 0.17514162, 0.72954527, 0],
['14/02/2012', 0.19213217, 0.12866237, 0.67920546, 0],
['17/01/2014', 0.12438848, 0.10297326, 0.77263826, 0],
['24/02/2017', 0.01541347, 0.09897548, 0.88561105, 0]]
df = pd.DataFrame(data=data, columns=columns)
ind = pd.to_datetime(df.DATE)
NODP = df.NODP.to_numpy()
NVP = df.NVP.to_numpy()
VP = df.VP.to_numpy()
VDP = df.VDP.to_numpy()
width = 120
p1 = plt.bar(ind, NODP, width, label='NODP', bottom=NVP+VDP+VP, color='grey')
p2 = plt.bar(ind, NVP, width, label='NVP', bottom=VDP+VP, color='tan')
p3 = plt.bar(ind, VDP, width, label='VDP', bottom=VP, color='darkorange')
p4 = plt.bar(ind, VP, width, label='VP', color='darkgreen')
plt.ylabel('Ratio')
plt.xlabel('Year')
plt.title('Ratio change')
plt.yticks(np.arange(0, 1.001, 0.1))
plt.legend()
plt.show()
Note that in this case the x-axis measured in days, and each bar is located at its date. This helps to know the relative spacing between the dates, in case this is important. If it isn't important, the x-positions could be chosen equidistant and labeled via the dates column.
To do so with standard matplotlib, following code would change:
ind = range(len(df))
width = 0.8
plt.xticks(ind, df.DATE, rotation=20)
plt.tight_layout() # needed to show the full labels of the x-axis
Plot the dataframe
# using your data above
df.DATE = pd.to_datetime(df.DATE)
df.set_index('DATE', inplace=True)
ax = df.plot(stacked=True, kind='bar', figsize=(12, 8))
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
# sets the tick labels so time isn't included
ax.xaxis.set_major_formatter(plt.FixedFormatter(df.index.to_series().dt.strftime("%Y-%m-%d")))
plt.show()
Add labels for clarity
By adding the following code before plt.show() you can add text annotations to the bars
# .patches is everything inside of the chart
for rect in ax.patches:
# Find where everything is located
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
y = rect.get_y()
# The width of the bar is the data value and can used as the label
label_text = f'{height:.2f}' # f'{height:.2f}' if you have decimal values as labels
label_x = x + width - 0.125
label_y = y + height / 2
# don't include label if it's equivalently 0
if height > 0.001:
ax.text(label_x, label_y, label_text, ha='right', va='center', fontsize=8)
plt.show()
I have this image from Matplotlib :
I would like to write for each category (cat i with i in [1-10] in the figure) the highest value and its corresponding legend on the graphic.
Below you can find visually what I would like to achieve :
The thing is the fact that I don't know if it is possible because of the way of plotting from matplotlib.
Basically, this is the part of the code for drawing multiple bars :
# create plot
fig, ax = plt.subplots(figsize = (9,9))
index = np.arange(len_category)
if multiple:
bar_width = 0.3
else :
bar_width = 1.5
opacity = 1.0
#test_array contains test1 and test2
cmap = get_cmap(len(test_array))
for i in range(len(test_array)):
count = count + 1
current_label = test_array[i]
rects = plt.bar(index-0.2+bar_width*i, score_array[i], bar_width, alpha=opacity, color=np.random.rand(3,1),label=current_label )
plt.xlabel('Categories')
plt.ylabel('Scores')
plt.title('Scores by Categories')
plt.xticks(index + bar_width, categories_array)
plt.legend()
plt.tight_layout()
plt.show()
and this is the part I have added in order to do what I would like to achieve. But it searches the max across all the bars in the graphics. For example, the max of test1 will be in cat10 and the max of test2 will be cat2. Instead, I would like to have the max for each category.
for i in range(len(test_array)):
count = count + 1
current_label = test_array[i]
rects = plt.bar(index-0.2+bar_width*i, score_array[i], bar_width,alpha=opacity,color=np.random.rand(3,1),label=current_label )
max_score_current = max(score_array[i])
list_rect = list()
max_height = 0
#The id of the rectangle who get the highest score
max_idx = 0
for idx,rect in enumerate(rects):
list_rect.append(rect)
height = rect.get_height()
if height > max_height:
max_height = height
max_idx = idx
highest_rect = list_rect[max_idx]
plt.text(highest_rect.get_x() + highest_rect.get_width()/2.0, max_height, str(test_array[i]),color='blue', fontweight='bold')
del list_rect[:]
Do you have an idea about how I can achieve that ?
Thank you
It usually better to keep data generation and visualization separate. Instead of looping through the bars themselves, just get the necessary data prior to plotting. This makes everything a lot more simple.
So first create a list of labels to use and then loop over the positions to annotate then. In the code below the labels are created by mapping the argmax of a column array to the test set via a dictionary.
import numpy as np
import matplotlib.pyplot as plt
test1 = [6,4,5,8,3]
test2 = [4,5,3,4,6]
labeldic = {0:"test1", 1:"test2"}
a = np.c_[test1,test2]
maxi = np.max(a, axis=1)
l = ["{} {}".format(i,labeldic[j]) for i,j in zip(maxi, np.argmax(a, axis=1))]
for i in range(a.shape[1]):
plt.bar(np.arange(a.shape[0])+(i-1)*0.3, a[:,i], width=0.3, align="edge",
label = labeldic[i])
for i in range(a.shape[0]):
plt.annotate(l[i], xy=(i,maxi[i]), xytext=(0,10),
textcoords="offset points", ha="center")
plt.margins(y=0.2)
plt.legend()
plt.show()
From your question it is not entirely clear what you want to achieve, but assuming that you want the relative height of each bar in one group printed above that bar, here is one way to achieve that:
from matplotlib import pyplot as plt
import numpy as np
score_array = np.random.rand(2,10)
index = np.arange(score_array.shape[1])
test_array=['test1','test2']
opacity = 1
bar_width = 0.25
for i,label in enumerate(test_array):
rects = plt.bar(index-0.2+bar_width*i, score_array[i], bar_width,alpha=opacity,label=label)
heights = [r.get_height() for r in rects]
print(heights)
rel_heights = [h/max(heights) for h in heights]
idx = heights.index(max(heights))
for i,(r,h, rh) in enumerate(zip(rects, heights, rel_heights)):
plt.text(r.get_x() + r.get_width()/2.0, h, '{:.2}'.format(rh), color='b', fontweight ='bold', ha='center')
plt.show()
The result looks like this: