Related
I'm working on an experimentation personal project. I have the following dataframes:
treat_repr = pd.DataFrame({'kpi': ['cpsink', 'hpu', 'mpu', 'revenue', 'wallet']
,'diff_pct': [0.655280, 0.127299, 0.229958, 0.613308, -0.718421]
,'me_pct': [1.206313, 0.182875, 0.170821, 1.336590, 2.229763]
,'p': [0.287025, 0.172464, 0.008328, 0.368466, 0.527718]
,'significance': ['insignificant', 'insignificant', 'significant', 'insignificant', 'insignificant']})
pre_treat_repr = pd.DataFrame({'kpi': ['cpsink', 'hpu', 'mpu', 'revenue', 'wallet']
,'diff_pct': [0.137174, 0.111005, 0.169490, -0.152929, -0.450667]
,'me_pct': [1.419080, 0.207081, 0.202014, 1.494588, 1.901672]
,'p': [0.849734, 0.293427, 0.100091, 0.841053, 0.642303]
,'significance': ['insignificant', 'insignificant', 'insignificant', 'insignificant', 'insignificant']})
I have used the below code to construct errorbar plot, which works fine:
def confint_plot(df):
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(18, 10))
plt.errorbar(df[df['significance'] == 'significant']["diff_pct"], df[df['significance'] == 'significant']["kpi"], xerr = df[df['significance'] == 'significant']["me_pct"], color = '#d62828', fmt = 'o', capsize = 10)
plt.errorbar(df[df['significance'] == 'insignificant']["diff_pct"], df[df['significance'] == 'insignificant']["kpi"], xerr = df[df['significance'] == 'insignificant']["me_pct"], color = '#2a9d8f', fmt = 'o', capsize = 10)
plt.legend(['significant', 'insignificant'], loc = 'best')
ax.axvline(0, c='red', alpha=0.5, linewidth=3.0,
linestyle = '--', ymin=0.0, ymax=1)
plt.title("Confidence Intervals of Continous Metrics", size=14, weight='bold')
plt.xlabel("% Difference of Control over Treatment", size=12)
plt.show()
for which the output of confint_plot(treat_repr) looks like this:
Now if I run the same plot function on a pre-treatment dataframe confint_plot(pre_treat_repr), the plot looks like this:
We can observe from both the plots that the order of the variables changed from 1st plot to 2nd plot depending on whether the kpi is significant(that's the way I figured after exhausting many attempts).
Questions:
How do I make a change to the code to dynamically allocate color maps without changing the order of the kpis on y axis?
Currently I have manually typed in the legends. Is there a way to dynamically populate legends?
Appreciate the help!
Because you plot the significant KPIs first, they will always appear on the bottom of the chart. How you solve this and keep the desired colors depends on the kind of charts you are making with matplotlib. With scatter charts, you can specify a color array in c parameter. Error bar charts do not offer that functionality.
One way to work around that is to sort your KPIs, give them numeric position (0, 1, 2, 3 , ...), plot them twice (once for significants, once for insignificants) and re-tick them:
def confint_plot(df):
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(18, 10))
# Sort the KPIs alphabetically. You can change the order to anything
# that fits your purpose
df_plot = df.sort_values('kpi').assign(y=range(len(df)))
for significance in ['significant', 'insignificant']:
cond = df_plot['significance'] == significance
color = '#d62828' if significance == 'significant' else '#2a9d8f'
# Plot them in their numeric positions first
plt.errorbar(
df_plot.loc[cond, 'diff_pct'], df_plot.loc[cond, 'y'],
xerr=df_plot.loc[cond, 'me_pct'], label=significance,
fmt='o', capsize=10, c=color
)
plt.legend(loc='best')
ax.axvline(0, c='red', alpha=0.5, linewidth=3.0,
linestyle = '--', ymin=0.0, ymax=1)
# Re-tick to show the KPIs
plt.yticks(df_plot['y'], df_plot['kpi'])
plt.title("Confidence Intervals of Continous Metrics", size=14, weight='bold')
plt.xlabel("% Difference of Control over Treatment", size=12)
plt.show()
I have the electricity consumption of 25 houses, and Im doing K-Means clustering on the dataset that holds those houses. After importing the dataset, pre-processing it, and applying K-Means with K=2, I plotted the data but when Im adding the legend I`m getting this:
No handles with labels found to put in legend.
No error in the code and it is running but I want my code to generate automatic legends that hold the ID of each house starting from 0 to 24.
Here is my code where I`m plotting the data:
plt.figure(figsize=(13,13))
import itertools
marker = itertools.cycle(('+', 'o', '*' , 'X', 's','8','>','1','<'))
for cluster_index in [0,1]:
plt.subplot(2,1,cluster_index + 1)
for index, row in data1.iterrows():
if row.iloc[-1] == cluster_index:
plt.plot(row.iloc[1:-1] ,marker = next(marker) , alpha=1)
plt.legend(loc="right")
plt.plot(kmeans.cluster_centers_[cluster_index], color='k' ,marker='o', alpha=1)
ax = plt.gca()
ax.tick_params(axis = 'x', which = 'major', labelsize = 10)
plt.xticks(rotation="vertical")
plt.ylabel('Monthly Mean Consumption 2018-2019', fontsize=10)
plt.title(f'Cluster {cluster_index}', fontsize=15)
plt.tight_layout()
plt.show()
plt.close()
I just want to have the legend in the output figure with the id of each house, please any help
As I do not have your data, I can not test it in a plot right now, but I assume the problem comes from not passing a label argument to plt.plot i.e.:
for index, row in data1.iterrows():
if row.iloc[-1] == cluster_index:
plt.plot(row.iloc[1:-1] ,marker = next(marker), alpha=1, label=index)
plt.legend(loc="right")
I really don't understand what's going wrong with this... I've looked through what is pretty simple data several times and have restarted the kernel (running on Jupyter Notebook) and nothing seems to be solving it.
Here's the data frame I have (sorry I know the numbers look a bit silly, this is a really sparse dataset over a long time period, original is reindexed for 20 years):
DATE NODP NVP VP VDP
03/08/2002 0.083623 0.10400659 0.81235517 1.52458E-05
14/09/2003 0.24669167 0.24806379 0.5052293 1.52458E-05
26/07/2005 0.15553726 0.13324796 0.7111538 0.000060983
20/05/2006 0 0.23 0.315 0.455
05/06/2007 0.21280034 0.29139224 0.49579217 1.52458E-05
21/02/2010 0 0.55502195 0.4449628 1.52458E-05
09/04/2011 0.09531311 0.17514162 0.72954527 0
14/02/2012 0.19213217 0.12866237 0.67920546 0
17/01/2014 0.12438848 0.10297326 0.77263826 0
24/02/2017 0.01541347 0.09897548 0.88561105 0
Note that all of the rows add up to 1! I have triple, quadruple checked this...XD
I am trying to produce a stacked bar chart of this data, with the following code, which seems to have worked perfectly for everything else I have been using it for:
NODP = df['NODP']
NVP = df['NVP']
VDP = df['VDP']
VP = df['VP']
ind = np.arange(len(df.index))
width = 5.0
p1 = plt.bar(ind, NODP, width, label = 'NODP', bottom=NVP, color= 'grey')
p2 = plt.bar(ind, NVP, width, label = 'NVP', bottom=VDP, color= 'tan')
p3 = plt.bar(ind, VDP, width, label = 'VDP', bottom=VP, color= 'darkorange')
p4 = plt.bar(ind, VP, width, label = 'VP', color= 'darkgreen')
plt.ylabel('Ratio')
plt.xlabel('Year')
plt.title('Ratio change',x=0.06,y=0.8)
plt.xticks(np.arange(min(ind), max(ind)+1, 6.0), labels=xlabels) #the xticks were cumbersome so not included in this example code
plt.legend()
Which gives me the following plot:
As is evident, 1) NODP is not showing up at all, and 2) the remainder of them are being plotted with the wrong proportions...
I really don't understand what is wrong, it should be really simple, right?! I'm sorry if it is really simple, it's probably right under my nose. Any ideas greatly appreciated!
If you want to create stacked bars this way (so standard matplotlib without using pandas or seaborn for plotting), the bottom needs to be the sum of all the lower bars.
Here is an example with the given data.
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
columns = ['DATE', 'NODP', 'NVP', 'VP', 'VDP']
data = [['03/08/2002', 0.083623, 0.10400659, 0.81235517, 1.52458E-05],
['14/09/2003', 0.24669167, 0.24806379, 0.5052293, 1.52458E-05],
['26/07/2005', 0.15553726, 0.13324796, 0.7111538, 0.000060983],
['20/05/2006', 0, 0.23, 0.315, 0.455],
['05/06/2007', 0.21280034, 0.29139224, 0.49579217, 1.52458E-05],
['21/02/2010', 0, 0.55502195, 0.4449628, 1.52458E-05],
['09/04/2011', 0.09531311, 0.17514162, 0.72954527, 0],
['14/02/2012', 0.19213217, 0.12866237, 0.67920546, 0],
['17/01/2014', 0.12438848, 0.10297326, 0.77263826, 0],
['24/02/2017', 0.01541347, 0.09897548, 0.88561105, 0]]
df = pd.DataFrame(data=data, columns=columns)
ind = pd.to_datetime(df.DATE)
NODP = df.NODP.to_numpy()
NVP = df.NVP.to_numpy()
VP = df.VP.to_numpy()
VDP = df.VDP.to_numpy()
width = 120
p1 = plt.bar(ind, NODP, width, label='NODP', bottom=NVP+VDP+VP, color='grey')
p2 = plt.bar(ind, NVP, width, label='NVP', bottom=VDP+VP, color='tan')
p3 = plt.bar(ind, VDP, width, label='VDP', bottom=VP, color='darkorange')
p4 = plt.bar(ind, VP, width, label='VP', color='darkgreen')
plt.ylabel('Ratio')
plt.xlabel('Year')
plt.title('Ratio change')
plt.yticks(np.arange(0, 1.001, 0.1))
plt.legend()
plt.show()
Note that in this case the x-axis measured in days, and each bar is located at its date. This helps to know the relative spacing between the dates, in case this is important. If it isn't important, the x-positions could be chosen equidistant and labeled via the dates column.
To do so with standard matplotlib, following code would change:
ind = range(len(df))
width = 0.8
plt.xticks(ind, df.DATE, rotation=20)
plt.tight_layout() # needed to show the full labels of the x-axis
Plot the dataframe
# using your data above
df.DATE = pd.to_datetime(df.DATE)
df.set_index('DATE', inplace=True)
ax = df.plot(stacked=True, kind='bar', figsize=(12, 8))
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
# sets the tick labels so time isn't included
ax.xaxis.set_major_formatter(plt.FixedFormatter(df.index.to_series().dt.strftime("%Y-%m-%d")))
plt.show()
Add labels for clarity
By adding the following code before plt.show() you can add text annotations to the bars
# .patches is everything inside of the chart
for rect in ax.patches:
# Find where everything is located
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
y = rect.get_y()
# The width of the bar is the data value and can used as the label
label_text = f'{height:.2f}' # f'{height:.2f}' if you have decimal values as labels
label_x = x + width - 0.125
label_y = y + height / 2
# don't include label if it's equivalently 0
if height > 0.001:
ax.text(label_x, label_y, label_text, ha='right', va='center', fontsize=8)
plt.show()
I am creating a figure with a title, 16 subplots, and a legend. I cannot for the life of me get it to save nicely. I am going to try my best to explain my predicament but my vocabulary may not be correct, so I apologize in advanced.
If I run my code (end) I receive the following output:
That is not pretty, everything is overlapping or cut off. If I were to add plt.savefig() that is what I get.
I can drag the corners of the pop-up window and that gives me a very nicely spaced figure, and is precisely what I want:
However, the save function at the bottom of that pop up window does not always work, and I would much rather be able to create a nice figure in my code that i send to the plt.savefig() function.
In all my searches I keep seeing tight_layout being recommended as a fix to this. The issue with that is it adjusts my plot sizes rather than the spacing between plots, so my titles overlap and my data isn't as visible:
I have also tried constrained_layout() with zero success
I am really hoping there is an obvious solution I am missing, as taking screen shots of the plot isn't really working for me.
eq_csv = r'/here/is/the/file.csv'
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
eq_df = pd.read_csv(eq_csv)
eq_data = eq_df[['LON', 'LAT', 'DEPTH', 'MAG']] # retrieve only the columns I need
eq_data = eq_data.sort_values(['MAG'], ascending=False)
# Get the NSEW boundaries and the Magnitude min and max
nbound = max(eq_data.LAT) + 0.05
sbound = min(eq_data.LAT) - 0.05
ebound = max(eq_data.LON) + 0.01
wbound = min(eq_data.LON)
xlimit = (wbound, ebound)
ylimit = (sbound, nbound)
magmin = min(eq_data.MAG)
magmax = max(eq_data.MAG)
# Loop through depth slices and create a 4 x 4 figure of subplots
fig, axes = plt.subplots(4,4)
for ax in axes.flat:
for n in list(range(1, 17)):
km = eq_data[(eq_data.DEPTH > n - 1) & (eq_data.DEPTH <= n)]
km = km.sort_values(['MAG'], ascending=True)
plt.subplot(4, 4, n) # plot a 4x4 sub plot at the nth location
scatter = plt.scatter(km["LON"], km['LAT'], s=10, c=km['MAG'], vmin=magmin, vmax=magmax, alpha = 0.5)
plt.ylim(sbound, nbound)
plt.xlim(wbound, ebound)
plt.tick_params(axis='both', which='major', labelsize=4)
plt.yticks(rotation = 90)
plt.ylabel('Latitude', rotation = 90, size = 6)
plt.xlabel('Longitude', size = 6)
plt.subplots_adjust(hspace=0.65, wspace=0.25)
plt.gca().set_title('Depth = ' + str(n - 1) + 'km to ' + str(n) + 'km', size=8, fontweight = 'bold') # set title of subplots
plt.suptitle('Magnitude of Events at Different Depth Slices, 1950 to Today', size = 20, fontweight = 'bold')
plt.tight_layout()
fig.subplots_adjust(right=0.8) #adust location of plot
cbar_ax = fig.add_axes([0.85, 0.15, 0.01, 0.7]) #location of color bar
cbar = fig.colorbar(scatter, cax=cbar_ax)
cbar.set_alpha(1)
cbar.set_label('Magnitude', rotation = 270, labelpad = 10)
cbar.draw_all()
plt.show()
plt.savefig('save/location')
I have a bar chart of data from 8 separate buildings, the data is separated by year, I'm trying to place the growth each building went through in the last year on top of the bar chart.
I have this written currently:
n_groups = 8
numbers_2017 = (122,96,42,23,23,22,0,0)
numbers_2018 = (284,224,122,52,41,24,3,1)
fig, ax = plt.subplots(figsize=(15, 10))
index = np.arange(n_groups)
bar_width = 0.35
events2017 = plt.bar(index, numbers_2017, bar_width,
alpha=0.7,
color='#fec615',
label='2017')
events2018 = plt.bar(index + bar_width, numbers_2018, bar_width,
alpha=0.7,
color='#044a05',
label='2018')
labels = ("8 specific buildings passed as strings")
labels = [ '\n'.join(wrap(l, 15)) for l in labels ]
plt.ylabel('Total Number of Events', fontsize=18, fontweight='bold', color = 'white')
plt.title('Number of Events per Building By Year\n', fontsize=20, fontweight='bold', color = 'white')
plt.xticks(index + bar_width / 2)
plt.yticks(color = 'white', fontsize=12)
ax.set_xticklabels((labels),fontsize=12, fontweight='bold', color = 'white')
plt.legend(loc='best', fontsize='xx-large')
plt.tight_layout()
plt.show()
Looking through similar questions on here many of them split the total count across all the bars, whereas I'm just trying to get a positive (or negative) growth percentage placed on top of the most recent year, 2018 in this case.
I found this excellent example online, however it does exactly what I explained earlier, splits up the percentages across the chart:
totals = []
# find the values and append to list
for i in ax.patches:
totals.append(i.get_height())
# set individual bar lables using above list
total = sum(totals)
# set individual bar lables using above list
for i in ax.patches:
# get_x pulls left or right; get_height pushes up or down
ax.text(i.get_x()-.03, i.get_height()+.5, \
str(round((i.get_height()/total)*100, 1))+'%', fontsize=15,
color='dimgrey')
Please let me know if I can list any examples or images that would help, and if this is a dupe please don't hesitate to send me to a (RELEVANT) original and I can shut this question down, Thanks!
I think you gave the answer yourself with the second part of code you gave.
The only thing you had to do was change the ax to the object you wanted the text above, which in this case was events2018.
totals = []
for start, end in zip(events2017.patches, events2018.patches):
if start.get_height() != 0:
totals.append( (end.get_height() - start.get_height())/start.get_height() * 100)
else:
totals.append("NaN")
# set individual bar lables using above list
for ind, i in enumerate(events2018.patches):
# get_x pulls left or right; get_height pushes up or down
if totals[ind] != "NaN":
plt.text(i.get_x(), i.get_height()+.5, \
str(round((totals[ind]), 1))+'%', fontsize=15,
color='dimgrey')
else:
plt.text(i.get_x(), i.get_height()+.5, \
totals[ind], fontsize=15, color='dimgrey')