I was trying to use matplotlib and pandas to create a bar chart that shows the daily change in COVID-19 cases for the USA.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
display(data.head(10))
df = data.groupby('date').sum()
df['index'] = range(len(df))
df['IsChanged'] = df['cases'].diff()
df.at['2020-01-21', 'IsChanged'] = 0.0
x = df['index']
z = df['IsChanged']
plt.figure(figsize=(20,10))
plt.grid(linestyle='--')
plt.bar(x,z)
plt.show()
The graph that I get though, looks like this:
.
The width of the bars of the chart are not even. I tried setting a specific width, but that didn't work. Is there a way to fix this?
This can be resolved by specifying the resolution. For example, try setting dpi=300. The graph in the answer is an image of the output with the DPI specified in your code.
plt.figure(figsize=(20,10),dpi=300)
While this does not directly answer your question, one might want to consider plt.fill_between() when the barplot has lots of bars. (since if you can not distinguish the bars from each other the barplot kind of loses its purpose)
For example
plt.fill_between(x, 0, z, step='mid', facecolor=(0.3, 0.3, 0.45 ,.4), edgecolor=(0, 0, 0, 1))
plt.grid(ls= ':', color='#6e6e6e', lw=0.5);
or even:
plt.fill_between(x, 0, z, facecolor=(0.3, 0.3, 0.45 ,.4), edgecolor=(0, 0, 0, 1))
plt.grid(ls= ':', color='#6e6e6e', lw=0.5);
Related
I have this data (df) and I get their percentages (data=rel) and plotted a stacked bar graph.
Now I want to add values (non percentage values) to the centers of each bar but from my first dataframe.
My code for now:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from csv import reader
import seaborn as sns
df = pd.DataFrame({'IL':['Balıkesir', 'Bursa', 'Çanakkale', 'Edirne', 'İstanbul', 'Kırklareli', 'Kocaeli', 'Sakarya','Tekirdağ','Yalova'],'ENGELLIUYGUN':[7,13,3,1,142,1,14,1,2,2],'ENGELLIUYGUNDEGIL':[1,5,0,0,55,0,3,0,1,0]})
iller=df.iloc[:,[0]]
df_total = df["ENGELLIUYGUN"] + df["ENGELLIUYGUNDEGIL"]
df_rel = df[df.columns[1:]].div(df_total, 0)*100
rel=[]
rel=pd.DataFrame(df_rel)
rel['İller'] = iller
d=df.iloc[:,[1]] #I want to add these values to the center of blue bars.
f=df.iloc[:,[2]] #I want to add these values to the center of green bars.
sns.set_theme (style='whitegrid')
ax=rel.plot(x='İller',kind='bar', stacked=True, color=["#3a88e2","#5c9e1e"], label=("Uygun","Uygun Değil"))
plt.legend(["Evet","Hayır"],fontsize=8, bbox_to_anchor=(1, 0.5))
plt.xlabel('...........',fontsize=12)
plt.ylabel('..........',fontsize=12)
plt.title('.............',loc='center',fontsize=14)
plt.ylim(0,100)
ax.yaxis.grid(color='gray', linestyle='dashed')
plt.show()
I have this for now:
I want the exact same style of this photo:
I am using Anaconda-Jupyter Notebook.
Answering: I want to add values (non percentage values) to the centers of each bar but from my first dataframe.
The correct way to annotate bars, is with .bar_label, as explained in this answer.
The values from df can be sent to the label= parameter instead of the percentages.
This answer shows how to succinctly calculate the percentages, but plots the counts and annotates with percentage and value, whereas this OP wants to plot the percentage on the y-axis and annotate with counts.
This answer shows how to place the legend at the bottom of the plot.
This answer shows how to format the axis tick labels as percent.
See pandas.DataFrame.plot for an explanation of the available parameters.
I am using Anaconda-Jupyter Notebook. Everything from the comment, # plot percent; ..., should be in the same notebook cell.
Tested in python 3.11, pandas 1.5.2, matplotlib 3.6.2
import pandas as pd
import matplotlib.ticker as tkr
# sample data
df = pd.DataFrame({'IL': ['Balıkesir', 'Bursa', 'Çanakkale', 'Edirne', 'İstanbul', 'Kırklareli', 'Kocaeli', 'Sakarya','Tekirdağ','Yalova'],
'ENGELLIUYGUN': [7, 13, 3, 1, 142, 1, 14, 1, 2, 2],
'ENGELLIUYGUNDEGIL': [1, 5, 0, 0, 55, 0, 3, 0, 1, 0]})
# set IL as the index
df = df.set_index('IL')
# calculate the percent
per = df.div(df.sum(axis=1), axis=0).mul(100)
# plot percent; adjust rot= for the rotation of the xtick labels
ax = per.plot(kind='bar', stacked=True, figsize=(10, 8), rot=0,
color=['#3a88e2', '#5c9e1e'], yticks=range(0, 101, 10),
title='my title', ylabel='', xlabel='')
# move the legend
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), ncol=2, frameon=False)
# format the y-axis tick labels
ax.yaxis.set_major_formatter(tkr.PercentFormatter())
# iterate through the containers
for c in ax.containers:
# get the current segment label (a string); corresponds to column / legend
col = c.get_label()
# use label to get the appropriate count values from df
# customize the label to account for cases when there might not be a bar section
labels = [v if v > 0 else '' for v in df[col]]
# the following will also work
# labels = df[col].replace(0, '')
# add the annotation
ax.bar_label(c, labels=labels, label_type='center', fontweight='bold')
Alternate Annotation Implementation
Since the column names in df and per are the same, they can be extracted directly from per.
# iterate through the containers and per column names
for c, col in zip(ax.containers, per):
# add the annotations with custom labels from df
ax.bar_label(c, labels=df[col].replace(0, ''), label_type='center', fontweight='bold')
I don't think any subtle method exist. So you have to print those yourself, adding explicitly text. Which is not that hard to do. For example, if you add this just after your plot
for i in range(len(d)):
ax.text(i, df_rel.iloc[i,0]/2, d.iloc[i,0], ha='center', fontweight='bold', color='#ffff00', fontsize='small')
ax.text(i, 50+df_rel.iloc[i,0]/2, f.iloc[i,0], ha='center', fontweight='bold', color='#400040', fontsize='small')
you get this result
You can of course change color, size, position, etc. (I am well known for by total lack of bon goût for those matter). But also decide some arbitrary rule, such as not printing '0' (that the advantage of doing things explicitly: your code, your rule; you don't have to fight an existing API to convince it to do it your way).
This question is related to a previous question I posted here. My code for my seaborn scatterplot looks as follows:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df = pd.DataFrame()
df['First PCA dimension'] = [1,2,3,4]
df['Second PCA dimension'] = [0,5,5,7]
df['Third PCA dimension'] = [1,2,6,4]
df['Data points'] = [1,2,3,4]
plt.figure(figsize=(42,30))
plt.title('2-D PCA of my data points',fontsize=32)
colors = ["#FF9926", "#2ACD37","#FF9926", "#FF0800"]
b = sns.scatterplot(x="First PCA dimension", y="Second PCA dimension", hue="Data points", palette=sns.color_palette(colors), data=df, legend="full", alpha=0.3)
sns.set_context("paper", rc={"font.size":48,"axes.titlesize":48,"axes.labelsize":48})
b.set_ylabel('mylabely', size=54)
b.set_xlabel('mylabelx', size=54)
b.set_xticklabels([1,2,3,4,5,6,7,8], fontsize = 36)
lgnd = plt.legend(fontsize='22')
for handle in lgnd.legendHandles:
handle.set_sizes([26.0])
plt.show()
The alpha value of 0.3 sets a transparency value for each point in my scatterplot. However, I would like to have a different transparency value for each data point (based on the category it belongs to) instead. Is this possible by providing a list of alpha values, similar to the way I provide a list of colours in the example above?
As noted in comments, this is something you can't currently do with seaborn.
However, you can hack it by using key colours for the markers, and find-replacing those colours using PathCollection.get_facecolor() and PathCollection.set_facecolor() with RGBA colours.
So for example, I needed a swarmplot on top of a violinplot, with certain classes of points at different opacities. To change greys into transparent blacks (what I needed to do), we can do:
seaborn.violinplot(...)
points = seaborn.swarmplot(...)
for c in points.collections:
if not isinstance(c, PathCollection):
continue
fc = c.get_facecolor()
if fc.shape[1] == 4:
for i, r in enumerate(fc):
# change mid-grey to 50% black
if numpy.array_equiv(r, array([0.5, 0.5, 0.5, 1])):
fc[i] = array([0, 0, 0, 0.5])
# change white to transparent
elif numpy.array_equiv(r, array([1, 1, 1, 1])):
fc[i] = array([0, 0, 0, 0])
c.set_facecolor(fc)
Very awful, but it got me what I needed for a one-shot graphic.
I have a dataframe having two columns- VOL, INVOL and for a particular year, the value are the same. Hence, while plotting in seaborn, I am not able to see the value of the other column when they converge.
For example:
My dataframe is
When I use seaborn, using the below code
f5_test = df5_test.melt('FY', var_name='cols', value_name='vals')
g = sns.catplot(x="FY", y="vals", hue='cols', data=df5_test, kind='point')
the chart is not showing the same point of 0.06.
I tried using pandas plotting, having the same result.
Please advise what I should do. Thanks in advance.
You plot looks legitimate. Two lines perfectly overlap since the data from 2016 to 2018 is exactly the same. I think maybe you can try to plot the two lines separately and add or subtract some small value to one of them to move the line a little bit. For example:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'FY': [2012, 2013, 2014, 2015, 2016, 2017, 2018],
'VOL_PCT': [0, 0.08, 0.07, 0.06, 0, 0, 0.06],
'INVOL_PC': [0, 0, 0, 0, 0, 0, 0.06]})
# plot
fig, ax = plt.subplots()
sns.lineplot(df.FY, df.VOL_PCT)
sns.lineplot(df.FY+.01, df.INVOL_PC-.001)
In addition, given the type of your data, you could also consider using stack plots. For example:
fig, ax = plt.subplots()
labels = ['VOL_PCT', 'INVOL_PC']
ax.stackplot(df.FY, df.VOL_PCT, df.INVOL_PC, labels=labels)
ax.legend(loc='upper left');
Ref. Stackplot
I am preparing a graph of latency percentile results. This is my pd.DataFrame looks like:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
result = pd.DataFrame(np.random.randint(133000, size=(5,3)), columns=list('ABC'), index=[99.0, 99.9, 99.99, 99.999, 99.9999])
I am using this function (commented lines are different pyplot methods I have already tried to achieve my goal):
def plot_latency_time_bar(result):
ind = np.arange(4)
means = []
stds = []
for index, row in result.iterrows():
means.append(np.mean([row[0]//1000, row[1]//1000, row[2]//1000]))
stds.append(np .std([row[0]//1000, row[1]//1000, row[2]//1000]))
plt.bar(result.index.values, means, 0.2, yerr=stds, align='center')
plt.xlabel('Percentile')
plt.ylabel('Latency')
plt.xticks(result.index.values)
# plt.xticks(ind, ('99.0', '99.9', '99.99', '99.999', '99.99999'))
# plt.autoscale(enable=False, axis='x', tight=False)
# plt.axis('auto')
# plt.margins(0.8, 0)
# plt.semilogx(basex=5)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
fig = plt.gcf()
fig.set_size_inches(15.5, 10.5)
And here is the figure:
As you can see bars for all percentiles above 99.0 overlaps and are completely unreadable. I would like to set some fixed space between ticks to have a same space between all of them.
Since you're using pandas, you can do all this from within that library:
means = df.mean(axis=1)/1000
stds = df.std(axis=1)/1000
means.plot.bar(yerr=stds, fc='b')
# Make some room for the x-axis tick labels
plt.subplots_adjust(bottom=0.2)
plt.show()
Not wishing to take anything away from xnx's answer (which is the most elegant way to do things given that you're working in pandas, and therefore likely the best answer for you) but the key insight you're missing is that, in matplotlib, the x positions of the data you're plotting and the x tick labels are independent things. If you say:
nominalX = np.arange( 1, 6 ) ** 2
y = np.arange( 1, 6 ) ** 4
positionalX = np.arange(len(y))
plt.bar( positionalX, y ) # graph y against the numbers 1..n
plt.gca().set(xticks=positionalX + 0.4, xticklabels=nominalX) # ...but superficially label the X values as something else
then that's different from tying positions to your nominal X values:
plt.bar( nominalX, y )
Note that I added 0.4 to the x position of the ticks, because that's half the default width of the bars bar( ..., width=0.8 )—so the ticks end up in the middle of the bar.
I want to create a bar chart of two series (say 'A' and 'B') contained in a Pandas dataframe. If I wanted to just plot them using a different y-axis, I can use secondary_y:
df = pd.DataFrame(np.random.uniform(size=10).reshape(5,2),columns=['A','B'])
df['A'] = df['A'] * 100
df.plot(secondary_y=['A'])
but if I want to create bar graphs, the equivalent command is ignored (it doesn't put different scales on the y-axis), so the bars from 'A' are so big that the bars from 'B' are cannot be distinguished:
df.plot(kind='bar',secondary_y=['A'])
How can I do this in pandas directly? or how would you create such graph?
I'm using pandas 0.10.1 and matplotlib version 1.2.1.
Don't think pandas graphing supports this. Did some manual matplotlib code.. you can tweak it further
import pylab as pl
fig = pl.figure()
ax1 = pl.subplot(111,ylabel='A')
#ax2 = gcf().add_axes(ax1.get_position(), sharex=ax1, frameon=False, ylabel='axes2')
ax2 =ax1.twinx()
ax2.set_ylabel('B')
ax1.bar(df.index,df.A.values, width =0.4, color ='g', align = 'center')
ax2.bar(df.index,df.B.values, width = 0.4, color='r', align = 'edge')
ax1.legend(['A'], loc = 'upper left')
ax2.legend(['B'], loc = 'upper right')
fig.show()
I am sure there are ways to force the one bar further tweak it. move bars further apart, one slightly transparent etc.
Ok, I had the same problem recently and even if it's an old question, I think that I can give an answer for this problem, in case if someone else lost his mind with this. Joop gave the bases of the thing to do, and it's easy when you only have (for exemple) two columns in your dataframe, but it becomes really nasty when you have a different numbers of columns for the two axis, due to the fact that you need to play with the position argument of the pandas plot() function. In my exemple I use seaborn but it's optionnal :
import pandas as pd
import seaborn as sns
import pylab as plt
import numpy as np
df1 = pd.DataFrame(np.array([[i*99 for i in range(11)]]).transpose(), columns = ["100"], index = [i for i in range(11)])
df2 = pd.DataFrame(np.array([[i for i in range(11)], [i*2 for i in range(11)]]).transpose(), columns = ["1", "2"], index = [i for i in range(11)])
fig, ax = plt.subplots()
ax2 = ax.twinx()
# we must define the length of each column.
df1_len = len(df1.columns.values)
df2_len = len(df2.columns.values)
column_width = 0.8 / (df1_len + df2_len)
# we calculate the position of each column in the plot. This value is based on the position definition :
# Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center)
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.plot.html
df1_posi = 0.5 + (df2_len/float(df1_len)) * 0.5
df2_posi = 0.5 - (df1_len/float(df2_len)) * 0.5
# In order to have nice color, I use the default color palette of seaborn
df1.plot(kind='bar', ax=ax, width=column_width*df1_len, color=sns.color_palette()[:df1_len], position=df1_posi)
df2.plot(kind='bar', ax=ax2, width=column_width*df2_len, color=sns.color_palette()[df1_len:df1_len+df2_len], position=df2_posi)
ax.legend(loc="upper left")
# Pandas add line at x = 0 for each dataframe.
ax.lines[0].set_visible(False)
ax2.lines[0].set_visible(False)
# Specific to seaborn, we have to remove the background line
ax2.grid(b=False, axis='both')
# We need to add some space, the xlim don't manage the new positions
column_length = (ax2.get_xlim()[1] - abs(ax2.get_xlim()[0])) / float(len(df1.index))
ax2.set_xlim([ax2.get_xlim()[0] - column_length, ax2.get_xlim()[1] + column_length])
fig.patch.set_facecolor('white')
plt.show()
And the result : http://i.stack.imgur.com/LZjK8.png
I didn't test every possibilities but it looks like it works fine whatever the number of columns in each dataframe you use.