I need some help making a set of stacked bar charts in python with matlibplot.
Formally, my dataframe looks like this
plt.figure(figsize=(10, 14))
fig= plt.figure()
ax = sns.countplot(x="airlines",hue='typecode', data=trafic,
order=trafic.airlines.value_counts(ascending=False).iloc[:5].index,
hue_order=trafic.typecode.value_counts(ascending=False).iloc[:5].index,
)
ax.set(xlabel="Airlines code", ylabel='Count')
As written in order and hue_order, I want to isolate the 5 most present airlines and aircraft types in my database
I was advised to make a stacked bar plot to make a more presentable graph, only I don't see any functionality with Seaborn to make one, and I can't manage with matplotlib to plot it while respecting this idea of isolating the 5 airlines/aircraft types most present in my database
Thanks for your help!
The following code uses seaborn's countplot with dodge=False. This places all bars belonging to the same airline one on top of the other. In a next step, all bars are moved up to stack them:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set()
np.random.seed(123)
trafic = pd.DataFrame({'airlines': np.random.choice([*'abcdefghij'], 500),
'typecode': np.random.choice([*'qrstuvwxyz'], 500)})
fig = plt.figure(figsize=(10, 5))
ax = sns.countplot(x="airlines", hue='typecode', palette='rocket', dodge=False, data=trafic,
order=trafic.airlines.value_counts(ascending=False).iloc[:5].index,
hue_order=trafic.typecode.value_counts(ascending=False).iloc[:5].index)
ax.set(xlabel="Airlines code", ylabel='Count')
bottoms = {}
for bars in ax.containers:
for bar in bars:
x, y = bar.get_xy()
h = bar.get_height()
if x in bottoms:
bar.set_y(bottoms[x])
bottoms[x] += h
else:
bottoms[x] = h
ax.relim() # the plot limits need to be updated with the moved bars
ax.autoscale()
plt.show()
Note that the airlines are sorted on their total airplanes, not on their total for the 5 overall most frequent airplane types.
PS: In the question's code, plt.figure() is called twice. That first creates an empty figure with the given figsize, and then a new figure with a default figsize.
Related
I am trying to include 2 seaborn countplots with different scales on the same plot but the bars display as different widths and overlap as shown below. Any idea how to get around this?
Setting dodge=False, doesn't work as the bars appear on top of each other.
The main problem of the approach in the question, is that the first countplot doesn't take hue into account. The second countplot won't magically move the bars of the first. An additional categorical column could be added, only taking on the 'weekend' value. Note that the column should be explicitly made categorical with two values, even if only one value is really used.
Things can be simplified a lot, just starting from the original dataframe, which supposedly already has a column 'is_weeked'. Creating the twinx ax beforehand allows to write a loop (so writing the call to sns.countplot() only once, with parameters).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_style('dark')
# create some demo data
data = pd.DataFrame({'ride_hod': np.random.normal(13, 3, 1000).astype(int) % 24,
'is_weekend': np.random.choice(['weekday', 'weekend'], 1000, p=[5 / 7, 2 / 7])})
# now, make 'is_weekend' a categorical column (not just strings)
data['is_weekend'] = pd.Categorical(data['is_weekend'], ['weekday', 'weekend'])
fig, ax1 = plt.subplots(figsize=(16, 6))
ax2 = ax1.twinx()
for ax, category in zip((ax1, ax2), data['is_weekend'].cat.categories):
sns.countplot(data=data[data['is_weekend'] == category], x='ride_hod', hue='is_weekend', palette='Blues', ax=ax)
ax.set_ylabel(f'Count ({category})')
ax1.legend_.remove() # both axes got a legend, remove one
ax1.set_xlabel('Hour of Day')
plt.tight_layout()
plt.show()
use plt.xticks(['put the label by hand in your x label'])
I have a sample dataset as follows;
pd.DataFrame({'Day_Duration':['Evening','Evening','Evening','Evening','Evening','Morning','Morning','Morning',
'Morning','Morning','Night','Night','Night','Night','Night','Noon','Noon','Noon',
'Noon','Noon'],'place_category':['Other','Italian','Japanese','Chinese','Burger',
'Other','Juice Bar','Donut','Bakery','American','Other','Italian','Japanese','Burger',\
'American','Other','Italian','Burger','American','Salad'],'Percent_delivery':[14.03,10.61,9.25,8.19,6.89,19.58,10.18,9.14,8.36,6.53,13.60,8.42,\
8.22,7.66,6.67,17.71,10.62,8.44,8.33,7.50]})
The goal is to draw faceted barplot with Day_duration serving as facets, hence 4 facets in total. I used the following code to achieve the same,
import seaborn as sns
#g = sns.FacetGrid(top5_places, col="Day_Duration")
g=sns.catplot(x="place_category", y="Percent_delivery", hue='place_category',col='Day_Duration',\
data=top5_places,ci=None,kind='bar',height=4, aspect=.7)
g.set_xticklabels(rotation=90)
Attached is the figure I got;
Can I kindly get help with 2 things, first is it possible to get only 5 values on the x-axis for each facet(rather than seeing all the values for each facet), second, is there a way to make the bars a bit wider. Help is appreciated.
Because you're using hue the api applies a unique color to each value of place_category, but it also expects each category to be in the plot, as shown in your image.
The final figure is a FacetGrid. Using subplot is the manual way of creating one.
In order to plot only the top n categories for each Day_Duration, each plot will need to be done individually, with a custom color map.
cmap is a dictionary with place categories as keys and colors as values. It's used so there will be one legend and each category will be colored the same for each plot.
Because we're not using the legend automatically generated by the plot, one needs to be created manually.
patches uses Patch to create each item in the legend. (e.g. the rectangle, associated with color and name).
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
# create a color map for unique values or place
place_cat = df.place_category.unique()
colors = sns.color_palette('husl', n_colors=10)
cmap = dict(zip(place_cat, colors))
# plot a subplot for each Day_Duration
plt.figure(figsize=(16, 6))
for i, tod in enumerate(df.Day_Duration.unique(), 1):
data = df[df.Day_Duration == tod].sort_values(['Percent_delivery'], ascending=False)
plt.subplot(1, 4, i)
p = sns.barplot(x='place_category', y='Percent_delivery', data=data, hue='place_category', palette=cmap)
p.legend_.remove()
plt.xticks(rotation=90)
plt.title(f'Day Duration: {tod}')
plt.tight_layout()
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
plt.legend(handles=patches, bbox_to_anchor=(1.04, 0.5), loc='center left', borderaxespad=0)
plt.show()
I'm trying to plot 60+ boxplots side by side from a dataframe and I was wondering if someone could suggest some possible solutions.
At the moment I have df_new, a dataframe with 66 columns, which I'm using to plot boxplots. The easiest way I found to plot the boxplots was to use the boxplot package inside pandas:
boxplot = df_new.boxplot(column=x, figsize = (100,50))
This gives me a very very tiny chart with illegible axis which I cannot seem to change the font size for, so I'm trying to do this natively in matplotlib but I cannot think of an efficient way of doing it. I'm trying to avoid creating 66 separate boxplots using something like:
fig, ax = plt.subplots(nrows = 1,
ncols = 66,
figsize = (10,5),
sharex = True)
ax[0,0].boxplot(#insert parameters here)
I actually do not not how to get the data from df_new.describe() into the boxplot function, so any tips on this would be greatly appreciated! The documentation is confusing. Not sure what x vectors should be.
Ideally I'd like to just give the boxplot function the dataframe and for it to automatically create all the boxplots by working out all the quartiles, column separations etc on the fly - is this even possible?
Thanks!
I tried to replace the boxplot with a ridge plot, which takes up less space because:
it requires half of the width
you can partially overlap the ridges
it develops vertically, so you can scroll down all the plot
I took the code from the seaborn documentation and adapted it a little bit in order to have 60 different ridges, normally distributed; here the code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
sns.set(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
# # Create the data
n = 20
x = list(np.random.randn(1, 60)[0])
g = [item[0] + item[1] for item in list(itertools.product(list('ABCDEFGHIJ'), list('123456')))]
df = pd.DataFrame({'x': n*x,
'g': n*g})
# Initialize the FacetGrid object
pal = sns.cubehelix_palette(10, rot=-.25, light=.7)
g = sns.FacetGrid(df, row="g", hue="g", aspect=15, height=.5, palette=pal)
# Draw the densities in a few steps
g.map(sns.kdeplot, "x", clip_on=False, shade=True, alpha=1, lw=1.5, bw=.2)
g.map(sns.kdeplot, "x", clip_on=False, color="w", lw=2, bw=.2)
g.map(plt.axhline, y=0, lw=2, clip_on=False)
# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
ax = plt.gca()
ax.text(0, .2, label, fontweight="bold", color=color,
ha="left", va="center", transform=ax.transAxes)
g.map(label, "x")
# Set the subplots to overlap
g.fig.subplots_adjust(hspace=-.25)
# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[])
g.despine(bottom=True, left=True)
plt.show()
This is the result I get:
I don't know if it will be good for your needs, in any case keep in mind that keeping so many distributions next to each other will always require a lot of space (and a very big screen).
Maybe you could try dividing the distrubutions into smaller groups and plotting them a little at a time?
I want to plot a "highlighted" point on top of swarmplot like this
The swarmplot don't have the y-axis, so I have no idea how to plot that point.
import seaborn as sns
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.swarmplot(x=tips["total_bill"])
This approach is predicated on knowing the index of the data point you wish to highlight, but it should work - although if you have multiple swarmplots on a single Axes instance it will become slightly more complex.
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.swarmplot(x=tips["total_bill"])
artists = ax.get_children()
offsets = []
for a in artists:
if type(a) is matplotlib.collections.PathCollection:
offsets = a.get_offsets()
break
plt.scatter(offsets[50,0], offsets[50,1], marker='o', color='orange', zorder=10)
You can highlight a point/s using the hue attribute if you add a grouping variable for the y axis (so that they appear as a single group), and then use another variable to highlight the point that you're interested in.
Then you can remove the y labels and styling and legend.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
# Get data and mark point you want to highlight
tips = sns.load_dataset("tips")
tips['highlighted_point'] = 0
tips.loc[tips[tips.total_bill > 50].index, 'highlighted_point'] = 1
# Add holding 'group' variable so they appear as one
tips['y_variable'] = 'testing'
# Use 'hue' to differentiate the highlighted point
ax = sns.swarmplot(x=tips["total_bill"], y=tips['y_variable'], hue=tips['highlighted_point'])
# Remove legend
ax.get_legend().remove()
# Hide y axis formatting
ax.set_ylabel('')
ax.get_yaxis().set_ticks([])
plt.show()
Is there a way to add a secondary legend to a scatterplot, where the size of the scatter is proportional to some data?
I have written the following code that generates a scatterplot. The color of the scatter represents the year (and is taken from a user-defined df) while the size of the scatter represents variable 3 (also taken from a df but is raw data):
import pandas as pd
colors = pd.DataFrame({'1985':'red','1990':'b','1995':'k','2000':'g','2005':'m','2010':'y'}, index=[0,1,2,3,4,5])
fig = plt.figure()
ax = fig.add_subplot(111)
for i in df.keys():
df[i].plot(kind='scatter',x='variable1',y='variable2',ax=ax,label=i,s=df[i]['variable3']/100, c=colors[i])
ax.legend(loc='upper right')
ax.set_xlabel("Variable 1")
ax.set_ylabel("Variable 2")
This code (with my data) produces the following graph:
So while the colors/years are well and clearly defined, the size of the scatter is not.
How can I add a secondary or additional legend that defines what the size of the scatter means?
You will need to create the second legend yourself, i.e. you need to create some artists to populate the legend with. In the case of a scatter we can use a normal plot and set the marker accordingly.
This is shown in the below example. To actually add a second legend we need to add the first legend to the axes, such that the new legend does not overwrite the first one.
import matplotlib.pyplot as plt
import matplotlib.colors
import numpy as np; np.random.seed(1)
import pandas as pd
plt.rcParams["figure.subplot.right"] = 0.8
v = np.random.rand(30,4)
v[:,2] = np.random.choice(np.arange(1980,2015,5), size=30)
v[:,3] = np.random.randint(5,13,size=30)
df= pd.DataFrame(v, columns=["x","y","year","quality"])
df.year = df.year.values.astype(int)
fig, ax = plt.subplots()
for i, (name, dff) in enumerate(df.groupby("year")):
c = matplotlib.colors.to_hex(plt.cm.jet(i/7.))
dff.plot(kind='scatter',x='x',y='y', label=name, c=c,
s=dff.quality**2, ax=ax)
leg = plt.legend(loc=(1.03,0), title="Year")
ax.add_artist(leg)
h = [plt.plot([],[], color="gray", marker="o", ms=i, ls="")[0] for i in range(5,13)]
plt.legend(handles=h, labels=range(5,13),loc=(1.03,0.5), title="Quality")
plt.show()
Have a look at http://matplotlib.org/users/legend_guide.html.
It shows how to have multiple legends (about halfway down) and there is another example that shows how to set the marker size.
If that doesn't work, then you can also create a custom legend (last example).