Overlaying multiple histograms using pandas - python

I have two or three csv files with the same header and would like to draw the histograms for each column overlaying one another on the same plot.
The following code gives me two separate figures, each containing all histograms for each of the files. Is there a compact way to go about plotting them together on the same figure using pandas/matplot lib? I imagine something close to this but using dataframes.
Code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('input1.csv')
df2 = pd.read_csv('input2.csv')
df.hist(bins=20)
df2.hist(bins=20)
plt.show()

In [18]: from pandas import DataFrame
In [19]: from numpy.random import randn
In [20]: df = DataFrame(randn(10, 2))
In [21]: df2 = DataFrame(randn(10, 2))
In [22]: axs = df.hist()
In [23]: for ax, (colname, values) in zip(axs.flat, df2.iteritems()):
....: values.hist(ax=ax, bins=10)
....:
In [24]: draw()
gives

The main issue of overlaying the histograms of two (or more) dataframes containing the same variables in side-by-side plots within a single figure has been already solved in the answer by Phillip Cloud.
This answer provides a solution to the issue raised by the author of the question (in the comments to the accepted answer) regarding how to enforce the same number of bins and range for the variables common to both dataframes. This can be accomplished by creating a list of bins common to all variables of both dataframes. In fact, this answer goes a little bit further by adjusting the plots for cases where the different variables contained in each dataframe cover slightly different ranges (but still within the same order of magnitude), as illustrated in the following example:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
from matplotlib.lines import Line2D
# Set seed for random data
rng = np.random.default_rng(seed=1)
# Create two similar dataframes each containing two random variables,
# with df2 twice the size of df1
df1_size = 1000
df1 = pd.DataFrame(dict(var1 = rng.exponential(scale=1.0, size=df1_size),
var2 = rng.normal(loc=40, scale=5, size=df1_size)))
df2_size = 2*df1_size
df2 = pd.DataFrame(dict(var1 = rng.exponential(scale=2.0, size=df2_size),
var2 = rng.normal(loc=50, scale=10, size=df2_size)))
# Combine the dataframes to extract the min/max values of each variable
df_combined = pd.concat([df1, df2])
vars_min = [df_combined[var].min() for var in df_combined]
vars_max = [df_combined[var].max() for var in df_combined]
# Create custom bins based on the min/max of all values from both
# dataframes to ensure that in each histogram the bins are aligned
# making them easily comparable
nbins = 30
bin_edges, step = np.linspace(min(vars_min), max(vars_max), nbins+1, retstep=True)
# Create figure by combining the outputs of two pandas df.hist() function
# calls using the 'step' type of histogram to improve plot readability
htype = 'step'
alpha = 0.7
lw = 2
axs = df1.hist(figsize=(10,4), bins=bin_edges, histtype=htype,
linewidth=lw, alpha=alpha, label='df1')
df2.hist(ax=axs.flatten(), grid=False, bins=bin_edges, histtype=htype,
linewidth=lw, alpha=alpha, label='df2')
# Adjust x-axes limits based on min/max values and step between bins, and
# remove top/right spines: if, contrary to this example dataset, var1 and
# var2 cover the same range, setting the x-axes limits with this loop is
# not necessary
for ax, v_min, v_max in zip(axs.flatten(), vars_min, vars_max):
ax.set_xlim(v_min-2*step, v_max+2*step)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Edit legend to get lines as legend keys instead of the default polygons:
# use legend handles and labels from any of the axes in the axs object
# (here taken from first one) seeing as the legend box is by default only
# shown in the last subplot when using the plt.legend() function.
handles, labels = axs.flatten()[0].get_legend_handles_labels()
lines = [Line2D([0], [0], lw=lw, color=h.get_facecolor()[:-1], alpha=alpha)
for h in handles]
plt.legend(lines, labels, frameon=False)
plt.suptitle('Pandas', x=0.5, y=1.1, fontsize=14)
plt.show()
It is worth noting that the seaborn package provides a more convenient way to create this kind of plot, where contrary to pandas the bins are automatically aligned. The only downside is that the dataframes must first be combined and reshaped to long format, as shown in this example using the same dataframes and bins as before:
import seaborn as sns # v 0.11.0
# Combine dataframes and convert the combined dataframe to long format
df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')
# Create figure using seaborn displot: note that the bins are automatically
# aligned thanks the 'common_bins' parameter of the seaborn histplot function
# (called here with 'kind='hist'') that is set to True by default. Here, the
# bins from the previous example are used to make the figures more comparable.
# Also note that the facets share the same x and y axes by default, this can
# be changed when var1 and var2 have different ranges and different
# distribution shapes, as it is the case in this example.
g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
element='step', bins=bin_edges, fill=False, height=4,
facet_kws=dict(sharex=False, sharey=False))
# For some reason setting sharex as above does not automatically adjust the
# x-axes limits (even when not setting a bins argument, maybe due to a bug
# with this package version) which is why this is done in the following loop,
# but note that you still need to set 'sharex=False' in displot, or else
# 'ax.set.xlim' will have no effect.
for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
ax.set_xlim(v_min-2*step, v_max+2*step)
# Additional formatting
g.legend.set_bbox_to_anchor((.9, 0.75))
g.legend.set_title('')
plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)
plt.show()
As you may notice, the histogram line is cut off at the limits of the list of bin edges (not visible on the maximum side due to scale). To get a line more similar to the example with pandas, an empty bin can be added at each extremity of the list of bins, like this:
bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
bin_edges = np.append(bin_edges, bin_edges.max()+step)
This example also illustrates the limits to this approach of setting common bins for both facets. Seeing as the ranges of var1 var2 are somewhat different and that 30 bins are used to cover the combined range, the histogram for var1 contains rather few bins and the histogram for var2 has slightly more bins than necessary. To my knowledge, there is no straightforward way of assigning a different list of bins to each facet when calling the plotting functions df.hist() and displot(df). So for cases where variables cover significantly different ranges, these figures would have to be created from scratch using matplotlib or some other plotting library.

Related

How to put two Pandas box plots next to each other? Or group them by variable?

I have two data frames (df1 and df2). Each have the same 10 variables with different values.
I created box plots of the variables in the data frames like so:
df1.boxplot()
df2.boxplot()
I get two graphs of 10 box plots next to each other for each variable. The actual output is the second graph, however, as obviously Python just runs the code in order.
Instead, I would either like these box plots to appear side by side OR ideally, I would like 10 graphs (one for each variable) comparing each variable by data frame (e.g. one graph for the first variable with two box plots in it, one for each data frame). Is that possible just using python library or do I have to use Matplotlib?
Thanks!
To get graphs, standard Python isn't enough. You'd need a graphical library such as matplotlib. Seaborn extends matplotlib to ease the creation of complex statistical plots. To work with Seaborn, the dataframes should be converted to long form (e.g. via pandas' melt) and then combined into one large dataframe.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# suppose df1 and df2 are dataframes, each with the same 10 columns
df1 = pd.DataFrame({i: np.random.randn(100).cumsum() for i in 'abcdefghij'})
df2 = pd.DataFrame({i: np.random.randn(150).cumsum() for i in 'abcdefghij'})
# pd.melt converts the dataframe to long form, pd.concat combines them
df = pd.concat({'df1': df1.melt(), 'df2': df2.melt()}, names=['source', 'old_index'])
# convert the source index to a column, and reset the old index
df = df.reset_index(level=0).reset_index(drop=True)
sns.boxplot(data=df, x='variable', y='value', hue='source', palette='turbo')
This creates boxes for each of the original columns, comparing the two dataframes:
Optionally, you could create multiple subplots with that same information:
sns.catplot(data=df, kind='box', col='variable', y='value', x='source',
palette='turbo', height=3, aspect=0.5, col_wrap=5)
By default, the y-axes are shared. You can disable the sharing via sharey=False. Here is an example, which also removes the repeated x axes and creates a common legend:
g = sns.catplot(data=df, kind='box', col='variable', y='value', x='source', hue='source', dodge=False,
palette='Reds', height=3, aspect=0.5, col_wrap=5, sharey=False)
g.set(xlabel='', xticks=[]) # remove x labels and ticks
g.add_legend()
PS: If you simply want to put two pandas boxplots next to each other, you can create a figure with two subplots, and pass the axes to pandas. (Note that pandas plotting is just an interface towards matplotlib.)
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5))
df1.boxplot(ax=ax1)
ax1.set_title('df1')
df2.boxplot(ax=ax2)
ax2.set_title('df2')
plt.tight_layout()
plt.show()

Two seaborn plots with different scales displayed on same plot but bars overlap

I am trying to include 2 seaborn countplots with different scales on the same plot but the bars display as different widths and overlap as shown below. Any idea how to get around this?
Setting dodge=False, doesn't work as the bars appear on top of each other.
The main problem of the approach in the question, is that the first countplot doesn't take hue into account. The second countplot won't magically move the bars of the first. An additional categorical column could be added, only taking on the 'weekend' value. Note that the column should be explicitly made categorical with two values, even if only one value is really used.
Things can be simplified a lot, just starting from the original dataframe, which supposedly already has a column 'is_weeked'. Creating the twinx ax beforehand allows to write a loop (so writing the call to sns.countplot() only once, with parameters).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_style('dark')
# create some demo data
data = pd.DataFrame({'ride_hod': np.random.normal(13, 3, 1000).astype(int) % 24,
'is_weekend': np.random.choice(['weekday', 'weekend'], 1000, p=[5 / 7, 2 / 7])})
# now, make 'is_weekend' a categorical column (not just strings)
data['is_weekend'] = pd.Categorical(data['is_weekend'], ['weekday', 'weekend'])
fig, ax1 = plt.subplots(figsize=(16, 6))
ax2 = ax1.twinx()
for ax, category in zip((ax1, ax2), data['is_weekend'].cat.categories):
sns.countplot(data=data[data['is_weekend'] == category], x='ride_hod', hue='is_weekend', palette='Blues', ax=ax)
ax.set_ylabel(f'Count ({category})')
ax1.legend_.remove() # both axes got a legend, remove one
ax1.set_xlabel('Hour of Day')
plt.tight_layout()
plt.show()
use plt.xticks(['put the label by hand in your x label'])

matplotlib: plotting more than one figure at once

I am working with 3 pandas dataframes having the same column structures(number and type), only that the datasets are for different years.
I would like to plot the ECDF for each of the dataframes, but everytime I do this, I do it individually (lack python skills). So also, one of the figures (2018) is scaled differently on x-axis making it a bit difficult to compare. Here's how I do it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from empiricaldist import Cdf
df1 = pd.read_csv('2016.csv')
df2 = pd.read_csv('2017.csv')
df3 = pd.read_csv('2018.csv')
#some info about the dfs
df1.columns.values
array(['id', 'trip_id', 'distance', 'duration', 'speed', 'foot', 'bike',
'car', 'bus', 'metro', 'mode'], dtype=object)
modal_class = df1['mode']
print(modal_class[:5])
0 bus
1 metro
2 bike
3 foot
4 car
def decorate_ecdf(title, x, y):
plt.xlabel(x)
plt.ylabel(y)
plt.title(title)
#plotting the ecdf for 2016 dataset
for name, group in df1.groupby('mode'):
Cdf.from_seq(group.speed).plot()
title, x, y = 'Speed distribution by travel mode (April 2016)','speed (m/s)', 'ECDF'
decorate_ecdf(title,x,y)
#plotting the ecdf for 2017 dataset
for name, group in df2.groupby('mode'):
Cdf.from_seq(group.speed).plot()
title, x, y = 'Speed distribution by travel mode (April 2017)','speed (m/s)', 'ECDF'
decorate_ecdf(title,x,y)
#plotting the ecdf for 2018 dataset
for name, group in df3.groupby('mode'):
Cdf.from_seq(group.speed).plot()
title, x, y = 'Speed distribution by travel mode (April 2018)','speed (m/s)', 'ECDF'
decorate_ecdf(title,x,y)
Output:
I am pretty sure this isn't the pythonist way of doing it, but a dirty way to get the work done. You can also see how the 2018 plot is scaled differently on the x-axis.
Is there a way to enforce that all figures are scaled the same way?
How do I re-write my code such that the figures are plotted by calling a function once?
When using pyplot, you can plot using an implicit method with plt.plot(), or you can use the explicit method, by creating and calling the figure and axis objects with fig, ax = plt.subplots(). What happened here is, in my view, a side-effect from using the implicit method.
For example, you can use two pd.DataFrame.plot() commands and have them share the same axis by supplying the returned axis to the other function.
foo = pd.DataFrame(dict(a=[1,2,3], b=[4,5,6]))
bar = pd.DataFrame(dict(c=[3,2,1], d=[6,5,4]))
ax = foo.plot()
bar.plot(ax=ax) # ax object is updated
ax.plot([0,3], [1,1], 'k--')
You can also create the figure and axis object previously, and supply as needed. Also, it's perfectly file to have multiple plot commands. Often, my code is 25% work, 75% fiddling with plots. Don't try to be clever and lose on readability.
fig, axes = plt.subplots(nrows=3, ncols=1, sharex=True)
# In this case, axes is a numpy array with 3 axis objects
# You can access the objects with indexing
# All will have the same x range
axes[0].plot([-1, 2], [1,1])
axes[1].plot([-2, 1], [1,1])
axes[2].plot([1,3],[1,1])
So you can combine both of these snippets to your own code. First, create the figure and axes object, then plot each dataframe, but supply the correct axis to them with the keyword ax.
Also, suppose you have three axis objects and they have different x limits. You can get them all, then set the three to have the same minimum value and the same maximum value. For example:
axis_list = [ax1, ax2, ax3] # suppose you created these separately and want to enforce the same axis limits
minimum_x = min([ax.get_xlim()[0] for ax in axis_list])
maximum_x = max([ax.get_xlim()[1] for ax in axis_list])
for ax in axis_list:
ax.set_xlim(minimum_x, maximum_x)

How to use different axis scales in pandas' DataFrame.plot.hist?

I find DataFrame.plot.hist to be amazingly convenient, but I cannot find a solution in this case.
I want to plot the distribution of many columns in the dataset. The problem is that pandas retains the same scale on all x axes, rendering most of the plots useless. Here is the code I'm using:
X.plot.hist(subplots=True, layout=(13, 6), figsize=(20, 45), bins=50, sharey=False, sharex=False)
plt.show()
And here's a section of the result:
It appears that the issue is that pandas uses the same bins on all the columns, irrespectively of their values. Is there a convenient solution in pandas or am I forced to do it by hand?
I centered the data (zero mean and unit variance) and the result improved a little, but it's still not acceptable.
There are a couple of options, here is the code and output:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Dummy data - value ranges differ a lot between columns
X = pd.DataFrame()
for i in range(18):
X['COL0{0}'.format(i+38)]=(2**i)*np.random.random(1000)
# Method 1 - just using the hist function to generate each plot
X.hist(layout=(3, 6), figsize=(20, 10), sharey=False, sharex=False, bins=50)
plt.title('Method 1')
plt.show()
# Method 2 - generate each plot separately
cols = plt.cm.spectral(np.arange(1,255,13))
fig, axes = plt.subplots(3,6,figsize=(20,10))
for index, column in enumerate(X.columns):
ax = axes.flatten()[index]
ax.hist(X[column],bins=50, label=column, fc=cols[index])
ax.legend(loc='upper right')
ax.set_ylim((0,1.2*ax.get_ylim()[1]))
fig.suptitle('Method 2')
fig.show()
The first plot:
The second plot:
I would definitely recommend the second method as you have much more control over the individual plots, for example you can change the axes scales, labels, grid parameters, and almost anything else.
I couldn't find anything that would allow you to modify the original plot.hist bins to accept individually calculated bins.
I hope this helps!

How to plot a histogram by different groups in matplotlib?

I have a table like:
value type
10 0
12 1
13 1
14 2
Generate a dummy data:
import numpy as np
value = np.random.randint(1, 20, 10)
type = np.random.choice([0, 1, 2], 10)
I want to accomplish a task in Python 3 with matplotlib (v1.4):
plot a histogram of value
group by type, i.e. use different colors to differentiate types
the position of the "bars" should be "dodge", i.e. side by side
since the range of value is small, I would use identity for bins, i.e. the width of a bin is 1
The questions are:
how to assign colors to bars based on the values of type and draw colors from colormap (e.g. Accent or other cmap in matplotlib)? I don't want to use named color (i.e. 'b', 'k', 'r')
the bars in my histogram overlap each other, how to "dodge" the bars?
Note
I have tried on Seaborn, matplotlib and pandas.plot for two hours and failed to get the desired histogram.
I read the examples and Users' Guide of matplotlib. Surprisingly, I found no tutorial about how to assign colors from colormap.
I have searched on Google but failed to find a succinct example.
I guess one could accomplish the task with matplotlib.pyplot, without import a bunch of modules such as matplotlib.cm, matplotlib.colors.
For your first question, we can create a dummy column equal to 1, and then generate counts by summing this column, grouped by value and type.
For your second question you can pass the colormap directly into plot using the colormap parameter:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn
seaborn.set() #make the plots look pretty
df = pd.DataFrame({'value': value, 'type': type})
df['dummy'] = 1
ag = df.groupby(['value','type']).sum().unstack()
ag.columns = ag.columns.droplevel()
ag.plot(kind = 'bar', colormap = cm.Accent, width = 1)
plt.show()
Whenever you need to plot a variable grouped by another (using color), seaborn usually provides a more convenient way to do that than matplotlib or pandas. So here is a solution using the seaborn histplot function:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
import seaborn as sns # v 0.11.0
# Set parameters for random data
rng = np.random.default_rng(seed=1) # random number generator
size = 50
xmin = 1
xmax = 20
# Create random dataframe
df = pd.DataFrame(dict(value = rng.integers(xmin, xmax, size=size),
val_type = rng.choice([0, 1, 2], size=size)))
# Create histogram with discrete bins (bin width is 1), colored by type
fig, ax = plt.subplots(figsize=(10,4))
sns.histplot(data=df, x='value', hue='val_type', multiple='dodge', discrete=True,
edgecolor='white', palette=plt.cm.Accent, alpha=1)
# Create x ticks covering the range of all integer values of df['value']
ax.set_xticks(np.arange(df['value'].min(), df['value'].max()+1))
# Additional formatting
sns.despine()
ax.get_legend().set_frame_on(False)
plt.show()
As you can notice, this being a histogram and not a bar plot, there is no space between the bars except where values of the x axis are not present in the dataset, like for values 12 and 14.
Seeing as the accepted answer provided a bar plot in pandas and that a bar plot may be a relevant choice for displaying a histogram in certain situations, here is how to create one with seaborn using the countplot function:
# For some reason the palette argument in countplot is not processed the
# same way as in histplot so here I fetch the colors from the previous
# example to make it easier to compare them
colors = [c for c in set([patch.get_facecolor() for patch in ax.patches])]
# Create bar chart of counts of each value grouped by type
fig, ax = plt.subplots(figsize=(10,4))
sns.countplot(data=df, x='value', hue='val_type', palette=colors,
saturation=1, edgecolor='white')
# Additional formatting
sns.despine()
ax.get_legend().set_frame_on(False)
plt.show()
As this is a bar plot, the values 12 and 14 are not included which produces a somewhat deceitful plot as no empty space is shown for those values. On the other hand, there is some space between each group of bars which makes it easier to see what value each bar belongs to.

Categories