I am working with 3 pandas dataframes having the same column structures(number and type), only that the datasets are for different years.
I would like to plot the ECDF for each of the dataframes, but everytime I do this, I do it individually (lack python skills). So also, one of the figures (2018) is scaled differently on x-axis making it a bit difficult to compare. Here's how I do it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from empiricaldist import Cdf
df1 = pd.read_csv('2016.csv')
df2 = pd.read_csv('2017.csv')
df3 = pd.read_csv('2018.csv')
#some info about the dfs
df1.columns.values
array(['id', 'trip_id', 'distance', 'duration', 'speed', 'foot', 'bike',
'car', 'bus', 'metro', 'mode'], dtype=object)
modal_class = df1['mode']
print(modal_class[:5])
0 bus
1 metro
2 bike
3 foot
4 car
def decorate_ecdf(title, x, y):
plt.xlabel(x)
plt.ylabel(y)
plt.title(title)
#plotting the ecdf for 2016 dataset
for name, group in df1.groupby('mode'):
Cdf.from_seq(group.speed).plot()
title, x, y = 'Speed distribution by travel mode (April 2016)','speed (m/s)', 'ECDF'
decorate_ecdf(title,x,y)
#plotting the ecdf for 2017 dataset
for name, group in df2.groupby('mode'):
Cdf.from_seq(group.speed).plot()
title, x, y = 'Speed distribution by travel mode (April 2017)','speed (m/s)', 'ECDF'
decorate_ecdf(title,x,y)
#plotting the ecdf for 2018 dataset
for name, group in df3.groupby('mode'):
Cdf.from_seq(group.speed).plot()
title, x, y = 'Speed distribution by travel mode (April 2018)','speed (m/s)', 'ECDF'
decorate_ecdf(title,x,y)
Output:
I am pretty sure this isn't the pythonist way of doing it, but a dirty way to get the work done. You can also see how the 2018 plot is scaled differently on the x-axis.
Is there a way to enforce that all figures are scaled the same way?
How do I re-write my code such that the figures are plotted by calling a function once?
When using pyplot, you can plot using an implicit method with plt.plot(), or you can use the explicit method, by creating and calling the figure and axis objects with fig, ax = plt.subplots(). What happened here is, in my view, a side-effect from using the implicit method.
For example, you can use two pd.DataFrame.plot() commands and have them share the same axis by supplying the returned axis to the other function.
foo = pd.DataFrame(dict(a=[1,2,3], b=[4,5,6]))
bar = pd.DataFrame(dict(c=[3,2,1], d=[6,5,4]))
ax = foo.plot()
bar.plot(ax=ax) # ax object is updated
ax.plot([0,3], [1,1], 'k--')
You can also create the figure and axis object previously, and supply as needed. Also, it's perfectly file to have multiple plot commands. Often, my code is 25% work, 75% fiddling with plots. Don't try to be clever and lose on readability.
fig, axes = plt.subplots(nrows=3, ncols=1, sharex=True)
# In this case, axes is a numpy array with 3 axis objects
# You can access the objects with indexing
# All will have the same x range
axes[0].plot([-1, 2], [1,1])
axes[1].plot([-2, 1], [1,1])
axes[2].plot([1,3],[1,1])
So you can combine both of these snippets to your own code. First, create the figure and axes object, then plot each dataframe, but supply the correct axis to them with the keyword ax.
Also, suppose you have three axis objects and they have different x limits. You can get them all, then set the three to have the same minimum value and the same maximum value. For example:
axis_list = [ax1, ax2, ax3] # suppose you created these separately and want to enforce the same axis limits
minimum_x = min([ax.get_xlim()[0] for ax in axis_list])
maximum_x = max([ax.get_xlim()[1] for ax in axis_list])
for ax in axis_list:
ax.set_xlim(minimum_x, maximum_x)
Related
I am trying to include 2 seaborn countplots with different scales on the same plot but the bars display as different widths and overlap as shown below. Any idea how to get around this?
Setting dodge=False, doesn't work as the bars appear on top of each other.
The main problem of the approach in the question, is that the first countplot doesn't take hue into account. The second countplot won't magically move the bars of the first. An additional categorical column could be added, only taking on the 'weekend' value. Note that the column should be explicitly made categorical with two values, even if only one value is really used.
Things can be simplified a lot, just starting from the original dataframe, which supposedly already has a column 'is_weeked'. Creating the twinx ax beforehand allows to write a loop (so writing the call to sns.countplot() only once, with parameters).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_style('dark')
# create some demo data
data = pd.DataFrame({'ride_hod': np.random.normal(13, 3, 1000).astype(int) % 24,
'is_weekend': np.random.choice(['weekday', 'weekend'], 1000, p=[5 / 7, 2 / 7])})
# now, make 'is_weekend' a categorical column (not just strings)
data['is_weekend'] = pd.Categorical(data['is_weekend'], ['weekday', 'weekend'])
fig, ax1 = plt.subplots(figsize=(16, 6))
ax2 = ax1.twinx()
for ax, category in zip((ax1, ax2), data['is_weekend'].cat.categories):
sns.countplot(data=data[data['is_weekend'] == category], x='ride_hod', hue='is_weekend', palette='Blues', ax=ax)
ax.set_ylabel(f'Count ({category})')
ax1.legend_.remove() # both axes got a legend, remove one
ax1.set_xlabel('Hour of Day')
plt.tight_layout()
plt.show()
use plt.xticks(['put the label by hand in your x label'])
I have a boolean time series that I want to use to determine the parts of the plot that should be shaded.
Currently I have:
ax1.fill_between(data.index, r_min, r_max, where=data['USREC']==True, alpha=0.2)
where, r_min and r_max are just the min and max of the y-axis.
But the fill_between doesn't fill all the way to the top and bottom of the plot because, so I wanted to use axvspan() instead.
Is there any easy way to do this given axvspan only takes coordinates? So the only way I can think of is to group all the dates that are next to each other and are True, then take the first and last of those dates and pass them into axvspan.
Thank you
You can still use fill_between, if you like. However instead of specifying the y-coordinates in data coordinates (for which it is not a priori clear, how large they need to be) you can specify them in axes coorinates. This can be achieved using a transform, where the x part is in data coordinates and the y part is in axes coordinates. The xaxis transform is such a transform. (This is not very surprising since the xaxis is always independent of the ycoorinates.) So
ax.fill_between(data.index, 0,1, where=data['USREC'], transform=ax.get_xaxis_transform())
would do the job.
Here is a complete example:
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
x = np.linspace(0,100,350)
y = np.cumsum(np.random.normal(size=len(x)))
bo = np.zeros(len(y))
bo[y>5] = 1
fig, ax = plt.subplots()
ax.fill_between(x, 0, 1, where=bo, alpha=0.4, transform=ax.get_xaxis_transform())
plt.plot(x,y)
plt.show()
Building on data from this question on faceting through looping, I was wondering if it was possible to call a ax = df.plot(kind='bar') and assign the thus generated AxesSubplot object to a specific axis position/coordinate? (like facet row 1, col 1, 2, 3, etc ... )?
The reason I am asking is not really for faceting bar plots as such, but for faceting map production using the geopandas library. If it worked with bar plots, it might also work with geopandas geodataframe.plot() calls. I can't plot a map from axes themselves, therefore I seem to need to go the other way around--get axes as a byproduct of a plot call, and then place that as appropriate in a grid.
Non-working example--the loop is really pseudo here; I don't move the axis index to plot a different panel each time (in fact, I overwrite the axes object from the subplots call). That is, though, what I would like to do--map the axis object generated from the plot call to the axes (coordinate space) from the subplots call).
N = 100
industry = ['a','b','c']
city = ['x','y','z']
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
jobs = np.random.randint(low=1,high=250,size=N)
df_city =pd.DataFrame({'industry':ind,'city':cty,'jobs':jobs})
## how many panels do we need?
cols =df_city.city.value_counts().shape[0]
fig, axes = plt.subplots(1, cols, figsize=(8, 8))
for x, city in enumerate(df_city.city.value_counts().index.values):
data = df_city[(df_city['city'] == city)]
data = data.groupby(['industry']).jobs.sum()
axes = data.plot(kind='bar')
print type(axes)
fig.suptitle('Employment By Industry By City', fontsize=20)
<class 'matplotlib.axes._subplots.AxesSubplot'>
<class 'matplotlib.axes._subplots.AxesSubplot'>
<class 'matplotlib.axes._subplots.AxesSubplot'>
If I understand right, the accepted answer in that other question is overly pessimistic about the chances of doing this with pandas. How about this:
for ix, (key, group) in enumerate(df_city.groupby('industry')):
ax = pyplot.subplot(1, 3, ix+1)
group.groupby('city')['jobs'].sum().plot(kind='bar', ax=ax)
ax.set_xlabel('industry: {}'.format(key))
With that you get:
The idea is to group by the variable that you want to separate subplots, and iterate over the groups. For each group, use pyplot.subplot to target the desired subplot for each group and use another groupby on the group data to get the summary values to plot. You can pass the ax argument to DataFrame.plot to tell it to plot into an existing axes object. (I can't tell if you want them grouped by industry first and then city within each plot, or the other way around, but you just switch "industy" and "city" in the two groupby calls if you want it the other way around.)
In this version, the axes limits are not equalized across the subplots. But the general idea can be polished to handle that.
I have two Pandas DataFrames that I'm hoping to plot in single figure. I'm using IPython notebook.
I would like the legend to show the label for both of the DataFrames, but so far I've been able to get only the latter one to show. Also any suggestions as to how to go about writing the code in a more sensible way would be appreciated. I'm new to all this and don't really understand object oriented plotting.
%pylab inline
import pandas as pd
#creating data
prng = pd.period_range('1/1/2011', '1/1/2012', freq='M')
var=pd.DataFrame(randn(len(prng)),index=prng,columns=['total'])
shares=pd.DataFrame(randn(len(prng)),index=index,columns=['average'])
#plotting
ax=var.total.plot(label='Variance')
ax=shares.average.plot(secondary_y=True,label='Average Age')
ax.left_ax.set_ylabel('Variance of log wages')
ax.right_ax.set_ylabel('Average age')
plt.legend(loc='upper center')
plt.title('Wage Variance and Mean Age')
plt.show()
This is indeed a bit confusing. I think it boils down to how Matplotlib handles the secondary axes. Pandas probably calls ax.twinx() somewhere which superimposes a secondary axes on the first one, but this is actually a separate axes. Therefore also with separate lines & labels and a separate legend. Calling plt.legend() only applies to one of the axes (the active one) which in your example is the second axes.
Pandas fortunately does store both axes, so you can grab all line objects from both of them and pass them to the .legend() command yourself. Given your example data:
You can plot exactly as you did:
ax = var.total.plot(label='Variance')
ax = shares.average.plot(secondary_y=True, label='Average Age')
ax.set_ylabel('Variance of log wages')
ax.right_ax.set_ylabel('Average age')
Both axes objects are available with ax (left axe) and ax.right_ax, so you can grab the line objects from them. Matplotlib's .get_lines() return a list so you can merge them by simple addition.
lines = ax.get_lines() + ax.right_ax.get_lines()
The line objects have a label property which can be used to read and pass the label to the .legend() command.
ax.legend(lines, [l.get_label() for l in lines], loc='upper center')
And the rest of the plotting:
ax.set_title('Wage Variance and Mean Age')
plt.show()
edit:
It might be less confusing if you separate the Pandas (data) and the Matplotlib (plotting) parts more strictly, so avoid using the Pandas build-in plotting (which only wraps Matplotlib anyway):
fig, ax = plt.subplots()
ax.plot(var.index.to_datetime(), var.total, 'b', label='Variance')
ax.set_ylabel('Variance of log wages')
ax2 = ax.twinx()
ax2.plot(shares.index.to_datetime(), shares.average, 'g' , label='Average Age')
ax2.set_ylabel('Average age')
lines = ax.get_lines() + ax2.get_lines()
ax.legend(lines, [line.get_label() for line in lines], loc='upper center')
ax.set_title('Wage Variance and Mean Age')
plt.show()
When multiple series are plotted then the legend is not displayed by default.
The easy way to display custom legends is just to use the axis from the last plotted series / dataframes (my code from IPython Notebook):
%matplotlib inline # Embed the plot
import matplotlib.pyplot as plt
...
rates[rates.MovieID <= 25].groupby('MovieID').Rating.count().plot() # blue
(rates[rates.MovieID <= 25].groupby('MovieID').Rating.median() * 1000).plot() # green
(rates[rates.MovieID <= 25][rates.RateDelta <= 10].groupby('MovieID').Rating.count() * 2000).plot() # red
ax = (rates[rates.MovieID <= 25][rates.RateDelta <= 10].groupby('MovieID').Rating.median() * 1000).plot() # cyan
ax.legend(['Popularity', 'RateMedian', 'FirstPpl', 'FirstRM'])
You can use pd.concat to merge the two dataframes and then plot is using a secondary y-axis:
import numpy as np # For generating random data.
import pandas as pd
# Creating data.
np.random.seed(0)
prng = pd.period_range('1/1/2011', '1/1/2012', freq='M')
var = pd.DataFrame(np.random.randn(len(prng)), index=prng, columns=['total'])
shares = pd.DataFrame(np.random.randn(len(prng)), index=prng, columns=['average'])
# Plotting.
ax = (
pd.concat([var, shares], axis=1)
.rename(columns={
'total': 'Variance of Low Wages',
'average': 'Average Age'
})
.plot(
title='Wage Variance and Mean Age',
secondary_y='Average Age')
)
ax.set_ylabel('Variance of Low Wages')
ax.right_ax.set_ylabel('Average Age', rotation=-90)
I have two or three csv files with the same header and would like to draw the histograms for each column overlaying one another on the same plot.
The following code gives me two separate figures, each containing all histograms for each of the files. Is there a compact way to go about plotting them together on the same figure using pandas/matplot lib? I imagine something close to this but using dataframes.
Code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('input1.csv')
df2 = pd.read_csv('input2.csv')
df.hist(bins=20)
df2.hist(bins=20)
plt.show()
In [18]: from pandas import DataFrame
In [19]: from numpy.random import randn
In [20]: df = DataFrame(randn(10, 2))
In [21]: df2 = DataFrame(randn(10, 2))
In [22]: axs = df.hist()
In [23]: for ax, (colname, values) in zip(axs.flat, df2.iteritems()):
....: values.hist(ax=ax, bins=10)
....:
In [24]: draw()
gives
The main issue of overlaying the histograms of two (or more) dataframes containing the same variables in side-by-side plots within a single figure has been already solved in the answer by Phillip Cloud.
This answer provides a solution to the issue raised by the author of the question (in the comments to the accepted answer) regarding how to enforce the same number of bins and range for the variables common to both dataframes. This can be accomplished by creating a list of bins common to all variables of both dataframes. In fact, this answer goes a little bit further by adjusting the plots for cases where the different variables contained in each dataframe cover slightly different ranges (but still within the same order of magnitude), as illustrated in the following example:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
from matplotlib.lines import Line2D
# Set seed for random data
rng = np.random.default_rng(seed=1)
# Create two similar dataframes each containing two random variables,
# with df2 twice the size of df1
df1_size = 1000
df1 = pd.DataFrame(dict(var1 = rng.exponential(scale=1.0, size=df1_size),
var2 = rng.normal(loc=40, scale=5, size=df1_size)))
df2_size = 2*df1_size
df2 = pd.DataFrame(dict(var1 = rng.exponential(scale=2.0, size=df2_size),
var2 = rng.normal(loc=50, scale=10, size=df2_size)))
# Combine the dataframes to extract the min/max values of each variable
df_combined = pd.concat([df1, df2])
vars_min = [df_combined[var].min() for var in df_combined]
vars_max = [df_combined[var].max() for var in df_combined]
# Create custom bins based on the min/max of all values from both
# dataframes to ensure that in each histogram the bins are aligned
# making them easily comparable
nbins = 30
bin_edges, step = np.linspace(min(vars_min), max(vars_max), nbins+1, retstep=True)
# Create figure by combining the outputs of two pandas df.hist() function
# calls using the 'step' type of histogram to improve plot readability
htype = 'step'
alpha = 0.7
lw = 2
axs = df1.hist(figsize=(10,4), bins=bin_edges, histtype=htype,
linewidth=lw, alpha=alpha, label='df1')
df2.hist(ax=axs.flatten(), grid=False, bins=bin_edges, histtype=htype,
linewidth=lw, alpha=alpha, label='df2')
# Adjust x-axes limits based on min/max values and step between bins, and
# remove top/right spines: if, contrary to this example dataset, var1 and
# var2 cover the same range, setting the x-axes limits with this loop is
# not necessary
for ax, v_min, v_max in zip(axs.flatten(), vars_min, vars_max):
ax.set_xlim(v_min-2*step, v_max+2*step)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Edit legend to get lines as legend keys instead of the default polygons:
# use legend handles and labels from any of the axes in the axs object
# (here taken from first one) seeing as the legend box is by default only
# shown in the last subplot when using the plt.legend() function.
handles, labels = axs.flatten()[0].get_legend_handles_labels()
lines = [Line2D([0], [0], lw=lw, color=h.get_facecolor()[:-1], alpha=alpha)
for h in handles]
plt.legend(lines, labels, frameon=False)
plt.suptitle('Pandas', x=0.5, y=1.1, fontsize=14)
plt.show()
It is worth noting that the seaborn package provides a more convenient way to create this kind of plot, where contrary to pandas the bins are automatically aligned. The only downside is that the dataframes must first be combined and reshaped to long format, as shown in this example using the same dataframes and bins as before:
import seaborn as sns # v 0.11.0
# Combine dataframes and convert the combined dataframe to long format
df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')
# Create figure using seaborn displot: note that the bins are automatically
# aligned thanks the 'common_bins' parameter of the seaborn histplot function
# (called here with 'kind='hist'') that is set to True by default. Here, the
# bins from the previous example are used to make the figures more comparable.
# Also note that the facets share the same x and y axes by default, this can
# be changed when var1 and var2 have different ranges and different
# distribution shapes, as it is the case in this example.
g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
element='step', bins=bin_edges, fill=False, height=4,
facet_kws=dict(sharex=False, sharey=False))
# For some reason setting sharex as above does not automatically adjust the
# x-axes limits (even when not setting a bins argument, maybe due to a bug
# with this package version) which is why this is done in the following loop,
# but note that you still need to set 'sharex=False' in displot, or else
# 'ax.set.xlim' will have no effect.
for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
ax.set_xlim(v_min-2*step, v_max+2*step)
# Additional formatting
g.legend.set_bbox_to_anchor((.9, 0.75))
g.legend.set_title('')
plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)
plt.show()
As you may notice, the histogram line is cut off at the limits of the list of bin edges (not visible on the maximum side due to scale). To get a line more similar to the example with pandas, an empty bin can be added at each extremity of the list of bins, like this:
bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
bin_edges = np.append(bin_edges, bin_edges.max()+step)
This example also illustrates the limits to this approach of setting common bins for both facets. Seeing as the ranges of var1 var2 are somewhat different and that 30 bins are used to cover the combined range, the histogram for var1 contains rather few bins and the histogram for var2 has slightly more bins than necessary. To my knowledge, there is no straightforward way of assigning a different list of bins to each facet when calling the plotting functions df.hist() and displot(df). So for cases where variables cover significantly different ranges, these figures would have to be created from scratch using matplotlib or some other plotting library.