I have two Pandas DataFrames that I'm hoping to plot in single figure. I'm using IPython notebook.
I would like the legend to show the label for both of the DataFrames, but so far I've been able to get only the latter one to show. Also any suggestions as to how to go about writing the code in a more sensible way would be appreciated. I'm new to all this and don't really understand object oriented plotting.
%pylab inline
import pandas as pd
#creating data
prng = pd.period_range('1/1/2011', '1/1/2012', freq='M')
var=pd.DataFrame(randn(len(prng)),index=prng,columns=['total'])
shares=pd.DataFrame(randn(len(prng)),index=index,columns=['average'])
#plotting
ax=var.total.plot(label='Variance')
ax=shares.average.plot(secondary_y=True,label='Average Age')
ax.left_ax.set_ylabel('Variance of log wages')
ax.right_ax.set_ylabel('Average age')
plt.legend(loc='upper center')
plt.title('Wage Variance and Mean Age')
plt.show()
This is indeed a bit confusing. I think it boils down to how Matplotlib handles the secondary axes. Pandas probably calls ax.twinx() somewhere which superimposes a secondary axes on the first one, but this is actually a separate axes. Therefore also with separate lines & labels and a separate legend. Calling plt.legend() only applies to one of the axes (the active one) which in your example is the second axes.
Pandas fortunately does store both axes, so you can grab all line objects from both of them and pass them to the .legend() command yourself. Given your example data:
You can plot exactly as you did:
ax = var.total.plot(label='Variance')
ax = shares.average.plot(secondary_y=True, label='Average Age')
ax.set_ylabel('Variance of log wages')
ax.right_ax.set_ylabel('Average age')
Both axes objects are available with ax (left axe) and ax.right_ax, so you can grab the line objects from them. Matplotlib's .get_lines() return a list so you can merge them by simple addition.
lines = ax.get_lines() + ax.right_ax.get_lines()
The line objects have a label property which can be used to read and pass the label to the .legend() command.
ax.legend(lines, [l.get_label() for l in lines], loc='upper center')
And the rest of the plotting:
ax.set_title('Wage Variance and Mean Age')
plt.show()
edit:
It might be less confusing if you separate the Pandas (data) and the Matplotlib (plotting) parts more strictly, so avoid using the Pandas build-in plotting (which only wraps Matplotlib anyway):
fig, ax = plt.subplots()
ax.plot(var.index.to_datetime(), var.total, 'b', label='Variance')
ax.set_ylabel('Variance of log wages')
ax2 = ax.twinx()
ax2.plot(shares.index.to_datetime(), shares.average, 'g' , label='Average Age')
ax2.set_ylabel('Average age')
lines = ax.get_lines() + ax2.get_lines()
ax.legend(lines, [line.get_label() for line in lines], loc='upper center')
ax.set_title('Wage Variance and Mean Age')
plt.show()
When multiple series are plotted then the legend is not displayed by default.
The easy way to display custom legends is just to use the axis from the last plotted series / dataframes (my code from IPython Notebook):
%matplotlib inline # Embed the plot
import matplotlib.pyplot as plt
...
rates[rates.MovieID <= 25].groupby('MovieID').Rating.count().plot() # blue
(rates[rates.MovieID <= 25].groupby('MovieID').Rating.median() * 1000).plot() # green
(rates[rates.MovieID <= 25][rates.RateDelta <= 10].groupby('MovieID').Rating.count() * 2000).plot() # red
ax = (rates[rates.MovieID <= 25][rates.RateDelta <= 10].groupby('MovieID').Rating.median() * 1000).plot() # cyan
ax.legend(['Popularity', 'RateMedian', 'FirstPpl', 'FirstRM'])
You can use pd.concat to merge the two dataframes and then plot is using a secondary y-axis:
import numpy as np # For generating random data.
import pandas as pd
# Creating data.
np.random.seed(0)
prng = pd.period_range('1/1/2011', '1/1/2012', freq='M')
var = pd.DataFrame(np.random.randn(len(prng)), index=prng, columns=['total'])
shares = pd.DataFrame(np.random.randn(len(prng)), index=prng, columns=['average'])
# Plotting.
ax = (
pd.concat([var, shares], axis=1)
.rename(columns={
'total': 'Variance of Low Wages',
'average': 'Average Age'
})
.plot(
title='Wage Variance and Mean Age',
secondary_y='Average Age')
)
ax.set_ylabel('Variance of Low Wages')
ax.right_ax.set_ylabel('Average Age', rotation=-90)
Related
I am trying to include 2 seaborn countplots with different scales on the same plot but the bars display as different widths and overlap as shown below. Any idea how to get around this?
Setting dodge=False, doesn't work as the bars appear on top of each other.
The main problem of the approach in the question, is that the first countplot doesn't take hue into account. The second countplot won't magically move the bars of the first. An additional categorical column could be added, only taking on the 'weekend' value. Note that the column should be explicitly made categorical with two values, even if only one value is really used.
Things can be simplified a lot, just starting from the original dataframe, which supposedly already has a column 'is_weeked'. Creating the twinx ax beforehand allows to write a loop (so writing the call to sns.countplot() only once, with parameters).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_style('dark')
# create some demo data
data = pd.DataFrame({'ride_hod': np.random.normal(13, 3, 1000).astype(int) % 24,
'is_weekend': np.random.choice(['weekday', 'weekend'], 1000, p=[5 / 7, 2 / 7])})
# now, make 'is_weekend' a categorical column (not just strings)
data['is_weekend'] = pd.Categorical(data['is_weekend'], ['weekday', 'weekend'])
fig, ax1 = plt.subplots(figsize=(16, 6))
ax2 = ax1.twinx()
for ax, category in zip((ax1, ax2), data['is_weekend'].cat.categories):
sns.countplot(data=data[data['is_weekend'] == category], x='ride_hod', hue='is_weekend', palette='Blues', ax=ax)
ax.set_ylabel(f'Count ({category})')
ax1.legend_.remove() # both axes got a legend, remove one
ax1.set_xlabel('Hour of Day')
plt.tight_layout()
plt.show()
use plt.xticks(['put the label by hand in your x label'])
I am querying COVID-19 data and building a dataframe of day-over-day changes for one of the data points (positive test results) where each row is a day, each column is a state or territory (there are 56 altogether). I can then generate a chart for every one of the states, but I can't get my x-axis labels (the dates) to behave like I want. There are two problems which I suspect are related. First, there are too many labels -- usually matplotlib tidily reduces the label count for readability, but I think the subplots are confusing it. Second, I would like the labels to read vertically; but this only happens on the last of the plots. (I tried moving the rotation='vertical' inside the for block, to no avail.)
The dates are the same for all the subplots, so -- this part works -- the x-axis labels only need to appear on the bottom row of the subplots. Matplotlib is doing this automatically. But I need fewer of the labels, and for all of them to align vertically. Here is my code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# get current data
all_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")
# convert the YYYYMMDD date to a datetime object
all_states[['gooddate']] = all_states[['date']].applymap(lambda s: pd.to_datetime(str(s), format = '%Y%m%d'))
# 'positive' is the cumulative total of COVID-19 test results that are positive
all_states_new_positives = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'positive', aggfunc='sum')
all_states_new_positives_diff = all_states_new_positives.diff()
fig, axes = plt.subplots(14, 4, figsize = (12,8), sharex = True )
plt.tight_layout
for i , ax in enumerate(axes.ravel()):
# get the numbers for the last 28 days
x = all_states_new_positives_diff.iloc[-28 :].index
y = all_states_new_positives_diff.iloc[-28 : , i]
ax.set_title(y.name, loc='left', fontsize=12, fontweight=0)
ax.plot(x,y)
plt.xticks(rotation='vertical')
plt.subplots_adjust(left=0.5, bottom=1, right=1, top=4, wspace=2, hspace=2)
plt.show();
Suggestions:
Increase the height of the figure.
fig, axes = plt.subplots(14, 4, figsize = (12,20), sharex = True)
Rotate all the labels:
fig.autofmt_xdate(rotation=90)
Use tight_layout at the end instead of subplots_adjust:
fig.tight_layout()
I'm trying to plot two datasets into one plot with matplotlib. One of the two plots is misaligned by 1 on the x-axis.
This MWE pretty much sums up the problem. What do I have to adjust to bring the box-plot further to the left?
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
titles = ["nlnd", "nlmd", "nlhd", "mlnd", "mlmd", "mlhd", "hlnd", "hlmd", "hlhd"]
plotData = pd.DataFrame(np.random.rand(25, 9), columns=titles)
failureRates = pd.DataFrame(np.random.rand(9, 1), index=titles)
color = {'boxes': 'DarkGreen', 'whiskers': 'DarkOrange', 'medians': 'DarkBlue',
'caps': 'Gray'}
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twinx()
plotData.plot.box(ax=ax1, color=color, sym='+')
failureRates.plot(ax=ax2, color='b', legend=False)
ax1.set_ylabel('Seconds')
ax2.set_ylabel('Failure Rate in %')
plt.xlim(-0.7, 8.7)
ax1.set_xticks(range(len(titles)))
ax1.set_xticklabels(titles)
fig.tight_layout()
fig.show()
Actual result. Note that its only 8 box-plots instead of 9 and that they're starting at index 1.
The issue is a mismatch between how box() and plot() work - box() starts at x-position 1 and plot() depends on the index of the dataframe (which defaults to starting at 0). There are only 8 plots because the 9th is being cut off since you specify plt.xlim(-0.7, 8.7). There are several easy ways to fix this, as #Sheldore's answer indicates, you can explicitly set the positions for the boxplot. Another way you can do this is to change the indexing of the failureRates dataframe to start at 1 in construction of the dataframe, i.e.
failureRates = pd.DataFrame(np.random.rand(9, 1), index=range(1, len(titles)+1))
note that you need not specify the xticks or the xlim for the question MCVE, but you may need to for your complete code.
You can specify the positions on the x-axis where you want to have the box plots. Since you have 9 boxes, use the following which generates the figure below
plotData.plot.box(ax=ax1, color=color, sym='+', positions=range(9))
Is there a way to add a secondary legend to a scatterplot, where the size of the scatter is proportional to some data?
I have written the following code that generates a scatterplot. The color of the scatter represents the year (and is taken from a user-defined df) while the size of the scatter represents variable 3 (also taken from a df but is raw data):
import pandas as pd
colors = pd.DataFrame({'1985':'red','1990':'b','1995':'k','2000':'g','2005':'m','2010':'y'}, index=[0,1,2,3,4,5])
fig = plt.figure()
ax = fig.add_subplot(111)
for i in df.keys():
df[i].plot(kind='scatter',x='variable1',y='variable2',ax=ax,label=i,s=df[i]['variable3']/100, c=colors[i])
ax.legend(loc='upper right')
ax.set_xlabel("Variable 1")
ax.set_ylabel("Variable 2")
This code (with my data) produces the following graph:
So while the colors/years are well and clearly defined, the size of the scatter is not.
How can I add a secondary or additional legend that defines what the size of the scatter means?
You will need to create the second legend yourself, i.e. you need to create some artists to populate the legend with. In the case of a scatter we can use a normal plot and set the marker accordingly.
This is shown in the below example. To actually add a second legend we need to add the first legend to the axes, such that the new legend does not overwrite the first one.
import matplotlib.pyplot as plt
import matplotlib.colors
import numpy as np; np.random.seed(1)
import pandas as pd
plt.rcParams["figure.subplot.right"] = 0.8
v = np.random.rand(30,4)
v[:,2] = np.random.choice(np.arange(1980,2015,5), size=30)
v[:,3] = np.random.randint(5,13,size=30)
df= pd.DataFrame(v, columns=["x","y","year","quality"])
df.year = df.year.values.astype(int)
fig, ax = plt.subplots()
for i, (name, dff) in enumerate(df.groupby("year")):
c = matplotlib.colors.to_hex(plt.cm.jet(i/7.))
dff.plot(kind='scatter',x='x',y='y', label=name, c=c,
s=dff.quality**2, ax=ax)
leg = plt.legend(loc=(1.03,0), title="Year")
ax.add_artist(leg)
h = [plt.plot([],[], color="gray", marker="o", ms=i, ls="")[0] for i in range(5,13)]
plt.legend(handles=h, labels=range(5,13),loc=(1.03,0.5), title="Quality")
plt.show()
Have a look at http://matplotlib.org/users/legend_guide.html.
It shows how to have multiple legends (about halfway down) and there is another example that shows how to set the marker size.
If that doesn't work, then you can also create a custom legend (last example).
I came across this different behaviour in the third example plot below. Why am I able to correctly edit the x-axis' ticks with pandas line() and area() plots, but not with bar()? What's the best way to fix the (general) third example?
import numpy as np
import pandas as pd
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
x = np.arange(73,145,1)
y = np.cos(x)
df = pd.Series(y,x)
ax1 = df.plot.line()
ax1.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax1.xaxis.set_minor_locator(ticker.MultipleLocator(2.5))
plt.show()
ax2 = df.plot.area(stacked=False)
ax2.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax2.xaxis.set_minor_locator(ticker.MultipleLocator(2.5))
plt.show()
ax3 = df.plot.bar()
ax3.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax3.xaxis.set_minor_locator(ticker.MultipleLocator(2.5))
plt.show()
Problem:
The bar plot is meant to be used with categorical data. Therefore the bars are not actually at the positions of x but at positions 0,1,2,...N-1. The bar labels are then adjusted to the values of x.
If you then put a tick only on every tenth bar, the second label will be placed at the tenth bar etc. The result is
You can see that the bars are actually positionned at integer values starting at 0 by using a normal ScalarFormatter on the axes:
ax3 = df.plot.bar()
ax3.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax3.xaxis.set_minor_locator(ticker.MultipleLocator(2.5))
ax3.xaxis.set_major_formatter(ticker.ScalarFormatter())
Now you can of course define your own fixed formatter like this
n = 10
ax3 = df.plot.bar()
ax3.xaxis.set_major_locator(ticker.MultipleLocator(n))
ax3.xaxis.set_minor_locator(ticker.MultipleLocator(n/4.))
seq = ax3.xaxis.get_major_formatter().seq
ax3.xaxis.set_major_formatter(ticker.FixedFormatter([""]+seq[::n]))
which has the drawback that it starts at some arbitrary value.
Solution:
I would guess the best general solution is not to use the pandas plotting function at all (which is anyways only a wrapper), but the matplotlib bar function directly:
fig, ax3 = plt.subplots()
ax3.bar(df.index, df.values, width=0.72)
ax3.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax3.xaxis.set_minor_locator(ticker.MultipleLocator(2.5))