I'm trying to use the convenience of the plot method of a pandas dataframe while adjusting the size of the figure produced. (I'm saving the figures to file as well as displaying them inline in a Jupyter notebook). I found the method below successful most of the time, except when I plot two lines on the same chart - then the figure goes back to the default size.
I suspect this might be due to the differences between plot on a series and plot on a dataframe.
Setup example code:
data = {
'A': 90 + np.random.randn(366),
'B': 85 + np.random.randn(366)
}
date_range = pd.date_range('2016-01-01', '2016-12-31')
index = pd.Index(date_range, name='Date')
df = pd.DataFrame(data=data, index=index)
Control - this code produces the expected result (a wide plot):
fig = plt.figure(figsize=(10,4))
df['A'].plot()
plt.savefig("plot1.png")
plt.show()
Result:
Plotting two lines - figure size is not (10,4)
fig = plt.figure(figsize=(10,4))
df[['A', 'B']].plot()
plt.savefig("plot2.png")
plt.show()
Result:
What's the right way to do this so that the figure size is consistency set regardless of number of series selected?
The reason for the difference between the two cases is a bit hidden inside the logic of pandas.DataFrame.plot(). As one can see in the documentation this method allows a lot of arguments to be passed such that it will handle all kinds of different cases.
Here in the first case, you create a matplotlib figure via fig = plt.figure(figsize=(10,4)) and then plot a single column DataFrame. Now the internal logic of pandas plot function is to check if there is already a figure present in the matplotlib state machine, and if so, use it's current axes to plot the columns values to it. This works as expected.
However in the second case, the data consists of two columns. There are several options how to handle such a plot, including using different subplots with shared or non-shared axes etc. In order for pandas to be able to apply any of those possible requirements, it will by default create a new figure to which it can add the axes to plot to. The new figure will not know about the already existing figure and its size, but rather have the default size, unless you specify the figsize argument.
In the comments, you say that a possible solution is to use df[['A', 'B']].plot(figsize=(10,4)). This is correct, but you then need to omit the creation of your initial figure. Otherwise it will produce 2 figures, which is probably undesired. In a notebook this will not be visible, but if you run this as a usual python script with plt.show() at the end, there will be two figure windows opening.
So the solution which lets pandas take care of figure creation is
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"A":[2,3,1], "B":[1,2,2]})
df[['A', 'B']].plot(figsize=(10,4))
plt.show()
A way to circumvent the creation of a new figure is to supply the ax argument to the pandas.DataFrame.plot(ax=ax) function, where ax is an externally created axes. This axes can be the standard axes you obtain via plt.gca().
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"A":[2,3,1], "B":[1,2,2]})
plt.figure(figsize=(10,4))
df[['A', 'B']].plot(ax = plt.gca())
plt.show()
Alternatively use the more object oriented way seen in the answer from PaulH.
Always operate explicitly and directly on your Figure and Axes objects. Don't rely on the pyplot state machine. In your case that means:
fig1, ax1 = plt.subplots(figsize=(10,4))
df['A'].plot(ax=ax1)
fig1.savefig("plot1.png")
fig2, ax2 = plt.figure(figsize=(10,4))
df[['A', 'B']].plot(ax=ax2)
fig2.savefig("plot2.png")
plt.show()
Related
I have two data frames (df1 and df2). Each have the same 10 variables with different values.
I created box plots of the variables in the data frames like so:
df1.boxplot()
df2.boxplot()
I get two graphs of 10 box plots next to each other for each variable. The actual output is the second graph, however, as obviously Python just runs the code in order.
Instead, I would either like these box plots to appear side by side OR ideally, I would like 10 graphs (one for each variable) comparing each variable by data frame (e.g. one graph for the first variable with two box plots in it, one for each data frame). Is that possible just using python library or do I have to use Matplotlib?
Thanks!
To get graphs, standard Python isn't enough. You'd need a graphical library such as matplotlib. Seaborn extends matplotlib to ease the creation of complex statistical plots. To work with Seaborn, the dataframes should be converted to long form (e.g. via pandas' melt) and then combined into one large dataframe.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# suppose df1 and df2 are dataframes, each with the same 10 columns
df1 = pd.DataFrame({i: np.random.randn(100).cumsum() for i in 'abcdefghij'})
df2 = pd.DataFrame({i: np.random.randn(150).cumsum() for i in 'abcdefghij'})
# pd.melt converts the dataframe to long form, pd.concat combines them
df = pd.concat({'df1': df1.melt(), 'df2': df2.melt()}, names=['source', 'old_index'])
# convert the source index to a column, and reset the old index
df = df.reset_index(level=0).reset_index(drop=True)
sns.boxplot(data=df, x='variable', y='value', hue='source', palette='turbo')
This creates boxes for each of the original columns, comparing the two dataframes:
Optionally, you could create multiple subplots with that same information:
sns.catplot(data=df, kind='box', col='variable', y='value', x='source',
palette='turbo', height=3, aspect=0.5, col_wrap=5)
By default, the y-axes are shared. You can disable the sharing via sharey=False. Here is an example, which also removes the repeated x axes and creates a common legend:
g = sns.catplot(data=df, kind='box', col='variable', y='value', x='source', hue='source', dodge=False,
palette='Reds', height=3, aspect=0.5, col_wrap=5, sharey=False)
g.set(xlabel='', xticks=[]) # remove x labels and ticks
g.add_legend()
PS: If you simply want to put two pandas boxplots next to each other, you can create a figure with two subplots, and pass the axes to pandas. (Note that pandas plotting is just an interface towards matplotlib.)
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5))
df1.boxplot(ax=ax1)
ax1.set_title('df1')
df2.boxplot(ax=ax2)
ax2.set_title('df2')
plt.tight_layout()
plt.show()
I have a dataframe like below,
And I am trying to plot size distribution of different species from different projects. Here is I have been trying (very simple code as I am new to python):
test=pd.read_excel(file,sheet_name="test",engine='openpyxl')
test.set_index('Species')
test=test.groupby('Project ID')
ax=test.boxplot(column='sizes',by='Species',return_type='axes')
The plot is exactly I need (below)
However, this returns ax as series object not axes, that make it hard to handle plot formatting (ie adding y labels, etc...) afterwards, it there any way to fix?
In matplotlib (which is what pandas uses), you always get one "axes" per subplot. Therefore, it makes sense that you have a collection (Series) of axes in your example (two subplots). This is actually good news, because now you can access the subplot you want to style very conveniently by name. Say, for example, you want to add a y-label to the left subplot, you can do:
ax_A = ax.loc["A"].loc["sizes"]
ax_B = ax.loc["B"].loc["sizes"]
ax_A.set_ylabel("My y-label")
Full example:
import numpy as np
import pandas as pd
test = pd.DataFrame({"Project ID": np.random.choice(["A", "B"], 100),
"Species": np.random.choice(["Plant1", "Plant2", "Plant3"], 100),
"sizes": np.random.random(100)})
test=test.groupby('Project ID')
ax=test.boxplot(column='sizes',by='Species',return_type='axes')
ax_A = ax.loc["A"].loc["sizes"]
ax_B = ax.loc["B"].loc["sizes"]
ax_A.set_ylabel("My y-label")
I have been given a data for which I need to find a histogram. So I used pandas hist() function and plot it using matplotlib. The code runs on a remote server so I cannot directly see it and hence I save the image. Here is what the image looks like
Here is my code below
import matplotlib.pyplot as plt
df_hist = pd.DataFrame(np.array(raw_data)).hist(bins=5) // raw_data is the data supplied to me
plt.savefig('/path/to/file.png')
plt.close()
As you can see the x axis labels are overlapping. So I used this function plt.tight_layout() like so
import matplotlib.pyplot as plt
df_hist = pd.DataFrame(np.array(raw_data)).hist(bins=5)
plt.tight_layout()
plt.savefig('/path/to/file.png')
plt.close()
There is some improvement now
But still the labels are too close. Is there a way to ensure the labels do not touch each other and there is fair spacing between them? Also I want to resize the image to make it smaller.
I checked the documentation here https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html but not sure which parameter to use for savefig.
Since raw_data is not already a pandas dataframe there's no need to turn it into one to do the plotting. Instead you can plot directly with matplotlib.
There are many different ways to achieve what you'd like. I'll start by setting up some data which looks similar to yours:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gamma
raw_data = gamma.rvs(a=1, scale=1e6, size=100)
If we go ahead and use matplotlib to create the histogram we may find the xticks too close together:
fig, ax = plt.subplots(1, 1, figsize=[5, 3])
ax.hist(raw_data, bins=5)
fig.tight_layout()
The xticks are hard to read with all the zeros, regardless of spacing. So, one thing you may wish to do would be to use scientific formatting. This makes the x-axis much easier to interpret:
ax.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
Another option, without using scientific formatting would be to rotate the ticks (as mentioned in the comments):
ax.tick_params(axis='x', rotation=45)
fig.tight_layout()
Finally, you also mentioned altering the size of the image. Note that this is best done when the figure is initialised. You can set the size of the figure with the figsize argument. The following would create a figure 5" wide and 3" in height:
fig, ax = plt.subplots(1, 1, figsize=[5, 3])
I think the two best fixes were mentioned by Pam in the comments.
You can rotate the labels with
plt.xticks(rotation=45
For more information, look here: Rotate axis text in python matplotlib
The real problem is too many zeros that don't provide any extra info. Numpy arrays are pretty easy to work with, so pd.DataFrame(np.array(raw_data)/1000).hist(bins=5) should get rid of three zeros off of both axes. Then just add a 'kilo' in the axes labels.
To change the size of the graph use rcParams.
from matplotlib import rcParams
rcParams['figure.figsize'] = 7, 5.75 #the numbers are the dimensions
This question already has answers here:
Inconsistency when setting figure size using pandas plot method
(2 answers)
Closed 4 years ago.
In the two snippets below, where the only difference seems to be the datasource type (pd.Series vs pd.DataFrame), does plt.figure(num=None, figsize=(12, 3), dpi=80) have an effect in one case but not in the other when using pd.DataFrame.plot?
Snippet 1 - Adjusting plot size when data is a pandas Series
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# data
np.random.seed(123)
df = pd.Series(np.random.randn(10000),index=pd.date_range('1/1/2000', periods=10000)).cumsum()
print(type(df))
# plot
plt.figure(num=None, figsize=(12, 3), dpi=80)
ax = df.plot()
plt.show()
Output 1
Snippet 2 - Now the data source is a pandas Dataframe
# imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# data
np.random.seed(123)
dfx = pd.Series(np.random.randn(100),index=pd.date_range('1/1/2000', periods=100)).cumsum()
dfy = pd.Series(np.random.randn(100),index=pd.date_range('1/1/2000', periods=100)).cumsum()
df = pd.concat([dfx, dfy], axis = 1)
print(type(df))
# plot
plt.figure(num=None, figsize=(12, 3), dpi=80)
ax = df.plot()
plt.show()
The only difference here seems to be the type of the datasource. Why would that have something to say for the matplotlib output?
It seems that pd.Dataframe.plot() works a bit differently from pd.Series.plot(). Since the dataframe might have any number of columns, which might require subplots, different axes, etc., Pandas defaults to creating a new figure. The way around this is to feed the arguments directly to the plot call, ie, df.plot(figsize=(12, 3)) (dpi isn't accepted as a keyword-argument, unfortunately). You can read more about in this great answer:
In the first case, you create a matplotlib figure via fig =
plt.figure(figsize=(10,4)) and then plot a single column DataFrame.
Now the internal logic of pandas plot function is to check if there is
already a figure present in the matplotlib state machine, and if so,
use it's current axes to plot the columns values to it. This works as
expected.
However in the second case, the data consists of two columns. There
are several options how to handle such a plot, including using
different subplots with shared or non-shared axes etc. In order for
pandas to be able to apply any of those possible requirements, it will
by default create a new figure to which it can add the axes to plot
to. The new figure will not know about the already existing figure and
its size, but rather have the default size, unless you specify the
figsize argument.
I have a function that plots a graph. I can call this graph with different variables to alter the graph. I'd like to call this function multiple times and plot the graphs along side each other but not sure how to do so
def plt_graph(x, graph_title, horiz_label):
df[x].plot(kind='barh')
plt.title(graph_title)
plt.ylabel("")
plt.xlabel(horiz_label)
plt_graph('gross','Total value','Gross (in millions)')
In case you know the number of plots you want to produce beforehands, you can first create as many subplots as you need
fig, axes = plt.subplots(nrows=1, ncols=5)
(in this case 5) and then provide the axes to the function
def plt_graph(x, graph_title, horiz_label, ax):
df[x].plot(kind='barh', ax=ax)
Finally, call every plot like this
plt_graph("framekey", "Some title", "some label", axes[4])
(where 4 stands for the fifth and last plot)
I have created a tool to do this really easily. I use it all the time in jupyter notebooks and find it so much neater than a big column of charts. Copy the Gridplot class from this file:
https://github.com/simonm3/analysis/blob/master/analysis/plot.py
Usage:
gridplot = Gridplot()
plt.plot(x)
plt.plot(y)
It shows each new plot in a grid with 4 plots to a row. You can change the size of the charts or the number per row. It works for plt.plot, plt.bar, plt.hist and plt.scatter. However it does require you use matplot directly rather than pandas.
If you want to turn it off:
gridplot.off()
If you want to reset the grid to position 1:
gridplot.on()
Here is a way that you can do it. First you create the figure which will contain the axes object. With those axes you have something like a canvas, which is where every graph will be drawn.
fig, ax = plt.subplots(1,2)
Here I have created one figure with two axes. This is a one row and two columns figure. If you inspect the ax variable you will see two objects. This is what we'll use for all the plotting. Now, going back to the function, let's start with a simple dataset and define the function.
df = pd.DataFrame({"a": np.random.random(10), "b": np.random.random(10)})
def plt_graph(x, graph_title, horiz_label, ax):
df[x].plot(kind = 'barh', ax = ax)
ax.set_xlabel(horiz_label)
ax.set_title(graph_title)
Then, to call the function you will simply do this:
plt_graph("a", "a", "a", ax=ax[0])
plt_graph("b", "b", "b", ax=ax[1])
Note that you pass each graph that you want to create to any of the axes you have. In this case, as you have two, you pass the first to the first axes and so on. Note that if you include seaborn in your import (import seaborn as sns), automatically the seaborn style will be applied to your graphs.
This is how your graphs will look like.
When you are creating plotting functions, you want to look at matplotlib's object oriented interface.