I am using the option to generate a separate histogram of a value for each group in a data frame like so (example code from documentation)
data = pd.Series(np.random.randn(1000))
data.hist(by=np.random.randint(0, 4, 1000), figsize=(6, 4))
This is great, but what I am not seeing is a way to set and standardize the axes. Is this possible?
To be specific, I would like to specify the x and y axes of the plots so that the y axis in particular has the same range for all plots. Otherwise it can be hard to compare distributions to one another.
you can pass kwds to hist and it will pass them along to appropriate sub processes. The relevant ones here are sharex and sharey
data = pd.Series(np.random.randn(1000))
data.hist(by=np.random.randint(0, 4, 1000), figsize=(6, 4),
sharex=True, sharey=True)
Related
I am trying to do EDA along with exploring the Matplotlib and Seaborn libraries.
The data_cat DataFrame has 4 columns and I want to create plots in a single row with 4 columns.
For that, I created a figure object with 4 axes objects.
fig, ax = plt.subplots(1,4, figsize = (16,4))
for i in range(len(data_cat.columns)):
sns.catplot(x = data_cat.columns[i], kind = 'count', data = data_cat, ax= ax[i])
The output for it is a figure with the 4 plots (as required) but it is followed by 4 blank plots that I think are the extra figure objects generated by the sns.catplot function.
Your code does not work as intended because sns.catplot() is a figure level function, that is designed to create its own grid of subplots if desired. So if you want to set up the subplot grid directly in matplotlib, as you do with your first line, you should use the appropriate axes level function instead, in this case sns.countplot():
fig, ax = plt.subplots(1, 4, figsize = (16,4))
for i in range(4):
sns.countplot(x = data_cat.columns[i], data = data_cat, ax= ax[i])
Alternatively, you could use pandas' df.melt() method to tidy up your dataset so that all the values from your four columns are in one column (say 'col_all'), and you have another column (say 'subplot') that identifies from which original column each value is. Then you can get all the subplots with one call:
sns.catplot(x='col_all', kind='count', data=data_cat, col='subplot')
I answered a related question here.
I've been struggling to generate the frequency plot of 2 columns named "Country" and "Company" in my DataFrame and show them as 2 subplots. Here's what I've got.
Figure1 = plt.figure(1)
Subplot1 = Figure1.add_subplot(2,1,1)
and here I'm going to use the bar chart pd.value_counts(DataFrame['Country']).plot('barh')
to shows as first subplot.
The problem is, I cant just go: Subplot1.pd.value_counts(DataFrame['Country']).plot('barh') as Subplot1. has no attribute pd. ~ Could anybody shed some light in to this?
Thanks a million in advance for your tips,
R.
You don't have to create Figure and Axes objects separately, and you should probably avoid initial caps in variable names, to differentiate them from classes.
Here, you can use plt.subplots, which creates a Figure and a number of Axes and binds them together. Then, you can just pass the Axes objects to the plot method of pandas:
from matplotlib import pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
pd.value_counts(df['Country']).plot('barh', ax=ax1)
pd.value_counts(df['Company']).plot('barh', ax=ax2)
Pandas' plot method can take in a Matplotlib axes object and direct the resulting plot into that subplot.
# If you want a two plots, one above the other.
nrows = 2
ncols = 1
# Here axes contains 2 objects representing the two subplots
fig, axes = plt.subplots(nrows, ncols, figsize=(8, 4))
# Below, "my_data_frame" is the name of your Pandas dataframe.
# Change it accordingly for the code to work.
# Plot first subplot
# This counts the number of times each country appears and plot
# that as a bar char in the first subplot represented by axes[0].
my_data_frame['Country'].value_counts().plot('barh', ax=axes[0])
# Plot second subplot
my_data_frame['Company'].value_counts().plot('barh', ax=axes[1])
I find DataFrame.plot.hist to be amazingly convenient, but I cannot find a solution in this case.
I want to plot the distribution of many columns in the dataset. The problem is that pandas retains the same scale on all x axes, rendering most of the plots useless. Here is the code I'm using:
X.plot.hist(subplots=True, layout=(13, 6), figsize=(20, 45), bins=50, sharey=False, sharex=False)
plt.show()
And here's a section of the result:
It appears that the issue is that pandas uses the same bins on all the columns, irrespectively of their values. Is there a convenient solution in pandas or am I forced to do it by hand?
I centered the data (zero mean and unit variance) and the result improved a little, but it's still not acceptable.
There are a couple of options, here is the code and output:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Dummy data - value ranges differ a lot between columns
X = pd.DataFrame()
for i in range(18):
X['COL0{0}'.format(i+38)]=(2**i)*np.random.random(1000)
# Method 1 - just using the hist function to generate each plot
X.hist(layout=(3, 6), figsize=(20, 10), sharey=False, sharex=False, bins=50)
plt.title('Method 1')
plt.show()
# Method 2 - generate each plot separately
cols = plt.cm.spectral(np.arange(1,255,13))
fig, axes = plt.subplots(3,6,figsize=(20,10))
for index, column in enumerate(X.columns):
ax = axes.flatten()[index]
ax.hist(X[column],bins=50, label=column, fc=cols[index])
ax.legend(loc='upper right')
ax.set_ylim((0,1.2*ax.get_ylim()[1]))
fig.suptitle('Method 2')
fig.show()
The first plot:
The second plot:
I would definitely recommend the second method as you have much more control over the individual plots, for example you can change the axes scales, labels, grid parameters, and almost anything else.
I couldn't find anything that would allow you to modify the original plot.hist bins to accept individually calculated bins.
I hope this helps!
I tried to make the title as clear as possible although I am not sure it is completely limpid.
I have three series of data (number of events along time). I would like to do a subplots were the three time series are represented. You will find attached the best I could come up with. The last time series is significantly shorter and that's why it is not visible on here.
I'm also adding the corresponding code so you can maybe understand better why I'm trying to do and advice me on the proper/smart way to do so.
import numpy as np
import matplotlib.pyplot as plt
x=np.genfromtxt('nbr_lig_bound1.dat')
x1=np.genfromtxt('nbr_lig_bound2.dat')
x2=np.genfromtxt('nbr_lig_bound3.dat')
# doing so because imshow requieres a 2D array
# best way I found and probably not the proper way to get it done
x=np.expand_dims(x, axis=0)
x=np.vstack((x,x))
x1=np.expand_dims(x1, axis=0)
x1=np.vstack((x1,x1))
x2=np.expand_dims(x2, axis=0)
x2=np.vstack((x2,x2))
# hoping that this would compensate for sharex shrinking my X range to
# the shortest array
ax[0].set_xlim(1,24)
ax[1].set_xlim(1,24)
ax[2].set_xlim(1,24)
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(6,6), sharex=True)
fig.subplots_adjust(hspace=0.001) # this seem to have no effect
p1=ax[0].imshow(x1[:,::10000], cmap='autumn_r')
p2=ax[1].imshow(x2[:,::10000], cmap='autumn_r')
p3=ax[2].imshow(x[:,::10000], cmap='autumn')
Here is what I could reach so far:
and here is a scheme of what I wish to have since I could not find it on the web. In short, I would like to remove the blank spaces around the plotted data in the two upper graphs. And as a more general question I would like to know if imshow is the best way of obtaining such plot (cf intended results below).
Using fig.subplots_adjust(hspace=0) sets the vertical (height) space between subplots to zero but doesn't adjust the vertical space within each subplot. By default, plt.imshow has a default aspect ratio (rc image.aspect) usually set such that pixels are squares so that you can accurately recreate images. To change this use aspect='auto' and adjust the ylim of your axes accordingly.
For example:
# you don't need all the `expand_dims` and `vstack`ing. Use `reshape`
x0 = np.linspace(5, 0, 25).reshape(1, -1)
x1 = x0**6
x2 = x0**2
fig, axes = plt.subplots(3, 1, sharex=True)
fig.subplots_adjust(hspace=0)
for ax, x in zip(axes, (x0, x1, x2)):
ax.imshow(x, cmap='autumn_r', aspect='auto')
ax.set_ylim(-0.5, 0.5) # alternatively pass extent=[0, 1, 0, 24] to imshow
ax.set_xticks([]) # remove all xticks
ax.set_yticks([]) # remove all yticks
plt.show()
yields
To add a colorbar, I recommend looking at this answer which uses fig.add_axes() or looking at the documentation for AxesDivider (which I personally like better).
How can I determine whether a subplot (AxesSubplot) is empty or not? I would like to deactivate empty axes of empty subplots and remove completely empty rows.
For instance, in this figure only two subplots are filled and the remaining subplots are empty.
import matplotlib.pyplot as plt
# create figure wit 3 rows and 7 cols; don't squeeze is it one list
fig, axes = plt.subplots(3, 7, squeeze=False)
x = [1,2]
y = [3,4]
# plot stuff only in two SubAxes; other axes are empty
axes[0][1].plot(x, y)
axes[1][2].plot(x, y)
# save figure
plt.savefig('image.png')
Note: It is mandatory to set squeeze to False.
Basically I want a sparse figure. Some subplots in rows can be empty, but they should be deactivated (no axes must be visible). Completely empty rows must be removed and must not be set to invisible.
You can use the fig.delaxes() method:
import matplotlib.pyplot as plt
# create figure wit 3 rows and 7 cols; don't squeeze is it one list
fig, axes = plt.subplots(3, 7, squeeze=False)
x = [1,2]
y = [3,4]
# plot stuff only in two SubAxes; other axes are empty
axes[0][1].plot(x, y)
axes[1][2].plot(x, y)
# delete empty axes
for i in [0, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20]:
fig.delaxes(axes.flatten()[i])
# save figure
plt.savefig('image.png')
plt.show(block=False)
One way of achieving what you require is to use matplotlibs subplot2grid feature. Using this you can set the total size of the grid (3,7 in your case) and choose to only plot data in certain subplots in this grid. I have adapted your code below to give an example:
import matplotlib.pyplot as plt
x = [1,2]
y = [3,4]
fig = plt.subplots(squeeze=False)
ax1 = plt.subplot2grid((3, 7), (0, 1))
ax2 = plt.subplot2grid((3, 7), (1, 2))
ax1.plot(x,y)
ax2.plot(x,y)
plt.show()
This gives the following graph:
EDIT:
Subplot2grid, in effect, does give you a list of axes. In your original question you use fig, axes = plt.subplots(3, 7, squeeze=False) and then use axes[0][1].plot(x, y) to specifiy which subplot your data will be plotted in. That is the same as what subplot2grid does, apart from it only shows the subplots with data in them which you have defined.
So take ax1 = plt.subplot2grid((3, 7), (0, 1)) in my answer above, here I have specified the shape of the 'grid' which is 3 by 7. That means I can have 21 subplots in that grid if I wanted, exactly like you original code. The difference is that your code displays all the subplots whereas subplot2grid does not. The (3,7) in ax1 = ... above specifies the shape of the whole grid and the (0,1) specifies where in that grid the subplot will be shown.
You can use any position the subplot wherever you like within that 3x7 grid. You can also fill all 21 spaces of that grid with subplots that have data in them if you require by going all the way up to ax21 = plt.subplot2grid(...).