Histograms of grouped data - python

I'm fairly new to Python and getting my head around this has been really hard.
I have a code like this
df = p.read_csv("files/athena-query-1.txt", ";")
ax = df.hist(column="distance", range=[0.0, 0.5], bins=100, by="gate_id")
All I want is to see a distance distribution per gate on separate charts. If there are 400 gate_id, I want to see 400 distribution plots.
It tells me that the ax is a collection of AxesSubplot. When I try to plot this, I get only one graph that is unreadable. My guessing is that it tries to create a single chart (a Figure?).

EDIT:
I reproduced a minimal example of what I think you might mean:
#create dataframe with 100 random values of normal distribution for 'distance', and distributing (1,2,3,4) as 'gate_id' evenly among the values:
df=pd.DataFrame({'distance': scipy.stats.norm.rvs(size=100), 'gate_id': 25*[1,2,3,4]})
df.hist(column='distance', range=[0.0, 0.5], bins=100, by='gate_id')
This yields a figure with 4 subplots, corresponding to 'gate_id':
However if I try for 400 as you mentioned, the figure isn't even shown. Probably because its simply not big enough to hold 400 subplots. This is the reason I recommend the first solution example I gave below.
ORIGINAL:
If you want 400 seperate distribution plots, then why not create a 400 figures using matplotlib?
from matplotlib import pyplot as plt
for i in range(400):
fig, ax = plt.subplots()
ax.plot(<dataframe['x']>,<dataframe['y']>)
or you can also try to plot a huge figure with many subplots, such as
fig, ( (ax1, ax2, ax3, ...<fill up here>..., ax10), (ax11, ..., ax20), ..., (ax91, ..., ax100)) = plt.subplots(nrows=10, ncols=10)
ax1.bar(<dataframe['x']>,<dataframe['y']>)
...
ax100.bar(<dataframe['x']>,<dataframe['y']>)
This is only for 100 subplots, not sure if 400 is just simply too big.

Related

Problems with plt.subplots, which should be the best option?

I'm new in both python and stackoverflow... I come from the ggplot2 R background and I am still getting stacked with python. I don't understand why I have a null plot before my figure using matplotlib... I just have a basic pandas series and I want to plot some of the rows in a subplot, and some on the others (however my display is terrible and I don't know why/how to fix it). Thank you in advance!
df = organism_df.T
fig, (ax1,ax2) = plt.subplots(nrows=1,ncols=2,figsize=(5,5))
ax1 = df.iloc[[0,2,3,-1]].plot(kind='bar')
ax1.get_legend().remove()
ax1.set_title('Number of phages/bacteria interacting vs actual DB')
ax2 = df.iloc[[1,4,5,6,7]].plot(kind='bar')
ax2.get_legend().remove()
ax2.set_title('Number of different taxonomies with interactions')
plt.tight_layout()
The method plot from pandas would need the axes given as an argument, e.g., df.plot(ax=ax1, kind='bar'). In your example, first the figure (consisting of ax1 and ax2) is created, then another figure is created by the plot function (at the same time overwriting the original ax1 object) etc.

Seaborn histplot uses weird y axis limits?

I'm iterating through all columns of my df to plot their densities to see if and how I need to transform/normalize my data. I'm using Seaborn and this code:
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(16,40))
fig.tight_layout() #othwerwise the plots overlapped each other and I couldn't see the column names
for i, column in enumerate(df.columns):
sns.histplot(df[column],ax=axes[i//n_cols,i%n_cols], kde=True, legend=True, fmt='g')
This results in a mostly okay graph, however the scaling of the y axis is waaay too big in some cases:
City 3 and 4 are just fine, however, the highest Count for City 4 is at around 200, yet the plot scales y until 10 000, which makes the data hard to interpret. The x axis also goes way beyond where it should, as the highest cost is at about 1000000, but the plot goes until 25000000. When I plot City 4 separately and force a ylim of 200 and xlim of 1000000 I get a much more understandable plot:
Why is the y axis (and actually, the x axis also) scaled so weirdly, and how can I change my code to scale it down so that I don't get a ylim much higher than the actually displayed data?
Thank you!
Set the shared_yaxis to False.
This will get the subplots to plot at the respective maximum points of the corresponding data.
Example:
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(16,40), sharey=False)

Matplotlib: How to extract vlines and fill_between data from ax objects?

I have a figure comprised of two x/y curves, a vline and a fill_between in Matplotlib.
My ultimate aim is displaying this figure along with 2 other figures as subplots in a 4th new figure. And I really want to avoid creating all three figures from scratch again just for this new figure with subplots.
So, I'm looking to create a 1x3 figure (subplots, 1 row, 3 columns) like this:
[fig1, fig2, fig3]
I'm almost there. I've so far been able to extract the two x/y curves from the original figure's ax object. Moving through a for loop, I've been able to rebuild most of the three figures as subplots in my new figure:
(ax_a, ax_b, ax_c are ax objects belonging to the three original figures I want to add as subplots to my new figure)
fig = plt.figure(figsize = (16,4))
ax1 = fig.add_subplot(1,3,1)
ax2 = fig.add_subplot(1,3,2)
ax3 = fig.add_subplot(1,3,3)
for ax, ref in zip(fig.axes, [ax_a, ax_b, ax_c]):
for line in ref.lines:
x = line.get_xdata()
y = line.get_ydata()
ax.plot(x,y)
ax.set_xlabel(ref.get_xlabel())
ax.set_ylabel(ref.get_ylabel())
This actually creates a 1x3 grid of my original 3 plots. It's almost perfect.
What's missing are the fill_between component and the vlines component. If I could extract those objects from ax_a, ax_b and ax_c, I'd be done. But I can't find a way to do that.
Is there a way? If not, how would you solve a problem like this?
Thanks so much in advance for any advice offered.

How to print multiple plots together in python?

I am trying to print about 42 plots in 7 rows, 6 columns, but the printed output in jupyter notebook, shows all the plots one under the other. I want them in (7,6) format for comparison. I am using matplotlib.subplot2grid() function.
Note: I do not get any error, and my code works, however the plots are one under the other, vs being in a grid/ matrix form.
Here is my code:
def draw_umap(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean', title=''):
fit = umap.UMAP(
n_neighbors=n_neighbors,
min_dist=min_dist,
n_components=n_components,
metric=metric
)
u = fit.fit_transform(df);
plots = []
plt.figure(0)
fig = plt.figure()
fig.set_figheight(10)
fig.set_figwidth(10)
for i in range(7):
for j in range(6):
plt.subplot2grid((7,6), (i,j), rowspan=7, colspan=6)
plt.scatter(u[:,0], u[:,1], c= df.iloc[:,0])
plt.title(title, fontsize=8)
n=range(7)
d=range(6)
for n in n_neighbors:
for d in dist:
draw_umap(n_neighbors=n, min_dist=d, title="n_neighbors={}".format(n) + " min_dist={}".format(d))
I did refer to this post to get the plots in a grid and followed the code.
I also referred to this post, and modified my code for size of the fig.
Is there a better way to do this using Seaborn?
What am I missing here? Please help!
Both questions that you have linked contain solutions that seem more complicated than necessary. Note that subplot2grid is useful only if you want to create subplots of varying sizes which I understand is not your case. Also note that according to the docs Using GridSpec, as demonstrated in GridSpec demo is generally preferred, and I would also recommend this function only if you want to create subplots of varying sizes.
The simple way to create a grid of equal-sized subplots is to use plt.subplots which returns an array of Axes through which you can loop to plot your data as shown in this answer. That solution should work fine in your case seeing as you are plotting 42 plots in a grid of 7 by 6. But the problem is that in many cases you may find yourself not needing all the Axes of the grid, so you will end up with some empty frames in your figure.
Therefore, I suggest using a more general solution that works in any situation by first creating an empty figure and then adding each Axes with fig.add_subplot as shown in the following example:
import numpy as np # v 1.19.2
import matplotlib.pyplot as plt # v 3.3.4
# Create sample dataset
rng = np.random.default_rng(seed=1) # random number generator
nvars = 8
nobs = 50
xs = rng.uniform(size=(nvars, nobs))
ys = rng.normal(size=(nvars, nobs))
# Create figure with appropriate space between subplots
fig = plt.figure(figsize=(10, 8))
fig.subplots_adjust(hspace=0.4, wspace=0.3)
# Plot data by looping through arrays of variables and list of colors
colors = plt.get_cmap('tab10').colors
for idx, x, y, color in zip(range(len(xs)), xs, ys, colors):
ax = fig.add_subplot(3, 3, idx+1)
ax.scatter(x, y, color=color)
This could be done in seaborn as well, but I would need to see what your dataset looks like to provide a solution relevant to your case.
You can find a more elaborate example of this approach in the second solution in this answer.

How to draw multi-series histogram from several pd.Series objects?

I want to draw a multi-series histogram chart that looks like this:
multi-series histogram chart
I'm trying to add this to an existing Jupyter notebook that already had code in place to establish a double chart:
fig, (ax, ax2) = plt.subplots(2,1)
The existing code uses the style where the plotting is done using methods on the data objects themselves. For example, here's some of the existing code that plots line charts in one of the existing subplots:
ax = termstruct[i].T.plot.line(ax=ax, c=linecolor,
dashes=dash, grid=True, linewidth=width, figsize=FIGURE_SIZE)
The main point I'm making here is that the way the plotting is achieved is to use the .plot.line method on the Pandas pd.Series (termstruct). This is not at all consistent with the examples and tutorials I was able to find online for drawing charts with pyplot, but it works and it establishes a framework I'm trying to work within.
So I started by taking the obvious step of adding a 3rd subplot for my histogram by changing the subplots call to plt from above:
fig, (ax, ax2, ax3) = plt.subplots(3,1)
My data are in four separate pd.Series objects, where each one represents a series that should map to one of the colors in the chart example at the top of this post. But when I try following the same general coding style of using methods on the data objects to do the plotting, I always seem to wind up with the X and Y axes opposite what I want, like this:
what I wound up with!
The code that generated the above chart was:
ax3 = NakedPNLperMo.plot.hist(ax=ax3,grid=True, figsize=FIGURE_SIZE)
ax3 = H9PNLperMo.plot.hist(ax=ax3, grid=True, figsize=FIGURE_SIZE)
ax3 = H12PNLperMo.plot.hist(ax=ax3, grid=True, figsize=FIGURE_SIZE)
ax3 = H15PNLperMo.plot.hist(ax=ax3, grid=True, figsize=FIGURE_SIZE)
NakedPNLperMo and the other 3 pd.Series objects are full of arcane financial symbols, but a simplified version of their contents (to make this clear) would be:
NakedPNLperMO = pd.Series(data=[1.2,3.4,5.6,7.8,-2.3,-4.6],
index=['Month 1','Month 2','Month 3','Month 4',
'Month 5','Month 6'])
My intention/goal is that the data are plotted on the Y axis and the index values ('Month 1', etc.) are like columns across the x axis, but I can't seem to get that output no matter what I try.
Clearly the problem is the axes are swapped. But when I went looking for how to fix that, I couldn't find any examples online that follow this approach of drawing the chart using methods on the data objects. Everything I found in online tutorials was using a bunch of calls to plt to set up the charts. And more to the point, I couldn't see any way to follow the style in those examples and still draw the chart as a 3rd subplot alongside the 2 subplots already defined by this program.
My first (and foremost) question is what I SHOULD be trying next... Does it make sense to figure out how to change the parameters of [data-object].plot.xxx to get the axes the way I need them, or would it make more sense to follow the completely different style of making a series of calls to plt to design and draw the charts? The former would be consistent with what I have, but I can't find any online help for using that coding style. (Should I infer that it's a deprecated style of doing things?)
If the answer to the above is to take the approach of calling plt like the online examples all seem to show, how can I use the ax3 that ties this chart into the existing subplots? If the answer to the above is to stick with the approach of [data-object].plot.xxx, where can I find help on using that style? All the online examples I could find followed a different coding style.
And of course the most obvious question: How do I swap the axes so the chart looks right? :)
Thanks!
I hope this code help you, I have created three series to show you how you can do it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#jupyter notebook only
%matplotlib inline
s1 = pd.Series(data=[1.2,3.4,5.6,7.8,-2.3,-4.6],
index=['Month 1','Month 2','Month 3','Month 4',
'Month 5','Month 6'])
s2=pd.Series(data=[5,3.4,7.4,-5.1,-2.3,3],
index=['Month 1','Month 2','Month 3','Month 4',
'Month 5','Month 6'])
s3=pd.Series(data=[5,2,-2.4,0,1,3],
index=['Month 1','Month 2','Month 3','Month 4',
'Month 5','Month 6'])
df=pd.concat([s1,s2,s3],axis=1)
df.columns=['s1','s2','s3']
print(df)
ax=df.plot(kind='bar',figsize=(10,10),fontsize=15)
#------------------------------------------------#
plt.xticks(rotation=-45)
#grid on
plt.grid()
# set y=0
ax.axhline(0, color='black', lw=1)
#change size of legend
ax.legend(fontsize=20)
#hiding upper and right axis layout
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#changing the thickness
ax.spines['bottom'].set_linewidth(3)
ax.spines['left'].set_linewidth(3)
Output:
s1 s2 s3
Month 1 1.2 5.0 5.0
Month 2 3.4 3.4 2.0
Month 3 5.6 7.4 -2.4
Month 4 7.8 -5.1 0.0
Month 5 -2.3 -2.3 1.0
Month 6 -4.6 3.0 3.0
Figure

Categories