I have a dataframe with over 100 samples and 13 different features (12 numeric, one binary categorical [called 'Compare_this_one' below]). I am trying to quickly pull out a series of subplots comparing all of the features' statistics across the binary categories. The below code does most of what I want. I am just struggling with the aesthetic editing.
How do I remove the redundant x-axis labels (or all of them)?
How can I increase the title size in each subplot? I already adjusted all of the fontsizes with rcParam (which worked fine for all my other plots), but it doesn't seem to have impacted this plot.
How do I increase the padding between each plot? A couple of my y-axes have larger values, and they overlap with plots to the left.
Example code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 12), columns=list('ABCDEFGHIJKL'))
df['Compare_this_one'] = np.random.choice(range(1, 3), df.shape[0])
fig, ax_test = plt.subplots(4,3, sharex=True)
bp = df.boxplot(by='Compare_this_one',ax=ax_test,layout=(4,3))
plt.show()
Thanks, I really appreciate the help!
The bp var is a list of the axes of the subplots. You can set the label of each of these to your liking:
[ax.set_xlabel('') for ax in bp]
Related
For research purposes at my university, I need to create a stacked bar chart for speech data. I would like to represent the hours of speech on the y-axis and the frequency on the x-axis. The speech comes from different components, hence the stacked part of the chart. The data resides in a Pandas dataframe, which has a lot of columns, but the important ones are "component", "hours" and "ps_med_frequency" which are used in the graph.
A simplified view of the DF (it has 6.2k rows and 120 columns, a-k components):
component
filename
ps_med_freq (rounded to integer)
hours (length)
...
a
fn0001_ps
230
0.23
b
fn0002_ps
340
0.12
c
fn003_ps
278
0.09
I have already tried this with matplotlib, seaborn or just the plot method from the Pandas dataframe itself. None seem to work properly.
A snippet of seaborn code I have tried:
sns.barplot(data=meta_dataframe, x='ps_med_freq', y='hours', hue='component', dodge=False)
And basically all variations of this as well.
Below you can see one of the most "viable" results I've had so far:
example of failed graph
It seems to have a lot of inexplicable grey blobs, which I first contributed to the large dataset, but if I just plot it as a histogram and count the frequencies instead of showing them by hour, it works perfectly fine. Does anyone know a solution to this?
Thanks in advance!
P.S.: Yes, I realise this is a huge dataset and at first sight, the graph seems useless with that much data on it, but matplotlib has interactive graphs where you can zoom etc., where this kind of graph becomes useful for my purpose.
With sns.barplot you're creating a bar for each individual frequency value. You'll probably want to group similar frequencies together, as with sns.histplot(..., multiple='stack'). If you want a lot of detail, you can increase the number of bins for the histogram. Note that sns.barplot never creates stacks, it would just plot each bar transparently on top of the others.
You can create a histogram, using the hours as weights, so they get summed.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# create some suitable random test data
np.random.seed(20230104)
component_prob = np.random.uniform(0.1, 2, 7)
component_prob /= component_prob.sum()
df = pd.DataFrame({'component': np.random.choice([*'abcdefg'], 6200, p=component_prob),
'ps_med_freq': (np.random.normal(0.05, 1, 6200).cumsum() + 200).astype(int),
'hours': np.random.randint(1, 39, 6200) * .01})
# create bins for every range of 10, suitably rounded, and shifted by 0.001 to avoid floating point roundings
bins = np.arange(df['ps_med_freq'].min() // 10 * 10 - 0.001, df['ps_med_freq'].max() + 10, 10)
plt.figure(figsize=(16, 5))
ax = sns.histplot(data=df, x='ps_med_freq', weights='hours', hue='component', palette='bright',
multiple='stack', bins=bins)
# sns.move_legend(ax, loc='upper left', bbox_to_anchor=(1.01, 1.01)) # legend outside
sns.despine()
plt.tight_layout()
plt.show()
I am trying to include 2 seaborn countplots with different scales on the same plot but the bars display as different widths and overlap as shown below. Any idea how to get around this?
Setting dodge=False, doesn't work as the bars appear on top of each other.
The main problem of the approach in the question, is that the first countplot doesn't take hue into account. The second countplot won't magically move the bars of the first. An additional categorical column could be added, only taking on the 'weekend' value. Note that the column should be explicitly made categorical with two values, even if only one value is really used.
Things can be simplified a lot, just starting from the original dataframe, which supposedly already has a column 'is_weeked'. Creating the twinx ax beforehand allows to write a loop (so writing the call to sns.countplot() only once, with parameters).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_style('dark')
# create some demo data
data = pd.DataFrame({'ride_hod': np.random.normal(13, 3, 1000).astype(int) % 24,
'is_weekend': np.random.choice(['weekday', 'weekend'], 1000, p=[5 / 7, 2 / 7])})
# now, make 'is_weekend' a categorical column (not just strings)
data['is_weekend'] = pd.Categorical(data['is_weekend'], ['weekday', 'weekend'])
fig, ax1 = plt.subplots(figsize=(16, 6))
ax2 = ax1.twinx()
for ax, category in zip((ax1, ax2), data['is_weekend'].cat.categories):
sns.countplot(data=data[data['is_weekend'] == category], x='ride_hod', hue='is_weekend', palette='Blues', ax=ax)
ax.set_ylabel(f'Count ({category})')
ax1.legend_.remove() # both axes got a legend, remove one
ax1.set_xlabel('Hour of Day')
plt.tight_layout()
plt.show()
use plt.xticks(['put the label by hand in your x label'])
I find DataFrame.plot.hist to be amazingly convenient, but I cannot find a solution in this case.
I want to plot the distribution of many columns in the dataset. The problem is that pandas retains the same scale on all x axes, rendering most of the plots useless. Here is the code I'm using:
X.plot.hist(subplots=True, layout=(13, 6), figsize=(20, 45), bins=50, sharey=False, sharex=False)
plt.show()
And here's a section of the result:
It appears that the issue is that pandas uses the same bins on all the columns, irrespectively of their values. Is there a convenient solution in pandas or am I forced to do it by hand?
I centered the data (zero mean and unit variance) and the result improved a little, but it's still not acceptable.
There are a couple of options, here is the code and output:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Dummy data - value ranges differ a lot between columns
X = pd.DataFrame()
for i in range(18):
X['COL0{0}'.format(i+38)]=(2**i)*np.random.random(1000)
# Method 1 - just using the hist function to generate each plot
X.hist(layout=(3, 6), figsize=(20, 10), sharey=False, sharex=False, bins=50)
plt.title('Method 1')
plt.show()
# Method 2 - generate each plot separately
cols = plt.cm.spectral(np.arange(1,255,13))
fig, axes = plt.subplots(3,6,figsize=(20,10))
for index, column in enumerate(X.columns):
ax = axes.flatten()[index]
ax.hist(X[column],bins=50, label=column, fc=cols[index])
ax.legend(loc='upper right')
ax.set_ylim((0,1.2*ax.get_ylim()[1]))
fig.suptitle('Method 2')
fig.show()
The first plot:
The second plot:
I would definitely recommend the second method as you have much more control over the individual plots, for example you can change the axes scales, labels, grid parameters, and almost anything else.
I couldn't find anything that would allow you to modify the original plot.hist bins to accept individually calculated bins.
I hope this helps!
I have two or three csv files with the same header and would like to draw the histograms for each column overlaying one another on the same plot.
The following code gives me two separate figures, each containing all histograms for each of the files. Is there a compact way to go about plotting them together on the same figure using pandas/matplot lib? I imagine something close to this but using dataframes.
Code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('input1.csv')
df2 = pd.read_csv('input2.csv')
df.hist(bins=20)
df2.hist(bins=20)
plt.show()
In [18]: from pandas import DataFrame
In [19]: from numpy.random import randn
In [20]: df = DataFrame(randn(10, 2))
In [21]: df2 = DataFrame(randn(10, 2))
In [22]: axs = df.hist()
In [23]: for ax, (colname, values) in zip(axs.flat, df2.iteritems()):
....: values.hist(ax=ax, bins=10)
....:
In [24]: draw()
gives
The main issue of overlaying the histograms of two (or more) dataframes containing the same variables in side-by-side plots within a single figure has been already solved in the answer by Phillip Cloud.
This answer provides a solution to the issue raised by the author of the question (in the comments to the accepted answer) regarding how to enforce the same number of bins and range for the variables common to both dataframes. This can be accomplished by creating a list of bins common to all variables of both dataframes. In fact, this answer goes a little bit further by adjusting the plots for cases where the different variables contained in each dataframe cover slightly different ranges (but still within the same order of magnitude), as illustrated in the following example:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
from matplotlib.lines import Line2D
# Set seed for random data
rng = np.random.default_rng(seed=1)
# Create two similar dataframes each containing two random variables,
# with df2 twice the size of df1
df1_size = 1000
df1 = pd.DataFrame(dict(var1 = rng.exponential(scale=1.0, size=df1_size),
var2 = rng.normal(loc=40, scale=5, size=df1_size)))
df2_size = 2*df1_size
df2 = pd.DataFrame(dict(var1 = rng.exponential(scale=2.0, size=df2_size),
var2 = rng.normal(loc=50, scale=10, size=df2_size)))
# Combine the dataframes to extract the min/max values of each variable
df_combined = pd.concat([df1, df2])
vars_min = [df_combined[var].min() for var in df_combined]
vars_max = [df_combined[var].max() for var in df_combined]
# Create custom bins based on the min/max of all values from both
# dataframes to ensure that in each histogram the bins are aligned
# making them easily comparable
nbins = 30
bin_edges, step = np.linspace(min(vars_min), max(vars_max), nbins+1, retstep=True)
# Create figure by combining the outputs of two pandas df.hist() function
# calls using the 'step' type of histogram to improve plot readability
htype = 'step'
alpha = 0.7
lw = 2
axs = df1.hist(figsize=(10,4), bins=bin_edges, histtype=htype,
linewidth=lw, alpha=alpha, label='df1')
df2.hist(ax=axs.flatten(), grid=False, bins=bin_edges, histtype=htype,
linewidth=lw, alpha=alpha, label='df2')
# Adjust x-axes limits based on min/max values and step between bins, and
# remove top/right spines: if, contrary to this example dataset, var1 and
# var2 cover the same range, setting the x-axes limits with this loop is
# not necessary
for ax, v_min, v_max in zip(axs.flatten(), vars_min, vars_max):
ax.set_xlim(v_min-2*step, v_max+2*step)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Edit legend to get lines as legend keys instead of the default polygons:
# use legend handles and labels from any of the axes in the axs object
# (here taken from first one) seeing as the legend box is by default only
# shown in the last subplot when using the plt.legend() function.
handles, labels = axs.flatten()[0].get_legend_handles_labels()
lines = [Line2D([0], [0], lw=lw, color=h.get_facecolor()[:-1], alpha=alpha)
for h in handles]
plt.legend(lines, labels, frameon=False)
plt.suptitle('Pandas', x=0.5, y=1.1, fontsize=14)
plt.show()
It is worth noting that the seaborn package provides a more convenient way to create this kind of plot, where contrary to pandas the bins are automatically aligned. The only downside is that the dataframes must first be combined and reshaped to long format, as shown in this example using the same dataframes and bins as before:
import seaborn as sns # v 0.11.0
# Combine dataframes and convert the combined dataframe to long format
df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')
# Create figure using seaborn displot: note that the bins are automatically
# aligned thanks the 'common_bins' parameter of the seaborn histplot function
# (called here with 'kind='hist'') that is set to True by default. Here, the
# bins from the previous example are used to make the figures more comparable.
# Also note that the facets share the same x and y axes by default, this can
# be changed when var1 and var2 have different ranges and different
# distribution shapes, as it is the case in this example.
g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
element='step', bins=bin_edges, fill=False, height=4,
facet_kws=dict(sharex=False, sharey=False))
# For some reason setting sharex as above does not automatically adjust the
# x-axes limits (even when not setting a bins argument, maybe due to a bug
# with this package version) which is why this is done in the following loop,
# but note that you still need to set 'sharex=False' in displot, or else
# 'ax.set.xlim' will have no effect.
for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
ax.set_xlim(v_min-2*step, v_max+2*step)
# Additional formatting
g.legend.set_bbox_to_anchor((.9, 0.75))
g.legend.set_title('')
plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)
plt.show()
As you may notice, the histogram line is cut off at the limits of the list of bin edges (not visible on the maximum side due to scale). To get a line more similar to the example with pandas, an empty bin can be added at each extremity of the list of bins, like this:
bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
bin_edges = np.append(bin_edges, bin_edges.max()+step)
This example also illustrates the limits to this approach of setting common bins for both facets. Seeing as the ranges of var1 var2 are somewhat different and that 30 bins are used to cover the combined range, the histogram for var1 contains rather few bins and the histogram for var2 has slightly more bins than necessary. To my knowledge, there is no straightforward way of assigning a different list of bins to each facet when calling the plotting functions df.hist() and displot(df). So for cases where variables cover significantly different ranges, these figures would have to be created from scratch using matplotlib or some other plotting library.
I was trying to plot a time series data figure using matplotbib, the problem is that there are too many observations, therefore the labels have overlap and don't fit well within a sized figure.
I am thinking of three solutions, one is to shrink the label size of observations, one is to change the text into vertical order or skewed manner, last is only to specify the first and last a few observations with dots between them. The code is to demonstrate my point.
I wonder anyone can help? Thanks
from datetime import date
import numpy as np
from pandas import *
import matplotlib.pyplot as plt
N = 100
data = np.array(np.random.randn(N))
time_index = date_range(date.today(), periods = len(data))
plt.plot(time_index, data)
For your simple plot, you could do
plt.xticks(rotation=90).
Alternatively, you could specify what ticks you wanted to display with
plt.xticks(<certain range of values>)
plt.xticklabels(<labels for those values>)
Edit:
Personally, I would change to the object-oriented way of pyplot.
f = plt.figure()
ax = f.add_subplot(111)
ax.plot(<stuff>)
ax.tick_params(axis='x', labelsize='8')
plt.setp( ax.xaxis.get_majorticklabels(), rotation=90 )
# OR
xlabels = ax.get_xticklabels()
for label in xlabels:
label.set_rotation(90)
plt.show()