Pandas DataFrame.hist Seaborn equivalent - python

When exploring a I often use Pandas' DataFrame.hist() method to quickly display a grid of histograms for every numeric column in the dataframe, for example:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.hist(bins=50, figsize=(10,7))
plt.show()
Which produces a figure with separate plots for each column:
I've tried the following:
import pandas as pd
import seaborn as sns
from sklearn import datasets
data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
for col_id in df.columns:
sns.distplot(df[col_id])
But this produces a figure with a single plot and all columns overlayed:
Is there a way to produce a grid of histograms showing the data from a DataFrame's columns with Seaborn?

You can take advantage of seaborn's FacetGrid if you reorganize your dataframe using melt. Seaborn typically expects data organized this way (long format).
g = sns.FacetGrid(df.melt(), col='variable', col_wrap=2)
g.map(plt.hist, 'value')

There is no equivalent as seaborn displot itself will only pick 1-D array, or list, maybe you can try generating the subplots.
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
for i in range(ax.shape[0]):
for j in range(ax.shape[1]):
sns.distplot(df[df.columns[i*2+j]], ax=ax[i][j])

https://seaborn.pydata.org/examples/distplot_options.html
Here is an example how you can show 4 graphs using subplot, with seaborn.

Anothert useful SEABORN method to quickly display a grid of histograms for every numeric column in the dataframe for you could be the quick,clean and handy sns.pairplot()
try:
sns.pairplot(df)
this has a lot of cool parameters you can explor like Hue etc
pairplot example for iris dataset
if you DON'T want the scatters you can actually create a customised grid really really quickly using sns.PairGrid(df)
this creates an empty grid with all the spaces and you can map whatever you want on them :g = sns.pairgrid(df)
`g.map(sns.distplot)` or `g.map_diag(plt.scatter)`
etc

I ended up adapting jcaliz's to make it work more generally, i.e. not just when the DataFrame has four columns, I also added code to remove any unused axes and ensure axes appear in alphabetical order (as with df.hist()).
size = int(math.ceil(len(df.columns)**0.5))
fig, ax = plt.subplots(size, size, figsize=(10, 10))
for i in range(ax.shape[0]):
for j in range(ax.shape[1]):
data_index = i*ax.shape[1]+j
if data_index < len(df.columns):
sns.distplot(df[df.columns.sort_values()[data_index]], ax=ax[i][j])
for i in range(len(df.columns), size ** 2):
fig.delaxes(ax[i // size][i % size])

Related

How to put two Pandas box plots next to each other? Or group them by variable?

I have two data frames (df1 and df2). Each have the same 10 variables with different values.
I created box plots of the variables in the data frames like so:
df1.boxplot()
df2.boxplot()
I get two graphs of 10 box plots next to each other for each variable. The actual output is the second graph, however, as obviously Python just runs the code in order.
Instead, I would either like these box plots to appear side by side OR ideally, I would like 10 graphs (one for each variable) comparing each variable by data frame (e.g. one graph for the first variable with two box plots in it, one for each data frame). Is that possible just using python library or do I have to use Matplotlib?
Thanks!
To get graphs, standard Python isn't enough. You'd need a graphical library such as matplotlib. Seaborn extends matplotlib to ease the creation of complex statistical plots. To work with Seaborn, the dataframes should be converted to long form (e.g. via pandas' melt) and then combined into one large dataframe.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# suppose df1 and df2 are dataframes, each with the same 10 columns
df1 = pd.DataFrame({i: np.random.randn(100).cumsum() for i in 'abcdefghij'})
df2 = pd.DataFrame({i: np.random.randn(150).cumsum() for i in 'abcdefghij'})
# pd.melt converts the dataframe to long form, pd.concat combines them
df = pd.concat({'df1': df1.melt(), 'df2': df2.melt()}, names=['source', 'old_index'])
# convert the source index to a column, and reset the old index
df = df.reset_index(level=0).reset_index(drop=True)
sns.boxplot(data=df, x='variable', y='value', hue='source', palette='turbo')
This creates boxes for each of the original columns, comparing the two dataframes:
Optionally, you could create multiple subplots with that same information:
sns.catplot(data=df, kind='box', col='variable', y='value', x='source',
palette='turbo', height=3, aspect=0.5, col_wrap=5)
By default, the y-axes are shared. You can disable the sharing via sharey=False. Here is an example, which also removes the repeated x axes and creates a common legend:
g = sns.catplot(data=df, kind='box', col='variable', y='value', x='source', hue='source', dodge=False,
palette='Reds', height=3, aspect=0.5, col_wrap=5, sharey=False)
g.set(xlabel='', xticks=[]) # remove x labels and ticks
g.add_legend()
PS: If you simply want to put two pandas boxplots next to each other, you can create a figure with two subplots, and pass the axes to pandas. (Note that pandas plotting is just an interface towards matplotlib.)
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5))
df1.boxplot(ax=ax1)
ax1.set_title('df1')
df2.boxplot(ax=ax2)
ax2.set_title('df2')
plt.tight_layout()
plt.show()

How to create seaborn violinplot with mean,median and mode displayed?

Is there a way to add a mean and a mode to a violinplot ? I have categorical data in one of my columns and the corresponding values in the next column. I tried looking into matplotlib violin plot as it technically offers the functionality I am looking for but it does not allow me to specify a categorical variable on the x axis, and this is crucial as I am looking at the distribution of the data per category. I have added a small table illustrating the shape of the data.
plt.figure(figsize=10,15)
ax=sns.violinplot(x='category',y='value',data=df)
First we calculate the the mode and means:
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({'Category':[1,2,5,1,2,4,3,4,2],
'Value':[1.5,1.2,2.2,2.6,2.3,2.7,5,3,0]})
Means = df.groupby('Category')['Value'].mean()
Modes = df.groupby('Category')['Value'].agg(lambda x: pd.Series.mode(x)[0])
You can use seaborn to make the basic plot, below I remove the inner boxplot using the inner= argument, so that we can see the mode and means:
fig, ax = plt.subplots()
sns.violinplot(x='Category',y='Value',data=df,inner=None)
plt.setp(ax.collections, alpha=.3)
plt.scatter(x=range(len(Means)),y=Means,c="k")
plt.scatter(x=range(len(Modes)),y=Modes)

Grid of plots with lines overplotted in matplotlib

I have a dataframe that consists of a bunch of x,y data that I'd like to see in scatter form along with a line. The dataframe consists of data with its form repeated over multiple categories. The end result I'd like to see is some kind of grid of the plots, but I'm not totally sure how matplotlib handles multiple subplots of overplotted data.
Here's an example of the kind of data I'm working with:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
category = np.arange(1,10)
total_data = pd.DataFrame()
for i in category:
x = np.arange(0,100)
y = 2*x + 10
data = np.random.normal(0,1,100) * y
dataframe = pd.DataFrame({'x':x, 'y':y, 'data':data, 'category':i})
total_data = total_data.append(dataframe)
We have x data, we have y data which is a linear model of some kind of generated dataset (the data variable).
I had been able to generate individual plots based on subsetting the master dataset, but I'd like to see them all side-by-side in a 3x3 grid in this case. However, calling the plots within the loop just overplots them all onto one single image.
Is there a good way to take the following code block and make a grid out of the category subsets? Am I overcomplicating it by doing the subset within the plot call?
plt.scatter(total_data['x'][total_data['category']==1], total_data['data'][total_data['category']==1])
plt.plot(total_data['x'][total_data['category']==1], total_data['y'][total_data['category']==1], linewidth=4, color='black')
If there's a simpler way to generate the by-category scatter plus line, I'm all for it. I don't know if seaborn has a similar or more intuitive method to use than pyplot.
You can use either sns.FacetGrid or manual plt.plot. For example:
g = sns.FacetGrid(data=total_data, col='category', col_wrap=3)
g = g.map(plt.scatter, 'x','data')
g = g.map(plt.plot,'x','y', color='k');
Gives:
Or manual plt with groupby:
fig, axes = plt.subplots(3,3)
for (cat, data), ax in zip(total_data.groupby('category'), axes.ravel()):
ax.scatter(data['x'], data['data'])
ax.plot(data['x'], data['y'], color='k')
gives:

Seaborn plot pandas dataframe by multiple groupby

I have pandas dataframe where I have nested 4 categories (50,60,70,80) within two categories (positive, negative) and I would like to plot with seaborn kdeplot of a column (eg., A_mean...) based on groupby. What I want to achieve is this (this was done by splitting the pandas to a list). I went over several posts, this code (Multiple single plots in seaborn with pandas groupby data) works for one level but not for the two if I want to plot this for each Game_RS:
for i, group in df_hb_SLR.groupby('Condition'):
sns.kdeplot(data=group['A_mean_per_subject'], shade=True, color='blue', label = 'label name')
I tried to use this one (Seaborn groupby pandas Series) but the first answer did not work for me:
sns.kdeplot(df_hb_SLR.A_mean_per_subject, groupby=df_hb_SLR.Game_RS)
AttributeError: 'Line2D' object has no property 'groupby'
and the pivot answer I was not able to make work.
Is there a direct way from seaborn or any better way directly from pandas Dataframe?
My data are accessible in csv format under this link -- data and I load them as usual:
df_hb_SLR = pd.read_csv('data.csv')
Thank you for help.
Here is a solution using seaborn's FacetGrid, which makes this kind of things really easy
g = sns.FacetGrid(data=df_hb_SLR, col="Condition", hue='Game_RS', height=5, aspect=0.5)
g = g.map(sns.kdeplot, 'A_mean_per_subject', shade=True)
g.add_legend()
The downside of FacetGrid is that it creates a new figure, so If you'd like to integrate those plots into a larger ensemble of subplots, you could achieve the same result using groupby() and some looping:
group1 = "Condition"
N1 = len(df_hb_SLR[group1].unique())
group2 = 'Game_RS'
target = 'A_mean_per_subject'
height = 5
aspect = 0.5
colour = ['gray', 'blue', 'green', 'darkorange']
fig, axs = plt.subplots(1,N1, figsize=(N1*height*aspect,N1*height*aspect), sharey=True)
for (group1Name,df1),ax in zip(df_hb_SLR.groupby(group1),axs):
ax.set_title(group1Name)
for (group2Name,df2),c in zip(df1.groupby(group2), colour):
sns.kdeplot(df2[target], shade=True, label=group2Name, ax=ax, color = c)

Scatterplot groupby columns in python

I am trying to make a scatterplot over two different types of categorical variables, each with three different levels. Right now I am using the seaborn library in python:
sns.pairplot(x_vars = ['UTM_x'], y_vars = ['UTM_y'], data = df, hue = "Mobility_Provider", height = 5)
sns.pairplot(x_vars = ['UTM_x'], y_vars = ['UTM_y'], data = df, hue = "zone_number", height = 5)
which gives me two separate scatter plot, one grouped by Mobility_Provider, one grouped by zone_number. However, I was wondering if it's possible to combine these two graphs together, e.g. different levels of Mobility_Provider are represented in different colours, while different levels of zone_number are represented in different shapes/markers of the plot.
Thanks a lot!
A sample plot would be:
Plot1
Plot2
Each row of the df has x and y values, and two categorical variables ("Mobility_Provider" and "zone_number")
This can be easily done using seaborn's scatterplot, just use
hue = "Mobility_Provider",style="zone_number"
Something like this
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
df = pd.DataFrame({'x':[1,2,3,4],'y':[1,2,3,4],'Mobility_Provider':[0,0,1,1],\
'zone_number':[0,1,0,1]})
sns.scatterplot(x="x", y="y",s=100,hue='Mobility_Provider',style='zone_number', data=df)
plt.show()

Categories