Creating whisker plots from grouped pandas Series - python

I have a dataset of values arriving in 5min timestamped intervals that I'm visualising grouped by hours of day, like this
I want to turn this into a whisker/box plot for the added information. However, the implementations of matplotlib, seaborn and pandas of this plot all want an array of raw data to compute the plot's contents themselves.
Is there a way to create whisker plots from pre-computed/grouped mean, median, std and quartiles? I would like to avoid reinventing the wheel with a comparatively inefficient grouping algorithm to build per-day datasets just for this.
This is some code to produce toy data and a version of the current plot.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# some toy data in a 15-day range
data = [1.5+np.sin(x)*5 for x in np.arange(0, 403.3, .1)]
s = pd.Series(data=data, index=pd.date_range('2019-01-01', '2019-01-15', freq='5min'))
s.groupby(s.index.hour).mean().plot(kind='bar')
plt.show()
Adding to #Quang Hoang's solution: You can use hlines() to display the median as well:
axis.bar(data.index, data['q75'] - data['q25'], bottom=data['q25'], width=wd)
axis.hlines(y=data['median'], xmin=data.index-wd/2, xmax=data.index+wd/2, color='black', linewidth=1)

I don't think there is anything for that. But you can create a whisker plot fairly simply with two plot command:
# precomputed data:
data = (s.groupby(s.index.hour)
.agg(['mean','std','median',
lambda x: x.quantile(.25),
lambda x: x.quantile(.75)])
)
data.columns = ['mean','std','median','q25','q75']
# plot the whiskers with `errorbar` from `mean` and `std`
fig, ax = plt.subplots(figsize=(12,6))
ax.errorbar(data.index,data['mean'],
yerr=data['std']*1.96,
linestyle='none',
capsize=5
)
# plot the boxes with `bar` at bottoms from quantiles
ax.bar(data.index, data['q75']-data['q25'], bottom=data['q25'])
Output:

Related

Plotting multiple Pandas autocorrelation plots in different plots

My question is somewhat related to this one. I have a Pandas DataFrame and I want to separately plot the autocorrelation function for value each item in category. Below is what I've tried, and it plots all of the autocorrelation functions on the same plot. How can I plot them separately and also control plot size?
# Import libraries
import pandas as pd
from pandas.plotting import autocorrelation_plot
# Create DataFrame
df = pd.DataFrame({
'category': ['sav','sav','sav','sav','sav','check','check','check','check','check','cd','cd','cd','cd','cd'],
'value': [1.2,1.3,1.5,1.7,1.8, 10,13,17,20,25, 7,8,8.5,9,9.3]
})
# Loop through for each item in category and plot autocorrelation function
for cat in df['category'].unique():
s = df[df['category']==cat]['value']
s = s.diff().iloc[1:] #First order difference to de-trend
ax = autocorrelation_plot(s)
One easy way is to force rendering after each iteration with plt.show():
# Loop through for each item in category and plot autocorrelation function
for cat in df['category'].unique():
# create new figure, play with size
plt.figure(figsize=(10,6))
s = df[df['category']==cat]['value']
s = s.diff().iloc[1:] #First order difference to de-trend
ax = autocorrelation_plot(s)
plt.show() # here
Also the syntax can be simplified with groupby:
for cat, data in df.groupby('category')['value']:
plt.figure(figsize=(10,6))
autocorrelation_plot(data.diff().iloc[1:])
plt.title(cat)
plt.show()

Grid of plots with lines overplotted in matplotlib

I have a dataframe that consists of a bunch of x,y data that I'd like to see in scatter form along with a line. The dataframe consists of data with its form repeated over multiple categories. The end result I'd like to see is some kind of grid of the plots, but I'm not totally sure how matplotlib handles multiple subplots of overplotted data.
Here's an example of the kind of data I'm working with:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
category = np.arange(1,10)
total_data = pd.DataFrame()
for i in category:
x = np.arange(0,100)
y = 2*x + 10
data = np.random.normal(0,1,100) * y
dataframe = pd.DataFrame({'x':x, 'y':y, 'data':data, 'category':i})
total_data = total_data.append(dataframe)
We have x data, we have y data which is a linear model of some kind of generated dataset (the data variable).
I had been able to generate individual plots based on subsetting the master dataset, but I'd like to see them all side-by-side in a 3x3 grid in this case. However, calling the plots within the loop just overplots them all onto one single image.
Is there a good way to take the following code block and make a grid out of the category subsets? Am I overcomplicating it by doing the subset within the plot call?
plt.scatter(total_data['x'][total_data['category']==1], total_data['data'][total_data['category']==1])
plt.plot(total_data['x'][total_data['category']==1], total_data['y'][total_data['category']==1], linewidth=4, color='black')
If there's a simpler way to generate the by-category scatter plus line, I'm all for it. I don't know if seaborn has a similar or more intuitive method to use than pyplot.
You can use either sns.FacetGrid or manual plt.plot. For example:
g = sns.FacetGrid(data=total_data, col='category', col_wrap=3)
g = g.map(plt.scatter, 'x','data')
g = g.map(plt.plot,'x','y', color='k');
Gives:
Or manual plt with groupby:
fig, axes = plt.subplots(3,3)
for (cat, data), ax in zip(total_data.groupby('category'), axes.ravel()):
ax.scatter(data['x'], data['data'])
ax.plot(data['x'], data['y'], color='k')
gives:

Pandas DataFrame.hist Seaborn equivalent

When exploring a I often use Pandas' DataFrame.hist() method to quickly display a grid of histograms for every numeric column in the dataframe, for example:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.hist(bins=50, figsize=(10,7))
plt.show()
Which produces a figure with separate plots for each column:
I've tried the following:
import pandas as pd
import seaborn as sns
from sklearn import datasets
data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
for col_id in df.columns:
sns.distplot(df[col_id])
But this produces a figure with a single plot and all columns overlayed:
Is there a way to produce a grid of histograms showing the data from a DataFrame's columns with Seaborn?
You can take advantage of seaborn's FacetGrid if you reorganize your dataframe using melt. Seaborn typically expects data organized this way (long format).
g = sns.FacetGrid(df.melt(), col='variable', col_wrap=2)
g.map(plt.hist, 'value')
There is no equivalent as seaborn displot itself will only pick 1-D array, or list, maybe you can try generating the subplots.
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
for i in range(ax.shape[0]):
for j in range(ax.shape[1]):
sns.distplot(df[df.columns[i*2+j]], ax=ax[i][j])
https://seaborn.pydata.org/examples/distplot_options.html
Here is an example how you can show 4 graphs using subplot, with seaborn.
Anothert useful SEABORN method to quickly display a grid of histograms for every numeric column in the dataframe for you could be the quick,clean and handy sns.pairplot()
try:
sns.pairplot(df)
this has a lot of cool parameters you can explor like Hue etc
pairplot example for iris dataset
if you DON'T want the scatters you can actually create a customised grid really really quickly using sns.PairGrid(df)
this creates an empty grid with all the spaces and you can map whatever you want on them :g = sns.pairgrid(df)
`g.map(sns.distplot)` or `g.map_diag(plt.scatter)`
etc
I ended up adapting jcaliz's to make it work more generally, i.e. not just when the DataFrame has four columns, I also added code to remove any unused axes and ensure axes appear in alphabetical order (as with df.hist()).
size = int(math.ceil(len(df.columns)**0.5))
fig, ax = plt.subplots(size, size, figsize=(10, 10))
for i in range(ax.shape[0]):
for j in range(ax.shape[1]):
data_index = i*ax.shape[1]+j
if data_index < len(df.columns):
sns.distplot(df[df.columns.sort_values()[data_index]], ax=ax[i][j])
for i in range(len(df.columns), size ** 2):
fig.delaxes(ax[i // size][i % size])

Sequential colors of timestamps in pairplot

I have 15 features in my data set which are time series.
I want to plot it in a pairplot, and have the colours of the points be corresponding to a sequential colormap like so:
Early datapoints will then have a brighter blue-color than the old ones.
One of the columns in my dataframe is called index, and I tried using the hue='Indexparameter in the plotting function, without any luck.
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True,palette='Blues_d')
#norm = plt.Normalize(df.Index.min(), df.Index.max())
#sm = plt.cm.ScalarMappable(cmap="Reds", norm=norm)
#sm.set_array([])
ax= sns.pairplot(df,vars=['AvgPower','energy_mean',
'ActPower','WindSpeed','NacelleDirection','AvgSpeed','rms','kurt','skewness','signal_mean','Power spectral entropy','B1','B2','B3','B4','B5'],
hue='Index') # I do not include 'Index' in the vars, so it isn't plotted.
ax.get_legend().remove()
ax.figure.colorbar(sm)
plt.show()
How can I get this to work?

Plot a histogram with normal curve and name the bins in seaborn

Hi all, I am trying to plot the following type of plot using seaborn with a different data set. The problem is when a histogram type is used, I cannot name the bins (like 2-2.5,2.5-3..etc) even though it provides kernel curves. Bar plots dont have function to draw the normal curve like in the picture. The image seems to be used SPSS statistical package which I have little knowledge of.
Following is the closest thing I can get (I have attached the code)
df = pd.DataFrame({'cat': ['1-1.5', '1.5-2', '2-2.5','2.5-3','3-3.5','3.5-4','4-4.5','4.5-5'],'val': [0,0,1,7,7,33,17,10]})
ax = sns.barplot(y = 'val', x = 'cat',
data = df)
ax.set(xlabel='Categories', ylabel='Frequency')
plt.show()
So the problem is of course that you don't have the original data, but data that has already been binned. One could reverse this binning and start with an array of raw data. Then perform the histogramming again and use a sns.distplot which, by default, shows a KDE plot as well.
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
cat = ['1-1.5', '1.5-2', '2-2.5','2.5-3','3-3.5','3.5-4','4-4.5','4.5-5']
val = [0,0,1,7,7,33,17,10]
data = []
for i in range(len(cat)):
data.extend([1.25+i*0.5]*val[i])
bins = np.arange(1,5.5, 0.5)
ax = sns.distplot(data, bins=bins, hist_kws= dict(edgecolor="k"))
ax.set(xlabel='Categories', ylabel='Frequency')
ax.set_xticks(bins[:-1]+0.25)
ax.set_xticklabels(cat)
plt.show()
Use the bw keyword argument to the KDE function to set the smoothness of the curve. E.g. sns.distplot(data, bins=bins, kde_kws=dict(bw=0.5), hist_kws= dict(edgecolor="k")) where bw=0.5 produces
Also try bw=0.1, bw=0.25, bw=0.35 and bw=2 to see the differences.

Categories