Weighted violinplot

Weighted violinplot - python

I have a Pandas dataframe of data on generator plant capacity (MW) by fuel type. I wanted to show the estimated distribution of plant capacity in two different ways: by plant (easy) and by MW (harder). Here's an example:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# create and seed a randomstate object (to make #s repeatable below)
rnd = np.random.RandomState(7)
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
mymean = rnd.uniform(low=2.8,high=3.2)
mysigma = rnd.uniform(low=0.6,high=1.0)
df = df.append(
pd.DataFrame({'Fuel': myfuel,
'MW': np.array(rnd.lognormal(mean=mymean,sigma=mysigma,size=1000))
}),
ignore_index=True
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5
)
And here's the plot of the estimated distributions of plant size by MW this code makes:
This violinplot is very deceptive, without more context. Because it is not weighted, the thin tail at the top of each category hides the fact that the relatively few plants in the tail contain a lot of (maybe even most of) the MWs of capacity. So I want a second plot with the distribution by MWs--basically a weighted version of this first violinplot.
I wanted to know if anyone has figured out an elegant way to make such a "weighted" violinplot, or if anyone has an idea about the most elegant way to do that.
I figured I could loop through each row of my plant-level dataframe and decompose the plant data (into a new dataframe) into MW-level data. For instance, for a row in the plant-level dataframe that shows a plant with 350 MW, I could decompose that into 3500 new rows of my new dataframe, each representing 100 kW of capacity. (I think I have to go to at least the 100 kW level of resolution, because some of these plants are pretty small, in the 100 kW range.) That new dataframe would be enormous, but I could then do a violinplot of that decomposed data. That seemed slightly brute force. Any better ideas for approach?
Update:
I implemented the brute force method described above. Here's what it looks like if anyone is interested. This is not the "answer" to this question, because I still would be interested if anyone knows a more elegant/simple/efficient way to do this. So please chime in if you know of such a way. Otherwise, I hope this brute force approach might be helpful to someone in the future.
So that it's easy to see that the weighted violinplot makes sense, I replaced the random data with a simple uniform series of numbers from 0 to 10. Under this new approach, the violinplot of df should look pretty uniform, and the violinplot of the weighted data (dfw) should get steadily wider towards the top of the violins. That's exactly what happens (see image of violinplots below).
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
df = df.append(
pd.DataFrame({'Fuel': myfuel,
# To make it easy to see that the violinplot of dfw (below)
# makes sense, here we'll just use a simple range list from
# 0 to 10
'MW': np.array(range(11))
}),
ignore_index=True
)
# I have to recast the data type here to avoid an error when using violinplot below
df.MW = df.MW.astype(float)
# create another empty dataframe
dfw = pd.DataFrame(data=None,columns=['Fuel','MW'])
# since dfw will be huge, specify data types (in particular, use "category" for Fuel to limit dfw size)
dfw = dfw.astype(dtype={'Fuel':'category', 'MW':'float'})
# Define the MW size by which to normalize all of the units
# Careful: too big -> loss of fidelity in data for small plants
# too small -> dfw will need to store an enormous amount of data
norm = 0.1 # this is in MW, so 0.1 MW = 100 kW
# Define a var to represent (for each row) how many basic units
# of size = norm there are in each row
mynum = 0
# loop through rows of df
for index, row in df.iterrows():
# calculate and store the number of norm MW there are within the MW of each plant
mynum = int(round(row['MW']/norm))
# insert mynum rows into dfw, each with Fuel = row['Fuel'] and MW = row['MW']
dfw = dfw.append(
pd.DataFrame({'Fuel': row['Fuel'],
'MW': np.array([row['MW']]*mynum,dtype='float')
}),
ignore_index=True
)
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey='row')
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax1
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=dfw,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax2
)
# loop through the set of tick labels for both axes
# set tick label size and rotation
for item in (ax1.get_xticklabels() + ax2.get_xticklabels()):
item.set_fontsize(8)
item.set_rotation(30)
item.set_horizontalalignment('right')
plt.show()

Related

Easier way to plot multiple Relative Frequencies

Plotting multiple relative frequencies (sum of bin to be one, not area of bin to be one) was not easier than I thought.
In Method A, We can use weights argument and plotted properly, but it is not intuitive.
import numpy as np
import pandas as pd
df_a = pd.DataFrame(np.random.randn(1000),columns=['a'])
df_b = pd.DataFrame(1+ np.random.randn(100),columns=['b'])
# Method A
ax = df_a.plot(kind='hist', weights= np.ones_like(df_a) / len(df_a),alpha=0.5)
df_b.plot(kind='hist', weights= np.ones_like(df_b) / len(df_b),alpha=0.5 ,ax= ax )
plt.title("Method A")
plt.show()
In Method B, the part for determining relative frequencies count_a/sum(count_a) is easy to understand, but the diagram is not beautiful.
# Method B
count_a,bins_a = np.histogram(df_a.a)
count_b,bins_b = np.histogram(df_b.b)
plt.bar(bins_a[:-1],count_a/sum(count_a),alpha=0.5 )
plt.bar(bins_b[:-1],count_b/sum(count_b),alpha=0.5 )
plt.title("Method B")
Is there another way to get a graph directly from the data without doing the calculations myself?

The problem with your bar plot is that the width is fixed by default to 0.8. This can easily be adjusted to account for the real width of your histogram:
plt.bar(bins_a[:-1], count_a/sum(count_a), width = bins_a[1:] - bins_a[:-1], alpha = 0.5, align = 'edge')
and this is the result:
In this example the bin width is fixed but by providing a sequence you have a more flexible option, which can be used also in the case of variable bin sizes.
A different option is to use seaborn as suggested in the comment:
import seaborn as sns
df_hist = pd.concat([df_a, df_b]).melt()
sns.histplot(data = df_hist, x = 'value', hue = 'variable', stat = 'probability', common_norm = False)

Is there any nicer way to aggregate multiple columns on same grouped pandas dataframe?

I am trying to figure out how should I manipulate my data so I can aggregate on multiple columns but for same grouped pandas data. The reason why I am doing this because, I need to get stacked line chart which take data from different aggregation on same grouped data. How can we do this some compact way? can anyone suggest possible way of doing this in pandas? any ideas?
my current attempt:
import pandas as pd
import matplotlib.pyplot as plt
url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])
df_re = df[df['retail_item'].str.contains("GROUND BEEF")]
df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_rei = df_rei.reset_index(level=[0,1])
df_rei['week'] = pd.DatetimeIndex(df_rei['date']).week
df_rei['year'] = pd.DatetimeIndex(df_rei['date']).year
df_rei['week'] = df_rei['date'].dt.strftime('%W').astype('uint8')
df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
similarly, I need to do data aggregation also like this:
df_re['price_gap'] = df_re['high_price'] - df_re['low_price']
dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
dff_rei1 = dff_rei1.reset_index(level=[0,1])
dff_rei1['week'] = pd.DatetimeIndex(dff_rei1['date']).week
dff_rei1['year'] = pd.DatetimeIndex(dff_rei1['date']).year
dff_rei1['week'] = dff_rei1['date'].dt.strftime('%W').astype('uint8')
dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
problem
when I made data aggregation, those lines are similar:
df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
and
dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
I think better way could be I have to make custom function with *arg, **kwargs to make shift for aggregating the columns, but how should I show stacked line chart where y axis shows different quantities. Is that doable to do so in pandas?
line plot
I did for getting line chart as follow:
for g, d in df_ret_df1.groupby('retail_item'):
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
sns.lineplot(x='week', y='vals', hue='mm', data=d,alpha=.8)
y1 = d[d.mm == 'max']
y2 = d[d.mm == 'min']
plt.fill_between(x=y1.week, y1=y1.vals, y2=y2.vals)
for year in df['year'].unique():
data = df_rei[(df_rei.date.dt.year == year) & (df_rei.retail_item == g)]
sns.lineplot(x='week', y='price_gap', ci=None, data=data, palette=cmap,label=year,alpha=.8)
I want to minimize those so I could able to aggregate on different columns and make stacked line chart, where they share x-axis as week, and y axis shows number of ads and price_range respectively. I don't know is there any better way of doing this. I am doing this because stacked line chart (two vertical subplots), one shows number of ads on y axis and another one shows price ranges for same items along 52 weeks. can anyone suggest any possible way of doing this? any ideas?

This answer builds on the one by Andreas who has already answered the main question of how to produce aggregate variables of multiple columns in a compact way. The goal here is to implement that solution specifically to your case and to give an example of how to produce a single figure from the aggregated data. Here are some key points:
The dates in the original dataset are already on a weekly frequency so groupby('week') is not needed for df_ret_df1 and dff_ret_df2, which is why these contain identical values for min, max, and mean.
This example uses pandas and matplotlib so the variables do not need to be stacked as when using seaborn.
The aggregation step produces a MultiIndex for the columns. You can access the aggregated variables (min, max, mean) of each high-level variable by using df.xs.
The date is set as the index of the aggregated dataframe to use as the x variable. Using the DatetimeIndex as the x variable gives you more flexibility for formatting the tick labels and ensures that the data is always plotted in chronological order.
It is not clear in the question how the data for separate years should be displayed (in separate figures?) so here the entire time series is shown in a single figure.
Import dataset and aggregate it as needed
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
# Import dataset
url = 'https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/\
raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv'
df = pd.read_csv(url, parse_dates=['date'])
# Create dataframe containing data for ground beef products, compute
# aggregate variables, and set the date as the index
df_gbeef = df[df['retail_item'].str.contains('GROUND BEEF')].copy()
df_gbeef['price_gap'] = df_gbeef['high_price'] - df_gbeef['low_price']
agg_dict = {'number_of_ads': [min, max, 'mean'],
'price_gap': [min, max, 'mean']}
df_gbeef_agg = (df_gbeef.groupby(['date', 'retail_item']).agg(agg_dict)
.reset_index('retail_item'))
df_gbeef_agg
Plot aggregated variables in single figure containing small multiples
variables = ['number_of_ads', 'price_gap']
colors = ['tab:orange', 'tab:blue']
nrows = len(variables)
ncols = df_gbeef_agg['retail_item'].nunique()
fig, axs = plt.subplots(nrows, ncols, figsize=(10, 5), sharex=True, sharey='row')
for axs_row, var, color in zip(axs, variables, colors):
for i, (item, df_item) in enumerate(df_gbeef_agg.groupby('retail_item')):
ax = axs_row[i]
# Select data and plot it
data = df_item.xs(var, axis=1)
ax.fill_between(x=data.index, y1=data['min'], y2=data['max'],
color=color, alpha=0.3, label='min/max')
ax.plot(data.index, data['mean'], color=color, label='mean')
ax.spines['bottom'].set_position('zero')
# Format x-axis tick labels
fmt = plt.matplotlib.dates.DateFormatter('%W') # is not equal to ISO week
ax.xaxis.set_major_formatter(fmt)
# Fomat subplot according to position within the figure
if ax.is_first_row():
ax.set_title(item, pad=10)
if ax.is_last_row():
ax.set_xlabel('Week number', size=12, labelpad=5)
if ax.is_first_col():
ax.set_ylabel(var, size=12, labelpad=10)
if ax.is_last_col():
ax.legend(frameon=False)
fig.suptitle('Cross-regional weekly ads and price gaps of ground beef products',
size=14, y=1.02)
fig.subplots_adjust(hspace=0.1);

I am not sure if this answers your question fully, but based on your headline I guess it all boils down to:
import pandas as pd
url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])
# define which columns to group and in which way
dct = {'low_price': [max, min],
'high_price': min,
'year': 'mean'}
# actually group the columns
df.groupby(['region']).agg(dct)
Output:
low_price high_price year
max min min mean
region
ALASKA 16.99 1.33 1.33 2020.792123
HAWAII 12.99 1.33 1.33 2020.738318
MIDWEST 28.73 0.99 0.99 2020.690159
NORTHEAST 19.99 1.20 1.99 2020.709916
NORTHWEST 16.99 1.33 1.33 2020.736397
SOUTH CENTRAL 28.76 1.20 1.49 2020.700980
SOUTHEAST 21.99 1.33 1.48 2020.699655
SOUTHWEST 16.99 1.29 1.29 2020.704341

Seaborn scatterplot size and jitter

I have the following code for a scatter graph
dimens = (12, 10)
fig, ax = plt.subplots(figsize=dimens)
sns.scatterplot(data = information, x = 'latitude', y = 'longitude', hue="genre", s=200,
x_jitter=4, y_jitter=4, ax=ax)
No matter what I change the jitter to, the plots still remain very close. Whats wrong with it?
Example dataframe:
store longitude latitude genre
mcdonalds 140.232323 40.434343 all
kfc 140.232323 40.434343 chicken
burgerking 138.434343 35.545433 burger
fiveguys 137.323984 36.543322 burger

In the help page, it writes:
{x,y}_jitterbooleans or floats Currently non-functional
You can either add a new column or do it on the fly:
import seaborn as sns
import pandas as pd
import numpy as np
information = pd.DataFrame({'store':['mcdonalds','kfc','burgerking','fiveguys'],
'longitude':[140.232323,140.232323,138.434343,137.323984],
'latitude':[40.434343,40.434343,35.545433,36.543322],
'genre':['all','chicken','burger','burger']})
def jitter(values,j):
return values + np.random.normal(j,0.1,values.shape)
sns.scatterplot(x = jitter(information.latitude,2),
y = jitter(information.longitude,2),
hue=information.genre,s=200,alpha=0.5)

The parameter s=200 sets the individual scatter points to a very large size.
Adding 4 points of jitter is very little compared to that.

Line-based heatmap or 2D line histogram

I have a synthetic dataset with 1000 noisy polygons of various orders and sin/cos curves that I can plot as lines using python seaborn.
Since I have quite a few lines that are overlapping, I'd like to plot some sort of heatmap or histogram of my line graphs.
I've tried iterating over the columns and aggregating the counts to use seaborn's heatmap graph, but with many lines this takes quite a while.
The next best thing that results in what I want was a hexbin graph (with seaborn jointgraph).
But it's a compromise between runtime and granularity (the shown graph has gridsize 750). I couldn't find any other graph-type for my problem. But I also don't know exactly what it might be called.
I've also tried with line alpha set to 0.2. This results in a similar graph to what I want. But it's less precise (if more than 5 lines overlap at the same point I already have zero transparency left). Also, it misses the typical coloration of heatmaps.
(Moot search terms were: heatmap, 2D line histogram, line histogram, density plots...)
Does anybody know packages to plot this more efficiently and high(er) quality or knows how to do it with the popular python plotters (i.e. the matplotlib family: matplotlib, seaborn, bokeh). I'm really fine with any package though.

It took me awhile, but I finally solved this using Datashader. If using a notebook, the plots can be embedded into interactive Bokeh plots, which looks really nice.
Anyhow, here is the code for static images, in case someone else is in need of something similar:
# coding: utf-8
import time
import numpy as np
from numpy.polynomial import polynomial
import pandas as pd
import matplotlib.pyplot as plt
import datashader as ds
import datashader.transfer_functions as tf
plt.style.use("seaborn-whitegrid")
def create_data():
# ...
# Each column is one data sample
df = create_data()
# Following will append a nan-row and reshape the dataframe into two columns, with each sample stacked on top of each other
# THIS IS CRUCIAL TO OPTIMIZE SPEED: https://github.com/bokeh/datashader/issues/286
# Append row with nan-values
df = df.append(pd.DataFrame([np.array([np.nan] * len(df.columns))], columns=df.columns, index=[np.nan]))
# Reshape
x, y = df.shape
arr = df.as_matrix().reshape((x * y, 1), order='F')
df_reshaped = pd.DataFrame(arr, columns=list('y'), index=np.tile(df.index.values, y))
df_reshaped = df_reshaped.reset_index()
df_reshaped.columns.values[0] = 'x'
# Plotting parameters
x_range = (min(df.index.values), max(df.index.values))
y_range = (df.min().min(), df.max().max())
w = 1000
h = 750
dpi = 150
cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=h, plot_width=w)
# Aggregate data
t0 = time.time()
aggs = cvs.line(df_reshaped, 'x', 'y', ds.count())
print("Time to aggregate line data: {}".format(time.time()-t0))
# One colored plot
t1 = time.time()
stacked_img = tf.Image(tf.shade(aggs, cmap=["darkblue", "darkblue"]))
print("Time to create stacked image: {}".format(time.time() - t1))
# Save
f0 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax0 = f0.add_subplot(111)
ax0.imshow(stacked_img.to_pil())
ax0.grid(False)
f0.savefig("stacked.png", bbox_inches="tight", dpi=dpi)
# Heat map - This uses a equalized histogram (built-in default), there are other options, though.
t2 = time.time()
heatmap_img = tf.Image(tf.shade(aggs, cmap=plt.cm.Spectral_r))
print("Time to create stacked image: {}".format(time.time() - t2))
# Save
f1 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax1 = f1.add_subplot(111)
ax1.imshow(heatmap_img.to_pil())
ax1.grid(False)
f1.savefig("heatmap.png", bbox_inches="tight", dpi=dpi)
With following run times (in seconds):
Time to aggregate line data: 0.7710442543029785
Time to create stacked image: 0.06000351905822754
Time to create stacked image: 0.05600309371948242
The resulting plots:

Although it seems you have tried this, plotting the counts seems to give a good representation of the data. However, it really depends what you're trying to find in your data, what is it supposed to tell you?
The reason for the long run time is due to plotting so many lines, a heatmap based on the counts however will plot fairly quickly.
I created some dummy data for sinus waves, based on noise, no. of lines, amplitude and shift. Added both a boxplot and heatmap.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import random
import pandas as pd
np.random.seed(0)
#create dummy data
N = 200
sinuses = []
no_lines = 200
for i in range(no_lines):
a = np.random.randint(5, 40)/5 #amplitude
x = random.choice([int(N/5), int(N/(2/5))]) #random shift
sinuses.append(np.roll(a * np.sin(np.linspace(0, 2 * np.pi, N)) + np.random.randn(N), x))
fig = plt.figure(figsize=(20 / 2.54, 20 / 2.54))
sins = pd.DataFrame(sinuses, )
ax1 = plt.subplot2grid((3,10), (0,0), colspan=10)
ax2 = plt.subplot2grid((3,10), (1,0), colspan=10)
ax3 = plt.subplot2grid((3,10), (2,0), colspan=9)
ax4 = plt.subplot2grid((3,10), (2,9))
# plot line data
sins.T.plot(ax=ax1, color='lightblue',linewidth=.3)
ax1.legend_.remove()
ax1.set_xlim(0, N)
# try boxplot
sins.plot.box(ax=ax2, showfliers=False)
xticks = ax2.xaxis.get_major_ticks()
for index, label in enumerate(ax2.get_xaxis().get_ticklabels()):
xticks[index].set_visible(False) # hide ticks where labels are hidden
#make a list of bins
no_bins = 20
bins = list(np.arange(sins.min().min(), sins.max().max(), int(abs(sins.min().min())+sins.max().max())/no_bins))
bins.append(sins.max().max())
# calculate histogram
hists = []
for col in sins.columns:
count, division = np.histogram(sins.iloc[:,col], bins=bins)
hists.append(count)
hists = pd.DataFrame(hists, columns=[str(i) for i in bins[1:]])
print(hists.shape, '\n', hists.head())
cmap = mpl.colors.ListedColormap(['white', '#FFFFBB', '#C3FDB8', '#B5EAAA', '#64E986', '#54C571',
'#4AA02C', '#347C17', '#347235', '#25383C', '#254117'])
#heatmap
im = ax3.pcolor(hists.T, cmap=cmap)
cbar = plt.colorbar(im, cax=ax4)
yticks = np.arange(0, len(bins))
yticklabels = hists.columns.tolist()
ax3.set_yticks(yticks)
ax3.set_yticklabels([round(i,1) for i in bins])
ax3.set_title('Count')
yticks = ax3.yaxis.get_major_ticks()
for index, label in enumerate(ax3.get_yaxis().get_ticklabels()):
if index % 3 != 0: #make some labels invisible
yticks[index].set_visible(False) # hide ticks where labels are hidden
plt.show()
Although the boxplot is easy to interpret, it doesn't show the actual distribution of the data very well, but knowing where the median and quantiles lie may be helpful.
Increasing the number of lines and amount of values per line will increase plotting time considerably for the line plots, the heatmap is still fairly quick though to generate. The boxplot becomes indiscernible however.
I couldn't exactly replicate your data (or know the actual size of it), but perhaps the heatmap may be helpful.

Timeseries plot with min/max shading using Seaborn

I am trying to create a 3-line time series plot based on the following data , in a Week x Overload graph, where each Cluster is a different line.
I have multiple observations for each (Cluster, Week) pair (5 for each atm, will have 1000). I would like the points on the line to be the average Overload value for that specific (Cluster, Week) pair, and the band be the min/max values of it.
Currently using the following bit of code to plot it, but I'm not getting any lines, as I don't know what unit to specify using the current dataframe:
ax14 = sns.tsplot(data = long_total_cluster_capacity_overload_df, value = "Overload", time = "Week", condition = "Cluster")
GIST Data
I have a feeling I still need to re-shape my dataframe, but I have no idea how. Looking for a final results that looks like this

Based off this incredible answer, I was able to create a monkey patch to beautifully do what you are looking for.
import pandas as pd
import seaborn as sns
import seaborn.timeseries
def _plot_range_band(*args, central_data=None, ci=None, data=None, **kwargs):
upper = data.max(axis=0)
lower = data.min(axis=0)
#import pdb; pdb.set_trace()
ci = np.asarray((lower, upper))
kwargs.update({"central_data": central_data, "ci": ci, "data": data})
seaborn.timeseries._plot_ci_band(*args, **kwargs)
seaborn.timeseries._plot_range_band = _plot_range_band
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
cluster_overload['Unit'] = cluster_overload.groupby(['Cluster','Week']).cumcount()
ax = sns.tsplot(time='Week',value="Overload", condition="Cluster", unit="Unit", data=cluster_overload,
err_style="range_band", n_boot=0)
Output Graph:
Notice that the shaded regions line up with the true maximum and minimums in the line graph!
If you figure out why the unit variable is required, please let me know.
If you do not want them all on the same graph then:
import pandas as pd
import seaborn as sns
import seaborn.timeseries
def _plot_range_band(*args, central_data=None, ci=None, data=None, **kwargs):
upper = data.max(axis=0)
lower = data.min(axis=0)
#import pdb; pdb.set_trace()
ci = np.asarray((lower, upper))
kwargs.update({"central_data": central_data, "ci": ci, "data": data})
seaborn.timeseries._plot_ci_band(*args, **kwargs)
seaborn.timeseries._plot_range_band = _plot_range_band
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
cluster_overload['subindex'] = cluster_overload.groupby(['Cluster','Week']).cumcount()
def customPlot(*args,**kwargs):
df = kwargs.pop('data')
pivoted = df.pivot(index='subindex', columns='Week', values='Overload')
ax = sns.tsplot(pivoted.values, err_style="range_band", n_boot=0, color=kwargs['color'])
g = sns.FacetGrid(cluster_overload, row="Cluster", sharey=False, hue='Cluster', aspect=3)
g = g.map_dataframe(customPlot, 'Week', 'Overload','subindex')
Which produces the following, (you can obviously play with the aspect ratio if you think the proportions are off)

I finally used the good old plot with a design (subplots) that seems (to me) more readable.
df = pd.read_csv('TSplot.csv', sep='\t', index_col=0)
# Compute the min, mean and max (could also be other values)
grouped = df.groupby(["Cluster", "Week"]).agg({'Overload': ['min', 'mean', 'max']}).unstack("Cluster")
# Plot with sublot since it is more readable
axes = grouped.loc[:,('Overload', 'mean')].plot(subplots=True)
# Getting the color palette used
palette = sns.color_palette()
# Initializing an index to get each cluster and each color
index = 0
for ax in axes:
ax.fill_between(grouped.index, grouped.loc[:,('Overload', 'mean', index + 1)],
grouped.loc[:,('Overload', 'max', index + 1 )], alpha=.2, color=palette[index])
ax.fill_between(grouped.index,
grouped.loc[:,('Overload', 'min', index + 1)] , grouped.loc[:,('Overload', 'mean', index + 1)], alpha=.2, color=palette[index])
index +=1

I really thought I would be able to do it with seaborn.tsplot. But it does not quite look right. Here is the result I get with seaborn:
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
cluster_overload['Unit'] = cluster_overload.groupby(['Cluster','Week']).cumcount()
ax = sns.tsplot(time='Week',value="Overload", condition="Cluster", ci=100, unit="Unit", data=cluster_overload)
Outputs:
I am really confused as to why the unit parameter is necessary since my understanding is that all the data is aggregated based on (time, condition) The Seaborn Documentation defines unit as
Field in the data DataFrame identifying the sampling unit (e.g.
subject, neuron, etc.). The error representation will collapse over
units at each time/condition observation. This has no role when data
is an array.
I am not certain of the meaning of 'collapsed over'- especially since my definition wouldn't make it a required variable.
Anyways, here's the output if you want exactly what you discussed, not nearly as pretty. I am not sure how to manually shade in those regions, but please share if you figure it out.
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
grouped = cluster_overload.groupby(['Cluster','Week'],as_index=False)
stats = grouped.agg(['min','mean','max']).unstack().T
stats.index = stats.index.droplevel(0)
colors = ['b','g','r']
ax = stats.loc['mean'].plot(color=colors, alpha=0.8, linewidth=3)
stats.loc['max'].plot(ax=ax,color=colors,legend=False, alpha=0.3)
stats.loc['min'].plot(ax=ax,color=colors,legend=False, alpha=0.3)
Outputs:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Weighted violinplot - python

Related

Easier way to plot multiple Relative Frequencies

Is there any nicer way to aggregate multiple columns on same grouped pandas dataframe?

Seaborn scatterplot size and jitter

Line-based heatmap or 2D line histogram

Timeseries plot with min/max shading using Seaborn

Categories

Resources