Major Difference in 2D kernel Density Plots: Seaborn and R

Major Difference in 2D kernel Density Plots: Seaborn and R - python

I am trying to plot data using the 2D kernel density plot of Seaborn's jointplot function (using statsmodels' KDEMultivariate function to calculate a data-driven bandwidth). I've plotted a 2D kernel density in R using the same data and the result looks very good (using the 'ks' package), while the Seaborn plot looks very very different.
I am using the same exact data and the same exact bandwidth for each (taking the bandwidth given by KDEMultivariant and passing that to the R method).
Here is the input.csv data used: https://app.box.com/s/ot7d36t44wrr85pusp5657pc1w2kf5hj
Below are the code used in each and output images from each.
Python / Seaborn:
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv("input.csv", dtype={'x': float, 'y': float}, skiprows=0)
bw_ml_x = sm.nonparametric.KDEMultivariate(data=data['x'], var_type='c', bw='cv_ml')
bw_ml_y = sm.nonparametric.KDEMultivariate(data=data['y'], var_type='c', bw='cv_ml')
g = sns.jointplot(x='x', y='y', data=data, kind="kde", stat_func=None, bw=[bw_ml_x.bw, bw_ml_y.bw])
g.plot_joint(plt.scatter, c="w")
g.ax_joint.collections[0].set_alpha(0)
sns.plt.show()
Img for Seaborn plot:
The bandwidth given by bw_ml_x.bw and bw_ml_y.bw is placed in a 2 x 2 R matrix H, where H[1,1] = bw_ml_x.bw, H[2,2] = bw_ml.y.bw, and other values set to zero.
R:
library(ks)
fhat <- kde(x=as.data.frame(data[1], data[2]), H=H)
plot(fhat, display="filled.contour2", cont=seq(10,90,by=10))
Img for R plot:

Looking at your Seaborn/Python plot, many of the points cluster along the (0,n) region and the (1,1) region of your space, just as the KDE of the R plot shows. This indicates that Seaborn and R are looking at the same data; we simply need to reformulate the call to the kde in Seaborn in order to visualize the KDE gradients.
If you modify your Python call to match the documentation for Kernel Density Estimation in Seaborn you'll get a proper 2d-kdf out of Python:
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd
import seaborn as sns
data = pd.read_csv("input.csv", dtype={'x': float, 'y': float}, skiprows=0)
bw_ml_x = sm.nonparametric.KDEMultivariate(data=data['x'], var_type='c', bw='cv_ml')
bw_ml_y = sm.nonparametric.KDEMultivariate(data=data['y'], var_type='c', bw='cv_ml')
g = sns.jointplot(x='x', y='y', data=data, kind="kde")
g.plot_joint(plt.scatter, c="w")
g.ax_joint.collections[0].set_alpha(0)
sns.plt.show()
This accords with the R plot (though the kernel estimators seem to be slightly different, which would account for the variation in gradients between the plots):

Related

How to plot heatmap onto mplsoccer pitch?

Wondering how I can plot a seaborn plot onto a different matplotlib plot. Currently I have two plots (one a heatmap, the other a soccer pitch), but when I plot the heatmap onto the pitch, I get the results below. (Plotting the pitch onto the heatmap isn't pretty either.) Any ideas how to fix it?
Note: Plots don't need a colorbar and the grid structure isn't required either. Just care about the heatmap covering the entire space of the pitch. Thanks!
import pandas as pd
import numpy as np
from mplsoccer import Pitch
import seaborn as sns
nmf_shot_W = pd.read_csv('https://raw.githubusercontent.com/lucas-nelson-uiuc/datasets/main/nmf_show_W.csv').iloc[:, 1:]
nmf_shot_ThierryHenry = pd.read_csv('https://raw.githubusercontent.com/lucas-nelson-uiuc/datasets/main/nmf_show_Hth.csv')['Thierry Henry']
pitch = Pitch(pitch_type='statsbomb', line_zorder=2,
pitch_color='#22312b', line_color='#efefef')
dfdfdf = np.array(np.matmul(nmf_shot_W, nmf_shot_ThierryHenry)).reshape((24,25))
g_ax = sns.heatmap(dfdfdf)
pitch.draw(ax=g_ax)
Current output:
Desired output:

Use the built-in pitch.heatmap:
pitch.heatmap expects a stats dictionary of binned data, bin mesh, and bin centers:
stats (dict) – The keys are statistic (the calculated statistic), x_grid and y_grid (the bin's edges), and cx and cy (the bin centers).
In the mplsoccer heatmap demos, they construct this stats object using pitch.bin_statistic because they have raw data. However, you already have binned data ("calculated statistic"), so reconstruct the stats object manually by building the mesh and centers:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mplsoccer import Pitch
nmf_shot_W = pd.read_csv('71878281/nmf_show_W.csv', index_col=0)
nmf_shot_ThierryHenry = pd.read_csv('71878281/nmf_show_Hth.csv')['Thierry Henry']
statistic = np.dot(nmf_shot_W, nmf_shot_ThierryHenry.to_numpy()).reshape((24, 25))
# construct stats object from binned data, bin mesh, and bin centers
y, x = statistic.shape
x_grid = np.linspace(0, 120, x + 1)
y_grid = np.linspace(0, 80, y + 1)
cx = x_grid[:-1] + 0.5 * (x_grid[1] - x_grid[0])
cy = y_grid[:-1] + 0.5 * (y_grid[1] - y_grid[0])
stats = dict(statistic=statistic, x_grid=x_grid, y_grid=y_grid, cx=cx, cy=cy)
# use pitch.draw and pitch.heatmap as per mplsoccer demo
pitch = Pitch(pitch_type='statsbomb', line_zorder=2, pitch_color='#22312b', line_color='#efefef')
fig, ax = pitch.draw(figsize=(6.6, 4.125))
pcm = pitch.heatmap(stats, ax=ax, cmap='plasma')
cbar = fig.colorbar(pcm, ax=ax, shrink=0.6)
cbar.outline.set_edgecolor('#efefef')
cbar.ax.yaxis.set_tick_params(color='#efefef')
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color='#efefef')

Can you plot interquartile range as the error band on a seaborn lineplot?

I'm plotting time series data using seaborn lineplot (https://seaborn.pydata.org/generated/seaborn.lineplot.html), and plotting the median instead of mean. Example code:
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(x="timepoint", y="signal", estimator = np.median, data=fmri)
I want the error bands to show the interquartile range as opposed to the confidence interval. I know I can use ci = "sd" for standard deviation, but is there a simple way to add the IQR instead? I cannot figure it out.
Thank you!

I don't know if this can be done with seaborn alone, but here's one way to do it with matplotlib, keeping the seaborn style. The describe() method conveniently provides summary statistics for a DataFrame, among them the quartiles, which we can use to plot the medians with inter-quartile-ranges.
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
fmri_stats = fmri.groupby(['timepoint']).describe()
x = fmri_stats.index
medians = fmri_stats[('signal', '50%')]
medians.name = 'signal'
quartiles1 = fmri_stats[('signal', '25%')]
quartiles3 = fmri_stats[('signal', '75%')]
ax = sns.lineplot(x, medians)
ax.fill_between(x, quartiles1, quartiles3, alpha=0.3);

You can calculate the median within lineplot like you have done, set ci to be none and fill in using ax.fill_between()
import numpy as np
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(x="timepoint", y="signal", estimator = np.median,
data=fmri,ci=None)
bounds = fmri.groupby('timepoint')['signal'].quantile((0.25,0.75)).unstack()
ax.fill_between(x=bounds.index,y1=bounds.iloc[:,0],y2=bounds.iloc[:,1],alpha=0.1)

This option is possible since version 0.12 of seaborn, see here for the documentation.
pip install --upgrade seaborn
The estimator specifies the point by the name of pandas method or callable, such as 'median' or 'mean'.
The errorbar is an option to plot a distribution spread by a string, (string, number) tuple, or callable. In order to mark the median value and fill the area between the interquartile, you would need the params:
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(data=fmri, x="timepoint", y="signal", estimator=np.median,
errorbar=lambda x: (np.quantile(x, 0.25), np.quantile(x, 0.75)))

You can now!
estimator="median", errobar=("pi",0.5)
https://seaborn.pydata.org/tutorial/error_bars

What's the equivalent of fitdist and histfit in Python?

--- SAMPLE ---
I have a data set (sample) that contains 1 000 damage values (the values are very small <1e-6) in a 1-dimension array (see the attached .json file). The sample is seemed to follow Lognormal distribution:
--- PROBLEM & WHAT I ALREADY TRIED ---
I tried the suggestions in this post Fitting empirical distribution to theoretical ones with Scipy (Python)? and this post Scipy: lognormal fitting to fit my data by lognormal distribution. None of these works. :(
I always get something very large in Y-axis as the following:
Here is the code that I used in Python (and the data.json file can be downloaded from here):
from matplotlib import pyplot as plt
from scipy import stats as scistats
import json
with open("data.json", "r") as f:
sample = json.load(f) # load data: a 1000 * 1 array with many small values( < 1e-6)
fig, axis = plt.subplots() # initiate a figure
N, nbins, patches = axis.hist(sample, bins = 40) # plot sample by histogram
axis.ticklabel_format(style = 'sci', scilimits = (-3, 4), axis = 'x') # make X-axis to use scitific numbers
axis.set_xlabel("Value")
axis.set_ylabel("Count")
plt.show()
fig, axis = plt.subplots()
param = scistats.lognorm.fit(sample) # fit data by Lognormal distribution
pdf_fitted = scistats.lognorm.pdf(nbins, * param[: -2], loc = param[-2], scale = param[-1]) # prepare data for ploting fitted distribution
axis.plot(nbins, pdf_fitted) # draw fitted distribution on the same figure
plt.show()
I tried the other kind of distribution, but when I try to plot the result, the Y-axis is always too large and I can't plot with my histogram. Where did I fail ???
I'have also tried out the suggestion in my another question: Use scipy lognormal distribution to fit data with small values, then show in matplotlib. But the value of variable pdf_fitted is always too big.
--- EXPECTING RESULT ---
Basically, what I want is like this:
And here is the Matlab code that I used in the above screenshot:
fname = 'data.json';
sample = jsondecode(fileread(fname));
% fitting distribution
pd = fitdist(sample, 'lognormal')
% A combined command for plotting histogram and distribution
figure();
histfit(sample,40,"lognormal")
So if you have any idea of the equivalent command of fitdist and histfit in Python/Scipy/Numpy/Matplotlib, please post it !
Thanks a lot !

Try the distfit (or fitdist) library.
https://erdogant.github.io/distfit
pip install distfit
import numpy as np
# Example data
X = np.random.normal(10, 3, 2000)
y = [3,4,5,6,10,11,12,18,20]
# From the distfit library import the class distfit
from distfit import distfit
# Initialize
dist = distfit()
# Search for best theoretical fit on your emperical data
dist.fit_transform(X)
# Plot
dist.plot()
# summay plot
dist.plot_summary()
So in your case it would be:
dist = distfit(distr='lognorm')
dist.fit_transform(X)

Try seaborn:
import seaborn as sns, numpy as np
sns.set(); np.random.seed(0)
x = np.random.randn(100)
ax = sns.distplot(x)

I tried your dataset using Openturns library
x is the list given in you json file.
import openturns as ot
from openturns.viewer import View
import matplotlib.pyplot as plt
# first format your list x as a sample of dimension 1
sample = ot.Sample(x,1)
# use the LogNormalFactory to build a Lognormal distribution according to your sample
distribution = ot.LogNormalFactory().build(sample)
# draw the pdf of the obtained distribution
graph = distribution.drawPDF()
graph.setLegends(["LogNormal"])
View(graph)
plt.show()
If you want the parameters of the distribution
print(distribution)
>>> LogNormal(muLog = -16.5263, sigmaLog = 0.636928, gamma = 3.01106e-08)
You can build the histogram the same way by calling HistogramFactory, then you can add one graph to another:
graph2 = ot.HistogramFactory().build(sample).drawPDF()
graph2.setColors(['blue'])
graph2.setLegends(["Histogram"])
graph2.add(graph)
View(graph2)
and set the boundaries values if you want to zoom
axes = view.getAxes()
_ = axes[0].set_xlim(-0.6e-07, 2.8e-07)
plt.show()

Overlay Linear Regression Line on Scatter Plot (iPython Notebook)

gh_data = ascii.read('http://dept.astro.lsa.umich.edu/~ericbell/data/GHOSTS/M81/ngc3031- field15.newphoto_radec')
ra = gh_data['col5'][:]
dec = gh_data['col6'][:]
f606 = gh_data['col3'][:]
f814 = gh_data['col4'][:]
plot(f6062-f8142,f8142, 'bo', alpha=0.15)
axis([-1,2.5,27,23])
xlabel('F606W-F814W')
ylabel('F814W')
title('Field 14')
The data set is imported and organized into different columns, I am trying to overlay a line of best fit, or linear regression over the scatterplot created, but I cannot figure out how. Thanks in advance.

As #rayryeng pointed out, your code just plots the data, but doesn't actually compute any regression results to plot. Here's one way of doing it:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.DataFrame({"y": range(1,11)+np.random.rand(10),
"x": range(1,11)+np.random.rand(10)})
Use statsmodels OLS method to fit a regression line, and params to extract the coefficient on the single regressor:
beta_1 = sm.OLS(data.y, data.x).fit().params
Produce a scatterplot and add a regression line:
fig, ax = plt.subplots()
ax.scatter(data.x, data.y)
ax.plot(range(1,11), [i*beta_1 for i in range(1,11)], label = "best fit")
ax.legend(loc="best")

Confidence intervals for logistic fit in seaborn

When the same dataset is plotted with logistic regression fit using seaborn in Python and ggplot2 in R, the confidence intervals are drastically different, though docs in both cases say they display 95% ci by default. What am I missing here?
R code:
library("ggplot2")
library("MASS")
data(menarche)
ggplot(menarche, aes(x=Age, y=Menarche/Total)) + geom_point(shape=19) + geom_smooth(method="glm", family="binomial")
write.csv(menarche, file='menarche.csv')
Python code:
import seaborn as sns
from matplotlib import pyplot as plt
import pandas as pd
data = pd.read_csv('menarche.csv')
data['Fraction'] = data['Menarche']/data['Total']
sns.regplot(x='Age', y='Fraction', data=data, logistic=True)
plt.show()
Edit: binarization of response creates similar plot in ggplot2
Based on mwascom's comments I converted the data to binary response variable by creating and used it for comparison. Now the confidence intervals look similar and it appears that this is what seaborn is plotting when given the fraction of successes. I am yet to figure out when given the fraction of success as response variable, why the two are differing (the results of glm fit are same in terms of intercept and coefficient).
import numpy as np
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
menarche = pd.read_csv('menarche.csv')
# Convert the data into binary (yes/no) response form
tmp = []
for ii, row in enumerate(menarche[['Age','Total', 'Menarche']] .as_matrix()):
ages = [row[0]]*int(row[1])
yes_idx = np.random.choice(int(row[1]), size=int(row[2]), replace=False)
response = np.zeros((row[1]))
response[yes_idx] = 1
group = [ii] * int(row[1])
group_data = np.c_[group, ages, response]
tmp.append(group_data)
binarized = np.vstack(tmp)
menarche_b = pd.DataFrame(binarized, columns=['Group', 'Age', 'Menarche'])
menarche_b.to_csv('menarche_binarized.csv') # for feeding to R
menarche_b['intercept'] = 1.0
model = sm.GLM(menarche_b['Menarche'], menarche_b[['Age', 'intercept']], family=sm.families.Binomial())
result = model.fit()
result.summary()
import seaborn as sns
sns.regplot(x='Age', y='Menarche', data=menarche_b, logistic=True)
plt.show()
produces the same curve (now the data points are plotted at 0 and 1).
data = read.csv('menarche_binarized.csv')
model = glm(Menarche ~ Age, data=data, family=binomial(logit))
model
library("ggplot2")
ggplot(data, aes(x=Age, y=Menarche))+stat_smooth(method="glm", family=binomial(logit))
now produces what looks similar to the seaborn output (similar confidence intervals).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Major Difference in 2D kernel Density Plots: Seaborn and R - python

Related

How to plot heatmap onto mplsoccer pitch?

Can you plot interquartile range as the error band on a seaborn lineplot?

What's the equivalent of fitdist and histfit in Python?

Overlay Linear Regression Line on Scatter Plot (iPython Notebook)

Confidence intervals for logistic fit in seaborn

Categories

Resources