Confidence intervals for logistic fit in seaborn - python

When the same dataset is plotted with logistic regression fit using seaborn in Python and ggplot2 in R, the confidence intervals are drastically different, though docs in both cases say they display 95% ci by default. What am I missing here?
R code:
library("ggplot2")
library("MASS")
data(menarche)
ggplot(menarche, aes(x=Age, y=Menarche/Total)) + geom_point(shape=19) + geom_smooth(method="glm", family="binomial")
write.csv(menarche, file='menarche.csv')
Python code:
import seaborn as sns
from matplotlib import pyplot as plt
import pandas as pd
data = pd.read_csv('menarche.csv')
data['Fraction'] = data['Menarche']/data['Total']
sns.regplot(x='Age', y='Fraction', data=data, logistic=True)
plt.show()
Edit: binarization of response creates similar plot in ggplot2
Based on mwascom's comments I converted the data to binary response variable by creating and used it for comparison. Now the confidence intervals look similar and it appears that this is what seaborn is plotting when given the fraction of successes. I am yet to figure out when given the fraction of success as response variable, why the two are differing (the results of glm fit are same in terms of intercept and coefficient).
import numpy as np
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
menarche = pd.read_csv('menarche.csv')
# Convert the data into binary (yes/no) response form
tmp = []
for ii, row in enumerate(menarche[['Age','Total', 'Menarche']] .as_matrix()):
ages = [row[0]]*int(row[1])
yes_idx = np.random.choice(int(row[1]), size=int(row[2]), replace=False)
response = np.zeros((row[1]))
response[yes_idx] = 1
group = [ii] * int(row[1])
group_data = np.c_[group, ages, response]
tmp.append(group_data)
binarized = np.vstack(tmp)
menarche_b = pd.DataFrame(binarized, columns=['Group', 'Age', 'Menarche'])
menarche_b.to_csv('menarche_binarized.csv') # for feeding to R
menarche_b['intercept'] = 1.0
model = sm.GLM(menarche_b['Menarche'], menarche_b[['Age', 'intercept']], family=sm.families.Binomial())
result = model.fit()
result.summary()
import seaborn as sns
sns.regplot(x='Age', y='Menarche', data=menarche_b, logistic=True)
plt.show()
produces the same curve (now the data points are plotted at 0 and 1).
data = read.csv('menarche_binarized.csv')
model = glm(Menarche ~ Age, data=data, family=binomial(logit))
model
library("ggplot2")
ggplot(data, aes(x=Age, y=Menarche))+stat_smooth(method="glm", family=binomial(logit))
now produces what looks similar to the seaborn output (similar confidence intervals).

Related

Plotting tendency line in Python

I want to plot a tendency line on top of a data plot. This must be simple but I have not been able to figure out how to get to it.
Let us say I have the following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('A'))
sns.lineplot(data=df)
ax.set(xlabel="Index",
ylabel="Variable",
title="Sample")
plt.show()
The resulting plot is:
What I would like to add is a tendency line. Something like the red line in the following:
I thank you for any feedback.
A moving average is one method (my first thought, and already suggested).
Another method is to use a polynomial fit. Since you had 100 points in your original data, I picked a 10th order fit (square root of data length) in the example below. With some modification of your original code:
idx = [i for i in range(100)]
rnd = np.random.randint(0,100,size=100)
ser = pd.Series(rnd, idx)
fit = np.polyfit(idx, rnd, 10)
pf = np.poly1d(fit)
plt.plot(idx, rnd, 'b', idx, pf(idx), 'r')
This code provides a plot like this:
You can do something like this using Rolling Average:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = np.random.randint(0,100,size=(100, 1))
df["rolling_avg"] = df.A.rolling(7).mean().shift(-3)
sns.lineplot(data=df)
plt.show()
You could also do a Regression plot to analyse how data can be interpolated using:
ax = sns.regplot(x=df.index, y="A",
data=df,
scatter_kws={"s": 10},
order=10,
ci=None)

Labeling points on matplotlib scatter plot from output of TSNE using pandas dataframe

the question topic is a little complex because I need a lot of help lol. To explain, I have a csv of data with labels (names) and numerical data...
name,post_count,follower_count,following_count,anonymous_pic,is_private,...
adam,3,997,435,0,0,1,0,0,0,0 bob,2,723,600,0,0,1,0,0,0,0
jill,11,3193,962,0,0,1,0,0,0,0 sara,0,225,298,0,0,1,0,0,0,0
.
.
and so on. This data is loaded into a pandas dataframe from the csv. Now, I wish to pass only the numerical parts of this data into a sklearn.manifold class called TSNE (t-distributed stochastic neighbor embedding) which will output a list the same size as the input data, where each element of the new list is is list of size k (where k is the number of components specified as an argument to the TSNE class). In my case k = 2.
I'm graphing this data on a 2-D scatter plot from matplotlib, and I'd like to be able to inspect the labels on the data. I know matplotlib has an annotate feature in which points can be labeled, but how do I go about separating these labels from the data for TSNE? and if i just separate the labels prior to transformation, how can I go about ensuring that i'm relabeling the right points?
I'd like to be able to inspect these names, because I need to see if the transformation is useful on my data. This way I can analyze a few really bizarre places and see if something interesting is happening. Here is my code if you find it useful (Although I'll admit its just scratchwork)
# Data structuring
import pandas as pd
import numpy as np
# Plotting
import seaborn as sns
import matplotlib.pyplot as plt
sns.set() # for plot styling
# Load data
df = pd.read_csv('user_data.csv')
print(df.head())
# sklearn
from sklearn.mixture import GMM
from sklearn.manifold import TSNE
tsne = TSNE(n_components = 2, init = 'random', random_state = 0)
lab_proj = tsne.fit_transform(df)
x = [i[0] for i in lab_proj]
y = [i[1] for i in lab_proj]
print(len(lab_proj))
df['PCA1'] = x
df['PCA2'] = y
model = GMM(n_components = 1, covariance_type = 'full')
model.fit(df)
y_gmm = model.predict(df)
df['cluster'] = y_gmm
sns.lmplot('PCA1', 'PCA2', data = df, col='cluster', fit_reg = False)
plt.show()
Thanks!

How to create as scatter plot with regression line based on statsmodel OLS?

I am having difficulty adding a regression line (the one which statsmodel OLS is based on) on to scatter plot. Note that with seaborn's lmplot, I can get a line (see example), but I would like to use the exact one coming from statsmodel OLS for total consistency.
How can I adjust code below to add in the regression line into the first scatter plot?
import statsmodels.regression.linear_model as sm
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(0)
data = {'Xvalue': range(20, 30), 'Yvalue': np.random.randint(low=10, high=100, size=10)}
data = pd.DataFrame(data)
X = data[['Xvalue']]
Y = data['Yvalue']
model2 = sm.OLS(Y,sm.add_constant(X), data=data)
model_fit = model2.fit()
print(model_fit.summary())
#Plot
data.plot(kind='scatter', x='Xvalue', y='Yvalue')
#Seaborn
sns.lmplot(x='Xvalue', y='Yvalue', data=data)
Scatter plot (trying to work out how to add in the statsmodel OLS regression line
seaborn lmplot with its regression line (trying to mimic this)
Thanks to the link from #busybear, it now works!
import statsmodels.regression.linear_model as sm
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(0)
data = {'Xvalue': range(20, 30), 'Yvalue': np.random.randint(low=10, high=100, size=10)}
data = pd.DataFrame(data)
X = data[['Xvalue']]
Y = data['Yvalue']
model = sm.OLS(data['Yvalue'], sm.add_constant(data['Xvalue']))
model_fit = model.fit()
p = model_fit.params
print(model_fit.summary())
#Plot
p
x = np.arange(0,40)
ax = data.plot(kind='scatter', x='Xvalue', y='Yvalue')
ax.plot(x, p.const + p.Xvalue * x)
ax.set_xlim([0,30])
#Seaborn
sns.lmplot(x='Xvalue', y='Yvalue', data=data)

Dendrogram using pandas and scipy

I wish to generate a dendrogram based on correlation using pandas and scipy. I use a dataset (as a DataFrame) consisting of returns, which is of size n x m, where n is the number of dates and m the number of companies. Then I simply run the script
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np
m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_matrix = np.random.normal(0, 0.01, size=(len(dates), m))
dataframe = pd.DataFrame(data=random_matrix, index=dates)
z = hc.linkage(dataframe.values.T, method='average', metric='correlation')
dendrogram = hc.dendrogram(z, labels=dataframe.columns)
plt.show()
and I get a nice dendrogram. Now, the thing is that I'd also like to use other correlation measures apart from just ordinary Pearson correlation, which is a feature that's incorporated in pandas by simply invoking DataFrame.corr(method='<method>'). So, I thought at first that it was to simply run the following code
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np
m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_returns = np.random.normal(0, 0.01, size=(len(dates), m))
dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = dataframe.corr()
z = hc.linkage(corr.values, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()
However, if I do this I get strange values on the y-axis as the maximum value > 1.4. Whereas if I run the first script it's about 1. What am I doing wrong? Am I using the wrong metric in hc.linkage?
EDIT I might add that the shape of the dendrogram is exactly the same. Do I have to normalize the third column of the resulting z with the maximum value?
Found the solution. If you have already calculated a distance matrix (be it correlation or whatever), you simply have to condense the matrix using distance.squareform. That is,
dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = 1 - dataframe.corr()
corr_condensed = hc.distance.squareform(corr) # convert to condensed
z = hc.linkage(corr_condensed, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()

Major Difference in 2D kernel Density Plots: Seaborn and R

I am trying to plot data using the 2D kernel density plot of Seaborn's jointplot function (using statsmodels' KDEMultivariate function to calculate a data-driven bandwidth). I've plotted a 2D kernel density in R using the same data and the result looks very good (using the 'ks' package), while the Seaborn plot looks very very different.
I am using the same exact data and the same exact bandwidth for each (taking the bandwidth given by KDEMultivariant and passing that to the R method).
Here is the input.csv data used: https://app.box.com/s/ot7d36t44wrr85pusp5657pc1w2kf5hj
Below are the code used in each and output images from each.
Python / Seaborn:
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv("input.csv", dtype={'x': float, 'y': float}, skiprows=0)
bw_ml_x = sm.nonparametric.KDEMultivariate(data=data['x'], var_type='c', bw='cv_ml')
bw_ml_y = sm.nonparametric.KDEMultivariate(data=data['y'], var_type='c', bw='cv_ml')
g = sns.jointplot(x='x', y='y', data=data, kind="kde", stat_func=None, bw=[bw_ml_x.bw, bw_ml_y.bw])
g.plot_joint(plt.scatter, c="w")
g.ax_joint.collections[0].set_alpha(0)
sns.plt.show()
Img for Seaborn plot:
The bandwidth given by bw_ml_x.bw and bw_ml_y.bw is placed in a 2 x 2 R matrix H, where H[1,1] = bw_ml_x.bw, H[2,2] = bw_ml.y.bw, and other values set to zero.
R:
library(ks)
fhat <- kde(x=as.data.frame(data[1], data[2]), H=H)
plot(fhat, display="filled.contour2", cont=seq(10,90,by=10))
Img for R plot:
Looking at your Seaborn/Python plot, many of the points cluster along the (0,n) region and the (1,1) region of your space, just as the KDE of the R plot shows. This indicates that Seaborn and R are looking at the same data; we simply need to reformulate the call to the kde in Seaborn in order to visualize the KDE gradients.
If you modify your Python call to match the documentation for Kernel Density Estimation in Seaborn you'll get a proper 2d-kdf out of Python:
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd
import seaborn as sns
data = pd.read_csv("input.csv", dtype={'x': float, 'y': float}, skiprows=0)
bw_ml_x = sm.nonparametric.KDEMultivariate(data=data['x'], var_type='c', bw='cv_ml')
bw_ml_y = sm.nonparametric.KDEMultivariate(data=data['y'], var_type='c', bw='cv_ml')
g = sns.jointplot(x='x', y='y', data=data, kind="kde")
g.plot_joint(plt.scatter, c="w")
g.ax_joint.collections[0].set_alpha(0)
sns.plt.show()
This accords with the R plot (though the kernel estimators seem to be slightly different, which would account for the variation in gradients between the plots):

Categories