I tried fitting an OLS for Boston data set. My graph looks like below.
How to annotate the linear regression equation just above the line or somewhere in the graph? How do I print the equation in Python?
I am fairly new to this area. Exploring python as of now. If somebody can help me, it would speed up my learning curve.
Many thanks!
I tried this as well.
My problem is - how to annotate the above in the graph in equation format?
You can use coefficients of linear fit to make a legend like in this example:
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
tips = sns.load_dataset("tips")
# get coeffs of linear fit
slope, intercept, r_value, p_value, std_err = stats.linregress(tips['total_bill'],tips['tip'])
# use line_kws to set line label for legend
ax = sns.regplot(x="total_bill", y="tip", data=tips, color='b',
line_kws={'label':"y={0:.1f}x+{1:.1f}".format(slope,intercept)})
# plot legend
ax.legend()
plt.show()
If you use more complex fitting function you can use latex notification: https://matplotlib.org/users/usetex.html
To annotate multiple linear regression lines in the case of using seaborn lmplot you can do the following.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_excel('data.xlsx')
# assume some random columns called EAV and PAV in your DataFrame
# assume a third variable used for grouping called "Mammal" which will be used for color coding
p = sns.lmplot(x=EAV, y=PAV,
data=df, hue='Mammal',
line_kws={'label':"Linear Reg"}, legend=True)
ax = p.axes[0, 0]
ax.legend()
leg = ax.get_legend()
L_labels = leg.get_texts()
# assuming you computed r_squared which is the coefficient of determination somewhere else
slope, intercept, r_value, p_value, std_err = stats.linregress(df['EAV'],df['PAV'])
label_line_1 = r'$y={0:.1f}x+{1:.1f}'.format(slope,intercept)
label_line_2 = r'$R^2:{0:.2f}$'.format(0.21) # as an exampple or whatever you want[!
L_labels[0].set_text(label_line_1)
L_labels[1].set_text(label_line_2)
Result:
Simpler syntax.. same result.
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
slope, intercept, r_value, pv, se = stats.linregress(df['alcohol'],df['magnesium'])
sns.regplot(x="alcohol", y="magnesium", data=df,
ci=None, label="y={0:.1f}x+{1:.1f}".format(slope, intercept)).legend(loc="best")
I extended the solution by #RMS to work for a multi-panel lmplot example (using data from a sleep-deprivation study (Belenky et. al., J Sleep Res 2003) available in pydataset). This allows one to have axis-specific legends/labels without having to use, e.g., regplot and plt.subplots.
Edit: Added second method using the map_dataframe() method from FacetGrid(), as suggested in the answer by Marcos here.
import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
import pydataset as pds
import matplotlib.pyplot as plt
# use seaborn theme
sns.set_theme(color_codes=True)
# Load data from sleep deprivation study (Belenky et al, J Sleep Res 2003)
# ['Reaction', 'Days', 'Subject'] = [reaction time (ms), deprivation time, Subj. No.]
df = pds.data("sleepstudy")
# convert integer label to string
df['Subject'] = df['Subject'].apply(str)
# perform linear regressions outside of seaborn to get parameters
subjects = np.unique(df['Subject'].to_numpy())
fit_str = []
for s in subjects:
ddf = df[df['Subject'] == s]
m, b, r_value, p_value, std_err = \
sp.stats.linregress(ddf['Days'],ddf['Reaction'])
fs = f"y = {m:.2f} x + {b:.1f}"
fit_str.append(fs)
method_one = False
method_two = True
if method_one:
# Access legend on each axis to write equation
#
# Create 18 panel lmplot with seaborn
g = sns.lmplot(x="Days", y="Reaction", col="Subject",
col_wrap=6, height=2.5, data=df,
line_kws={'label':"Linear Reg"}, legend=True)
# write string with fit result into legend string of each axis
axes = g.axes # 18 element list of axes objects
i=0
for ax in axes:
ax.legend() # create legend on axis
leg = ax.get_legend()
leg_labels = leg.get_texts()
leg_labels[0].set_text(fit_str[i])
i += 1
elif method_two:
# use the .map_dataframe () method from FacetGrid() to annotate plot
# https://stackoverflow.com/questions/25579227 (answer by #Marcos)
#
# Create 18 panel lmplot with seaborn
g = sns.lmplot(x="Days", y="Reaction", col="Subject",
col_wrap=6, height=2.5, data=df)
def annotate(data, **kws):
m, b, r_value, p_value, std_err = \
sp.stats.linregress(data['Days'],data['Reaction'])
ax = plt.gca()
ax.text(0.5, 0.9, f"y = {m:.2f} x + {b:.1f}",
horizontalalignment='center',
verticalalignment='center',
transform=ax.transAxes)
g.map_dataframe(annotate)
# write figure to pdf
plt.savefig("sleepstudy_data_w-fits.pdf")
Output (Method 1):
Output (Method 2):
Update 2022-05-11: Unrelated to the plotting techniques, it turns out that this interpretation of the data (and that provided, e.g., in the original R repository) is incorrect. See the reported issue here. Fits should be done to days 2-9, corresponding to zero to seven days of sleep deprivation (3h sleep per night). The first three data points correspond to training and baseline days (all with 8h sleep per night).
Related
In Pandas, I am doing:
bp = p_df.groupby('class').plot(kind='kde')
p_df is a dataframe object.
However, this is producing two plots, one for each class.
How do I force one plot with both classes in the same plot?
Version 1:
You can create your axis, and then use the ax keyword of DataFrameGroupBy.plot to add everything to these axes:
import matplotlib.pyplot as plt
p_df = pd.DataFrame({"class": [1,1,2,2,1], "a": [2,3,2,3,2]})
fig, ax = plt.subplots(figsize=(8,6))
bp = p_df.groupby('class').plot(kind='kde', ax=ax)
This is the result:
Unfortunately, the labeling of the legend does not make too much sense here.
Version 2:
Another way would be to loop through the groups and plot the curves manually:
classes = ["class 1"] * 5 + ["class 2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "vals": vals})
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('class'):
df.vals.plot(kind="kde", ax=ax, label=label)
plt.legend()
This way you can easily control the legend. This is the result:
import matplotlib.pyplot as plt
p_df.groupby('class').plot(kind='kde', ax=plt.gca())
Another approach would be using seaborn module. This would plot the two density estimates on the same axes without specifying a variable to hold the axes as follows (using some data frame setup from the other answer):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# data to create an example data frame
classes = ["c1"] * 5 + ["c2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
# the data frame
df = pd.DataFrame({"cls": classes, "indices":idx, "vals": vals})
# this is to plot the kde
sns.kdeplot(df.vals[df.cls == "c1"],label='c1');
sns.kdeplot(df.vals[df.cls == "c2"],label='c2');
# beautifying the labels
plt.xlabel('value')
plt.ylabel('density')
plt.show()
This results in the following image.
There are two easy methods to plot each group in the same plot.
When using pandas.DataFrame.groupby, the column to be plotted, (e.g. the aggregation column) should be specified.
Use seaborn.kdeplot or seaborn.displot and specify the hue parameter
Using pandas v1.2.4, matplotlib 3.4.2, seaborn 0.11.1
The OP is specific to plotting the kde, but the steps are the same for many plot types (e.g. kind='line', sns.lineplot, etc.).
Imports and Sample Data
For the sample data, the groups are in the 'kind' column, and the kde of 'duration' will be plotted, ignoring 'waiting'.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('geyser')
# display(df.head())
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Plot with pandas.DataFrame.plot
Reshape the data using .groupby or .pivot
.groupby
Specify the aggregation column, ['duration'], and kind='kde'.
ax = df.groupby('kind')['duration'].plot(kind='kde', legend=True)
.pivot
ax = df.pivot(columns='kind', values='duration').plot(kind='kde')
Plot with seaborn.kdeplot
Specify hue='kind'
ax = sns.kdeplot(data=df, x='duration', hue='kind')
Plot with seaborn.displot
Specify hue='kind' and kind='kde'
fig = sns.displot(data=df, kind='kde', x='duration', hue='kind')
Plot
Maybe you can try this:
fig, ax = plt.subplots(figsize=(10,8))
classes = list(df.class.unique())
for c in classes:
df2 = data.loc[data['class'] == c]
df2.vals.plot(kind="kde", ax=ax, label=c)
plt.legend()
I want to process "Burst" data- a time series with bursts. The data can be pretty noisy. I am really only interested in the burst duration but my burst detection algorithm only really works if the data has no slope. Now my question is : How do i find a linear slope for this type of data without doing it manually? My main problem is that there can be burst which exceed either end of my time(x) axis. Otherwise i could probably just find the mean of the first and last 20 datapoints and fit a linear function.
Basically i want to find the red line in the follwing picture and subtract it. I guess a linear regression either through the burst or the sloped baseline would do the trick but i am somehow stuck.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams,pylab
from matplotlib.patches import Rectangle
import random
#set plot properties
sns.set_style("white")
rcParams['font.size'] = 14
FIG_SIZE=(12, 15)
# Simulate data
timepoints = 4000
r = pd.Series(np.floor(np.ones(timepoints)*20 + np.random.normal(scale=10,
size=timepoints))) #target events
r[r<0] = 0 #set negative values to 0
# #add some bursts to the data
heights = [35,45,50,55,40,60,70]
starts =[100,300,700,950,1200,1800,2100,2550,2800,3100,3500,3800]
ends=[200,400,800,1100,1500,1900,2400,2625,2950,3350,3700,4000]
for x,y in zip(starts,ends):
r[x:y] = r[x:y] + random.choice(heights) #+ np.random.normal(scale=10, size=200)
# add linear slope to data
slope=0.02
linear_slope=np.arange(timepoints) *slope
burst_data_with_slope= r + linear_slope
#Fig setup
fig, (ax1, ax2) = plt.subplots(2, figsize=FIG_SIZE,sharey=False)
#
ax1.set_ylabel('proportion of target events', size=14)
ax1.set_xlabel('time (sec)', size=14)
ax2.set_xlabel('time (sec)', size=14)
ax2.set_ylabel('proportion of target events', size=14)
ax1.set_xlim([0, timepoints])
ax2.set_xlim([0, timepoints])
ax1.plot(burst_data_with_slope, color='#00bbcc', linewidth=1)
ax1.set_title('Original Data', size=14)
ax2.plot(burst_data_with_slope, color='#00bbcc', linewidth=1)
ax2.plot(linear_slope, color='red', linewidth=1)
ax2.set_title('Original Data substracted slope', size=14)
# Finaly plot
plt.subplots_adjust(hspace=0.5)
plt.show()
Thanks in advance !
gh_data = ascii.read('http://dept.astro.lsa.umich.edu/~ericbell/data/GHOSTS/M81/ngc3031- field15.newphoto_radec')
ra = gh_data['col5'][:]
dec = gh_data['col6'][:]
f606 = gh_data['col3'][:]
f814 = gh_data['col4'][:]
plot(f6062-f8142,f8142, 'bo', alpha=0.15)
axis([-1,2.5,27,23])
xlabel('F606W-F814W')
ylabel('F814W')
title('Field 14')
The data set is imported and organized into different columns, I am trying to overlay a line of best fit, or linear regression over the scatterplot created, but I cannot figure out how. Thanks in advance.
As #rayryeng pointed out, your code just plots the data, but doesn't actually compute any regression results to plot. Here's one way of doing it:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.DataFrame({"y": range(1,11)+np.random.rand(10),
"x": range(1,11)+np.random.rand(10)})
Use statsmodels OLS method to fit a regression line, and params to extract the coefficient on the single regressor:
beta_1 = sm.OLS(data.y, data.x).fit().params
Produce a scatterplot and add a regression line:
fig, ax = plt.subplots()
ax.scatter(data.x, data.y)
ax.plot(range(1,11), [i*beta_1 for i in range(1,11)], label = "best fit")
ax.legend(loc="best")
I am trying to make QQ-plots using the statsmodel package. However, the resolution of the figure is so low that I could not possibly use the results in a presentation.
I know that to make networkX graph plot a higher resolution image I can use:
plt.figure( figsize=(N,M) )
networkx.draw(G)
and change the values of N and M to attain desirable results.
However, when I try the same method with a QQ-plot from the statsmodel package, it seems to have no impact on the size of the resulting figure, i.e., when I use
plt.Figure( figsize = (N,M) )
statsmodels.qqplot_2samples(sample1, sample2, line = 'r')
changing M and N have no effect on the figure size. Any ideas on how to fix this (and why this method isn't working)?
You can use mpl.rc_context to temporarily set the default figsize before plotting.
import numpy as np
import matplotlib as mpl
from statsmodels.graphics.gofplots import qqplot_2samples
np.random.seed(10)
sample1 = np.random.rand(10)
sample2 = np.random.rand(10)
n, m = 6, 6
with mpl.rc_context():
mpl.rc("figure", figsize=(n,m))
qqplot_2samples(sample1, sample2, line = 'r')
This is a great solution and works for other plots too - I upvoted it. Here is the implementation for acf and pacf plots.
N, M = 12, 6
fig, ax = plt.subplots(figsize=(N, M))
plot_pacf(df2, lags = 40, title='Daily Female Births', ax=ax)
plt.show()
The qqplot_2samples function has an ax parameter which allows you to specify
a matplotlib axes object on which the plot should be drawn. If you don't supply
the ax, then a new axes object is created for you.
So, as an alternative to cel's solution, if you wish to create your own figure,
then you should also pass the figure's axes object to qqplot_2samples:
sm.qqplot_2samples(sample1, sample2, line='r', ax=ax)
For example,
import scipy.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
N, M = 6, 5
fig, ax = plt.subplots(figsize=(N, M))
sample1 = stats.norm.rvs(5, size=1000)
sample2 = stats.norm.rvs(10, size=1000)
sm.qqplot_2samples(sample1, sample2, line='r', ax=ax)
plt.show()
Just use plt.rc("figure", figsize=(16,8)) before plotting.
Check this link here.
I used plt.rc()
plt.rc("figure", figsize=(10,6))
sm.graphics.tsa.plot_acf(nifty_50['close_price'], lags=36000);
In Pandas, I am doing:
bp = p_df.groupby('class').plot(kind='kde')
p_df is a dataframe object.
However, this is producing two plots, one for each class.
How do I force one plot with both classes in the same plot?
Version 1:
You can create your axis, and then use the ax keyword of DataFrameGroupBy.plot to add everything to these axes:
import matplotlib.pyplot as plt
p_df = pd.DataFrame({"class": [1,1,2,2,1], "a": [2,3,2,3,2]})
fig, ax = plt.subplots(figsize=(8,6))
bp = p_df.groupby('class').plot(kind='kde', ax=ax)
This is the result:
Unfortunately, the labeling of the legend does not make too much sense here.
Version 2:
Another way would be to loop through the groups and plot the curves manually:
classes = ["class 1"] * 5 + ["class 2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "vals": vals})
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('class'):
df.vals.plot(kind="kde", ax=ax, label=label)
plt.legend()
This way you can easily control the legend. This is the result:
import matplotlib.pyplot as plt
p_df.groupby('class').plot(kind='kde', ax=plt.gca())
Another approach would be using seaborn module. This would plot the two density estimates on the same axes without specifying a variable to hold the axes as follows (using some data frame setup from the other answer):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# data to create an example data frame
classes = ["c1"] * 5 + ["c2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
# the data frame
df = pd.DataFrame({"cls": classes, "indices":idx, "vals": vals})
# this is to plot the kde
sns.kdeplot(df.vals[df.cls == "c1"],label='c1');
sns.kdeplot(df.vals[df.cls == "c2"],label='c2');
# beautifying the labels
plt.xlabel('value')
plt.ylabel('density')
plt.show()
This results in the following image.
There are two easy methods to plot each group in the same plot.
When using pandas.DataFrame.groupby, the column to be plotted, (e.g. the aggregation column) should be specified.
Use seaborn.kdeplot or seaborn.displot and specify the hue parameter
Using pandas v1.2.4, matplotlib 3.4.2, seaborn 0.11.1
The OP is specific to plotting the kde, but the steps are the same for many plot types (e.g. kind='line', sns.lineplot, etc.).
Imports and Sample Data
For the sample data, the groups are in the 'kind' column, and the kde of 'duration' will be plotted, ignoring 'waiting'.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('geyser')
# display(df.head())
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Plot with pandas.DataFrame.plot
Reshape the data using .groupby or .pivot
.groupby
Specify the aggregation column, ['duration'], and kind='kde'.
ax = df.groupby('kind')['duration'].plot(kind='kde', legend=True)
.pivot
ax = df.pivot(columns='kind', values='duration').plot(kind='kde')
Plot with seaborn.kdeplot
Specify hue='kind'
ax = sns.kdeplot(data=df, x='duration', hue='kind')
Plot with seaborn.displot
Specify hue='kind' and kind='kde'
fig = sns.displot(data=df, kind='kde', x='duration', hue='kind')
Plot
Maybe you can try this:
fig, ax = plt.subplots(figsize=(10,8))
classes = list(df.class.unique())
for c in classes:
df2 = data.loc[data['class'] == c]
df2.vals.plot(kind="kde", ax=ax, label=c)
plt.legend()