Substract linear slope from burst data after linear regression - python

I want to process "Burst" data- a time series with bursts. The data can be pretty noisy. I am really only interested in the burst duration but my burst detection algorithm only really works if the data has no slope. Now my question is : How do i find a linear slope for this type of data without doing it manually? My main problem is that there can be burst which exceed either end of my time(x) axis. Otherwise i could probably just find the mean of the first and last 20 datapoints and fit a linear function.
Basically i want to find the red line in the follwing picture and subtract it. I guess a linear regression either through the burst or the sloped baseline would do the trick but i am somehow stuck.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams,pylab
from matplotlib.patches import Rectangle
import random
#set plot properties
sns.set_style("white")
rcParams['font.size'] = 14
FIG_SIZE=(12, 15)
# Simulate data
timepoints = 4000
r = pd.Series(np.floor(np.ones(timepoints)*20 + np.random.normal(scale=10,
size=timepoints))) #target events
r[r<0] = 0 #set negative values to 0
# #add some bursts to the data
heights = [35,45,50,55,40,60,70]
starts =[100,300,700,950,1200,1800,2100,2550,2800,3100,3500,3800]
ends=[200,400,800,1100,1500,1900,2400,2625,2950,3350,3700,4000]
for x,y in zip(starts,ends):
r[x:y] = r[x:y] + random.choice(heights) #+ np.random.normal(scale=10, size=200)
# add linear slope to data
slope=0.02
linear_slope=np.arange(timepoints) *slope
burst_data_with_slope= r + linear_slope
#Fig setup
fig, (ax1, ax2) = plt.subplots(2, figsize=FIG_SIZE,sharey=False)
#
ax1.set_ylabel('proportion of target events', size=14)
ax1.set_xlabel('time (sec)', size=14)
ax2.set_xlabel('time (sec)', size=14)
ax2.set_ylabel('proportion of target events', size=14)
ax1.set_xlim([0, timepoints])
ax2.set_xlim([0, timepoints])
ax1.plot(burst_data_with_slope, color='#00bbcc', linewidth=1)
ax1.set_title('Original Data', size=14)
ax2.plot(burst_data_with_slope, color='#00bbcc', linewidth=1)
ax2.plot(linear_slope, color='red', linewidth=1)
ax2.set_title('Original Data substracted slope', size=14)
# Finaly plot
plt.subplots_adjust(hspace=0.5)
plt.show()
Thanks in advance !

Related

Calculating the area under multiple Peaks using Python

My problem is calculating the area under the peaks in my FT-IR analysis. I usually work with Origin but I would like to see if I get a better result working with Python. The data I'm using is linked here and the code is below. The problem I'm facing is, I don't know how to find the start and the end of the peak to calculate the area and how to set a Baseline.
I found this answered question about how to calculate the area under multiple peaks but I don't know how to implement it in my code: How to get value of area under multiple peaks
import numpy as np
from numpy import trapz
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv(r'CuCO3.csv', skiprows=5)
print(df)
Wavenumber = df.iloc[:,0]
Absorbance = df.iloc[:,1]
Wavenumber_Peak = Wavenumber.iloc[700:916] #Where the peaks start/end that i want to calculate the area
Absorbance_Peak = Absorbance.iloc[700:916] #Where the peaks start/end that i want to calculate the area
plt.figure()
plt.plot(Wavenumber_Peak, Absorbance_Peak)
plt.show()
Plot of the peaks to calculate the area:
Okay, I have quickly added the code from the other post to your beginning and checked that it works. Unfortunately, the file that you linked did not work with your code, so I had to change some stuff in the beginning to make it work (in a very unelegant way, because I do not really know how to work with dataframes). If your local file is different and processing the file in this way does not work, then just exchange my beginning by yours.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import peakutils
df = pd.read_csv(r'CuCO3.csv', skiprows=5)
data = np.asarray([[float(y) for y in x[0].split(",")] for x in df.to_numpy()])
Wavenumber = np.arange(700, 916)
Absorbance = data[700:916,1]
indices = peakutils.indexes(Absorbance, thres=0.35, min_dist=0.1)
peak_values = [Absorbance[i] for i in indices]
peak_Wavenumbers = [Wavenumber[i] for i in indices]
plt.figure()
plt.scatter(peak_Wavenumbers, peak_values)
plt.plot(Wavenumber, Absorbance)
plt.show()
ixpeak = Wavenumber.searchsorted(peak_Wavenumbers)
ixmin = np.array([np.argmin(i) for i in np.split(Absorbance, ixpeak)])
ixmin[1:] += ixpeak
mins = Wavenumber[ixmin]
# split up the x and y values based on those minima
xsplit = np.split(Wavenumber, ixmin[1:-1])
ysplit = np.split(Absorbance, ixmin[1:-1])
# find the areas under each peak
areas = [np.trapz(ys, xs) for xs, ys in zip(xsplit, ysplit)]
# plotting stuff
plt.figure(figsize=(5, 7))
plt.subplots_adjust(hspace=.33)
plt.subplot(211)
plt.plot(Wavenumber, Absorbance, label='trace 0')
plt.plot(peak_Wavenumbers, Absorbance[ixpeak], '+', c='red', ms=10, label='peaks')
plt.plot(mins, Absorbance[ixmin], 'x', c='green', ms=10, label='mins')
plt.xlabel('dep')
plt.ylabel('indep')
plt.title('Example data')
plt.ylim(-.1, 1.6)
plt.legend()
plt.subplot(212)
plt.bar(np.arange(len(areas)), areas)
plt.xlabel('Peak number')
plt.ylabel('Area under peak')
plt.title('Area under the peaks of trace 0')
plt.show()

Making an animated time-series graph that progresses through the data in real time

I am attempting to graph some heartrate data that is collected over the course of 4 minutes. The ideal graph will be an animation of the heart rate that moves in real time with the data (i.e., a 4 minute animated graph). The data looks like something this:
import pandas as pd
import random
import more_itertools as mit
data = pd.DataFrame({'time': mit.random_combination(range(708709, 987067), r=410),
'HR': [random.randint(70,110) for x in range(410)]})
The original code I tried from this very helpful article (https://towardsdatascience.com/dynamic-replay-of-time-series-data-819e27212b4b) managed to make a dynamic time series graph that was unable to stretch across the needed 4 minutes. As you can see there are inconsistent lapses when the device collected HR data and thus the original solution was not ideal. (Original solution below). Ideally I would like to avoid imputing missing time values.
Thanks for your time!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import deque
%matplotlib qt5
plt.ion()
Heartrate = 'HR'
%matplotlib qt5
plt.ion()
visible = 40
dy1 = deque(np.zeros(visible), visible)
dx = deque(np.zeros(visible), visible)
#interval = np.linspace(0, data.shape[0], num=data.shape[0])
interval = data['time']
fig = plt.figure(figsize=(15,10))
ah1 = fig.add_subplot(111)
ah1.set_xlabel("Time [ms]", fontsize=14, labelpad=10)
ah1.set_ylabel("Heart Rate Last 5 Seconds", fontsize=14, labelpad=5)
l1, = ah1.plot(dx, dy1, color='rosybrown', label='HR')
ah1.legend(loc="upper right", fontsize=12, fancybox=True, framealpha=0.5)
start = 0
while start+visible <= data.shape[0]-1:
# extend deques (both x and y axes)
dy1.extend(data[Heartrate].iloc[start:start+visible])
dx.extend(interval[start:start+visible])
# update axes
l1.set_ydata(dy1)
l1.set_xdata(dx)
# get mean of deques
mdy1 = np.mean(dy1)
# set x- and y-limits based on their mean
dist = 20
ah1.set_ylim(-dist+min(data['HR']), max(data['HR'])+dist) # static y-axis
#ah1.set_ylim(-35+mdy1, 35+mdy1) # dynamic y-axis
ah1.set_xlim(interval[start], interval[start+visible])
# control speed of moving time-series
start += 1
fig.canvas.draw()
fig.canvas.flush_events()

is there a simple method to smooth a curve without taking into account future values and without a time shift?

I have a Unix time series (x) with an associated signal value (y) which is generated every minute, dropping the first value and appending a new one. I am trying to smooth the resulting curve without loosing time accuracy with a specific emphasis on the final value of the smoothed curve which will be written to a database. I would like to be able to adjust the smoothing to a considerable degree.
I have studied (as mathematical layman, more or less) all options I could find and I could master. I came across Savitzki Golay which looked perfect until I realized it works well on past data but fails to produce a reliable final value if no future data is available for smoothing. I have tried many other methods which produced results but could not be adjusted like Savgol.
import pandas as pd
from bokeh.plotting import figure, show, output_file
from bokeh.layouts import column
from math import pi
from scipy.signal import savgol_filter
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from scipy.interpolate import splrep, splev
from scipy.ndimage import gaussian_filter1d
from scipy.signal import lfilter
from scipy.interpolate import UnivariateSpline
import matplotlib.pyplot as plt
df_sim = pd.read_csv("/home/20190905_Signal_Smooth_Test.csv")
#sklearn Polynomial*****************************************
poly = PolynomialFeatures(degree=4)
X = df_sim.iloc[:, 0:1].values
print(X)
y = df_sim.iloc[:, 1].values
print(y)
X_poly = poly.fit_transform(X)
poly.fit(X_poly, y)
lin2 = LinearRegression()
lin2.fit(X_poly, y)
# Visualising the Polynomial Regression results
plt.scatter(X, y, color='blue')
plt.plot(X, lin2.predict(poly.fit_transform(X)), color='red')
plt.title('Polynomial Regression')
plt.xlabel('Time')
plt.ylabel('Signal')
plt.show()
#scipy interpolate********************************************
bspl = splrep(df_sim['timestamp'], df_sim['signal'], s=5)
bspl_y = splev(df_sim['timestamp'], bspl)
df_sim['signal_spline'] = bspl_y
#scipy gaussian filter****************************************
smooth = gaussian_filter1d(df_sim['signal'], 3)
df_sim['signal_gauss'] = smooth
#scipy lfilter************************************************
n = 5 # the larger n is, the smoother curve will be
b = [1.0 / n] * n
a = 1
histo_filter = lfilter(b, a, df_sim['signal'])
df_sim['signal_lfilter'] = histo_filter
print(df_sim)
#scipy UnivariateSpline**************************************
s = UnivariateSpline(df_sim['timestamp'], df_sim['signal'], s=5)
xs = df_sim['timestamp']
ys = s(xs)
df_sim['signal_univariante'] = ys
#scipy savgol filter****************************************
sg = savgol_filter(df_sim['signal'], 11, 3)
df_sim['signal_savgol'] = sg
df_sim['date'] = pd.to_datetime(df_sim['timestamp'], unit='s')
#plotting it all********************************************
print(df_sim)
w = 60000
TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=1000, plot_height=250,
title=f"Various Signals y vs Timestamp x")
p.xaxis.major_label_orientation = pi / 4
p.grid.grid_line_alpha = 0.9
p.line(x=df_sim['date'], y=df_sim['signal'], color='green')
p.line(x=df_sim['date'], y=df_sim['signal_spline'], color='blue')
p.line(x=df_sim['date'], y=df_sim['signal_gauss'], color='red')
p.line(x=df_sim['date'], y=df_sim['signal_lfilter'], color='magenta')
p.line(x=df_sim['date'], y=df_sim['signal_univariante'], color='yellow')
p1 = figure(x_axis_type="datetime", tools=TOOLS, plot_width=1000, plot_height=250,
title=f"Savgol vs Signal")
p1.xaxis.major_label_orientation = pi / 4
p1.grid.grid_line_alpha = 0.9
p1.line(x=df_sim['date'], y=df_sim['signal'], color='green')
p1.line(x=df_sim['date'], y=df_sim['signal_savgol'], color='blue')
output_file("signal.html", title="Signal Test")
show(column(p, p1)) # open a browser
I expect a result that is similar to Savitzky Golay but with valid final smoothed values for the data series. None of the other methods present the same flexibility to adjust the grade of smoothing. Most other methods shift the curve to the right. I can provide to csv file for testing.
This really depends on why you are smoothing the data. Every smoothing method will have side effects, such as letting some 'noise' through more than other. Research 'phase response of filtering'.
A common technique to avoid the problem of missing data at the end of a symmetric filter is to just forecast your data a few points ahead and use that. For example, if you are using a 5-term moving average filter you will be missing 2 data points when you go to calculate your end value.
To forecast these two points, you could use the auto_arima() function from the pmdarima module, or look at the fbprophet module (which I find quite good for this kind of situation).

Seaborn: annotate the linear regression equation

I tried fitting an OLS for Boston data set. My graph looks like below.
How to annotate the linear regression equation just above the line or somewhere in the graph? How do I print the equation in Python?
I am fairly new to this area. Exploring python as of now. If somebody can help me, it would speed up my learning curve.
Many thanks!
I tried this as well.
My problem is - how to annotate the above in the graph in equation format?
You can use coefficients of linear fit to make a legend like in this example:
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
tips = sns.load_dataset("tips")
# get coeffs of linear fit
slope, intercept, r_value, p_value, std_err = stats.linregress(tips['total_bill'],tips['tip'])
# use line_kws to set line label for legend
ax = sns.regplot(x="total_bill", y="tip", data=tips, color='b',
line_kws={'label':"y={0:.1f}x+{1:.1f}".format(slope,intercept)})
# plot legend
ax.legend()
plt.show()
If you use more complex fitting function you can use latex notification: https://matplotlib.org/users/usetex.html
To annotate multiple linear regression lines in the case of using seaborn lmplot you can do the following.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_excel('data.xlsx')
# assume some random columns called EAV and PAV in your DataFrame
# assume a third variable used for grouping called "Mammal" which will be used for color coding
p = sns.lmplot(x=EAV, y=PAV,
data=df, hue='Mammal',
line_kws={'label':"Linear Reg"}, legend=True)
ax = p.axes[0, 0]
ax.legend()
leg = ax.get_legend()
L_labels = leg.get_texts()
# assuming you computed r_squared which is the coefficient of determination somewhere else
slope, intercept, r_value, p_value, std_err = stats.linregress(df['EAV'],df['PAV'])
label_line_1 = r'$y={0:.1f}x+{1:.1f}'.format(slope,intercept)
label_line_2 = r'$R^2:{0:.2f}$'.format(0.21) # as an exampple or whatever you want[!
L_labels[0].set_text(label_line_1)
L_labels[1].set_text(label_line_2)
Result:
Simpler syntax.. same result.
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
slope, intercept, r_value, pv, se = stats.linregress(df['alcohol'],df['magnesium'])
sns.regplot(x="alcohol", y="magnesium", data=df,
ci=None, label="y={0:.1f}x+{1:.1f}".format(slope, intercept)).legend(loc="best")
I extended the solution by #RMS to work for a multi-panel lmplot example (using data from a sleep-deprivation study (Belenky et. al., J Sleep Res 2003) available in pydataset). This allows one to have axis-specific legends/labels without having to use, e.g., regplot and plt.subplots.
Edit: Added second method using the map_dataframe() method from FacetGrid(), as suggested in the answer by Marcos here.
import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
import pydataset as pds
import matplotlib.pyplot as plt
# use seaborn theme
sns.set_theme(color_codes=True)
# Load data from sleep deprivation study (Belenky et al, J Sleep Res 2003)
# ['Reaction', 'Days', 'Subject'] = [reaction time (ms), deprivation time, Subj. No.]
df = pds.data("sleepstudy")
# convert integer label to string
df['Subject'] = df['Subject'].apply(str)
# perform linear regressions outside of seaborn to get parameters
subjects = np.unique(df['Subject'].to_numpy())
fit_str = []
for s in subjects:
ddf = df[df['Subject'] == s]
m, b, r_value, p_value, std_err = \
sp.stats.linregress(ddf['Days'],ddf['Reaction'])
fs = f"y = {m:.2f} x + {b:.1f}"
fit_str.append(fs)
method_one = False
method_two = True
if method_one:
# Access legend on each axis to write equation
#
# Create 18 panel lmplot with seaborn
g = sns.lmplot(x="Days", y="Reaction", col="Subject",
col_wrap=6, height=2.5, data=df,
line_kws={'label':"Linear Reg"}, legend=True)
# write string with fit result into legend string of each axis
axes = g.axes # 18 element list of axes objects
i=0
for ax in axes:
ax.legend() # create legend on axis
leg = ax.get_legend()
leg_labels = leg.get_texts()
leg_labels[0].set_text(fit_str[i])
i += 1
elif method_two:
# use the .map_dataframe () method from FacetGrid() to annotate plot
# https://stackoverflow.com/questions/25579227 (answer by #Marcos)
#
# Create 18 panel lmplot with seaborn
g = sns.lmplot(x="Days", y="Reaction", col="Subject",
col_wrap=6, height=2.5, data=df)
def annotate(data, **kws):
m, b, r_value, p_value, std_err = \
sp.stats.linregress(data['Days'],data['Reaction'])
ax = plt.gca()
ax.text(0.5, 0.9, f"y = {m:.2f} x + {b:.1f}",
horizontalalignment='center',
verticalalignment='center',
transform=ax.transAxes)
g.map_dataframe(annotate)
# write figure to pdf
plt.savefig("sleepstudy_data_w-fits.pdf")
Output (Method 1):
Output (Method 2):
Update 2022-05-11: Unrelated to the plotting techniques, it turns out that this interpretation of the data (and that provided, e.g., in the original R repository) is incorrect. See the reported issue here. Fits should be done to days 2-9, corresponding to zero to seven days of sleep deprivation (3h sleep per night). The first three data points correspond to training and baseline days (all with 8h sleep per night).

Overlay Linear Regression Line on Scatter Plot (iPython Notebook)

gh_data = ascii.read('http://dept.astro.lsa.umich.edu/~ericbell/data/GHOSTS/M81/ngc3031- field15.newphoto_radec')
ra = gh_data['col5'][:]
dec = gh_data['col6'][:]
f606 = gh_data['col3'][:]
f814 = gh_data['col4'][:]
plot(f6062-f8142,f8142, 'bo', alpha=0.15)
axis([-1,2.5,27,23])
xlabel('F606W-F814W')
ylabel('F814W')
title('Field 14')
The data set is imported and organized into different columns, I am trying to overlay a line of best fit, or linear regression over the scatterplot created, but I cannot figure out how. Thanks in advance.
As #rayryeng pointed out, your code just plots the data, but doesn't actually compute any regression results to plot. Here's one way of doing it:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.DataFrame({"y": range(1,11)+np.random.rand(10),
"x": range(1,11)+np.random.rand(10)})
Use statsmodels OLS method to fit a regression line, and params to extract the coefficient on the single regressor:
beta_1 = sm.OLS(data.y, data.x).fit().params
Produce a scatterplot and add a regression line:
fig, ax = plt.subplots()
ax.scatter(data.x, data.y)
ax.plot(range(1,11), [i*beta_1 for i in range(1,11)], label = "best fit")
ax.legend(loc="best")

Categories