Calculating the area under multiple Peaks using Python - python

My problem is calculating the area under the peaks in my FT-IR analysis. I usually work with Origin but I would like to see if I get a better result working with Python. The data I'm using is linked here and the code is below. The problem I'm facing is, I don't know how to find the start and the end of the peak to calculate the area and how to set a Baseline.
I found this answered question about how to calculate the area under multiple peaks but I don't know how to implement it in my code: How to get value of area under multiple peaks
import numpy as np
from numpy import trapz
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv(r'CuCO3.csv', skiprows=5)
print(df)
Wavenumber = df.iloc[:,0]
Absorbance = df.iloc[:,1]
Wavenumber_Peak = Wavenumber.iloc[700:916] #Where the peaks start/end that i want to calculate the area
Absorbance_Peak = Absorbance.iloc[700:916] #Where the peaks start/end that i want to calculate the area
plt.figure()
plt.plot(Wavenumber_Peak, Absorbance_Peak)
plt.show()
Plot of the peaks to calculate the area:

Okay, I have quickly added the code from the other post to your beginning and checked that it works. Unfortunately, the file that you linked did not work with your code, so I had to change some stuff in the beginning to make it work (in a very unelegant way, because I do not really know how to work with dataframes). If your local file is different and processing the file in this way does not work, then just exchange my beginning by yours.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import peakutils
df = pd.read_csv(r'CuCO3.csv', skiprows=5)
data = np.asarray([[float(y) for y in x[0].split(",")] for x in df.to_numpy()])
Wavenumber = np.arange(700, 916)
Absorbance = data[700:916,1]
indices = peakutils.indexes(Absorbance, thres=0.35, min_dist=0.1)
peak_values = [Absorbance[i] for i in indices]
peak_Wavenumbers = [Wavenumber[i] for i in indices]
plt.figure()
plt.scatter(peak_Wavenumbers, peak_values)
plt.plot(Wavenumber, Absorbance)
plt.show()
ixpeak = Wavenumber.searchsorted(peak_Wavenumbers)
ixmin = np.array([np.argmin(i) for i in np.split(Absorbance, ixpeak)])
ixmin[1:] += ixpeak
mins = Wavenumber[ixmin]
# split up the x and y values based on those minima
xsplit = np.split(Wavenumber, ixmin[1:-1])
ysplit = np.split(Absorbance, ixmin[1:-1])
# find the areas under each peak
areas = [np.trapz(ys, xs) for xs, ys in zip(xsplit, ysplit)]
# plotting stuff
plt.figure(figsize=(5, 7))
plt.subplots_adjust(hspace=.33)
plt.subplot(211)
plt.plot(Wavenumber, Absorbance, label='trace 0')
plt.plot(peak_Wavenumbers, Absorbance[ixpeak], '+', c='red', ms=10, label='peaks')
plt.plot(mins, Absorbance[ixmin], 'x', c='green', ms=10, label='mins')
plt.xlabel('dep')
plt.ylabel('indep')
plt.title('Example data')
plt.ylim(-.1, 1.6)
plt.legend()
plt.subplot(212)
plt.bar(np.arange(len(areas)), areas)
plt.xlabel('Peak number')
plt.ylabel('Area under peak')
plt.title('Area under the peaks of trace 0')
plt.show()

Related

How to find series of highest peaks of a repeating pattern using find_peaks() in Python?

I'm trying to determine the highest peaks of the pattern blocks in the following waveform:
Basically, I need to detect the following peaks only (highlighted):
If I use scipy.find_peaks(), it's unable to detect the appropriate peaks:
indices = find_peaks(my_waveform, prominence = 1)[0]
It ends up detecting all of the following points, which is not what I am looking for:
I can't provide the input arguments of distance or height thresholds to scipy.find_peaks() since there are many of the desired peaks on either extremes which are lower in height than the non-desired peaks in the middle.
Note: I had de-trended the waveform in order to aid this above problem too as you can see in the above snapshot, but it still doesn't give the right results.
So can anyone help with a correct way to tackle this?
Here's the code to fully reproduce the dataset I've shown ("autocorr" is the final waveform of interest)
import json
import sys, os
import numpy as np
import pandas as pd
import glob
import pickle
from statsmodels.tsa.stattools import adfuller, acf, pacf
from scipy.signal import find_peaks, square
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
#GENERATION OF A FUNCTION WITH DUAL SEASONALITY & NOISE
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
def signal_line_plot(input_signal: pd.Series, title: str = "", y_label: str = "Signal"):
""" Function to plot a time series signal
Args:
input_signal: time series signal that you want to plot
title: title on plot
y_label: label of the signal being plotted
Returns:
signal plot
"""
plt.plot(input_signal)
plt.title(title)
plt.ylabel(y_label)
plt.show()
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
signal_line_plot(x_daily_weekly_long)
signal_line_plot(x_daily_weekly_long[0:1000])
#x_daily_weekly_long is the final waveform on which I'm carrying out Autocorrelation
#PERFORMING AUTOCORRELATION:
import scipy.signal as signal
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode = "same")
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
#VISUALIZATION:
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
As you have some kind of double (or even triple) signal, I would attempt a double smoothing.
One to remove the overall trend, and one to remove the sharp noise.
A picture is probably better than a long explanation:
from scipy.signal import find_peaks
import pandas as pd
import numpy as np
def smooth(s, win):
return pd.Series(s).rolling(window=win, center=True).mean().ffill().bfill()
plt.plot(lags, autocorr, label='data')
WINDOW = 100 # needs to be determined empirically
# and so are the multipliers below
# double smoothing difference + clipping
ddiff = np.clip(smooth(autocorr, 2*WINDOW)-smooth(autocorr, 10*WINDOW), 0, np.inf)
plt.plot(lags, ddiff, label='smooth+clip')
peaks = find_peaks(ddiff, width=WINDOW)[0]
plt.plot(lags[peaks], autocorr[peaks], marker='o', ls='')
plt.plot(lags[peaks], ddiff[peaks], marker='o', ls='')
plt.legend()
output:
smoothing the original signal
As often in data analysis, the earlier you perform a transformation might be the better. You could also clean your original signal before running the autocorrelation. Here is a quick example (using the smooth function defined above):
from scipy.signal import find_peaks
x2 = smooth(x_daily_weekly_long, 100)
autocorr2 = signal.correlate(x2, x2, mode = "same")
plt.plot(lags, autocorr2)
idx = find_peaks(autocorr2)[0]
plt.plot(lags[idx], autocorr2[idx], marker='o', ls='')
cleaned signal:
For testing purposes i used a rough reconstruction of your signal.
import numpy as np
from scipy.signal import find_peaks, square
import matplotlib.pyplot as plt
x = np.linspace(3,103,10000)
sin = np.clip(np.sin(0.6*x)-0.5,0,10)
tri = np.concatenate([np.linspace(0,0.3,5000),np.linspace(0.3,0,5000)],axis =0)
sig = np.sin(6*x-1.2)
full = sin+tri+sig
peak run #1
peaks = find_peaks(full)[0]
plt.plot(full)
plt.scatter(peaks,full[peaks], color='red', s=5)
plt.show()
peak run #2 + index reextraction (this needs the actual values from your signal)
peaks2 = find_peaks(full[peaks])[0]
index = peaks[peaks2]
plt.plot(full)
plt.scatter(index,full[index], color='red', s=5)
plt.show()
If you know the period you can do this:
w=T
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
plt.scatter(lags[signal.find_peaks(signal.convolve(autocorr, np.ones(w)/w, mode="same"))[0]], autocorr[signal.find_peaks(signal.convolve(autocorr, np.ones(w)/w, mode="same"))[0]], color="r")
Result:
I don't know if it works in other cases.
EDIT:
another approach is to find the maximum in a slicing window, but also in this case you must define empirically a window size.
w=900
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
plt.scatter(lags[filters.maximum_filter(autocorr, size=W)==autocorr], autocorr[filters.maximum_filter(autocorr, size=W)==autocorr], color="r")
Result:

How to get the intersection of 2 lines in a plot?

I would like to determine the intersection of two Matplotlib plots.
The input data for the first plot is stored in a CSV file that looks like this:
Time;Channel A;Channel B;Channel C;Channel D (s);(mV);(mV);(mV);(mV)
0,00000000;-16,28006000;2,31961900;13,29508000;-0,98889020
0,00010000;-16,28006000;1,37345900;12,59309000;-1,34293700
0,00020000;-16,16408000;1,49554400;12,47711000;-1,92894600
0,00030000;-17,10414000;1,25747800;28,77549000;-1,57489900
0,00040000;-16,98205000;1,72750600;6,73299900;0,54327920
0,00050000;-16,28006000;2,31961900;12,47711000;-0,51886220
0,00060000;-16,39604000;2,31961900;12,47711000;0,54327920
0,00070000;-16,39604000;2,19753400;12,00708000;-0,04883409
0,00080000;-17,33610000;7,74020200;16,57917000;-0,28079600
0,00090000;-16,98205000;2,31961900;9,66304500;1,48333500
This is the shortened CSV file. The Original has a lot more Data.
I got this code so far to get the FFT of Channel D:
import matplotlib.pyplot as plt
import pandas as pd
from numpy.fft import rfft, rfftfreq
a=pd.read_csv('20210629-0007.csv', sep = ';', skiprows=[1,2],usecols = [4],dtype=float, decimal=',')
dt = 1/10000
#print(a.head())
n=len(a)
#time increment in each data
acc=a.values.flatten() #to convert DataFrame to 1D array
#acc value must be in numpy array format for half way mirror calculation
fft=rfft(acc)*dt
freq=rfftfreq(n,d=dt)
FFT=abs(fft)
plt.plot(freq,FFT)
plt.axvline(x=150, color = 'red')
plt.show()
Does anybody know how to get the intersection of those 2 plots ( red line and blue line at the same frequency ) ?
I would be very grateful for any help!
manually
This is not really a programming question, rather basic mathematics.
Here is your plot:
Let's call (x1,y1) and (x2,y2) the first two points of your blue line and (x,y) the coordinates of the intersection.
You have this relationship between the points: (x-x1)/(x2-x1) = (y-y1)/(y2-y1)
Thus: y=y1+(x-x1)*(y2-y1)/(x2-x1)
Which gives FFT[0]+(150-0)*(FFT[1]-FFT[0])/(freq[1]-freq[0])
Coordinates of the intersection are (150, 0.000189)
programmatically
You can use the pd.Series.interpolate method
import numpy as np
import pandas as pd
np.random.seed(0)
s = pd.Series(np.random.randint(0,100,20),
index=sorted(np.random.choice(range(100), 20))).sort_index()
ax = s.plot()
ax.axvline(35, color='r')
s.loc[35] = np.NaN
ax.plot(35, s.sort_index().interpolate(method='index').loc[35], marker='o')

Plotting probability density function in Python

I want to plot two probability density functions (pdf) based on values of a certain column in a dataframe. The first one for all the values that correspond to rows with target label = 0 and second one where target label = 1.
My attempt is below, but as you can see the curves do not look like a pdf (the max value is 0 and they are not confined to X axis in range 0-1 and 5-6. I assume I can get something close by playing around with bw factor, but I am looking for a one-liner that just figures out right params and plots a pdf(including figuiring out the right X-axis start/end to use). Is there any such built in function that does this. If not, would appreciate some pointers on how to build something like this.
#matplotloib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.neighbors import KernelDensity
values = np.random.rand(10)
values_shift5 = np.random.rand(10) + 5
df = pd.DataFrame({'values' : values, 'label' : np.zeros(10)})
df = pd.concat([df, pd.DataFrame({'values' : values_shift5, 'label' : np.ones(10)})])
kde_label_0 = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(df[df.label == 0]['values'].values.reshape(-1, 1))
kde_label_1 = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(df[df.label == 1]['values'].values.reshape(-1, 1))
X_plot = np.linspace(0, 10, 50).reshape(-1, 1)
log_density_0 = kde_label_0.score_samples(X_plot)
log_density_1 = kde_label_1.score_samples(X_plot)
plt.plot(X_plot, log_density_0, label='Label 0')
plt.plot(X_plot, log_density_1, label='Label 1')
plt.legend()
plt.show()

How to set a threshold when coloring and labeling scatterplot points in python

I saw a python graph that looks like the following:
I think doing something like this really puts emphasis on certain data points and takes away a lot of clutter. Using the adjust text library, I know how to label points with the following code:
from adjustText import adjust_text
texts = [plt.text(x0,y0,name,ha='right',va='bottom') for x0,y0,name in zip(
df.x, df1.y, df1.label)]
adjust_text(texts)
What could I add to this code to only label points that are, say, greater than 5?
Also, how could I go about coloring all data points outside of that threshold (less than 5) gray, as seen in the picture?
I've been reading documentation to no avail, so I decided to ask you all here. Thanks in advance!
EDIT: I am using a dictionary to color the points, so I'm good there. I just would like to know how to convert data points that don't meet a requirement back to gray
Here's my code for coloring the points:
for i in range(len(df)):
ax.scatter(df.x.iloc[i], df.y.iloc[i],alpha=.7,color=COLORS[df.color.iloc[i]])
Calling scatter for each point isn't the most efficient. You can call scatter twice: once for data below and once for above the threshold:
threshold = 5
ix = df.y < threshold
ax.scatter(df.x[ix], df.y[ix], c='gray')
ax.scatter(df.x[~ix], df.y[~ix], c=COLORS[df.color[~ix]]
Here is an example:
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
colors = list('rgbcmyk')
color_cycle = cycle(colors)
np.random.seed(42)
n = 100
df = pd.DataFrame(np.random.random((n, 2)), columns=['x', 'y'])
df['colors'] = [c for c, _ in zip(color_cycle, range(n))]
ix = df.y < 0.75
fig, ax = plt.subplots()
ax.scatter(df.x[ix], df.y[ix], c='gray')
ax.scatter(df.x[~ix], df.y[~ix], c=df.colors[~ix])

is there a simple method to smooth a curve without taking into account future values and without a time shift?

I have a Unix time series (x) with an associated signal value (y) which is generated every minute, dropping the first value and appending a new one. I am trying to smooth the resulting curve without loosing time accuracy with a specific emphasis on the final value of the smoothed curve which will be written to a database. I would like to be able to adjust the smoothing to a considerable degree.
I have studied (as mathematical layman, more or less) all options I could find and I could master. I came across Savitzki Golay which looked perfect until I realized it works well on past data but fails to produce a reliable final value if no future data is available for smoothing. I have tried many other methods which produced results but could not be adjusted like Savgol.
import pandas as pd
from bokeh.plotting import figure, show, output_file
from bokeh.layouts import column
from math import pi
from scipy.signal import savgol_filter
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from scipy.interpolate import splrep, splev
from scipy.ndimage import gaussian_filter1d
from scipy.signal import lfilter
from scipy.interpolate import UnivariateSpline
import matplotlib.pyplot as plt
df_sim = pd.read_csv("/home/20190905_Signal_Smooth_Test.csv")
#sklearn Polynomial*****************************************
poly = PolynomialFeatures(degree=4)
X = df_sim.iloc[:, 0:1].values
print(X)
y = df_sim.iloc[:, 1].values
print(y)
X_poly = poly.fit_transform(X)
poly.fit(X_poly, y)
lin2 = LinearRegression()
lin2.fit(X_poly, y)
# Visualising the Polynomial Regression results
plt.scatter(X, y, color='blue')
plt.plot(X, lin2.predict(poly.fit_transform(X)), color='red')
plt.title('Polynomial Regression')
plt.xlabel('Time')
plt.ylabel('Signal')
plt.show()
#scipy interpolate********************************************
bspl = splrep(df_sim['timestamp'], df_sim['signal'], s=5)
bspl_y = splev(df_sim['timestamp'], bspl)
df_sim['signal_spline'] = bspl_y
#scipy gaussian filter****************************************
smooth = gaussian_filter1d(df_sim['signal'], 3)
df_sim['signal_gauss'] = smooth
#scipy lfilter************************************************
n = 5 # the larger n is, the smoother curve will be
b = [1.0 / n] * n
a = 1
histo_filter = lfilter(b, a, df_sim['signal'])
df_sim['signal_lfilter'] = histo_filter
print(df_sim)
#scipy UnivariateSpline**************************************
s = UnivariateSpline(df_sim['timestamp'], df_sim['signal'], s=5)
xs = df_sim['timestamp']
ys = s(xs)
df_sim['signal_univariante'] = ys
#scipy savgol filter****************************************
sg = savgol_filter(df_sim['signal'], 11, 3)
df_sim['signal_savgol'] = sg
df_sim['date'] = pd.to_datetime(df_sim['timestamp'], unit='s')
#plotting it all********************************************
print(df_sim)
w = 60000
TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=1000, plot_height=250,
title=f"Various Signals y vs Timestamp x")
p.xaxis.major_label_orientation = pi / 4
p.grid.grid_line_alpha = 0.9
p.line(x=df_sim['date'], y=df_sim['signal'], color='green')
p.line(x=df_sim['date'], y=df_sim['signal_spline'], color='blue')
p.line(x=df_sim['date'], y=df_sim['signal_gauss'], color='red')
p.line(x=df_sim['date'], y=df_sim['signal_lfilter'], color='magenta')
p.line(x=df_sim['date'], y=df_sim['signal_univariante'], color='yellow')
p1 = figure(x_axis_type="datetime", tools=TOOLS, plot_width=1000, plot_height=250,
title=f"Savgol vs Signal")
p1.xaxis.major_label_orientation = pi / 4
p1.grid.grid_line_alpha = 0.9
p1.line(x=df_sim['date'], y=df_sim['signal'], color='green')
p1.line(x=df_sim['date'], y=df_sim['signal_savgol'], color='blue')
output_file("signal.html", title="Signal Test")
show(column(p, p1)) # open a browser
I expect a result that is similar to Savitzky Golay but with valid final smoothed values for the data series. None of the other methods present the same flexibility to adjust the grade of smoothing. Most other methods shift the curve to the right. I can provide to csv file for testing.
This really depends on why you are smoothing the data. Every smoothing method will have side effects, such as letting some 'noise' through more than other. Research 'phase response of filtering'.
A common technique to avoid the problem of missing data at the end of a symmetric filter is to just forecast your data a few points ahead and use that. For example, if you are using a 5-term moving average filter you will be missing 2 data points when you go to calculate your end value.
To forecast these two points, you could use the auto_arima() function from the pmdarima module, or look at the fbprophet module (which I find quite good for this kind of situation).

Categories