Plot specific element values in matplotlib - python

I have a list as below:
freq = [29342, 28360, 26029, 21418, 20771, 18372, 18239, 18070, 17261, 17102]
I want to show the values of n-th and m-th element of the x-axis and draw a vertical line
plt.plot(freq[0:1000])
For example in the graph above, the 100th elements on the x-axis - how can I show the values on the line?
I tried to knee but it shows only one elbow. I suggest it is the 50th element? But what is exactly x,y??
from kneed import KneeLocator
kn = KneeLocator(list(range(0, 1000)), freq[0:1000], curve='convex', direction='decreasing')
import matplotlib.pyplot as plt
kn.plot_knee()
#plt.axvline(x=50, color='black', linewidth=2, alpha=.7)
plt.annotate(freq[50], xy=(50, freq[50]), size=10)

You might think that everybody knows this library kneed. Well, I don't know about others but I have never seen that one before (it does not even have a tag here on SO).
But their documentation is excellent (qhull take note!). So, you could do something like this:
#fake data generation
import numpy as np
x=np.linspace(1, 10, 100)
freq=x**(-1.9)
#here happens the actual plotting
from kneed import KneeLocator
import matplotlib.pyplot as plt
kn = KneeLocator(x, freq, curve='convex', direction='decreasing')
xk = kn.knee
yk = kn.knee_y
kn.plot_knee()
plt.annotate(f'Found knee at x={xk:.2f}, y={yk:.2f}', xy=(xk*1.1, yk*1.1) )
plt.show()
Sample output:

Related

Calculating the area under multiple Peaks using Python

My problem is calculating the area under the peaks in my FT-IR analysis. I usually work with Origin but I would like to see if I get a better result working with Python. The data I'm using is linked here and the code is below. The problem I'm facing is, I don't know how to find the start and the end of the peak to calculate the area and how to set a Baseline.
I found this answered question about how to calculate the area under multiple peaks but I don't know how to implement it in my code: How to get value of area under multiple peaks
import numpy as np
from numpy import trapz
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv(r'CuCO3.csv', skiprows=5)
print(df)
Wavenumber = df.iloc[:,0]
Absorbance = df.iloc[:,1]
Wavenumber_Peak = Wavenumber.iloc[700:916] #Where the peaks start/end that i want to calculate the area
Absorbance_Peak = Absorbance.iloc[700:916] #Where the peaks start/end that i want to calculate the area
plt.figure()
plt.plot(Wavenumber_Peak, Absorbance_Peak)
plt.show()
Plot of the peaks to calculate the area:
Okay, I have quickly added the code from the other post to your beginning and checked that it works. Unfortunately, the file that you linked did not work with your code, so I had to change some stuff in the beginning to make it work (in a very unelegant way, because I do not really know how to work with dataframes). If your local file is different and processing the file in this way does not work, then just exchange my beginning by yours.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import peakutils
df = pd.read_csv(r'CuCO3.csv', skiprows=5)
data = np.asarray([[float(y) for y in x[0].split(",")] for x in df.to_numpy()])
Wavenumber = np.arange(700, 916)
Absorbance = data[700:916,1]
indices = peakutils.indexes(Absorbance, thres=0.35, min_dist=0.1)
peak_values = [Absorbance[i] for i in indices]
peak_Wavenumbers = [Wavenumber[i] for i in indices]
plt.figure()
plt.scatter(peak_Wavenumbers, peak_values)
plt.plot(Wavenumber, Absorbance)
plt.show()
ixpeak = Wavenumber.searchsorted(peak_Wavenumbers)
ixmin = np.array([np.argmin(i) for i in np.split(Absorbance, ixpeak)])
ixmin[1:] += ixpeak
mins = Wavenumber[ixmin]
# split up the x and y values based on those minima
xsplit = np.split(Wavenumber, ixmin[1:-1])
ysplit = np.split(Absorbance, ixmin[1:-1])
# find the areas under each peak
areas = [np.trapz(ys, xs) for xs, ys in zip(xsplit, ysplit)]
# plotting stuff
plt.figure(figsize=(5, 7))
plt.subplots_adjust(hspace=.33)
plt.subplot(211)
plt.plot(Wavenumber, Absorbance, label='trace 0')
plt.plot(peak_Wavenumbers, Absorbance[ixpeak], '+', c='red', ms=10, label='peaks')
plt.plot(mins, Absorbance[ixmin], 'x', c='green', ms=10, label='mins')
plt.xlabel('dep')
plt.ylabel('indep')
plt.title('Example data')
plt.ylim(-.1, 1.6)
plt.legend()
plt.subplot(212)
plt.bar(np.arange(len(areas)), areas)
plt.xlabel('Peak number')
plt.ylabel('Area under peak')
plt.title('Area under the peaks of trace 0')
plt.show()

How to get the intersection of 2 lines in a plot?

I would like to determine the intersection of two Matplotlib plots.
The input data for the first plot is stored in a CSV file that looks like this:
Time;Channel A;Channel B;Channel C;Channel D (s);(mV);(mV);(mV);(mV)
0,00000000;-16,28006000;2,31961900;13,29508000;-0,98889020
0,00010000;-16,28006000;1,37345900;12,59309000;-1,34293700
0,00020000;-16,16408000;1,49554400;12,47711000;-1,92894600
0,00030000;-17,10414000;1,25747800;28,77549000;-1,57489900
0,00040000;-16,98205000;1,72750600;6,73299900;0,54327920
0,00050000;-16,28006000;2,31961900;12,47711000;-0,51886220
0,00060000;-16,39604000;2,31961900;12,47711000;0,54327920
0,00070000;-16,39604000;2,19753400;12,00708000;-0,04883409
0,00080000;-17,33610000;7,74020200;16,57917000;-0,28079600
0,00090000;-16,98205000;2,31961900;9,66304500;1,48333500
This is the shortened CSV file. The Original has a lot more Data.
I got this code so far to get the FFT of Channel D:
import matplotlib.pyplot as plt
import pandas as pd
from numpy.fft import rfft, rfftfreq
a=pd.read_csv('20210629-0007.csv', sep = ';', skiprows=[1,2],usecols = [4],dtype=float, decimal=',')
dt = 1/10000
#print(a.head())
n=len(a)
#time increment in each data
acc=a.values.flatten() #to convert DataFrame to 1D array
#acc value must be in numpy array format for half way mirror calculation
fft=rfft(acc)*dt
freq=rfftfreq(n,d=dt)
FFT=abs(fft)
plt.plot(freq,FFT)
plt.axvline(x=150, color = 'red')
plt.show()
Does anybody know how to get the intersection of those 2 plots ( red line and blue line at the same frequency ) ?
I would be very grateful for any help!
manually
This is not really a programming question, rather basic mathematics.
Here is your plot:
Let's call (x1,y1) and (x2,y2) the first two points of your blue line and (x,y) the coordinates of the intersection.
You have this relationship between the points: (x-x1)/(x2-x1) = (y-y1)/(y2-y1)
Thus: y=y1+(x-x1)*(y2-y1)/(x2-x1)
Which gives FFT[0]+(150-0)*(FFT[1]-FFT[0])/(freq[1]-freq[0])
Coordinates of the intersection are (150, 0.000189)
programmatically
You can use the pd.Series.interpolate method
import numpy as np
import pandas as pd
np.random.seed(0)
s = pd.Series(np.random.randint(0,100,20),
index=sorted(np.random.choice(range(100), 20))).sort_index()
ax = s.plot()
ax.axvline(35, color='r')
s.loc[35] = np.NaN
ax.plot(35, s.sort_index().interpolate(method='index').loc[35], marker='o')

Plotting tendency line in Python

I want to plot a tendency line on top of a data plot. This must be simple but I have not been able to figure out how to get to it.
Let us say I have the following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('A'))
sns.lineplot(data=df)
ax.set(xlabel="Index",
ylabel="Variable",
title="Sample")
plt.show()
The resulting plot is:
What I would like to add is a tendency line. Something like the red line in the following:
I thank you for any feedback.
A moving average is one method (my first thought, and already suggested).
Another method is to use a polynomial fit. Since you had 100 points in your original data, I picked a 10th order fit (square root of data length) in the example below. With some modification of your original code:
idx = [i for i in range(100)]
rnd = np.random.randint(0,100,size=100)
ser = pd.Series(rnd, idx)
fit = np.polyfit(idx, rnd, 10)
pf = np.poly1d(fit)
plt.plot(idx, rnd, 'b', idx, pf(idx), 'r')
This code provides a plot like this:
You can do something like this using Rolling Average:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = np.random.randint(0,100,size=(100, 1))
df["rolling_avg"] = df.A.rolling(7).mean().shift(-3)
sns.lineplot(data=df)
plt.show()
You could also do a Regression plot to analyse how data can be interpolated using:
ax = sns.regplot(x=df.index, y="A",
data=df,
scatter_kws={"s": 10},
order=10,
ci=None)

Equivalent to sns.distplot(data, fit=scipy.stats.norm) with plotly

For a while, I've been using both seaborn and plotly for visualization, depending on my needs at the moment. Lately, I've been trying to move completely to plotly, but there are things that I still can't find out how to make it work.
For example, I used to use seaborn to check the distribution of some data, to see how well it fitted to the gaussian distribution. This can be easily done with the following snippet:
import seaborn as sns
from scipy.stats import norm
sns.distplot(data, fit=norm)
I've been trying to achieve some similar quick gaussian check with plotly express (px.histogram to be more specific), but I can't get it done. Could you please help me with this matter?
EDIT
An example for "data" would be:
import numpy as np
np.random.seed(123)
data = np.random.noncentral_chisquare(3, 20, 1000)
The output should show data histogram with its KDE, plus a gaussian equivalent KDE. This is helpful when testing transformations results (log, box-cox...)
I think you can be interested in reading this. Apparently at the moment the easiest way it's using plotly.figure_factory.create_dist_plot but from the link above it looks like it's going to be discontinued.
import numpy as np
import plotly.figure_factory as ff
np.random.seed(123)
data = np.random.noncentral_chisquare(3, 20, 1000)
m = data.mean()
s = data.std()
gaussian_data = np.random.normal(m, s, 10000)
fig = ff.create_distplot(
[data, gaussian_data],
group_labels=["plot", "gaussian"],
curve_type="kde")
fig.data = [fig.data[0], fig.data[2], fig.data[3]]
fig.update_layout(showlegend=False)
fig.show()
And if instead of fig.data = ... you use
lst = list(fig.data)
lst.pop(1)
fig.data = tuple(lst)
you'll get

Plotting histrogram with weighted bell curve

I'm plotting a histogram with a bell curve and I'm running into problems with the bell curve part. Basically, my data consists of 3 columns, an ITEM_TYPE, QTY, and WIDTH. The data of my histogram needs to account for the quantity column, and I have no problem doing that, however, when I try to do the same for the bell curve, I'm not sure how to best go about it. View my code below:
import pandas as pd
import matplotlib.pylab as plt
import matplotlib.ticker as mtick
import numpy as np
from scipy import stats
import seaborn as sns
import statsmodels.api as sm
df = pd.read_csv('Size_Overview.csv')
df2 = df[df['ITEM_TYPE'] == 'Fixed Window']
weighted = sm.nonparametric.KDEUnivariate(df2['WIDTH'])
weighted.fit(fft=False, weights=df2['QTY_ORD'])
ax = plt.subplot()
ax.hist(df2['WIDTH'], bins = [0,1,2,3,4,5,6,7,8,9,10], weights=df2['QTY_ORD'])
lnspc = np.linspace(0, 10, len(df2['WIDTH']))
m, s = stats.norm.fit(df2['WIDTH'], weights=df2['QTY_ORD'])
pdf_g = stats.norm.pdf(lnspc, m, s)
plt.plot(lnspc, pdf_g, label='Norm', c='red')
ax.set_ylabel('Unit Count (% of Total)')
ax.set_xlabel('Width (in Feet)')
ax.set_title('Width Distribution (Fixed Windows)')
ax.set_xticks(np.arange(0, 11, 1.0))
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
plt.xticks(rotation=45)
plt.show()
So basically, the "weights" argument in weighted.fit and ax.hist work perfectly, however, the same argument down in stats.norm.fit seems to be ignored completely. So if I have 1 row in my data with a really high quantity, the histogram will make the proper adjustment but the bell curve stays exactly the same every time. It's basically just calculating the mean and std of the WIDTH column completely ignoring the QTY
Here's what my chart looks like:
Here's what it looks like if I add a really large quantity to an item between 1 and 2 feet:
As you can see, the histogram adjusted correctly but the bell curve stayed roughly the same. How can I make the bell curve adjust for my quantity column? All tips are appreciated, thanks in advance
Edit: Looks like nobody could help, but I figured it out anyway. Here's the workaround I came up with:
Added this code:
values = df2['WIDTH'].values
qty = df2['QTY_ORD'].astype(int)
count = qty.values
full_values = np.repeat(values, count)
and replaced:
m, s = stats.norm.fit(df2['WIDTH'], weights=df2['QTY_ORD'])
with:
m, s = stats.norm.fit(full_values)
So basically use the numpy repeat function to pass in the entire width column based on the number in the qty column. That's it!!
So now my second chart looks like this:

Categories