Plotting histrogram with weighted bell curve - python

I'm plotting a histogram with a bell curve and I'm running into problems with the bell curve part. Basically, my data consists of 3 columns, an ITEM_TYPE, QTY, and WIDTH. The data of my histogram needs to account for the quantity column, and I have no problem doing that, however, when I try to do the same for the bell curve, I'm not sure how to best go about it. View my code below:
import pandas as pd
import matplotlib.pylab as plt
import matplotlib.ticker as mtick
import numpy as np
from scipy import stats
import seaborn as sns
import statsmodels.api as sm
df = pd.read_csv('Size_Overview.csv')
df2 = df[df['ITEM_TYPE'] == 'Fixed Window']
weighted = sm.nonparametric.KDEUnivariate(df2['WIDTH'])
weighted.fit(fft=False, weights=df2['QTY_ORD'])
ax = plt.subplot()
ax.hist(df2['WIDTH'], bins = [0,1,2,3,4,5,6,7,8,9,10], weights=df2['QTY_ORD'])
lnspc = np.linspace(0, 10, len(df2['WIDTH']))
m, s = stats.norm.fit(df2['WIDTH'], weights=df2['QTY_ORD'])
pdf_g = stats.norm.pdf(lnspc, m, s)
plt.plot(lnspc, pdf_g, label='Norm', c='red')
ax.set_ylabel('Unit Count (% of Total)')
ax.set_xlabel('Width (in Feet)')
ax.set_title('Width Distribution (Fixed Windows)')
ax.set_xticks(np.arange(0, 11, 1.0))
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
plt.xticks(rotation=45)
plt.show()
So basically, the "weights" argument in weighted.fit and ax.hist work perfectly, however, the same argument down in stats.norm.fit seems to be ignored completely. So if I have 1 row in my data with a really high quantity, the histogram will make the proper adjustment but the bell curve stays exactly the same every time. It's basically just calculating the mean and std of the WIDTH column completely ignoring the QTY
Here's what my chart looks like:
Here's what it looks like if I add a really large quantity to an item between 1 and 2 feet:
As you can see, the histogram adjusted correctly but the bell curve stayed roughly the same. How can I make the bell curve adjust for my quantity column? All tips are appreciated, thanks in advance
Edit: Looks like nobody could help, but I figured it out anyway. Here's the workaround I came up with:
Added this code:
values = df2['WIDTH'].values
qty = df2['QTY_ORD'].astype(int)
count = qty.values
full_values = np.repeat(values, count)
and replaced:
m, s = stats.norm.fit(df2['WIDTH'], weights=df2['QTY_ORD'])
with:
m, s = stats.norm.fit(full_values)
So basically use the numpy repeat function to pass in the entire width column based on the number in the qty column. That's it!!
So now my second chart looks like this:

Related

Calculating the area under multiple Peaks using Python

My problem is calculating the area under the peaks in my FT-IR analysis. I usually work with Origin but I would like to see if I get a better result working with Python. The data I'm using is linked here and the code is below. The problem I'm facing is, I don't know how to find the start and the end of the peak to calculate the area and how to set a Baseline.
I found this answered question about how to calculate the area under multiple peaks but I don't know how to implement it in my code: How to get value of area under multiple peaks
import numpy as np
from numpy import trapz
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv(r'CuCO3.csv', skiprows=5)
print(df)
Wavenumber = df.iloc[:,0]
Absorbance = df.iloc[:,1]
Wavenumber_Peak = Wavenumber.iloc[700:916] #Where the peaks start/end that i want to calculate the area
Absorbance_Peak = Absorbance.iloc[700:916] #Where the peaks start/end that i want to calculate the area
plt.figure()
plt.plot(Wavenumber_Peak, Absorbance_Peak)
plt.show()
Plot of the peaks to calculate the area:
Okay, I have quickly added the code from the other post to your beginning and checked that it works. Unfortunately, the file that you linked did not work with your code, so I had to change some stuff in the beginning to make it work (in a very unelegant way, because I do not really know how to work with dataframes). If your local file is different and processing the file in this way does not work, then just exchange my beginning by yours.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import peakutils
df = pd.read_csv(r'CuCO3.csv', skiprows=5)
data = np.asarray([[float(y) for y in x[0].split(",")] for x in df.to_numpy()])
Wavenumber = np.arange(700, 916)
Absorbance = data[700:916,1]
indices = peakutils.indexes(Absorbance, thres=0.35, min_dist=0.1)
peak_values = [Absorbance[i] for i in indices]
peak_Wavenumbers = [Wavenumber[i] for i in indices]
plt.figure()
plt.scatter(peak_Wavenumbers, peak_values)
plt.plot(Wavenumber, Absorbance)
plt.show()
ixpeak = Wavenumber.searchsorted(peak_Wavenumbers)
ixmin = np.array([np.argmin(i) for i in np.split(Absorbance, ixpeak)])
ixmin[1:] += ixpeak
mins = Wavenumber[ixmin]
# split up the x and y values based on those minima
xsplit = np.split(Wavenumber, ixmin[1:-1])
ysplit = np.split(Absorbance, ixmin[1:-1])
# find the areas under each peak
areas = [np.trapz(ys, xs) for xs, ys in zip(xsplit, ysplit)]
# plotting stuff
plt.figure(figsize=(5, 7))
plt.subplots_adjust(hspace=.33)
plt.subplot(211)
plt.plot(Wavenumber, Absorbance, label='trace 0')
plt.plot(peak_Wavenumbers, Absorbance[ixpeak], '+', c='red', ms=10, label='peaks')
plt.plot(mins, Absorbance[ixmin], 'x', c='green', ms=10, label='mins')
plt.xlabel('dep')
plt.ylabel('indep')
plt.title('Example data')
plt.ylim(-.1, 1.6)
plt.legend()
plt.subplot(212)
plt.bar(np.arange(len(areas)), areas)
plt.xlabel('Peak number')
plt.ylabel('Area under peak')
plt.title('Area under the peaks of trace 0')
plt.show()

Plot specific element values in matplotlib

I have a list as below:
freq = [29342, 28360, 26029, 21418, 20771, 18372, 18239, 18070, 17261, 17102]
I want to show the values of n-th and m-th element of the x-axis and draw a vertical line
plt.plot(freq[0:1000])
For example in the graph above, the 100th elements on the x-axis - how can I show the values on the line?
I tried to knee but it shows only one elbow. I suggest it is the 50th element? But what is exactly x,y??
from kneed import KneeLocator
kn = KneeLocator(list(range(0, 1000)), freq[0:1000], curve='convex', direction='decreasing')
import matplotlib.pyplot as plt
kn.plot_knee()
#plt.axvline(x=50, color='black', linewidth=2, alpha=.7)
plt.annotate(freq[50], xy=(50, freq[50]), size=10)
You might think that everybody knows this library kneed. Well, I don't know about others but I have never seen that one before (it does not even have a tag here on SO).
But their documentation is excellent (qhull take note!). So, you could do something like this:
#fake data generation
import numpy as np
x=np.linspace(1, 10, 100)
freq=x**(-1.9)
#here happens the actual plotting
from kneed import KneeLocator
import matplotlib.pyplot as plt
kn = KneeLocator(x, freq, curve='convex', direction='decreasing')
xk = kn.knee
yk = kn.knee_y
kn.plot_knee()
plt.annotate(f'Found knee at x={xk:.2f}, y={yk:.2f}', xy=(xk*1.1, yk*1.1) )
plt.show()
Sample output:

Pandas line graph - y-axis high values at the bottom and low values at the top (fliped 180 degree)

I am new to pandas and just want to show my rank vs my friends rank using pandas.
And because a lower Rank is better than a higher rank (the #1 = better then #2)
I want the graph to rising and not to fall. With the code I have, the graph is falling... Please help.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
"Me" : [10,9,7,6,3,2,1],
"My friend" : [20,19,18,15,14,10,6]},
index=list(range(7)))
lines = df.plot.line()
plt.show()
So over time I gain a higher rank but pandas is making a falling graph instead of a rising.
I hope you understand what I mean. Thanks for your help
Are you looking for invert_yaxis:
fig, ax = plt.subplots()
lines = df.plot.line(ax=ax)
ax.invert_yaxis()
Output:

Creating a visualization with 2 Y-Axis scales

I am currently trying to plot the price of the 1080 graphics card against the price of bitcoin over time, but the scales of the Y axis are just way off. This is my code so far:
import pandas as pd
from datetime import date
import matplotlib.pyplot as plt
from matplotlib.pyplot import *
import numpy as np
GPUDATA = pd.read_csv("1080Prices.csv")
BCDATA = pd.read_csv("BitcoinPrice.csv")
date = pd.to_datetime(GPUDATA["Date"])
price = GPUDATA["Price_USD"]
date1 = pd.to_datetime(BCDATA["Date"])
price1 = BCDATA["Close"]
plot(date, price)
plot(date1, price1)
And that produces this:
The GPU prices, of course, are in blue and the price of bitcoin is in orange. I am fairly new to visualizations and I'm having a rough time finding anything online that could help me fix this issue. Some of the suggestions I found on here seem to deal with plotting data from a single datasource, but my data comes from 2 datasources.
One has entries of the GPU price in a given day, the other has the open, close, high, and low price of bitcoin in a given day. I am struggling to find a solution, any advice would be more than welcome! Thank you!
What you want to do is twin the X-axis, such that both plots will share the X-axis, but have separate Y-axes. That can be done in this way:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
GPUDATA = pd.read_csv("1080Prices.csv")
BCDATA = pd.read_csv("BitcoinPrice.csv")
gpu_dates = pd.to_datetime(GPUDATA["Date"])
gpu_prices = GPUDATA["Price_USD"]
btc_dates = pd.to_datetime(BCDATA["Date"])
btc_prices = BCDATA["Close"]
fig, ax1 = plt.subplots()
ax2 = ax1.twinx() # Create a new Axes object sharing ax1's x-axis
ax1.plot(gpu_dates, gpu_prices, color='blue')
ax2.plot(btc_dates, btc_prices, color='red')
As you have not provided sample data, I am unable to show a relevant demonstration, but this should work.

How to plot a time series plot for each party and the total votes they got each year

Here's my dataset
Newbie here
I want to plot the total votes each party has got for each year, I think bar plot would be a good fit here but I'm not understanding how to do it.
I want to do it with plotly.
The output should be something like this.
Here is working sample for you use case
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = {'Partyname': ['Independents', 'INC','Independents','Independents','Independents'], 'Year': [1977, 1977,1980,1980,1980], "totPoll":[25168,35400,109,125,405]}
df = pd.DataFrame(data)
grpByParty = df.groupby(['Partyname'])
sumVotes = grpByParty['totPoll'].agg(np.sum)
y_values = sumVotes.keys().tolist()
y_pos = np.arange(len(y_values))
votes = sumVotes.tolist()
plt.bar(y_pos, votes, align='center', alpha=0.5)
plt.xticks(y_pos, y_values)
plt.ylabel('votes')
plt.title('party wise votes ')
plt.show()
Approach that have taken here
Group the data as a party wise.
Get sum of the total vote as party wise using aggregate.
Take The x any y coordinates in a list.
Plot the diagram using matplotlib.pyplot
Output will look like this.

Categories