plot the average of every x value - python

I plot a function which is based on the results of a curve fit I did in the query. Now I want to see how the curve fit actually fits the average values for every x value. I treid it with a for loop and a groupby.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
plt.style.use('seaborn-colorblind')
x = dataset['mrwSmpVWi']
c = dataset['c']
a = dataset['a']
b = dataset['b']
Snr = dataset['Seriennummer']
dataset["y"] = (c / (1 + (a) * np.exp(-b*(x))))
for number in dataset.groupby('mrwSmpVWi'):
dataset['m'] = dataset['mrwSmpP'].mean()
fig, ax = plt.subplots(figsize=(30,15))
for name, group in dataset.groupby('Seriennummer'):
group.plot(x="mrwSmpVWi", y="m", ax=ax, marker='o', linestyle='', ms=12, label =name)
group.plot(x="mrwSmpVWi", y="y", ax=ax, label =name)
plt.show()
The dataset with the values is huge and not sorted by mrwSmpVWi.
Has someone an idea why I only get a straight line for my average values?

You got to take a look at what you're doing with this line:
for number in dataset.groupby('mrwSmpVWi'):
dataset['m'] = dataset['mrwSmpP'].mean()
You probably want:
dataset['m'] = dataset.groupby('Seriennummer')['mrwSmpVWi'].transform('mean')
(assuming you were intending to calculate the mean of each group of Serienummer)

Related

Create a Seaborn style histogram / kernel density plot using the actual density function

I really like to the look of Seaborn's KDE plot:
I was wondering how can I replicate this for line plot.
In my case I actually have the function to generate the density instead of samples of the data.
So assuming I have the data in a data frame:
x - The value of x per sample.
y - The value of the density function at y.
μσ - Categorical variable to group data from the same density (In the code, I use the mean and standard deviation of a normal distribution).
I can use Seaborn's lineplot to get what I want without the area below the curve as in the image above.
I'm after achieving the look as above for the data I have.
Is there a way to replicate this theme, area under the curve included, for lineplot?
The code below shows what I got so far:
import numpy as np
import scipy as sp
import pandas as pd
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
num_grid_pts = 1000
val_μ = [0, -1, 1, 0]
val_σ = [1, 2, 3, 4]
num_var = len(val_μ) # variations
x = np.linspace(-10, 10, num_grid_pts)
P = np.zeros((num_grid_pts, num_var)) # PDF
μσ = [f'μ = {μ}, σ = {σ}' for μ, σ in zip(val_μ, val_σ)]
for ii, (μ, σ) in enumerate(zip(val_μ, val_σ)):
randVar = norm(μ, σ)
P[:, ii] = randVar.pdf(x)
df_P = pd.DataFrame(data = {'x': np.tile(x, num_var), 'PDF': P.flatten('F'), 'μσ': np.repeat(μσ, len(x))})
f, ax = plt.subplots(figsize=(15, 10))
sns.lineplot(data=df_P, x='x', y='PDF', hue='μσ', ax=ax)
plot_lines = ax.get_lines()
for ii in range(num_var):
ax.fill_between(x=plot_lines[ii].get_xdata(), y1=plot_lines[ii].get_ydata(), alpha=0.25, color=plot_lines[ii].get_color())
ax.set_title(f'Normal Distribution')
ax.set_xlabel(f'Value')
ax.set_ylabel(f'Probability')
plt.show()
I used the lineplot to create the lines and then created the fills. But this is a hack, I was wondering if I can do it more naturally within Seaborn.
I found a way to manually play with the elements do so using the area object:
(
so.Plot(healthexp, "Year", "Spending_USD", color="Country")
.add(so.Area(alpha=.7), so.Stack())
)
The result is:
Yet for some reason the example code doesn't work.
What I did was using Seabron's lineplot() and then manually add fill_between() polygon:
ax = sns.lineplot(data=data_frame, x='data_x', y='data_y', hue='data_color')
plot_lines = ax.get_lines()
for i in range(num_unique_colors):
ax.fill_between(x=plot_lines[i].get_xdata(), y1=plot_lines[i].get_ydata(), alpha=0.25, color=plot_lines[i].get_color())

Retrieving line data from multiple weighted seaborn distribution plots?

I have the code below with randomly generated dataframes and I would like to extract the x and y values of both plotted lines. These line plots show the Price on the Y-axis and are Volume weighted.
For some reason, the line values for the second distribution plot, cannot be stored on the variables "df_2_x", "df_2_y". The values of "df_1_x", "df_1_y" are also written on the other variables. Both print statements return True, so the arrays are completely equal.
If I put them in separate cells in a notebook, it does work.
I also looked at this solution: How to retrieve all data from seaborn distribution plot with mutliple distributions?
But this does not work for weighted distplots.
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
Price_1 = [round(random.uniform(2,12), 2) for i in range(30)]
Volume_1 = [round(random.uniform(100,3000)) for i in range(30)]
Price_2 = [round(random.uniform(0,10), 2) for i in range(30)]
Volume_2 = [round(random.uniform(100,1500)) for i in range(30)]
df_1 = pd.DataFrame({'Price_1' : Price_1,
'Volume_1' : Volume_1})
df_2 = pd.DataFrame({'Price_2' : Price_2,
'Volume_2' :Volume_2})
df_1_x, df_1_y = sns.distplot(df_1.Price_1, hist_kws={"weights":list(df_1.Volume_1)}).get_lines()[0].get_data()
df_2_x, df_2_y = sns.distplot(df_2.Price_2, hist_kws={"weights":list(df_2.Volume_2)}).get_lines()[0].get_data()
print((df_1_x == df_2_x).all())
print((df_1_y == df_2_y).all())
Why does this happen, and how can I fix this?
Whether or not weight is used, doesn't make a difference here.
The principal problem is that you are extracting again the first curve in df_2_x, df_2_y = sns.distplot(df_2....).get_lines()[0].get_data(). You'd want the second curve instead: df_2_x, df_2_y = sns.distplot(df_2....).get_lines()[1].get_data().
Note that seaborn isn't really meant to concatenate commands. Sometimes it works, but it usually adds a lot of confusion. E.g. sns.distplot returns an ax (which represents a subplot). Graphical elements such as lines are added to that ax.
Also note that sns.distplot has been deprecated. It will be removed from Seaborn in one of the next versions. It is replaced by sns.histplot and sns.kdeplot.
Here is how the code could look like:
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
Price_1 = [round(random.uniform(2, 12), 2) for i in range(30)]
Volume_1 = [round(random.uniform(100, 3000)) for i in range(30)]
Price_2 = [round(random.uniform(0, 10), 2) for i in range(30)]
Volume_2 = [round(random.uniform(100, 1500)) for i in range(30)]
df_1 = pd.DataFrame({'Price_1': Price_1,
'Volume_1': Volume_1})
df_2 = pd.DataFrame({'Price_2': Price_2,
'Volume_2': Volume_2})
ax = sns.histplot(x=df_1.Price_1, weights=list(df_1.Volume_1), bins=10, kde=True, kde_kws={'cut': 3})
sns.histplot(x=df_2.Price_2, weights=list(df_2.Volume_2), bins=10, kde=True, kde_kws={'cut': 3}, ax=ax)
df_1_x, df_1_y = ax.lines[0].get_data()
df_2_x, df_2_y = ax.lines[1].get_data()
# use fill_between to demonstrate where the extracted curves lie
ax.fill_between(df_1_x, 0, df_1_y, color='b', alpha=0.2)
ax.fill_between(df_2_x, 0, df_2_y, color='r', alpha=0.2)
plt.show()

peak_widths w.r.t to x axis

I am using scipy.signal to calculate the width of the different peaks. I have 100 values wrt to different time points. I am using following code to calculate the peak, then width. The problem is it is not considering the time on x axis while calculating the width.
peaks_control, _ = find_peaks(x_control, height=2100)
time_control = time[:100]
width_control = peak_widths(x_control, peaks_control, rel_height=0.9)
The output of width_control is
array([12.84785714, 13.21299534, 13.4502381 , 12.71311143]),
array([2042.5, 2048.8, 2057.4, 2065. ]),
array([ 5.795 ,28.29469697, 51.245 , 74.17150396]),
array([18.64285714, 41.50769231, 64.6952381 , 86.88461538]))
I am using following to use time on x axis and show the signals, which is correct
plt.plot(time_control, x_control)
plt.plot(time_control[peaks_control], x_control[peaks_control], "x")
#plt.plot(np.zeros_like(x_control), "--", color="gray")
#plt.xlim(time_control.tolist())
plt.title('Control')
plt.xlabel('Time (s)')
plt.ylabel('RFU')
plt.show()
I am using following code to show the width also, but not able to put the actual time on x axis.
plt.plot(x_control)
plt.plot(peaks_control, x_control[peaks_control], "x")
plt.hlines(*width_control[1:], color="C3")
plt.title('Control')
plt.xlabel('Time (s)')
plt.ylabel('RFU')
plt.show()
I had the same problem just now, so here's my solution (there are probably more elegant solutions but this worked for me):
peak_widths() returns the widths (in samples), height at which the widths were calculated, and the interpolated positions of left and right intersection points of a horizontal line at the respective evaluation height (also in samples).
To convert those values from samples back to our x-axis we can use scipy.interpolate.interp1():
from scipy.signal import find_peaks, peak_widths
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
import numpy as np
def index_to_xdata(xdata, indices):
"interpolate the values from signal.peak_widths to xdata"
ind = np.arange(len(xdata))
f = interp1d(ind,xdata)
return f(indices)
x = np.linspace(0, 1, 10)
y = np.sin(4*x-0.2)
peaks, _ = find_peaks(y)
widths, width_heights, left_ips, right_ips = peak_widths(y, peaks)
widths = index_to_xdata(x, widths)
left_ips = index_to_xdata(x, left_ips)
right_ips = index_to_xdata(x, right_ips)
plt.plot(x,y)
plt.plot(x[peaks], y[peaks], "x")
plt.hlines(width_heights, left_ips, right_ips, color='r')
plt.xlabel('x values')
plt.ylabel('y values')
plt.show()
image of plot

How to plot median values in each bin, and to show the 25 and 75 percent value

I would like to plot my data similar to the following figure with showing median in each bin and 25 and 75 percent value.[The solid line and open circles show the median values in each bin, and the broken lines show the 25% and 75% values.]
I have this sample data. And I did like this to get the similar plot
import numpy as np
import matplotlib.pyplot as plt
from astropy.table import Table
data=Table.read('sample_data.fits')
# Sample data
X=data['density']
Y=data['lineflux']
total_bins = 15
bins = np.linspace(min(X),max(X), total_bins)
delta = bins[1]-bins[0]
idx = np.digitize(X,bins)
running_median = [np.median(Y[idx==k]) for k in range(total_bins)]
plt.plot(X,Y,'.')
plt.plot(bins-delta/2,running_median,'--r',marker='o',fillstyle='none',markersize=20,alpha=1)
plt.xlabel('log $\delta_{5th}[Mpc^{-3}]$')
plt.ylabel('log OII[flux]')
plt.loglog()
plt.axis('tight')
plt.show()
And I got this plot.
There is a large offset. I change the size of the bin also, still, I got the large offset.
How to plot in the correct way and how to include the 25 and 75 percent value like the previous figure in my plot.
To also answer the other question: you can use np.percentile. I had to lower the bin number (there was a bin without data, this leads to problems with the percentile). For the logarithmic bins see my comment above:
import numpy as np
import matplotlib.pyplot as plt
from astropy.table import Table
data=Table.read('sample_data.fits')
# Sample data
X=data['density']
Y=data['lineflux']
total_bins = 10
#bins = np.linspace(min(X), max(X), total_bins)
bins = np.logspace(np.log10(0.0001), np.log10(0.1), total_bins)
delta = bins[1]-bins[0]
idx = np.digitize(X, bins)
running_median = [np.median(Y[idx==k]) for k in range(total_bins)]
running_prc25 = [np.percentile(Y[idx==k], 25) for k in range(total_bins)]
running_prc75 = [np.percentile(Y[idx==k], 75) for k in range(total_bins)]
plt.plot(X,Y,'.')
plt.plot(bins-delta/2,running_median,'-r',marker='o',fillstyle='none',markersize=20,alpha=1)
plt.plot(bins-delta/2,running_prc25,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
plt.plot(bins-delta/2,running_prc75,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
plt.xlabel('log $\delta_{5th}[Mpc^{-3}]$')
plt.ylabel('log OII[flux]')
plt.loglog()
plt.axis('tight')
plt.show()
which produces
EDIT:
To show a filled plot you may try (just relevant section shown):
fig, ax = plt.subplots()
plt.plot(X,Y,'.')
plt.plot(bins-delta/2,running_median,'-r',marker='o',fillstyle='none',markersize=20,alpha=1)
#plt.plot(bins-delta/2,running_prc25,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
#plt.plot(bins-delta/2,running_prc75,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
ax.fill_between(bins-delta/2,running_prc25,running_median, facecolor='orange')
ax.fill_between(bins-delta/2,running_prc75,running_median, facecolor='orange')
which produces

How to scale histogram y-axis in million in matplotlib

I am plotting a histogram using matplotlib but my y-axis range is in the millions. How can I scale the y-axis so that instead of printing 5000000 it will print 5
Here is my code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
filename = './norstar10readlength.csv'
df=pd.read_csv(filename, sep=',',header=None)
n, bins, patches = plt.hist(x=df.values, bins=10, color='#0504aa',
alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('My Very Own Histogram')
maxfreq = n.max()
# Set a clean upper y-axis limit.
plt.ylim(ymax=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)
plt.show()
And here is the plot I am generating now
An elegant solution is to apply a FuncFormatter to format y labels.
Instead of your source data, I used the following DataFrame:
Val
0 800000
1 2600000
2 6700000
3 1400000
4 1700000
5 1600000
and made a bar plot. "Ordinary" bar plot:
df.Val.plot.bar(rot=0, width=0.75);
yields a picture with original values on the y axis (1000000 to
7000000).
But if you run:
from matplotlib.ticker import FuncFormatter
def lblFormat(n, pos):
return str(int(n / 1e6))
lblFormatter = FuncFormatter(lblFormat)
ax = df.Val.plot.bar(rot=0, width=0.75)
ax.yaxis.set_major_formatter(lblFormatter)
then y axis labels are integers (the number of millions):
So you can arrange your code something like this:
n, bins, patches = plt.hist(x=df.values, ...)
#
# Other drawing actions, up to "plt.ylim" (including)
#
ax = plt.gca()
ax.yaxis.set_major_formatter(lblFormatter)
plt.show()
You can modify your df itself, you just need to decide one ratio
so if you want to make 50000 to 5 then it means the ratio is 5/50000 which is 0.0001
Once you have the ratio just multiply all the values of y-axis with the ratio in your DataFrame itself.
Hope this helps!!

Categories