I want to know how scipy.stats uses its methods fit and pdf. According to the documentation, fit(data, a, loc = 0, scale = 1) estimates parameters for data while pdf(x, a, loc=0, scale=1) computes probability density function . But I couldn't find how fit and pdf are actually performed, statistically and mathematically.
I am using the sm.datasets.elnino data, using the code from tmthydvnprt
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())
y, x = np.histogram(data, bins = 50, density = True)
x = (x + np.roll(x, -1))[:-1] / 2.0
distribution = st.gennorm
params = distribution.fit(data)
arg = params[:-2]
loc = params[-2]
scale = params[-1]
pdf = distribution.pdf(x, loc = loc, scale = scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))
Using data, arg = 4.3836, loc = 23.2991, scale = 3.8499.
I want to know what arg, loc, and scale represent and how they are calculated.
Thank you.
Related
I'm trying to determine the highest peaks of the pattern blocks in the following waveform:
Basically, I need to detect the following peaks only (highlighted):
If I use scipy.find_peaks(), it's unable to detect the appropriate peaks:
indices = find_peaks(my_waveform, prominence = 1)[0]
It ends up detecting all of the following points, which is not what I am looking for:
I can't provide the input arguments of distance or height thresholds to scipy.find_peaks() since there are many of the desired peaks on either extremes which are lower in height than the non-desired peaks in the middle.
Note: I had de-trended the waveform in order to aid this above problem too as you can see in the above snapshot, but it still doesn't give the right results.
So can anyone help with a correct way to tackle this?
Here's the code to fully reproduce the dataset I've shown ("autocorr" is the final waveform of interest)
import json
import sys, os
import numpy as np
import pandas as pd
import glob
import pickle
from statsmodels.tsa.stattools import adfuller, acf, pacf
from scipy.signal import find_peaks, square
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
#GENERATION OF A FUNCTION WITH DUAL SEASONALITY & NOISE
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
def signal_line_plot(input_signal: pd.Series, title: str = "", y_label: str = "Signal"):
""" Function to plot a time series signal
Args:
input_signal: time series signal that you want to plot
title: title on plot
y_label: label of the signal being plotted
Returns:
signal plot
"""
plt.plot(input_signal)
plt.title(title)
plt.ylabel(y_label)
plt.show()
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
signal_line_plot(x_daily_weekly_long)
signal_line_plot(x_daily_weekly_long[0:1000])
#x_daily_weekly_long is the final waveform on which I'm carrying out Autocorrelation
#PERFORMING AUTOCORRELATION:
import scipy.signal as signal
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode = "same")
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
#VISUALIZATION:
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
As you have some kind of double (or even triple) signal, I would attempt a double smoothing.
One to remove the overall trend, and one to remove the sharp noise.
A picture is probably better than a long explanation:
from scipy.signal import find_peaks
import pandas as pd
import numpy as np
def smooth(s, win):
return pd.Series(s).rolling(window=win, center=True).mean().ffill().bfill()
plt.plot(lags, autocorr, label='data')
WINDOW = 100 # needs to be determined empirically
# and so are the multipliers below
# double smoothing difference + clipping
ddiff = np.clip(smooth(autocorr, 2*WINDOW)-smooth(autocorr, 10*WINDOW), 0, np.inf)
plt.plot(lags, ddiff, label='smooth+clip')
peaks = find_peaks(ddiff, width=WINDOW)[0]
plt.plot(lags[peaks], autocorr[peaks], marker='o', ls='')
plt.plot(lags[peaks], ddiff[peaks], marker='o', ls='')
plt.legend()
output:
smoothing the original signal
As often in data analysis, the earlier you perform a transformation might be the better. You could also clean your original signal before running the autocorrelation. Here is a quick example (using the smooth function defined above):
from scipy.signal import find_peaks
x2 = smooth(x_daily_weekly_long, 100)
autocorr2 = signal.correlate(x2, x2, mode = "same")
plt.plot(lags, autocorr2)
idx = find_peaks(autocorr2)[0]
plt.plot(lags[idx], autocorr2[idx], marker='o', ls='')
cleaned signal:
For testing purposes i used a rough reconstruction of your signal.
import numpy as np
from scipy.signal import find_peaks, square
import matplotlib.pyplot as plt
x = np.linspace(3,103,10000)
sin = np.clip(np.sin(0.6*x)-0.5,0,10)
tri = np.concatenate([np.linspace(0,0.3,5000),np.linspace(0.3,0,5000)],axis =0)
sig = np.sin(6*x-1.2)
full = sin+tri+sig
peak run #1
peaks = find_peaks(full)[0]
plt.plot(full)
plt.scatter(peaks,full[peaks], color='red', s=5)
plt.show()
peak run #2 + index reextraction (this needs the actual values from your signal)
peaks2 = find_peaks(full[peaks])[0]
index = peaks[peaks2]
plt.plot(full)
plt.scatter(index,full[index], color='red', s=5)
plt.show()
If you know the period you can do this:
w=T
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
plt.scatter(lags[signal.find_peaks(signal.convolve(autocorr, np.ones(w)/w, mode="same"))[0]], autocorr[signal.find_peaks(signal.convolve(autocorr, np.ones(w)/w, mode="same"))[0]], color="r")
Result:
I don't know if it works in other cases.
EDIT:
another approach is to find the maximum in a slicing window, but also in this case you must define empirically a window size.
w=900
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
plt.scatter(lags[filters.maximum_filter(autocorr, size=W)==autocorr], autocorr[filters.maximum_filter(autocorr, size=W)==autocorr], color="r")
Result:
I have used seaborn's kdeplot on some data.
import seaborn as sns
import numpy as np
sns.kdeplot(np.random.rand(100))
Is it possible to return the fwhm from the curve created?
And if not, is there another way to calculate it?
You can extract the generated kde curve from the ax. Then get the maximum y value and search the x positions nearest to the half max:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
ax = sns.kdeplot(np.random.rand(100))
kde_curve = ax.lines[0]
x = kde_curve.get_xdata()
y = kde_curve.get_ydata()
halfmax = y.max() / 2
maxpos = y.argmax()
leftpos = (np.abs(y[:maxpos] - halfmax)).argmin()
rightpos = (np.abs(y[maxpos:] - halfmax)).argmin() + maxpos
fullwidthathalfmax = x[rightpos] - x[leftpos]
ax.hlines(halfmax, x[leftpos], x[rightpos], color='crimson', ls=':')
ax.text(x[maxpos], halfmax, f'{fullwidthathalfmax:.3f}\n', color='crimson', ha='center', va='center')
ax.set_ylim(ymin=0)
plt.show()
Note that you can also calculate a kde curve from scipy.stats.gaussian_kde if you don't need the plotted version. In that case, the code could look like:
import numpy as np
from scipy.stats import gaussian_kde
data = np.random.rand(100)
kde = gaussian_kde(data)
x = np.linspace(data.min(), data.max(), 1000)
y = kde(x)
halfmax = y.max() / 2
maxpos = y.argmax()
leftpos = (np.abs(y[:maxpos] - halfmax)).argmin()
rightpos = (np.abs(y[maxpos:] - halfmax)).argmin() + maxpos
fullwidthathalfmax = x[rightpos] - x[leftpos]
print(fullwidthathalfmax)
I don't believe there's a way to return the fwhm from the random dataplot without writing the code to calculate it.
Take into account some example data:
import numpy as np
arr_x = np.linspace(norm.ppf(0.00001), norm.ppf(0.99999), 10000)
arr_y = norm.pdf(arr_x)
Find the minimum and maximum points and calculate difference.
difference = max(arr_y) - min(arr_y)
Find the half max (in this case it is half min)
HM = difference / 2
Find the nearest data point to HM:
nearest = (np.abs(arr_y - HM)).argmin()
Calculate the distance between nearest and min to get the HWHM, then mult by 2 to get the FWHM.
I have a set of integer values, and I want to set them to Weibull distribution and get the best fit parameters. Then I draw the histogram of data together with the pdf of Weibull distribution, using the best fit parameters. This is the code I used.
from jtlHandler import *
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
def get_pdf(latencies):
a = np.array(latencies)
ag = st.gaussian_kde(a)
ak = np.linspace(np.min(a), np.max(a), len(a))
agv = ag(ak)
plt.plot(ak,agv)
plt.show()
return (ak,agv)
def fit_to_distribution(distribution, data):
params = distribution.fit(data)
# Return MLEs for shape (if applicable), location, and scale parameters from data.
#
# MLE stands for Maximum Likelihood Estimate. Starting estimates for the fit are given by input arguments; for any arguments not provided with starting estimates, self._fitstart(data) is called to generate such.
return params
def make_distribution_pdf(dist, params, end):
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Build PDF and turn into pandas Series
x = np.linspace(0, end, end)
y = dist.pdf(x, loc=loc, scale=scale, *arg)
pdf = pd.Series(y, x)
return pdf
latencies = getLatencyList("filename")
latencies = latencies[int(9*(len(latencies)/10)):len(latencies)]
data = pd.Series(latencies)
params = fit_to_distribution(st.weibull_max, data)
print("Parameters for the fit: "+str(params))
# Make PDF
pdf = make_distribution_pdf(st.weibull_max, params, max(latencies))
# Display
plt.figure()
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=200, normed=True, alpha=0.5, label='Data',
legend=True, ax=ax)
ax.set_title('Weibull distribution')
ax.set_xlabel('Latnecy')
ax.set_ylabel('Frequency')
plt.savefig("image.png")
This is the resulting figure.
As it is seen, the Weibull approximation is not simmilar to the original distribution of data.
How can I get the best Weibull approximation to my data?
You can fit a data set (set of numbers) to any distribution using the following two methods.
import os
import matplotlib.pyplot as plt
import sys
import math
import numpy as np
import scipy.stats as st
from scipy.stats._continuous_distns import _distn_names
from scipy.optimize import curve_fit
def fit_to_distribution(distribution, latency_values):
distribution = getattr(st, distribution)
params = distribution.fit(latency_values)
return params
def make_distribution_pdf(distribution, latency_list):
distribution = getattr(st, distribution)
params = distribution.fit(latency_list)
arg = params[:-2]
loc = params[-2]
scale = params[-1]
x = np.linspace(min(latency_list), max(latency_list), 10000)
y = distribution.pdf(x, loc=loc, scale=scale, *arg)
return x, y
I want to create a customized distribution based on a Levy truncated law, which reads
p(r) = (r + r0)**(-beta)*exp(-r/k).
So I defined it in the following way:
import numpy as np
import scipy.stats as st
class LevyPDF(st.rv_continuous):
def _pdf(self,r):
r0 = 100
k = 1500
beta = 1.6
return (r + r0)**(-beta)*np.exp(-r/k)
Suppose that I want to find the distribution of distances between r = 0 and r = 50km. Then:
nmin = 0
nmax = 50
my_cv = LevyPDF(a=nmin, b=nmax, name='LevyPDF')
x = np.linspace(nmin, nmax, (nmax-nmin)*2)
I do not understand why:
sum(my_cv.cdf(x)) = 2.22
instead of 1.
Then how can I define an histogram of N = 2000000 random distances based on the distribution that I defined?
Using your minimal example (slightly adapted):
import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt
class LevyPDF(st.rv_continuous):
def _pdf(self,r):
r0 = 100
k = 1500
beta = 1.6
return (r + r0)**(-beta)*np.exp(-r/k)
nmin = 0
nmax = 50
my_cv = LevyPDF(a=nmin, b=nmax, name='LevyPDF')
To sample from your random variable, use rvs() method from rv_continuous class:
N = 50000
X = my_cv.rvs(size=N, random_state=1)
Will return an array of size (N,) with random variates sampled from your distribution. Use random_state option to freeze your example and make your script repeatable (it defines random seed for your sampling).
Note as N softly increases, computation time drastically increases.
To plot histogram, use matplotlib library, see hist:
fig, axe = plt.subplots()
n, bins, patches = axe.hist(X, 50, normed=1, facecolor='green', alpha=0.75)
plt.show(axe)
Bellow a example of sampling from Chi Square with 40 Degrees of Freedom:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
rv = stats.chi2(40)
N = 200000
X = rv.rvs(size=N, random_state=1)
fig, axe = plt.subplots()
n, bins, patches = axe.hist(X, 50, normed=1, facecolor='green', alpha=0.75)
plt.show(axe)
It leads to:
I am running some goodness of fit tests using scipy.stats in Python 2.7.10.
for distrName in distrNameList:
distr = getattr(distributions, distrName)
param = distr.fit(sample)
pdf = distr.pdf(???)
What do I pass into distr.pdf() to get the values of the best-fit pdf on the list of sample points of interest, called abscissas?
From the documentation, the .fit() method returns:
shape, loc, scale : tuple of floats
MLEs for any shape statistics, followed by those for location and scale.
and the .pdf() method accepts:
x : array_like
quantiles
arg1, arg2, arg3,... : array_like
The shape parameter(s) for the distribution (see docstring of the instance object for more information)
loc : array_like, optional
location parameter (default=0)
scale : array_like, optional
So essentially you would do something like this:
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
# some random variates drawn from a beta distribution
rvs = stats.beta.rvs(2, 5, loc=0, scale=1, size=1000)
# estimate distribution parameters, in this case (a, b, loc, scale)
params = stats.beta.fit(rvs)
# evaluate PDF
x = np.linspace(0, 1, 1000)
pdf = stats.beta.pdf(x, *params)
# plot
fig, ax = plt.subplots(1, 1)
ax.hold(True)
ax.hist(rvs, normed=True)
ax.plot(x, pdf, '--r')
To evaluate the pdf at abscissas, you would pass abcissas as the first argument to pdf. To specify the parameters, use the * operator to unpack the param tuple and pass those values to distr.pdf:
pdf = distr.pdf(abscissas, *param)
For example,
import numpy as np
import scipy.stats as stats
distrNameList = ['beta', 'expon', 'gamma']
sample = stats.norm(0, 1).rvs(1000)
abscissas = np.linspace(0,1, 10)
for distrName in distrNameList:
distr = getattr(stats.distributions, distrName)
param = distr.fit(sample)
pdf = distr.pdf(abscissas, *param)
print(pdf)