plot gaussian between points - python

Hi i actually have a list of points and i would like to plot a gaussian curve between those points to generate some sort of time series.
For example, here i use a date range
import pandas as pd
a=pd.date_range(start="2015-06-16 ",end="2015-06-23 ", freq='H')
and i would like a gaussian density curve (ie normal distribution) between "2015-06-16" and "2015-06-17". Another one between "2015-06-17" and "2015-06-18" and so on.
I have no idea on how to do that.
Thank you

import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
a = pd.date_range(start="2015-06-16 ",end="2015-06-23 ", freq='H')
x = np.linspace(-3, 3, 24)
norm_pdf = stats.norm.pdf(x, 0, 1)
density = np.tile(norm_pdf, (len(a)-1)/24)
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(a[1:], density)
ax.set_ylim([0, 1])

Assuming constant vol and no drift (given your short time horizon), the following should work:
import pandas as pd
import numpy as np
annualized_vol = 0.30 # i.e. 30%
delta_t = 1 / 252. / 24. # Assuming 252 days in a trading year and 24 hours in a trading day.
initial_price = 100
idx = pd.date_range(start="2015-06-16 ", end="2015-06-23 ", freq='H')
dx = pd.DataFrame(np.random.randn(len(idx)), index=idx) * annualized_vol * delta_t ** .5
(initial_price * dx.cumsum()).plot()

Related

Python : How I can draw FFT graph with Pandas DataFrame which is made by time and values

I am trying to make FFT graph which is derived from Pandas DataFrame.
It is my source code I tried with.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.fft import fftfreq
plt.style.use("seaborn")
data = pd.read_csv("/Users/kyungyunlee/Desktop/ IRP reference/Data/PIXEL_DATA/1_piece.csv")
N = len(t)
t = data["time"].loc[data["time"] > 5].loc[data["time"] < 10]
s = data["y_value"].loc[data["time"] > 5].loc[data["time"] < 10]
print(len(s))
fft = np.fft.fft(s)
fftfreq = np.fft.fftfreq(len(s))
plt.subplot(1, 2, 1)
plt.xlabel("Frquency Domain")
plt.ylabel("Amplitude")
plt.plot(fftfreq, fft)
plt.subplot(1, 2, 2)
plt.plot(t, s)
plt.show()
And the picture below is the result of the source code.
As you can see from the graph, the left graph is FFT and the right graph is the time and amplitude graph. In this situation, I can't understand why my FFT graph is like that. The graph is weird but I can't find what the problem is.
enter image description here
Please check this Pandas dataframe screenshot. very simple data consist with time(maybe ms) and values.
time,y_value
5.009026,614
5.035417,550
5.061302,554
5.08712,611
5.114184,613
5.140525,614
5.167711,573
5.19439,532
5.220309,596
5.247532,607
5.273929,608
5.300062,588
5.326553,529
5.352314,577
5.378559,602
5.404629,602
5.431329,597
5.459119,547
5.486477,556
5.512459,597
5.539668,594
5.567103,597
5.594013,564
5.621206,539
5.646212,586
5.671964,594
5.698939,594
5.726222,577
5.777665,574
5.804736,590
5.831811,590
5.858152,583
5.885826,543
5.912285,562
5.937549,587
5.991617,585
6.018168,555
6.044418,547
6.07098,581
6.097121,585
6.124821,585
6.151159,566
6.177994,536
6.205361,573
6.232069,582
6.25743,582
6.284097,573
6.31036,537
6.336849,564
6.363457,580
6.390022,580
6.417727,576
6.444151,549
6.471022,553
6.498445,576
6.551982,577
6.578571,557
6.60393,544
6.631363,571
6.657855,576
6.685089,576
6.711603,563
6.763428,565
6.789426,574
6.815717,574
6.841412,569
6.867886,543
6.867886,517
6.89452,558
6.921834,572
6.974582,570
7.00143,550
7.029219,550
7.055249,569
7.109767,570
7.137385,556
7.188917,565
7.215901,569
7.215901,543
7.243045,569
7.270299,561
7.32553,560
What I want to do is to draw FFT graph with this Data but I don't know why the code is not working.
I hope I can get some feedbacks. Thank you.
If you suppress the DC node and adjust the axes, the results seem pretty reasonable:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.fft import fftfreq
plt.style.use("seaborn")
data = pd.read_csv("x.data")
print(data)
t = data["time"].loc[data["time"] > 5].loc[data["time"] < 10]
s = data["y_value"].loc[data["time"] > 5].loc[data["time"] < 10]
fft = np.fft.fft(s)
fft[0] = 0
fftfreq = np.fft.fftfreq(len(s))*len(s)/(t.max()-t.min())
plt.subplot(1, 2, 1)
plt.xlabel("Frquency Domain")
plt.ylabel("Amplitude")
plt.plot(fftfreq, fft)
plt.subplot(1, 2, 2)
plt.plot(t, s)
plt.show()
Output:
And if you plot the power spectrum (np.abs(fft)), you get:

How to draw the Probability Density Function (PDF) plot in Python?

I'd like to ask how to draw the Probability Density Function (PDF) plot in Python.
This is my codes.
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import scipy.stats as stats
.
x = np.random.normal(50, 3, 1000)
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source)
df
I generated a data frame. Then, I tried to draw a PDF graph.
df["AGW"].sort_values()
df_mean = np.mean(df["AGW"])
df_std = np.std(df["AGW"])
pdf = stats.norm.pdf(df["AGW"], df_mean, df_std)
plt.plot(df["AGW"], pdf)
I obtained above graph. What I did wrong? Could you let me how to draw the Probability Density Function (PDF) Plot which is also known as normal distribution graph.
Could you let me know which codes (or library) I need to use to draw the PDF graph?
Always many thanks!!
You just need to sort the values (not really check what's after edit)
pdf = stats.norm.pdf(df["AGW"].sort_values(), df_mean, df_std)
plt.plot(df["AGW"].sort_values(), pdf)
And it will work.
The line df["AGW"].sort_values() doesn't change df. Maybe you meant df.sort_values(by=['AGW'], inplace=True).
In that case the full code will be :
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import scipy.stats as stats
x = np.random.normal(50, 3, 1000)
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source)
df.sort_values(by=['AGW'], inplace=True)
df_mean = np.mean(df["AGW"])
df_std = np.std(df["AGW"])
pdf = stats.norm.pdf(df["AGW"], df_mean, df_std)
plt.plot(df["AGW"], pdf)
Which gives :
Edit :
I think here we already have the distribution (x is normally distributed) so we dont need to generate the pdf of x. As the use of the pdf is for something like this :
mu = 50
variance = 3
sigma = math.sqrt(variance)
x = np.linspace(mu - 5*sigma, mu + 5*sigma, 1000)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()
Here we dont need to generate the distribution from x points, we only need to plot the density of the distribution we already have .
So you might use this :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.random.normal(50, 3, 1000) #Generating Data
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source) #Converting to pandas DataFrame
df.plot(kind = 'density'); # or df["AGW"].plot(kind = 'density');
Which gives :
You might use other packages if you want, like seaborn :
import seaborn as sns
plt.figure(figsize = (5,5))
sns.kdeplot(df["AGW"] , bw = 0.5 , fill = True)
plt.show()
Or this :
import seaborn as sns
sns.set_style("whitegrid") # Setting style(Optional)
plt.figure(figsize = (10,5)) #Specify the size of figure
sns.distplot(x = df["AGW"] , bins = 10 , kde = True , color = 'teal'
, kde_kws=dict(linewidth = 4 , color = 'black')) #kde for normal distribution
plt.show()
Check this article for more.

How to get the full width at half maximum (FWHM) from kdeplot

I have used seaborn's kdeplot on some data.
import seaborn as sns
import numpy as np
sns.kdeplot(np.random.rand(100))
Is it possible to return the fwhm from the curve created?
And if not, is there another way to calculate it?
You can extract the generated kde curve from the ax. Then get the maximum y value and search the x positions nearest to the half max:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
ax = sns.kdeplot(np.random.rand(100))
kde_curve = ax.lines[0]
x = kde_curve.get_xdata()
y = kde_curve.get_ydata()
halfmax = y.max() / 2
maxpos = y.argmax()
leftpos = (np.abs(y[:maxpos] - halfmax)).argmin()
rightpos = (np.abs(y[maxpos:] - halfmax)).argmin() + maxpos
fullwidthathalfmax = x[rightpos] - x[leftpos]
ax.hlines(halfmax, x[leftpos], x[rightpos], color='crimson', ls=':')
ax.text(x[maxpos], halfmax, f'{fullwidthathalfmax:.3f}\n', color='crimson', ha='center', va='center')
ax.set_ylim(ymin=0)
plt.show()
Note that you can also calculate a kde curve from scipy.stats.gaussian_kde if you don't need the plotted version. In that case, the code could look like:
import numpy as np
from scipy.stats import gaussian_kde
data = np.random.rand(100)
kde = gaussian_kde(data)
x = np.linspace(data.min(), data.max(), 1000)
y = kde(x)
halfmax = y.max() / 2
maxpos = y.argmax()
leftpos = (np.abs(y[:maxpos] - halfmax)).argmin()
rightpos = (np.abs(y[maxpos:] - halfmax)).argmin() + maxpos
fullwidthathalfmax = x[rightpos] - x[leftpos]
print(fullwidthathalfmax)
I don't believe there's a way to return the fwhm from the random dataplot without writing the code to calculate it.
Take into account some example data:
import numpy as np
arr_x = np.linspace(norm.ppf(0.00001), norm.ppf(0.99999), 10000)
arr_y = norm.pdf(arr_x)
Find the minimum and maximum points and calculate difference.
difference = max(arr_y) - min(arr_y)
Find the half max (in this case it is half min)
HM = difference / 2
Find the nearest data point to HM:
nearest = (np.abs(arr_y - HM)).argmin()
Calculate the distance between nearest and min to get the HWHM, then mult by 2 to get the FWHM.

Populate Pandas Dataframe with normal distribution

I would like to populate a dataframe with numbers that follow a normal distribution. Currently I'm populating it randomly, but the distribution is flat. Column a has mean and sd of 5 and 1 respectively, and column b has mean and sd of 15 and 1.
import pandas as pd
import numpy as np
n = 10
df = pd.DataFrame(dict(
a=np.random.randint(1,10,size=n),
b=np.random.randint(100,110,size=n)
))
Try this. randint does not select from normal dist. normal does. Also no idea where you came up with 100 and 110 in min and max args for b.
n = 10
a_bar = 5; a_sd = 1
b_bar = 15; b_sd = 1
df = pd.DataFrame(dict(a=np.random.normal(a_bar, a_sd, size=n),
b=np.random.normal(b_bar, b_sd, size=n)),
columns=['a', 'b'])
This should work;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
n = 200
df = pd.DataFrame(dict(
a=np.random.normal(1,10,size=n),
b=np.random.normal(100,110,size=n)
))
plt.style.use("ggplot")
fig, ax = plt.subplots()
ax.plot(df["a"])
ax.plot(df["b"], color="b")
plt.show()
plt.clf()
Generated Plot
I think you are using the wrong numpy function: np.random.randint returns random integers from the discrete uniform distribution. If you want a random normal distribution, you need to use np.random.normal, namely:
import pandas as pd
import numpy as np
n = 10
df = pd.DataFrame(dict(
a=np.random.normal(loc=5,scale=1,size=n),
b=np.random.normal(15,1,size=n)
))
where loc corresponds to the mean value, and scale to the standard deviation value of the distribution.

Dendrogram using pandas and scipy

I wish to generate a dendrogram based on correlation using pandas and scipy. I use a dataset (as a DataFrame) consisting of returns, which is of size n x m, where n is the number of dates and m the number of companies. Then I simply run the script
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np
m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_matrix = np.random.normal(0, 0.01, size=(len(dates), m))
dataframe = pd.DataFrame(data=random_matrix, index=dates)
z = hc.linkage(dataframe.values.T, method='average', metric='correlation')
dendrogram = hc.dendrogram(z, labels=dataframe.columns)
plt.show()
and I get a nice dendrogram. Now, the thing is that I'd also like to use other correlation measures apart from just ordinary Pearson correlation, which is a feature that's incorporated in pandas by simply invoking DataFrame.corr(method='<method>'). So, I thought at first that it was to simply run the following code
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np
m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_returns = np.random.normal(0, 0.01, size=(len(dates), m))
dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = dataframe.corr()
z = hc.linkage(corr.values, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()
However, if I do this I get strange values on the y-axis as the maximum value > 1.4. Whereas if I run the first script it's about 1. What am I doing wrong? Am I using the wrong metric in hc.linkage?
EDIT I might add that the shape of the dendrogram is exactly the same. Do I have to normalize the third column of the resulting z with the maximum value?
Found the solution. If you have already calculated a distance matrix (be it correlation or whatever), you simply have to condense the matrix using distance.squareform. That is,
dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = 1 - dataframe.corr()
corr_condensed = hc.distance.squareform(corr) # convert to condensed
z = hc.linkage(corr_condensed, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()

Categories