Populate Pandas Dataframe with normal distribution - python

I would like to populate a dataframe with numbers that follow a normal distribution. Currently I'm populating it randomly, but the distribution is flat. Column a has mean and sd of 5 and 1 respectively, and column b has mean and sd of 15 and 1.
import pandas as pd
import numpy as np
n = 10
df = pd.DataFrame(dict(
a=np.random.randint(1,10,size=n),
b=np.random.randint(100,110,size=n)
))

Try this. randint does not select from normal dist. normal does. Also no idea where you came up with 100 and 110 in min and max args for b.
n = 10
a_bar = 5; a_sd = 1
b_bar = 15; b_sd = 1
df = pd.DataFrame(dict(a=np.random.normal(a_bar, a_sd, size=n),
b=np.random.normal(b_bar, b_sd, size=n)),
columns=['a', 'b'])

This should work;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
n = 200
df = pd.DataFrame(dict(
a=np.random.normal(1,10,size=n),
b=np.random.normal(100,110,size=n)
))
plt.style.use("ggplot")
fig, ax = plt.subplots()
ax.plot(df["a"])
ax.plot(df["b"], color="b")
plt.show()
plt.clf()
Generated Plot

I think you are using the wrong numpy function: np.random.randint returns random integers from the discrete uniform distribution. If you want a random normal distribution, you need to use np.random.normal, namely:
import pandas as pd
import numpy as np
n = 10
df = pd.DataFrame(dict(
a=np.random.normal(loc=5,scale=1,size=n),
b=np.random.normal(15,1,size=n)
))
where loc corresponds to the mean value, and scale to the standard deviation value of the distribution.

Related

How can I force seaborn or matplotlib to always render isolated values in heatmaps?

I am analyzing long series of events using heatmaps. Values of the column are most of the time 0 but occasionally are 1 unfortunately the rendering behaviour often hide the 1 occurrences because of the 0 surrounding them. I have tried to use antialiased=False but it did not solve the problem:
This code reproduce the issue:
import numpy as np
import pandas as pd
import seaborn as sns
d = pd.DataFrame(np.zeros((2000, 4)))
for i in range(4):
for j in [34,223,56,666]:
d[i][j] = 1
axS = sns.heatmap(d,antialiased=False)
There should be 4 lines instead only one is visible. Of course, if I stretch the plot I have better results but still some values are hidden.
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,30)
axB = sns.heatmap(d,antialiased=False)
I would like to force the rendering of isolated values. Is there any way to get this behaviour?
P.S. I need heatmaps because I compare multiple variables with float values, so spy for instance is not a good option for me.
If these events are infrequent, you could reduce the resolution of the vertical axis by aggregating the values into bins.
import numpy as np
import pandas as pd
import seaborn as sns
d = pd.DataFrame(np.zeros((2000, 4)))
for i in range(4):
for j in [34,223,56,666]:
d[i][j] = 1
x1 = d.index.min() - 1e-9
x2 = d.index.max()
bin_width = 50
bin_edge = np.arange(x1, x2 + bin_width, bin_width)
bin_center = np.arange(x1 + bin_width/2, x2, bin_width)
index_binned = pd.cut(d.index, bins=bin_edge, labels=bin_center)
d = d.join(pd.Series(index_binned, name="index_binned"))
d_binned = d.groupby('index_binned').sum()
sns.heatmap(data=d_binned, antialiased=False)
Output:

How to draw the Probability Density Function (PDF) plot in Python?

I'd like to ask how to draw the Probability Density Function (PDF) plot in Python.
This is my codes.
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import scipy.stats as stats
.
x = np.random.normal(50, 3, 1000)
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source)
df
I generated a data frame. Then, I tried to draw a PDF graph.
df["AGW"].sort_values()
df_mean = np.mean(df["AGW"])
df_std = np.std(df["AGW"])
pdf = stats.norm.pdf(df["AGW"], df_mean, df_std)
plt.plot(df["AGW"], pdf)
I obtained above graph. What I did wrong? Could you let me how to draw the Probability Density Function (PDF) Plot which is also known as normal distribution graph.
Could you let me know which codes (or library) I need to use to draw the PDF graph?
Always many thanks!!
You just need to sort the values (not really check what's after edit)
pdf = stats.norm.pdf(df["AGW"].sort_values(), df_mean, df_std)
plt.plot(df["AGW"].sort_values(), pdf)
And it will work.
The line df["AGW"].sort_values() doesn't change df. Maybe you meant df.sort_values(by=['AGW'], inplace=True).
In that case the full code will be :
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import scipy.stats as stats
x = np.random.normal(50, 3, 1000)
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source)
df.sort_values(by=['AGW'], inplace=True)
df_mean = np.mean(df["AGW"])
df_std = np.std(df["AGW"])
pdf = stats.norm.pdf(df["AGW"], df_mean, df_std)
plt.plot(df["AGW"], pdf)
Which gives :
Edit :
I think here we already have the distribution (x is normally distributed) so we dont need to generate the pdf of x. As the use of the pdf is for something like this :
mu = 50
variance = 3
sigma = math.sqrt(variance)
x = np.linspace(mu - 5*sigma, mu + 5*sigma, 1000)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()
Here we dont need to generate the distribution from x points, we only need to plot the density of the distribution we already have .
So you might use this :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.random.normal(50, 3, 1000) #Generating Data
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source) #Converting to pandas DataFrame
df.plot(kind = 'density'); # or df["AGW"].plot(kind = 'density');
Which gives :
You might use other packages if you want, like seaborn :
import seaborn as sns
plt.figure(figsize = (5,5))
sns.kdeplot(df["AGW"] , bw = 0.5 , fill = True)
plt.show()
Or this :
import seaborn as sns
sns.set_style("whitegrid") # Setting style(Optional)
plt.figure(figsize = (10,5)) #Specify the size of figure
sns.distplot(x = df["AGW"] , bins = 10 , kde = True , color = 'teal'
, kde_kws=dict(linewidth = 4 , color = 'black')) #kde for normal distribution
plt.show()
Check this article for more.

Plotting step function with empirical data cumulative x-axis

I have a dummy dataset, df:
Demand WTP
0 13.0 111.3
1 443.9 152.9
2 419.6 98.2
3 295.9 625.5
4 150.2 210.4
I would like to plot this data as a step function in which the "WTP" are y-values and "Demand" are x-values.
The step curve should start with from the row with the lowest value in "WTP", and then increase gradually with the corresponding x-values from "Demand". However, I can't get the x-values to be cumulative, and instead, my plot becomes this:
I'm trying to get something that looks like this:
but instead of a proportion along the y-axis, I want the actual values from my dataset:
This is my code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
Demand_quantity = pd.Series([13, 443.9, 419.6, 295.9, 150.2])
Demand_WTP = [111.3, 152.9, 98.2, 625.5, 210.4]
demand_data = {'Demand':Demand_quantity, 'WTP':Demand_WTP}
Demand = pd.DataFrame(demand_data)
Demand.sort_values(by = 'WTP', axis = 0, inplace = True)
print(Demand)
# sns.ecdfplot(data = Demand_WTP, x = Demand_quantity, stat = 'count')
plt.step(Demand['Demand'], Demand['WTP'], label='pre (default)')
plt.legend(title='Parameter where:')
plt.title('plt.step(where=...)')
plt.show()
You can try:
import matplotlib.pyplot as plt
import pandas as pd
df=pd.DataFrame({"Demand":[13, 443.9, 419.6, 295.9, 150.2],"WTP":[111.3, 152.9, 98.2, 625.5, 210.4]})
df=df.sort_values(by=["Demand"])
plt.step(df.Demand,df.WTP)
But I am not really sure about what you want to do. If the x-values are the df.Demand, than the dataframe should be sorted according to this column.
If you want to cumulate the x-values, than try to use numpy.cumsum:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df=pd.DataFrame({"Demand":[13, 443.9, 419.6, 295.9, 150.2],"WTP":[111.3, 152.9, 98.2, 625.5, 210.4]})
df=df.sort_values(by=["WTP"])
plt.step(np.cumsum(df.Demand),df.WTP)

Dendrogram using pandas and scipy

I wish to generate a dendrogram based on correlation using pandas and scipy. I use a dataset (as a DataFrame) consisting of returns, which is of size n x m, where n is the number of dates and m the number of companies. Then I simply run the script
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np
m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_matrix = np.random.normal(0, 0.01, size=(len(dates), m))
dataframe = pd.DataFrame(data=random_matrix, index=dates)
z = hc.linkage(dataframe.values.T, method='average', metric='correlation')
dendrogram = hc.dendrogram(z, labels=dataframe.columns)
plt.show()
and I get a nice dendrogram. Now, the thing is that I'd also like to use other correlation measures apart from just ordinary Pearson correlation, which is a feature that's incorporated in pandas by simply invoking DataFrame.corr(method='<method>'). So, I thought at first that it was to simply run the following code
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np
m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_returns = np.random.normal(0, 0.01, size=(len(dates), m))
dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = dataframe.corr()
z = hc.linkage(corr.values, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()
However, if I do this I get strange values on the y-axis as the maximum value > 1.4. Whereas if I run the first script it's about 1. What am I doing wrong? Am I using the wrong metric in hc.linkage?
EDIT I might add that the shape of the dendrogram is exactly the same. Do I have to normalize the third column of the resulting z with the maximum value?
Found the solution. If you have already calculated a distance matrix (be it correlation or whatever), you simply have to condense the matrix using distance.squareform. That is,
dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = 1 - dataframe.corr()
corr_condensed = hc.distance.squareform(corr) # convert to condensed
z = hc.linkage(corr_condensed, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()

plot gaussian between points

Hi i actually have a list of points and i would like to plot a gaussian curve between those points to generate some sort of time series.
For example, here i use a date range
import pandas as pd
a=pd.date_range(start="2015-06-16 ",end="2015-06-23 ", freq='H')
and i would like a gaussian density curve (ie normal distribution) between "2015-06-16" and "2015-06-17". Another one between "2015-06-17" and "2015-06-18" and so on.
I have no idea on how to do that.
Thank you
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
a = pd.date_range(start="2015-06-16 ",end="2015-06-23 ", freq='H')
x = np.linspace(-3, 3, 24)
norm_pdf = stats.norm.pdf(x, 0, 1)
density = np.tile(norm_pdf, (len(a)-1)/24)
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(a[1:], density)
ax.set_ylim([0, 1])
Assuming constant vol and no drift (given your short time horizon), the following should work:
import pandas as pd
import numpy as np
annualized_vol = 0.30 # i.e. 30%
delta_t = 1 / 252. / 24. # Assuming 252 days in a trading year and 24 hours in a trading day.
initial_price = 100
idx = pd.date_range(start="2015-06-16 ", end="2015-06-23 ", freq='H')
dx = pd.DataFrame(np.random.randn(len(idx)), index=idx) * annualized_vol * delta_t ** .5
(initial_price * dx.cumsum()).plot()

Categories