Python Plotly CDF with Frequency DIstribution Data - python

How do you make a CDF plot with frequency distribution data in a Pandas DataFrame using Plotly? Suppose the following toy data
value freq
1 3
2 2
3 1
All of the examples show how to do it with raw data that looks like:
value
1
1
1
2
2
3
I am able to do it with Pandas .plot like so (but I would prefer to do the same with Plotly):
stats_df = df
stats_df['pdf'] = stats_df['count'] / sum(stats_df['count'])
# calculate CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
# plot
stats_df.plot(x = 'n_calls',
y = ['pdf', 'cdf'],
logx = True,
kind = 'line',
grid = True)
If you would like to demonstrate with a toy dataset, here's one: https://raw.githubusercontent.com/plotly/datasets/master/2010_alcohol_consumption_by_country.csv
References:
https://plotly.com/python/v3/discrete-frequency/
https://plotly.com/python/distplot/

It is not possible to build a CDF in the Plotly.
On Plotly, only PDF and a histogram can be plotted (see below for alcohol sample).
The code for the graph above looks like this:
import plotly.figure_factory as ff
import pandas as pd
data = pd.read_csv(
'https://raw.githubusercontent.com/plotly/datasets/master/2010_alcohol_consumption_by_country.csv')
x = data['alcohol'].values.tolist()
group_labels = ['']
fig = ff.create_distplot([x], group_labels,
bin_size=.25, show_rug=False)
fig.show()
If you need exactly the CDF, then use third-party libraries for data preparation.
In the example below, I am using Numpy.
The code for the graph above looks like this:
import plotly.graph_objs as go
import numpy as np
import pandas as pd
data = pd.read_csv(
'https://raw.githubusercontent.com/plotly/datasets/master/2010_alcohol_consumption_by_country.csv')
x = data['alcohol'].values.tolist()
hist, bin_edges = np.histogram(x, bins=100, density=True)
cdf = np.cumsum(hist * np.diff(bin_edges))
fig = go.Figure(data=[
go.Bar(x=bin_edges, y=hist, name='Histogram'),
go.Scatter(x=bin_edges, y=cdf, name='CDF')
])
fig.show()
Note that the CDF is a broken line. This is due to the fact that this is not an approximate function for the unknown distribution.
To get a smooth function, you need to know the distribution law.

Related

Annotate Min/Max/Median in Matplotlib Violin Plot

Given this example code:
import pandas as pd
import matplotlib.pyplot as plt
data = 'https://raw.githubusercontent.com/marsja/jupyter/master/flanks.csv'
df = pd.read_csv(data, index_col=0)
# Subsetting using Pandas query():
congruent = df.query('TrialType == "congruent"')['RT']
incongruent = df.query('TrialType == "incongruent"')['RT']
# Combine data
plot_data = list([incongruent, congruent])
fig, ax = plt.subplots()
xticklabels = ['Incongruent', 'Congruent']
ax.set_xticks([1, 2])
ax.set_xticklabels(xticklabels)
ax.violinplot(plot_data, showmedians=True)
Which results in the following plot:
How can I annotate the min, max, and mean lines with their respective values?
I haven't been able to find examples online that allude to how to annotate violin plots in this way. If we set plot = ax.violinplot(plot_data, showmedians=True) then we can access attributes like plot['cmaxes'] but I cant quite figure out how to use that for annotations.
Here is an example of what I am trying to achieve:
So this was as easy as getting the medians/mins/maxes and then enumerating, adding the annotation with plt.text, and adding some small values for positioning:
medians = results_df.groupby(['model_cat'])['test_f1'].median()
for i, v in enumerate(medians):
plt.text((i+.85), (v+.001), str(round(v, 3)), fontsize = 12)

How to draw the Probability Density Function (PDF) plot in Python?

I'd like to ask how to draw the Probability Density Function (PDF) plot in Python.
This is my codes.
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import scipy.stats as stats
.
x = np.random.normal(50, 3, 1000)
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source)
df
I generated a data frame. Then, I tried to draw a PDF graph.
df["AGW"].sort_values()
df_mean = np.mean(df["AGW"])
df_std = np.std(df["AGW"])
pdf = stats.norm.pdf(df["AGW"], df_mean, df_std)
plt.plot(df["AGW"], pdf)
I obtained above graph. What I did wrong? Could you let me how to draw the Probability Density Function (PDF) Plot which is also known as normal distribution graph.
Could you let me know which codes (or library) I need to use to draw the PDF graph?
Always many thanks!!
You just need to sort the values (not really check what's after edit)
pdf = stats.norm.pdf(df["AGW"].sort_values(), df_mean, df_std)
plt.plot(df["AGW"].sort_values(), pdf)
And it will work.
The line df["AGW"].sort_values() doesn't change df. Maybe you meant df.sort_values(by=['AGW'], inplace=True).
In that case the full code will be :
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import scipy.stats as stats
x = np.random.normal(50, 3, 1000)
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source)
df.sort_values(by=['AGW'], inplace=True)
df_mean = np.mean(df["AGW"])
df_std = np.std(df["AGW"])
pdf = stats.norm.pdf(df["AGW"], df_mean, df_std)
plt.plot(df["AGW"], pdf)
Which gives :
Edit :
I think here we already have the distribution (x is normally distributed) so we dont need to generate the pdf of x. As the use of the pdf is for something like this :
mu = 50
variance = 3
sigma = math.sqrt(variance)
x = np.linspace(mu - 5*sigma, mu + 5*sigma, 1000)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()
Here we dont need to generate the distribution from x points, we only need to plot the density of the distribution we already have .
So you might use this :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.random.normal(50, 3, 1000) #Generating Data
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source) #Converting to pandas DataFrame
df.plot(kind = 'density'); # or df["AGW"].plot(kind = 'density');
Which gives :
You might use other packages if you want, like seaborn :
import seaborn as sns
plt.figure(figsize = (5,5))
sns.kdeplot(df["AGW"] , bw = 0.5 , fill = True)
plt.show()
Or this :
import seaborn as sns
sns.set_style("whitegrid") # Setting style(Optional)
plt.figure(figsize = (10,5)) #Specify the size of figure
sns.distplot(x = df["AGW"] , bins = 10 , kde = True , color = 'teal'
, kde_kws=dict(linewidth = 4 , color = 'black')) #kde for normal distribution
plt.show()
Check this article for more.

Plotting probability density function in Python

I want to plot two probability density functions (pdf) based on values of a certain column in a dataframe. The first one for all the values that correspond to rows with target label = 0 and second one where target label = 1.
My attempt is below, but as you can see the curves do not look like a pdf (the max value is 0 and they are not confined to X axis in range 0-1 and 5-6. I assume I can get something close by playing around with bw factor, but I am looking for a one-liner that just figures out right params and plots a pdf(including figuiring out the right X-axis start/end to use). Is there any such built in function that does this. If not, would appreciate some pointers on how to build something like this.
#matplotloib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.neighbors import KernelDensity
values = np.random.rand(10)
values_shift5 = np.random.rand(10) + 5
df = pd.DataFrame({'values' : values, 'label' : np.zeros(10)})
df = pd.concat([df, pd.DataFrame({'values' : values_shift5, 'label' : np.ones(10)})])
kde_label_0 = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(df[df.label == 0]['values'].values.reshape(-1, 1))
kde_label_1 = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(df[df.label == 1]['values'].values.reshape(-1, 1))
X_plot = np.linspace(0, 10, 50).reshape(-1, 1)
log_density_0 = kde_label_0.score_samples(X_plot)
log_density_1 = kde_label_1.score_samples(X_plot)
plt.plot(X_plot, log_density_0, label='Label 0')
plt.plot(X_plot, log_density_1, label='Label 1')
plt.legend()
plt.show()

Uncertain why trendline is not appearing on matplotlib scatterplot

I am trying to plot a trendline for a matplotlib scatterplot and am uncertain why the trendline is not appearing. What should I change in my code to make the trendline appear? Event is a categorical data type.
I've followed what most other stackoverflow questions suggest about plotting a trendline, but am uncertain why my trendline is not appearing.
#import libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pandas.plotting import register_matplotlib_converters
#register datetime converters
register_matplotlib_converters()
#read dataset using pandas
dataset = pd.read_csv("UsrNonCallCDCEvents_CDCEventType.csv")
#convert date to datetime type
dataset['Interval'] = pd.to_datetime(dataset['Interval'])
#convert other columns to numeric type
for cols in list(dataset):
if cols != 'Interval' and cols != 'CDCEventType':
dataset[cols] = pd.to_numeric(dataset[cols])
#create pivot of dataset
pivot_dataset = dataset.pivot(index='Interval',columns='CDCEventType',values='AvgWeight(B)')
#create scatterplot with trendline
x = pivot_dataset.index.values.astype('float64')
y = pivot_dataset['J-STD-025']
plt.scatter(x,y)
z = np.polyfit(x,y,1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.show()
This is the graph currently being output. I am trying to get this same graph, but with a trendline: https://imgur.com/a/o18a5Y3
It's also fine that x axis is not showing dates
A snippet of my dataframe looks like this: https://imgur.com/a/xJAcgEI
I've painted out the irrelvant column names

What's the equivalent of fitdist and histfit in Python?

--- SAMPLE ---
I have a data set (sample) that contains 1 000 damage values (the values are very small <1e-6) in a 1-dimension array (see the attached .json file). The sample is seemed to follow Lognormal distribution:
--- PROBLEM & WHAT I ALREADY TRIED ---
I tried the suggestions in this post Fitting empirical distribution to theoretical ones with Scipy (Python)? and this post Scipy: lognormal fitting to fit my data by lognormal distribution. None of these works. :(
I always get something very large in Y-axis as the following:
Here is the code that I used in Python (and the data.json file can be downloaded from here):
from matplotlib import pyplot as plt
from scipy import stats as scistats
import json
with open("data.json", "r") as f:
sample = json.load(f) # load data: a 1000 * 1 array with many small values( < 1e-6)
fig, axis = plt.subplots() # initiate a figure
N, nbins, patches = axis.hist(sample, bins = 40) # plot sample by histogram
axis.ticklabel_format(style = 'sci', scilimits = (-3, 4), axis = 'x') # make X-axis to use scitific numbers
axis.set_xlabel("Value")
axis.set_ylabel("Count")
plt.show()
fig, axis = plt.subplots()
param = scistats.lognorm.fit(sample) # fit data by Lognormal distribution
pdf_fitted = scistats.lognorm.pdf(nbins, * param[: -2], loc = param[-2], scale = param[-1]) # prepare data for ploting fitted distribution
axis.plot(nbins, pdf_fitted) # draw fitted distribution on the same figure
plt.show()
I tried the other kind of distribution, but when I try to plot the result, the Y-axis is always too large and I can't plot with my histogram. Where did I fail ???
I'have also tried out the suggestion in my another question: Use scipy lognormal distribution to fit data with small values, then show in matplotlib. But the value of variable pdf_fitted is always too big.
--- EXPECTING RESULT ---
Basically, what I want is like this:
And here is the Matlab code that I used in the above screenshot:
fname = 'data.json';
sample = jsondecode(fileread(fname));
% fitting distribution
pd = fitdist(sample, 'lognormal')
% A combined command for plotting histogram and distribution
figure();
histfit(sample,40,"lognormal")
So if you have any idea of the equivalent command of fitdist and histfit in Python/Scipy/Numpy/Matplotlib, please post it !
Thanks a lot !
Try the distfit (or fitdist) library.
https://erdogant.github.io/distfit
pip install distfit
import numpy as np
# Example data
X = np.random.normal(10, 3, 2000)
y = [3,4,5,6,10,11,12,18,20]
# From the distfit library import the class distfit
from distfit import distfit
# Initialize
dist = distfit()
# Search for best theoretical fit on your emperical data
dist.fit_transform(X)
# Plot
dist.plot()
# summay plot
dist.plot_summary()
So in your case it would be:
dist = distfit(distr='lognorm')
dist.fit_transform(X)
Try seaborn:
import seaborn as sns, numpy as np
sns.set(); np.random.seed(0)
x = np.random.randn(100)
ax = sns.distplot(x)
I tried your dataset using Openturns library
x is the list given in you json file.
import openturns as ot
from openturns.viewer import View
import matplotlib.pyplot as plt
# first format your list x as a sample of dimension 1
sample = ot.Sample(x,1)
# use the LogNormalFactory to build a Lognormal distribution according to your sample
distribution = ot.LogNormalFactory().build(sample)
# draw the pdf of the obtained distribution
graph = distribution.drawPDF()
graph.setLegends(["LogNormal"])
View(graph)
plt.show()
If you want the parameters of the distribution
print(distribution)
>>> LogNormal(muLog = -16.5263, sigmaLog = 0.636928, gamma = 3.01106e-08)
You can build the histogram the same way by calling HistogramFactory, then you can add one graph to another:
graph2 = ot.HistogramFactory().build(sample).drawPDF()
graph2.setColors(['blue'])
graph2.setLegends(["Histogram"])
graph2.add(graph)
View(graph2)
and set the boundaries values if you want to zoom
axes = view.getAxes()
_ = axes[0].set_xlim(-0.6e-07, 2.8e-07)
plt.show()

Categories