For a while, I've been using both seaborn and plotly for visualization, depending on my needs at the moment. Lately, I've been trying to move completely to plotly, but there are things that I still can't find out how to make it work.
For example, I used to use seaborn to check the distribution of some data, to see how well it fitted to the gaussian distribution. This can be easily done with the following snippet:
import seaborn as sns
from scipy.stats import norm
sns.distplot(data, fit=norm)
I've been trying to achieve some similar quick gaussian check with plotly express (px.histogram to be more specific), but I can't get it done. Could you please help me with this matter?
EDIT
An example for "data" would be:
import numpy as np
np.random.seed(123)
data = np.random.noncentral_chisquare(3, 20, 1000)
The output should show data histogram with its KDE, plus a gaussian equivalent KDE. This is helpful when testing transformations results (log, box-cox...)
I think you can be interested in reading this. Apparently at the moment the easiest way it's using plotly.figure_factory.create_dist_plot but from the link above it looks like it's going to be discontinued.
import numpy as np
import plotly.figure_factory as ff
np.random.seed(123)
data = np.random.noncentral_chisquare(3, 20, 1000)
m = data.mean()
s = data.std()
gaussian_data = np.random.normal(m, s, 10000)
fig = ff.create_distplot(
[data, gaussian_data],
group_labels=["plot", "gaussian"],
curve_type="kde")
fig.data = [fig.data[0], fig.data[2], fig.data[3]]
fig.update_layout(showlegend=False)
fig.show()
And if instead of fig.data = ... you use
lst = list(fig.data)
lst.pop(1)
fig.data = tuple(lst)
you'll get
Related
I have the following dataset
ids count
1 2000210
2 -23123
3 100
4 500
5 102300120
...
1 million 123213
I want a graph where I have group of ids (all unique ids) in the x axis and count in y axis and a distribution chart that looks like the following
How can I achieve this in pandas dataframe in python.
I tried different ways but I am only getting a basic plot and not as complex as the drawing.
What I tried
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
df.plot(x="range", y="count");
But the plots dont make any sense. I am also new to plotting in pandas. I searched for a long time for charts like this in the internet and could really use some help with such graphs
From what I understood from your question and comments here is what you can do:
1) Import the libraries and set the default theme:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
2) Create your dataframe:
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
3) Plot your data
3.1) Simple take using only the seaborn library:
sns.kdeplot(data=df, x="count", weights="range")
Output:
3.2) More complex take using seaborn and matplotlib libraries:
sns.histplot(x=df["count"], weights=df["range"], discrete=True,
color='darkblue', edgecolor='black',
kde=True, kde_kws={'cut': 2}, line_kws={'linewidth': 4})
plt.ylabel("range")
plt.show()
Output:
Personal note: please make sure to check all the solutions, if they
are not enough comment and we will work together in order to find you
a solution
From a distribution plot of ids you can use:
import numpy as np
import pandas as pd
np.random.seed(seed=123)
df = pd.DataFrame(np.random.randn(1000000), columns=["ids"])
df['ids'].plot(kind='kde')
I have the following .csv data:
Simulation Run,[urea] (μM),[NO3-] (μM),[NH4+] (μM),[NO2-] (μM),[O2] (μM),[HCO3-] (μM),[OH-] (μM),[H+] (μM),[H2O] (μM)
/Run_01,1124.3139186264032,49.79709670397852,128.31458304321205,0.0,4.0,140000.0,0.1,0.1,55000000.0
/Run_02,1.0017668367460492e-159,2426.7395169966485,3.1544859186304598e-09,1.975005700484566e-10,4.0,140000.0,0.1,0.1,55000000.0
/Run_03,9.905001536507822e-160,2426.739516996945,2.861369463189477e-09,1.7910618538551373e-10,4.0,140000.0,0.1,0.1,55000000.0
/Run_04,1123.3362048916795,49.7956932352008,130.27141398143655,0.0,4.0,140000.0,0.1,0.1,55000000.0
/Run_05,1101.9594005273052,49.792379912298884,173.02833603309404,0.0,4.0,140000.0,0.1,0.1,55000000.0
I would like to plot it in a series of scatterplot matrices to look at the relationships between the different variables. Much like how it is done here. NOTE: In the linked example the person is asking how to accomplish this in altair. I want to do this in Matplotlib.
Using the above code as reference, here is the code I'm working with:
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from math import ceil
def graph_data(f: str):
"""
Represents the data
as a series of scatter-plot matrices.
"""
df = pd.read_csv(f)
NROWS = ceil((len(df.columns) - 1) / 3)
# Although the number of variables could vary,
# I would like no more than 3 charts per row.
NCOLS = 3
fname = f[:-4] + '.pdf'
with PdfPages(fname) as pdf:
scatter_matrix(df, alpha=0.2, figsize=(NROWS, NCOLS), diagonal='kde')
pdf.savefig(bbox_inches='tight')
plt.close()
When I try to run this, here is the error I get:
[LOTS OF TRACEBACK]...numpy.linalg.LinAlgError: singular matrix
Is this happening because the number of variables isn't a perfect square number (thereby not yielding a square matrix)? Is there a way to avoid this?
EDIT:
I forgot to specify my import statements so I have those in now.
I have a list as below:
freq = [29342, 28360, 26029, 21418, 20771, 18372, 18239, 18070, 17261, 17102]
I want to show the values of n-th and m-th element of the x-axis and draw a vertical line
plt.plot(freq[0:1000])
For example in the graph above, the 100th elements on the x-axis - how can I show the values on the line?
I tried to knee but it shows only one elbow. I suggest it is the 50th element? But what is exactly x,y??
from kneed import KneeLocator
kn = KneeLocator(list(range(0, 1000)), freq[0:1000], curve='convex', direction='decreasing')
import matplotlib.pyplot as plt
kn.plot_knee()
#plt.axvline(x=50, color='black', linewidth=2, alpha=.7)
plt.annotate(freq[50], xy=(50, freq[50]), size=10)
You might think that everybody knows this library kneed. Well, I don't know about others but I have never seen that one before (it does not even have a tag here on SO).
But their documentation is excellent (qhull take note!). So, you could do something like this:
#fake data generation
import numpy as np
x=np.linspace(1, 10, 100)
freq=x**(-1.9)
#here happens the actual plotting
from kneed import KneeLocator
import matplotlib.pyplot as plt
kn = KneeLocator(x, freq, curve='convex', direction='decreasing')
xk = kn.knee
yk = kn.knee_y
kn.plot_knee()
plt.annotate(f'Found knee at x={xk:.2f}, y={yk:.2f}', xy=(xk*1.1, yk*1.1) )
plt.show()
Sample output:
--- SAMPLE ---
I have a data set (sample) that contains 1 000 damage values (the values are very small <1e-6) in a 1-dimension array (see the attached .json file). The sample is seemed to follow Lognormal distribution:
--- PROBLEM & WHAT I ALREADY TRIED ---
I tried the suggestions in this post Fitting empirical distribution to theoretical ones with Scipy (Python)? and this post Scipy: lognormal fitting to fit my data by lognormal distribution. None of these works. :(
I always get something very large in Y-axis as the following:
Here is the code that I used in Python (and the data.json file can be downloaded from here):
from matplotlib import pyplot as plt
from scipy import stats as scistats
import json
with open("data.json", "r") as f:
sample = json.load(f) # load data: a 1000 * 1 array with many small values( < 1e-6)
fig, axis = plt.subplots() # initiate a figure
N, nbins, patches = axis.hist(sample, bins = 40) # plot sample by histogram
axis.ticklabel_format(style = 'sci', scilimits = (-3, 4), axis = 'x') # make X-axis to use scitific numbers
axis.set_xlabel("Value")
axis.set_ylabel("Count")
plt.show()
fig, axis = plt.subplots()
param = scistats.lognorm.fit(sample) # fit data by Lognormal distribution
pdf_fitted = scistats.lognorm.pdf(nbins, * param[: -2], loc = param[-2], scale = param[-1]) # prepare data for ploting fitted distribution
axis.plot(nbins, pdf_fitted) # draw fitted distribution on the same figure
plt.show()
I tried the other kind of distribution, but when I try to plot the result, the Y-axis is always too large and I can't plot with my histogram. Where did I fail ???
I'have also tried out the suggestion in my another question: Use scipy lognormal distribution to fit data with small values, then show in matplotlib. But the value of variable pdf_fitted is always too big.
--- EXPECTING RESULT ---
Basically, what I want is like this:
And here is the Matlab code that I used in the above screenshot:
fname = 'data.json';
sample = jsondecode(fileread(fname));
% fitting distribution
pd = fitdist(sample, 'lognormal')
% A combined command for plotting histogram and distribution
figure();
histfit(sample,40,"lognormal")
So if you have any idea of the equivalent command of fitdist and histfit in Python/Scipy/Numpy/Matplotlib, please post it !
Thanks a lot !
Try the distfit (or fitdist) library.
https://erdogant.github.io/distfit
pip install distfit
import numpy as np
# Example data
X = np.random.normal(10, 3, 2000)
y = [3,4,5,6,10,11,12,18,20]
# From the distfit library import the class distfit
from distfit import distfit
# Initialize
dist = distfit()
# Search for best theoretical fit on your emperical data
dist.fit_transform(X)
# Plot
dist.plot()
# summay plot
dist.plot_summary()
So in your case it would be:
dist = distfit(distr='lognorm')
dist.fit_transform(X)
Try seaborn:
import seaborn as sns, numpy as np
sns.set(); np.random.seed(0)
x = np.random.randn(100)
ax = sns.distplot(x)
I tried your dataset using Openturns library
x is the list given in you json file.
import openturns as ot
from openturns.viewer import View
import matplotlib.pyplot as plt
# first format your list x as a sample of dimension 1
sample = ot.Sample(x,1)
# use the LogNormalFactory to build a Lognormal distribution according to your sample
distribution = ot.LogNormalFactory().build(sample)
# draw the pdf of the obtained distribution
graph = distribution.drawPDF()
graph.setLegends(["LogNormal"])
View(graph)
plt.show()
If you want the parameters of the distribution
print(distribution)
>>> LogNormal(muLog = -16.5263, sigmaLog = 0.636928, gamma = 3.01106e-08)
You can build the histogram the same way by calling HistogramFactory, then you can add one graph to another:
graph2 = ot.HistogramFactory().build(sample).drawPDF()
graph2.setColors(['blue'])
graph2.setLegends(["Histogram"])
graph2.add(graph)
View(graph2)
and set the boundaries values if you want to zoom
axes = view.getAxes()
_ = axes[0].set_xlim(-0.6e-07, 2.8e-07)
plt.show()
I have a series of data which consists of values from several experiments (1-40, in the MWE it is 1-5). The overall amount of entries in my original data is ~4.000.000, which I try to smooth in order to display it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import spline
from statsmodels.nonparametric.smoothers_lowess import lowess
df = pd.DataFrame()
df["values"] = np.random.randint(100000, 200000, 1000)
df["id"] = [1,2,3,4,5] * 200
plt.figure(1, figsize=(11.69,8.27))
# Both fail for my amount of data:
plt.plot(spline(df["values"], df["id"], range(100)), "r-")
plt.plot(lowess(df["values"], df["id"]), "r-")
Both, scipy.interplate and statsmodels.nonparametric.smoothers_lowess.lowess, throw out of memory exceptions for my data. Is there any efficient way to solve this like in, e.g., GNU R using ggplot2 and geom_smooth()?
I can't quite tell what you're getting at with all the dimensions to your data, but one very simple thing you can try is to just use the 'markevery' kwarg like so:
import numpy as np
import matplotlib.pyplot as plt
x=np.linspace(1,100,1E7)
y=x**2
plt.figure(1, figsize=(11.69,8.27))
plt.plot(x,y,markevery=100)
plt.show()
This will only plot every nth point (n=100 here).
If that doesn't help then you may want to try just a simple numpy interpolation with fewer samples like so:
x_large=np.linspace(1,100,1E7)
y_large=x**2
x_small=np.linspace(1,100,1E3)
y_small=np.interp(x_small,x_large,y_large)
plt.plot(x_small,y_small)