I have a histogram from measured data and I want to find an envelope (a continuous function) of this histogram. What do you suggest? How to do it in python?
plot_histogram_of_real_data(file_name='/home/me/data.txt'):
plt.figure('Histogram of real data')
data = load_measured_data(file_name)
n, bins, patches = plt.hist(data, 30, facecolor='green', alpha=0.75)
plt.grid()
plt.show()
You can either fit the data that you get from a histogram using one of several ways:
Use numpy.polyfit for polynomial fits (https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html)
Use scipy.optimize.curve_fit for fitting arbitrary functions
There is also kernel density approximation: scipy.stats.gaussian_kde which is a standard representation for most statsiticians.
In seaborn, you can plot sns.kdeplot for a single set of data, and sns.violinplot for multiple sets of data. For data which may vary significantly, I would suggest using the Kernel density estimates, rather than fitting some function of your own from histograms.
Related
So I wrote some code with the help of lmfit to fit a Gaussian curve on some histogram data. While the curve itself is fine, when I try to plot the results in matplotlib, it displays the fit along with the data points. In reality, I want to plot histogram bars with the curve fit. How do you do this? Or alternatively, is there a way in lmfit to only show the fit curve and then add the histogram plot and combine them together?
Relevant part of my code:
counts, bin_edges = np.histogram(some_array, bins=1000)
bin_widths = np.diff(bin_edges)
x = bin_edges[:-1] + (bin_widths / 2)
y = counts
mod = GaussianModel()
pars = mod.guess(y, x=x)
final_fit = mod.fit(y, pars, x=x)
final_fit.plot_fit()
plt.show()
Here's the graphed result:
Gaussian curve
lmfit's builtin plotting routines are minimal wrappers around matplotlib, intended to give reasonable default plots for many cases. They don't make histograms.
But the arrays are readily available and using matplotlib to make a histogram is easy. I think all you need is:
import matplotlib.pyplot as plt
plt.hist(some_array, bins=1000, rwidth=0.5, label='binned data')
plt.plot(x, final_fit.best_fit, label='best fit')
plt.legend()
plt.show()
I have a Data Frame df with two columns 'Egy' and 'fx' that I plot in this way:
plot_1 = df_data.plot(x="Egy", y="fx", color="red", ax=ax1, linewidth=0.85)
plot_1.set_xscale('log')
plt.show()
But then I want to smooth this curve using spline like this:
from scipy.interpolate import spline
import numpy as np
x_new = np.linspace(df_data['Egy'].min(), df_data['Egy'].max(),500)
f = spline(df_data['Egy'], df_data['fx'],x_new)
plot_1 = ax1.plot(x_new, f, color="black", linewidth=0.85)
plot_1.set_xscale('log')
plt.show()
And the plot I get is this (forget about the scatter blue points).
There are a lot of "peaks" in the smooth curve, mainly at lower x. How Can I smooth this curve properly?
When I consider the "busybear" suggestion of use np.logspace instead of np.linspace I get the following picture, which is not very satisfactory either.
You have your x values linearly scaled with np.linspace but your plot is log scaled. You could try np.geomspace for your x values or plot without the log scale.
Using spline will only work well for functions that are already smooth. What you need is to regularize the data and then interpolate afterwards. This will help to smooth out the bumps. Regularization is an advanced topic, and it would not be appropriate to discuss it in detail here.
Update: for regularization using machine learning, you might look into the scikit library for Python.
I'm facing a silly problem while plotting a graph from a regression function calculated using sci-kit-learn. After constructing the function I need to plot a graph that shows X and Y from the original data and calculated dots from my function. The problem is that my function is not a line, despite being linear, it uses a Fourier series in order to give the right shape for my curve, and when I try to plot the lines using:
ax.plot(df['GDPercapita'], modelp1.predict(df1), color='k')
I got a Graph like this:
Plot
But the trhu graph is supposed to be a line following those black points:
Dots to be connected
I'm generating the graph using the follow code:
fig, ax = plt.subplots()
ax.scatter(df['GDPercapita'], df['LifeExpectancy'], edgecolors=(0, 0, 0))
ax.scatter(df['GDPercapita'], modelp1.predict(df1),color='k') #this line is changed to get the first pic.
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show(block=True)
Does anyone have an idea about what to do?
POST DISCUSSION EDIT:
Ok, so first things first:
The data can be download at: http://www.est.ufmg.br/~marcosop/est171-ML/dados/worldDevelopmentIndicators.csv
I had to generate new data using a Fourier expasion, with normalized values of GDPercapita, in order to perform an exhaustive optimization algorithm for Regression Function used to predict the LifeExpectancy, and found out the number o p parameters that generate the best Regression Function, this number is p=22.
Now I have to generate a Polynomial Function using the predictions points of the regression fuction with p=22, to show how the best regression function is compared to the Polynomial function with the 22 degrees.
To generate the prediction I use the following code:
from sklearn import linear_model
modelp22 = linear_model.LinearRegression()
modelp22.fit(xp22,y_train)
df22 = df[p22]
fig, ax = plt.subplots()
ax.scatter(df['GDPercapita'], df['LifeExpectancy'], edgecolors=(0, 0, 0))
ax.scatter(df['GDPercapita'], modelp22.predict(df22),color='k')
ax.set_xlabel('GDPercapita')
ax.set_ylabel('LifeExpectancy')
plt.show(block=True)
Now I need to use the predictions points to create a Polynomial Function and plot a graph with: The original data(first scatter), the predictions points(secont scatter) and the Polygonal Funciontion (a curve or plot) to show their visual relation.
Plotting my data in excel as a scatter plot with smooth line and markers produces the type of figure I'm expecting. Image of Excel plots:
However when trying to plot the data in matplotlib I'm running into some issues with interpolation. I'm using the interpolation package from SciPy, I've tried a range of different interpolation methods including spline interpolation and BarycentricInterpolator as suggested previously. These plots are obviously very different to the excel produced plots however:
I've tried different smoothing and k values for spline interpolation, while the curve changes the root problem still exists.
How would I be able to produce a fitted curve similar to the excel-produced plots?
Thanks
The problem is that you interpolate the data on a linear scale but expect the outcome to look smooth on a logarithmic scale.
The idea would therefore be perform the interpolation on a log scale already by transforming the data to its logarithm first and then perform the interpolation. You can then transform it back to linear scale such that you can plot it on a log scale again.
from scipy.interpolate import interp1d, Akima1DInterpolator
import numpy as np
import matplotlib.pyplot as plt
x = np.array([0.02,0.2,2,20,200])
y = np.array([700,850,680,410, 700])
plt.plot(x,y, marker="o", ls="")
sx=np.log10(x)
xi_ = np.linspace(sx.min(),sx.max(), num=201)
xi = 10**(xi_)
f = interp1d(sx,y, kind="cubic")
yi = f(xi_)
plt.plot(xi,yi, label="cubic spline")
f2 = Akima1DInterpolator(sx, y)
yi2 = f2(xi_)
plt.plot(xi,yi2, label="Akima")
plt.gca().set_xscale("log")
plt.legend()
plt.show()
I am new here, although I have been reading answers to questions for quite a while. I have a problem, I have a seismic hazard curve looking roughly as follows:
I need to plot it like an histogram. That is what a hazard curve looks like - I would need to plot the median as a histogram
I have tried to use plt.hist as follows:
n, bins, patches = plt.hist(x, 50, facecolor='green', alpha=0.75)
where x is my frequency data array:
x = [1.00E-02, 1.00E-03, 1.00E-04, 1.00E-05, 1.00E-06, 1.00E-07, 1.00E-08, 1.00E-09, 1.00E-10]
but it gives me back an empty image. I think it is because it's used to plot probability density functions (and mine is not a probability density function) but I am not sure if I am right. Can someone give me some pointers on how to do this?