Best fit curve (polynomial) on scatter plot with bokeh - python

I have created a scatter plot with bokeh. I want to generate a best fit polynomial curve on the data, and superimpose the curve on the cloud of points.
I have generated a 2nd degree polyline with polyfit:
import numpy as np
from bokeh.plotting import figure, output_file, show
model2 = np.poly1d(np.polyfit(df['Dist'],df['Speed'], 2)
polyline = np.linspace(1,16000,900)
graph = figure(title = "Speed function of flight distance")
graph.scatter(df['Dist'],df['Speed'])
show(graph)
What is the instruction for showing this polyline on top of the scatter plot? I see how to generate a line of best fit, my need is for a polyline.

As mentioned in the comments, graph.line() adds a line plot. Now, we just need an evenly spaced x-range over which we plot the fitted function:
import numpy as np
from bokeh.plotting import figure, output_file, show
#data generation
import pandas as pd
np.random.seed(123)
dist = np.sort(np.random.choice(range(100), 20, replace=False))
speed = 0.3 * dist ** 2 - 2.7 * dist - 1 + np.random.randint(-10, 10, dist.size)
df = pd.DataFrame({'Dist': dist, 'Speed': speed})
model2 = np.poly1d(np.polyfit(df['Dist'], df['Speed'], 2))
x_fit = np.linspace(df['Dist'].min(), df['Dist'].max(), 100)
graph = figure(title = "Speed function of flight distance")
graph.scatter(df['Dist'],df['Speed'])
graph.line(x_fit, model2(x_fit))
show(graph)
Sample output:

Related

How to plot heatmap onto mplsoccer pitch?

Wondering how I can plot a seaborn plot onto a different matplotlib plot. Currently I have two plots (one a heatmap, the other a soccer pitch), but when I plot the heatmap onto the pitch, I get the results below. (Plotting the pitch onto the heatmap isn't pretty either.) Any ideas how to fix it?
Note: Plots don't need a colorbar and the grid structure isn't required either. Just care about the heatmap covering the entire space of the pitch. Thanks!
import pandas as pd
import numpy as np
from mplsoccer import Pitch
import seaborn as sns
nmf_shot_W = pd.read_csv('https://raw.githubusercontent.com/lucas-nelson-uiuc/datasets/main/nmf_show_W.csv').iloc[:, 1:]
nmf_shot_ThierryHenry = pd.read_csv('https://raw.githubusercontent.com/lucas-nelson-uiuc/datasets/main/nmf_show_Hth.csv')['Thierry Henry']
pitch = Pitch(pitch_type='statsbomb', line_zorder=2,
pitch_color='#22312b', line_color='#efefef')
dfdfdf = np.array(np.matmul(nmf_shot_W, nmf_shot_ThierryHenry)).reshape((24,25))
g_ax = sns.heatmap(dfdfdf)
pitch.draw(ax=g_ax)
Current output:
Desired output:
Use the built-in pitch.heatmap:
pitch.heatmap expects a stats dictionary of binned data, bin mesh, and bin centers:
stats (dict) – The keys are statistic (the calculated statistic), x_grid and y_grid (the bin's edges), and cx and cy (the bin centers).
In the mplsoccer heatmap demos, they construct this stats object using pitch.bin_statistic because they have raw data. However, you already have binned data ("calculated statistic"), so reconstruct the stats object manually by building the mesh and centers:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mplsoccer import Pitch
nmf_shot_W = pd.read_csv('71878281/nmf_show_W.csv', index_col=0)
nmf_shot_ThierryHenry = pd.read_csv('71878281/nmf_show_Hth.csv')['Thierry Henry']
statistic = np.dot(nmf_shot_W, nmf_shot_ThierryHenry.to_numpy()).reshape((24, 25))
# construct stats object from binned data, bin mesh, and bin centers
y, x = statistic.shape
x_grid = np.linspace(0, 120, x + 1)
y_grid = np.linspace(0, 80, y + 1)
cx = x_grid[:-1] + 0.5 * (x_grid[1] - x_grid[0])
cy = y_grid[:-1] + 0.5 * (y_grid[1] - y_grid[0])
stats = dict(statistic=statistic, x_grid=x_grid, y_grid=y_grid, cx=cx, cy=cy)
# use pitch.draw and pitch.heatmap as per mplsoccer demo
pitch = Pitch(pitch_type='statsbomb', line_zorder=2, pitch_color='#22312b', line_color='#efefef')
fig, ax = pitch.draw(figsize=(6.6, 4.125))
pcm = pitch.heatmap(stats, ax=ax, cmap='plasma')
cbar = fig.colorbar(pcm, ax=ax, shrink=0.6)
cbar.outline.set_edgecolor('#efefef')
cbar.ax.yaxis.set_tick_params(color='#efefef')
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color='#efefef')

Discontinuos, non-monotonic x-axis on contourf

I am plotting a 3D shape in spherical coordinates. In order to rotate it, I am shifting the phi values by 30 deg as phi_lin and phi_rot show in in the following code. I would expect the result in panel 4 to have the same distribution of panel 2, but rigidly shifted to the right by 30 degrees.
I guess, the problem is that plotting function countorf cannot deal with the phi_rot input vector since it is non-monotonic. It is possible to see in panel 3 the discontinuity du the shifting. How can I overcome this problem?
Here a working code:
import glob
import math
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.colors import LightSource
%matplotlib inline
import itertools
def ellips(THETA,PHI):
"""
#Definiton of the ellipsoid
# from https://arxiv.org/pdf/1104.5145.pdf
"""
a=1; b=2; c=3
R = (a*b*c) / np.sqrt(b**2*c**2*np.cos(THETA)**2 + c**2*a**2*np.sin(THETA)**2*np.cos(PHI)**2 + a**2*b**2*np.sin(THETA)**2*np.sin(PHI)**2)
return np.array(R)
nth=13
theta = np.linspace(0, np.pi, nth)
#length = 13
phi_lin=[-180,-150,-120,-90,-60,-30,0,30,60,90,120,150,180]
phi_rot=[-150,-120,-90,-60,-30,0,30,60,90,120,150,180,-180]
THETA_lin, PHI_lin = np.meshgrid(theta, phi_lin)
THETA_rot, PHI_rot = np.meshgrid(theta, phi_rot)
THETA_deg_lin=[el*180/np.pi for el in THETA_lin]
THETA_deg_rot=[el*180/np.pi for el in THETA_rot]
PHI_deg_lin=[el for el in PHI_lin]
PHI_deg_rot=[el for el in PHI_rot]
fig1, ax = plt.subplots(2,2, figsize=(15,15), constrained_layout=True)
ax[0,0].plot(PHI_deg_lin, "o")
ax[0,0].set_xlabel("# element")
ax[0,0].set_ylabel('phi [DEG]')
ax[0,0].set_title("initial coordinates")
ax[0,1].contourf(PHI_deg_lin, THETA_deg_lin, ellips(THETA_deg_lin,PHI_deg_lin).reshape(len(phi_lin),nth))
ax[0,1].set_xlabel('phi [DEG]')
ax[0,1].set_ylabel('theta [DEG]')
ax[0,1].set_title("Original ellipsoind in spherical coordinates")
ax[1,0].plot(PHI_deg_rot, "o")
ax[1,0].set_xlabel("# element")
ax[1,0].set_ylabel('phi [DEG]')
ax[1,0].set_title("shifted coordinates")
ax[1,1].contourf(PHI_deg_rot, THETA_deg_rot, ellips(THETA_deg_rot,PHI_deg_rot).reshape(len(phi_rot),nth))
ax[1,1].set_xlabel('phi [DEG]')
ax[1,1].set_ylabel('theta [DEG]')
ax[1,1].set_title("Original ellipsoind in spherical coordinates")
and the output:
UPDATE: I tried to create an interpolation function z=f(x,y) with the rotated coordinates and to plot the new z:
from scipy import interpolate
i2d = interpolate.interp2d(theta, phi_rot, ellips(THETA_deg_rot,PHI_deg_rot))
znew = i2d(theta,phi_lin)
ax[1,1].contourf(PHI_deg_rot, THETA_deg_rot,znew.reshape(len(phi_rot),nth))
the shifting occurs as you can see in the following output, but the non linearly-spaced x axis prevents to have a smooth contour:
any idea how to fix it?
The solution has been inspired by this post.
Since contourf doesn´t accept non-linearly-spaced axis, it is necessary to interpolate the rotated data
from scipy import interpolate
i2d = interpolate.interp2d(theta, phi_rot, ellips(THETA_deg_rot,PHI_deg_rot))
evaluate it on the same axis (lin or rot doesn´t matter at this point)
znew = i2d(theta,phi_lin)
and plotting it using the tricontourf with a suitable numner of levels
ax[1,1].tricontourf(np.array(PHI_deg_rot).reshape(-1), np.array(THETA_deg_rot).reshape(-1),znew.reshape(-1),10)
the output is the expected one:

is there a simple method to smooth a curve without taking into account future values and without a time shift?

I have a Unix time series (x) with an associated signal value (y) which is generated every minute, dropping the first value and appending a new one. I am trying to smooth the resulting curve without loosing time accuracy with a specific emphasis on the final value of the smoothed curve which will be written to a database. I would like to be able to adjust the smoothing to a considerable degree.
I have studied (as mathematical layman, more or less) all options I could find and I could master. I came across Savitzki Golay which looked perfect until I realized it works well on past data but fails to produce a reliable final value if no future data is available for smoothing. I have tried many other methods which produced results but could not be adjusted like Savgol.
import pandas as pd
from bokeh.plotting import figure, show, output_file
from bokeh.layouts import column
from math import pi
from scipy.signal import savgol_filter
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from scipy.interpolate import splrep, splev
from scipy.ndimage import gaussian_filter1d
from scipy.signal import lfilter
from scipy.interpolate import UnivariateSpline
import matplotlib.pyplot as plt
df_sim = pd.read_csv("/home/20190905_Signal_Smooth_Test.csv")
#sklearn Polynomial*****************************************
poly = PolynomialFeatures(degree=4)
X = df_sim.iloc[:, 0:1].values
print(X)
y = df_sim.iloc[:, 1].values
print(y)
X_poly = poly.fit_transform(X)
poly.fit(X_poly, y)
lin2 = LinearRegression()
lin2.fit(X_poly, y)
# Visualising the Polynomial Regression results
plt.scatter(X, y, color='blue')
plt.plot(X, lin2.predict(poly.fit_transform(X)), color='red')
plt.title('Polynomial Regression')
plt.xlabel('Time')
plt.ylabel('Signal')
plt.show()
#scipy interpolate********************************************
bspl = splrep(df_sim['timestamp'], df_sim['signal'], s=5)
bspl_y = splev(df_sim['timestamp'], bspl)
df_sim['signal_spline'] = bspl_y
#scipy gaussian filter****************************************
smooth = gaussian_filter1d(df_sim['signal'], 3)
df_sim['signal_gauss'] = smooth
#scipy lfilter************************************************
n = 5 # the larger n is, the smoother curve will be
b = [1.0 / n] * n
a = 1
histo_filter = lfilter(b, a, df_sim['signal'])
df_sim['signal_lfilter'] = histo_filter
print(df_sim)
#scipy UnivariateSpline**************************************
s = UnivariateSpline(df_sim['timestamp'], df_sim['signal'], s=5)
xs = df_sim['timestamp']
ys = s(xs)
df_sim['signal_univariante'] = ys
#scipy savgol filter****************************************
sg = savgol_filter(df_sim['signal'], 11, 3)
df_sim['signal_savgol'] = sg
df_sim['date'] = pd.to_datetime(df_sim['timestamp'], unit='s')
#plotting it all********************************************
print(df_sim)
w = 60000
TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=1000, plot_height=250,
title=f"Various Signals y vs Timestamp x")
p.xaxis.major_label_orientation = pi / 4
p.grid.grid_line_alpha = 0.9
p.line(x=df_sim['date'], y=df_sim['signal'], color='green')
p.line(x=df_sim['date'], y=df_sim['signal_spline'], color='blue')
p.line(x=df_sim['date'], y=df_sim['signal_gauss'], color='red')
p.line(x=df_sim['date'], y=df_sim['signal_lfilter'], color='magenta')
p.line(x=df_sim['date'], y=df_sim['signal_univariante'], color='yellow')
p1 = figure(x_axis_type="datetime", tools=TOOLS, plot_width=1000, plot_height=250,
title=f"Savgol vs Signal")
p1.xaxis.major_label_orientation = pi / 4
p1.grid.grid_line_alpha = 0.9
p1.line(x=df_sim['date'], y=df_sim['signal'], color='green')
p1.line(x=df_sim['date'], y=df_sim['signal_savgol'], color='blue')
output_file("signal.html", title="Signal Test")
show(column(p, p1)) # open a browser
I expect a result that is similar to Savitzky Golay but with valid final smoothed values for the data series. None of the other methods present the same flexibility to adjust the grade of smoothing. Most other methods shift the curve to the right. I can provide to csv file for testing.
This really depends on why you are smoothing the data. Every smoothing method will have side effects, such as letting some 'noise' through more than other. Research 'phase response of filtering'.
A common technique to avoid the problem of missing data at the end of a symmetric filter is to just forecast your data a few points ahead and use that. For example, if you are using a 5-term moving average filter you will be missing 2 data points when you go to calculate your end value.
To forecast these two points, you could use the auto_arima() function from the pmdarima module, or look at the fbprophet module (which I find quite good for this kind of situation).

Overlay Linear Regression Line on Scatter Plot (iPython Notebook)

gh_data = ascii.read('http://dept.astro.lsa.umich.edu/~ericbell/data/GHOSTS/M81/ngc3031- field15.newphoto_radec')
ra = gh_data['col5'][:]
dec = gh_data['col6'][:]
f606 = gh_data['col3'][:]
f814 = gh_data['col4'][:]
plot(f6062-f8142,f8142, 'bo', alpha=0.15)
axis([-1,2.5,27,23])
xlabel('F606W-F814W')
ylabel('F814W')
title('Field 14')
The data set is imported and organized into different columns, I am trying to overlay a line of best fit, or linear regression over the scatterplot created, but I cannot figure out how. Thanks in advance.
As #rayryeng pointed out, your code just plots the data, but doesn't actually compute any regression results to plot. Here's one way of doing it:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.DataFrame({"y": range(1,11)+np.random.rand(10),
"x": range(1,11)+np.random.rand(10)})
Use statsmodels OLS method to fit a regression line, and params to extract the coefficient on the single regressor:
beta_1 = sm.OLS(data.y, data.x).fit().params
Produce a scatterplot and add a regression line:
fig, ax = plt.subplots()
ax.scatter(data.x, data.y)
ax.plot(range(1,11), [i*beta_1 for i in range(1,11)], label = "best fit")
ax.legend(loc="best")

Major Difference in 2D kernel Density Plots: Seaborn and R

I am trying to plot data using the 2D kernel density plot of Seaborn's jointplot function (using statsmodels' KDEMultivariate function to calculate a data-driven bandwidth). I've plotted a 2D kernel density in R using the same data and the result looks very good (using the 'ks' package), while the Seaborn plot looks very very different.
I am using the same exact data and the same exact bandwidth for each (taking the bandwidth given by KDEMultivariant and passing that to the R method).
Here is the input.csv data used: https://app.box.com/s/ot7d36t44wrr85pusp5657pc1w2kf5hj
Below are the code used in each and output images from each.
Python / Seaborn:
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv("input.csv", dtype={'x': float, 'y': float}, skiprows=0)
bw_ml_x = sm.nonparametric.KDEMultivariate(data=data['x'], var_type='c', bw='cv_ml')
bw_ml_y = sm.nonparametric.KDEMultivariate(data=data['y'], var_type='c', bw='cv_ml')
g = sns.jointplot(x='x', y='y', data=data, kind="kde", stat_func=None, bw=[bw_ml_x.bw, bw_ml_y.bw])
g.plot_joint(plt.scatter, c="w")
g.ax_joint.collections[0].set_alpha(0)
sns.plt.show()
Img for Seaborn plot:
The bandwidth given by bw_ml_x.bw and bw_ml_y.bw is placed in a 2 x 2 R matrix H, where H[1,1] = bw_ml_x.bw, H[2,2] = bw_ml.y.bw, and other values set to zero.
R:
library(ks)
fhat <- kde(x=as.data.frame(data[1], data[2]), H=H)
plot(fhat, display="filled.contour2", cont=seq(10,90,by=10))
Img for R plot:
Looking at your Seaborn/Python plot, many of the points cluster along the (0,n) region and the (1,1) region of your space, just as the KDE of the R plot shows. This indicates that Seaborn and R are looking at the same data; we simply need to reformulate the call to the kde in Seaborn in order to visualize the KDE gradients.
If you modify your Python call to match the documentation for Kernel Density Estimation in Seaborn you'll get a proper 2d-kdf out of Python:
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd
import seaborn as sns
data = pd.read_csv("input.csv", dtype={'x': float, 'y': float}, skiprows=0)
bw_ml_x = sm.nonparametric.KDEMultivariate(data=data['x'], var_type='c', bw='cv_ml')
bw_ml_y = sm.nonparametric.KDEMultivariate(data=data['y'], var_type='c', bw='cv_ml')
g = sns.jointplot(x='x', y='y', data=data, kind="kde")
g.plot_joint(plt.scatter, c="w")
g.ax_joint.collections[0].set_alpha(0)
sns.plt.show()
This accords with the R plot (though the kernel estimators seem to be slightly different, which would account for the variation in gradients between the plots):

Categories