I'm using seaborn for plotting data. Everything is fine until my mentor asked me how the plot is made in the following code for example.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
x = np.random.normal(size=100)
sns.distplot(x)
plt.show()
The result of this code is:
My questions:
How does distplot manage to plot this?
Why does the plot start at -3 and end at 4?
Is there any parametric function or any specific mathematical function that distplot uses to plot the data like this?
I use distplot and kind='kde' to plot my data, but I would like to know what is the maths behind those functions.
Here is some code trying to illustrate how the kde curve is drawn.
The code starts with a random sample of 100 xs.
These xs are shown in a histogram. With density=True the histogram is normalized so that it's full area would be 1. (Standard, the bars of the histogram grow with the number of points. Internally, the complete area is calculated and each bar's height is divided by that area.)
To draw the kde, a gaussian "bell" curve is drawn around each of the N samples. These curves are summed, and normalized by dividing by N.
The sigma of these curves is a free parameter. Default it is calculated by Scott's rule (N ** (-1/5) or 0.4 for 100 points, the green curve in the example plot).
The code below shows the result for different choices of sigma. Smaller sigmas enclose the given data stronger, larger sigmas appear more smooth. There is no perfect choice for sigma, it depends strongly on the data and what is known (or guessed) about the underlying distribution.
import matplotlib.pyplot as plt
import numpy as np
def gauss(x, mu, sigma):
return np.exp(-((x - mu) / sigma) ** 2 / 2) / (sigma * np.sqrt(2 * np.pi))
N = 100
xs = np.random.normal(0, 1, N)
plt.hist(xs, density=True, label='Histogram', alpha=.4, ec='w')
x = np.linspace(xs.min() - 1, xs.max() + 1, 100)
for sigma in np.arange(.2, 1.2, .2):
plt.plot(x, sum(gauss(x, xi, sigma) for xi in xs) / N, label=f'$\\sigma = {sigma:.1f}$')
plt.xlim(x[0], x[-1])
plt.legend()
plt.show()
PS: Instead of a histogram or a kde, other ways to visualize 100 random numbers are a set of short lines:
plt.plot(np.repeat(xs, 3), np.tile((0, -0.05, np.nan), N), lw=1, c='k', alpha=0.5)
plt.ylim(ymin=-0.05)
or dots (jittered, so they don't overlap):
plt.scatter(xs, -np.random.rand(N)/10, s=1, color='crimson')
plt.ylim(ymin=-0.099)
I have a numpy array vertices of shape (N,3) containing the N vertices of a spherical polygon in 3D, i.e. all these points lie on the surface of a sphere. The center and radius of the sphere is known (take the unit sphere for example). I would like to plot the spherical polygon bounded by these vertices. (Mathematically speaking, I want to plot the spherically convex hull generated by these vertices).
How can I do that using matplotlib? I tried Poly3DCollection, but this only plots the Euclidean polygon. I managed to plot the entire unit sphere using plot_surface like this:
u = np.linspace(0, 2 * np.pi, 100)
v = np.linspace(0, np.pi, 100)
x = np.outer(np.cos(u), np.sin(v))
y = np.outer(np.sin(u), np.sin(v))
z = np.outer(np.ones(np.size(u)), np.cos(v))
ax.plot_surface(x, y, z, rstride=5, cstride=5, color='y', alpha=0.1)
I guess one could manually calculate what points to remove from x, y, z and then still use plot_surface in order to plot the polygon. Would this be the correct way to use matplotlib or does it have another module, which I could use directly?
In case there is no convenient way to do this in matplotlib, can you recommend any other library, which does that?
I was inspired by this answer by #James to see how griddata and map_coordinates might be used. In the examples below I'm showing 2D data, but my interest is in 3D. I noticed that griddata only provides splines for 1D and 2D, and is limited to linear interpolation for 3D and higher (probably for very good reasons). However, map_coordinates seems to be fine with 3D using higher order (smoother than piece-wise linear) interpolation.
My primary question: if I have random, unstructured data (where I can not use map_coordinates) in 3D, is there some way to get smoother than piece-wise linear interpolation within the NumPy SciPy universe, or at least nearby?
My secondary question: is spline for 3D not available in griddata because it is difficult or tedious to implement, or is there a fundamental difficulty?
The images and horrible python below show my current understanding of how griddata and map_coordinates can or can't be used. Interpolation is done along the thick black line.
STRUCTURED DATA:
UNSTRUCTURED DATA:
Horrible python:
import numpy as np
import matplotlib.pyplot as plt
def g(x, y):
return np.exp(-((x-1.0)**2 + (y-1.0)**2))
def findit(x, X): # or could use some 1D interpolation
fraction = (x - X[0]) / (X[-1]-X[0])
return fraction * float(X.shape[0]-1)
nth, nr = 12, 11
theta_min, theta_max = 0.2, 1.3
r_min, r_max = 0.7, 2.0
theta = np.linspace(theta_min, theta_max, nth)
r = np.linspace(r_min, r_max, nr)
R, TH = np.meshgrid(r, theta)
Xp, Yp = R*np.cos(TH), R*np.sin(TH)
array = g(Xp, Yp)
x, y = np.linspace(0.0, 2.0, 200), np.linspace(0.0, 2.0, 200)
X, Y = np.meshgrid(x, y)
blob = g(X, Y)
xtest = np.linspace(0.25, 1.75, 40)
ytest = np.zeros_like(xtest) + 0.75
rtest = np.sqrt(xtest**2 + ytest**2)
thetatest = np.arctan2(xtest, ytest)
ir = findit(rtest, r)
it = findit(thetatest, theta)
plt.figure()
plt.subplot(2,1,1)
plt.scatter(100.0*Xp.flatten(), 100.0*Yp.flatten())
plt.plot(100.0*xtest, 100.0*ytest, '-k', linewidth=3)
plt.hold
plt.imshow(blob, origin='lower', cmap='gray')
plt.text(5, 5, "don't use jet!", color='white')
exact = g(xtest, ytest)
import scipy.ndimage.interpolation as spndint
ndint0 = spndint.map_coordinates(array, [it, ir], order=0)
ndint1 = spndint.map_coordinates(array, [it, ir], order=1)
ndint2 = spndint.map_coordinates(array, [it, ir], order=2)
import scipy.interpolate as spint
points = np.vstack((Xp.flatten(), Yp.flatten())).T # could use np.array(zip(...))
grid_x = xtest
grid_y = np.array([0.75])
g0 = spint.griddata(points, array.flatten(), (grid_x, grid_y), method='nearest')
g1 = spint.griddata(points, array.flatten(), (grid_x, grid_y), method='linear')
g2 = spint.griddata(points, array.flatten(), (grid_x, grid_y), method='cubic')
plt.subplot(4,2,5)
plt.plot(exact, 'or')
#plt.plot(ndint0)
plt.plot(ndint1)
plt.plot(ndint2)
plt.title("map_coordinates")
plt.subplot(4,2,6)
plt.plot(exact, 'or')
#plt.plot(g0)
plt.plot(g1)
plt.plot(g2)
plt.title("griddata")
plt.subplot(4,2,7)
#plt.plot(ndint0 - exact)
plt.plot(ndint1 - exact)
plt.plot(ndint2 - exact)
plt.title("error map_coordinates")
plt.subplot(4,2,8)
#plt.plot(g0 - exact)
plt.plot(g1 - exact)
plt.plot(g2 - exact)
plt.title("error griddata")
plt.show()
seed_points_rand = 2.0 * np.random.random((400, 2))
rr = np.sqrt((seed_points_rand**2).sum(axis=-1))
thth = np.arctan2(seed_points_rand[...,1], seed_points_rand[...,0])
isinside = (rr>r_min) * (rr<r_max) * (thth>theta_min) * (thth<theta_max)
points_rand = seed_points_rand[isinside]
Xprand, Yprand = points_rand.T # unpack
array_rand = g(Xprand, Yprand)
grid_x = xtest
grid_y = np.array([0.75])
plt.figure()
plt.subplot(2,1,1)
plt.scatter(100.0*Xprand.flatten(), 100.0*Yprand.flatten())
plt.plot(100.0*xtest, 100.0*ytest, '-k', linewidth=3)
plt.hold
plt.imshow(blob, origin='lower', cmap='gray')
plt.text(5, 5, "don't use jet!", color='white')
g0rand = spint.griddata(points_rand, array_rand.flatten(), (grid_x, grid_y), method='nearest')
g1rand = spint.griddata(points_rand, array_rand.flatten(), (grid_x, grid_y), method='linear')
g2rand = spint.griddata(points_rand, array_rand.flatten(), (grid_x, grid_y), method='cubic')
plt.subplot(4,2,6)
plt.plot(exact, 'or')
#plt.plot(g0rand)
plt.plot(g1rand)
plt.plot(g2rand)
plt.title("griddata")
plt.subplot(4,2,8)
#plt.plot(g0rand - exact)
plt.plot(g1rand - exact)
plt.plot(g2rand - exact)
plt.title("error griddata")
plt.show()
Good question! (and nice plots!)
For unstructured data, you'll want to switch back to functions meant for unstructured data. griddata is one option, but uses triangulation with linear interpolation in between. This leads to "hard" edges at triangle boundaries.
Splines are radial basis functions. In scipy terms, you want scipy.interpolate.Rbf. I'd recommend using function="linear" or function="thin_plate" over cubic splines, but cubic is available as well. (Cubic splines will exacerbate problems with "overshooting" compared to linear or thin-plate splines.)
One caveat is that this particular implementation of radial basis functions will always use all points in your dataset. This is the most accurate and smooth approach, but it scales poorly as the number of input observation points increases. There are several ways around this, but things will get more complex. I'll leave that for another question.
At any rate, here's a simplified example. We'll generate random data and then interpolate it at points that are on a regular grid. (Note that the input is not on a regular grid, and the interpolated points don't need to be either.)
import numpy as np
import scipy.interpolate
import matplotlib.pyplot as plt
np.random.seed(1977)
x, y, z = np.random.random((3, 10))
interp = scipy.interpolate.Rbf(x, y, z, function='thin_plate')
yi, xi = np.mgrid[0:1:100j, 0:1:100j]
zi = interp(xi, yi)
plt.plot(x, y, 'ko')
plt.imshow(zi, extent=[0, 1, 1, 0], cmap='gist_earth')
plt.colorbar()
plt.show()
Choice of spline type
I chose "thin_plate" as the type of spline. Our input observations points range from 0 to 1 (they're created by np.random.random). Notice that our interpolated values go slightly above 1 and well below zero. This is "overshooting".
Linear splines will completely avoid overshooting, but you'll wind up with "bullseye" patterns (nowhere near as severe as with IDW methods, though). For example, here's the exact same data interpolated with a linear radial basis function. Notice that our interpolated values never go above 1 or below 0:
Higher order splines will make trends in the data more continuous but will overshoot more. The default "multiquadric" is fairly similar to a thin-plate spline, but will make things a bit more continuous and overshoot a bit worse:
However, as you go to even higher order splines such as "cubic" (third order):
and "quintic" (fifth order)
You can really wind up with unreasonable results as soon as you move even slightly beyond your input data.
At any rate, here's a simple example to compare different radial basis functions on random data:
import numpy as np
import scipy.interpolate
import matplotlib.pyplot as plt
np.random.seed(1977)
x, y, z = np.random.random((3, 10))
yi, xi = np.mgrid[0:1:100j, 0:1:100j]
interp_types = ['multiquadric', 'inverse', 'gaussian', 'linear', 'cubic',
'quintic', 'thin_plate']
for kind in interp_types:
interp = scipy.interpolate.Rbf(x, y, z, function=kind)
zi = interp(xi, yi)
fig, ax = plt.subplots()
ax.plot(x, y, 'ko')
im = ax.imshow(zi, extent=[0, 1, 1, 0], cmap='gist_earth')
fig.colorbar(im)
ax.set(title=kind)
fig.savefig(kind + '.png', dpi=80)
plt.show()
I have ran into a problem relating to the drawing of the Ellipsoid.
The ellipsoid that I am drawing to draw is the following:
x**2/16 + y**2/16 + z**2/16 = 1.
So I saw a lot of references relating to calculating and plotting of an Ellipse void and in multiple questions a cartesian to spherical or vice versa calculation was mentioned.
Ran into a website that had a calculator for it, but I had no idea on how to successfully perform this calculation. Also I am not sure as to what the linspaces should be set to. Have seen the ones that I have there as defaults, but as I got no previous experience with these libraries, I really don't know what to expect from it.
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(figsize=plt.figaspect(1)) # Square figure
ax = fig.add_subplot(111, projection='3d')
multip = (1, 1, 1)
# Radii corresponding to the coefficients:
rx, ry, rz = 1/np.sqrt(multip)
# Spherical Angles
u = np.linspace(0, 2 * np.pi, 100)
v = np.linspace(0, np.pi, 100)
# Cartesian coordinates
#Lots of uncertainty.
#x =
#y =
#z =
# Plot:
ax.plot_surface(x, y, z, rstride=4, cstride=4, color='b')
# Axis modifications
max_radius = max(rx, ry, rz)
for axis in 'xyz':
getattr(ax, 'set_{}lim'.format(axis))((-max_radius, max_radius))
plt.show()
Your ellipsoid is not just an ellipsoid, it's a sphere.
Notice that if you use the substitution formulas written below for x, y and z, you'll get an identity. It is in general easier to plot such a surface of revolution in a different coordinate system (spherical in this case), rather than attempting to solve an implicit equation (which in most plotting programs ends up jagged, unless you take some countermeasures).
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
phi = np.linspace(0,2*np.pi, 256).reshape(256, 1) # the angle of the projection in the xy-plane
theta = np.linspace(0, np.pi, 256).reshape(-1, 256) # the angle from the polar axis, ie the polar angle
radius = 4
# Transformation formulae for a spherical coordinate system.
x = radius*np.sin(theta)*np.cos(phi)
y = radius*np.sin(theta)*np.sin(phi)
z = radius*np.cos(theta)
fig = plt.figure(figsize=plt.figaspect(1)) # Square figure
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(x, y, z, color='b')
I have a set of data that I want to use to produce a contour plot in polar co-ordinates using Matplotlib.
My data is the following:
theta - 1D array of angle values
radius - 1D array of radius values
value - 1D array of values that I want to use for the contours
These are all 1D arrays that align properly - eg:
theta radius value
30 1 2.9
30 2 5.3
35 5 9.2
That is, all of the values are repeated enough times so that each row of this 'table' of three variables defines one point.
How can I create a polar contour plot from these values? I've thought about converting the radius and theta values to x and y values and doing it in cartesian co-ordinates, but the contour function seems to require 2D arrays, and I can't quite understand why.
Any ideas?
Matplotlib's contour() function expects data to be arranged as a 2D grid of points and corresponding grid of values for each of those grid points. If your data is naturally arranged in a grid you can convert r, theta to x, y and use contour(r*np.cos(theta), r*np.sin(theta), values) to make your plot.
If your data isn't naturally gridded, you should follow Stephen's advice and used griddata() to interpolate your data on to a grid.
The following script shows examples of both.
import pylab as plt
from matplotlib.mlab import griddata
import numpy as np
# data on a grid
r = np.linspace(0, 1, 100)
t = np.linspace(0, 2*np.pi, 100)
r, t = np.meshgrid(r, t)
z = (t-np.pi)**2 + 10*(r-0.5)**2
plt.subplot(121)
plt.contour(r*np.cos(t), r*np.sin(t), z)
# ungrid data, then re-grid it
r = r.flatten()
t = t.flatten()
x = r*np.cos(t)
y = r*np.sin(t)
z = z.flatten()
xgrid = np.linspace(x.min(), x.max(), 100)
ygrid = np.linspace(y.min(), y.max(), 100)
xgrid, ygrid = np.meshgrid(xgrid, ygrid)
zgrid = griddata(x,y,z, xgrid, ygrid)
plt.subplot(122)
plt.contour(xgrid, ygrid, zgrid)
plt.show()
I don't know if it's possible to do a polar contour plot directly, but if you convert to cartesian coordinates you can use the griddata function to convert your 1D arrays to 2D.