I am plotting a 5th degree polynomial... for simplicity, lets just go with y=(x-3)(x-2)x(x+2)(x+3). On reasonable intervals for x, say from -5 to 5, the graph isn't very informative because the function grows very quickly outside of the "interesting" range, about -3 to 3:
A symlog scale is somewhat better, but now I'm looking at the log of a 5th degree polynomial, which is a bit hard for me to interpret:
Ideally, I could plot this on a polynomial scale. Since I know I have a 5th degree polynomial, then a 5th root scale would be able to fit all of my data, and the graph should behave linearly out near the edges. Is it possible to scale my axes with an arbitrary function?
I adjusted this example as follows:
import numpy as np
import matplotlib.pyplot as plt
y = np.random.normal(loc=0.5, scale=0.4, size=1000)
x = np.arange(len(y))
fig, ax = plt.subplots(figsize=(6, 8), constrained_layout=True)
t = np.arange(1, 170.0, 0.1)
s = t / 2.
ax.plot(t, s, '-', lw=2)
ax.set_yscale('function', functions=(lambda x: x**5, lambda x: x**(0.2)))
ax.grid(True)
ax.set_ylim(0,5)
plt.show()
Related
I am trying to smoothen my data using spline - which is basically cumulative percentile on the y-axis and a reference point they refer to on the x-axis. I get most part of it correct, however, the challenge I am facing is my y axis is increasing in a non linear way - as seen int he spline plot below- y-axis value keep increasing and decreasing, instead of just increasing.
I still want a smooth curve but want y-axis to increase with the x-axis - i.e. each subsequent y-axis point should be equal or a slight increment in value from the previous value, as opposed to increasing and the decreasing later.
Reproducible code:
import pandas as pd
import numpy as np
from scipy.interpolate import make_interp_spline
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
percentile_block_df = pd.DataFrame({
'x' : [0.5,100.5,200.5,400.5,800.5,900.5,1000.5],
'percentile' : [0.0001,0.01,0.065,0.85,0.99,0.9973,0.9999]
})
figure(figsize=(8, 6), dpi=80)
y = percentile_block_df.percentile
x = percentile_block_df.x
X_Y_Spline = make_interp_spline(x, y)
# Returns evenly spaced numbers
# over a specified interval.
X_ = np.linspace(x.min(), x.max(), 1000)
Y_ = X_Y_Spline(X_)
figure(figsize=(18, 6), dpi=80)
plt.subplot(1, 2, 1) # row 1, col 2 index 1
plt.plot(x, y,"ro")
plt.plot(x, y)
plt.title("Original")
plt.xlabel('X')
plt.ylabel('Percentile ')
plt.subplot(1, 2, 2) # index 2
plt.plot(x, y,"ro")
plt.plot(X_, Y_,"green")
plt.title("Spline Plot")
plt.xlabel('X')
plt.ylabel('Percentile ')
plt.show()
What you are looking for is "monotonicity preserving interpolation". A quick search shows that scipy.interpolate.PchipInterpolator does just that. Here is the result for your example when simply plugging in from scipy.interpolate import PchipInterpolator instead of from scipy.interpolate import make_interp_spline.
Whether or not that's appropriate depends of course on your specific requirements for the interpolation. I encourage you to research the other options which are out there.
Similar question:
Fully monotone interpolation in python
Code that eventually worked for me:
This link, explains the need for Monotone cubic interpolation
#this code allows "smoothening" of the data
B_spline_coeff1 = PchipInterpolator(x1, np.log(y1))
X1_Final = np.linspace(x.min(), x.max(), 1000)
Y1_Final = np.exp(B_spline_coeff1(X1_Final))
#plot subplots
figure(figsize=(18, 6), dpi=80)
plt.subplot(1, 2, 1) # row 1, col 2 index 1
plt.plot(x, y,"ro")
plt.plot(x, y)
plt.title("Original")
plt.xlabel('X')
plt.ylabel('Percentile ')
plt.subplot(1, 2, 2) # index 2
plt.plot(x, y,"ro")
plt.plot(X1_Final, Y1_Final,"green")
plt.title("Spline Plot")
plt.xlabel('X')
plt.ylabel('Percentile ')
plt.show()
What I am trying to do is to play around with some random distribution. I don't want it to be normal. But for the time being normal is easier.
import matplotlib.pyplot as plt
from scipy.stats import norm
ws=norm.rvs(4.0, 1.5, size=100)
density, bins = np.histogram(ws, 50,normed=True, density=True)
unity_density = density / density.sum()
fig, ((ax1, ax2)) = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(12,6))
widths = bins[:-1] - bins[1:]
ax1.bar(bins[1:], unity_density, width=widths)
ax2.bar(bins[1:], unity_density.cumsum(), width=widths)
fig.tight_layout()
Then what I can do it visualize CDF in terms of points.
density1=unity_density.cumsum()
x=bins[:-1]
y=density1
plt.plot(x, density1, 'o')
So what I have been trying to do is to use the np.interp function on the output of np.histogram in order to obtain a smooth curve representing the CDF and extracting the percent points to plot them. Ideally, I need to try to do it all both manually and using ppf function from scipy.
I have always struggled with statistics as an undergraduate. I am in grad school now and try to put me through as many exercises like this as possible in order to get a deeper understanding of what is happening. I've reached a point of desperation with this task.
Thank you!
One possibility to get smoother results is to use more samples, by using 10^5 samples and 100 bins I get the following images:
ws = norm.rvs(loc=4.0, scale=1.5, size=100000)
density, bins = np.histogram(ws, bins=100, normed=True, density=True)
In general you could use scipys interpolation module to smooth your CDF.
For 100 samples and a smoothing factor of s=0.01 I get:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import splev, splrep
density1 = unity_density.cumsum()
x = bins[:-1]
y = density1
# Interpolation
spl = splrep(x, y, s=0.01, per=False)
x2 = np.linspace(x[0], x[-1], 200)
y2 = splev(x2, spl)
# Plotting
fig, ax = plt.subplots()
plt.plot(x, density1, 'o')
plt.plot(x2, y2, 'r-')
The third possibility is to calculate the CDF analytically. If you generate the noise yourself with a numpy / scipy function most of the time there is already an implementation of the CDF available, otherwise you should find it on Wikipedia. If your samples come from measurements that is of course a different story.
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
x = np.linspace(-2, 10)
y = norm(loc=4.0, scale=1.5).cdf(x)
ax.plot(x, y, 'bo-')
As a minimal reproducible example, suppose I have the following multivariate normal distribution:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import multivariate_normal, gaussian_kde
# Choose mean vector and variance-covariance matrix
mu = np.array([0, 0])
sigma = np.array([[2, 0], [0, 3]])
# Create surface plot data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
rv = multivariate_normal(mean=mu, cov=sigma)
Z = np.array([rv.pdf(pair) for pair in zip(X.ravel(), Y.ravel())])
Z = Z.reshape(X.shape)
# Plot it
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
pos = ax.plot_surface(X, Y, Z)
plt.show()
This gives the following surface plot:
My goal is to marginalize this and use Kernel Density Estimation to get a nice and smooth 1D Gaussian. I am running into 2 problems:
Not sure my marginalization technique makes sense.
After marginalizing I am left with a barplot, but gaussian_kde requires actual data (not frequencies of it) in order to fit KDE, so I am unable to use this function.
Here is how I marginalize it:
# find marginal distribution over y by summing over all x
y_distribution = Z.sum(axis=1) / Z.sum() # Do I need to normalize?
# plot bars
plt.bar(y, y_distribution)
plt.show()
and this is the barplot that I obtain:
Next, I follow this StackOverflow question to find the KDE only from "histogram" data. To do this, we resample the histogram and fit KDE on the resamples:
# sample the histogram
resamples = np.random.choice(y, size=1000, p=y_distribution)
kde = gaussian_kde(resamples)
# plot bars
fig, ax = plt.subplots(nrows=1, ncols=2)
ax[0].bar(y, y_distribution)
ax[1].plot(y, kde.pdf(y))
plt.show()
This produces the following plot:
which looks "okay-ish" but the two plots are clearly not on the same scale.
Coding Issue
How come the KDE is coming out on a different scale? Or rather, why is the barplot on a different scale than the KDE?
To further highlight this, I've changed the variance covariance matrix so that we know that the marginal distribution over y is a normal distribution centered at 0 with variance 3. At this point we can compare the KDE with the actual normal distribution as follows:
plt.plot(y, norm.pdf(y, loc=0, scale=np.sqrt(3)), label='norm')
plt.plot(y, kde.pdf(y), label='kde')
plt.legend()
plt.show()
This gives:
Which means the bar plot is on the wrong scale. What coding issue made the barplot in the wrong scale?
A good way to show the concentration of the data points in a plot is using a scatter plot with non-unit transparency. As a result, the areas with more concentration would appear darker.
# this is synthetic example
N = 10000 # a very very large number
x = np.random.normal(0, 1, N)
y = np.random.normal(0, 1, N)
plt.scatter(x, y, marker='.', alpha=0.1) # an area full of dots, darker wherever the number of dots is more
which gives something like this:
Imagine the case we want to emphasize on the outliers. So the situation is almost reversed: A plot in which the less-concentrated areas are bolder. (There might be a trick to apply for my simple example, but imagine a general case where a distribution of points are not known prior, or it's difficult to define a rule for transparency/weight on color.)
I was thinking if there's anything handy same as alpha that is designed for this job specifically. Although other ideas for emphasizing on outliers are also welcomed.
UPDATE: This is what happens when more then one data point is scattered on the same area:
I'm looking for something like the picture below, the more data point, the less transparent the marker.
To answer the question: You can calculate the density of points, normalize it and encode it in the alpha channel of a colormap.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
# this is synthetic example
N = 10000 # a very very large number
x = np.random.normal(0, 1, N)
y = np.random.normal(0, 1, N)
fig, (ax,ax2) = plt.subplots(ncols=2, figsize=(8,5))
ax.scatter(x, y, marker='.', alpha=0.1)
values = np.vstack([x,y])
kernel = stats.gaussian_kde(values)
weights = kernel(values)
weights = weights/weights.max()
cols = plt.cm.Blues([0.8, 0.5])
cols[:,3] = [1., 0.005]
cmap = LinearSegmentedColormap.from_list("", cols)
ax2.scatter(x, y, c=weights, s = 1, marker='.', cmap=cmap)
plt.show()
Left is the original image, right is the image where higher density points have a lower alpha.
Note, however, that this is undesireable, because high density transparent points are undistinguishable from low density. I.e. in the right image it really looks as though you have a hole in the middle of your distribution.
Clearly, a solution with a colormap which does not contain the color of the background is a lot less confusing to the reader.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# this is synthetic example
N = 10000 # a very very large number
x = np.random.normal(0, 1, N)
y = np.random.normal(0, 1, N)
fig, ax = plt.subplots(figsize=(5,5))
values = np.vstack([x,y])
kernel = stats.gaussian_kde(values)
weights = kernel(values)
weights = weights/weights.max()
ax.scatter(x, y, c = weights, s=9, edgecolor="none", marker='.', cmap="magma")
plt.show()
Here, low density points are still emphazised by darker color, but at the same time it's clear to the viewer that the highest density lies in the middle.
As far as I know, there is no "direct" solution to this quite interesting problem. As a workaround, I propose this solution:
N = 10000 # a very very large number
x = np.random.normal(0, 1, N)
y = np.random.normal(0, 1, N)
fig = plt.figure() # create figure directly to be able to extract the bg color
ax = fig.gca()
ax.scatter(x, y, marker='.') # plot all markers without alpha
bgcolor = ax.get_facecolor() # extract current background color
# plot with alpha, "overwriting" dense points
ax.scatter(x, y, marker='.', color=bgcolor, alpha=0.2)
This will plot all points without transparency and then plot all points again with some transparency, "overwriting" those points with the highest density the most. Setting the alpha value to other higher values will put more emphasis to outliers and vice versa.
Of course the color of the second scatter plot needs to be adjusted to your background color. In my example this is done by extracting the background color and setting it as the new scatter plot's color.
This solution is independent of the kind of distribution. It only depends on the density of the points. However it produces twice the amount of points, thus may take slightly longer to render.
Reproducing the edit in the question, my solution is showing exactly the desired behavior. The leftmost point is a single point and is the darkest, the rightmost is consisting of three points and is the lightest color.
x = [0, 1, 1, 2, 2, 2]
y = [0, 0, 0, 0, 0, 0]
fig = plt.figure() # create figure directly to be able to extract the bg color
ax = fig.gca()
ax.scatter(x, y, marker='.', s=10000) # plot all markers without alpha
bgcolor = ax.get_facecolor() # extract current background color
# plot with alpha, "overwriting" dense points
ax.scatter(x, y, marker='.', color=bgcolor, alpha=0.2, s=10000)
Assuming that the distributions are centered around a specific point (e.g. (0,0) in this case), I would use this:
import numpy as np
import matplotlib.pyplot as plt
N = 500
# 0 mean, 0.2 std
x = np.random.normal(0,0.2,N)
y = np.random.normal(0,0.2,N)
# calculate the distance to (0, 0).
color = np.sqrt((x-0)**2 + (y-0)**2)
plt.scatter(x , y, c=color, cmap='plasma', alpha=0.7)
plt.show()
Results:
I don't know if it helps you, because it's not exactly you asked for, but you can simply color points, which values are bigger than some threshold. For example:
import matplotlib.pyplot as plt
num = 100
threshold = 80
x = np.linspace(0, 100, num=num)
y = np.random.normal(size=num)*45
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.scatter(x[np.abs(y) < threshold], y[np.abs(y) < threshold], color="#00FFAA")
ax.scatter(x[np.abs(y) >= threshold], y[np.abs(y) >= threshold], color="#AA00FF")
plt.show()
I have a graph where the x-axis is the temperature in GeV, but I also need to put a reference of the temperature in Kelvin, so I thought of putting a parasite axis with the temperature in K. Trying to follow this answer How to add a second x-axis in matplotlib , Here is the example of the code. I get a second axis at the top of my graph, but it is not the temperature in K as I need.
import numpy as np
import matplotlib.pyplot as plt
tt = np.logspace(-14,10,100)
yy = np.logspace(-10,-2,100)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twiny()
ax1.loglog(tt,yy)
ax1.set_xlabel('Temperature (GeV')
new_tick_locations = np.array([.2, .5, .9])
def tick_function(X):
V = X*1.16e13
return ["%.1f" % z for z in V]
ax2.set_xlim(ax1.get_xlim())
ax2.set_xticks(new_tick_locations)
ax2.set_xticklabels(tick_function(ax1Xs))
ax2.set_xlabel('Temp (Kelvin)')
plt.show()
This is what I get when I run the code.
loglog plot
I need the parasite axis be proportional to the original x-axis. And that it can be easy to read the temperature in Kelvin when anyone sees the graph. Thanks in advance.
A general purpose solution may look as follows. Since you have a non-linear scale, the idea is to find the positions of nice ticks in Kelvin, convert to GeV, set the positions in units of GeV, but label them in units of Kelvin. This sounds complicated, but the advantage is that you do not need to find the ticks yourself, just rely on matplotlib for finding them.
What this requires though is the functional dependence between the two scales, i.e. the converion between GeV and Kelvin and its inverse.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
tt = np.logspace(-14,10,100)
yy = np.logspace(-10,-2,100)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twiny()
plt.setp([ax1,ax2], xscale="log", yscale="log")
ax1.get_shared_x_axes().join(ax1, ax2)
ax1.plot(tt,yy)
ax1.set_xlabel('Temperature (GeV)')
ax2.set_xlabel('Temp (Kelvin)')
fig.canvas.draw()
# 1 GeV == 1.16 × 10^13 Kelvin
Kelvin2GeV = lambda k: k / 1.16e13
GeV2Kelvin = lambda gev: gev * 1.16e13
loc = mticker.LogLocator()
locs = loc.tick_values(*GeV2Kelvin(np.array(ax1.get_xlim())))
ax2.set_xticks(Kelvin2GeV(locs))
ax2.set_xlim(ax1.get_xlim())
f = mticker.ScalarFormatter(useOffset=False, useMathText=True)
g = lambda x,pos : "${}$".format(f._formatSciNotation('%1.10e' % GeV2Kelvin(x)))
fmt = mticker.FuncFormatter(g)
ax2.xaxis.set_major_formatter(mticker.FuncFormatter(fmt))
plt.show()
The problem appears to be the following: When you use ax2.set_xlim(ax1.get_xlim()), you are basically setting the limit of upper x-axis to be the same as that of the lower x-axis. Now if you do
print(ax1.get_xlim())
print(ax2.get_xlim())
you get for both axes the same values as
(6.309573444801943e-16, 158489319246.11108)
(6.309573444801943e-16, 158489319246.11108)
but your lower x-axis is having a logarithmic scale. When you assign the limits using ax2.set_xlim(), the limits of ax2 are the same but the scale is still linear. That's why when you set the ticks at [.2, .5, .9], these values appear as ticks on the far left of the upper x-axis as in your figure.
The solution is to set the upper x-axis also to be a logarithmic scale. This is required because your new_tick_locations corresponds to the actual values on the lower x-axis. You just want to rename these values to show the ticklabels in Kelvin. It is clear from your variable names that new_tick_locations corresponds to the new tick locations. I use some modified values of new_tick_locations to highlight the problem.
I am using scientific formatting '%.0e' because 1 GeV = 1.16e13 K and so 0.5 GeV would be a very large value with many zeros.
Below is a sample answer:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
tt = np.logspace(-14,10,100)
yy = np.logspace(-10,-2,100)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twiny()
ax1.loglog(tt,yy)
ax1.set_xlabel('Temperature (GeV)')
new_tick_locations = np.array([0.000002, 0.05, 9000])
def tick_function(X):
V = X*1.16e13
return ["%.1f" % z for z in V]
ax2.set_xscale('log') # Setting the logarithmic scale
ax2.set_xlim(ax1.get_xlim())
ax2.set_xticks(new_tick_locations)
ax2.set_xticklabels(tick_function(new_tick_locations))
ax2.xaxis.set_major_formatter(mtick.FormatStrFormatter('%.0e'))
ax2.set_xlabel('Temp (Kelvin)')
plt.show()