I'm trying to plot a probability distribution (say probability of k events). It should be plotted as a logscale on the horizontal axis since the behavior at large values of k looks like k^{-alpha}. So it's a straight line for large k on a logscale plot.
But 0 happens.
I want to plot this in a way that is easy to interpret.
For an example, consider a probability defined so that p_0 = 0.5 and for k= 1, 2, 3, ... we set p_k = Ck^{-2} where if I've calculated correctly C=3/pi^2. This should sum to 1 and produce a nice straight line for k>0, but obviously, I can't stick in 0. Nevertheless it's important that the person looking at the image understand that 0 exists and has significant probability.
I'm using matplotlib (in python), but really I'm interested in how we could visualize this. The implementation can be sorted later.
In order to put 0 into the plot, you have apply symlog to x axis and log to y axis. I am putting some code here in case you are not familiar with matplotlib, then you can start with code below. For details, pls check doc.
import numpy as np
import matplotlib.pyplot as plt
n = 100
x = np.arange(0, n)
y = 3/(np.pi*np.pi)/(x[1:])**2
y = np.concatenate([[0.5], y])
fig, ax = plt.subplots(1, 1, figsize=(7.2, 7.2))
ax.plot(x, y, 'x')
ax.set_xlim(-1, n)
ax.set_xscale('symlog')
ax.set_yscale('log')
Related
Hi I created a program that will create deviations from a real trajectory, it is complicated and I do not have a simple example unfortunately.
It calculates a path with stochastic initial conditions from the real path and does this for x iterations, the goal is to show that the deviations become larger at greater times.
The real path and the deviations are showed below.
However I want to show that the deviations become greater the longer in time we are. Ofcourse I could just calculate the variance and plot mean+var and mean-var at each time step but I was wondering if I could plot something like this, using hist2d
You see that the blocks are not as smooth as a like and this is not that great to use.
Then I went and looked at python's kde and created the following.
This is also not preferable as I think it bins more points at the minima and maxima. Also it is 'too smeared out'. Especially in the beginning, all the points are the same so I want there just to be a straight line to really show that the deviations start later on.
I guess my question is; is what I want even possible and what package/command should I use. I haven't found what I am looking for on other questions. Or has anyone a suggestion to nicely show what I want in a any other way?
Here is an idea plotting multiple curves with transparency on top of each other:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 200)
for _ in range(1000):
plt.plot(x, np.sin(x * np.random.normal(1, 0.1)) * np.random.normal(1, 0.1), color='r', alpha=0.02)
plt.plot(x, np.sin(x), color='b')
plt.margins(x=0)
plt.show()
Another option creates a 2d histogram:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 200)
all_curves = np.array([np.sin(x * np.random.normal(1, 0.1)) * np.random.normal(1, 0.1) for _ in range(100)])
plt.hist2d(x=np.tile(x, all_curves.shape[0]), y=all_curves.ravel(), bins=(100, 100), cmap='inferno')
plt.show()
Still another approach would use fill_between (as suggested by #bramb) between confidence intervals:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 200)
all_curves = np.array([np.sin(x * np.random.normal(1, 0.1)) * np.random.normal(1, 0.1) for _ in range(1000)])
confidence_interval1 = 95
confidence_interval2 = 80
confidence_interval3 = 50
for ci in [confidence_interval1, confidence_interval2, confidence_interval3]:
low = np.percentile(all_curves, 50 - ci / 2, axis=0)
high = np.percentile(all_curves, 50 + ci / 2, axis=0)
plt.fill_between(x, low, high, color='r', alpha=0.2)
plt.plot(x, np.sin(x), color='b')
plt.margins(x=0)
plt.show()
You could use something like the matplotlib.pyplot.fill_between method. It fills everything between y1 (max) and y2 (min) for a given (common) x array. You would then be able to accentuate that the filled region keeps enlarging with increasing x value.
However, this would require you to find the minimal and maximal value of your deviations at each time point and save these to two separate arrays. The exact method of doing this will depend on how you are storing these individual runs.
In case they are separate lists / arrays, you can convert these to a numpy matrix / pandas dataframe and use the minimum / maximum methods along the relevant axis.
I am having some troubles understanding proper way to marginalize out variables from probability distributions. As I understand the proper way to do this is to sum over variables that is being marginalized out leaving only variables to be kept. For case of normal distribution, the result is also normal distribution. I can show this part with equations and doing integrals, but when I try to check in python I get incorrect results--the peak of resulting distribution is much higher.
Here is example (the code is from Marginalize a surface plot and use kernel density estimation (kde) on it)
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import multivariate_normal, gaussian_kde
# Choose mean vector and variance-covariance matrix
mu = np.array([0, 0])
sigma = np.array([[2, 0], [0, 3]])
# Create surface plot data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
rv = multivariate_normal(mean=mu, cov=sigma)
Z = np.array([rv.pdf(pair) for pair in zip(X.ravel(), Y.ravel())])
Z = Z.reshape(X.shape)
# Plot it
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
pos = ax.plot_surface(X, Y, Z)
plt.show()
This makes plot of two variable normal distribution. If I take sum of variable x to get marginal distribution
Zmarg_y = Z.sum(axis=0)
plt.plot(x, Zmarg_y)
plt.show()
result is not the same as if I simply drop the variable instead of marginalize out. I tried this also with a 3 variable gaussian distribution where I marginalized 1 variable to get a 2 variable distribution. The result was also on a higher scale. Is there a problem with normalization here? I am studying probability for a first time and am trying to understand every single detail and I think I am misunderstanding something important about this. Thank you.
An equation which is represent as below
sin(x)*sin(y)*sin(z)+cos(x)*sin(y)*cos(z)=0
I know the code to plot function for z=f(x,y) using matplotlib but to plot above function I don’t know the code, but I tried MATLAB MuPad code which is as follows
Plot(sin(x)*sin(y)*sin(z)+cos(x)*sin(y)*cos(z),#3d)
This will be much easier if you can isolate z. Your equation is the same as sin(z)/cos(z) = -cos(x)*sin(y)/(sin(x)*sin(y)) so z = atan(-cos(x)*sin(y)/(sin(x)*sin(y))).
Please don't mistake me, but I think your given equation to plot can be reduced to a simple 2D plot.
sin(x)*sin(y)*sin(z)+cos(x)*sin(y)*cos(z) = 0
sin(y)[sin(x)*sin(z)+cos(x)*cos(z)] = 0
sin(y)*cos(x-z) = 0
Hence sin(y) = 0 or cos(x-z)=0
Hence y = n*pi (1) or x-z=(2*n + 1)pi/2
Implies, x = z + (2*n + 1)pi/2 (2)
For (1), it will be a straight line (the plot of y vs n) and in second case, you will get parallel lines which cuts x-axis at (2*n + 1)pi/2 and distance between two parallel lines would be pi. (Assuming you keep n constant).
Assuming, your y can't be zero, you could simplify the plot to a 2D plot with just x and z.
And answering your original question, you need to use mplot3d to plot 3D plots. But as with any graphing tool, you need values or points of x, y, z. (You can compute the possible points by programming). Then you feed those points to the plot, like below.
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection="3d")
xs = [] # X values
ys = [] # Y values
zs = [] # Z values
ax.plot3D(xs, ys, zs)
plt.show()
I want to plot a KDE for some data with data that covers a large range in x-values. Therefore I want to use a logarithmic scale for the x-axis. For plotting I was using seaborn and the solution from Plotting 2D Kernel Density Estimation with Python, both of which fail once I set the xscale to logarithmic. When I take the logarithm of my x-data beforehand, everything looks fine, except the tics and ticlabels are still linear with the logarithm of the actual values as the labels. I could manually change the tics using something like:
labels = np.array(ax.get_xticks().tolist(), dtype=np.float64)
new_labels = [r'$10^{%.1f}$' % (labels[i]) for i in range(len(labels))]
ax.set_xticklabels(new_labels)
but in my eyes that looks just wrong and is nothing close to the axis labels (including the minor tics) when I would just use
ax.set_xscale('log')
Is there an easier way to plot a KDE with logarithmic x-data? Or is it possible to just change the tic- or label-scale without changing the scaling of the data, so that I could plot the logarithmic values of x and change the scaling of the labels afterwards?
Edit:
The plot I want to create looks like this:
The two right columns are what it is supposed to look like. There I used the the x data with the logarithm already applied. I don't like the labels on the x-axis, though.
The left column displays the plots, when the original data is used for the kde and all the other plots, and afterwards the scale is changed using
ax.set_xscale('log')
For some reason the kde, does not look like it is supposed to look. This is also not a result of erroneous data, since it looks just fine if the logarithmic data is used.
Edit 2:
A working example of code is
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = np.random.multivariate_normal((0, 0), [[0.8, 0.05], [0.05, 0.7]], 100)
x = np.power(10, data[:, 0])
y = data[:, 1]
fig, ax = plt.subplots(2, 1)
sns.kdeplot(data=np.log10(x), data2=y, ax=ax[0])
sns.kdeplot(data=x, data2=y, ax=ax[1])
ax[1].set_xscale('log')
plt.show()
The ax[1] plot is not displayed correctly for me (the x-axis is inverted), but the general behavior is the same as for the case described above. I believe the problem lies with the bandwidth of the kde, which should probably account for the logarithmic x-data.
I found an answer that works for me and wanted to post it in case someone else has a similar problem.
Based on the accepted answer from this post, I defined a function that first applies the logarithm to the x-data and after the KDE was performed, transforms the x-values back to the original values. Afterwards I can simply plot the contours and use ax.set_xscale('log')
import numpy as np
import scipy.stats as st
def logx_kde(x, y, xmin, xmax, ymin, ymax):
x = np.log10(x)
# Peform the kernel density estimate
xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([xx.ravel(), yy.ravel()])
values = np.vstack([x, y])
kernel = st.gaussian_kde(values)
f = np.reshape(kernel(positions).T, xx.shape)
return np.power(10, xx), yy, f
This question already has an answer here:
Logscale plots with zero values in matplotlib
(1 answer)
Closed 9 years ago.
I have a very large and sparse dataset of spam twitter accounts and it requires me to scale the x axis in order to be able to visualize the distribution (histogram, kde etc) and cdf of the various variables (tweets_count, number of followers/following etc).
> describe(spammers_class1$tweets_count)
var n mean sd median trimmed mad min max range skew kurtosis se
1 1 1076817 443.47 3729.05 35 57.29 43 0 669873 669873 53.23 5974.73 3.59
In this dataset, the value 0 has a huge importance (actually 0 should have the highest density). However, with a logarithmic scale these values are ignored. I thought of changing the value to 0.1 for example, but it will not make sense that there are spam accounts that have 10^-1 followers.
So, what would be a workaround in python and matplotlib ?
Add 1 to each x value, then take the log:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.ticker as ticker
fig, ax = plt.subplots()
x = [0, 10, 100, 1000]
y = [100, 20, 10, 50]
x = np.asarray(x) + 1
y = np.asarray(y)
ax.plot(x, y)
ax.set_xscale('log')
ax.set_xlim(x.min(), x.max())
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{0:g}'.format(x-1)))
ax.xaxis.set_major_locator(ticker.FixedLocator(x))
plt.show()
Use
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{0:g}'.format(x-1)))
ax.xaxis.set_major_locator(ticker.FixedLocator(x))
to relabel the tick marks according to the non-log values of x.
(My original suggestion was to use plt.xticks(x, x-1), but this would affect all axes. To isolate the changes to one particular axes, I changed all commands calls to ax, rather than calls to plt.)
matplotlib removes points which contain a NaN, inf or -inf value. Since log(0) is -inf, the point corresponding to x=0 would be removed from a log plot.
If you increase all the x-values by 1, since log(1) = 0, the point corresponding to x=0 will not be plotted at x=log(1)=0 on the log plot.
The remaining x-values will also be shifted by one, but it will not matter to the eye since log(x+1) is very close to log(x) for large values of x.
ax1.set_xlim(0, 1e3)
Here is the example from matplotlib documentation.
And there it sets the limit values of the axes this way:
ax1.set_xlim(1e1, 1e3)
ax1.set_ylim(1e2, 1e3)