I'm using matplotlib to plot a probability distribution which looks something like the sum of a Pareto distribution and a Gaussian with positive mean. In other words it has very large values near 0, a small local minimum at x = a, a local maximum at x = b > a, and a long right tail decaying to 0.
I'd like to set the y limits based only on those values to the right of the local minimum, i.e. cut off the left-most values so as to focus on the local maximum. I know I can do this with:
plt.plot(pdf)
plt.ylim((0, local_maximum))
However, this sets ymax to exactly the value of the local maximum, which makes the plot look ugly for two reasons:
the local maximum touches the top boundary of the plot, with no space above
the y axis is not a round multiple of ytics, so it's not clear what the maximum value is
Matplotlib's algorithm for choosing a default axis is pretty good, so my current hack is to plot twice: the first time I plot only the data above the local minimum for the purpose of choosing a good ylim, and the second time I plot all the data, as follows:
fig, ax = plt.subplots()
# first plot the data above the local minimum x=a, just to get a good ymax
plt.plot(pdf[a:])
ymin, ymax = plt.ylim()
# now plot all the data using the nice ymax
fig.clear()
plt.ylim((0, ymax))
plt.plot(pdf)
This gives me a good ymax that is a round multiple of ytics and fits al the data up to the local maximum, with a bit of whitespace.
Is there a better way, that doesn't require plotting twice?
Related
I have the equation: z(x,y)=1+x^(2/3)y^(-3/4)
I would like to calculate values of z for x=[0,100] and y=[10^1,10^4]. I will do this for 100 points in each axis direction. My grid, then, will be 100x100 points. In the x-direction I want the points spaced linearly. In the y-direction I want the points space logarithmically.
Were I to need these values I could easily go through the following:
x=np.linspace(0,100,100)
y=np.logspace(1,4,100)
z=np.zeros( (len(x), len(y)) )
for i in range(len(x)):
for j in range(len(y)):
z[i,j]=1+x[i]**(2/3)*y[j]**(-3/4)
The problem for me comes with visualizing these results. I know that I would need to create a grid of points. I feel my options are to create a meshgrid with the values and then use pcolor.
My issue here is that the values at the center of the block do not coincide with the calculated values. In the x-direction I could fix this by shifting the x-vector by half of dx (the step between successive values). I'm not so sure how I would do this for the y-axis. Furthermore, If I wanted to compute values for each of the y-direction values, including the end points, they would not all show up.
In the final visualization I would like to have the y-axis as a log scale and the x axis as a linear scale. I would also like the tick marks to fall in the center of the cells, correlating with the correct value. Can someone point me to the correct plotting functions for this. I have to resolve the issue using pcolor or pcolormesh.
Should you require more details, please let me know.
In current matplotlib, you can use pcolormesh with shading='nearest', and it will center the blocks with the values:
import matplotlib.pyplot as plt
y_plot = np.log10(y)
z[5, 5] = 0 # to make it more evident
plt.pcolormesh(x, y_plot, z, shading="nearest")
plt.colorbar()
ax = plt.gca()
ax.set_xticks(x)
ax.set_yticks(y_plot)
plt.axvline(x[5])
plt.axhline(y_plot[5])
Output:
I am just learning some basics of Data Analysis.
I have a simple csv data file like the one below.
START,FIRST,SECOND,ITEM
1,100,200,A
2,100,200,B
2,100,300,C
2,200,300,D
3,200,100,E
3,200,100,F
3,200,100,G
3,200,100,H
3,200,100,I
3,200,100,J
I wrote this small program to read this csv file and then print a histogram using matplotlib for the three columns START, FIRST, and SECOND. I also print a scatter plot for FIRST vs SECOND columns.
#!/exp/anaconda3/bin/python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
file_name = 'junk.csv'
data = pd.read_csv(file_name)
print(data.describe())
plt.rcParams['axes.grid'] = True
fix, axs = plt.subplots(2,2, figsize=(15,10))
axs[0, 0].hist(data['START'], 100, density=True, facecolor='g', alpha=0.8)
axs[1, 0].scatter(data['FIRST'], data['SECOND'], facecolor='violet')
axs[0, 1].hist(data['FIRST'], 100, density=True, facecolor='r', alpha=0.8)
axs[1, 1].hist(data['SECOND'], 100, density=True, facecolor='b', alpha=0.8)
plt.show()
What I do not understand is in the histogram plots, for example, the bottom right hand image with blue bars in the attached picture, why does it not simply plot how many times the number 200 is occurring instead of showing that 200 occurs 0.10 times. How is that possible? Same goes for the 300 as well.
Can someone help me understand what and how matplot is coming up with the Y-axis values? These values do not make sense to me.
Thank you.
Ruby Drew
Try density = False. The density parameter tells matplotlib whether you want it to normalise the heights or not such that it represents a probability density.
First note that a histogram is primarily meant to count continuous samples in small bins. For discrete data, the bins should be carefully chosen to have boundaries nicely in-between the values. When you add bins=N, matplotlib supposes a continuous distribution and subdivides the space from smallest to largest sample into N equally-sized bins. For discrete data this can have unexpected side effects, such as putting samples in either one bin for the values on the bin boundaries.
With density=True, the heights of the bars is recalculated such that the total area of all bins sums to 1. For a continuous distribution with many samples, this resembles the probability density function and can be used to draw a kde plot with the same y-axis.
So, what's happening in the blue histogram:
100 bins are created between 100 and 300. Each bin will be 2 wide.
3 bins get values: the bin 100-102 gets count 3, either the bin 198-200 or the bin 200-202 get a count of 1, the bin 298-300 also gets a count of one.
The total height of the bins is now 5. As their width is 2, the histogram counts need to be divided by (total_height * bin_width) to obtain a total area of 1.
Clearly, the sum of height times width of the bars sums to 1: 0.3*2 + 0.1*2 + 0.1*2 = 1.
The latest version (0.11) of Seaborn's histplot has a parameter to indicate that a distribution is discrete. And a parameter stat= where you can choose between 'count' for bin heights indicating the usual counts and 'probability' for heights relative to their probability, mimicking a probability mass function. The blue histogram could be drawn as:
import seaborn as sns
sns.histplot(data, x='SECOND', discrete=True, stat='probability', facecolor='b', alpha=0.8, ax=axs[1, 1])
I am trying to do a plot of a seismic wave using plt.contour.
I have 3 arrays:
time (x-axis)
frequency (y-axis)
amplitude (z-axis)
This is my results so far:
The problem is that I want to change the scaling of the colorbar: making a gradation and not having this white color when the amplitude is low. But I am not able to do so, even though I spent a lot of time browsing the doc.
I read that plt.pcolormesh is not appropriate here (it is just working here because I am in a special case), but this what I want to get regarding to the colours and colorbar:
This is the code I wrote:
T = len(time[0])*(time[0][1] - time[0][0]) # multiply ampFFT with T to offset
Z = abs(ampFFT)*(T) # abbreviation
# freq = frequency, ampFFT = Fast Fourier Transform of the amplitude of the wave
# freq, amFFT and time have same dimensions: 40 x 1418 (40 steps of time discretization x steps to have the total time. 2D because it is easier to use)
maxFreq = abs(freq).max() # maxium frequency for plot boundaries
maxAmpFFT = abs(Z).max()/2 # maxium ampFFT for plot boundaries of colorbar divided by 2 to scale better with the colors
minAmpFFT = abs(Z).min()
plt.figure(1)
plt.contour(time, freq, Z, vmin=minAmpFFT, vmax=maxAmpFFT)
plt.colorbar()
plt.ylim(0,maxFreq) # 0 to remove the negative frequencies useless here
plt.title("Amplitude intensity regarding to time and frequency")
plt.xlabel('time (in secondes)')
plt.ylabel('frequency (in Hz)')
plt.show()
Thank you for your attention!
NB : In case you were wondering about plt.pcolormesh: the plot is completely messed up when I choose to increase the time discretization (here I split the time in 40, but when I split the time in 1000 the plot is not correct, and I want to be able to split the time in smaller pieces).
EDIT: When I use plt.contourf instead of plt.contour I got this plot:
Which is not really convincing either. I understand why the yellow colour takes so much space (it is because I set a low vmax), but I don't understand why there is still white colour in my plot.
EDIT 2: My teacher plotted my data, and I have the correct data. The only problem that is left is the white background in my plot (and the deep blue on left and right border for nor apparent reason when I use plt.contourf). Despite those problems, the highest amplitude is located around 0.5 Hz, which is in agreement with the work of my teacher.
He used gnuplot, but since I don't know gnuplot, I prefer to use python.
Solution/Workaround I found
Here is what I did to display my data like countourf does, but without the display problems:
Explanation: for the surface, I took abs(freq) instead of just freq because I have negative frequencies.
It is because that when calculating the frequency of a FFT, you have a frequency that repeat itself a 2nd time like this:
You have 2 way of obtaining this frequency:
- the frequency is positive, this array is 2 x Nyquist frequency (so if you divide the array by 2, you have all your wave, and it doesn't repeat itself).
- the frequency starts negative and go to positive, this array also is 2 x Nyquist frequency (so if you remove the negative value you have all your wave, and it doesn't repeat itself).
Python fft.fftfreq use the 2nd option. plot_surface doesn't work well with removing the data of an array (for me it was still displayed). So I made the frequency value absolute and the problem disappeared.
fig = plt.figure(1, figsize=(18,15)) # figsize: increase plot size
ax = fig.gca(projection='3d')
surf = ax.plot_surface(time, abs(freq), Z, rstride=1, cstride=1, cmap=cm.magma, linewidth=0, antialiased=False, vmin=minAmpFFT, vmax=maxAmpFFT)
ax.set_zlim(0, maxAmpFFT)
ax.set_ylim(0, maxFreq)
ax.view_init(azim=90, elev=90) # change view to top view, with axis in the right direction
plt.title("Amplitude intensity (m/Hz^0.5) regarding to time and frequency")
plt.xlabel('x : time (in secondes)')
plt.ylabel('y : frequency (in Hz)')
# ax.yaxis._set_scale('log') # should be in log, but does not work
plt.gca().invert_xaxis() # invert x axis !! MUST BE AFTER X,Y,Z LIM
plt.gca().invert_yaxis() # invert y axis !! MUST BE AFTER X,Y,Z LIM
plt.colorbar(surf)
fig.tight_layout()
plt.show()
This is the plot I got:
Lets say I have a large data set to where I can manipulate it all in some sort analysis. Which can be looking at values in a probability distribution.
Now that I have this large data set, I then want to compare known, actual data to it. Primarily, how many of the values in my data set have the same value or property with the known data. For example:
This is a cumulative distribution. The continuous lines are from generated data from simulations and the decreasing intensities are just predicted percentages. The stars are then observational (known) data, plotted against generated data.
Another example I have made is how visually the points could possibly be projected on a histogram:
I'm having difficulty marking where the known data points fall in the generated data set and plot it cumulatively along side the distribution of the generated data.
If I were to try and retrieve the number of points that fall in the vicinity of the generated data, I would start out like this (its not right):
def SameValue(SimData, DefData, uncert):
numb = [(DefData-uncert) < i < (DefData+uncert) for i in SimData]
return sum(numb)
But I am having trouble accounting for the points falling in the value ranges and then having it all set up to where I can plot it. Any idea on how to gather this data and project this onto a cumulative distribution?
The question is pretty chaotic with lots of irrelevant information but staying vague at the essetial points. I will try interprete it the best I can.
I think what you are after is the following: Given a finite sample from an unknown distribution, what is the probability to obtain a new sample at a fixed value?
I'm not sure if there is a general answer to it, but in any case that would be a question to be asked to statistics or mathematics people. My guess is that you would need to make some assumptions about the distribution itself.
For the practical case however, it might be sufficient to find out in which bin of the sampled distribution the new value would lie.
So assuming we have a distribution x, which we divide into bins. We can compute the histogram h, using numpy.histogram. The probability to find a value in each bin is then given by h/h.sum().
Having a value v=0.77, of which we want to know the probability according to the distribution, we can find out the bin in which it would belong by looking for the index ind in the bin array where this value would need to be inserted for the array to stay sorted. This can be done using numpy.searchsorted.
import numpy as np; np.random.seed(0)
x = np.random.rayleigh(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
prob = h/float(h.sum())
ind = np.searchsorted(bins, 0.77, side="right")
print prob[ind] # which prints 0.058
So the probability is 5.8% to sample a value in the bin around 0.77.
A different option would be to interpolate the histogram between the bin centers, as to find the the probability.
In the code below we plot a distribution similar to the one from the picture in the question and use both methods, the first for the frequency histogram, the second for the cumulative distribution.
import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
x = np.random.rayleigh(size=1000)
y = np.random.normal(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())
points = [[.77,-.55],[1.13,1.08],[2.15,-.3]]
markers = [ur'$\u2660$',ur'$\u2665$',ur'$\u263B$']
colors = ["k", "crimson" , "gold"]
labels = list("ABC")
kws = dict(height_ratios=[1,1,2], hspace=0.0)
fig, (axh, axc, ax) = plt.subplots(nrows=3, figsize=(6,6), gridspec_kw=kws, sharex=True)
cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
axc.plot(cbins, hcumc, marker=".", markersize="2", mfc="k", mec="k" )
axh.bar(bins[:-1], h, width=np.diff(bins[:2])[0], alpha=0.7, ec="C0", align="edge")
ax.scatter(x,y, s=10, alpha=0.7)
for p, m, l, c in zip(points, markers, labels, colors):
kw = dict(ls="", marker=m, color=c, label=l, markeredgewidth=0, ms=10)
# plot points in scatter distribution
ax.plot(p[0],p[1], **kw)
#plot points in bar histogram, find bin in which to plot point
# shift by half the bin width to plot it in the middle of bar
pix = np.searchsorted(bins, p[0], side="right")
axh.plot(bins[pix-1]+np.diff(bins[:2])[0]/2., h[pix-1]/2., **kw)
# plot in cumulative histogram, interpolate, such that point is on curve.
yi = np.interp(p[0], cbins, hcumc)
axc.plot(p[0],yi, **kw)
ax.legend()
plt.tight_layout()
plt.show()
I'm trying to fill the area under a curve with matplotlib. The script below works fine.
import matplotlib.pyplot as plt
from math import sqrt
x = range(100)
y = [sqrt(i) for i in x]
plt.plot(x,y,color='k',lw=2)
plt.fill_between(x,y,0,color='0.8')
plt.show()
However if I set the y-scale to logarithmic (see below). It sometimes fills the area above the curve ! Can anyone help me? I would like to fill the area between the curve and y = 0.
x = range(100)
y = [sqrt(i) for i in x]
plt.plot(x,y,color='k',lw=2)
plt.fill_between(x,y,0,color='0.8')
plt.yscale('log')
plt.show()
Thanks in advance!
With a logarithmic y-scale, fill_between(x, y, 0) tells matplotlib to fill the region between log(0) = -infinity and log(y). Naturally, it balks. You can avoid the problem by changing 0 to some small number like 1e-6.
As mentioned, 0 -> -inf in a log scale. Thus, any plotted value that was less than or equal to zero would be problematic (requiring an infinite ylim in log space). This problem exists independently of whether you are using fill_between() or not.
Fortunately, matplotlib provides a way to handle this nicely. In the default behavior, matplotlib masks the values of every value less than or equal to zero. In your example, this means that your entire y=0 line is masked and excluded from the polygon defining the filled-between area. The result is that the polygon is simply closed by drawing a line from (100,10) down and leftward to (0,0). Another option is to clip the values. In this case, they are set to 1e-300 and are not consulted when determining the ylim of the plot. So to get your desired result, do the following:
plt.yscale('log', nonposy='clip')