Numpy/matplot: How to plot First X% is in range Y%? - python

Assume I have the following observations:
1,2,3,4,5,6,7,100
Now I want to make a plot how the observations are distributed percent wise:
First 12.5% of the observations is <=1 (1 out of 8)
First 50% of the observations is <=4 (4 out of 4)
First 87.5% of the observations is <=7 (7 out of 8)
First 100% of the observations is <=100 (8 out of 8)
My questions:
How is such kind of plot called? (so max observation on y axis per percentile, percentile on x axis?). A kind of histogram?
How can I create such kind of plot in Matplotlib/Numpy?
Thanks

I'm not sure what such a plot would be called (edit: it appears it's called a cumulative frequency plot, or something similar). However, it's easy to do.
Essentially, if you have sorted data, then the percentage of observations <= a value at index i is just (i+1)/len(data). It's easy to create an x array using arange that satisfies this. So, for example:
from matplotlib import pylab
import numpy as np
a = np.array([1,2,3,4,5,6,7,100])
pylab.plot( np.arange(1,len(a)+1)/len(a), a, # This part is required
'-', drawstyle='steps' ) # This part is stylistic
Gives:
If you'd prefer your x axis go from 0 to 100 rather than 0 to
Note too that this works for your example data because it is already sorted. If you are using unsorted data, then sort it first, with np.sort for example:
c = np.random.randn(100)
c.sort()
pylab.plot( np.arange(1,len(c)+1)/len(c), c, '-', drawstyle='steps' )

Related

How can I plot a CDF in Matplotlib without binning my data? [duplicate]

This question already has answers here:
How to use markers with ECDF plot
(2 answers)
How to plot empirical cdf (ecdf)
(18 answers)
Closed 1 year ago.
I can easily make a CDF in Matplotlib by using a cumulative histogram:
data = np.linspace(0, 100, num=10000)
plt.hist(data, cumulative=True, density=1)
And the result is this:
I can crank up the bin count to get a better approximation:
plt.hist(data, bins=50, cumulative=True, density=1)
Now the result is:
This is still not great. I know I can just make the bin count even higher, but that's a pretty unsatisfying solution for me.
Is there a way to plot a CDF that doesn't make me lose some precision? Like a binless histogram or something else?
You're talking about the ECDF (empirical cumulative distribution function) derived from the sample, and a cumulative histogram isn't how it's typically done. What's usually done is sorting the sample, finding the unique values, and finding the proportion of the sample less than or equal to those unique values; no need to adjust bin-widths.
The ECDF has discontinuous jumps at every unique value, so you'd want 2 values for each jump for plotting's sake. The following code will give you the x and y to plot an ECDF:
def ecdf4plot(seq, assumeSorted = False):
"""
In:
seq - sorted-able object containing values
assumeSorted - specifies whether seq is sorted or not
Out:
0. values of support at both points of jump discontinuities
1. values of ECDF at both points of jump discontinuities
ECDF's true value at a jump discontinuity is the higher one """
if not assumeSorted:
seq = sorted(seq)
prev = seq[0]
n = len(seq)
support = [prev]
ECDF = [0.]
for i in range(1, n):
seqi = seq[i]
if seqi != prev:
preP = i/n
support.append(prev)
ECDF.append(preP)
support.append(seqi)
ECDF.append(preP)
prev = seqi
support.append(prev)
ECDF.append(1.)
return support, ECDF
# example usage
import numpy as np
from matplotlib import pyplot as plt
plt.plot(*ecdf4plot(np.random.randn(100)))

Distance between two group of values in a numpy array

I have a very basic question which in theory is easy to do (with fewer points and a lot of manual labour in ArcGIS), but I am not able to start at all with the coding to solve this problem (also I am new to complicated python coding).
I have 2 variables 'Root zone' aka RTZ and 'Tree cover' aka TC both are an array of 250x186 values (which are basically grids with each grid having a specific value). The values in TC varies from 0 to 100. Each grid size is 0.25 degrees (might be helpful in understanding the distance).
My problem is "I want to calculate the distance of each TC value ranging between 50-100 (so each value of TC value greater than 50 at each lat and lon) from the points where nearest TC ranges between 0-30 (less than 30)."
Just take into consideration that we are not looking at the np.nan part of the TC. So the white part in TC is also white in RZS.
What I want to do is create a 2-dimensional scatter plot with X-axis denoting the 'distance of 50-100 TC from 0-30 values', Y-axis denoting 'RZS of those 50-100 TC points'. The above figure might make things more clear.
I hope I could have provided any code for this, but I am not to even able to start on the distance thing.
Please provide any suggestion on how should I proceed with this.
Let's consider an example:
If you look at the x: 70 and y:70, one can see a lot of points with values from 0-30 of the tree cover all across the dataset. But I only want the distance from the nearest value to my point which falls between 0-30.
The following code might work, with random example data:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
# Create some completely random data, and include an area of NaNs as well
rzs = np.random.uniform(0, 100, size=(250, 168))
tc = np.random.lognormal(3.0, size=(250, 168))
tc = np.clip(tc, 0, 100)
rzs[60:80,:] = np.nan
tc[60:80,:] = np.nan
plt.subplot(2,2,1)
plt.imshow(rzs)
plt.colorbar()
plt.subplot(2,2,2)
plt.imshow(tc)
plt.colorbar()
Now do the real work:
# Select the indices of the low- and high-valued points
# This will results in warnings here because of NaNs;
# the NaNs should be filtered out in the indices, since they will
# compare to False in all the comparisons, and thus not be
# indexed by 'low' and 'high'
low = (tc >= 0) & (tc <= 30)
high = (tc >= 50) & (tc <= 100)
# Get the coordinates for the low- and high-valued points,
# combine and transpose them to be in the correct format
y, x = np.where(low)
low_coords = np.array([x, y]).T
y, x = np.where(high)
high_coords = np.array([x, y]).T
# We now calculate the distances between *all* low-valued points, and *all* high-valued points.
# This calculation scales as O^2, as does the memory cost (of the output),
# so be wary when using it with large input sizes.
from scipy.spatial.distance import cdist, pdist
distances = cdist(low_coords, high_coords)
# Now find the minimum distance along the axis of the high-valued coords,
# which here is the second axis.
# Since we also want to find values corresponding to those minimum distances,
# we should use the `argmin` function instead of a normal `min` function.
indices = distances.argmin(axis=1)
mindistances = distances[np.arange(distances.shape[0]), indices]
minrzs = rzs.flatten()[indices]
plt.scatter(mindistances, minrzs)
The resulting plot looks a bit weird, since there are rather discrete distances because of the grid (1, sqrt(1^1+1^1), 2, sqrt(1^1+2^2), sqrt(2^2+2^2), 3, sqrt(1^1+3^2), ...); this is because both TC values are randomly distributed, and thus low values may end up directly adjacent to high values (and because we're looking for minimum distances, most plotted points are for these cases). The vertical distribution is because the RZS values were uniformly distributed between 0 and 100.
This is simply a result of the input example data, which is not too representative of the real data.

Comparing two arrays which have very dispersed values

I have a very sparse array that looks like:
Array A: min = -68093253945.0 max=8.54631971208e+13
Array B: min=-1e+15 max = 1.87343e+14
And also each array will have concentration at certain levels e.g. near 2000, near 1m, near 0.05 and so on.
I am trying to compare these two arrays in terms of concentration, and want to do so in a way that is invariant to the number of entries in each. I also want to account for huge outliers if possible and maybe compress the bins to be between 0 and 1 or something of this sort.
The aim is to make a histogram via:
plt.hist(A,alpha=0.5,label='A') # plt.hist passes it's arguments to np.histogram
ion()
plt.hist(B,alpha=0.5,label='B')
plt.title("Histogram of Values")
plt.legend(loc='upper right')
plt.savefig('valuecomp.png')
How do I do this? I have experimented with:
A = stats.zscore(A)
B = stats.zscore(B)
A = preprocessing.scale(A)
B = preprocessing.scale(B)
A = preprocessing.scale(A, axis=0, with_mean=True, with_std=True, copy=True)
B = preprocessing.scale(B, axis=0, with_mean=True, with_std=True, copy=True)
And then for my histograms, adding normed=True, range(0,100). All the methods give me a histogram with a massive vertical chunk near to 0.0 instead of distributing the values smoothly. range(0,100) looks good but it ignores any values like 1m outside of 100.
Perhaps I need to remove outliers from my data first and then do a histogram?
#sascha's suggestion of using AstroML was a good one, but the knuth and freedman versions seem to take astronomically long (excuse the pun), and the blocks version simply thinned the blocks.
I took the sigmoid of each value via from scipy.special import expit and then plotted the histogram that way. Only way I could get this to work.

Matplotlib: How to make a histogram with bins of equal area?

Given some list of numbers following some arbitrary distribution, how can I define bin positions for matplotlib.pyplot.hist() so that the area in each bin is equal to (or close to) some constant area, A? The area should be calculated by multiplying the number of items in the bin by the width of the bin and its value should be no greater than A.
Here is a MWE to display a histogram with normally distributed sample data:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(100)
plt.hist(x, bin_pos)
plt.show()
Here bin_pos is a list representing the positions of the boundaries of the bins (see related question here.
I found this question intriguing. The solution depends on whether you want to plot a density function, or a true histogram. The latter case turns out to be quite a bit more challenging. Here is more info on the difference between a histogram and a density function.
Density Functions
This will do what you want for a density function:
def histedges_equalN(x, nbin):
npt = len(x)
return np.interp(np.linspace(0, npt, nbin + 1),
np.arange(npt),
np.sort(x))
x = np.random.randn(1000)
n, bins, patches = plt.hist(x, histedges_equalN(x, 10), normed=True)
Note the use of normed=True, which specifies that we're calculating and plotting a density function. In this case the areas are identically equal (you can check by looking at n * np.diff(bins)). Also note that this solution involves finding bins that have the same number of points.
Histograms
Here is a solution that gives approximately equal area boxes for a histogram:
def histedges_equalA(x, nbin):
pow = 0.5
dx = np.diff(np.sort(x))
tmp = np.cumsum(dx ** pow)
tmp = np.pad(tmp, (1, 0), 'constant')
return np.interp(np.linspace(0, tmp.max(), nbin + 1),
tmp,
np.sort(x))
n, bins, patches = plt.hist(x, histedges_equalA(x, nbin), normed=False)
These boxes, however, are not all equal area. The first and last, in particular, tend to be about 30% larger than the others. This is an artifact of the sparse distribution of the data at the tails of the normal distribution and I believe it will persist anytime their is a sparsely populated region in a data set.
Side note: I played with the value pow a bit, and found that a value of about 0.56 had a lower RMS error for the normal distribution. I stuck with the square-root because it performs best when the data is tightly-spaced (relative to the bin-width), and I'm pretty sure there is a theoretical basis for it that I haven't bothered to dig into (anyone?).
The issue with equal-area histograms
As far as I can tell it is not possible to obtain an exact solution to this problem. This is because it is sensitive to the discretization of the data. For example, suppose the first point in your dataset is an outlier at -13 and the next value is at -3, as depicted by the red dots in this image:
Now suppose the total "area" of your histogram is 150 and you want 10 bins. In that case the area of each histogram bar should be about 15, but you can't get there because as soon as your bar includes the second point, its area jumps from 10 to 20. That is, the data does not allow this bar to have an area between 10 and 20. One solution for this might be to adjust the lower-bound of the box to increase its area, but this starts to become arbitrary and does not work if this 'gap' is in the middle of the data set.

plot histogram of datetime.time python / matplotlib

I am trying to plot a histogram of datetime.time values. Where these values are discretized into five minute slices. The data looks like this, in a list:
['17:15:00', '18:20:00', '17:15:00', '13:10:00', '17:45:00', '18:20:00']
I would like to plot a histogram, or some form of distribution graph so that the number of occurrences of each time can be examined easily.
NB. Given each time is discretised then. The maximum number of bins in a histogram would be 288 = (60 / 5 * 24)
I have looked at matplotlib.pyplot.hist. But is requires some sort of continuous scalar
I did what David Zwicker said and used seconds, and then changed the x axis. I will look at what Dave said about 'bins'. This works roughly and gives a bar per hour plot to start with.
def chart(occurance_list):
hour_list = [t.hour for t in occurance_list]
print hour_list
numbers=[x for x in xrange(0,24)]
labels=map(lambda x: str(x), numbers)
plt.xticks(numbers, labels)
plt.xlim(0,24)
plt.hist(hour_list)
plt.show()
you have to convert the data in two variable and then you can use plotlab to plot in histograms.

Categories