Python: Binning one coordinate and averaging another based on these bins

Python: Binning one coordinate and averaging another based on these bins - python

I have two vectors rev_count and stars. The elements of those form pairs (let's say rev_count is the x coordinate and stars is the y coordinate).
I would like to bin the data by rev_count and then average the stars in a single rev_count bin (I want to bin along the x axis and compute the average y coordinate in that bin).
This is the code that I tried to use (inspired by my matlab background):
import matplotlib.pyplot as plt
import numpy
binwidth = numpy.max(rev_count)/10
revbin = range(0, numpy.max(rev_count), binwidth)
revbinnedstars = [None]*len(revbin)
for i in range(0, len(revbin)-1):
revbinnedstars[i] = numpy.mean(stars[numpy.argwhere((revbin[i]-binwidth/2) < rev_count < (revbin[i]+binwidth/2))])
print('Plotting binned stars with count')
plt.figure(3)
plt.plot(revbin, revbinnedstars, '.')
plt.show()
However, this seems to be incredibly slow/inefficient. Is there a more natural way to do this in python?

Scipy has a function for this:
from scipy.stats import binned_statistic
revbinnedstars, edges, _ = binned_statistic(rev_count, stars, 'mean', bins=10)
revbin = edges[:-1]
If you don't want to use scipy there's also a histogram function in numpy:
sums, edges = numpy.histogram(rev_count, bins=10, weights=stars)
counts, _ = numpy.histogram(rev_count, bins=10)
revbinnedstars = sums / counts

I suppose you are using Python 2 but if not you should change the division when calculating the step to // (floor division) otherwise numpy will be annoyed that it cannot interpret floats as step.
binwidth = numpy.max(rev_count)//10 # Changed this to floor division
revbin = range(0, numpy.max(rev_count), binwidth)
revbinnedstars = [None]*len(revbin)
for i in range(0, len(revbin)-1):
# I actually don't know what you wanted to do but I guess you wanted the
# "logical and" combination in that bin (you don't need to use np.where here)
# You can put that all in one statement but it gets crowded so I'll split it:
index1 = revbin[i]-binwidth/2 < rev_count
index2 = rev_count < revbin[i]+binwidth/2)
revbinnedstars[i] = numpy.mean(stars[np.logical_and(index1, index2)])
That at least should work and gives the right results. It will be very inefficient if you have huge datasets and want more than 10 bins.
One very important takeaway:
Don't use np.argwhere if you want to index an array. That result is just supposed to be human readable. If you really want the coordinates use np.where. That can be used as index but isn't that pretty to read if you have multidimensional inputs.
The numpy documentation supports me on that point:
The output of argwhere is not suitable for indexing arrays. For this purpose use where(a) instead.
That's also the reason why your code was so slow. It tried to do something you don't want it to do and which can be very expensive in memory and cpu usage. Without giving you the right result.
What I have done here is called boolean masks. It's shorter to write than np.where(condition) and involves one less calculation.
A completly vectorized approach could be used by defining a grid that knows which stars are in which bin:
bins = 10
binwidth = numpy.max(rev_count)//bins
revbin = np.arange(0, np.max(rev_count)+binwidth+1, binwidth)
an even better approach for defining the bins would be. Beware that you have to add one to the maximum since you want to include it and one to the number of bins because you are interested in the bin-start and end-points not the center of the bins:
number_of_bins = 10
revbin = np.linspace(np.min(rev_count), np.max(rev_count)+1, number_of_bins+1)
and then you can setup the grid:
grid = np.logical_and(rev_count[None, :] >= revbin[:-1, None], rev_count[None, :] < revbin[1:, None])
The grid is bins x rev_count big (because of the broadcasting, I increased the dimensions of each of those arrays by one BUT not the same). This essentially checkes if a point is bigger than the lower bin range and smaller than the upper bin range (therefore the [:-1] and [1:] indices). This is done multidimensional where the counts are in the second dimension (numpy axis=1) and the bins in the first dimension (numpy axis=0)
So we can get the Y coordinates of the stars in the appropriate bin by just multiplying these with this grid:
stars * grid
To calculate the mean we need the sum of the coordinates in this bin and divide it by the number of stars in that bin (bins are along the axis=1, stars that are not in this bin only have a value of zero along this axis):
revbinnedstars = np.sum(stars * grid, axis=1) / np.sum(grid, axis=1)
I actually don't know if that's more efficient. It'll be a lot more expensive in memory but maybe a bit less expensive in CPU.

The function I use for binning (x,y) data and determining summary statistics such as mean values in those bins is based upon the scipy.stats.statistic() function. I have written a wrapper for it, because I use it a lot. You may find this useful...
def binXY(x,y,statistic='mean',xbins=10,xrange=None):
"""
Finds statistical value of x and y values in each x bin.
Returns the same type of statistic for both x and y.
See scipy.stats.binned_statistic() for options.
Parameters
----------
x : array
x values.
y : array
y values.
statistic : string or callable, optional
See documentation for scipy.stats.binned_statistic(). Default is mean.
xbins : int or sequence of scalars, optional
If xbins is an integer, it is the number of equal bins within xrange.
If xbins is an array, then it is the location of xbin edges, similar
to definitions used by np.histogram. Default is 10 bins.
All but the last (righthand-most) bin is half-open. In other words, if
bins is [1, 2, 3, 4], then the first bin is [1, 2) (including 1, but
excluding 2) and the second [2, 3). The last bin, however, is [3, 4],
which includes 4.
xrange : (float, float) or [(float, float)], optional
The lower and upper range of the bins. If not provided, range is
simply (x.min(), x.max()). Values outside the range are ignored.
Returns
-------
x_stat : array
The x statistic (e.g. mean) in each bin.
y_stat : array
The y statistic (e.g. mean) in each bin.
n : array of dtype int
The count of y values in each bin.
"""
x_stat, xbin_edges, binnumber = stats.binned_statistic(x, x,
statistic=statistic, bins=xbins, range=xrange)
y_stat, xbin_edges, binnumber = stats.binned_statistic(x, y,
statistic=statistic, bins=xbins, range=xrange)
n, xbin_edges, binnumber = stats.binned_statistic(x, y,
statistic='count', bins=xbins, range=xrange)
return x_stat, y_stat, n

Related

Produce evenly spaced values on a logarithmic scale (like `np.linspace()`, but on a log scale)

I want to creat two samples, first an y sample with values in a range from 10^3 to 10^10 and another x sample with values in a range from 10^-5 to 10^10 for a logarithmic plot. I tried the following :
y = np.linspace(1e3,1e10, num = 1000)
x = np.linspace(1e-5,1e10, num = 1000)
but it returns me a non evenly distributed sample, with only 1 value of the order of 10^-5 and many more of the order of 10^9 for x, and zero value between 10^-5 and 10^7. That is what I get for x:
[1.00000000e-05 1.00100100e+07 2.00200200e+07 3.00300300e+07
4.00400400e+07 5.00500501e+07 6.00600601e+07 7.00700701e+07
8.00800801e+07 9.00900901e+07 1.00100100e+08 1.10110110e+08
1.20120120e+08 1.30130130e+08 1.40140140e+08 1.50150150e+08
...
I want a sample with values evenly separated: with the same number of values for each 10^ order because I need it for a logarithmic plot. Why is linspace not working and how can I fix it ?

A linspace returns linearly spaced values, meaning there will be the same distance from each number to the next.
logspace on the other hand creates logarithmically spaced values, which are what you are looking for.
https://numpy.org/devdocs/reference/generated/numpy.logspace.html
Edit:
Beware that logspace takes the exponent as start and stop values. Meaning you must write np.logspace(3, 10, num = 1000) and np.logspace(-5, 10, num = 1000)

Check out geomspace:
import numpy as np
y = np.geomspace(1e3, 1e10, num=8)
print(y)
[1.e+03 1.e+04 1.e+05 1.e+06 1.e+07 1.e+08 1.e+09 1.e+10]

You are using the wrong function for what you are trying to achieve here. You should use "logspace" for that.
y = np.logspace(3,10, num = 1000)
print (y)

Distance between two group of values in a numpy array

I have a very basic question which in theory is easy to do (with fewer points and a lot of manual labour in ArcGIS), but I am not able to start at all with the coding to solve this problem (also I am new to complicated python coding).
I have 2 variables 'Root zone' aka RTZ and 'Tree cover' aka TC both are an array of 250x186 values (which are basically grids with each grid having a specific value). The values in TC varies from 0 to 100. Each grid size is 0.25 degrees (might be helpful in understanding the distance).
My problem is "I want to calculate the distance of each TC value ranging between 50-100 (so each value of TC value greater than 50 at each lat and lon) from the points where nearest TC ranges between 0-30 (less than 30)."
Just take into consideration that we are not looking at the np.nan part of the TC. So the white part in TC is also white in RZS.
What I want to do is create a 2-dimensional scatter plot with X-axis denoting the 'distance of 50-100 TC from 0-30 values', Y-axis denoting 'RZS of those 50-100 TC points'. The above figure might make things more clear.
I hope I could have provided any code for this, but I am not to even able to start on the distance thing.
Please provide any suggestion on how should I proceed with this.
Let's consider an example:
If you look at the x: 70 and y:70, one can see a lot of points with values from 0-30 of the tree cover all across the dataset. But I only want the distance from the nearest value to my point which falls between 0-30.

The following code might work, with random example data:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
# Create some completely random data, and include an area of NaNs as well
rzs = np.random.uniform(0, 100, size=(250, 168))
tc = np.random.lognormal(3.0, size=(250, 168))
tc = np.clip(tc, 0, 100)
rzs[60:80,:] = np.nan
tc[60:80,:] = np.nan
plt.subplot(2,2,1)
plt.imshow(rzs)
plt.colorbar()
plt.subplot(2,2,2)
plt.imshow(tc)
plt.colorbar()
Now do the real work:
# Select the indices of the low- and high-valued points
# This will results in warnings here because of NaNs;
# the NaNs should be filtered out in the indices, since they will
# compare to False in all the comparisons, and thus not be
# indexed by 'low' and 'high'
low = (tc >= 0) & (tc <= 30)
high = (tc >= 50) & (tc <= 100)
# Get the coordinates for the low- and high-valued points,
# combine and transpose them to be in the correct format
y, x = np.where(low)
low_coords = np.array([x, y]).T
y, x = np.where(high)
high_coords = np.array([x, y]).T
# We now calculate the distances between *all* low-valued points, and *all* high-valued points.
# This calculation scales as O^2, as does the memory cost (of the output),
# so be wary when using it with large input sizes.
from scipy.spatial.distance import cdist, pdist
distances = cdist(low_coords, high_coords)
# Now find the minimum distance along the axis of the high-valued coords,
# which here is the second axis.
# Since we also want to find values corresponding to those minimum distances,
# we should use the `argmin` function instead of a normal `min` function.
indices = distances.argmin(axis=1)
mindistances = distances[np.arange(distances.shape[0]), indices]
minrzs = rzs.flatten()[indices]
plt.scatter(mindistances, minrzs)
The resulting plot looks a bit weird, since there are rather discrete distances because of the grid (1, sqrt(1^1+1^1), 2, sqrt(1^1+2^2), sqrt(2^2+2^2), 3, sqrt(1^1+3^2), ...); this is because both TC values are randomly distributed, and thus low values may end up directly adjacent to high values (and because we're looking for minimum distances, most plotted points are for these cases). The vertical distribution is because the RZS values were uniformly distributed between 0 and 100.
This is simply a result of the input example data, which is not too representative of the real data.

Divide a stream into bins with equal counts

Ideally, I want the following without reading the data from hard disk for too many times. The data is big and memory can't hold all of the data at the same time.
Input is a stream x[t] from hard disk. The stream of numbers contains N elements.
It is possible to have a histogram of x with m bins.
The n bins are defined by the bin edges e0 < e1, ..., < em. For example, if ei =< x[0] < ei+1, then x[0] belongs to the ith bin.
Find the bin edges that makes the bin holds an nearly equal number of elements from the stream. The number of elements in each bin should be ideally within some threshold percent from N/m. This is because if we evenly distribute N elements among m bins, each bins should hold about N/m elements.
Current solution:
import numpy as np
def test_data(size):
x = np.random.normal(0, 0.5, size // 2)
x = np.hstack([x, np.random.normal(4, 1, size // 2)])
return x
def bin_edge_as_index(n_bin, fine_hist, fine_n_bin, data_size):
cum_sum = np.cumsum(fine_hist)
bin_id = np.empty((n_bin + 1), dtype=int)
count_per_bin = data_size * 1.0 / n_bin
for i in range(1, n_bin):
bin_id[i] = np.argmax(cum_sum > count_per_bin * i)
bin_id[0] = 0
bin_id[n_bin] = fine_n_bin
return bin_id
def get_bin_count(bin_edge, data):
n_bin = bin_edge.shape[0] - 1
result = np.zeros((n_bin), dtype=int)
for i in range(n_bin):
cmp0 = (bin_edge[i] <= data)
cmp1 = (data < bin_edge[i + 1])
result[i] = np.sum(cmp0 & cmp1)
return result
# Test Setting
test_size = 10000
n_bin = 6
fine_n_bin = 2000 # use a big number and hope it works
# Test Data
x = test_data(test_size)
# Fine Histogram
fine_hist, fine_bin_edge = np.histogram(x, fine_n_bin)
# Index of the bins of the fine histogram that contains
# the required bin edges (e_1, e_2, ... e_n)
bin_id = bin_edge_as_index(
n_bin, fine_hist, fine_n_bin, test_size)
# Find the bin edges
bin_edge = fine_bin_edge[bin_id]
print("bin_edges:")
print(bin_edge)
# Check
bin_count = get_bin_count(bin_edge, x)
print("bin_counts:")
print(bin_count)
print("ideal count per bin:")
print(test_size * 1.0 / n_bin)
Output of program:
bin_edges:
[-1.86507282 -0.22751473 0.2085489 1.30798591 3.57180559 4.40218207
7.41287669]
bin_counts:
[1656 1675 1668 1663 1660 1677]
ideal count per bin:
1666.6666666666667
Problem:
I can't specify a threshold s, and expect the bin counts are at most s% different from the ideal counts per bin.

Assuming that the distribution is not outrageously skewed (like 10000 values between 1.0000001 and 1.0000002 and 10000 others between 9.0000001 and 9.0000002), you can proceed as below.
Compute a histogram with a sufficient resolution, say K bins, which covers the whole range (hopefully known beforehand). This will take a single pass over the data.
Then compute the cumulated histogram and as you go, identify the m+1 quantile edges (where the cumulated counts cross multiples of N/m).
The accuracy that you will get is dictated by the maximum number of elements in a bin of the original histogram.
For N elements, using an histogram of K bins and assuming some "nonuniformity factor" (equal to a few units for reasonable distributions), the maximum error will be f.N/K.
You can improve accuracy if you like by considering m+1 auxiliary histograms which only accumulate the values that fall in the quantile bins of the global histogram. Then you can refine the quantiles to the resolution of these auxiliary histograms.
This will cost you an extra pass, but the error will be reduced to f.N/(K.K'), using K then m.K' histogram space only, instead of K.K'.

Iff you can assume your data is random with a defined distribution (that is: taking any non-trivial percentage of your data in sequence is going to "sketch" the same distribution as the entire data, only with a coarser precision), I imagine there are a number of options:
read a part of your data in some oversampled histogram. Based on this, choose an approximation for the bin edges the way you do now (as explained in your question), then uniformly oversample these bins, then read another chunk of your data into the new bins and so on. If you have enough data, processing them in chunks 0f 10% would allow for 10 iterations to improve your bins structure in a single pass.
start with a number of bins and accumulate some (not all) data. Look over them and if one has bin_width*count disproportionately higher then the neighbours (maybe this is where the precision/error may come into play), divide that bin in two and heuristically assign the old bin count into the newly created bins (one possible heuristic - proportional with the count of the neighbours). At the end, you should have a division somehow controlled by the acceptable error from which to sort-of interpolate your distribution.
Of course, the above are only ideas of approaches, can't offer any warranty about how well they'll work.

numpy polyfit yields nonsense

I am trying to fit these values:
This is my code:
for i in range(-area,area):
stDev1= []
for j in range(-area,area):
stDev0 = stDev[i+i0][j+j0]
stDev1.append(stDev0)
slices[i] = stDev1
fitV = []
xV = []
for l in range(-area,area):
y = np.asarray(slices[l])
x = np.arange(0,2*area,1)
for m in range(-area,area):
fitV.append(slices[m][l])
xV.append(l)
fit = np.polyfit(xV,fitV,4)
yfit = function(fit,area)
x100 = np.arange(0,100,1)
plt.plot(xV,fitV,'.')
plt.savefig("fits1.png")
def function(fit,area):
yfit = []
for x in range(-area,area):
yfit.append(fit[0]+fit[1]*x+fit[2]*x**2+fit[3]*x**3+fit[4]*x**4)
return(yfit)
i0 = 400
j0 = 400
area = 50
stdev = 2d np.array([1300][800]) #just an image of "noise" feel free to add any image // 2d np array you like.
This yields:
obviously this is completly wrong?
I assume I miss understand the concept of polyfit? From the doc the requirement is that I feed it with with two arrays of shape x[i] y[i]? My values in
xV = [ x_1_-50,x_1_-49,...,x_1_49,x_2_-50,...,x_49_49]
and my ys are:
fitV = [y_1_-50,y_1_-49,...,y_1_49,...y_2_-50,...,y_2_49]

I do not completely understand your program. In the future, it would be helpful if you were to distill your issue to a MCVE. But here are some thoughts:
It seems, in your data, that for a given value of x there are multiple values of y. Given (x,y) data, polyfit returns a tuple that represents a polynomial function, but no function can map a single value of x onto multiple values of y. As a first step, consider collapsing each set of y values into a single representative value using, for example, the mean, median, or mode. Or perhaps, in your domain, there's a more natural way to do this.
Second, there is an idiomatic way to use the pair of functions np.polyfit and np.polyval, and you're not using them in the standard way. Of course, numerous useful departures from this pattern exist, but first make sure you understand the basic pattern of these two functions.
a. Given your measurements y_data, taken at times or locations x_data, plot them and make a guess as to the order of the fit. That is, does it look like a line? Like a parabola? Let's assume you believe your data to be parabolic, and that you'll use a second order polynomial fit.
b. Make sure that your arrays are sorted in order of increasing x. There are many ways to do this, but np.argsort is a easy one.
c. Run polyfit: p = polyfit(x_data,y_data,2), which returns a tuple containing the 2nd, 1st, and 0th order coefficients in p, (c2,c1,c0).
d. In the idiomatic use of polyfit and polyval, next you would generate your fit: polyval(p,x_data). Or perhaps you want the fit to be sampled more coarsely or finely, in which case you might take a subset of x_data or interpolate more values in x_data.
A complete example is below.
import numpy as np
from matplotlib import pyplot as plt
# these are your measurements, unsorted
x_data = np.array([18, 6, 9, 12 , 3, 0, 15])
y_data = np.array([583.26347805, 63.16059915, 100.94286909, 183.72581827, 62.24497418,
134.99558191, 368.78421529])
# first, sort both vectors in increasing-x order:
sorted_indices = np.argsort(x_data)
x_data = x_data[sorted_indices]
y_data = y_data[sorted_indices]
# now, plot and observe the parabolic shape:
plt.plot(x_data,y_data,'ks')
plt.show()
# generate the 2nd order fitting polynomial:
p = np.polyfit(x_data,y_data,2)
# make a more finely sampled x_fit vector with, for example
# 1024 equally spaced points between the first and last
# values of x_data
x_fit = np.linspace(x_data[0],x_data[-1],1024)
# now, compute the fit using your polynomial:
y_fit = np.polyval(p,x_fit)
# and plot them together:
plt.plot(x_data,y_data,'ks')
plt.plot(x_fit,y_fit,'b--')
plt.show()
Hope that helps.

How to generate random numbers in specyfic range using pareto distribution in Python

Hi I wanted to generate some random numbers with pareto distribution. I've found that this is possible using numpy. But I don't know hot to shape the outcome. For example i want to have results in range: 10-20, but how can i achieve this?
I know the syntax for using pareto from numpy
numpy.random.pareto(m, s)
I can't understand what m is for (I've been looking in wikipedia, but i don't understand one bit)? I know that s i size of generated tuple.

The documentation seems to have a mistake which might be confusing you.
Normally the parameter names in the call signature:
numpy.random.pareto(a, size=None)
Match the parameter names with the given details:
Parameters
----------
shape : float, > 0.
Shape of the distribution.
size : tuple of ints
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn.
But you see that the first parameter is called both a and shape. Pass your desired shape as the first argument to the function to get a distribution of size numbers (they're not a tuple, but a numpy array).
If you need to change the second parameter (called xm on wikipedia), then just add it to all values, as in the example from the docs:
Examples
--------
Draw samples from the distribution:
>>> a, m = 3., 1. # shape and mode
>>> s = np.random.pareto(a, 1000) + m
So, it is trivial to implement a lower bound: just use your lower bound for m:
lower = 10 # the lower bound for your values
shape = 1 # the distribution shape parameter, also known as `a` or `alpha`
size = 1000 # the size of your sample (number of random values)
And create the distribution with the lower bound:
x = np.random.pareto(shape, size) + lower
However, the Pareto distribution is not bounded from above, so if you try to cut it off it will really be a truncated version of the distribution, which is not quite the same thing, so be careful. If the shape parameter much bigger than 1, the distribution decays algebraically, as x – (a+1), so you won't see very many large values anyway.
If you choose to implement the upper bound, a simple way is to generate the ordinary sample then remove any values that exceed your limit:
upper = 20
x = x[x<upper] # only values where x < upper
But now the size of your sample is (possibly) smaller. You could keep adding new ones (and filtering out the values that are too large) until the size is what you want, but it would be simpler to make it sufficiently large in the first place, then use only size of them:
x = np.random.pareto(shape, size*5/4) + lower
x = x[x<upper][:size]

#askewchan Is the document changed?
According to the latest doc, m should be used like this
a, m = 3., 2. # shape and mode
s = (np.random.pareto(a) + 1) * m
where a is the shape, and m is the scale (which is (xm) in Wikipedia).
This is the test code, the expected mean equals to the simulation result.
a = 2
m = 10
def subtask_service_time():
return (numpy.random.pareto(a) + 1) * m
print('Simulation mean:', sum([subtask_service_time() for _ in range(1000)]) / 1000)
print('Excepted mean:', a * m / (a - 1))
>>>>Simulation mean: 20.383399962437686
>>>>Excepted mean: 20.0

Generator does not provide a version compatibility guarantee.
You can use numpy.random.Generator.pareto to generate numbers in Numpy 1.18.1
for example:
a, m = 3., 2. # shape and mode
s = (np.random.default_rng().pareto(a, 1000) + 1) * m
In pareto distribution, mode also means the lower bound. And pareto distribution have no upper bound
Shape can be caculated by Shape = E / (E - mode)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.