Simulating for loop 100 times with a matrix - python

I am currently trying to run a simulation of MBS a 100 times. I have written the for loop that I need, however I need to make this loop run 100 times. After consulting with a friend, I believe that I need to specify the "size" parameter in np.random.normal as a matrix, however my coding skills are limited and I would greatly appreciate your help with doing so. Specifically, for a sequence of correlation parameter ρ (rho) between 0 and 1, I need to simulate 100 MBSs, and report the average payoff of each tranche across the simulations Below is my code with notes.
EDIT:
I appreciate the code proposed in the answers and it is indeed very useful. I have one last hurdle now to include: How do I include a payment structure that is sequential. Specifically, The junior tranche is the first to absorb losses from the
underlying collateral pool and does so until the portfolio loss exceeds 5% (i.e. the proportion of defaults exceeds 10%) at which point the junior tranche becomes worthless. The mezzanine
tranche begins to absorb losses once the portfolio loss exceeds 5% and continues to do so until the portfolio loss reaches 10% (i.e. the proportion of defaults exceeds 20%). Finally, the senior
tranche absorbs portfolio losses in excess of 10%. As of now we've only considered an average payoff given a specific share of the total payoff.
rho_list = [0,0.2, 0.5,0.6, 1]
# parameters
n_borrowers = 10
payoff_default = 0.5
payoff_nodefault = 1
threshold = -1.65
# draw of income shocks (I have to draw new values of s and eps for each simulation)
s = np.random.normal(0,1,size=1) # common s for all borrowers
eps = np.random.normal(0,1,size=n_borrowers) # each borrower has their own eps
for rho in rho_list :
# compute the borrower's income
x = np.sqrt(rho) * s + np.sqrt(1-rho) * eps
# which borrower defaults?
loan_payoff = (x < threshold) * payoff_default + (x >= threshold) * payoff_nodefault
# total pool
total_payoff = np.sum(loan_payoff)
# how much does each investor receive from total_payoff?
senior_payoff = 0.82 * total_payoff
mezz_payoff = 0.12 * total_payoff
junior_payoff = 0.6 * total_payoff
print('total: {}, senior: {}, mezz: {}, junior: {}'.format(total_payoff, senior_payoff, mezz_payoff, junior_payoff))
# next steps (this is what I need help with)
# repeat this for 100 simulations and compute the average payoff to each investor
# is it possible to generate the income for all simulations in one step?
# Idea: specify the "size" parameter in np.random.normal as a matrix

I'm not certain I've understood the problem entirely correctly, but I think this does what you want without using any for loops by taking advantage of numpy's broadcasting. I'm by no means an expert in numpy, and multidimensional calculations are something I'm not super comfortable with but I believe my logic is sound. I'd be more than happy for any feedback.
Solution
# Setup
import pandas as pd
import numpy as np
investors = np.array([0.82, 0.12, 0.06])
rhos = np.linspace(0, 1, 11)[..., None, None]
default = 0.5
nodefault = 1
thresh = -1.65
n_sims = 100
n_borrowers = 10
s = np.random.normal(0, 1, size=(1, n_sims, 1))
eps = np.random.normal(0, 1, size=(n_sims, n_borrowers))
# Solution
x = np.sqrt(rhos) * s + np.sqrt(1 - rhos) * eps
payoffs = (x < thresh) * default + (x >= thresh) * nodefault
avgs = payoffs.sum(axis=2).mean(axis=1)
investor_payouts = avgs[..., None] * investors[None, ...]
data = np.hstack([rhos.reshape(-1, 1), investor_payouts])
df = pd.DataFrame(data, columns=["rho", "senior", "mezz", "junior"])
Output:
rho senior mezz junior
0 0.0 8.0155 1.1730 0.5865
1 0.1 7.9909 1.1694 0.5847
2 0.2 7.9991 1.1706 0.5853
3 0.3 7.9868 1.1688 0.5844
4 0.4 7.9991 1.1706 0.5853
5 0.5 7.9786 1.1676 0.5838
6 0.6 7.9745 1.1670 0.5835
7 0.7 7.9745 1.1670 0.5835
8 0.8 7.9458 1.1628 0.5814
9 0.9 7.9007 1.1562 0.5781
10 1.0 7.8310 1.1460 0.5730
Rationale
With n_sims = 100 and n_borrowers = 10, and rhos = np.linspace(0, 1, 11) we have the shapes
>>> rhos.shape
(11, 1, 1)
>>> s.shape
(1, 100, 1)
>>> eps.shape
(100, 10)
The reasoning for the shapes of rhos and s are so that broadcasting can be done more easily.
For each simulation, we essentially need to calculate the payoffs for each ρ. In essence, we want an array of shape (11, 100, 10) where along the first axis are the values of ρ, and the second and third axis are one hundred simulations of 10 borrowers.
The first term of your equation is sqrt(ρ) * s, and we want (11, 100, 1) so that we can broadcast later.
np.sqrt(rhos) * s
# shapes (11, 1, 1) * (1, 100, 1) = (11, 100, 1)
This gives us the same 100 simulated values for s, each multiplied by a different value of sqrt(ρ) (e.g., for ρ=0, which is the first value in rhos, the first row of this (11, 100) matrix is all zeros). We've added an extra dimension to get (11, 100, 1) in order to add to the second term.
The second term follows a similar logic, we want the values of sqrt(1 - ρ) to be multiplied across 100 simulations of 10 borrowers. Since eps.shape == (100, 10) and rhos.shape == (11,), and we want (11, 100, 10), we need to add two new axes to rhos:
np.sqrt(1 - rhos) * eps
# shapes (11, 1, 1) * (100, 10) = (11, 100, 10)
Now we want to combine those two terms for a final array of shape (11, 100, 10). This is why we gave the first term a new axis to get (11, 100, 1), which allows us to broadcast the values of the first term over the second term's last axis:
np.sqrt(rhos) * s + np.sqrt(1 - rhos) * eps
# shapes (11, 100, 1) + (11, 100, 10) = (11, 100, 10)
We're doing this because, in your original code, you are taking a scalar s and broadcasting it over eps, which was an array of length 10. In order to do that, numpy needed to broadcast s into an array of shape (10,) to match the shape of eps. We're doing the same thing here, except we're trying to do it for 100 simulations AND 11 different ρ values.
After all that nasty broadcasting, we arrive at an array which we can now collapse into a sum across borrowers, (total_payoff = np.sum(loan_payoff) in your original code), and then an average across all 100 simulations, which is achieved by the axis arguments to those respective functions; axis 2 has 10 elements, representing the borrowers; axis 1 has 100 elements, representing each simulation. So we use
payoffs.sum(axis=2).mean(axis=1)
Note that the calculation of the intermediary x is the same as in your original code.
At this point, we've obtained the average total payoff for 100 simulations across 10 borrowers, for 11 different values of ρ. From here we want to break out the average payoff by investor. In other words, we have 11 average payoffs (one for each ρ), and 3 investor rates, and we want to broadcast the 3 investor rates over the 11 average payoffs to get an array of shape (11, 3).
Right now avgs.shape == (11,) and investors.shape == (3,) so we need to add some axes to get our desired result:
investor_payouts = avgs[..., None] * investors[None, ...]
# shapes (11, 1) * (1, 3) = (11, 3)
Finally, the np.hstack stuff isn't necessary, that's just me stacking the ρ values with the results so that I could put everything in a dataframe. You could just as easily create the resultant dataframe in a number of other ways, depending on what you need.

Related

Python weighted quantile as R wtd.quantile()

I want to convert the R package Hmisc::wtd.quantile() into python.
Here is the example in R:
I took this as reference and it seems that the logics are different than R:
# First function
def weighted_quantile(values, quantiles, sample_weight = None,
values_sorted = False, old_style = False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:return: numpy.array with computed quantiles.
"""
values = np.array(values)
quantiles = np.array(quantiles)
if sample_weight is None:
sample_weight = np.ones(len(values))
sample_weight = np.array(sample_weight)
assert np.all(quantiles >= 0) and np.all(quantiles <= 1), 'quantiles should be in [0, 1]'
if not values_sorted:
sorter = np.argsort(values)
values = values[sorter]
sample_weight = sample_weight[sorter]
# weighted_quantiles = np.cumsum(sample_weight)
# weighted_quantiles /= np.sum(sample_weight)
weighted_quantiles = np.cumsum(sample_weight)/np.sum(sample_weight)
return np.interp(quantiles, weighted_quantiles, values)
weighted_quantile(values = [0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319],
quantiles = np.arange(0, 1 + 1 / 5, 1 / 5),
sample_weight = [1,1,1,1,1])
>> array([0.2136325, 0.2136325, 0.4079128, 0.4890342, 0.5083345, 0.6197319])
# Second function
def weighted_percentile(data, weights, perc):
"""
perc : percentile in [0-1]!
"""
data = np.array(data)
weights = np.array(weights)
ix = np.argsort(data)
data = data[ix] # sort data
weights = weights[ix] # sort weights
cdf = (np.cumsum(weights) - 0.5 * weights) / np.sum(weights) # 'like' a CDF function
return np.interp(perc, cdf, data)
weighted_percentile([0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319], [1,1,1,1,1], np.arange(0, 1 + 1 / 5, 1 / 5))
>> array([0.2136325 , 0.31077265, 0.4484735 , 0.49868435, 0.5640332 ,
0.6197319 ])
Both are different with R. Any idea?
I am Python-illiterate, but from what I see and after some quick checks I can tell you the following.
Here you use uniform (sampling) weights, so you could also directly use the quantile() function. Not surprisingly, it gives the same results as wtd.quantile() with uniform weights:
x <- c(0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319)
n <- length(x)
x <- sort(x)
quantile(x, probs = seq(0,1,0.2))
# 0% 20% 40% 60% 80% 100%
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
The R quantile() function get the quantiles in a 'textbook' way, i.e. by determining the index i of the obs to use with i = q(n+1).
In your case:
seq(0,1,0.2)*(n+1)
# 0.0 1.2 2.4 3.6 4.8 6.0
Of course since you have 5 values/obs and you want quintiles, the indices are not integers. But you know for example that the first quintile (i = 1.2) lies between obs 1 and obs 2. More precisely, it is a linear combination of the two observations (the 'weights' are derived from the value of the index):
0.2*x[1] + 0.8*x[2]
# 0.3690567
You can do the same for the all the quintiles, on the basis of the indices:
q <-
c(min(x), ## 0: actually, the first obs
0.2*x[1] + 0.8*x[2], ## 1.2: quintile lies between obs 1 and 2
0.4*x[2] + 0.6*x[3], ## 2.4: quintile lies between obs 2 and 3
0.6*x[3] + 0.4*x[4], ## 3.6: quintile lies between obs 3 and 4
0.8*x[4] + 0.2*x[5], ## 4.8: quintile lies between obs 4 and 5
max(x) ## 6: actually, the last obs
)
q
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
You can see that you get exactly the output of quantile() and wtd.quantile().
If instead of 0.2*x[1] + 0.8*x[2] we consider the following:
0.5*x[1] + 0.5*x[2]
# 0.3107726
We get the output of your second Python function. It appears that your second function considers uniform 'weights' (obviously I am not talking about the sampling weights here) when combining the two observations. The issue (at least for the second Python function) seems to come from this. I know these are just insights, but I hope they will help.
EDIT: note that the difference between the two is not necessary an 'issue' with the python code. There are different quantile estimators (and their weighted versions) and the python functions could simply rely on a different estimator than Hmisc::wtd.quantile(). I think that the latter uses the weighted version of the Harrell-Davis quantile estimator. If you really want to implement this one, you should check the source code of Hmisc::wtd.quantile() and try to 'directly' translate this into Python.

Transpose property of matrix multiplication does not equal exactly when last dimension of the array is 1

I have a weird problem that I ran into the other day and I wondered if someone knows the reason for this weirdness. I'm sorry if it's a duplicate of some other posts, but I couldn't find similar posts on this.
I was testing the transpose property on matrix multiplications where (A # B)T == BT # AT holds true.
I wrote a simple code to test it out and to my surprise when the last dimension for the array is 1, the array is "close" but not "equal" in some cases (ignore the ugly code...).
def transpose_property(last_dim: int = 1):
A = np.random.rand(4, 3)
B = np.random.rand(3, last_dim)
AT = A.transpose((1, 0)).copy()
BT = B.transpose((1, 0)).copy()
M_left = (A # B).transpose(1, 0).copy()
M_right = BT # AT
equal = np.array_equal(M_left, M_right)
close = np.allclose(M_left, M_right)
return [equal, close]
I ran some tests where I change the last dimension (last_dim) of B from 1 to 3 and counted how many times (A#B)T and BT # AT were "equal" (np.array_equal) and "close" (np.allclose):
np.random.seed(0)
num_tests = 10000
for dim in range(1, 4):
results = [transpose_property(last_dim=dim) for _ in range(num_tests)]
equals = [r[0] for r in results if r[0]]
closes = [r[1] for r in results if r[1]]
print(f"dim={dim}:")
print(f"\tequals: {len(equals)}/{num_tests}")
print(f"\tcloses: {len(closes)}/{num_tests}")
Results of this script are shown below:
dim=1:
equals: 3452/10000
closes: 10000/10000
dim=2:
equals: 10000/10000
closes: 10000/10000
dim=3:
equals: 10000/10000
closes: 10000/10000
I'm puzzled that when the last dimension is 1, the number of equals is low, but when the last dimension is greater than 1, all of them match.
I think it might be due to floating-point precision rounding, but I don't understand why it's only when the last dimension is 1. How could this be?
I would like to note that:
this happens when you add more dimensions (e.g., A.shape: (3, 3, 3) and B.shape: (3, 3, 1) and transpose the last two dimensions).
this will not happen when the array is an integer (e.g., A = np.arange(12).reshape(6, 2) and B = np.arange(2).reshape(2, 1)).
This problem is not detrimental since the end arrays are "close". It's just that this has been on my mind and wanted to figure out why.
Here is a simple test to show you what is going on:
from itertools import count
import numpy as np
last_dim = 1
for i in count(1):
A = np.random.rand(4, 3)
B = np.random.rand(3, last_dim)
AT = A.transpose((1, 0)).copy()
BT = B.transpose((1, 0)).copy()
M_left = (A # B).transpose(1, 0).copy()
M_right = BT # AT
equal = np.array_equal(M_left, M_right)
if not equal:
break
print(i)
print(M_left - M_right)
On my machine, I get a discrepancy on the first iteration:
array([[ 1.11022302e-16, 0.00000000e+00, 0.00000000e+00, -1.11022302e-16]])
Floats are finite-precision integers with a scale factor. The order of multiplication and addition can cause an error in the last bit to creep in, which is what you are seeing here:
>>> 2**-53
1.1102230246251565e-16
Remember that adding four numbers of order 1 changes the scale of the number to order 4, so you may end up losing up to two bits, depending on how the rounding works out. You will lose more bits as the size of the matrix increases.

Given an existing distribution, how can I draw samples of size N with std of X?

I have a existing distribution of values and I want to draw samples of size 5, but those 5 samples need to have a std of X within some tolerance. For example, I need 5 samples that have a std of 10 (even though the overall distribution is std=~32).
The example code below somewhat works, but is quite slow for large dataset. It randomly samples the distribution until it finds something close to the target std, then removes those elements so they can't be drawn again.
Is there a smarter way to do this properly and faster? It works ok for some target_std (above 6), but it isn't accurate below 6.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(23)
# Create a distribution
d1 = np.random.normal(95, 5, 200)
d2 = np.random.normal(125, 5, 200)
d3 = np.random.normal(115, 10, 200)
d4 = np.random.normal(70, 10, 100)
d5 = np.random.normal(160, 5, 200)
d6 = np.random.normal(170, 20, 100)
dist = np.concatenate((d1, d2, d3, d4, d5, d6))
print(f"Full distribution: len={len(dist)}, mean={np.mean(dist)}, std={np.std(dist)}")
plt.hist(dist, bins=100)
plt.title("Full Distribution")
plt.show();
batch_size = 5
num_batches = math.ceil(len(dist)/batch_size)
target_std = 10
tolerance = 1
# how many samples to search
num_samples = 100
result = []
# Find samples of batch_size that are closest to target_std
for i in range(num_batches):
samples = []
idxs = np.arange(len(dist))
for j in range(num_samples):
indices = np.random.choice(idxs, size=batch_size, replace=False)
sample = dist[indices]
std = sample.std()
err = abs(std - target_std)
samples.append((sample, indices, std, err, np.mean(sample), max(sample), min(sample)))
if err <= tolerance:
# close enough, stop sampling
break
# sort by smallest err first, then take the first/best result
samples = sorted(samples, key=lambda x: x[3])
best = samples[0]
if i % 100 == 0:
pass
print(f"{i}, std={best[2]}, err={best[3]}, nsamples={num_samples}")
result.append(best)
# remove the data from our source
dist = np.delete(dist, best[1])
df_samples = pd.DataFrame(result, columns=["sample", "indices", "std", "err", "mean", "max", "min"])
df_samples["err"].plot(title="Errors (target_std - batch_std)")
batch_std = df_samples["std"].mean()
batch_err = df_samples["err"].mean()
print(f"RESULT: Target std: {target_std}, Mean batch std: {batch_std}, Mean batch err: {batch_err}")
Since your problem is not restricted to a certain distribution, I use a normally random distribution, but this should work for any distribution. However the run time will depend on the population size.
population = np.random.randn(1000)*32
std = 10.
tol = 1.
n_samples = 5
samples = list(np.random.choice(population, n_samples))
while True:
center = np.mean(samples)
dis = [abs(i-center) for i in samples]
if np.std(samples)>(std+tol):
samples.pop(dis.index(max(dis)))
elif np.std(samples)<(std-tol):
samples.pop(dis.index(min(dis)))
else:
break
samples.append(np.random.choice(population, 1)[0])
Here is how the code works.
First, draw n_samples, probably the std is not in the range you want, so we calculate the mean and absolute distance of each sample to the mean. Then if the std is larger than the desired value plus tolerance, we kick the furthest sample and draw a new one and vice versa.
Note that if this takes too much time to calculate for your data, after kicking the outlier out, you can calculate what should be the range of the next element that should be drawn in the population, instead of randomly taking one. Hopefully this works for you.
DISCLAIMER: This is not a random draw anymore, and you should be aware that the draw is biased and is not representative of the population.

Moving average on 3D array in Dask

I have a 3D array and I would like to use Dask to chunk up my 3D array into blocks of traces of a certain window size around each trace. A trace is just one vector of size (1, 1, z). I can do this using the numpy as_strided tricks as follows:
import numpy as np
from numpy.lib.stride_tricks import as_strided
input_volume = np.linspace(1, 1000, 1000, dtype=int).reshape((10, 10, 10))
window_size = 5
x, y, z = input_volume.shape
# Create a view on the volume of sub-cubes window_size traces wide overlapping by 1 trace in each direction
half_w = (window_size - 1) // 2
padded = np.pad(input_volume[...], [(half_w, half_w), (half_w, half_w), (0, 0)], 'edge')
x_str, y_str, z_str = padded.strides
blocks = as_strided(padded, (x, y, window_size, window_size, z), (x_str, y_str, x_str, y_str, z_str))
averaged_volume = np.mean(blocks, (2, 3))
First I pad my 3D cube in the x and y dimensions by the half window. I get the average trace from each block so in this case a block of (5, 5, z) gets reduced to a single trace. I then end up with a volume the same size as the original that has been averaged over the window size. This effectively gives me a "view" of my 3D array with as shape of (10, 10, 5, 5, 10).
This works but if the volume is large it will load the whole volume into memory.
I have been trying to achieve the same thing with a chunked array in dask but I'm having trouble getting the depth and boundaries correct to give me the same answer. How can I achieve the same thing in dask so it only loads each block of traces into memory at a time and writes back out to the average cube?
EDIT:
This is the dask code I have been trying so far but when this runs I get an IndexError: tuple index out of range when it's trying to do the average calculation:
def average(block):
return np.mean(block, axis=(0, 1))
import dask.array as da
dask_volume = da.from_array(da.pad(input_volume, [(half_w, half_w), (half_w, half_w), (0, 0)], 'edge'), chunks=(window_size ,window_size, -1))
dask_overlapping = da.overlap.overlap(dask_volume, depth={0: window_size - 1, 1: window_size -1}, boundary={0: 'none', 1: 'none'})
dask_average = dask_overlapping.map_blocks(average, chunks=(1, 1, z)).compute()
Thanks,
Mike

Python: how to make an histogram with equally *sized* bins

I have a set of data, and want to make an histogram of it. I need the bins to have the same size, by which I mean that they must contain the same number of objects, rather than the more common (numpy.histogram) problem of having equally spaced bins.
This will naturally come at the expenses of the bins widths, which can - and in general will - be different.
I will specify the number of desired bins and the data set, obtaining the bins edges in return.
Example:
data = numpy.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
bins_edges = somefunc(data, nbins=3)
print(bins_edges)
>> [1.,1.3,2.1,2.12]
So the bins all contain 2 points, but their widths (0.3, 0.8, 0.02) are different.
There are two limitations:
- if a group of data is identical, the bin containing them could be bigger.
- if there are N data and M bins are requested, there will be N/M bins plus one if N%M is not 0.
This piece of code is some cruft I've written, which worked nicely for small data sets. What if I have 10**9+ points and want to speed up the process?
1 import numpy as np
2
3 def def_equbin(in_distr, binsize=None, bin_num=None):
4
5 try:
6
7 distr_size = len(in_distr)
8
9 bin_size = distr_size / bin_num
10 odd_bin_size = distr_size % bin_num
11
12 args = in_distr.argsort()
13
14 hist = np.zeros((bin_num, bin_size))
15
16 for i in range(bin_num):
17 hist[i, :] = in_distr[args[i * bin_size: (i + 1) * bin_size]]
18
19 if odd_bin_size == 0:
20 odd_bin = None
21 bins_limits = np.arange(bin_num) * bin_size
22 bins_limits = args[bins_limits]
23 bins_limits = np.concatenate((in_distr[bins_limits],
24 [in_distr[args[-1]]]))
25 else:
26 odd_bin = in_distr[args[bin_num * bin_size:]]
27 bins_limits = np.arange(bin_num + 1) * bin_size
28 bins_limits = args[bins_limits]
29 bins_limits = in_distr[bins_limits]
30 bins_limits = np.concatenate((bins_limits, [in_distr[args[-1]]]))
31
32 return (hist, odd_bin, bins_limits)
Using your example case (bins of 2 points, 6 total data points):
from scipy import stats
bin_edges = stats.mstats.mquantiles(data, [0, 2./6, 4./6, 1])
>> array([1. , 1.24666667, 2.05333333, 2.12])
I would like to mention also the existence of pandas.qcut, which does equi-populated binning in quite an efficient way. In your case it would work something like
data = np.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
# parameter q specifies the number of bins
qc = pd.qcut(data, q=3, precision=1)
# bin definition
bins = qc.categories
print(bins)
>> Index(['[1, 1.3]', '(1.3, 2.03]', '(2.03, 2.1]'], dtype='object')
# bin corresponding to each point in data
codes = qc.codes
print(codes)
>> array([0, 0, 1, 1, 2, 2], dtype=int8)
Update for skewed distributions :
I came across the same problem as #astabada, wanting to create bins each containing an equal number of samples. When applying the solution proposed #aganders3, I found that it didn't work particularly well for skewed distributions. In the case of skewed data (for example something with a whole lot of zeros), stats.mstats.mquantiles for a predefined number of quantiles will not guarantee an equal number of samples in each bin. You will get bin edges that look like this :
[0. 0. 4. 9.]
In which case the first bin will be empty.
In order to deal with skewed cases, I created a function that calls stats.mstats.mquantiles and then dynamically modifies the number of bins if samples are not equal within a certain tolerance (30% of the smallest sample size in the example code). If samples are not equal between bins, the code reduces the number of equally-spaced quantiles by 1 and calls stats.mstats.mquantiles again until sample sizes are equal or only one bin exists.
I hard coded the tolerance in the example, but this could be modified to a keyword argument if desired.
I also prefer giving the number of equally spaced quantiles as an argument to my function instead of giving user defined quantiles to stats.mstats.mquantiles in order to reduce accidental errors (i.e. something like [0., 0.25, 0.7, 1.]).
Here's the code :
import numpy as np
from scipy import stats
def equibins(dat, binnum, **kwargs):
numin = binnum
while numin>1.:
qtls = np.linspace(0.,1.0,num=numin,endpoint=False)
ebins =stats.mstats.mquantiles(dat,qtls,alphap=kwargs['alpha'],betap=kwargs['beta'])
allhist, allbin = np.histogram(dat, bins = ebins)
if (np.unique(ebins).shape!=ebins.shape or tolerence(allhist,0.3)==False) and numin>2:
numin= numin-1
del qtls, ebins
else:
numin=0
return ebins
def tolerence(narray, percent):
if percent>1.0:
per = percent/100.
else:
per = percent
lev_tol = per*narray.min()
tolerate = np.all(narray[1:]-narray[0]<lev_tol)
return tolerate
Just sort the data, and divide it into fixed bins by length! Obviously you can never divide into exactly equally populated bins, if the number of samples does not divide exactly by the number of bins.
import math
import numpy as np
data = np.array([2,3,5,6,8,5,5,6,3,2,3,7,8,9,8,6,6,8,9,9,0,7,5,3,3,4,5,6,7])
data_sorted = np.sort(data)
nbins = 3
step = math.ceil(len(data_sorted)//nbins+1)
binned_data = []
for i in range(0,len(data_sorted),step):
binned_data.append(data_sorted[i:i+step])

Categories