Divide a stream into bins with equal counts - python

Ideally, I want the following without reading the data from hard disk for too many times. The data is big and memory can't hold all of the data at the same time.
Input is a stream x[t] from hard disk. The stream of numbers contains N elements.
It is possible to have a histogram of x with m bins.
The n bins are defined by the bin edges e0 < e1, ..., < em. For example, if ei =< x[0] < ei+1, then x[0] belongs to the ith bin.
Find the bin edges that makes the bin holds an nearly equal number of elements from the stream. The number of elements in each bin should be ideally within some threshold percent from N/m. This is because if we evenly distribute N elements among m bins, each bins should hold about N/m elements.
Current solution:
import numpy as np
def test_data(size):
x = np.random.normal(0, 0.5, size // 2)
x = np.hstack([x, np.random.normal(4, 1, size // 2)])
return x
def bin_edge_as_index(n_bin, fine_hist, fine_n_bin, data_size):
cum_sum = np.cumsum(fine_hist)
bin_id = np.empty((n_bin + 1), dtype=int)
count_per_bin = data_size * 1.0 / n_bin
for i in range(1, n_bin):
bin_id[i] = np.argmax(cum_sum > count_per_bin * i)
bin_id[0] = 0
bin_id[n_bin] = fine_n_bin
return bin_id
def get_bin_count(bin_edge, data):
n_bin = bin_edge.shape[0] - 1
result = np.zeros((n_bin), dtype=int)
for i in range(n_bin):
cmp0 = (bin_edge[i] <= data)
cmp1 = (data < bin_edge[i + 1])
result[i] = np.sum(cmp0 & cmp1)
return result
# Test Setting
test_size = 10000
n_bin = 6
fine_n_bin = 2000 # use a big number and hope it works
# Test Data
x = test_data(test_size)
# Fine Histogram
fine_hist, fine_bin_edge = np.histogram(x, fine_n_bin)
# Index of the bins of the fine histogram that contains
# the required bin edges (e_1, e_2, ... e_n)
bin_id = bin_edge_as_index(
n_bin, fine_hist, fine_n_bin, test_size)
# Find the bin edges
bin_edge = fine_bin_edge[bin_id]
print("bin_edges:")
print(bin_edge)
# Check
bin_count = get_bin_count(bin_edge, x)
print("bin_counts:")
print(bin_count)
print("ideal count per bin:")
print(test_size * 1.0 / n_bin)
Output of program:
bin_edges:
[-1.86507282 -0.22751473 0.2085489 1.30798591 3.57180559 4.40218207
7.41287669]
bin_counts:
[1656 1675 1668 1663 1660 1677]
ideal count per bin:
1666.6666666666667
Problem:
I can't specify a threshold s, and expect the bin counts are at most s% different from the ideal counts per bin.

Assuming that the distribution is not outrageously skewed (like 10000 values between 1.0000001 and 1.0000002 and 10000 others between 9.0000001 and 9.0000002), you can proceed as below.
Compute a histogram with a sufficient resolution, say K bins, which covers the whole range (hopefully known beforehand). This will take a single pass over the data.
Then compute the cumulated histogram and as you go, identify the m+1 quantile edges (where the cumulated counts cross multiples of N/m).
The accuracy that you will get is dictated by the maximum number of elements in a bin of the original histogram.
For N elements, using an histogram of K bins and assuming some "nonuniformity factor" (equal to a few units for reasonable distributions), the maximum error will be f.N/K.
You can improve accuracy if you like by considering m+1 auxiliary histograms which only accumulate the values that fall in the quantile bins of the global histogram. Then you can refine the quantiles to the resolution of these auxiliary histograms.
This will cost you an extra pass, but the error will be reduced to f.N/(K.K'), using K then m.K' histogram space only, instead of K.K'.

Iff you can assume your data is random with a defined distribution (that is: taking any non-trivial percentage of your data in sequence is going to "sketch" the same distribution as the entire data, only with a coarser precision), I imagine there are a number of options:
read a part of your data in some oversampled histogram. Based on this, choose an approximation for the bin edges the way you do now (as explained in your question), then uniformly oversample these bins, then read another chunk of your data into the new bins and so on. If you have enough data, processing them in chunks 0f 10% would allow for 10 iterations to improve your bins structure in a single pass.
start with a number of bins and accumulate some (not all) data. Look over them and if one has bin_width*count disproportionately higher then the neighbours (maybe this is where the precision/error may come into play), divide that bin in two and heuristically assign the old bin count into the newly created bins (one possible heuristic - proportional with the count of the neighbours). At the end, you should have a division somehow controlled by the acceptable error from which to sort-of interpolate your distribution.
Of course, the above are only ideas of approaches, can't offer any warranty about how well they'll work.

Related

What is the most efficient way to generate a list of random numbers all within a range that have a fixed sum so that their boundaries are approached?

I am trying to generate a list of 12 random weights for a stock portfolio in order to determine how the portfolio would have performed in the past given different weights assigned to each stock. The sum of the weights must of course be 1 and there is an additional restriction: each stock must have a weight between 1/24 and 1/4.
Although I am able to generate random numbers such that they all fall within the interval by using random.uniform(), as well as guarantee their sum is 1 by dividing each weighting by the sum of the weightings, I'm finding that
a) each subsequent array of weightings is very similar. I am rarely getting values for weightings that are near the upper boundary of 1/4
b) random.seed() does not seem to be working properly, whether I put it in the randweight() function or at the beginning of the for loop. I'm confused as to why because I thought that generating a random seed value would make my array of weights unique for each iteration. Currently, it's cyclical, with a period of 3.
The following is my code:
# boundaries on weightings
n = 12
min_weight = (1/(2*n))
max_weight = 25 / 100
def rand_weight(e):
random.seed()
return e + np.random.uniform(min_weight, max_weight)
for i in range(100):
weights = np.empty(12)
while not (np.all(weights > min_weight) and np.all(weights < max_weight)):
weights = np.array(list(map(rand_weight, weights)))
weights /= np.sum(weights)
I have already tried scattering the weights by changing the min_weight and max_weight inside the for loop so that rand_weight generates newer values, but this makes the runtime really slow because the "not" condition in the while loop takes longer to evaluate to false (since the probability of all the numbers being in the range decreases).
Lets start with simple facts first. If you want numbers to be in the range [0.042...0.25] and 12 iid numbers in total summed to one, then for mean value
Sum(Xi)=1
E[Sum(Xi)]=Sum(E[Xi])=N E[Xi] = 1
E[Xi]=1/N = 1/12 = 0.083
One corollary is that it would be hard to get numbers close to upper range boundary.
And instead doing things like sampling whatever and then normalizing to get sum to 1, better to use known distribution where sum of values is 1 to begin with.
So lets use Dirichlet distribution, and sample points uniformly in simplex, which means alpha (concentration) vector is all ones.
import numpy as np
N = 12
s = np.random.dirichlet(N*[1.0], 1)
print(np.sum(s))
Some value would be larger (or smaller), and you could reject them
def sampleWeights(alpha, lo, hi):
while True:
s = np.random.dirichlet(alpha, 1)[0]
if np.any(s > hi):
continue # reject
if np.any(s < lo):
continue # reject
return s # accept
and call it like this
N=12
alpha = N*[1.0]
q = sampleWeights(alpha, 1./24., 1./4.)
if you could check it, a lot of rejections happens at low bound, rather then high bound.
BEauty of using known Dirichlet distribution is that you could "concentrate" sampled values around mean, e.g.
alpha = N*[10.0]
q = sampleWeights(alpha, 1./24., 1./4.)
will produce same iid with mean of 1/12 but a lot smaller std.deviation, RV a lot more concentrated around mean
And if you want non-identically distributed RVs, use different alphas
alpha = [1.,2.,3.,4.,5.,6.,6.,5.,4.,3.,2.,1.]
q = sampleWeights(alpha, 1./24., 1./4.)
then some of RVs would be close to upper boundary, and some close to lower boundary. Lots of advantages to use known distribution
The following works. Particularly confusing to me is that np.empty(12) seemed to always return the same array. So once it had been initialized, it stayed the same.
This seems to produce numbers above 0.22 reasonably often.
import numpy as np
from random import random, seed
# boundaries on weightings
n = 12
min_weight = (1/(2*n))
max_weight = 25 / 100
seed(666)
for i in range(100):
weights = np.zeros(n)
while not (np.all(weights > min_weight) and np.all(weights < max_weight)):
weights = np.array([random() for _ in range(n)])
weights /= np.sum(weights) - min_weight * n
weights += min_weight
print(weights)

How to build (or precompute) a histogram from a file too large for memory

Is there a graphing library for python that doesn't require storing all raw data points as a numpy array or list in order to graph a histogram?
I have a dataset too large for memory, and I don't want to use subsampling to reduce the data size.
What I'm looking for is a library that can take the output of a generator (each data point yielded from a file, as a float), and build a histogram on the fly.
This includes computing bin size as the generator yields each data point from the file.
If such a library doesn't exist, I'd like to know if numpy is able to precompute a counter of {bin_1:count_1, bin_2:count_2...bin_x:count_x} from yielded datapoints.
Datapoints are held as a vertical matrix, in a tab file, arranged by node-node-score like below:
node node 5.55555
More information:
104301133 lines in data (so far)
I don't know the min or max values
bin widths should be the the same
number of bins could be 1000
Attempted Answer:
low = np.inf
high = -np.inf
# find the overall min/max
chunksize = 1000
loop = 0
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=chunksize, delimiter='\t'):
low = np.minimum(chunk.iloc[:, 2].min(), low)
high = np.maximum(chunk.iloc[:, 2].max(), high)
loop += 1
lines = loop*chunksize
nbins = math.ceil(math.sqrt(lines))
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64) # np.ndarray filled with np.uint32 zeros, CHANGED TO int64
# iterate over your dataset in chunks of 1000 lines (increase or decrease this
# according to how much you can hold in memory)
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=2, delimiter='\t'):
# compute bin counts over the 3rd column
subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges) # np.ndarray filled with np.int64
# accumulate bin counts over chunks
total += subtotal
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
# plt.bar(np.arange(total.shape[0]), total, width=1)
plt.savefig('gsl_test_hist.svg')
Output:
You could iterate over chunks of your dataset and use np.histogram to accumulate your bin counts into a single vector (you would need to define your bin edges a priori and pass them to np.histogram using the bins= parameter), e.g.:
import numpy as np
import pandas as pd
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.uint)
# iterate over your dataset in chunks of 1000 lines (increase or decrease this
# according to how much you can hold in memory)
for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000):
# compute bin counts over the 3rd column
subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)
# accumulate bin counts over chunks
total += subtotal.astype(np.uint)
If you want to ensure that your bins span the full range of values in your array, but you don't already know the minimum and maximum then you will need to loop over it once beforehand to compute these (e.g. using np.min/np.max), for example:
low = np.inf
high = -np.inf
# find the overall min/max
for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000):
low = np.minimum(chunk.iloc[:, 2].min(), low)
high = np.maximum(chunk.iloc[:, 2].max(), high)
Once you have your array of bin counts, you can then generate a bar plot directly using plt.bar:
plt.bar(bin_edges[:-1], total, width=1)
It's also possible to use the weights= parameter to plt.hist in order to generate a histogram from a vector of counts rather than samples, e.g.:
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total, ...)

Python: Binning one coordinate and averaging another based on these bins

I have two vectors rev_count and stars. The elements of those form pairs (let's say rev_count is the x coordinate and stars is the y coordinate).
I would like to bin the data by rev_count and then average the stars in a single rev_count bin (I want to bin along the x axis and compute the average y coordinate in that bin).
This is the code that I tried to use (inspired by my matlab background):
import matplotlib.pyplot as plt
import numpy
binwidth = numpy.max(rev_count)/10
revbin = range(0, numpy.max(rev_count), binwidth)
revbinnedstars = [None]*len(revbin)
for i in range(0, len(revbin)-1):
revbinnedstars[i] = numpy.mean(stars[numpy.argwhere((revbin[i]-binwidth/2) < rev_count < (revbin[i]+binwidth/2))])
print('Plotting binned stars with count')
plt.figure(3)
plt.plot(revbin, revbinnedstars, '.')
plt.show()
However, this seems to be incredibly slow/inefficient. Is there a more natural way to do this in python?
Scipy has a function for this:
from scipy.stats import binned_statistic
revbinnedstars, edges, _ = binned_statistic(rev_count, stars, 'mean', bins=10)
revbin = edges[:-1]
If you don't want to use scipy there's also a histogram function in numpy:
sums, edges = numpy.histogram(rev_count, bins=10, weights=stars)
counts, _ = numpy.histogram(rev_count, bins=10)
revbinnedstars = sums / counts
I suppose you are using Python 2 but if not you should change the division when calculating the step to // (floor division) otherwise numpy will be annoyed that it cannot interpret floats as step.
binwidth = numpy.max(rev_count)//10 # Changed this to floor division
revbin = range(0, numpy.max(rev_count), binwidth)
revbinnedstars = [None]*len(revbin)
for i in range(0, len(revbin)-1):
# I actually don't know what you wanted to do but I guess you wanted the
# "logical and" combination in that bin (you don't need to use np.where here)
# You can put that all in one statement but it gets crowded so I'll split it:
index1 = revbin[i]-binwidth/2 < rev_count
index2 = rev_count < revbin[i]+binwidth/2)
revbinnedstars[i] = numpy.mean(stars[np.logical_and(index1, index2)])
That at least should work and gives the right results. It will be very inefficient if you have huge datasets and want more than 10 bins.
One very important takeaway:
Don't use np.argwhere if you want to index an array. That result is just supposed to be human readable. If you really want the coordinates use np.where. That can be used as index but isn't that pretty to read if you have multidimensional inputs.
The numpy documentation supports me on that point:
The output of argwhere is not suitable for indexing arrays. For this purpose use where(a) instead.
That's also the reason why your code was so slow. It tried to do something you don't want it to do and which can be very expensive in memory and cpu usage. Without giving you the right result.
What I have done here is called boolean masks. It's shorter to write than np.where(condition) and involves one less calculation.
A completly vectorized approach could be used by defining a grid that knows which stars are in which bin:
bins = 10
binwidth = numpy.max(rev_count)//bins
revbin = np.arange(0, np.max(rev_count)+binwidth+1, binwidth)
an even better approach for defining the bins would be. Beware that you have to add one to the maximum since you want to include it and one to the number of bins because you are interested in the bin-start and end-points not the center of the bins:
number_of_bins = 10
revbin = np.linspace(np.min(rev_count), np.max(rev_count)+1, number_of_bins+1)
and then you can setup the grid:
grid = np.logical_and(rev_count[None, :] >= revbin[:-1, None], rev_count[None, :] < revbin[1:, None])
The grid is bins x rev_count big (because of the broadcasting, I increased the dimensions of each of those arrays by one BUT not the same). This essentially checkes if a point is bigger than the lower bin range and smaller than the upper bin range (therefore the [:-1] and [1:] indices). This is done multidimensional where the counts are in the second dimension (numpy axis=1) and the bins in the first dimension (numpy axis=0)
So we can get the Y coordinates of the stars in the appropriate bin by just multiplying these with this grid:
stars * grid
To calculate the mean we need the sum of the coordinates in this bin and divide it by the number of stars in that bin (bins are along the axis=1, stars that are not in this bin only have a value of zero along this axis):
revbinnedstars = np.sum(stars * grid, axis=1) / np.sum(grid, axis=1)
I actually don't know if that's more efficient. It'll be a lot more expensive in memory but maybe a bit less expensive in CPU.
The function I use for binning (x,y) data and determining summary statistics such as mean values in those bins is based upon the scipy.stats.statistic() function. I have written a wrapper for it, because I use it a lot. You may find this useful...
def binXY(x,y,statistic='mean',xbins=10,xrange=None):
"""
Finds statistical value of x and y values in each x bin.
Returns the same type of statistic for both x and y.
See scipy.stats.binned_statistic() for options.
Parameters
----------
x : array
x values.
y : array
y values.
statistic : string or callable, optional
See documentation for scipy.stats.binned_statistic(). Default is mean.
xbins : int or sequence of scalars, optional
If xbins is an integer, it is the number of equal bins within xrange.
If xbins is an array, then it is the location of xbin edges, similar
to definitions used by np.histogram. Default is 10 bins.
All but the last (righthand-most) bin is half-open. In other words, if
bins is [1, 2, 3, 4], then the first bin is [1, 2) (including 1, but
excluding 2) and the second [2, 3). The last bin, however, is [3, 4],
which includes 4.
xrange : (float, float) or [(float, float)], optional
The lower and upper range of the bins. If not provided, range is
simply (x.min(), x.max()). Values outside the range are ignored.
Returns
-------
x_stat : array
The x statistic (e.g. mean) in each bin.
y_stat : array
The y statistic (e.g. mean) in each bin.
n : array of dtype int
The count of y values in each bin.
"""
x_stat, xbin_edges, binnumber = stats.binned_statistic(x, x,
statistic=statistic, bins=xbins, range=xrange)
y_stat, xbin_edges, binnumber = stats.binned_statistic(x, y,
statistic=statistic, bins=xbins, range=xrange)
n, xbin_edges, binnumber = stats.binned_statistic(x, y,
statistic='count', bins=xbins, range=xrange)
return x_stat, y_stat, n

Fastest way to get average value of frequencies within range [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I am new in python as well as in signal processing. I am trying to calculate mean value among some frequency range of a signal.
What I am trying to do is as follows:
import numpy as np
data = <my 1d signal>
lF = <lower frequency>
uF = <upper frequency>
ps = np.abs(np.fft.fft(data)) ** 2 #array of power spectrum
time_step = 1.0 / 2000.0
freqs = np.fft.fftfreq(data.size, time_step) # array of frequencies
idx = np.argsort(freqs) # sorting frequencies
sum = 0
c =0
for i in idx:
if (freqs[i] >= lF) and (freqs[i] <= uF) :
sum += ps[i]
c +=1
avgValue = sum/c
print 'mean value is=',avgValue
I think calculation is fine, but it takes a lot of time like for data of more than 15GB and processing time grows exponentially. Is there any fastest way available such that I would be able to get mean value of power spectrum within some frequency range in fastest manner. Thanks in advance.
EDIT 1
I followed this code for calculation of power spectrum.
EDIT 2
This doesn't answer to my question as it calculates mean over the whole array/list but I want mean over part of the array.
EDIT 3
Solution by jez of using mask reduces time. Actually I have more than 10 channels of 1D signal and I want to treat them in a same manner i.e. average frequencies in a range of each channel separately. I think python loops are slow. Is there any alternate for that?
Like this:
for i in xrange(0,15):
data = signals[:, i]
ps = np.abs(np.fft.fft(data)) ** 2
freqs = np.fft.fftfreq(data.size, time_step)
mask = np.logical_and(freqs >= lF, freqs <= uF )
avgValue = ps[mask].mean()
print 'mean value is=',avgValue
The following performs a mean over a selected region:
mask = numpy.logical_and( freqs >= lF, freqs <= uF )
avgValue = ps[ mask ].mean()
For proper scaling of power values that have been computed as abs(fft coefficients)**2, you will need to multiply by (2.0 / len(data))**2 (Parseval's theorem)
Note that it gets slightly fiddly if your frequency range includes the Nyquist frequency—for precise results, handling of that single frequency component would then need to depend on whether data.size is even or odd). So for simplicity, ensure that uF is strictly less than max(freqs). [For similar reasons you should ensure lF > 0.]
The reasons for this are tedious to explain and even more tedious to correct for, but basically: the DC component is represented once in the DFT, whereas most other frequency components are represented twice (positive frequency and negative frequency) at half-amplitude each time. The even-more-annoying exception is the Nyquist frequency which is represented once at full amplitude if the signal length is even, but twice at half amplitude if the signal length is odd. All of this would not affect you if you were averaging amplitude: in a linear system, being represented twice compensates for being at half amplitude. But you're averaging power, i.e. squaring the values before averaging, so this compensation doesn't work out.
I've pasted my code for grokking all of this. This code also shows how you can work with multiple signals stacked in one numpy array, which addresses your follow-up question about avoiding loops in the multi-channel case. Remember to supply the correct axis argument both to numpy.fft.fft() and to my fft2ap().
If you really have a signal of 15 GB size, you'll not be able to calculate the FFT in an acceptable time. You can avoid using the FFT, if it is acceptable for you to approximate your frequency range by a band pass filter. The justification is the Poisson summation formula, which states that sum of squares is not changed by a FFT (or: the power is preserved). Staying in the time domain will let the processing time rise proportionally to the signal length.
The following code designs a Butterworth band path filter, plots the filter response and filters a sample signal:
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
dd = np.random.randn(10**4) # generate sample data
T = 1./2e3 # sampling interval
n, f_s = len(dd), 1./T # number of points and sampling frequency
# design band path filter:
f_l, f_u = 50, 500 # Band from 50 Hz to 500 Hz
wp = np.array([f_l, f_u])*2/f_s # normalized pass band frequnecies
ws = np.array([0.8*f_l, 1.2*f_u])*2/f_s # normalized stop band frequencies
b, a = signal.iirdesign(wp, ws, gpass=60, gstop=80, ftype="butter",
analog=False)
# plot filter response:
w, h = signal.freqz(b, a, whole=False)
ff_w = w*f_s/(2*np.pi)
fg, ax = plt.subplots()
ax.set_title('Butterworth filter amplitude response')
ax.plot(ff_w, np.abs(h))
ax.set_ylabel('relative Amplitude')
ax.grid(True)
ax.set_xlabel('Frequency in Hertz')
fg.canvas.draw()
# do the filtering:
zi = signal.lfilter_zi(b, a)*dd[0]
dd1, _ = signal.lfilter(b, a, dd, zi=zi)
# calculate the avarage:
avg = np.mean(dd1**2)
print("RMS values is %g" % avg)
plt.show()
Read the documentation to Scipy's Filter design to learn how to modify the parameters of the filter.
If you want to stay with the FFT, read the docs on signal.welch and plt.psd. The Welch algorithm is a method to efficiently calculate the power spectral density of a signal (with some trade-offs).
It is much easier to work with FFT if your arrays are power of 2. When you do fft the frequencies ranges from -pi/timestep to pi/timestep (assuming that frequency is defined as w = 2*pi/t, change the values accordingly if you use f =1/t representation). Your spectrum is arranged as 0 to minfreqq--maxfreq to zero. you can now use fftshift function to swap the frequencies and your spectrum looks like minfreq -- DC -- maxfreq. now you can easily determine your desired frequency range because it is already sorted.
The frequency step dw=2*pi/(time span) or max-frequency/(N/2) where N is array size.
N/2 point is DC or 0 frequency. Nth position is max frequency now you can easily determine your range
Lower_freq_indx=N/2+N/2*Lower_freq/max_freq
Higher_freq_index=N/2+N/2*Higher_freq/Max_freq
avg=sum(ps[lower_freq_indx:Higher_freq_index]/(Higher_freq_index-Lower_freq_index)
I hope this will help
regards

Monte Carlo Simulation with Python: building a histogram on the fly

I have a conceptual question on building a histogram on the fly with Python. I am trying to figure out if there is a good algorithm or maybe an existing package.
I wrote a function, which runs a Monte Carlo simulation, gets called 1,000,000,000 times, and returns a 64 bit floating number at the end of each run. Below is the said function:
def MonteCarlo(df,head,span):
# Pick initial truck
rnd_truck = np.random.randint(0,len(df))
full_length = df['length'][rnd_truck]
full_weight = df['gvw'][rnd_truck]
# Loop using other random trucks until the bridge is full
while True:
rnd_truck = np.random.randint(0,len(df))
full_length += head + df['length'][rnd_truck]
if full_length > span:
break
else:
full_weight += df['gvw'][rnd_truck]
# Return average weight per feet on the bridge
return(full_weight/span)
df is a Pandas dataframe object having columns labeled as 'length' and 'gvw', which are truck lengths and weights, respectively. head is the distance between two consecutive trucks, span is the bridge length. The function randomly places trucks on the bridge as long as the total length of the truck train is less than the bridge length. Finally, calculates the average weight of the trucks existing on the bridge per foot (total weight existing on the bridge divided by the bridge length).
As a result I would like to build a tabular histogram showing the distribution of the returned values, which can be plotted later. I had some ideas in mind:
Keep collecting the returned values in a numpy vector, then use existing histogram functions once the MonteCarlo analysis is completed. This would not be feasable, since if my calculation is correct, I would need 7.5 GB of memory for that vector only (1,000,000,000 64 bit floats ~ 7.5 GB)
Initialize a numpy array with a given range and number of bins. Increase the number of items in the matching bin by one at the end of each run. The problem is, I do not know the range of values I would get. Setting up a histogram with a range and an appropriate bin size is an unknown. I also have to figure out how to assign values to the correct bins, but I think it is doable.
Do it somehow on the fly. Modify ranges and bin sizes each time the function returns a number. This would be too tricky to write from scratch I think.
Well, I bet there may be a better way to handle this problem. Any ideas would be welcome!
On a second note, I tested running the above function for 1,000,000,000 times only to get the largest value that is computed (the code snippet is below). And this takes around an hour when span = 200. The computation time would increase if I run it for longer spans (the while loop runs longer to fill the bridge with trucks). Is there a way to optimize this you think?
max_w = 0
i = 1
while i < 1000000000:
if max_w < MonteCarlo(df_basic, 15., 200.):
max_w = MonteCarlo(df_basic, 15., 200.)
i += 1
print max_w
Thanks!
Here is a possible solution, with fixed bin size, and bins of the form [k * size, (k + 1) * size[. The function finalizebins returns two lists: one with bin counts (a), and the other (b) with bin lower bounds (the upper bound is deduced by adding binsize).
import math, random
def updatebins(bins, binsize, x):
i = math.floor(x / binsize)
if i in bins:
bins[i] += 1
else:
bins[i] = 1
def finalizebins(bins, binsize):
imin = min(bins.keys())
imax = max(bins.keys())
a = [0] * (imax - imin + 1)
b = [binsize * k for k in range(imin, imax + 1)]
for i in range(imin, imax + 1):
if i in bins:
a[i - imin] = bins[i]
return a, b
# A test with a mixture of gaussian distributions
def check(n):
bins = {}
binsize = 5.0
for i in range(n):
if random.random() > 0.5:
x = random.gauss(100, 50)
else:
x = random.gauss(-200, 150)
updatebins(bins, binsize, x)
return finalizebins(bins, binsize)
a, b = check(10000)
# This must be 10000
sum(a)
# Plot the data
from matplotlib.pyplot import *
bar(b,a)
show()

Categories