I have a conceptual question on building a histogram on the fly with Python. I am trying to figure out if there is a good algorithm or maybe an existing package.
I wrote a function, which runs a Monte Carlo simulation, gets called 1,000,000,000 times, and returns a 64 bit floating number at the end of each run. Below is the said function:
def MonteCarlo(df,head,span):
# Pick initial truck
rnd_truck = np.random.randint(0,len(df))
full_length = df['length'][rnd_truck]
full_weight = df['gvw'][rnd_truck]
# Loop using other random trucks until the bridge is full
while True:
rnd_truck = np.random.randint(0,len(df))
full_length += head + df['length'][rnd_truck]
if full_length > span:
break
else:
full_weight += df['gvw'][rnd_truck]
# Return average weight per feet on the bridge
return(full_weight/span)
df is a Pandas dataframe object having columns labeled as 'length' and 'gvw', which are truck lengths and weights, respectively. head is the distance between two consecutive trucks, span is the bridge length. The function randomly places trucks on the bridge as long as the total length of the truck train is less than the bridge length. Finally, calculates the average weight of the trucks existing on the bridge per foot (total weight existing on the bridge divided by the bridge length).
As a result I would like to build a tabular histogram showing the distribution of the returned values, which can be plotted later. I had some ideas in mind:
Keep collecting the returned values in a numpy vector, then use existing histogram functions once the MonteCarlo analysis is completed. This would not be feasable, since if my calculation is correct, I would need 7.5 GB of memory for that vector only (1,000,000,000 64 bit floats ~ 7.5 GB)
Initialize a numpy array with a given range and number of bins. Increase the number of items in the matching bin by one at the end of each run. The problem is, I do not know the range of values I would get. Setting up a histogram with a range and an appropriate bin size is an unknown. I also have to figure out how to assign values to the correct bins, but I think it is doable.
Do it somehow on the fly. Modify ranges and bin sizes each time the function returns a number. This would be too tricky to write from scratch I think.
Well, I bet there may be a better way to handle this problem. Any ideas would be welcome!
On a second note, I tested running the above function for 1,000,000,000 times only to get the largest value that is computed (the code snippet is below). And this takes around an hour when span = 200. The computation time would increase if I run it for longer spans (the while loop runs longer to fill the bridge with trucks). Is there a way to optimize this you think?
max_w = 0
i = 1
while i < 1000000000:
if max_w < MonteCarlo(df_basic, 15., 200.):
max_w = MonteCarlo(df_basic, 15., 200.)
i += 1
print max_w
Thanks!
Here is a possible solution, with fixed bin size, and bins of the form [k * size, (k + 1) * size[. The function finalizebins returns two lists: one with bin counts (a), and the other (b) with bin lower bounds (the upper bound is deduced by adding binsize).
import math, random
def updatebins(bins, binsize, x):
i = math.floor(x / binsize)
if i in bins:
bins[i] += 1
else:
bins[i] = 1
def finalizebins(bins, binsize):
imin = min(bins.keys())
imax = max(bins.keys())
a = [0] * (imax - imin + 1)
b = [binsize * k for k in range(imin, imax + 1)]
for i in range(imin, imax + 1):
if i in bins:
a[i - imin] = bins[i]
return a, b
# A test with a mixture of gaussian distributions
def check(n):
bins = {}
binsize = 5.0
for i in range(n):
if random.random() > 0.5:
x = random.gauss(100, 50)
else:
x = random.gauss(-200, 150)
updatebins(bins, binsize, x)
return finalizebins(bins, binsize)
a, b = check(10000)
# This must be 10000
sum(a)
# Plot the data
from matplotlib.pyplot import *
bar(b,a)
show()
Related
I am trying to generate a list of 12 random weights for a stock portfolio in order to determine how the portfolio would have performed in the past given different weights assigned to each stock. The sum of the weights must of course be 1 and there is an additional restriction: each stock must have a weight between 1/24 and 1/4.
Although I am able to generate random numbers such that they all fall within the interval by using random.uniform(), as well as guarantee their sum is 1 by dividing each weighting by the sum of the weightings, I'm finding that
a) each subsequent array of weightings is very similar. I am rarely getting values for weightings that are near the upper boundary of 1/4
b) random.seed() does not seem to be working properly, whether I put it in the randweight() function or at the beginning of the for loop. I'm confused as to why because I thought that generating a random seed value would make my array of weights unique for each iteration. Currently, it's cyclical, with a period of 3.
The following is my code:
# boundaries on weightings
n = 12
min_weight = (1/(2*n))
max_weight = 25 / 100
def rand_weight(e):
random.seed()
return e + np.random.uniform(min_weight, max_weight)
for i in range(100):
weights = np.empty(12)
while not (np.all(weights > min_weight) and np.all(weights < max_weight)):
weights = np.array(list(map(rand_weight, weights)))
weights /= np.sum(weights)
I have already tried scattering the weights by changing the min_weight and max_weight inside the for loop so that rand_weight generates newer values, but this makes the runtime really slow because the "not" condition in the while loop takes longer to evaluate to false (since the probability of all the numbers being in the range decreases).
Lets start with simple facts first. If you want numbers to be in the range [0.042...0.25] and 12 iid numbers in total summed to one, then for mean value
Sum(Xi)=1
E[Sum(Xi)]=Sum(E[Xi])=N E[Xi] = 1
E[Xi]=1/N = 1/12 = 0.083
One corollary is that it would be hard to get numbers close to upper range boundary.
And instead doing things like sampling whatever and then normalizing to get sum to 1, better to use known distribution where sum of values is 1 to begin with.
So lets use Dirichlet distribution, and sample points uniformly in simplex, which means alpha (concentration) vector is all ones.
import numpy as np
N = 12
s = np.random.dirichlet(N*[1.0], 1)
print(np.sum(s))
Some value would be larger (or smaller), and you could reject them
def sampleWeights(alpha, lo, hi):
while True:
s = np.random.dirichlet(alpha, 1)[0]
if np.any(s > hi):
continue # reject
if np.any(s < lo):
continue # reject
return s # accept
and call it like this
N=12
alpha = N*[1.0]
q = sampleWeights(alpha, 1./24., 1./4.)
if you could check it, a lot of rejections happens at low bound, rather then high bound.
BEauty of using known Dirichlet distribution is that you could "concentrate" sampled values around mean, e.g.
alpha = N*[10.0]
q = sampleWeights(alpha, 1./24., 1./4.)
will produce same iid with mean of 1/12 but a lot smaller std.deviation, RV a lot more concentrated around mean
And if you want non-identically distributed RVs, use different alphas
alpha = [1.,2.,3.,4.,5.,6.,6.,5.,4.,3.,2.,1.]
q = sampleWeights(alpha, 1./24., 1./4.)
then some of RVs would be close to upper boundary, and some close to lower boundary. Lots of advantages to use known distribution
The following works. Particularly confusing to me is that np.empty(12) seemed to always return the same array. So once it had been initialized, it stayed the same.
This seems to produce numbers above 0.22 reasonably often.
import numpy as np
from random import random, seed
# boundaries on weightings
n = 12
min_weight = (1/(2*n))
max_weight = 25 / 100
seed(666)
for i in range(100):
weights = np.zeros(n)
while not (np.all(weights > min_weight) and np.all(weights < max_weight)):
weights = np.array([random() for _ in range(n)])
weights /= np.sum(weights) - min_weight * n
weights += min_weight
print(weights)
Ideally, I want the following without reading the data from hard disk for too many times. The data is big and memory can't hold all of the data at the same time.
Input is a stream x[t] from hard disk. The stream of numbers contains N elements.
It is possible to have a histogram of x with m bins.
The n bins are defined by the bin edges e0 < e1, ..., < em. For example, if ei =< x[0] < ei+1, then x[0] belongs to the ith bin.
Find the bin edges that makes the bin holds an nearly equal number of elements from the stream. The number of elements in each bin should be ideally within some threshold percent from N/m. This is because if we evenly distribute N elements among m bins, each bins should hold about N/m elements.
Current solution:
import numpy as np
def test_data(size):
x = np.random.normal(0, 0.5, size // 2)
x = np.hstack([x, np.random.normal(4, 1, size // 2)])
return x
def bin_edge_as_index(n_bin, fine_hist, fine_n_bin, data_size):
cum_sum = np.cumsum(fine_hist)
bin_id = np.empty((n_bin + 1), dtype=int)
count_per_bin = data_size * 1.0 / n_bin
for i in range(1, n_bin):
bin_id[i] = np.argmax(cum_sum > count_per_bin * i)
bin_id[0] = 0
bin_id[n_bin] = fine_n_bin
return bin_id
def get_bin_count(bin_edge, data):
n_bin = bin_edge.shape[0] - 1
result = np.zeros((n_bin), dtype=int)
for i in range(n_bin):
cmp0 = (bin_edge[i] <= data)
cmp1 = (data < bin_edge[i + 1])
result[i] = np.sum(cmp0 & cmp1)
return result
# Test Setting
test_size = 10000
n_bin = 6
fine_n_bin = 2000 # use a big number and hope it works
# Test Data
x = test_data(test_size)
# Fine Histogram
fine_hist, fine_bin_edge = np.histogram(x, fine_n_bin)
# Index of the bins of the fine histogram that contains
# the required bin edges (e_1, e_2, ... e_n)
bin_id = bin_edge_as_index(
n_bin, fine_hist, fine_n_bin, test_size)
# Find the bin edges
bin_edge = fine_bin_edge[bin_id]
print("bin_edges:")
print(bin_edge)
# Check
bin_count = get_bin_count(bin_edge, x)
print("bin_counts:")
print(bin_count)
print("ideal count per bin:")
print(test_size * 1.0 / n_bin)
Output of program:
bin_edges:
[-1.86507282 -0.22751473 0.2085489 1.30798591 3.57180559 4.40218207
7.41287669]
bin_counts:
[1656 1675 1668 1663 1660 1677]
ideal count per bin:
1666.6666666666667
Problem:
I can't specify a threshold s, and expect the bin counts are at most s% different from the ideal counts per bin.
Assuming that the distribution is not outrageously skewed (like 10000 values between 1.0000001 and 1.0000002 and 10000 others between 9.0000001 and 9.0000002), you can proceed as below.
Compute a histogram with a sufficient resolution, say K bins, which covers the whole range (hopefully known beforehand). This will take a single pass over the data.
Then compute the cumulated histogram and as you go, identify the m+1 quantile edges (where the cumulated counts cross multiples of N/m).
The accuracy that you will get is dictated by the maximum number of elements in a bin of the original histogram.
For N elements, using an histogram of K bins and assuming some "nonuniformity factor" (equal to a few units for reasonable distributions), the maximum error will be f.N/K.
You can improve accuracy if you like by considering m+1 auxiliary histograms which only accumulate the values that fall in the quantile bins of the global histogram. Then you can refine the quantiles to the resolution of these auxiliary histograms.
This will cost you an extra pass, but the error will be reduced to f.N/(K.K'), using K then m.K' histogram space only, instead of K.K'.
Iff you can assume your data is random with a defined distribution (that is: taking any non-trivial percentage of your data in sequence is going to "sketch" the same distribution as the entire data, only with a coarser precision), I imagine there are a number of options:
read a part of your data in some oversampled histogram. Based on this, choose an approximation for the bin edges the way you do now (as explained in your question), then uniformly oversample these bins, then read another chunk of your data into the new bins and so on. If you have enough data, processing them in chunks 0f 10% would allow for 10 iterations to improve your bins structure in a single pass.
start with a number of bins and accumulate some (not all) data. Look over them and if one has bin_width*count disproportionately higher then the neighbours (maybe this is where the precision/error may come into play), divide that bin in two and heuristically assign the old bin count into the newly created bins (one possible heuristic - proportional with the count of the neighbours). At the end, you should have a division somehow controlled by the acceptable error from which to sort-of interpolate your distribution.
Of course, the above are only ideas of approaches, can't offer any warranty about how well they'll work.
I am wanting to generate eight random numbers within a range (0 to pi/8), add them together, take the sine of this sum, and after doing this N times, take the mean result. After scaling this up I get the correct answer, but it is too slow for N > 10^6, especially when I am averaging over N trials n_t = 25 more times! I am currently getting this code to run in around 12 seconds for N = 10^5, meaning that it will take 20 minutes for N = 10^7, which doesn't seem optimal (it may be, I don't know!).
My code is as follows:
import random
import datetime
from numpy import pi
from numpy import sin
import numpy
t1 = datetime.datetime.now()
def trial(N):
total = []
uniform = numpy.random.uniform
append = total.append
for j in range(N):
sum = 0
for i in range (8):
sum+= uniform(0, pi/8)
append(sin(sum))
return total
N = 1000000
n_t = 25
total_squared = 0
ans = []
for k in range (n_t):
total = trial(N)
f_mean = (numpy.sum(total))/N
ans.append(f_mean*((pi/8)**8)*1000000)
sum_square = 0
for e in ans:
sum_square += e**2
sum = numpy.sum(ans)
mean = sum/n_t
variance = sum_square/n_t - mean**2
s_d = variance**0.5
print (mean, " ± ", s_d)
t2 = datetime.datetime.now()
print ("Execution time: %s" % (t2-t1))
If anyone can help me optimise this it would be much appreciated!
Thank you :)
Given your requirement of obtaining the result with this method, np.sin(np.random.uniform(0,np.pi/8,size=(8,10**6,25)).sum(axis=0)).mean(axis=0) gets you your 25 trials pretty quickly... This is fully vectorised (and concise which is always a bonus!) so I doubt you could do any better...
Explanation:
You generate a big random 3d array of size (8 x 10**6 x 25). .sum(axis=0) will get you the sum over the first dimension (8). np.sin(...) applies elementwise. .mean(axis=0) will get you the mean over the first remaining dimension (10**6) and leave you with a 1d array of length (25) corresponding to your trials.
By the Central Limit Theorem, your random variable will closely follow a normal law.
The sum of the eight uniform variables has a bell-shaped distribution over the range [0, π]. If I am right, the distribution can be represented as a B-spline of order 8. Taking the sine gives a value in range [0, 1]. You can find the expectation µ and variance σ² by simple numerical integration.
Then use a normal generator with mean µ and variance σ²/N. That will be instantaneous in comparison.
I was hoping for a bit of help to make my code run faster.
Basically I have a square grid of lat,long points in a list insideoceanlist. Then there is a directory containing data files of lat, long coords which represent lightning strikes for a particular day. The idea is for each day, we want to know how many lightning strikes there were around each point on the square grid. At the moment it is just two for loops, so for every point on the square grid, you check how far away every lightning strike was for that day. If it was within 40km I add one to that point to make a density map.
The starting grid has the overall shape of a rectangle, made up of squares with width of 0.11 and length 0.11. The entire rectange is about 50x30. Lastly I have a shapefile which outlines the 'forecast zones' in Australia, and if any point in the grid is outside this zone then we omit it. So all the leftover points (insideoceanlist) are the ones in Australia.
There are around 100000 points on the square grid and even for a slow day there are around 1000 lightning strikes, so it takes a long time to process. Is there a way to do this more efficiently? I really appreciate any advice.
By the way I changed list2 into list3 because I heard that iterating over lists is faster than arrays in python.
for i in range(len(list1)): #list1 is a list of data files containing lat,long coords for lightning strikes for each day
dict_density = {}
for k in insideoceanlist: #insideoceanlist is a grid of ~100000 lat,long points
dict_density[k] = 0
list2 = np.loadtxt(list1[i],delimiter = ",") #this open one of the files containing lat,long coords and puts it into an array
list3 = map(list,list2) #converts the array into a list
# the following part is what I wanted to improve
for j in insideoceanlist:
for l in list3:
if great_circle(l,j).meters < 40000: #great_circle is a function which measures distance between points the two lat,long points
dict_density[j] += 1
#
filename = 'example' +str(i) + '.txt'
with open(filename, 'w') as f:
for m in range(len(insideoceanlist)):
f.write('%s\n' % (dict_density[insideoceanlist[m]])) #writes each point in the same order as the insideoceanlist
f.close()
To elaborate a bit on #DanGetz's answer, here is some code that uses the strike data as the driver, rather than iterating the entire grid for each strike point. I'm assuming you're centered on Australia's median point, with 0.11 degree grid squares, even though the size of a degree varies by latitude!
Some back-of-the-envelope computation with a quick reference to Wikipedia tells me that your 40km distance is a ±4 grid-square range from north to south, and a ±5 grid-square range from east to west. (It drops to 4 squares in the lower latitudes, but ... meh!)
The tricks here, as mentioned, are to convert from strike position (lat/lon) to grid square in a direct, formulaic manner. Figure out the position of one corner of the grid, subtract that position from the strike, then divide by the size of the grid - 0.11 degrees, truncate, and you have your row/col indexes. Now visit all the surrounding squares until the distance grows too great, which is at most 1 + (2 * 2 * 4 * 5) = 81 squares checking for distance. Increment the squares within range.
The result is that I'm doing at most 81 visits times 1000 strikes (or however many you have) as opposed to visiting 100,000 grid squares times 1000 strikes. This is a significant performance gain.
Note that you don't describe your incoming data format, so I just randomly generated numbers. You'll want to fix that. ;-)
#!python3
"""
Per WikiPedia (https://en.wikipedia.org/wiki/Centre_points_of_Australia)
Median point
============
The median point was calculated as the midpoint between the extremes of
latitude and longitude of the continent.
24 degrees 15 minutes south latitude, 133 degrees 25 minutes east
longitude (24°15′S 133°25′E); position on SG53-01 Henbury 1:250 000
and 5549 James 1:100 000 scale maps.
"""
MEDIAN_LAT = -(24.00 + 15.00/60.00)
MEDIAN_LON = (133 + 25.00/60.00)
"""
From the OP:
The starting grid has the overall shape of a rectangle, made up of
squares with width of 0.11 and length 0.11. The entire rectange is about
50x30. Lastly I have a shapefile which outlines the 'forecast zones' in
Australia, and if any point in the grid is outside this zone then we
omit it. So all the leftover points (insideoceanlist) are the ones in
Australia.
"""
DELTA_LAT = 0.11
DELTA_LON = 0.11
GRID_WIDTH = 50.0 # degrees
GRID_HEIGHT = 30.0 # degrees
GRID_ROWS = int(GRID_HEIGHT / DELTA_LAT) + 1
GRID_COLS = int(GRID_WIDTH / DELTA_LON) + 1
LAT_SIGN = 1.0 if MEDIAN_LAT >= 0 else -1.0
LON_SIGN = 1.0 if MEDIAN_LON >= 0 else -1.0
GRID_LOW_LAT = MEDIAN_LAT - (LAT_SIGN * GRID_HEIGHT / 2.0)
GRID_HIGH_LAT = MEDIAN_LAT + (LAT_SIGN * GRID_HEIGHT / 2.0)
GRID_MIN_LAT = min(GRID_LOW_LAT, GRID_HIGH_LAT)
GRID_MAX_LAT = max(GRID_LOW_LAT, GRID_HIGH_LAT)
GRID_LOW_LON = MEDIAN_LON - (LON_SIGN * GRID_WIDTH / 2.0)
GRID_HIGH_LON = MEDIAN_LON + (LON_SIGN * GRID_WIDTH / 2.0)
GRID_MIN_LON = min(GRID_LOW_LON, GRID_HIGH_LON)
GRID_MAX_LON = max(GRID_LOW_LON, GRID_HIGH_LON)
GRID_PROXIMITY_KM = 40.0
"""https://en.wikipedia.org/wiki/Longitude#Length_of_a_degree_of_longitude"""
_Degree_sizes_km = (
(0, 110.574, 111.320),
(15, 110.649, 107.551),
(30, 110.852, 96.486),
(45, 111.132, 78.847),
(60, 111.412, 55.800),
(75, 111.618, 28.902),
(90, 111.694, 0.000),
)
# For the Australia situation, +/- 15 degrees means that our worst
# case scenario is about 40 degrees south. At that point, a single
# degree of longitude is smallest, with a size about 80 km. That
# in turn means a 40 km distance window will span half a degree or so.
# Since grid squares a 0.11 degree across, we have to check +/- 5
# cols.
GRID_SEARCH_COLS = 5
# Latitude degrees are nice and constant-like at about 110km. That means
# a .11 degree grid square is 12km or so, making our search range +/- 4
# rows.
GRID_SEARCH_ROWS = 4
def make_grid(rows, cols):
return [[0 for col in range(cols)] for row in range(rows)]
Grid = make_grid(GRID_ROWS, GRID_COLS)
def _col_to_lon(col):
return GRID_LOW_LON + (LON_SIGN * DELTA_LON * col)
Col_to_lon = [_col_to_lon(c) for c in range(GRID_COLS)]
def _row_to_lat(row):
return GRID_LOW_LAT + (LAT_SIGN * DELTA_LAT * row)
Row_to_lat = [_row_to_lat(r) for r in range(GRID_ROWS)]
def pos_to_grid(pos):
lat, lon = pos
if lat < GRID_MIN_LAT or lat >= GRID_MAX_LAT:
print("Lat limits:", GRID_MIN_LAT, GRID_MAX_LAT)
print("Position {} is outside grid.".format(pos))
return None
if lon < GRID_MIN_LON or lon >= GRID_MAX_LON:
print("Lon limits:", GRID_MIN_LON, GRID_MAX_LON)
print("Position {} is outside grid.".format(pos))
return None
row = int((lat - GRID_LOW_LAT) / DELTA_LAT)
col = int((lon - GRID_LOW_LON) / DELTA_LON)
return (row, col)
def visit_nearby_grid_points(pos, dist_km):
row, col = pos_to_grid(pos)
# +0, +0 is not symmetric - don't increment twice
Grid[row][col] += 1
for dr in range(1, GRID_SEARCH_ROWS):
for dc in range(1, GRID_SEARCH_COLS):
misses = 0
gridpos = Row_to_lat[row+dr], Col_to_lon[col+dc]
if great_circle(pos, gridpos).meters <= dist_km:
Grid[row+dr][col+dc] += 1
else:
misses += 1
gridpos = Row_to_lat[row+dr], Col_to_lon[col-dc]
if great_circle(pos, gridpos).meters <= dist_km:
Grid[row+dr][col-dc] += 1
else:
misses += 1
gridpos = Row_to_lat[row-dr], Col_to_lon[col+dc]
if great_circle(pos, gridpos).meters <= dist_km:
Grid[row-dr][col+dc] += 1
else:
misses += 1
gridpos = Row_to_lat[row-dr], Col_to_lon[col-dc]
if great_circle(pos, gridpos).meters <= dist_km:
Grid[row-dr][col-dc] += 1
else:
misses += 1
if misses == 4:
break
def get_pos_from_line(line):
"""
FIXME: Don't know the format of your data, just random numbers
"""
import random
return (random.uniform(GRID_LOW_LAT, GRID_HIGH_LAT),
random.uniform(GRID_LOW_LON, GRID_HIGH_LON))
with open("strikes.data", "r") as strikes:
for line in strikes:
pos = get_pos_from_line(line)
visit_nearby_grid_points(pos, GRID_PROXIMITY_KM)
If you know the formula that generates the points on your grid, you can probably find the closest grid point to a given point quickly by reversing that formula.
Below is a motivating example, that isn't quite right for your purposes because the Earth is a sphere, not flat or cylindrical. If you can't easily reverse the grid point formula to find the closest grid point, then maybe you can do the following:
create a second grid (let's call it G2) that is a simple formula like below, with big enough boxes such that you can be confident that the closest grid point to any point in one box will either be in the same box, or in one of the 8 neighboring boxes.
create a dict which stores which original grid (G1) points are in which box of the G2 grid
take the point p you're trying to classify, and find the G2 box it would go into
compare p to all the G1 points in this G2 box, and all the immediate neighbors of that box
choose the G1 point of these that's closest to p
Motivating example with a perfect flat grid
If you had a perfect square grid on a flat surface, that isn't rotated, with sides of length d, then their points can be defined by a simple mathematical formula. Their latitude values will all be of the form
lat0 + d * i
for some integer value i, where lat0 is the lowest-numbered latitude, and their longitude values will be of the same form:
long0 + d * j
for some integer j. To find what the closest grid point is for a given (lat, long) pair, you can separately find its latitude and longitude. The closest latitude number on your grid will be where
i = round((lat - lat0) / d)
and likewise j = round((long - long0) / d) for the longitude.
So one way you can go forward is to plug that in to the formulas above, and get
grid_point = (lat0 + d * round((lat - lat0) / d),
long0 + d * round((long - long0) / d)
and just increment the count in your dict at that grid point. This should make your code much, much faster than before, because instead of checking thousands of grid points for distance, you directly found the grid point with a couple calculations.
You can probably make this a little faster by using the i and j numbers as indexes into a multidimensional array, instead of using grid_point in a dict.
Have you tried using Numpy for the indexing? You can use multi-dimensional arrays, and the indexing should be faster because Numpy arrays are essentially a Python wrapper around C arrays.
If you need further speed increases, take a look at Cython, a Python to optimized C converter. It is especially good for multi-dimensional indexing, and should be able to speed this type of code by about an order of magnitude. It'll add a single additional dependency to your code, but it's a quick install, and not too difficult to implement.
(Benchmarks), (Tutorial using Numpy with Cython)
Also as a quick aside, use
for listI in list1:
...
list2 = np.loadtxt(listI, delimiter=',')
# or if that doesn't work, at least use xrange() rather than range()
essentially you should only ever use range() when you explicity need the list generated by the range() function. In your case, it shouldn't do much because it is the outer-most loop.
I have a range image of a scene. I traverse the image and calculate the average change in depth under the detection window. The detection windows changes size based on the average depth of the surrounding pixels of the current location. I accumulate the average change to produce a simple response image.
Most of the time is spent in the for loop, it is taking about 40+s for a 512x52 image on my machine. I was hoping for some speed up. Is there a more efficient/faster way to traverse the image? Is there a better pythonic/numpy/scipy way to visit each pixel? Or shall I go learn cython?
EDIT: I have reduced running time to about 18s by using scipy.misc.imread() instead of skimage.io.imread(). Not sure what the difference is, I will try to investigate.
Here is a simplified version of the code:
import matplotlib.pylab as plt
import numpy as np
from skimage.io import imread
from skimage.transform import integral_image, integrate
import time
def intersect(a, b):
'''Determine the intersection of two rectangles'''
rect = (0,0,0,0)
r0 = max(a[0],b[0])
c0 = max(a[1],b[1])
r1 = min(a[2],b[2])
c1 = min(a[3],b[3])
# Do we have a valid intersection?
if r1 > r0 and c1 > c0:
rect = (r0,c0,r1,c1)
return rect
# Setup data
depth_src = imread("test.jpg", as_grey=True)
depth_intg = integral_image(depth_src) # integrate to find sum depth in region
depth_pts = integral_image(depth_src > 0) # integrate to find num points which have depth
boundary = (0,0,depth_src.shape[0]-1,depth_src.shape[1]-1) # rectangle to intersect with
# Image to accumulate response
out_img = np.zeros(depth_src.shape)
# Average dimensions of bbox/detection window per unit length of depth
model = (0.602,2.044) # width, height
start_time = time.time()
for (r,c), junk in np.ndenumerate(depth_src):
# Find points around current pixel
r0, c0, r1, c1 = intersect((r-1, c-1, r+1, c+1), boundary)
# Calculate average of depth of points around current pixel
scale = integrate(depth_intg, r0, c0, r1, c1) * 255 / 9.0
# Based on average depth, create the detection window
r0 = r - (model[0] * scale/2)
c0 = c - (model[1] * scale/2)
r1 = r + (model[0] * scale/2)
c1 = c + (model[1] * scale/2)
# Used scale optimised detection window to extract features
r0, c0, r1, c1 = intersect((r0,c0,r1,c1), boundary)
depth_count = integrate(depth_pts,r0,c0,r1,c1)
if depth_count:
depth_sum = integrate(depth_intg,r0,c0,r1,c1)
avg_change = depth_sum / depth_count
# Accumulate response
out_img[r0:r1,c0:c1] += avg_change
print time.time() - start_time, " seconds"
plt.imshow(out_img)
plt.gray()
plt.show()
Michael, interesting question. It seems that the main performance problem you have is that each pixel in the image has two integrate() functions computed on it, one of size 3x3 and the other of a size which is not known in advance. Calculating individual integrals in this way is extremely inefficient, regardless of what numpy functions you use; it's an algorithmic issue, not an implementation issue. Consider an image of size NN. You can calculate all integrals of any size KK in that image using only approximately 4*NN operations, not (as one might naively expect) NNKK. The way you do that is first calculate an image of sliding sums over a window K in each row, and then sliding sums over the result in each column. Updating each sliding sum to move to the next pixel requires only adding the newest pixel in the current window and subtracting the oldest pixel in the previous window, thus two operations per pixel regardless of window size. We do have to do that twice (for rows and columns), therefore 4 operations per pixel.
I am not sure if there is a sliding window sum built into numpy, but this answer suggests a couple of ways to do it, using stride tricks: https://stackoverflow.com/a/12713297/1828289. You can certainly accomplish the same with one loop over columns and one loop over rows (taking slices to extract a row/column).
Example:
# img is a 2D ndarray
# K is the size of sums to calculate using sliding window
row_sums = numpy.zeros_like(img)
for i in range( img.shape[0] ):
if i > K:
row_sums[i,:] = row_sums[i-1,:] - img[i-K-1,:] + img[i,:]
elif i > 1:
row_sums[i,:] = row_sums[i-1,:] + img[i,:]
else: # i == 0
row_sums[i,:] = img[i,:]
col_sums = numpy.zeros_like(img)
for j in range( img.shape[1] ):
if j > K:
col_sums[:,j] = col_sums[:,j-1] - row_sums[:,j-K-1] + row_sums[:,j]
elif j > 1:
col_sums[:,j] = col_sums[:,j-1] + row_sums[:,j]
else: # j == 0
col_sums[:,j] = row_sums[:,j]
# here col_sums[i,j] should be equal to numpy.sum(img[i-K:i, j-K:j]) if i >=K and j >= K
# first K rows and columns in col_sums contain partial sums and can be ignored
How do you best apply that to your case? I think you might want to pre-compute the integrals for 3x3 (average depth) and also for several larger sizes, and use the value of the 3x3 to select one of the larger sizes for the detection window (assuming I understand the intent of your algorithm). The range of larger sizes you need might be limited, or artificially limiting it might still work acceptably well, just pick the nearest size. Calculating all integrals together using sliding sums is so much more efficient that I am almost certain it is worth calculating them for a lot of sizes you would never use at a particular pixel, especially if some of the sizes are large.
P.S. This is a minor addition, but you may want to avoid calling intersect() for every pixel: either (a) only process pixels which are farther from the edge than the max integral size, or (b) add margins to the image of the max integral size on all sides, filling the margins with either zeros or nans, or (c) (best approach) use slices to take care of this automatically: a slice index outside the boundary of an ndarray is automatically limited to the boundary, except of course negative indexes are wrapped around.
EDIT: added example of sliding window sums