I'm trying to work out how to speed up a Python function which uses numpy. The output I have received from lineprofiler is below, and this shows that the vast majority of the time is spent on the line ind_y, ind_x = np.where(seg_image == i).
seg_image is an integer array which is the result of segmenting an image, thus finding the pixels where seg_image == i extracts a specific segmented object. I am looping through lots of these objects (in the code below I'm just looping through 5 for testing, but I'll actually be looping through over 20,000), and it takes a long time to run!
Is there any way in which the np.where call can be speeded up? Or, alternatively, that the penultimate line (which also takes a good proportion of the time) can be speeded up?
The ideal solution would be to run the code on the whole array at once, rather than looping, but I don't think this is possible as there are side-effects to some of the functions I need to run (for example, dilating a segmented object can make it 'collide' with the next region and thus give incorrect results later on).
Does anyone have any ideas?
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 def correct_hot(hot_image, seg_image):
6 1 239810 239810.0 2.3 new_hot = hot_image.copy()
7 1 572966 572966.0 5.5 sign = np.zeros_like(hot_image) + 1
8 1 67565 67565.0 0.6 sign[:,:] = 1
9 1 1257867 1257867.0 12.1 sign[hot_image > 0] = -1
10
11 1 150 150.0 0.0 s_elem = np.ones((3, 3))
12
13 #for i in xrange(1,seg_image.max()+1):
14 6 57 9.5 0.0 for i in range(1,6):
15 5 6092775 1218555.0 58.5 ind_y, ind_x = np.where(seg_image == i)
16
17 # Get the average HOT value of the object (really simple!)
18 5 2408 481.6 0.0 obj_avg = hot_image[ind_y, ind_x].mean()
19
20 5 333 66.6 0.0 miny = np.min(ind_y)
21
22 5 162 32.4 0.0 minx = np.min(ind_x)
23
24
25 5 369 73.8 0.0 new_ind_x = ind_x - minx + 3
26 5 113 22.6 0.0 new_ind_y = ind_y - miny + 3
27
28 5 211 42.2 0.0 maxy = np.max(new_ind_y)
29 5 143 28.6 0.0 maxx = np.max(new_ind_x)
30
31 # 7 is + 1 to deal with the zero-based indexing, + 2 * 3 to deal with the 3 cell padding above
32 5 217 43.4 0.0 obj = np.zeros( (maxy+7, maxx+7) )
33
34 5 158 31.6 0.0 obj[new_ind_y, new_ind_x] = 1
35
36 5 2482 496.4 0.0 dilated = ndimage.binary_dilation(obj, s_elem)
37 5 1370 274.0 0.0 border = mahotas.borders(dilated)
38
39 5 122 24.4 0.0 border = np.logical_and(border, dilated)
40
41 5 355 71.0 0.0 border_ind_y, border_ind_x = np.where(border == 1)
42 5 136 27.2 0.0 border_ind_y = border_ind_y + miny - 3
43 5 123 24.6 0.0 border_ind_x = border_ind_x + minx - 3
44
45 5 645 129.0 0.0 border_avg = hot_image[border_ind_y, border_ind_x].mean()
46
47 5 2167729 433545.8 20.8 new_hot[seg_image == i] = (new_hot[ind_y, ind_x] + (sign[ind_y, ind_x] * np.abs(obj_avg - border_avg)))
48 5 10179 2035.8 0.1 print obj_avg, border_avg
49
50 1 4 4.0 0.0 return new_hot
EDIT I have left my original answer at the bottom for the record, but I have actually looked into your code in more detail over lunch, and I think that using np.where is a big mistake:
In [63]: a = np.random.randint(100, size=(1000, 1000))
In [64]: %timeit a == 42
1000 loops, best of 3: 950 us per loop
In [65]: %timeit np.where(a == 42)
100 loops, best of 3: 7.55 ms per loop
You could get a boolean array (that you can use for indexing) in 1/8 of the time you need to get the actual coordinates of the points!!!
There is of course the cropping of the features that you do, but ndimage has a find_objects function that returns enclosing slices, and appears to be very fast:
In [66]: %timeit ndimage.find_objects(a)
100 loops, best of 3: 11.5 ms per loop
This returns a list of tuples of slices enclosing all of your objects, in 50% more time thn it takes to find the indices of one single object.
It may not work out of the box as I cannot test it right now, but I would restructure your code into something like the following:
def correct_hot_bis(hot_image, seg_image):
# Need this to not index out of bounds when computing border_avg
hot_image_padded = np.pad(hot_image, 3, mode='constant',
constant_values=0)
new_hot = hot_image.copy()
sign = np.ones_like(hot_image, dtype=np.int8)
sign[hot_image > 0] = -1
s_elem = np.ones((3, 3))
for j, slice_ in enumerate(ndimage.find_objects(seg_image)):
hot_image_view = hot_image[slice_]
seg_image_view = seg_image[slice_]
new_shape = tuple(dim+6 for dim in hot_image_view.shape)
new_slice = tuple(slice(dim.start,
dim.stop+6,
None) for dim in slice_)
indices = seg_image_view == j+1
obj_avg = hot_image_view[indices].mean()
obj = np.zeros(new_shape)
obj[3:-3, 3:-3][indices] = True
dilated = ndimage.binary_dilation(obj, s_elem)
border = mahotas.borders(dilated)
border &= dilated
border_avg = hot_image_padded[new_slice][border == 1].mean()
new_hot[slice_][indices] += (sign[slice_][indices] *
np.abs(obj_avg - border_avg))
return new_hot
You would still need to figure out the collisions, but you could get about a 2x speed-up by computing all the indices simultaneously using a np.unique based approach:
a = np.random.randint(100, size=(1000, 1000))
def get_pos(arr):
pos = []
for j in xrange(100):
pos.append(np.where(arr == j))
return pos
def get_pos_bis(arr):
unq, flat_idx = np.unique(arr, return_inverse=True)
pos = np.argsort(flat_idx)
counts = np.bincount(flat_idx)
cum_counts = np.cumsum(counts)
multi_dim_idx = np.unravel_index(pos, arr.shape)
return zip(*(np.split(coords, cum_counts) for coords in multi_dim_idx))
In [33]: %timeit get_pos(a)
1 loops, best of 3: 766 ms per loop
In [34]: %timeit get_pos_bis(a)
1 loops, best of 3: 388 ms per loop
Note that the pixels for each object are returned in a different order, so you can't simply compare the returns of both functions to assess equality. But they should both return the same.
One thing you could do to same a little bit of time is to save the result of seg_image == i so that you don't need to compute it twice. You're computing it on lines 15 & 47, you could add seg_mask = seg_image == i and then reuse that result (It might also be good to separate out that piece for profiling purposes).
While there a some other minor things that you could do to eke out a little bit of performance, the root issue is that you're using a O(M * N) algorithm where M is the number of segments and N is the size of your image. It's not obvious to me from your code whether there is a faster algorithm to accomplish the same thing, but that's the first place I'd try and look for a speedup.
Related
Recently I was trying out this problem and my code got 60% of the marks, with the remaining cases returning TLEs.
Bazza and Shazza do not like bugs. They wish to clear out all the bugs
on their garden fence. They come up with a brilliant idea: they buy
some sugar frogs and release them near the fence, letting them eat up
all the bugs.
The plan is a great success and the bug infestation is gone. But
strangely, they now have a sugar frog infestation. Instead of getting
rid of the frogs, Bazza and Shazza decide to set up an obstacle course
and watch the frogs jump along it for their enjoyment.
The fence is a series of \$N\$ fence posts of varying heights. Bazza and
Shazza will select three fence posts to create the obstacle course,
where the middle post is strictly higher than the other two. The frogs
are to jump up from the left post to the middle post, then jump down
from the middle post to the right post. The three posts do not have to
be next to each other as frogs can jump over other fence posts,
regardless of the height of those other posts.
The difficulty of an obstacle course is the height of the first jump
plus the height of the second jump. The height of a jump is equal to
the difference in height between it's two fence posts. Your task is to
help Bazza and Shazza find the most difficult obstacle course for the
frogs to jump.
Input
Your program should read from the file. The file will describe
a single fence.
The first line of input will contain one integer \$N\$: the number of
fence posts. The next \$N\$ lines will each contain one integer \$h_i\$: the
height of the ith fence post. You are guaranteed that there will be at
least one valid obstacle course: that is, there will be at least one
combination of three fence posts where the middle post is strictly
higher than the other two.
Output
Your program should write to the file. Your output file should
contain one line with one integer: the greatest difficulty of any
possible obstacle course.
Constraints
To evaluate your solution, the judges will run your
program against several different input files. All of these files will
adhere to the following bounds:
\$3 \leq N \leq 100,000\$ (the number of fence posts)
\$1 \leq h_i \leq 100,000\$ (the height of each post)
As some of the test cases will be quite large,
you may need to think about how well your solution scales for larger
input values. However, not all the cases will be large. In particular:
For 30% of the marks, \$N \leq 300\$. For an additional 30% of the
marks, \$N \leq 3,000\$. For the remaining 40% of the marks, no special > constraints apply.
Hence, I was wondering if anyone could think of a way to optimize my code (below), or perhaps provide a more elegant, efficient algorithm than the one I am currently using.
Here is my code:
infile = open('frogin.txt', 'r')
outfile = open('frogout.txt', 'w')
N = int(infile.readline())
l = []
for i in range(N):
l.append(int(infile.readline()))
m = 0
#find maximum z-x+z-y such that the middle number z is the largest of x, y, z
for j in range(1, N - 1):
x = min(l[0: j])
y = min(l[j + 1:])
z = l[j]
if x < z and y < z:
n = z - x + z - y
m = n if n > m else m
outfile.write(str(m))
infile.close()
outfile.close()
exit()
If you require additional information regarding my solution or the problem, please do comment below.
Ok, first let's evaluate your program. I created a test file like
from random import randint
n = 100000
max_ = 100000
with open("frogin.txt", "w") as outf:
outf.write(str(n) + "\n")
outf.write("\n".join(str(randint(1, max_)) for _ in range(n)))
then ran your code in IPython like
%load_ext line_profiler
def test():
infile = open('frogin.txt', 'r')
outfile = open('frogout.txt', 'w')
N = int(infile.readline())
l = []
for i in range(N):
l.append(int(infile.readline()))
m = 0
for j in range(1, N - 1):
pre_l = l[0: j] # I split these lines
x = min(pre_l) # for a bit more detail
post_l = l[j + 1:] # on exactly which operations
y = min(post_l) # are taking the most time
z = l[j]
if x < z and y < z:
n = z - x + z - y
m = n if n > m else m
outfile.write(str(m))
infile.close()
outfile.close()
%lprun -f test test() # instrument the `test` function, then run `test()`
which gave
Total time: 197.565 s
File: <ipython-input-37-afa35ce6607a>
Function: test at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def test():
2 1 479 479.0 0.0 infile = open('frogin.txt', 'r')
3 1 984 984.0 0.0 outfile = open('frogout.txt', 'w')
4 1 195 195.0 0.0 N = int(infile.readline())
5 1 2 2.0 0.0 l = []
6 100001 117005 1.2 0.0 for i in range(N):
7 100000 269917 2.7 0.0 l.append(int(infile.readline()))
8 1 2 2.0 0.0 m = 0
9 99999 226984 2.3 0.0 for j in range(1, N - 1):
10 99998 94137525 941.4 12.2 pre_l = l[0: j]
11 99998 300309109 3003.2 38.8 x = min(pre_l)
12 99998 85915575 859.2 11.1 post_l = l[j + 1:]
13 99998 291183808 2911.9 37.7 y = min(post_l)
14 99998 441185 4.4 0.1 z = l[j]
15 99998 212870 2.1 0.0 if x < z and y < z:
16 99978 284920 2.8 0.0 n = z - x + z - y
17 99978 181296 1.8 0.0 m = n if n > m else m
18 1 114 114.0 0.0 outfile.write(str(m))
19 1 170 170.0 0.0 infile.close()
20 1 511 511.0 0.0 outfile.close()
which shows that 23.3% of your time (46 s) is spent repeatedly slicing your array, and 76.5% (151 s) is spent running min() on the slices 200k times.
So - how can we speed this up? Consider
a = min(l[0:50001]) # 50000 comparisons
b = min(l[0:50002]) # 50001 comparisons
c = min(a, l[50001]) # 1 comparison
Here's the magic: b and c are exactly equivalent but b takes something like 10k times longer to run. You have to have a calculated first - but you can repeat the same trick, shifted back by 1, to get a cheaply, and the same for the a's predecessor, and so on.
In one pass from start to end you can keep a running tally of 'minimum value seen previous to this index'. You can then do the same thing from end to start, keeping a running tally of 'minimum value seen after this index'. You can then zip all three arrays together and find the maximum achievable values.
I wrote a quick version,
def test():
ERROR_VAL = 1000000 # too big to be part of any valid solution
# read input file
with open("frogin.txt") as inf:
nums = [int(i) for i in inf.read().split()]
# check contents
n = nums.pop(0)
if len(nums) < n:
raise ValueError("Input file is too short!")
elif len(nums) > n:
raise ValueError("Input file is too long!")
# min_pre[i] == min(nums[:i])
min_pre = [0] * n
min_pre[0] = ERROR_VAL
for i in range(1, n):
min_pre[i] = min(nums[i - 1], min_pre[i - 1])
# min_post[i] == min(nums[i+1:])
min_post = [0] * n
min_post[n - 1] = ERROR_VAL
for i in range(n - 2, -1, -1):
min_post[i] = min(nums[i + 1], min_post[i + 1])
return max((nums[i] - min_pre[i]) + (nums[i] - min_post[i]) for i in range(1, n - 1) if min_pre[i] < nums[i] > min_post[i])
and profiled it,
Total time: 0.300842 s
File: <ipython-input-99-2097216e4420>
Function: test at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def test():
2 1 5 5.0 0.0 ERROR_VAL = 1000000 # too big to be part of any valid solution
3 # read input file
4 1 503 503.0 0.0 with open("frogin.txt") as inf:
5 1 99903 99903.0 8.5 nums = [int(i) for i in inf.read().split()]
6 # check contents
7 1 212 212.0 0.0 n = nums.pop(0)
8 1 7 7.0 0.0 if len(nums) < n:
9 raise ValueError("Input file is too short!")
10 1 2 2.0 0.0 elif len(nums) > n:
11 raise ValueError("Input file is too long!")
12 # min_pre[i] == min(nums[:i])
13 1 994 994.0 0.1 min_pre = [0] * n
14 1 3 3.0 0.0 min_pre[0] = ERROR_VAL
15 100000 162915 1.6 13.8 for i in range(1, n):
16 99999 267593 2.7 22.7 min_pre[i] = min(nums[i - 1], min_pre[i - 1])
17 # min_post[i] == min(nums[i+1:])
18 1 1050 1050.0 0.1 min_post = [0] * n
19 1 3 3.0 0.0 min_post[n - 1] = ERROR_VAL
20 100000 167021 1.7 14.2 for i in range(n - 2, -1, -1):
21 99999 272080 2.7 23.1 min_post[i] = min(nums[i + 1], min_post[i + 1])
22 1 205222 205222.0 17.4 return max((nums[i] - min_pre[i]) + (nums[i] - min_post[i]) for i in range(1, n - 1) if min_pre[i] < nums[i] > min_post[i])
and you can see the run-time for processing 100k values has dropped from 197 s to 0.3 s.
i have the data like this
ID 8-Jan 15-Jan 22-Jan 29-Jan 5-Feb 12-Feb LowerBound UpperBound
001 618 720 645 573 503 447 - -
002 62 80 67 94 81 65 - -
003 32 10 23 26 26 31 - -
004 22 13 1 28 19 25 - -
005 9 7 9 6 8 4 - -
I want to create two columns with lower bounds and upper bounds for each product using 95% confidence intervals. I know manual way of writing a function which loops through each product ID
import numpy as np
import scipy as sp
import scipy.stats
# Method copied from http://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data
def mean_confidence_interval(data, confidence=0.95):
a = 1.0*np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * sp.stats.t._ppf((1+confidence)/2., n-1)
return m-h, m+h
Is there an efficient way in Pandas or (one liner kind of thing) ?
Of course, you want df.apply. Note you need to modify mean_confidence_interval to return pd.Series([m-h, m+h]).
df[['LowerBound','UpperBound']] = df.apply(mean_confidence_interval, axis=1)
Standard error of the mean is pretty straightforward to calculate so you can easily vectorize this:
import scipy.stats as ss
df.mean(axis=1) + ss.t.ppf(0.975, df.shape[1]-1) * df.std(axis=1)/np.sqrt(df.shape[1])
will give you the upper bound. Use - ss.t.ppf for the lower bound.
Also, pandas seems to have a sem method. If you have a large dataset, I don't suggest using apply over rows. It is pretty slow. Here are some timings:
df = pd.DataFrame(np.random.randn(100, 10))
%timeit df.apply(mean_confidence_interval, axis=1)
100 loops, best of 3: 18.2 ms per loop
%%timeit
dist = ss.t.ppf(0.975, df.shape[1]-1) * df.sem(axis=1)
mean = df.mean(axis=1)
mean - dist, mean + dist
1000 loops, best of 3: 598 µs per loop
Since you already created a function for calculating the confidence interval, simply apply it to each row of your data:
def mean_confidence_interval(data):
confidence = 0.95
m = data.mean()
se = scipy.stats.sem(data)
h = se * sp.stats.t._ppf((1 + confidence) / 2, data.shape[0] - 1)
return pd.Series((m - h, m + h))
interval = df.apply(mean_confidence_interval, axis=1)
interval.columns = ("LowerBound", "UpperBound")
pd.concat([df, interval],axis=1)
I am currently trying to process an experimental timeseries dataset, which has missing values. I would like to calculate the sliding windowed mean of this dataset along time, while handling nan values. The correct way for me to do it is to compute inside each window the sum of the finite elements and divide it with their number. This nonlinearity forces me to use non convolutional methods to face this problem, thus I have a severe time bottleneck in this part of the process. As a code example of what I am trying to accomplish I present the following:
import numpy as np
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data= np.random.random(50)
data[np.random.randint(0,n-1, n_miss)] = None
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0): min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
print 'Input:\t',data
print 'Output:\t',result
with output:
Input: [ 0.47431791 0.17620835 0.78495647 0.79894688 0.58334064 0.38068788
0.87829696 nan 0.71589171 nan 0.70359557 0.76113969
0.13694387 0.32126573 0.22730891 nan 0.35057169 nan
0.89251851 0.56226354 0.040117 nan 0.37249799 0.77625334
nan nan nan nan 0.63227417 0.92781944
0.99416471 0.81850753 0.35004997 nan 0.80743783 0.60828597
nan 0.01410721 nan nan 0.6976317 nan
0.03875394 0.60924066 0.22998065 nan 0.34476729 0.38090961
nan 0.2021964 ]
Output: [ 0.32526313 0.47849424 0.5867039 0.72241466 0.58765847 0.61410849
0.62949242 0.79709433 0.71589171 0.70974364 0.73236763 0.53389305
0.40644977 0.22850617 0.27428732 0.2889403 0.35057169 0.6215451
0.72739103 0.49829968 0.30119027 0.20630749 0.57437567 0.57437567
0.77625334 nan nan 0.63227417 0.7800468 0.85141944
0.91349722 0.7209074 0.58427875 0.5787439 0.7078619 0.7078619
0.31119659 0.01410721 0.01410721 0.6976317 0.6976317 0.36819282
0.3239973 0.29265842 0.41961066 0.28737397 0.36283845 0.36283845
0.29155301 0.2021964 ]
Can this result be produced by numpy operations, without using a for loop?
You can do that using the rolling function of Pandas:
import numpy as np
import pandas as pd
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data = np.random.random(n)
data[np.random.randint(0, n-1, n_miss)] = None
windowed_mean = pd.Series(data).rolling(window=win_size, min_periods=1).mean()
print(pd.DataFrame({'Data': data, 'Windowed mean': windowed_mean}) )
Output:
Data Windowed mean
0 0.589376 0.589376
1 0.639173 0.614274
2 0.343534 0.524027
3 0.250329 0.411012
4 0.911952 0.501938
5 NaN 0.581141
6 0.224964 0.568458
7 NaN 0.224964
8 0.508419 0.366692
9 0.215418 0.361918
10 NaN 0.361918
11 0.638118 0.426768
12 0.587478 0.612798
13 0.097037 0.440878
14 0.688689 0.457735
15 0.858593 0.548107
16 0.408903 0.652062
17 0.448993 0.572163
18 NaN 0.428948
19 0.877453 0.663223
20 NaN 0.877453
21 NaN 0.877453
22 0.021798 0.021798
23 0.482054 0.251926
24 0.092387 0.198746
25 0.251766 0.275402
26 0.093854 0.146002
27 NaN 0.172810
28 NaN 0.093854
29 NaN NaN
30 0.965669 0.965669
31 0.695999 0.830834
32 NaN 0.830834
33 NaN 0.695999
34 NaN NaN
35 0.613727 0.613727
36 0.837533 0.725630
37 NaN 0.725630
38 0.782295 0.809914
39 NaN 0.782295
40 0.777429 0.779862
41 0.401355 0.589392
42 0.491709 0.556831
43 0.127813 0.340292
44 0.781625 0.467049
45 0.960466 0.623301
46 0.637618 0.793236
47 0.651264 0.749782
48 0.154911 0.481264
49 0.159145 0.321773
Here's a convolution based approach using np.convolve -
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
Please note that this would have one extra element on either sides.
If you are working with 2D data, we can use Scipy's 2D convolution.
Approaches -
def original_app(data, win_size):
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0): \
min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
return result
def numpy_app(data, win_size):
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
return out[1:-1] # Slice out the one-extra elems on sides
Sample run -
In [118]: #Construct sample data
...: n = 50
...: n_miss = 20
...: win_size = 3
...: data= np.random.random(50)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [119]: original_app(data, win_size = 3)
Out[119]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
In [120]: numpy_app(data, win_size = 3)
__main__:36: RuntimeWarning: invalid value encountered in divide
Out[120]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
Runtime test -
In [122]: #Construct sample data
...: n = 50000
...: n_miss = 20000
...: win_size = 3
...: data= np.random.random(n)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [123]: %timeit original_app(data, win_size = 3)
1 loops, best of 3: 1.51 s per loop
In [124]: %timeit numpy_app(data, win_size = 3)
1000 loops, best of 3: 1.09 ms per loop
In [125]: import pandas as pd
# #jdehesa's pandas solution
In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean()
100 loops, best of 3: 3.34 ms per loop
I'm fairly new to python, so I don't know all the tips and tricks quite yet. But I'm trying to read in data line by line from a file, then into a numpy array. I have to read it in line by line in this manner, but I have freedom when it comes to moving that data into the array. Here is the relevant code:
xyzi_point_array = np.zeros((0,4))
x_list = []
y_list = []
z_list = []
i_list = []
points_read = 0
while True: #FOR EVERY LINE DO:
line = decryptLine(inFile.readline()) #grabs the next line of data
if not line: break
.
.
.
index = 0
for entry in line: #FOR EVERY VALUE IN THE LINE
x_list.append(X)
y_list.append(Y)
z_list.append(z_catalog[index])
i_list.append(entry)
index += 1
points_read += 1
xyzi_point_array = np.zeros((points_read,4))
xyzi_point_array[:,0] = x_list
xyzi_point_array[:,1] = y_list
xyzi_point_array[:,2] = z_list
xyzi_point_array[:,3] = i_list
Where X and Y are scalars which are different for each line, and where z_catalog is a 1D numpy array.
For smaller data sets, the imbedded for loop is the biggest draw, with the xyzi_point_array[points_read,:] = line taking the majority of processor time. However with larger data sets, working with tempArr to expand xyzi_point_array becomes the worst, so I'll need to optimize both.
Any ideas? General tips on how to better handle numpy arrays are also welcome, I come from a C++ background, and am probably not handling these arrays in the most pythonic way..
For reference, here is the lineprofiler readout for this bit of the code:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
138 150 233 1.6 0.0 index = 0
139 489600 468293 1.0 11.6 for entry in line: #FOR EVERY VALUE IN THE LINE
140 489450 457227 0.9 11.4 x_list.append(lineX)
141 489450 441687 0.9 11.0 y_list.append(lineY)
142 489450 541891 1.1 13.5 z_list.append(z_catalog[index])
143 489450 450191 0.9 11.2 i_list.append(entry)
144 489450 421573 0.9 10.5 index += 1
145 489450 408764 0.8 10.2 points_read += 1
146
149 1 78 78.0 0.0 xyzi_point_array = np.zeros((points_read,4))
150 1 39539 39539.0 1.0 xyzi_point_array[:,0] = x_list
151 1 33876 33876.0 0.8 xyzi_point_array[:,1] = y_list
152 1 48619 48619.0 1.2 xyzi_point_array[:,2] = z_list
153 1 47219 47219.0 1.2 xyzi_point_array[:,3] = i_list
New to Pandas, looking for the most efficient way to do this.
I have a Series of DataFrames. Each DataFrame has the same columns but different indexes, and they are indexed by date. The Series is indexed by ticker symbol. So each item in the Sequence represents a single time series of each individual stock's performance.
I need to randomly generate a list of n data frames, where each dataframe is a subset of some random assortment of the available stocks' histories. It's ok if there is overlap, so long as start end end dates are different.
This following code does it, but it's really slow, and I'm wondering if there's a better way to go about it:
Code
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if type(data) != pd.Series:
return None
if subset=='validate':
offset = 0
elif subset=='test':
offset = 200
elif subset=='train':
offset = 400
tickers = np.random.randint(0, len(data), size=len(data))
ret_data = []
while len(ret_data) != batch_size:
for t in tickers:
data_t = data[t]
max_len = len(data_t)-timesteps-1
if len(ret_data)==batch_size: break
if max_len-offset < 0: continue
index = np.random.randint(offset, max_len)
d = data_t[index:index+timesteps]
if len(d)==timesteps: ret_data.append(d)
return ret_data
Profile output:
Timer unit: 1e-06 s
File: finance.py
Function: random_sample at line 137
Total time: 0.016142 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
137 #profile
138 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
139 1 5 5.0 0.0 if type(data) != pd.Series:
140 return None
141
142 1 1 1.0 0.0 if subset=='validate':
143 offset = 0
144 1 1 1.0 0.0 elif subset=='test':
145 offset = 200
146 1 0 0.0 0.0 elif subset=='train':
147 1 1 1.0 0.0 offset = 400
148
149 1 1835 1835.0 11.4 tickers = np.random.randint(0, len(data), size=len(data))
150
151 1 2 2.0 0.0 ret_data = []
152 2 3 1.5 0.0 while len(ret_data) != batch_size:
153 116 148 1.3 0.9 for t in tickers:
154 116 2497 21.5 15.5 data_t = data[t]
155 116 317 2.7 2.0 max_len = len(data_t)-timesteps-1
156 116 80 0.7 0.5 if len(ret_data)==batch_size: break
157 115 69 0.6 0.4 if max_len-offset < 0: continue
158
159 100 101 1.0 0.6 index = np.random.randint(offset, max_len)
160 100 10840 108.4 67.2 d = data_t[index:index+timesteps]
161 100 241 2.4 1.5 if len(d)==timesteps: ret_data.append(d)
162
163 1 1 1.0 0.0 return ret_data
Are you sure you need to find a faster method? Your current method isn't that slow. The following changes might simplify, but won't necessarily be any faster:
Step 1: Take a random sample (with replacement) from the list of dataframes
rand_stocks = np.random.randint(0, len(data), size=batch_size)
You can treat this array rand_stocks as a list of indices to be applied to your Series of dataframes. The size is already batch size so that eliminates the need for the while loop and your comparison on line 156.
That is, you can iterate over rand_stocks and access the stock like so:
for idx in rand_stocks:
stock = data.ix[idx]
# Get a sample from this stock.
Step 2: Get a random datarange for each stock you have randomly selected.
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = data_t[start_idx:start_idx+timesteps]
I don't have your data, but here's how I put it together:
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if subset=='train': offset = 0 #you can obviously change this back
rand_stocks = np.random.randint(0, len(data), size=batch_size)
ret_data = []
for idx in rand_stocks:
stock = data[idx]
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = stock[start_idx:start_idx+timesteps]
ret_data.append(d)
return ret_data
Creating a dataset:
In [22]: import numpy as np
In [23]: import pandas as pd
In [24]: rndrange = pd.DateRange('1/1/2012', periods=72, freq='H')
In [25]: rndseries = pd.Series(np.random.randn(len(rndrange)), index=rndrange)
In [26]: rndseries.head()
Out[26]:
2012-01-02 2.025795
2012-01-03 1.731667
2012-01-04 0.092725
2012-01-05 -0.489804
2012-01-06 -0.090041
In [27]: data = [rndseries,rndseries,rndseries,rndseries,rndseries,rndseries]
Testing the function:
In [42]: random_sample(data, timesteps=2, batch_size = 2)
Out[42]:
[2012-01-23 1.464576
2012-01-24 -1.052048,
2012-01-23 1.464576
2012-01-24 -1.052048]