Need for speed: Slow nested groupbys and applys in Pandas

Need for speed: Slow nested groupbys and applys in Pandas - python

I am performing a complex transformation on a DataFrame. I thought it would be quick for Pandas, but the only way I've managed to do it is with some nested groupbys and applys, using lambda functions, and it is slow. It seems like the sort of thing where there should be built-in, faster methods. At n_rows=1000 it's 2 seconds, but I'll be doing 10^7 rows, so this is far too slow. It's difficult to explain what we're doing, so here's the code and profile, then I'll explain:
n_rows = 1000
d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping
f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame
q = d.groupby(grps).apply(h) #Slow
824984 function calls (816675 primitive calls) in 1.850 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
221770 0.105 0.000 0.105 0.000 {isinstance}
7329 0.104 0.000 0.217 0.000 index.py:86(__new__)
8309 0.089 0.000 0.423 0.000 series.py:430(__new__)
5375 0.081 0.000 0.081 0.000 {method 'reduce' of 'numpy.ufunc' objects}
34225 0.068 0.000 0.133 0.000 {method 'view' of 'numpy.ndarray' objects}
36780/36779 0.067 0.000 0.067 0.000 {numpy.core.multiarray.array}
5349 0.065 0.000 0.567 0.000 series.py:709(_get_values)
985/1 0.063 0.000 1.847 1.847 groupby.py:608(apply)
5349 0.056 0.000 0.198 0.000 _methods.py:42(_mean)
5358 0.050 0.000 0.232 0.000 index.py:332(__getitem__)
8309 0.049 0.000 0.228 0.000 series.py:3299(_sanitize_array)
9296 0.047 0.000 0.116 0.000 index.py:1341(__new__)
984 0.039 0.000 0.092 0.000 algorithms.py:105(factorize)
Group the DataFrame rows by the groupings. For each grouping, for each row, group by those values that are the same (i.e. all have the value 3 versus all have value 4). For each index in a value grouping, look up the corresponding index in dgs, and average. Then average for the row groupings.
::exhale::
Any suggestions on how to rearrange this for speed would be appreciated.

You can do the apply and groupby by one multilevel groupby, here is the code:
import pandas as pd
from numpy import array, arange
from numpy.random import randint, seed
seed(42)
n_rows = 1000
d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping
f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame
print d.groupby(grps).apply(h) #Slow
### my code starts from here ###
def group_process(df2):
s = df2.stack()
v = np.repeat(dgs[None, :df2.shape[1]], df2.shape[0], axis=0).ravel()
return pd.Series(v).groupby([s.index.get_level_values(0), s.values]).mean().mean(level=1)
print d.groupby(grps).apply(group_process)
output:
1 2 3 4 5 6 7 \
(1, 2] 4.621575 4.625887 4.775235 4.954321 4.566441 4.568111 4.835664
(2, 3] 4.446347 4.138528 4.862613 4.800538 4.582721 4.595890 4.794183
(3, 4] 4.776144 4.510119 4.391729 4.392262 4.930556 4.695776 4.630068
8 9
(1, 2] 4.246085 4.520384
(2, 3] 5.237360 4.418934
(3, 4] 4.829167 4.681548
[3 rows x 9 columns]
1 2 3 4 5 6 7 \
(1, 2] 4.621575 4.625887 4.775235 4.954321 4.566441 4.568111 4.835664
(2, 3] 4.446347 4.138528 4.862613 4.800538 4.582721 4.595890 4.794183
(3, 4] 4.776144 4.510119 4.391729 4.392262 4.930556 4.695776 4.630068
8 9
(1, 2] 4.246085 4.520384
(2, 3] 5.237360 4.418934
(3, 4] 4.829167 4.681548
[3 rows x 9 columns]
It's about 70x faster, but I don't know if it can work with 10**7 rows.

Related

Efficient and precise calculation of the euclidean distance

Following some online research (1, 2, numpy, scipy, scikit, math), I have found several ways for calculating the Euclidean Distance in Python:
# 1
numpy.linalg.norm(a-b)
# 2
distance.euclidean(vector1, vector2)
# 3
sklearn.metrics.pairwise.euclidean_distances
# 4
sqrt((xa-xb)^2 + (ya-yb)^2 + (za-zb)^2)
# 5
dist = [(a - b)**2 for a, b in zip(vector1, vector2)]
dist = math.sqrt(sum(dist))
# 6
math.hypot(x, y)
I was wondering if someone could provide an insight on which of the above (or any other that I have not found) is considered the best in terms of efficiency and precision. If someone is aware of any resource(s) which discusses the subject that would also be great.
The context I am interesting in is in calculating the Euclidean Distance between pairs of number-tuples, e.g. the distance between (52, 106, 35, 12) and (33, 153, 75, 10).

Conclusion first:
From the test result by using timeit for efficiency test, we can conclude that regarding the efficiency:
Method5 (zip, math.sqrt) > Method1 (numpy.linalg.norm) > Method2 (scipy.spatial.distance) > Method3 (sklearn.metrics.pairwise.euclidean_distances )
While I didn't really test your Method4 as it is not suitable for general cases and it is generally equivalent to Method5.
For the rest, quite surprisingly, Method5 is the fastest one. While for Method1 which uses numpy, as what we expected, which is heavily optimized in C, is the second fastest.
For scipy.spatial.distance, if you go directly to the function definition, you will see that it is actually using numpy.linalg.norm, except it will perform the validation on the two input vectors before the actual numpy.linalg.norm. That's why it is slightly slower thant numpy.linalg.norm.
Finally for sklearn, according to the documentation:
This formulation has two advantages over other ways of computing distances. First, it is computationally efficient when dealing with sparse data. Second, if one argument varies but the other remains unchanged, then dot(x, x) and/or dot(y, y) can be pre-computed.
However, this is not the most precise way of doing this computation, and the distance matrix returned by this function may not be exactly symmetric as required
Since in your question you would like to use a fixed set of data, the advantage of this implementation is not reflected. And due to the trade off between the performance and precision, it also gives the worst precision among all of the methods.
Regarding the precision, Method5=Metho1=Method2>Method3
Efficiency Test Script:
import numpy as np
from scipy.spatial import distance
from sklearn.metrics.pairwise import euclidean_distances
import math
# 1
def eudis1(v1, v2):
return np.linalg.norm(v1-v2)
# 2
def eudis2(v1, v2):
return distance.euclidean(v1, v2)
# 3
def eudis3(v1, v2):
return euclidean_distances(v1, v2)
# 5
def eudis5(v1, v2):
dist = [(a - b)**2 for a, b in zip(v1, v2)]
dist = math.sqrt(sum(dist))
return dist
dis1 = (52, 106, 35, 12)
dis2 = (33, 153, 75, 10)
v1, v2 = np.array(dis1), np.array(dis2)
import timeit
def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs)
return wrapped
wrappered1 = wrapper(eudis1, v1, v2)
wrappered2 = wrapper(eudis2, v1, v2)
wrappered3 = wrapper(eudis3, v1, v2)
wrappered5 = wrapper(eudis5, v1, v2)
t1 = timeit.repeat(wrappered1, repeat=3, number=100000)
t2 = timeit.repeat(wrappered2, repeat=3, number=100000)
t3 = timeit.repeat(wrappered3, repeat=3, number=100000)
t5 = timeit.repeat(wrappered5, repeat=3, number=100000)
print('\n')
print('t1: ', sum(t1)/len(t1))
print('t2: ', sum(t2)/len(t2))
print('t3: ', sum(t3)/len(t3))
print('t5: ', sum(t5)/len(t5))
Efficiency Test Output:
t1: 0.654838958307
t2: 1.53977598714
t3: 6.7898791732
t5: 0.422228400305
Precision Test Script & Result:
In [8]: eudis1(v1,v2)
Out[8]: 64.60650122085238
In [9]: eudis2(v1,v2)
Out[9]: 64.60650122085238
In [10]: eudis3(v1,v2)
Out[10]: array([[ 64.60650122]])
In [11]: eudis5(v1,v2)
Out[11]: 64.60650122085238

This is not exactly answering the question, but it is probably worth mentioning that if you aren't interested in the actual euclidean distance, but just want to compare euclidean distances against each other, square roots are monotone functions, i.e. x**(1/2) < y**(1/2) if and only if x < y.
So if you don't want the explicit distance, but for instance just want to know if the euclidean distance of vector1 is closer to a list of vectors, called vectorlist, you can avoid the expensive (in terms of both precision and time) square root, but can make do with something like
min(vectorlist, key = lambda compare: sum([(a - b)**2 for a, b in zip(vector1, compare)])

Here is an example on how to use just numpy.
import numpy as np
a = np.array([3, 0])
b = np.array([0, 4])
c = np.sqrt(np.sum(((a - b) ** 2)))
# c == 5.0

Improving benchmark on the accepted answer, I've found out that, assuming you already get input in numpy array format, method5 can better written in:
import numpy as np
from numba import jit
#jit(nopython=True)
def euclidian_distance(y1, y2):
return np.sqrt(np.sum((y1-y2)**2)) # based on pythagorean
Speed test:
euclidian_distance(y1, y2)
# 2.03 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.linalg.norm(y1-y2)
# 17.6 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Fun fact, you can add jit to numpy function:
#jit(nopython=True)
def jit_linalg(y1, y2):
return np.linalg.norm(y1-y2)
jit_linalg(y[i],y[j])
# 2.91 µs ± 261 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

As a general rule of thumb, stick to the scipy and numpy implementations where possible, as they're vectorized and much faster than native Python code. (Main reasons are: implementations in C, vectorization eliminates type checking overhead that looping does.)
(Aside: My answer doesn't cover precision here, but I think the same principle applies for precision as for efficiency.)
As a bit of a bonus, I'll chip in with a bit of information on how you can profile your code, to measure efficiency. If you're using the IPython interpreter, the secret is to use the %prun line magic.
In [1]: import numpy
In [2]: from scipy.spatial import distance
In [3]: c1 = numpy.array((52, 106, 35, 12))
In [4]: c2 = numpy.array((33, 153, 75, 10))
In [5]: %prun distance.euclidean(c1, c2)
35 function calls in 0.000 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 linalg.py:1976(norm)
1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.dot}
6 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.array}
4 0.000 0.000 0.000 0.000 numeric.py:406(asarray)
1 0.000 0.000 0.000 0.000 distance.py:232(euclidean)
2 0.000 0.000 0.000 0.000 distance.py:152(_validate_vector)
2 0.000 0.000 0.000 0.000 shape_base.py:9(atleast_1d)
1 0.000 0.000 0.000 0.000 misc.py:11(norm)
1 0.000 0.000 0.000 0.000 function_base.py:605(asarray_chkfinite)
2 0.000 0.000 0.000 0.000 numeric.py:476(asanyarray)
1 0.000 0.000 0.000 0.000 {method 'ravel' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 linalg.py:111(isComplexType)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
4 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {method 'squeeze' of 'numpy.ndarray' objects}
In [6]: %prun numpy.linalg.norm(c1 - c2)
10 function calls in 0.000 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 linalg.py:1976(norm)
1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.dot}
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 numeric.py:406(asarray)
1 0.000 0.000 0.000 0.000 {method 'ravel' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 linalg.py:111(isComplexType)
1 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.array}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
What %prun does is tell you how long a function call takes to run, including a bit of trace to figure out where the bottleneck might be. In this case, both the scipy.spatial.distance.euclidean and numpy.linalg.norm implementations are pretty fast. Assuming you defined a function dist(vect1, vect2), you can profile using the same IPython magic call. As another added bonus, %prun also works inside the Jupyter notebook, and you can do %%prun to profile an entire cell of code, rather than just one function, simply by making %%prun the first line of that cell.

I don't know how the precision and speed compares to the other libraries you mentioned, but you can do it for 2D vectors using the built-in math.hypot() function:
from math import hypot
def pairwise(iterable):
"s -> (s0, s1), (s1, s2), (s2, s3), ..."
a, b = iter(iterable), iter(iterable)
next(b, None)
return zip(a, b)
a = (52, 106, 35, 12)
b = (33, 153, 75, 10)
dist = [hypot(p2[0]-p1[0], p2[1]-p1[1]) for p1, p2 in pairwise(tuple(zip(a, b)))]
print(dist) # -> [131.59027319676787, 105.47511554864494, 68.94925670375281]

Python DataFrame displaying zeros instead of calculated values from numpy zeros command

I am trying to make a DataFrame in python and update the various rows and columns in the dataframe though a loop based on various calculations. The calculations are all correct, but when I try to display the DataFrame once the loop is complete, some of the numbers calculated are displayed yet I also see mostly zeros. Below is an example code:
map = pd.DataFrame(np.zeros([16, 6]), columns=['A', 'B', 'C', 'D', 'E', 'F'])
for i in range(0, len(map)):
map.A[i] = 1+1 #Some calculations
map.B[i] = map.A[i] + 2
print map
Result (just an example):
A B C D E F
1 2 4 0.000 0.000 0.000 0.000
2 0.000 0.000 0.000 0.000 0.000 0.000
3 0.000 0.000 0.000 0.000 0.000 0.000
4 0.000 0.000 0.000 0.000 0.000 0.000
5 0.000 0.000 0.000 0.000 0.000 0.000
(continues for 16 rows)
However, if I were to print a specific clolumn, I would get the real calculated numbers. Also, calcB uses the correct numbers from calcA so it has to be just a print error. I am guessing it is something to do with initializing the array and the memory but I am not sure. I originally used np.empty([16, 6]) but the same result occured. How do I get the DataFrame to print the actual numbers, not the zeros?

How can I optimize/vectorize this looped assignment on a DataFrame?

Below is a function I wrote to label certain rows based on ranges of indexes. For convenience, I'm making the two function arguments, samples and matdat available for download in pickle format.
from operator import itemgetter
from itertools import izip, imap
import pandas as pd
def _insert_design_columns(samples, matdat):
"""Add columns for design-factors, label lines that correspond to a given trials and
then fill in said columns with the appropriate value on lines that belong to a
trial.
samples : DataFrame
DataFrame of eyetracker samples.
column `t`: time sample, in ms
column `event`: TTL event
columns x, y: x and y coordinates of gaze
column cr: corneal reflection area
matdat : dict of numpy arrays
dict mapping matlab variable name to numpy array
returns : modified `samples` dataframe
"""
## This is fairly trivial preperation and data formatting for the nested
# for-loop below. We're just fixing types, adding empty columns, and
# ensuring that our numpy arrays have the right shape.
# Grab variables from the dict & squeeze the numpy arrays
key = ('cuepos', 'targetpos', 'targetorientation', 'soa', 'normalizedResp')
cpos, tpos, torient, soa, resp = map(pd.np.squeeze, imap(matdat.get, key))
cpos = cpos.astype(float)
cpos[cpos < 0] = pd.np.nan
cong = tpos == cpos
cong[pd.isnull(cpos)] = pd.np.nan
# Add empty columns for each factor. These will contain the factor level on
# that correspond to a trial (i.e. between a `TrialStart` and `ReportCueOnset` in
# `samples.event`
samples['soa'] = pd.np.nan
samples['cpos'] = pd.np.nan
samples['tpos'] = pd.np.nan
samples['cong'] = pd.np.nan
samples['torient'] = pd.np.nan
samples['normalizedResp'] = pd.np.nan
## This is important, but not the part we need to optimize.
# Here, we're finding the start and end indexes for every trial. Trials
# are composed of continuous slices of rows.
# Assign trial numbers
tstart = samples[samples.event == 'TrialStart'].t # each trial starts on a `TrialStart`
tstop = samples[samples.event == 'ReportCueOnset'].t # ... and ends on a `ReportCueOnset`
samples['trial'] = pd.np.nan # make an empty column which will contain trial num
## This is the sub-optimal part. Here, we're iterating through our start/end index
# pairs, slicing the dataframe to get the rows we need, and then:
# 1. Assigning a trial number to that slice of rows
# 2. Assigning the correct value to corresponding columns (see `factor_names`)
samples.set_index(['t'], inplace=True)
for i, (start, stop) in enumerate(izip(tstart, tstop)):
samples.loc[start:stop, 'trial'] = i + 1 # label the interval's trial number
# Now that we've labeled a range of rows as a trial, we can add factor levels
# to the corresponding columns
idx = itemgetter(i - 1)
# factor_values/names has the same length as the number of trials we're going to
# find. Get the corresponding value for the current trial so that we can
# assign it.
factor_values = imap(idx, (cpos, tpos, torient, soa, resp, cong))
factor_names = ('cpos', 'tpos', 'torient', 'soa', 'resp', 'cong')
for c, v in izip(factor_names, factor_values): # loop through columns and assign
samples.loc[start:stop, c] = v
samples.reset_index(inplace=True)
return samples
I've performed a %prun, the first few lines of which read:
548568 function calls (547462 primitive calls) in 9.380 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
11360 6.074 0.001 6.084 0.001 index.py:604(__contains__)
2194 0.949 0.000 0.949 0.000 {method 'copy' of 'numpy.ndarray' objects}
1430 0.730 0.001 0.730 0.001 {pandas.lib.infer_dtype}
1098 0.464 0.000 0.467 0.000 internals.py:277(set)
1093/1092 0.142 0.000 9.162 0.008 indexing.py:157(_setitem_with_indexer)
1100 0.106 0.000 1.266 0.001 frame.py:1851(__setitem__)
166 0.047 0.000 0.047 0.000 {method 'astype' of 'numpy.ndarray' objects}
107209 0.037 0.000 0.066 0.000 {isinstance}
14 0.029 0.002 0.029 0.002 {numpy.core.multiarray.concatenate}
39362/38266 0.026 0.000 6.101 0.000 {getattr}
7829/7828 0.024 0.000 0.030 0.000 {numpy.core.multiarray.array}
1092 0.023 0.000 0.457 0.000 internals.py:564(setitem)
5 0.023 0.005 0.023 0.005 {pandas.algos.take_2d_axis0_float64_float64}
4379 0.021 0.000 0.108 0.000 index.py:615(__getitem__)
1101 0.020 0.000 0.582 0.001 frame.py:1967(_sanitize_column)
2192 0.017 0.000 0.946 0.000 internals.py:2236(apply)
8 0.017 0.002 0.017 0.002 {method 'repeat' of 'numpy.ndarray' objects}
Judging by the line that reads 1093/1092 0.142 0.000 9.162 0.008 indexing.py:157(_setitem_with_indexer), I strongly suspect my nested loop assignment with loc to be the culprit. The whole function takes about 9.3 seconds to execute and has to be performed 144 times in total (i.e. ~22 minutes).
Is there a way to vectorize or otherwise optimize the assignment I'm trying to do?

Assign sampled multinomial values uniformly at random

I am using np.random.multinomial to sample a multinomial distribution M times (given probabilities [X_0 X_1 .. X_n] it returns counts [C_0 C_1 ... C_n] sampled from the specified multinomial, where \sum_i C_i = M). Given these sampled values (the C_i's), I want to assign them uniformly at random to some objects I have.
Currently what I'm doing is:
draws = np.random.multinomial(M, probs, size=1)
draws = draws[0]
draws_list = []
for idx,num in enumerate(draws):
draws_list += [idx]*num
random.shuffle(draws_list)
Then draws_list is a randomly shuffled list of the sampled values.
The problem is that populating draws_list (the for loop) is very slow. Is there a better/faster way to do this?

Try this code. This strategy is to allocate the memory first, then to fill data.
draws_list1 = np.empty(M, dtype=np.int)
acc = 0
for idx, num in enumerate(draws):
draws_list1[acc:acc+num].fill(idx)
acc += num
Here's the full code for profiling.
import numpy as np
import cProfile
M=10000000
draws = np.random.multinomial(M, [1/6.]*6, size=1)
draws = draws[0]
draws_list1 = np.empty(M, dtype=np.int)
def impl0():
draws_list0 = []
for idx, num in enumerate(draws):
draws_list0 += [idx]*num
return draws_list0
def impl1():
acc = 0
for idx, num in enumerate(draws):
draws_list1[acc:acc+num].fill(idx)
acc += num
return draws_list1
cProfile.run("impl0()")
cProfile.run("impl1()")
Here's the result of cProfile. If the statement np.empty is located in function impl1, 0.020 seconds are elapsed.
3 function calls in 0.095 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.020 0.020 0.095 0.095 <string>:1(<module>)
1 0.076 0.076 0.076 0.076 prof.py:11(impl0)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
9 function calls in 0.017 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.017 0.017 <string>:1(<module>)
1 0.000 0.000 0.017 0.017 prof.py:17(impl1)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
6 0.017 0.003 0.017 0.003 {method 'fill' of 'numpy.ndarray' objects}

A performance discussion on DP

Look at the codes below, I use two ways to solve the problem (simple recursive and DP). Why is the DP way slower?
What's your suggestion?
#!/usr/local/bin/python2.7
# encoding: utf-8
Problem: There is an array with positive integer. given a positive integer S,\
find the total number of combinations in Which the numbers' sum is S.
Method I:
def find_sum_recursive(number_list, sum_to_find):
count = 0
for i in range(len(number_list)):
sub_sum = sum_to_find - number_list[i]
if sub_sum < 0:
continue
elif sub_sum == 0:
count += 1
continue
else:
sub_list = number_list[i + 1:]
count += find_sum_recursive(sub_list, sub_sum)
return count
Method II:
def find_sum_DP(number_list, sum_to_find):
count = 0
if(0 == sum_to_find):
count = 1
elif([] != number_list and sum_to_find > 0):
count = find_sum_DP(number_list[:-1], sum_to_find) + find_sum_DP(number_list[:-1], sum_to_find - number_list[:].pop())
return count
Running it:
def main(argv=None): # IGNORE:C0111
number_list = [5, 5, 10, 3, 2, 9, 8]
sum_to_find = 15
input_setup = ';number_list = [5, 5, 10, 3, 2, 9, 8, 7, 6, 4, 3, 2, 9, 5, 4, 7, 2, 8, 3];sum_to_find = 15'
print 'Calculating...'
print 'recursive starting'
count = find_sum_recursive(number_list, sum_to_find)
print timeit.timeit('count = find_sum_recursive(number_list, sum_to_find)', setup='from __main__ import find_sum_recursive' + input_setup, number=10)
cProfile.run('find_sum_recursive(' + str(number_list) + ',' + str(sum_to_find) + ')')
print 'recursive ended:', count
print 'DP starting'
count_DP = find_sum_DP(number_list, sum_to_find)
print timeit.timeit('count_DP = find_sum_DP(number_list, sum_to_find)', setup='from __main__ import find_sum_DP' + input_setup, number=10)
cProfile.run('find_sum_DP(' + str(number_list) + ',' + str(sum_to_find) + ')')
print 'DP ended:', count_DP
print 'Finished.'
if __name__ == '__main__':
sys.exit(main())
I recode the method II, and it's right now:
def find_sum_DP(number_list, sum_to_find):
count = [[0 for i in xrange(0, sum_to_find + 1)] for j in xrange(0, len(number_list) + 1)]
for i in range(len(number_list) + 1):
for j in range(sum_to_find + 1):
if (0 == i and 0 == j):
count[i][j] = 1
elif (i > 0 and j > 0):
if (j > number_list[i - 1]):
count[i][j] = count[i - 1][j] + count[i - 1][j - number_list[i - 1]]
elif(j < number_list[i - 1]):
count[i][j] = count[i - 1][j]
else:
count[i][j] = count[i - 1][j] + 1
else:
count[i][j] = 0
return count[len(number_list)][sum_to_find]
Compare between method I & II:
Calculating...
recursive starting
0.00998711585999
92 function calls (63 primitive calls) in 0.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
30/1 0.000 0.000 0.000 0.000 FindSum.py:18(find_sum_recursive)
30 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
30 0.000 0.000 0.000 0.000 {range}
recursive ended: 6
DP starting
0.00171685218811
15 function calls in 0.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 FindSum.py:33(find_sum_DP)
3 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
9 0.000 0.000 0.000 0.000 {range}
DP ended: 6
Finished.

If you're using iPython, %prun is your friend here.
Take a look at the output for the recursive version:
2444 function calls (1631 primitive calls) in 0.002 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
814/1 0.002 0.000 0.002 0.002 <ipython-input-1-7488a6455e38>:1(find_sum_recursive)
814 0.000 0.000 0.000 0.000 {range}
814 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.002 0.002 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
And now, for the DP version:
10608 function calls (3538 primitive calls) in 0.007 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
7071/1 0.007 0.000 0.007 0.007 <ipython-input-15-3535e3ab26eb>:1(find_sum_DP)
3535 0.001 0.000 0.001 0.000 {method 'pop' of 'list' objects}
1 0.000 0.000 0.007 0.007 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
7071 is quite a bit higher than 814!
Your problem here is that your dynamic programming method isn't dynamic programming! The point of dynamic programming is that, when you have a problem with overlapping subproblems, as you do here, you store the results of each subproblem, and then when if you need the result again, you take it from that store rather than recalculating. Your code doesn't do that: every time you call find_sum_DP, you're recalculating, even if the same calculation has already been done. The result is that your _DP method is actually not only recursive, but recursive with more function calls than your recursive method.
(I'm currently writing a DP version to demonstrate)
Edit:
I need to add the caveat that, while I should know much more about dynamic programming, I very embarrassingly don't. I'm also writing this quickly and late at night, a bit as an exercise for myself. Nevertheless, here is a dynamic programming implementation of the function:
import numpy as np
def find_sum_realDP( number_list, sum_to_find ):
memo = np.zeros( (len(number_list),sum_to_find+1) ,dtype=np.int)-1
# This will store our results. memo[l][n] will give us the result
# for number_list[0:l+1] and a sum_to_find of n. If it hasn't been
# calculated yet, it will give us -1. This is not at all efficient
# storage, but isn't terribly bad.
# Now that we have that, we'll call the real function. Instead of modifying
# the list and making copies or views, we'll keep the same list, and keep
# track of the index we're on (nli).
return find_sum_realDP_do( number_list, len(number_list)-1, sum_to_find, memo ),memo
def find_sum_realDP_do( number_list, nli, sum_to_find, memo ):
# Our count is 0 by default.
ret = 0
# If we aren't at the sum to find yet, do we have any numbers left after this one?
if ((sum_to_find > 0) and nli>0):
# Each of these checks to see if we've already stored the result of the calculation.
# If so, we use that, if not, we calculate it.
if memo[nli-1,sum_to_find]>=0:
ret += memo[nli-1,sum_to_find]
else:
ret += find_sum_realDP_do(number_list, nli-1, sum_to_find, memo)
# This one is a bit tricky, and was a bug when I first wrote it. We don't want to
# have a negative sum_to_find, because that will be very bad; we'll start using results
# from other places in memo because it will wrap around.
if (sum_to_find-number_list[nli]>=0) and memo[nli-1,sum_to_find-number_list[nli]]>=0:
ret += memo[nli-1,sum_to_find-number_list[nli]]
elif (sum_to_find-number_list[nli]>=0):
ret += find_sum_realDP_do(number_list, nli-1, sum_to_find-number_list[nli], memo)
# Do we not actually have any sum to find left?
elif (0 == sum_to_find):
ret = 1
# If we only have one number left, will it get us there?
elif (nli == 0) and (sum_to_find-number_list[nli] == 0 ):
ret = 1
# Store our result.
memo[nli,sum_to_find] = ret
# Return our result.
return ret
Note that this uses numpy. It's very likely that you don't have this installed, but I'm not sure how to write a reasonably-performing dynamic programming algorithm in Python without it; I don't think Python lists have anywhere near the performance of Numpy arrays. Note also that this vs your code deals with zeros differently, so rather than debug this I'll just say that this code is for nonzero positive integers in the number list. Now, with this algorithm, profiling gives us:
243 function calls (7 primitive calls) in 0.001 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
237/1 0.001 0.000 0.001 0.001 <ipython-input-155-4a624e5a99b7>:9(find_sum_realDP_do)
1 0.000 0.000 0.001 0.001 <ipython-input-155-4a624e5a99b7>:1(find_sum_realDP)
1 0.000 0.000 0.000 0.000 {numpy.core.multiarray.zeros}
1 0.000 0.000 0.001 0.001 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
243 is a great deal better than even the recursive version! But your example data is small enough that it doesn't really show off how much better a dynamic programming algorithm is.
Let's try nlist2 = [7, 6, 2, 3, 7, 7, 2, 7, 4, 2, 4, 5, 6, 1, 7, 4, 6, 3, 2, 1, 1, 1, 4,
2, 3, 5, 2, 4, 4, 2, 4, 5, 4, 2, 1, 7, 6, 6, 1, 5, 4, 5, 3, 2, 3, 7,
1, 7, 6, 6], with the same sum_to_find=15. This has 50 values, and 900206 ways to get 15...
With find_sum_recursive:
3335462 function calls (2223643 primitive calls) in 14.137 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1111820/1 13.608 0.000 14.137 14.137 <ipython-input-46-7488a6455e38>:1(find_sum_recursive)
1111820 0.422 0.000 0.422 0.000 {range}
1111820 0.108 0.000 0.108 0.000 {len}
1 0.000 0.000 14.137 14.137 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
And now with find_sum_realDP:
736 function calls (7 primitive calls) in 0.007 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
730/1 0.007 0.000 0.007 0.007 <ipython-input-155-4a624e5a99b7>:9(find_sum_realDP_do)
1 0.000 0.000 0.007 0.007 <ipython-input-155-4a624e5a99b7>:1(find_sum_realDP)
1 0.000 0.000 0.000 0.000 {numpy.core.multiarray.zeros}
1 0.000 0.000 0.007 0.007 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
So we have less than 1/1000th of the calls, and run in less than 1/2000th of the time. Of course, the bigger a list you use, the better the DP algorithm will work. On my computer, running with sum_to_find of 15 and a list of 600 random numbers from 1 to 8, realDP only takes 0.09 seconds, and has less than 10,000 function calls; it's around this point that the 64-bit integers I'm using start overflowing and we have all sorts of other problems. Needless to say, the recursive algorithm would never be able to handle a list anywhere near that size before the computer stopped functioning, either from the materials inside it breaking down or the heat death of the universe.

One thing is that your code does much list copying. It would be faster if it just passed index or indices to define a “window view” and not to copy the lists all over. For the first method you can easily add a parametr starting_index and use it in your for loop. In the second method, your write number_list[:].pop() and copy whole list just to get the last element which you could simply do as number_list[-1]. You could also add a parameter ending_index and use it in your test (len(number_list) == ending_index instead of number_list != [], btw even just plain number_list is better than testing against empty list).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Need for speed: Slow nested groupbys and applys in Pandas - python

Related

Efficient and precise calculation of the euclidean distance

Python DataFrame displaying zeros instead of calculated values from numpy zeros command

How can I optimize/vectorize this looped assignment on a DataFrame?

Assign sampled multinomial values uniformly at random

A performance discussion on DP

Categories

Resources