I'm currently iterating through a very large set of data ~85GB (~600M lines) and simply using newton-raphson to compute a new parameter. As of right now my code is extremely slow, any tips on how to speed it up? The methods from BSCallClass & BSPutClass are closed-form, so there's nothing really to speed up there. Thanks.
class NewtonRaphson:
def __init__(self, theObject):
self.theObject = theObject
def solve(self, Target, Start, Tolerance, maxiter=500):
y = self.theObject.Price(Start)
x = Start
i = 0
while (abs(y - Target) > Tolerance):
i += 1
d = self.theObject.Vega(x)
x += (Target - y) / d
y = self.theObject.Price(x)
if i > maxiter:
x = nan
break
return x
def main():
for row in a.iterrows():
print row[1]["X.1"]
T = (row[1]["X.7"] - row[1]["X.8"]).days
Spot = row[1]["X.2"]
Strike = row[1]["X.9"]
MktPrice = abs(row[1]["X.10"]-row[1]["X.11"])/2
CPflag = row[1]["X.6"]
if CPflag == 'call':
option = BSCallClass(0, 0, T, Spot, Strike)
elif CPflag == 'put':
option = BSPutClass(0, 0, T, Spot, Strike)
a["X.15"][row[0]] = NewtonRaphson(option).solve(MktPrice, .05, .0001)
EDIT:
For those curious, I ended up speeding this entire process significantly by using the scipy suggestion, as well as using the multiprocessing module.
Don't code your own Newton-Raphson method in Python. You'll get better performance using one of the root finders in scipy.optimize such as brentq or newton.
(Presumably, if you have pandas, you'd also install scipy.)
Back of the envelope calculation:
Making 600M calls to brentq should be manageable on standard hardware:
import scipy.optimize as optimize
def f(x):
return x**2 - 2
In [28]: %timeit optimize.brentq(f, 0, 10)
100000 loops, best of 3: 4.86 us per loop
So if each call to optimize.brentq takes 4.86 microseconds, 600M calls will take about 4.86 * 600 ~ 3000 seconds ~ 1 hour.
newton may be slower, but still manageable:
def f(x):
return x**2 - 2
def fprime(x):
return 2*x
In [40]: %timeit optimize.newton(f, 10, fprime)
100000 loops, best of 3: 8.22 us per loop
Related
I'm working with DNA sequence alignments and trying to implement a simple scoring algorithm. Since i have to use a matrix for the calculations, i thought numpy should be way faster than a list of lists, but as I tested both, the python lists seem to be way faster. I found this thread (Why use numpy over list based on speed?) but still; i'm using preallocated numpy vs preallocated lists and list of lists are the clear winners.
Here is my code:
Lists
def edirDistance(x, y):
x_dim = len(x)+1
y_dim = len(y)+1
D = []
for i in range(x_dim):
D.append([0] * (y_dim))
#Filling the matrix borders
for i in range(x_dim):
D[i][0] = i
for i in range(y_dim):
D[0][i] = i
for i in range(1, x_dim):
for j in range(1, y_dim):
distHor = D[i][j-1] + 1
distVer = D[i-1][j] + 1
if x[i-1] == y[j-1]:
distDiag = D[i-1][j-1]
else:
distDiag = D[i-1][j-1] + 1
D[i][j] = min(distHor, distVer,distDiag)
return D
Numpy
def NP_edirDistance(x, y):
x_dim = len(x)+1
y_dim = len(y)+1
D = np.zeros((x_dim,y_dim))
#Filling the matrix borders
for i in range(x_dim):
D[i][0] = i
for i in range(y_dim):
D[0][i] = i
for i in range(1, x_dim):
for j in range(1, y_dim):
distHor = D[i][j-1] + 1
distVer = D[i-1][j] + 1
if x[i-1] == y[j-1]:
distDiag = D[i-1][j-1]
else:
distDiag = D[i-1][j-1] + 1
D[i][j] = min(distHor, distVer,distDiag)
return D
I'm not timing the np import.
a = 'ACGTACGACTATCGACTAGCTACGAA'
b = 'ACCCACGTATAACGACTAGCTAGGGA'
%%time
edirDistance(a, b)
total: 1.41 ms
%%time
NP_edirDistance(a, b)
total: 4.43 ms
Replacing D[i][j] by D[i,j] greatly improved time, but still slower. (Thanks #Learning is a mess !)
total: 2.64 ms
I tested with even larger DNA sequences (around 10.000 letters each) and still lists are winning.
Can someone help me improve timing?
Are lists better for this use?
One way to have faster run is to use GPU/TPU-aided accelerators such as numba and …. I have tested your codes by that a and b on google colab TPU without using accelerators:
1000 loops, best of 5: 563 µs per loop
1000 loops, best of 5: 1.95 ms per loop # NumPy
But with using numba as nopython=True, without any changes to your codes:
import numba as nb
#nb.njit()
def edirDistance(x, y):
.
.
#nb.njit()
def NP_edirDistance(x, y):
.
.
It gets:
1000 loops, best of 5: 213 µs per loop
1000 loops, best of 5: 153 µs per loop # NumPy
Which will get significant difference between them using huge samples or by improving and vectorizing your NumPy codes. This method results as below for samples with 10000 length:
35.50053691864014
22.95994758605957 # NumPy (seconds)
I'm concerned with the speed of the following function:
def cch(tau):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
Where "cartprod" is a variable for a list that looks like this:
cartprod = np.ndarray([[0.0123,0.0123],[0.0123,0.0459],...])
The length of this list is about 25 million. Basically, I'm trying to find a significantly faster way to return a list of differences for every pair list in that np.ndarray. Is there an algorithmic way or function that's faster than np.diff? Or, is np.diff the end all be all? I'm also open to anything else.
EDIT: Thank you all for your solutions!
I think you're hitting a wall by repeatedly returning multiple np.arrays of length ~25 million rather than np.diff being slow. I wrote an equivalent function that iterates over the array and tallies the results as it goes along. The function needs to be jitted with numba to be fast. I hope that is acceptable.
arr = np.random.rand(25000000, 2)
def cch(tau, cartprod):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
%timeit cch(0.01, arr)
#jit(nopython=True)
def cch_jit(tau, cartprod):
count = 0
tau = -tau
for i in range(cartprod.shape[0]):
count += np.less(np.abs(tau - (cartprod[i, 1]- cartprod[i, 0])), 0.001)
return count
%timeit cch_jit(0.01, arr)
produces
294 ms ± 2.82 ms
42.7 ms ± 483 µs
which is about ~6 times faster.
We can leverage multi-core with numexpr module for large data and to gain memory efficiency and hence performance with some help from array-slicing -
import numexpr as ne
def cch_numexpr(a, tau):
d = {'a0':a[:,0],'a1':a[:,1]}
return np.count_nonzero(ne.evaluate('abs(a0-a1-tau)<0.001',d))
Sample run and timings on 25M sized data -
In [83]: cartprod = np.random.rand(25000000,2)
In [84]: cch(cartprod, tau=0.5) == cch_numexpr(cartprod, tau=0.5)
Out[84]: True
In [85]: %timeit cch(cartprod, tau=0.5)
10 loops, best of 3: 150 ms per loop
In [86]: %timeit cch_numexpr(cartprod, tau=0.5)
10 loops, best of 3: 25.5 ms per loop
Around 6x speedup.
This was with 8 threads. Thus, with more number of threads available for compute, it should improve further. Related post on how to control multi-core functionality.
Just out of curiosity I compared the solutions of #Divakar numexpr and #alexdor numba.jit. The implementation numexpr.evaluate seems to be twice as fast as using numba's jit compiler. The results are shown for 100 runs each:
np.sum: 111.07543396949768
numexpr: 12.282189846038818
JIT: 6.2505223751068115
'np.sum' returns same result as 'numexpr'
'np.sum' returns same result as 'jit'
'numexpr' returns same result as 'jit'
Script so reproduce the results:
import numpy as np
import time
import numba
import numexpr
arr = np.random.rand(25000000, 2)
runs = 100
def cch(tau, cartprod):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
def cch_ne(tau, cartprod):
d = {'a0':cartprod[:,0],'a1':cartprod[:,1], 'tau': tau}
count = np.count_nonzero(numexpr.evaluate('abs(a0-a1-tau)<0.001',d))
return count
#numba.jit(nopython=True)
def cch_jit(tau, cartprod):
count = 0
tau = -tau
for i in range(cartprod.shape[0]):
count += np.less(np.abs(tau - (cartprod[i, 1]- cartprod[i, 0])), 0.001)
return count
start = time.time()
for x in range(runs):
x1 = cch(0.01, arr)
print('np.sum:\t\t', time.time() - start)
start = time.time()
for x in range(runs):
x2 = cch_ne(0.01, arr)
print('numexpr:\t', time.time() - start)
x3 = cch_jit(0.01, arr)
start = time.time()
for x in range(runs):
x3 = cch_jit(0.01, arr)
print('JIT:\t\t', time.time() - start)
if x1 == x2: print('\'np.sum\' returns same result as \'numexpr\'')
if x1 == x3: print('\'np.sum\' returns same result as \'jit\'')
if x2 == x3: print('\'numexpr\' returns same result as \'jit\'')
Is there a faster way to write "compute_optimal_weights" function in Python. I run it hundreds of millions of times, so any speed increase would help. The arguments of the function are different each time I run it.
c1 = 0.25
c2 = 0.67
def compute_optimal_weights(input_prices):
input_weights_optimal = {}
for i in input_prices:
price = input_prices[i]
input_weights_optimal[i] = c2 / sum([(price/n) ** c1 for n in input_prices.values()])
return input_weights_optimal
input_sellers_ID = range(10)
input_prices = {}
for i in input_sellers_ID:
input_prices[i] = random.uniform(0,1)
t0 = time.time()
for i in xrange(1000000):
compute_optimal_weights(input_prices)
t1 = time.time()
print "old time", (t1 - t0)
The number of elements in list and dictionary vary, but on average there are about 10 elements. They keys in input_prices are the same across all calls but the values change, so the same key will have different values over different runs.
Using a little bit of math, you can calculate part of your sum_price_ratio_scaled as a constant earlier in the loop and speed up your program by ~80% (for the average input size of 10).
Optimized Implementation (Python 3):
def compute_optimal_weights(ids, prices):
scaled_sum = 0
for i in ids:
scaled_sum += prices[i] ** -0.25
result = {}
for i in ids:
result[i] = 0.67 * (prices[i] ** -0.25) / scaled_sum
return result
Edit, in response to this answer: While using numpy will prove more performant with massive data sets, given that "on average there are about 10 elements" in your input_sellers_ID list, I doubt that this approach is worth its own weight for your particular application.
Although it might be tempting to leverage the terseness of generator expressions and dictionary comprehensions, I noticed when running on my machine that the best performance was obtained by using regular for-in loops and avoiding function calls like sum(...). For the sake of completeness, though, here is what the above implementation would look like in a more 'pythonic' style:
def compute_optimal_weights(ids, prices):
scaled_sum = sum(prices[i] ** -0.25 for i in ids)
return {i: 0.67 * (prices[i] ** -0.25) / scaled_sum for i in ids}
Reasoning / Math:
Based on your posted algorithm, you are trying to create a dictionary with values represented by the function f(i) below, where i is one of the elements in your input_sellers_ID list.
When you initially write out the formula for f(i), it appears as though prices[i] must be recalculated for every step of the summation process, which is costly. Simplifying the expression using the rules of exponents, however, you can see that the simplest summation needed to determine f(i) is actually independent of i (only the index value of j is ever used), meaning that that term is a constant and can be calculated outside of the loop which sets the dictionary values.
Note that above I refer to input_prices as prices and input_sellers_ID as ids.
Performance Profile (~80% speed improvement on my machine, size 10):
import time
import random
def compute_optimal_weights(ids, prices):
scaled_sum = 0
for i in ids:
scaled_sum += prices[i] ** -0.25
result = {}
for i in ids:
result[i] = 0.67 * (prices[i] ** -0.25) / scaled_sum
return result
def compute_optimal_weights_old(input_sellers_ID, input_prices):
input_weights_optimal = {}
for i in input_sellers_ID:
sum_price_ratio_scaled = 0
for j in input_sellers_ID:
price_ratio = input_prices[i] / input_prices[j]
scaled_price_ratio = price_ratio ** c1
sum_price_ratio_scaled += scaled_price_ratio
input_weights_optimal[i] = c2 / sum_price_ratio_scaled
return input_weights_optimal
c1 = 0.25
c2 = 0.67
input_sellers_ID = range(10)
input_prices = {i: random.uniform(0,1) for i in input_sellers_ID}
start = time.clock()
for _ in range(1000000):
compute_optimal_weights_old(input_sellers_ID, input_prices) and None
old_time = time.clock() - start
start = time.clock()
for _ in range(1000000):
compute_optimal_weights(input_sellers_ID, input_prices) and None
new_time = time.clock() - start
print('Old:', compute_optimal_weights_old(input_sellers_ID, input_prices))
print('New:', compute_optimal_weights(input_sellers_ID, input_prices))
print('New algorithm is {:.2%} faster.'.format(1 - new_time / old_time))
I believe we could speed-up the function by factoring the loop. Let a = price, b = n and c = c1, if my maths are not wrong (e.g. (5/6)**3 == 5**3 / 6**3:
(5./6.)**2 + (5./4.)**2
==
5**2 / 6.**2 + 5**2 / 4.**2
==
5**2 * (1/6.**2 + 1/4.**2)
With variables:
sum( (a / b) ** c for each b)
==
sum( a**c * (1/b) ** c for each b)
==
a**c * sum((1./b)**c for each b)
The second term is constant and can be taken out. Which leaves:
Faster implementation - Raw Python
Using generators and dict-comprehension:
def compute_optimal_weights(input_prices):
sconst = sum(1/w**c1 for w in input_prices.values())
return {k: c2 / (v**c1 * sconst) for k, v in input_prices.items()}
NOTE: if you are using Python2 replace .values() and .items() with .itervalues() and .iteritems() for extra speedup (few ms with large lists).
Even Faster - Numpy
Additionally, if you don't care that much about the dictionary and just want the values, you could speed it up using numpy (for large inputs >100):
def compute_optimal_weights_np(input_prices):
data = np.asarray(input_prices.values()) ** c1
return c2 / (data * np.sum(1./data))
Few timings for different input size:
N = 10 inputs:
MINE: 100000 loops, best of 3: 6.02 µs per loop
NUMPY: 100000 loops, best of 3: 10.6 µs per loop
YOURS: 10000 loops, best of 3: 23.8 µs per loop
N = 100 inputs:
MINE: 10000 loops, best of 3: 49.1 µs per loop
NUMPY: 10000 loops, best of 3: 22.6 µs per loop
YOURS: 1000 loops, best of 3: 1.86 ms per loop
N = 1000 inputs:
MINE: 1000 loops, best of 3: 458 µs per loop
NUMPY: 10000 loops, best of 3: 121 µs per loop
YOURS: 10 loops, best of 3: 173 ms per loop
N = 100000 inputs:
MINE: 10 loops, best of 3: 54.2 ms per loop
NUMPY: 100 loops, best of 3: 11.1 ms per loop
YOURS: didn't finish in a couple of minutes
Both options here are considerably faster than the one presented in the question. The benefit of using numpy if you can give consistent input (in the form of array instead of a dictionary) becomes apparent when the size grows:
Given the Fourier series coefficients a[n] and b[n] (for cosines and sines respectively) of a function with period T and t an equally spaced interval the following code will evaluate the partial sum for all points in interval t (a,b,t are all numpy arrays). It is clarified that len(t) <> len(a).
yn=ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*cos(2*pi*n*t/T)-b[n]*sin(2*pi*n*t/T))
My question is: Can this for loop be vectorized?
Here's one vectorized approach making use broadcasting to create the 2D array version of cosine/sine input : 2*pi*n*t/T and then using matrix-multiplication with np.dot for the sum-reduction -
r = np.arange(1,len(a))
S = 2*np.pi*r[:,None]*t/T
cS = np.cos(S)
sS = np.sin(S)
out = a[1:].dot(cS) - b[1:].dot(sS) + a[0]
Further performance boost
For further boost, we can make use of numexpr module to compute those trignometric steps -
import numexpr as ne
cS = ne.evaluate('cos(S)')
sS = ne.evaluate('sin(S)')
Runtime test -
Approaches -
def original_app(t,a,b,T):
yn=np.ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*np.cos(2*np.pi*n*t/T)-b[n]*np.sin(2*np.pi*n*t/T))
return yn
def vectorized_app(t,a,b,T):
r = np.arange(1,len(a))
S = (2*np.pi/T)*r[:,None]*t
cS = np.cos(S)
sS = np.sin(S)
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
def vectorized_app_v2(t,a,b,T):
r = np.arange(1,len(a))
S = (2*np.pi/T)*r[:,None]*t
cS = ne.evaluate('cos(S)')
sS = ne.evaluate('sin(S)')
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
Also, including function PP from #Paul Panzer's post.
Timings -
In [22]: # Setup inputs
...: n = 10000
...: t = np.random.randint(0,9,(n))
...: a = np.random.randint(0,9,(n))
...: b = np.random.randint(0,9,(n))
...: T = 3.45
...:
In [23]: print np.allclose(original_app(t,a,b,T), vectorized_app(t,a,b,T))
...: print np.allclose(original_app(t,a,b,T), vectorized_app_v2(t,a,b,T))
...: print np.allclose(original_app(t,a,b,T), PP(t,a,b,T))
...:
True
True
True
In [25]: %timeit original_app(t,a,b,T)
...: %timeit vectorized_app(t,a,b,T)
...: %timeit vectorized_app_v2(t,a,b,T)
...: %timeit PP(t,a,b,T)
...:
1 loops, best of 3: 6.49 s per loop
1 loops, best of 3: 6.24 s per loop
1 loops, best of 3: 1.54 s per loop
1 loops, best of 3: 1.96 s per loop
Can't beat numexpr, but if it's not available we can save on the transcendentals (testing and benchmarking code heavily based on #Divakar's code in case you didn't notice ;-) ):
import numpy as np
from timeit import timeit
def PP(t,a,b,T):
CS = np.empty((len(t), len(a)-1), np.complex)
CS[...] = np.exp(2j*np.pi*(t[:, None])/T)
np.cumprod(CS, axis=-1, out=CS)
return a[1:].dot(CS.T.real) - b[1:].dot(CS.T.imag) + a[0]
def original_app(t,a,b,T):
yn=np.ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*np.cos(2*np.pi*n*t/T)-b[n]*np.sin(2*np.pi*n*t/T))
return yn
def vectorized_app(t,a,b,T):
r = np.arange(1,len(a))
S = 2*np.pi*r[:,None]*t/T
cS = np.cos(S)
sS = np.sin(S)
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
n = 1000
t = 2000
t = np.random.randint(0,9,(t))
a = np.random.randint(0,9,(n))
b = np.random.randint(0,9,(n))
T = 3.45
print(np.allclose(original_app(t,a,b,T), vectorized_app(t,a,b,T)))
print(np.allclose(original_app(t,a,b,T), PP(t,a,b,T)))
print('{:18s} {:9.6f}'.format('orig', timeit(lambda: original_app(t,a,b,T), number=10)/10))
print('{:18s} {:9.6f}'.format('Divakar no numexpr', timeit(lambda: vectorized_app(t,a,b,T), number=10)/10))
print('{:18s} {:9.6f}'.format('PP', timeit(lambda: PP(t,a,b,T), number=10)/10))
Prints:
True
True
orig 0.166903
Divakar no numexpr 0.179617
PP 0.060817
Btw. if delta t divides T one can potentially save more, or even run the full fft and discard what's too much.
This is not really another answer but a comment on #Paul Panzer's one, written as an answer because I needed to post some code. If there is a way to post propely formatted code in a comment please advice.
Inspired by #Paul Panzer cumprod idea, I came up with the following:
an = ones((len(a)-1,len(te)))*2j*pi*te/T
CS = exp(cumsum(an,axis=0))
out = (a[1:].dot(CS.real) - b[1:].dot(CS.imag)) + a[0]
Although it seems properly vectorized and produces correct results, its performance is miserable. It is not only much slower than the cumprod, which is expected as len(a)-1 exponentiations more are made, but 50% slower than the original unvectorized version. What is the cause of this poor performance?
I wrote physics simulation code in python using numpy and than rewrote it to C++. in C++ it takes only 0.5 seconds while in python around 40s. Can someone please help my find what I did horribly wrong?
import numpy as np
def myFunc(i):
uH = np.copy(u)
for j in range(1, xmax-1):
u[i][j] = a*uH[i][j-1]+(1-2*a)*uH[i][j]+a*uH[i][j+1]
u[i][0] = u[i][0]/b
for x in range(1, xmax):
u[i][x] = (u[i][x]+a*u[i][x-1])/(b+a*c[x-1])
for x in range(xmax-2,-1,-1):
u[i][x]=u[i][x]-c[x]*u[i][x+1]
xmax = 101
tmax = 2000
#All other variables are defined here but I removed that for visibility
uH = np.zeros((xmax,xmax))
u = np.zeros((xmax,xmax))
c = np.full(xmax,-a)
uH[50][50] = 10000
for t in range(1, tmax):
if t % 2 == 0:
for i in range(0,xmax):
myFunc(i)
else:
for i in range(0, xmax):
myFunc(i)
In case someones wants to run it here is whole code: http://pastebin.com/20ZSpBqQ
EDIT: all variables are defined in the whole code which can be found on pastebin. Sorry for confusion, I thought removing all the clutter will make the code easier to understand
fundamentally, C is a compiled language, when Python is a interpreted one, speed against ease of use.
Numpy can fill the gap, but you must avoid for loop on items, which need often
some skills.
For exemple,
def block1():
for i in range(xmax):
for j in range(1, xmax-1):
u[i][j] = a*uH[i][j-1]+(1-2*a)*uH[i][j]+a*uH[i][j+1]
is in numpy style :
def block2():
u[:,1:-1] += a*np.diff(u,2)
with is shorter and faster ( and easier to read and understand ?) :
In [37]: %timeit block1()
10 loops, best of 3: 25.8 ms per loop
In [38]: %timeit block2()
10000 loops, best of 3: 123 µs per loop
At last, you can speed numpy code with Just In Time compilation, what is allowed with Numba. Just change the beginning of your code like :
import numba
#numba.jit
def myFunc(u,i):
...
and the calls by myFunc(u,i) at the end of the script (u must be a parameter for automatic determination of types) and you will reach the same performance (0,4 s on my PC).
So when I ran your numpy python code it took four minutes to run, once I removed the numpy code and replaced it with standard python code it only took one minute! (I have a not so fast computer)
Here's that code:
#import numpy as np
def impl(i,row):
if row:
uH = u[:][:] # this copys the array 'u'
for j in range(1, xmax-1):
u[i][j] = a*uH[i][j-1]+(1-2*a)*uH[i][j]+a*uH[i][j+1]
u[i][0] = u[i][0]/b
for x in range(1, xmax):
u[i][x] = (u[i][x]+a*u[i][x-1])/(b+a*c[x-1])
for x in range(xmax-2,-1,-1):
u[i][x]=u[i][x]-c[x]*u[i][x+1]
else:
uH = u[:][:] # this copys the array 'u'
for j in range(1, xmax-1):
u[j][i]= a*uH[j-1][i]+(1-2*a)*uH[j][i]+a*uH[j+1][i]
u[0][i] = u[0][i]/b
for y in range(1, xmax):
u[y][i] = (u[y][i]+a*u[y-1][i])/(b+a*c[y-1])
for y in range(xmax-2,-1,-1):
u[y][i]=u[y][i]-c[y]*u[y+1][i]
#Init
xmax = 101
tmax = 2000
D = 0.5
l = 1
tSec = 0.1
uH = [[0.0]*xmax]*xmax #np.zeros((xmax,xmax))
u = [[0.0]*xmax]*xmax #np.zeros((xmax,xmax))
dx = l / xmax
dt = tSec / tmax
a = (D*dt)/(dx*dx);
b=1+2*a
print("dx=="+str(dx))
print("dt=="+str(dt))
print(" a=="+str(a))
#koeficient c v trojdiagonalnej matici
c = [-a]*xmax #np.full(xmax,-a)
c[0]=c[0]/b
for i in range(1, xmax):
c[i]=c[i]/(b+a*c[i-1])
uH[50][50] = 10000
u = uH
for t in range(1, tmax):
if t % 2 == 0:
for i in range(0,xmax):
impl(i,False)
else:
for i in range(0, xmax):
impl(i,True)
I believe that this could be much faster if you were to have used numpy the correct way rather than as a substitute for arrays, however, not using numpy arrays cut the time to 1/4th of the original.