I am trying to compute the function below with theano/aesara in an preferably vectorized manner:
![image|620x182](upload://9Px5wAGjZdkBXVBg4fqmuSPorPr.png)
The solution i have is not vectorized and therefore way too slow:
def apply_adstock_with_lag(x, L, P, D):
"""
params:
x: original array
L: length
P: peak, delay in effect
D: decay, retain
"""
x = np.append(np.zeros(L - 1), x)
weights = [0 for _ in range(L)]
for l in range(L):
weight = D ** ((l - P) ** 2)
weights[L - 1 - l] = weight
weights = np.array(weights)
adstocked_x = []
for i in range(L - 1, len(x)):
x_array = x[i - L + 1:i + 1]
xi = sum(x_array * weights) / sum(weights)
adstocked_x.append(xi)
adstocked_x = tt.as_tensor_variable(adstocked_x)
return adstocked_x
An similar function although simplier and its vectorized solution can be found below, note that this is much much quicker probably due to the vectorized operations:
![image|252x39](upload://ucZeqCmCXcBRAHLdA7lJ0crs1Oz.png)
def adstock_geometric_theano_pymc3(x, theta):
x = tt.as_tensor_variable(x)
def adstock_geometric_recurrence_theano(index, input_x, decay_x, theta):
return tt.set_subtensor(decay_x[index], tt.sum(input_x + theta * decay_x[index - 1]))
len_observed = x.shape[0]
x_decayed = tt.zeros_like(x)
x_decayed = tt.set_subtensor(x_decayed[0], x[0])
output, _ = theano.scan(
fn=adstock_geometric_recurrence_theano,
sequences=[tt.arange(1, len_observed), x[1:len_observed]],
outputs_info=x_decayed,
non_sequences=theta,
n_steps=len_observed - 1
)
return output[-1]
I cant come up with the vectorized solution to my adstock-function, can anyone give it a go?
Have you tried:
def apply_adstock_with_lag(x, L, P, D):
adstocked_x = np.convolve(x, D**((np.arange(0, L, 1) - P)**2))[:-(L-1)] / sum(D**((np.arange(0, L, 1) - P)**2))
adstocked_x = at.as_tensor_variable(adstocked_x)
return adstocked_x
This should work
Related
I'm writing a code that solves a heat equation implementing an implicit method. The problem is that the values between first and last layer of the matrix are NaNs. What could be the problem?
From my problem of view, the main issue might be with the 105th line, which represents the convrsion of original function to the one that includes the boundary function.
Boundary functions code:
def func(x, t):
return x*(1 - x)*np.exp(-2*t)
# boundary function for x = 0 and x = 1
def q0(t):
return t*np.exp(-t/0.1)*np.cos(t) # граничное условие при x = 0
def q1(t):
return t*np.exp(-t/0.5)*np.cos(t) # граничное уcловие при x = 1
def derivative(f, x0, step):
return (f(x0+step) - f(x0))/step
# boundary function that for t = 0
def u_x0(x):
return (-x + 1)*x
Function that solves the three-diagonal matrix equation
def solution(a, b):
n = len(a)
x = [0 for k in range(0, n)]
# forward
v = [0 for k in range(0, n)]
u = [0 for k in range(0, n)]
# first string (t = 0)
v[0] = a[0][1] / (-a[0][0])
u[0] = ( - b[0]) / (-a[0][0])
for i in range(1, n - 1):
v[i] = a[i][i+1] / ( -a[i][i] - a[i][i-1]*v[i-1] )
u[i] = ( a[i][i-1]*u[i-1] - b[i] ) / ( -a[i][i] - a[i][i-1]*v[i-1] )
# last string (t = 1)
v[n-1] = 0
u[n-1] = (a[n-1][n-2]*u[n-2] - b[n-1]) / (-a[n-1][n-1] - a[n-1][n-2]*v[n-2])
x[n-1] = u[n-1]
for i in range(n-1, 0, -1):
x[i-1] = v[i-1] * x[i] + u[i-1]
return x
Coefficent matrix values:
A = -t/h**2
B = 1 + 2*t/h**2
C = -t/h**2
Code that actually solves the matrix:
i = 1
X =[]
while i < 99:
X = solution(cool_array, f)
k = 0
while k < len(x_i):
#line-105
X[k] += 0.01*(func(x_i[k], x_i[i]) - (1 - x_i[i])*derivative(q0, x_i[i], 0.01) - (x_i[i])*derivative(q1, x_i[i], 0.01))
k+=1
a = 1
while a < 98:
w_h_t[i][a] = X[a]
a+=1
f = X
f[0] = w_h_t[i][0]
f[99] = w_h_t[i][99]
i+=1
print(w_h_t)
As far as I understand, the algorith solution(a, b) is written properly, so I guess the problem might be with the boundary functions or with the 105th line. The output I expect is at least an array of number, not NaNs.
I'm trying to write a function to evaluate the probability mass function for the bivariate poisson distribution.
This is easy when all of the parameters (x, y, theta1, theta2, theta0) are scalars, but tricky to scale up without loops to allow these parameters to be vectors. I need it to scale such that, for:
theta0 being a scalar - the "correlation parameter" in the equation
theta1 and theta2 having length l
x, y both having length n
the output array would have shape (l, n, n). For example, a slice [j, :, :] from the output array would look like:
The first part (the constant, before the summation) I think i've figured out:
import numpy as np
from scipy.special import factorial
def constant(theta1, theta2, theta0, x, y):
exponential_part = np.exp(-(theta1 + theta2 + theta0)).reshape(-1, 1, 1)
x = np.tile(x, (len(x), 1)).transpose()
y = np.tile(y, (len(y), 1))
double_factorial = (np.power(np.array(theta1).reshape(-1, 1, 1), x)/factorial(x)) * \
(np.power(np.array(theta2).reshape(-1, 1, 1), y)/factorial(y))
return exponential_part * double_factorial
But I'm struggling with the summation part. How can I vectorize a summation where the limits depend on variable arrays?
I think I have this figured out, based on the approach that #w-m suggests: calculate every possible summation term which could appear, based on the maximum x or y value which appears, and use a mask to get rid of the ones you don't want. Assuming you have your x and y terms go from 0 to N, in consecutive order, this is calculating up to three times more terms than are actually required, but this is offset by getting to use vectorization.
Reference implementation
I wrote this by first writing a pure-Python reference implementation, which just implements your problem using loops. With 4 nested loops, it's not exactly fast, but it's handy to have while testing the numpy version.
import numpy as np
from scipy.special import factorial, comb
import operator as op
from functools import reduce
def choose(n, r):
# https://stackoverflow.com/a/4941932/530160
r = min(r, n-r)
numer = reduce(op.mul, range(n, n-r, -1), 1)
denom = reduce(op.mul, range(1, r+1), 1)
return numer // denom # or / in Python 2
def reference_impl_constant(s_theta1, s_theta2, s_theta0, s_x, s_y):
# Cast to float to prevent overflow
s_theta1 = float(s_theta1)
s_theta2 = float(s_theta2)
s_theta0 = float(s_theta0)
s_x = float(s_x)
s_y = float(s_y)
term1 = np.exp(-(s_theta1 + s_theta2 + s_theta0))
term2 = (s_theta1 ** s_x / factorial(s_x))
term3 = (s_theta2 ** s_y / factorial(s_y))
assert term1 >= 0
assert term2 >= 0
assert term3 >= 0
return term1 * term2 * term3
def reference_impl_constant_loop(theta1, theta2, theta0, x, y):
theta_len = theta1.shape[0]
xy_len = x.shape[0]
constant_array = np.zeros((theta_len, xy_len, xy_len))
for i in range(theta_len):
for j in range(xy_len):
for k in range(xy_len):
s_theta1 = theta1[i]
s_theta2 = theta2[i]
s_theta0 = theta0
s_x = x[j]
s_y = y[k]
constant_term = reference_impl_constant(s_theta1, s_theta2, s_theta0, s_x, s_y)
assert constant_term >= 0
constant_array[i, j, k] = constant_term
return constant_array
def reference_impl_summation(s_theta1, s_theta2, s_theta0, s_x, s_y):
sum_ = 0
for i in range(min(s_x, s_y) + 1):
sum_ += choose(s_x, i) * choose(s_y, i) * factorial(i) * ((s_theta0/s_theta1/s_theta2) ** i)
assert sum_ >= 0
return sum_
def reference_impl_summation_loop(theta1, theta2, theta0, x, y):
theta_len = theta1.shape[0]
xy_len = x.shape[0]
summation_array = np.zeros((theta_len, xy_len, xy_len))
for i in range(theta_len):
for j in range(xy_len):
for k in range(xy_len):
s_theta1 = theta1[i]
s_theta2 = theta2[i]
s_theta0 = theta0
s_x = x[j]
s_y = y[k]
summation_term = reference_impl_summation(s_theta1, s_theta2, s_theta0, s_x, s_y)
assert summation_term >= 0
summation_array[i, j, k] = summation_term
return summation_array
def reference_impl(theta1, theta2, theta0, x, y):
# all array inputs must be 1D
assert len(theta1.shape) == 1
assert len(theta2.shape) == 1
assert len(x.shape) == 1
assert len(y.shape) == 1
# theta vectors must have same length
theta_len = theta1.shape[0]
assert theta2.shape[0] == theta_len
# x and y must have same length
xy_len = x.shape[0]
assert y.shape[0] == xy_len
# theta0 is scalar
assert isinstance(theta0, (int, float))
constant_array = np.zeros((theta_len, xy_len, xy_len))
output = np.zeros((theta_len, xy_len, xy_len))
constant_array = reference_impl_constant_loop(theta1, theta2, theta0, x, y)
summation_array = reference_impl_summation_loop(theta1, theta2, theta0, x, y)
output = constant_array * summation_array
return output
Numpy implementation
I split the implementation of this across two functions.
The fast_constant() function calculates everything to the left of the summation symbol. The fast_summation() function calculates everything inside the summation symbol.
import numpy as np
from scipy.special import factorial, comb
def fast_summation(theta1, theta2, theta0, x, y):
x = np.tile(x, (len(x), 1)).transpose()
y = np.tile(y, (len(y), 1))
sum_limit = np.minimum(x, y)
max_sum_limit = np.max(sum_limit)
i = np.arange(max_sum_limit + 1).reshape(-1, 1, 1)
summation_mask = (i <= sum_limit)
theta_ratio = (theta0 / (theta1 * theta2)).reshape(-1, 1, 1, 1)
theta_to_power = np.power(theta_ratio, i)
terms = comb(x, i) * comb(y, i) * factorial(i) * theta_to_power
# mask out terms which aren't part of sum
terms *= summation_mask
# axis 0 is theta
# axis 1 is i
# axis 2 & 3 are x and y
# so sum across axis 1
terms = terms.sum(axis=1)
return terms
def fast_constant(theta1, theta2, theta0, x, y):
theta1 = theta1.astype('float64')
theta2 = theta2.astype('float64')
exponential_part = np.exp(-(theta1 + theta2 + theta0)).reshape(-1, 1, 1)
# x and y must be 1D
assert len(x.shape) == 1
assert len(y.shape) == 1
# x and y must have same shape
assert x.shape == y.shape
x_len, y_len = x.shape[0], y.shape[0]
x = x.reshape((x_len, 1))
y = y.reshape((1, y_len))
double_factorial = (np.power(np.array(theta1).reshape(-1, 1, 1), x)/factorial(x)) * \
(np.power(np.array(theta2).reshape(-1, 1, 1), y)/factorial(y))
return exponential_part * double_factorial
def fast_impl(theta1, theta2, theta0, x, y):
return fast_summation(theta1, theta2, theta0, x, y) * fast_constant(theta1, theta2, theta0, x, y)
Benchmarking
Assuming that X and Y range from 0 to 20, and that theta is centered somewhere inside that range, I get the result that the numpy version is roughly 280 times faster than the pure python reference.
Numerical stability
I'm unsure how numerically stable this is. For example, when I center theta at 100, I get a floating-point overflow. Typically, when computing an expression which has lots of choose and factorial expressions inside it, you'll use some mathematical equivalent which results in smaller intermediate sums. In this case I have so little understanding of the math that I don't know how you'd do that.
I'm trying to implement some calculation, but I can't figure how to vectorize my code and not using loops.
Let me explain: I have a matrix M[N,C] of either 0 or 1. Another matrix Y[N,1] containing values of [0,C-1] (My classes). Another matrix ds[N,M] which is my dataset.
My output matrix is of size grad[M,C] and should be calculated as follow: I'll explain for grad[:,0], same logic for any other column.
For each row(sample) in ds, if Y[that sample] != 0 (The current column of output matrix) and M[that sample, 0] > 0 , then grad[:,0] += ds[that sample]
If Y[that sample] == 0, then grad[:,0] -= (ds[that sample] * <Num of non zeros in M[that sample,:]>)
Here is my iterative approach:
for i in range(M.size(dim=1)):
for j in range(ds.size(dim=0)):
if y[j] == i:
grad[:,i] = grad[:,i] - (ds[j,:].T * sum(M[j,:]))
else:
if M[j,i] > 0:
grad[:,i] = grad[:,i] + ds[j,:].T
Since you are dealing with three dimensions n, m, and c (in lowercase to avoid ambiguity), it can be useful to change the shape of all your tensors to (n, m, c), by replicating their values over the missing dimension (e.g. M(m, c) becomes M(n, m, c)).
However, you can skip the explicit replication and use broadcasting, so it is sufficient to unsqueeze the missing dimension (e.g. M(m, c) becomes M(1, m, c).
Given these considerations, the vectorization of your code becomes as follows
cond = y.unsqueeze(2) == torch.arange(M.size(dim=1)).unsqueeze(0)
pos = ds.unsqueeze(2) * M.unsqueeze(1) * cond
neg = ds.unsqueeze(2) * M.unsqueeze(1).sum(dim=0, keepdim=True) * ~cond
grad += (pos - neg).sum(dim=0)
Here is a small test to check the validity of the solution
import torch
n, m, c = 11, 5, 7
y = torch.randint(c, size=(n, 1))
ds = torch.rand(n, m)
M = torch.randint(2, size=(n, c))
grad = torch.rand(m, c)
def slow_grad(y, ds, M, grad):
for i in range(M.size(dim=1)):
for j in range(ds.size(dim=0)):
if y[j] == i:
grad[:,i] = grad[:,i] - (ds[j,:].T * sum(M[j,:]))
else:
if M[j,i] > 0:
grad[:,i] = grad[:,i] + ds[j,:].T
return grad
def fast_grad(y, ds, M, grad):
cond = y.unsqueeze(2) == torch.arange(M.size(dim=1)).unsqueeze(0)
pos = ds.unsqueeze(2) * M.unsqueeze(1) * cond
neg = ds.unsqueeze(2) * M.unsqueeze(1).sum(dim=0, keepdim=True) * ~cond
grad += (pos - neg).sum(dim=0)
return grad
# Assert equality of all elements function outputs, throws an exception if false
assert torch.all(slow_grad(y, ds, M, grad) == fast_grad(y, ds, M, grad))
Feel free to test on other cases as well!
I am implementing a sequential algorithm (Kalman Filter) with a particular structure where a lot of inner looping can be done in parallel. I need to get as much performance out of this function as possible. Currently, it runs in about 600ms on my machine with representative data inputs (n, p = 12, d = 3, T = 3000)
I have used #numba.jit with nopython=True, parallel=True and annotated my ranges with numba.prange. However, even with very large data inputs (n > 5000) there is clearly no parallelism occurring (based on just looking at cores with top).
There is quite a bit of code here, I'm showing only the main chunk. Is there a reason Numba wouldn't be able to parallelize the array operations under the prange? I have also checked numba.config.NUMBA_NUM_THREADS (it is 8) and played with different numba.config.THREADING_LAYER (it is currently 'tbb'). I have also tried with both the openblas and the MKL versions of numpy+scipy, the MKL version appears to be slightly slower, and still no parallelization.
The annotation is:
#numba.jit(nopython=True, cache=False, parallel=True,
fastmath=True, nogil=True)
And the main part of the function:
P = np.empty((T + 1, n, p, d, d))
m = np.empty((T + 1, n, p, d))
P[0] = P0
m[0] = m0
phi = 0.0
Xt = np.empty((n, p)
for t in range(1, T + 1):
sum_P00 = 0.0
v = y[t - 1]
# Purely for convenience, little performance impact
for tau in range(1, p + 1):
Xt[:, tau - 1] = X[p + t - 1 - tau]
# Predict
for i in numba.prange(n):
for tau in range(p):
# Prediction step
m[t, i, tau] = Phi[i, tau] # m[t - 1, i, tau]
P[t, i, tau] = Phi[i, tau] # P[t - 1, i, tau] # Phi[i, tau].T
# Auxiliary gain variables
for i in numba.prange(n):
for tau in range(p):
v = v - Xt[i, tau] * m[t, i, tau, 0]
sum_P00 = sum_P00 + P[t, i, tau, 0, 0]
# Energy function update
s = np.linalg.norm(Xt)**2 * sum_P00 + sv2
phi += np.pi * s + 0.5 * v**2 / s
# Update
for i in numba.prange(n):
for tau in range(p):
k = Xt[i, tau] * P[t, i, tau, :, 0] # Gain
m[t, i, tau] = m[t, i, tau] + (v / s) * k
P[t, i, tau] = P[t, i, tau] + (k / s) # k.T
It appears to simply have been a problem with running interactively in Ipython. Running a test script from the console leads to parallel execution, as expected.
I have the following code snippet:
def func1(self, X, y):
#X.shape = (455,13)
#y.shape = (455)
num_examples, num_features = np.shape(X)
self.weights = np.random.uniform(-1 / (2 * num_examples), 1 / (2 * num_examples), num_features)
while condition:
new_weights = np.zeros(num_features)
K = (np.dot(X, self.weights) - y)
for j in range(num_features):
summ = 0
for i in range(num_examples):
summ += K[i] * X[i][j]
new_weights[j] = self.weights[j] - ((self.alpha / num_examples) * summ)
self.weights = new_weights
This code works too slow. Is there any optimization, which I can do?
You can efficiently use np.einsum(). See a testing version below:
def func2(X, y):
num_examples, num_features = np.shape(X)
weights = np.random.uniform(-1./(2*num_examples), 1./(2*num_examples), num_features)
K = (np.dot(X, weights) - y)
return weights - alpha/num_examples*np.einsum('i,ij->j', K, X)
You can get new_weights directly using matrix-multiplication with np.dot like so -
new_weights = self.weights- ((self.alpha / num_examples) * np.dot(K[None],X))