The task is to combine two arrays row by row (construct the permutations) based on the resulting multiplication of two corresponding vectors. Such as:
Row1_A,
Row2_A,
Row3_A,
Row1_B,
Row2_B,
Row3_B,
The result should be: Row1_A_Row1_B, Row1_A_Row2_B, Row1_A_Row3_B, Row2_A_Row1_B, etc..
Given the following initial arrays:
n_rows = 1000
A = np.random.randint(10, size=(n_rows, 5))
B = np.random.randint(10, size=(n_rows, 5))
P_A = np.random.rand(n_rows, 1)
P_B = np.random.rand(n_rows, 1)
Arrays P_A and P_B are corresponding vectors to the individual arrays, which contain a float. The combined rows should only appear in the final array if the multiplication surpasses a certain threshold, for example:
lim = 0.8
I have thought of the following functions or ways to solve this problem, but I would be interested in faster solutions. I am open to using numba or other libraries, but ideally I would like to improve the vectorized solution using numpy.
Method A
def concatenate_per_row(A, B):
m1,n1 = A.shape
m2,n2 = B.shape
out = np.zeros((m1,m2,n1+n2),dtype=A.dtype)
out[:,:,:n1] = A[:,None,:]
out[:,:,n1:] = B
return out.reshape(m1*m2,-1)
%%timeit
A_B = concatenate_per_row(A, B)
P_A_B = (P_A[:, None]*P_B[None, :])
P_A_B = P_A_B.flatten()
idx = P_A_B > lim
A_B = A_B[idx, :]
P_A_B = P_A_B[idx]
37.8 ms ± 660 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method B
%%timeit
A_B = []
P_A_B = []
for i in range(len(P_A)):
P_A_B_i = P_A[i]*P_B
idx = np.where(P_A_B_i > lim)[0]
if len(idx) > 0:
P_A_B.append(P_A_B_i[idx])
A_B_i = np.zeros((len(idx), A.shape[1] + B.shape[1]), dtype='int')
A_B_i[:, :A.shape[1]] = A[i]
A_B_i[:, A.shape[1]:] = B[idx, :]
A_B.append(A_B_i)
A_B = np.concatenate(A_B)
P_A_B = np.concatenate(P_A_B)
9.65 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
First of all, there is a more efficient algorithm. Indeed, you can pre-compute the size of the output array so the values can be directly written in the final output arrays rather than stored temporary in lists. To find the size efficiently, you can sort the array P_B and then do a binary search so to find the number of value greater than lim/P_A[i,0] for all possible i (P_B*P_A[i,0] > lim is equivalent to P_B > lim/P_A[i,0]). The number of item filtered for each i can be temporary stored so to quickly loop over the filtered items.
Moreover, you can use Numba to significantly speed the computation of the main loop up.
Here is the resulting code:
#nb.njit('(int_[:,::1], int_[:,::1], float64[:,::1], float64[:,::1])')
def compute(A, B, P_A, P_B):
assert P_A.shape[1] == 1
assert P_B.shape[1] == 1
P_B_sorted = np.sort(P_B.reshape(P_B.size))
counts = len(P_B) - np.searchsorted(P_B_sorted, lim/P_A[:,0], side='right')
n = np.sum(counts)
mA, mB = A.shape[1], B.shape[1]
m = mA + mB
A_B = np.empty((n, m), dtype=np.int_)
P_A_B = np.empty((n, 1), dtype=np.float64)
k = 0
for i in range(P_A.shape[0]):
if counts[i] > 0:
idx = np.where(P_B > lim/P_A[i, 0])[0]
assert counts[i] == len(idx)
start, end = k, k + counts[i]
A_B[start:end, :mA] = A[i, :]
A_B[start:end, mA:] = B[idx, :]
P_A_B[start:end, :] = P_B[idx, :] * P_A[i, 0]
k += counts[i]
return A_B, P_A_B
Here are performance results on my machine:
Original: 35.6 ms
Optimized original: 18.2 ms
Proposed (with order): 0.9 ms
Proposed (no ordering): 0.3 ms
The algorithm proposed above is 20 times faster than the original optimized algorithm. It can be made even faster. Indeed, if the order of items do not matter you can use an argsort so to reorder both B and P_B. This enable you not to compute idx every time in the hot loop and select directly the last elements from B and P_B (that are guaranteed to be higher than the threshold but not in the same order than the original code). Because the selected items are stored contiguously in memory, this implementation is much faster. In the end, this last implementation is about 60 times faster than the original optimized algorithm. Note that the proposed implementations are significantly faster than the original ones even without Numba.
Here is the implementation that do not care about the order of the items:
#nb.njit('(int_[:,::1], int_[:,::1], float64[:,::1], float64[:,::1])')
def compute(A, B, P_A, P_B):
assert P_A.shape[1] == 1
assert P_B.shape[1] == 1
nA, mA = A.shape
nB, mB = B.shape
m = mA + mB
order = np.argsort(P_B.reshape(nB))
P_B_sorted = P_B[order, :]
B_sorted = B[order, :]
counts = nB - np.searchsorted(P_B_sorted.reshape(nB), lim/P_A[:,0], side='right')
nRes = np.sum(counts)
A_B = np.empty((nRes, m), dtype=np.int_)
P_A_B = np.empty((nRes, 1), dtype=np.float64)
k = 0
for i in range(P_A.shape[0]):
if counts[i] > 0:
start, end = k, k + counts[i]
A_B[start:end, :mA] = A[i, :]
A_B[start:end, mA:] = B_sorted[nB-counts[i]:, :]
P_A_B[start:end, :] = P_B_sorted[nB-counts[i]:, :] * P_A[i, 0]
k += counts[i]
return A_B, P_A_B
Related
I'm struggling to figure out memory coalescence in CUDA. In order evaluate the performance difference between coalesced and uncoalesced memory accesses I have implemented two different versions of a kernel that adds two 2D matrices:
from numba import cuda
#cuda.jit
def uncoalesced_matrix_add(a, b, out):
x, y = cuda.grid(2)
out[x][y] = a[x][y] + b[x][y]
#cuda.jit
def coalesced_matrix_add(a, b, out):
x, y = cuda.grid(2)
out[y][x] = a[y][x] + b[y][x]
When I test the code above with square matrices everything works fine, I mean, both kernels produce the same result and the coalesced version is significantly faster:
import numpy as np
nrows, ncols = 2048, 2048
tpb = 32
threads_per_block = (tpb, tpb)
blocks = ((nrows + (tpb - 1))//tpb, (ncols + (tpb - 1))//tpb)
size = nrows*ncols
a = np.arange(size).reshape(nrows, ncols).astype(np.int32)
b = np.ones(shape=a.shape, dtype=np.int32)
out = np.empty_like(a).astype(np.int32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_out = cuda.to_device(out)
uncoalesced_matrix_add[blocks, threads_per_block](d_a, d_b, d_out)
slow = d_out.copy_to_host()
coalesced_matrix_add[blocks, threads_per_block](d_a, d_b, d_out)
fast = d_out.copy_to_host()
np.array_equal(slow, fast)
# True
However, if I change nrows = 1024 so that the matrices are no longer square, coalesced_matrix_add() throws the following error:
CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
What am I missing here?
Edit
For completeness, I'm attaching some profiling. These data were obtained by using the workaround proposed by #Robert Crovella with nrows = 1024 and ncols = 2048:
In [40]: %timeit uncoalesced_matrix_add[blocksu, threads_per_block](d_a, d_b, d_out)
289 µs ± 498 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [41]: %timeit coalesced_matrix_add[blocksc, threads_per_block](d_a, d_b, d_out)
164 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
when you make your array non-square, this calculation is no longer correct, for one of your two kernels:
blocks = ((nrows + (tpb - 1))//tpb, (ncols + (tpb - 1))//tpb)
For both thread and block dimensions, the x index dimension comes first, followed by y index dimension (referring to the x and y as they appear in the in-kernel built-in variables x and y).
For the usage in your first kernel:
out[x][y] = a[x][y] + b[x][y]
we want x to index through the rows. That is consistent with your grid definition.
For the usage in your second kernel:
out[y][x] = a[y][x] + b[y][x]
we want y to index through the rows. That is not consistent with your grid definition.
The result is out-of-bounds access on the second kernel call. The orientation of your rectangular grid does not match the orientation of your rectangular data.
In the square case, such reversal was immaterial, as both dimensions were the same.
Here is a possible "fix":
$ cat t62.py
from numba import cuda
import numpy as np
#cuda.jit
def uncoalesced_matrix_add(a, b, out):
x, y = cuda.grid(2)
out[x][y] = a[x][y] + b[x][y]
#cuda.jit
def coalesced_matrix_add(a, b, out):
x, y = cuda.grid(2)
out[y][x] = a[y][x] + b[y][x]
nrows, ncols = 512, 1024
tpb = 32
threads_per_block = (tpb, tpb)
blocksu = ((nrows + (tpb - 1))//tpb, (ncols + (tpb - 1))//tpb)
blocksc = ((ncols + (tpb - 1))//tpb, (nrows + (tpb - 1))//tpb)
size = nrows*ncols
a = np.arange(size).reshape(nrows, ncols).astype(np.int32)
b = np.ones(shape=a.shape, dtype=np.int32)
out = np.empty_like(a).astype(np.int32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_out = cuda.to_device(out)
uncoalesced_matrix_add[blocksu, threads_per_block](d_a, d_b, d_out)
slow = d_out.copy_to_host()
coalesced_matrix_add[blocksc, threads_per_block](d_a, d_b, d_out)
fast = d_out.copy_to_host()
print(np.array_equal(slow, fast))
# True
$ python t62.py
True
$
Also note that this grid sizing strategy only works for dimensions that are whole-number divisible by the block size.
I am trying to implement LDA using Gibbs sampling and in the step of updating each topic proportion, I have a 4 layer loop and it runs extremely slow and I am not sure how to improve the efficiency of this code. The code I have now is the following:
N_W is the number of words, and N_D is the number of document, and Z[i,j] is the topic assignment (1 to K possible assignments), X[i,j] is the count of the j-th word in i-th document, Beta[k,:] is of dimension [K, N_W].
And the update is the following:
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
for i in range(N_D):
for j in range(N_W):
# counting number of times a word is assigned to a topic
n_k[w] += (X[i,j] == w) and (Z[i,j] == k)
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
You could get rid of the last two for loops using logic functions:
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
a = np.logical_not(X-w) # all X(i,j) == w become a True, others a false
b = np.logical_not(Z-k) # all Z(i,j) == w become a True, others a false
c = np.logical_and(a,b) # all (i,j) where X(i,j) == w and Z(i,j) == k are True, others false
n_k[w] = np.sum(c) # sum all True values
Or even as a one liner:
n_k = np.array([[np.sum(np.logical_and(np.logical_not(X[:N_D,:N_W]-w), np.logical_not(Z[:N_D,:N_W]-k))) for w in range(N_W)] for k in range(K)])
Each row in n_k then can be used for beta calculation. Now it also includes N_W and N_D as restrictions, if they are not equal to the size of X and Z
I did some testing with the following matrices:
import numpy as np
K = 90
N_W = 100
N_D = 11
N_W = 12
Z = np.random.randint(0, K, size=(N_D, N_W))
X = np.random.randint(0, N_W, size=(N_D, N_W))
gamma = 1
The original code:
%%timeit
Beta = numpy.zeros((K, N_W))
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
for i in range(N_D):
for j in range(N_W):
# counting number of times a word is assigned to a topic
n_k[w] += (X[i,j] == w) and (Z[i,j] == k)
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
865 ms ± 8.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Then vectorising only the inner two loops:
%%timeit
Beta = numpy.zeros((K, N_W))
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
n_k[w] = np.sum((X == w) & (Z == k))
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
21.6 ms ± 542 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Finally with some creative application of broadcasting and extracting common elements:
%%timeit
Beta = numpy.zeros((K, N_W))
w = np.arange(N_W)
X_eq_w = np.equal.outer(X, w)
for k in range(K): # iteratively for each topic update
n_k = np.sum(X_eq_w & (Z == k)[:, :, None], axis=(0, 1))
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
4.6 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The trade-off here is between speed and memory. For the shapes I used this was not so memory-intensive, but the intermediate three-dimensional arrays I built in the last solution could get quite large.
My problem is the following. I have two arrays X and Y of shape n, p where p >> n (e.g. n = 50, p = 10000).
I also have a mask mask (1-d array of booleans of size p) with respect to p, of small density (e.g. np.mean(mask) is 0.05).
I try to compute, as fast as possible, the inner product of X and Y with respect to mask: the output inner is an array of shape n, n, and is such that inner[i, j] = np.sum(X[i, np.logical_not(mask)] * Y[j, np.logical_not(mask)]).
I have tried using the numpy.ma library, but it is quite slow for my use:
import numpy as np
import numpy.ma as ma
n, p = 50, 10000
density = 0.05
mask = np.array(np.random.binomial(1, density, size=p), dtype=np.bool_)
mask_big = np.ones(n)[:, None] * mask[None, :]
X = np.random.randn(n, p)
Y = np.random.randn(n, p)
X_ma = ma.array(X, mask=mask_big)
Y_ma = ma.array(Y, mask=mask_big)
But then, on my machine, X_ma.dot(Y_ma.T) is about 5 times slower than X.dot(Y.T)...
To begin with, I think it is a problem that .dot does not know that the mask is only with respect to p but I don't if its possible to use this information.
I'm looking for a way to perform the computation without being much slower than the naive dot.
Thanks a lot !
We can use matrix-multiplication with and without the masked versions as the masked subtraction from the full version yields to us the desired output -
inner = X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
Or simply use the reversed mask, would be slower though for a sparsey mask -
inner = X[:,~mask].dot(Y[:,~mask].T)
Timings -
In [34]: np.random.seed(0)
...: p,n = 10000,50
...: X = np.random.rand(n,p)
...: Y = np.random.rand(n,p)
...: mask = np.random.rand(p)>0.95
In [35]: mask.mean()
Out[35]: 0.0507
In [36]: %timeit X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
100 loops, best of 3: 2.54 ms per loop
In [37]: %timeit X[:,~mask].dot(Y[:,~mask].T)
100 loops, best of 3: 4.1 ms per loop
In [39]: %%timeit
...: inner = np.empty((n,n))
...: for i in range(X.shape[0]):
...: for j in range(X.shape[0]):
...: inner[i, j] = np.sum(X[i, ~mask] * Y[j, ~mask])
1 loop, best of 3: 302 ms per loop
I have an array that is not monotonic increasing. I would like to make it monotonic increasing applying a constant rate when the array decreases.
I have create a small example here where the rate is 0.2:
# Rate
rate = 0.2
# Array to interpolate
arr1 = np.array([0,1,2,3,4,4,4,3,2,2.5,3.5,5.2,7,10,9.5,np.nan,np.nan,np.nan,11.2, 11.4, 12,10,9,9.5,10.2,10.5,10.8,12,12.5,15],dtype=float)
# Line with constant rate at first monotonic decrease (index 6)
xx1 = 6
xr1 = np.array(np.arange(0,arr1.shape[0]+1),dtype=float)
yr1 = rate*xr1 + (arr1[xx1]-rate*xx1)
# Line with constant rate at second monotonic decrease [index 14]
xx2 = 13
xr2 = np.array(np.arange(0,arr1.shape[0]+1),dtype=float)
yr2 = rate*xr2 + (arr1[xx2]-rate*xx2)
# Line with constant rate at second monotonic decrease [index 14]
xx3 = 20
xr3 = np.array(np.arange(0,arr1.shape[0]+1),dtype=float)
yr3 = rate*xr3 + (arr1[xx3]-rate*xx3)
plt.figure()
plt.plot(arr1,'.-',label='Original')
plt.plot(xr1,yr1,label='Const Rate line 1')
plt.plot(xr2,yr2,label='Const Rate line 2')
plt.plot(xr3,yr3,label='Const Rate line 2')
plt.legend()
plt.grid()
The "Original" array is my dataset.
The final results I would like is the blue + red-dashed line. In the figure I highlighted also the "constant rate curves".
Since I have very large arrays (millions of records), I would like to avoid for-loops over the entire array.
Thanks a lot to everybody for the help!
Here's a different option: If you are interested in plotting monotonically increasing curve from your data, then you can simply skip the unwanted points between two successive increasing points, e.g. between arr1[6] = 4 and arr1[11] = 5, by connecting them with a line.
import numpy as np
import matplotlib.pyplot as plt
arr1 = np.array([0,1,2,3,4,4,4,3,2,2.5,3.5,5.2,7,10,9.5,np.nan,np.nan,np.nan,11.2, 11.4, 12,10,9,9.5,10.2,10.5,10.8,12,12.5,15],dtype=float)
mask = (arr1 == np.maximum.accumulate(np.nan_to_num(arr1)))
x = np.arange(len(arr1))
plt.figure()
plt.plot(x, arr1,'.-',label='Original')
plt.plot(x[mask], arr1[mask], 'r-', label='Interp.')
plt.legend()
plt.grid()
arr2 = arr1[1:] - arr1[:-1]
ind = numpy.where(arr2 < 0)[0]
for i in ind:
arr1[i] = arr1[i - 1] + rate
You may need to replace first any numpy.nan with values, such as numpy.amin(arr1)
I would like to avoid for-loops over the entire array.
Frankly speaking, it is hard to achieve no for-loops in numpy, because numpy as the C-made-library uses for-loops implemented in C / C++. And all sorting algorithm (like np.argwhere, np.all, etc.) requires comparisons and therefore also iterations.
Contrarily, I suggest using at least one explicit loop made in Python (iteration is made only once):
arr0 = np.zeros_like(arr1)
num = 1
rate = .2
while(num < len(arr1)):
if arr1[num] < arr1[num-1] or np.isnan(arr1[num]):
start = arr1[num-1]
while(start > arr1[num] or np.isnan(arr1[num])):
print(arr1[num])
arr0[num] = arr0[num-1] + rate
num+=1
continue
arr0[num] = arr1[num]
num +=1
Your problem can be expressed in one simple recursive difference equation:
y[n] = max(y[n-1] + 0.2, x[n])
So the direct Python form would be
def func(a):
out = np.zeros_like(a)
out[0] = a[0]
for i in range(1, len(a)):
out[i] = max(out[i-1] + 0.2, a[i])
return out
Unfortunately, this equation is recursive and non-linear, so finding a vectorized algorithm may be difficult.
However, using Numba we can speed up this loop-based algorithm by a factor of 300:
fastfunc = numba.jit(func)
arr1 = np.random.rand(1000000)
%timeit func(arr1)
# 599 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit fastfunc(arr1)
# 2.22 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I finally managed to do what I wanted with a while loop.
# data['myvar'] is the original dataset I want to reshape
data['myvar_corrected'] = data['myvar'].values
temp_d = data['myvar'].fillna(0).values*1.0
dtc = np.maximum.accumulate(temp_d)
data.loc[temp_d < np.maximum.accumulate(dtc),'myvar_corrected'] = float('nan')
stay_in_while = True
min_rate = 5/200000/(24*60)
idx_next = 0
while stay_in_while:
df_temp = data.iloc[idx_next:]
if df_tem['myvar'].isnull().sum()>0:
idx_first_nan = df_temp.reset_index().['myvar_corrected'].isnull().argmax()
idx_nan_or = (data_new.index.values==df_temp.index.values[idx_first_nan]).argmax()
x = np.arange(idx_first_nan-1,df_temp.shape[0])
y0 = df_temp.iloc[idx_first_nan-1]['myvar_corrected']
rate_curve = min_rate*x + (y0 - min_rate*(idx_first_nan-1))
damage_m_rate = df_temp.iloc[idx_first_nan-1:]['myvar_corrected']-rate_curve
try:
idx_intercept = (data_new.index.values==damage_m_rate[damage_m_rate>0].index.values[0]).argmax()
data_new.iloc[idx_nan_or:idx_intercept]['myvar'] = rate_curve[0:(damage_m_rate.index.values==damage_m_rate[damage_m_rate>0].index.values[0]).argmax()-1]
idx_next = idx_intercept + 1
except:
stay_in_while = False
else:
stay_in_while = False
# Finally I have my result stored in data_new['myvar']
In the following picture the result.
Thanks to everybody for the contribution!
I have to apply some mathematical formula that I've written
in python as:
for s in range(tdim):
sum1 = 0.0
for i in range(dim):
for j in range(dim):
sum1+=0.5*np.cos(theta[s]*(i-j))*
eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta[s]*(i-j))*eig1[j]*eig2[i]-eig1[i]*eig2[j])
PHi2.append(sum1)
Now, this is correct, but clearly inefficient, the other way around is to do:
for i in range(dim):
for j in range(dim):
PHi2 = 0.5*np.cos(theta*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j])
However, the second example gives me the same number in all elements of PHi2, so this
is faster but answer is wrong. How can you do this correctly and more efficiently?
NOTE: eig1 and eig2 are of the same dimension d, theta and PHi2 are the same dimension D,
BUT d!=D.
You can use a brute force broadcasting approach, but you are creating an intermediate array of shape (D, d, d), which can get out of hand if your arrays are even moderately large. Furthermore, in using broadcasting with no refinements you are recomputing a lot of calculations from the innermost loop that you only need to do once. If you first compute the necessary parameters for all possible values of i - j and add them together, you can reuse those values on the outer loop, e.g.:
def fast_ops(eig1, eig2, theta):
d = len(eig1)
d_arr = np.arange(d)
i_j = d_arr[:, None] - d_arr[None, :]
reidx = i_j + d - 1
mult1 = eig1[:, None] * eig1[ None, :] + eig2[:, None] + eig2[None, :]
mult2 = eig1[None, :] * eig2[:, None] - eig1[:, None] * eig2[None, :]
mult1_reidx = np.bincount(reidx.ravel(), weights=mult1.ravel())
mult2_reidx = np.bincount(reidx.ravel(), weights=mult2.ravel())
angles = theta[:, None] * np.arange(1 - d, d)
return 0.5 * (np.einsum('ij,j->i', np.cos(angles), mult1_reidx) -
np.einsum('ij,j->i', np.sin(angles), mult2_reidx))
IF we rewrite M4rtini's code as a function for comparison:
def fast_ops1(eig1, eig2, theta):
d = len(eig1)
D = len(theta)
s = np.array(range(D))[:, None, None]
i = np.array(range(d))[:, None]
j = np.array(range(d))
ret = 0.5 * (np.cos(theta[s]*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j]) -
np.sin(theta[s]*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j]))
return ret.sum(axis=(-1, -2))
And we make up some data:
d, D = 100, 200
eig1 = np.random.rand(d)
eig2 = np.random.rand(d)
theta = np.random.rand(D)
The speed improvement is very noticeable, 80x on top of the 115x over your original code, leading to a whooping 9000x speed-up:
In [22]: np.allclose(fast_ops1(eig1, eig2, theta), fast_ops(eig1, eig2, theta))
Out[22]: True
In [23]: %timeit fast_ops1(eig1, eig2, theta)
10 loops, best of 3: 145 ms per loop
In [24]: %timeit fast_ops(eig1, eig2, theta)
1000 loops, best of 3: 1.85 ms per loop
This works by broadcasting.
For tdim = 200 and dim = 100.
14 seconds with original.
120 ms with the version.
s = np.array(range(tdim))[:, None, None]
i = np.array(range(dim))[:, None]
j = np.array(range(dim))
PHi2 =(0.5*np.cos(theta[s]*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta[s]*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j])).sum(axis=2).sum(axis=1)
In the first bit of code, you have 0.5*np.cos(theta[s]*(i-j))... but in the second it's 0.5*np.cos(theta*(i-j)).... Unless you've got theta defined differently for the second bit of code, this could well be the cause of the trouble.