I currently investigate the development of a convolutional neural network involving up to 5 or 6 dimensional arrays efficiently.
I was aware that many of the tools used for convolutional neural networks do not really deal with ND convolutions, so I decided to try and write an implementation of Helix Convolution, whereby the convolution can be treated as a large, 1D convolution (see Reference 1. http://sepwww.stanford.edu/public/docs/sep95/jon1/paper_html/node2.html , Reference 2 https://sites.ualberta.ca/~mostafan/Files/Papers/md_convolution_TLE2009.pdf for more details of the concept).
I did this under the (possibly incorrect) assumption that a large, single dimensional convolution was likely to be easier on a GPU than a multidimensional one, as well as that the method is trivially scalable to N dimensions.
Particularly, a quote from Reference 2. states:
We have not found important gains in computational efficiency between N-D standard convolution versus using the
algorithm described in the text. We have, however, found that
writing codes for seismic data regularization with the described
trick leads to algorithms that can easily handle regularization
problems with any number of spatial dimensions (Naghizadeh
and Sacchi, 2009).
I have written an implementation of the function below, which compares to signal.fftconvolve. It is slower on the CPU compared to this function, but I would nonetheless like to see how it performs on the GPU in PyTorch as a forward convolutional layer.
Can someone kindly help me port this code to PyTorch so I can verify how it behaves?
"""
HELIX CONVOLUTION FUNCTION
Shrink:
CROPS THE SIZE OF THE CONVOLVED SIGNAL DOWN TO THE ORIGINAL SIZE OF THE ORIGINAL.
Pad:
PADS THE DIFFERENCE BETWEEN THE ORIGINAL SHAPE AND THE DESIRED, CONVOLVED SHAPE FOR KERNEL AND SIGNAL.
GetLength:
EXTRACTS THE LENGTH OF THE UNWOUND STRIP OF THE SIGNAL AND KERNEL THAT IS TO BE CONVOLVED.
FFTConvolve:
USES THE NUMPY FFT PACKAGE TO PERFORM FAST FOURIER CONVOLUTION ON THE SIGNALS
Convolve:
USES HELIX CONVOLUTION ON AN INPUT ARRAY AND KERNEL.
"""
import numpy as np
from numpy import *
from scipy import signal
import operator
import time
class HelixCPU:
#classmethod
def Shrink(cls,array, bounding):
start = tuple(map(lambda a, da: (a-da)//2, array.shape, bounding))
end = tuple(map(operator.add, start, bounding))
slices = tuple(map(slice, start, end))
return array[slices]
#classmethod
def Pad(cls,array, target_shape):
diff = target_shape-array.shape
padder=[(0,val) for val in diff]
padded = np.pad(array, padder, 'constant')
return padded
#classmethod
def GetLength(cls,array_shape, padded_shape):
temp=1
steps=np.zeros_like(array_shape)
for i, entry in enumerate(padded_shape[::-1]):
if(i==len(padded_shape)-1):
steps[i]=1
else:
temp=entry*temp
steps[i]=temp
steps=np.roll(steps, 1)
steps=steps[::-1]
ones=np.ones_like(array_shape)
ones[-1]=0
out=np.multiply(steps,array_shape - ones)
length = np.sum(out)
return length
#classmethod
def FFTConvolve(cls, in1, in2, len1, len2):
s1 = len1
s2 = len2
shape = s1 + s2 - 1
fsize = 2 ** np.ceil(cp.log2(shape)).astype(int)
fslice = slice(0, shape)
conv = np.fft.ifft(np.fft.fft(in1, int(fsize)) * np.fft.fft(in2, int(fsize)))[fslice].copy()
return conv
#classmethod
def Convolve(cls,array, kernel):
m = array.shape
n = kernel.shape
mn = np.add(m, n)
mn = mn-np.ones_like(mn)
k_pad=cls.Pad(kernel, mn)
a_pad=cls.Pad(array, mn)
length_k = cls.GetLength(kernel.shape, k_pad.shape);
length_a = cls.GetLength(array.shape, a_pad.shape);
k_flat = k_pad.flatten()[0:length_k]
a_flat = a_pad.flatten()[0:length_a]
conv = cls.FFTConvolve(a_flat, k_flat)
conv = np.resize(conv,mn)
conv = cls.Shrink(conv, m)
return conv
def main():
array=np.random.rand(25,25,41,51)
kernel=np.random.rand(10, 10, 10, 10)
start2 =time.process_time()
test2 = HelixCPU.Convolve(array, kernel)
end2=time.process_time()
start1= time.process_time()
test1 = signal.fftconvolve(array, kernel, "same")
end1= time.process_time()
print ("")
print ("========================")
print ("SOME LARGE CONVOLVED RANDOM ARRAYS. ")
print ("========================")
print("")
print ("Random Calorimeter Image of Size {0} Created".format(array.shape))
print ("Random Kernel of Size {0} Created".format(kernel.shape))
print("")
print ("Value\tOriginal\tHelix")
print ("Time Taken [s]\t{0}\t{1}\t{2}".format( (end1-start1), (end2-start2), (end2-start2)/(end1-start1) ))
print ("Maximum Value\t{:03.2f}\t{:13.2f}".format( np.max(test1), np.max(test2) ))
print ("Matrix Norm \t{:03.2f}\t{:13.2f}".format( np.linalg.norm(test1), np.linalg.norm(test2) ))
print ("All Close?\t{0}".format(np.allclose(test1, test2)))
Sorry, I cannot add a comment due to low rep, so I ask my question as an answer and hopefully can answer your question.
By helix convolution, do you mean defining a convolution operation as a single matrix multiplcation? If so, I did try this in the past but it is really memory inefficient for it to be practical.
Related
Let say I have a matrix W of shape (n_words, model_dim) where n_words is the number of words in a sentence and model_dim is the dimension of the space where the word vectors are represented. What is the fastest way to compute the moving average of these vectors ?
For example, with a window size of 2 (window length = 5), I could have something like this (which raises an error TypeError: JAX 'Tracer' objects do not support item assignment):
from jax import random
import jax.numpy as jnp
# Fake word vectors (17 words vectors of dimension 32)
W = random.normal(random.PRNGKey(0), shape=(17, 32))
ws = 2 # window size
N = W.shape[0] # number of words
new_W = jnp.zeros(W.shape)
for i in range(N):
window = W[max(0, i-ws):min(N, i+ws+1)]
n = window.shape[0]
for j in range(n):
new_W[i] += W[j] / n
I guess there is a faster solution with jnp.convolve but I'm not familiar with it.
This looks like you're trying to do a convolution, so jnp.convolve or similar would likely be a more performant approach.
That said, your example is a bit strange because n is never larger than 4, so you never access any but the first four elements of W. Also, you overwrite the previous value in each iteration of the inner loop, so each row of new_W just contained a scaled copy of one of the first four rows of W.
Changing your code to what I think you meant and using index_update to make it compatible with JAX's immutable arrays gives this:
from jax import random
import jax.numpy as jnp
# Fake word vectors (17 words vectors of dimension 32)
W = random.normal(random.PRNGKey(0), shape=(17, 32))
ws = 2 # window size
N = W.shape[0] # number of words
new_W = jnp.zeros(W.shape)
for i in range(N):
window = W[max(0, i-ws):min(N, i+ws)]
n = window.shape[0]
for j in range(n):
new_W = new_W.at[i].add(window[j] / n)
and here is the equivalent in terms of a much more efficient convolution:
from jax.scipy.signal import convolve
kernel = jnp.ones((4, 1))
new_W_2 = convolve(W, kernel, mode='same') / convolve(jnp.ones_like(W), kernel, mode='same')
jnp.allclose(new_W, new_W_2)
# True
My task is fairly simple: I have a large 2D matrix, containing only zeros and ones. For each position in this matrix, I want to sum all pixels in a window around this position. The problem is that the matrix has the shape (166667, 17668) and window sizes range from (333, 333) to (5333, 5333). So far I have only tried on a subset of the data. The code I arrived at:
out_arr = np.array( in_arr.shape )
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
for y in range(out_arr.shape[0]):
for x in range(out_arr.shape[1]):
out_arr[y, x] = np.sum(in_arr[y:y+windowsize, x:x+windowsize])
Obviously, this takes a long time. But in my case it was faster than a rolling window approach using numpy.stride_tricks.as_strided, as described here. I tried compiling it using cython, without effect.
What would be your suggestions to speed this up, apart from parallelizing?
I have a Nvidia Titan X at hand. Is there a way to benefit from that?
(e.g. using cupy)
For windowed summation convolution is actually overkill since a simple O(n) solution exists:
import numpy as np
from scipy.signal import convolve
def winsum(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2+1, mode='reflect')[:-1, :-1]
in_arr[0] = 0
in_arr[:, 0] = 0
ps = in_arr.cumsum(0).cumsum(1)
return ps[windowsize:, windowsize:] + ps[:-windowsize, :-windowsize] \
- ps[windowsize:, :-windowsize] - ps[:-windowsize, windowsize:]
This is already fast but you can save even more because ps calculated once for the largest window size could be reused for all smaller window sizes.
However, there is one potential drawback, which are the very large numbers that may arise from summing everything like that. A numerically more sound version eliminates this problem by taking the differences first. Downside: the extra saving through sharing ps is no longer available.
def winsum_safe(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
in_arr[windowsize:] -= in_arr[:-windowsize]
in_arr[:, windowsize:] -= in_arr[:, :-windowsize]
return in_arr.cumsum(0)[windowsize-1:].cumsum(1)[:, windowsize-1:]
For reference, here is the closest competitor which is fft based convolution. You need an up-to-date version of scipy for this to work efficiently. On older versions use fftconvolve instead of convolve.
def winsumc(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
kernel = np.ones((windowsize, windowsize), in_arr.dtype)
return convolve(in_arr, kernel, 'valid')
The next one is to simulate scipy's old - and excruciatingly slow - behavior.
def winsum_nofft(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
kernel = np.ones((windowsize, windowsize), in_arr.dtype)
return convolve(in_arr, kernel, 'valid', method='direct')
Testing and benchmarking:
data = np.random.random((1000, 1000))
assert np.allclose(winsum(data, 333), winsumc(data, 333))
assert np.allclose(winsum(data, 333), winsum_safe(data, 333))
kwds = dict(globals=globals(), number=10)
from timeit import timeit
from time import perf_counter
print('data 100x1000, window 333x333')
print('cumsum: ', timeit('winsum(data, 333)', **kwds)*100, 'ms')
print('cumsum safe: ', timeit('winsum_safe(data, 333)', **kwds)*100, 'ms')
print('fftconv: ', timeit('winsumc(data, 333)', **kwds)*100, 'ms')
t = perf_counter()
res = winsum_nofft(data, 99) # 333 just takes too long
t = perf_counter() - t
assert np.allclose(winsum(data, 99), res)
print('data 100x1000, window 99x99')
print('conv: ', t*1000, 'ms')
Sample output:
data 100x1000, window 333x333
cumsum: 70.33260859316215 ms
cumsum safe: 59.98647050000727 ms
fftconv: 298.60571819590405 ms
data 100x1000, window 99x99
conv: 135224.8261970235 ms
#Divakar pointed out in the comments that you can use conv2d and he is right. Here is an example:
import numpy as np
from scipy import signal
data = np.random.rand(5,5) # you original data that you want to sum
kernel = np.ones((2,2)) # square matrix of your dimensions, filled with ones
output = signal.convolve2d(data,kernel,mode='same') # the convolution
I wrote some code to implement the modified Gram Schmidt process. When
I tested it on real matrices, it is correct. However, when I tested it
on complex matrices, it went wrong.
I believe my code is correct by doing a step by step check. Therefore,
I wonder if there are numerical reasons why the modified Gram Schmidt
process fails on complex vectors.
Following is the code:
import numpy as np
def modifiedGramSchmidt(A):
"""
Gives a orthonormal matrix, using modified Gram Schmidt Procedure
:param A: a matrix of column vectors
:return: a matrix of orthonormal column vectors
"""
# assuming A is a square matrix
dim = A.shape[0]
Q = np.zeros(A.shape, dtype=A.dtype)
for j in range(0, dim):
q = A[:,j]
for i in range(0, j):
rij = np.vdot(q, Q[:,i])
q = q - rij*Q[:,i]
rjj = np.linalg.norm(q, ord=2)
if np.isclose(rjj,0.0):
raise ValueError("invalid input matrix")
else:
Q[:,j] = q/rjj
return Q
Following is the test code:
import numpy as np
# If testing on random matrices:
# X = np.random.rand(dim,dim)*10 + np.random.rand(dim,dim)*5 *1j
# If testing on some good one
v1 = np.array([1, 0, 1j]).reshape((3,1))
v2 = np.array([-1, 1j, 1]).reshape((3,1))
v3 = np.array([0, -1, 1j+1]).reshape((3,1))
X = np.hstack([v1,v2,v3])
Y = modifiedGramSchmidt(X)
Y3 = np.linalg.qr(X, mode="complete")[0]
if np.isclose(Y3.conj().T.dot(Y3), np.eye(dim, dtype=complex)).all():
print("The QR-complete gives orthonormal vectors")
if np.isclose(Y.conj().T.dot(Y), np.eye(dim, dtype=complex)).all():
print("The Gram Schmidt process is tested against a random matrix")
else:
print("But My modified GS goes wrong!")
print(Y.conj().T.dot(Y))
Update
The problem is that I implemented a algorithm designed for inner product linear in first argument
whereas I thought it were linear in second argument.
Thanks #landogardner
Your problem is to do with how numpy.vdot handles complex numbers — the complex conjugate of the first argument is used for the calculation (ref). So you're calculating rij as q*.Q[:,i] instead of q.Q[:,i]*. Just swap the order of the args:
rij = np.vdot(Q[:,i], q)
This got the test code working for me.
I wish to perform 2D convolution on images of size 600 X 400 using a 10 X 10 filter. The filter is not separable. scipy.signal.convolve2d works well for me currently but, I am expecting a lot bigger images soon.
To counter that, I have two ideas
resizing images
subsampling (or striding)?
Focusing on the subsampling part, theano has a function which does convolution the same way as scipy convolve2d, see theano conv2d
It also has the subsampling option too. But, installing theano on windows has been painful to me. How do I get subsampling work with scipy.signal.convolve2d? Any other alternatives (which doesn't require me installing me some heavyweight library)?
You could implement subsampling by hand, I'll only sketch 1d for simplicity. Say you want to sample s = d * f on a regular subgrid with spacing k. Then your nth sample is s_nk = sum_i=0^10 f_i d_nk-i. The thing to observe here is that the indices of f and d always sum to a multiple of k. This suggests splitting it up into sub-sums s_nk = sum_j=0^k-1 sum_i=0^10/k f_j+ik d_-j+(n-i)k. So what you need to do is: subsample d and f at grids with spacing k at all offsets 0, ..., k-1. Convolve all pairs of subsampled d and f whose offsets sum to 0 or k and add the results.
Here's some code for 1d. It roughly implements the above, only the grids are placed slightly differently to make index management easier. The second function does it the stupid way, i.e. computes the full convolution and then decimates. It is for testing the first function against.
import numpy as np
from scipy import signal
def ss_conv(d1, d2, decimate):
n = (len(d1) + len(d2) - 1) // decimate
out = np.zeros((n,))
for i in range(decimate):
d1d = d1[i::decimate]
d2d = d2[decimate-i-1::decimate]
cv = signal.convolve(d1d, d2d, 'full')
out[:len(cv)] += cv
return out
def conv_ss(d1, d2, decimate):
return signal.convolve(d1, d2, 'full')[decimate-1::decimate]
Edit: 2d version:
import numpy as np
from scipy import signal
def ss_conv_2d(d1, d2, decy, decx):
ny = (d1.shape[0] + d2.shape[0] - 1) // decy
nx = (d1.shape[1] + d2.shape[1] - 1) // decx
out = np.zeros((ny, nx))
for i in range(decy):
for j in range(decx):
d1d = d1[i::decy, j::decx]
d2d = d2[decy-i-1::decy, decx-j-1::decx]
cv = signal.convolve2d(d1d, d2d, 'full')
out[:cv.shape[0], :cv.shape[1]] += cv
return out
def conv_ss_2d(d1, d2, decy, decx):
return signal.convolve2d(d1, d2, 'full')[decy-1::decy, decx-1::decx]
I am trying to implement the back-propagation algorithm using numpy in python. I have been using this site to implement the matrix form of back-propagation. While testing this code on XOR, my network does not converge even after multiple runs of thousands of iterations. I think there is some sort of logic error. I would be very grateful if anyone would be willing to look it over. Fully runnable code can be found at github
import numpy as np
def backpropagate(network, tests, iterations=50):
#convert tests into numpy matrices
tests = [(np.matrix(inputs, dtype=np.float64).reshape(len(inputs), 1),
np.matrix(expected, dtype=np.float64).reshape(len(expected), 1))
for inputs, expected in tests]
for _ in range(iterations):
#accumulate the weight and bias deltas
weight_delta = [np.zeros(matrix.shape) for matrix in network.weights]
bias_delta = [np.zeros(matrix.shape) for matrix in network.bias]
#iterate over the tests
for potentials, expected in tests:
#input the potentials into the network
#calling the network with trace == True returns a list of matrices,
#representing the potentials of each layer
trace = network(potentials, trace=True)
errors = [expected - trace[-1]]
#iterate over the layers backwards
for weight_matrix, layer in reversed(list(zip(network.weights, trace))):
#compute the error vector for a layer
errors.append(np.multiply(weight_matrix.transpose()*errors[-1],
network.sigmoid.derivative(layer)))
#remove the input layer
errors.pop()
errors.reverse()
#compute the deltas for bias and weight
for index, error in enumerate(errors):
bias_delta[index] += error
weight_delta[index] += error * trace[index].transpose()
#apply the deltas
for index, delta in enumerate(weight_delta):
network.weights[index] += delta
for index, delta in enumerate(bias_delta):
network.bias[index] += delta
Additionally, here is the code that computes the output, and my sigmoid function. It is less likely that bug lies here; I was able to trained a network to simulate XOR using simulated annealing.
# the call function of the neural network
def __call__(self, potentials, trace=True):
#ensure the input is properly formated
potentials = np.matrix(potentials, dtype=np.float64).reshape(len(potentials), 1)
#accumulate the trace
trace = [potentials]
#iterate over the weights
for index, weight_matrix in enumerate(self.weights):
potentials = weight_matrix * potentials + self.bias[index]
potentials = self.sigmoid(potentials)
trace.append(potentials)
return trace
#The sigmoid function that is stored in the network
def sigmoid(x):
return np.tanh(x)
sigmoid.derivative = lambda x : (1-np.square(x))
The problem is the missing step-size parameter. Gradient should be additionally scaled, not to make the whole step in the weights space at once. So instead of: network.weights[index] += delta and network.bias[index] += delta it should be:
def backpropagate(network, tests, stepSize = 0.01, iterations=50):
#...
network.weights[index] += stepSize * delta
#...
network.bias[index] += stepSize * delta