Using Theano.scan with shared variables - python

I want to calculate the sumproduct of two arrays in Theano. Both arrays are declared as shared variables and are the result of prior computations. Reading the tutorial, I found out how to use scan to compute what I want using 'normal' tensor arrays, but when I tried to adapt the code to shared arrays I got the error message TypeError: function() takes at least 1 argument (1 given). (See minimal running code example below)
Where is the mistake in my code? Where is my misconception? I am also open to a different approach for solving my problem.
Generally I would prefer a version which takes the shared variables directly, because in my understanding, converting the arrays first back to Numpy arrays and than again passing them to Theano, would be wasteful.
Error message producing sumproduct code using shared variables:
import theano
import theano.tensor as T
import numpy
a1 = [1,2,4]
a2 = [3,4,5]
Ta1_shared = theano.shared(numpy.array(a1))
Ta2_shared = theano.shared(numpy.array(a2))
outputs_info = T.as_tensor_variable(numpy.asarray(0, 'float64'))
Tsumprod_result, updates = theano.scan(fn=lambda Ta1_shared, Ta2_shared, prior_value:
prior_value + Ta1_shared * Ta2_shared,
outputs_info=outputs_info,
sequences=[Ta1_shared, Ta2_shared])
Tsumprod_result = Tsumprod_result[-1]
Tsumprod = theano.function(outputs=Tsumprod_result)
print Tsumprod()
Error message:
TypeError: function() takes at least 1 argument (1 given)
Working sumproduct code using non-shared variables:
import theano
import theano.tensor as T
import numpy
a1 = [1, 2, 4]
a2 = [3, 4, 5]
Ta1 = theano.tensor.vector("a1")
Ta2 = theano.tensor.vector("coefficients")
outputs_info = T.as_tensor_variable(numpy.asarray(0, 'float64'))
Tsumprod_result, updates = theano.scan(fn=lambda Ta1, Ta2, prior_value:
prior_value + Ta1 * Ta2,
outputs_info=outputs_info,
sequences=[Ta1, Ta2])
Tsumprod_result = Tsumprod_result[-1]
Tsumprod = theano.function(inputs=[Ta1, Ta2], outputs=Tsumprod_result)
print Tsumprod(a1, a2)

You need to change the compilation line to this one:
Tsumprod = theano.function([], outputs=Tsumprod_result)
theano.function() always need a list of inputs. If the function take 0 input, like in this case, you need to give an empty list for the inputs.

Related

Input arguments for MATLAB Engine function

I'm trying to use MATLAB engine to call a MATLAB function in Python, but I'm having some problems. After manage to deal with NumPy arrays as input in the function, now I have some error from MATLAB:
MatlabExecutionError: Undefined function 'simple_test' for input
arguments of type 'int64'.
My Python code is:
import numpy as np
import matlab
import matlab.engine
eng = matlab.engine.start_matlab()
eng.cd()
Nn = 30
x= 250*np.ones((1,Nn))
y= 100*np.ones((1,Nn))
z = 32
xx = matlab.double(x.tolist())
yy = matlab.double(y.tolist())
Output = eng.simple_test(xx,yy,z,nargout=4)
A = np.array(Output[0]).astype(float)
B = np.array(Output[1]).astype(float)
C = np.array(Output[2]).astype(float)
D = np.array(Output[3]).astype(float)
and the Matlab function is:
function [A,B,C,D] = simple_test(x,y,z)
A = 3*x+2*y;
B = x*ones(length(x),length(x));
C = ones(z);
D = x*y';
end
Is a very simple example but I'm not able to run it!
I know the problem is in the z variable, because when I define z=32 the error is the one I mentioned, and when I change for z=32. the error changes to
MatlabExecutionError: Undefined function 'simple_test' for input
arguments of type 'double'.
but I don't know how to define z.

Cython Optimization of Numpy for Loop

I am new to cython and have the following code for a numpy for loop that I am trying to optimize. So far, this Cython code isn't much faster than the numpy for loop.
# cython: infer_types = True
import numpy as np
cimport numpy
DTYPE = np.double
def hdcfTransfomation(scanData):
cdef Py_ssize_t Position
scanLength = scanData.shape[0]
hdcfFunction_np = np.zeros(scanLength, dtype = DTYPE)
cdef double [::1] hdcfFunction = hdcfFunction_np
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
I know that using C math library functions would be more ideal than calling back into the numpy language (subtract, square, mean), but I am not sure where I can find a list of functions that can be called in this manner.
I have been trying to figure out ways to optimize this code by using different types, ect. but nothing is providing the performance that I think is possible with a fully optimized implementation of Cython.
Here is a working example of the numpy for-loop:
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength)
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
scanDataArray = np.random.rand(80000, 1)
transformedScan = hdcfTransformed(scanDataArray)
Always provide as much informations as possible (some example data, Python/Cython Version, Compiler Version/Settings and CPU Model.
Without that it is quite hard to compare any timings. For example this problem benefits quite well from SIMD-vectorization. It will make quite a difference which compiler you use or if you want to redistribute a compiled version which should also run on low-end or quite old CPUS (eg. no AVX).
I am not very familiar with Cython, but I think your main problem is the missing declaration for scanData. Maybe the C-Compiler needs additional flags like march=native, but the real syntax is compiler dependend. I am am also not sure how Cython or the C-compiler optimizes this part:
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
If that loops (all vectorized commands are actually loops) are not joined, but intead there are temporary arryas like in pure Python created, this will slow down the code. It will be a good idea to create a 1D array first. (eg. scanData=scanData[::1]
As said I am not that familliar with Cython, I tried what is possible with Numba. At least it shows what should also be possible with a resonable good Cython implementation.
Maybe easier to otimize for the compiler
import numba as nb
import numpy as np
#nb.njit(fastmath=True,error_model='numpy',parallel=True)
#scanData is a 1D-array here
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength, dtype = scanData.dtype)
for position in nb.prange(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:scanData.shape[0]-(position + 1)]
sum=0.
jj=0
for i in range(scanLength-(position + 1)):
jj+=1
sum+=(topShift[i]-bottomShift[i])**2
hdcfFunction[position] = sum/jj
return hdcfFunction
I also used parallelization here, because the problem is embarrassingly parallel. At least with a size of 80_000 and Numba it doesn't matter if you use a slightly modified version of your code (1D-array), or the code above.
Timings
#Quadcore Core i7-4th gen,Numba 0.4dev,Python 3.6
scanData=np.random.rand(80_000)
#The first call to the function isn't measured (compilation overhead),but the following calls.
Pure Python: 5900ms
Numba single-threaded: 947ms
Numba parallel: 260ms
Especially for larger arrays than np.random.rand(80_000) there may be better aproaches (loop tilling for better cache usage), but for this size that should be more or less OK (At least it fits in the L3-cache)
Naive GPU Implementation
from numba import cuda, float32
#cuda.jit('void(float32[:], float32[:])')
def hdcfTransfomation_gpu(scanData,out_data):
scanLength = scanData.shape[0]
position = cuda.grid(1)
if position < scanLength - 1:
sum= float32(0.)
offset=1 + position
for i in range(scanLength-offset):
sum+=(scanData[i+offset]-scanData[i])**2
out_data[position] = sum/(scanLength-offset)
hdcfTransfomation_gpu[scanData.shape[0]//64,64](scanData,res_3)
This gives about 400ms on a GT640 (float32) and 970ms (float64). For a good implemenation shared arrays should be considered.
Putting cython aside, does this do the same thing as your current code but without a for loop? We can tighten it up and correct for inaccuracies, but the first port of call is to try apply operations in numpy to 2D arrays before turning to cython for for loops. It's too long to put in a comment.
import numpy as np
# Setup
arr = np.random.choice(np.arange(10), 100).reshape(10, 10)
top_shift = arr[:, :-1]
bottom_shift = arr[:, 1:]
arr_diff = top_shift - bottom_shift
arr_squared = np.square(arr_diff)
arr_mean = arr_squared.mean(axis=1)

Python gsl_vector_set

I'm trying to initialize two vectors in memory using gsl_vector_set(). In the main code it is initialized to zero on default, but I wanted to initialize them to some non-zero value. I made a test code based on a working function that uses the gsl_vector_set() function.
from ctypes import *;
gsl = cdll.LoadLibrary('libgsl-0.dll');
gsl.gsl_vector_get.restype = c_double;
gsl.gsl_matrix_get.restype = c_double;
gsl.gsl_vector_set.restype = c_double;
foo = dict(
x_ht = [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
x_ht_m = [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
);
for f in range(0,18):
gsl.gsl_vector_set(foo['x_ht_m'],f,c_double(1.0));
gsl.gsl_vector_set(foo['x_ht'],f,c_double(1.0));
When I run the code I get this error.
ArgumentError: argument 1: <type 'exceptions.TypeError'>: Don't know how to convert parameter 1
I'm new to using ctypes and gsl functions so I'm not sure what the issue is or what the error message means. I an also not sure if there is a better way that I should be trying to save a vector to memory
Thank you #CristiFati for pointing out that I needed gsl_vector_calloc in my test code. I noticed that in the main code I was working in that the vector I needed to set was
NAV.KF_dictnry['x_hat_m']
instead of
NAV.KF_dictnry['x_ht_m']
So I fixed the test code to mirror the real code a bit better by creating a class holding the dictionary, and added the ability to change each value in the vector to an arbitrary value.
from ctypes import *;
gsl = cdll.LoadLibrary('libgsl-0.dll');
gsl.gsl_vector_get.restype = c_double;
gsl.gsl_matrix_get.restype = c_double;
gsl.gsl_vector_set.restype = c_double;
class foo(object):
fu = dict(
x_hat = gsl.gsl_vector_calloc(c_size_t(18)),
x_hat_m = gsl.gsl_vector_calloc(c_size_t(18)),
);
x_ht = [1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]
x_ht_m = [1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]
for f in range(0,18):
gsl.gsl_vector_set(foo.fu['x_hat_m'],f,c_double(x_ht_m[f]));
gsl.gsl_vector_set(foo.fu['x_hat'],f,c_double(x_ht[f]));
After running I checked with:
gsl.gsl_vector_get(foo.fu['x_hat_m'],0)
and got out a 1.0 (worked for the entire vector).
Turned out to just be some stupid mistakes on my end.
Thanks again!

Passing extra arguments to broyden1

Im trying to execute scipy broyden1 function with extra parameters (called "data" in the example), here is the code:
data = [radar_wavelen, satpos, satvel, ellipsoid_semimajor_axis, ellipsoid_semiminor_axis, srange]
target_xyz = broyden1(Pixlinexyx_2Bsolved, start_xyz, args=data)
def Pixlinexyx_2Bsolved(target, *data):
radar_wavelen, satpos, satvel, ellipsoid_semimajor_axis, ellipsoid_semiminor_axis, srange = data
print target
print radar_wavelen, satpos, satvel, ellipsoid_semimajor_axis, ellipsoid_semiminor_axis, srange
Pixlinexyx_2Bsolved is the function whose root I want to find.
start_xyz is initial guess of the solution:
start_xyz = [4543557.208584103, 1097477.4119051248, 4176990.636060918]
And data is this list containing a lot of numbers, that will be used inside the Pixlinexyx_2Bsolved function:
data = [0.056666, [5147114.2523595653, 1584731.770061729, 4715875.3525346108], [5162.8213179936156, -365.24378919717839, -5497.6237250296626], 6378144.0430000005, 6356758.789000001, 850681.12442702544]
When I call the function broyden1 (as in the second line of example code) I get the next error:
target_xyz = broyden1(Pixlinexyx_2Bsolved, start_xyz, args=data)
File "<string>", line 5, in broyden1
TypeError: __init__() got an unexpected keyword argument 'args'
What I'm doing wrong?
Now, seeing the documentation of fsolve, it seems to be able to get extra args in the callable func... Here is a similar question as mine.
There is a similar question at scipy's issue-tracker including a solution using python's functools-module (here: PEP 309 -- Partial Function Application
).
Small example based on the above link and the original problem from the docs:
import numpy as np
import scipy.optimize
""" No external data """
def F(x):
return np.cos(x) + x[::-1] - [1, 2, 3, 4]
x = scipy.optimize.broyden1(F, [1,1,1,1], f_tol=1e-14)
print(x)
""" External data """
from functools import partial
def G(data, x):
return np.cos(x) + x[::-1] - data
data = [1,2,3,4]
G_partial = partial(G, data)
x = scipy.optimize.broyden1(G_partial, [1,1,1,1], f_tol=1e-14)
print(x)
Out
[ 4.04674914 3.91158389 2.71791677 1.61756251]
[ 4.04674914 3.91158389 2.71791677 1.61756251]

How to do indexing properly in numba.cuda?

I am writing a code which needs to do some indexing in python using numba.
However, I cannot do it correctly.
It seems something is prohibited.
The code is as follows:
from numba import cuda
import numpy as np
#cuda.jit
def function(output, size, random_array):
i_p, i_k1, i_k2 = cuda.grid(3)
if i_p<size and i_k1<size and i_k2<size:
a1=i_p**2+i_k1
a2=i_p**2+i_k2
a3=i_k1**2+i_k2**2
a=[a1,a2,a3]
for i in range(len(random_array)):
output[i_p,i_k1,i_k2,i] = a[int(random_array[i])]
output=cuda.device_array((10,10,10,5))
random_array=cuda.to_device(np.array([np.random.random()*3 for i in range(5)]))
size=10
threadsperblock = (8, 8, 8)
blockspergridx=(size + (threadsperblock[0] - 1)) // threadsperblock[0]
blockspergrid = ((blockspergridx, blockspergridx, blockspergridx))
# Start the kernel
function[blockspergrid, threadsperblock](output, size, random_array)
print(output.copy_to_host())
It yields an error:
LoweringError: Failed at nopython (nopython mode backend)
'CUDATargetContext' object has no attribute 'build_list'
File "<ipython-input-57-6058e2bfe8b9>", line 10
[1] During: lowering "$40.21 = build_list(items=[Var(a1, <ipython-input-57-6058e2bfe8b9> (7)), Var(a2, <ipython-input-57-6058e2bfe8b9> (8)), Var(a3, <ipython-input-57-6058e2bfe8b9> (9))])" at <ipython-input-57-6058e2bfe8b9> (10
Can anyone help me with this?
One choice is to feed a also as an input of the function, but when a is really large like some 1000*1000*1000*7 array, it always gives me out off memory.
The problem has nothing to do with array indexing. Within the kernel, this line:
a=[a1,a2,a3]
is not supported. You cannot create a list within a #cuda.jit function. The exact list of supported Python types within kernels is fully documented here.

Categories