I have a linear set of equations, where
A x = b and A is a large matrix and b is known as well.
The matrix A is set up with python.
Now I want to invert matrix A to get x.
A and b are passed to a Fortran 90 program via a shared object. I compiled the Fortran program using numpy.f2py:
import numpy.f2py.f2py2e as f2py2e
import sys, os
sys.argv += "-lmkl_rt -c -m MKL_MODULE MKL_WRAPPER.f90".split()
f2py2e.main()
Finally, I call the f90 subroutine:
MKL_MODULE.mkl_wrapper.call_dgelsd(A, b, np.shape(A)[0], np.shape(A)[1])
When calling the fortran program, the memory usage doubles, apparently due to an internal copy of the matrix A and b.
However, once I have the vector x I'm not interested in A or b anymore.
Is there any way to avoid that internal copy and passing A to the fortran program?
I already had the idea of saving A and b to the HD and reading it from the Fortran program, but this takes very long time and is not really an option for matrices of the size I'm dealing with.
No internal copy will be made if the arrays are in F-order
[How to force numpy array order to fortran style?
to-fortran-style][1]
Related
I try to solve 2 coupled equations systems, called here system A and system B. One of these 2 systems are an ODE system.
To avoid to copy the shared data between the 2 systems, I would like have a structure with pointers. To that, I use the mechanism of numpy.view.
A bit of code :
import numpy as np
import scipy
t0,t1,dt = 0.0,5.0, 1.0
data = np.ones((5,2))
data[:,1]*=2
y=np.array([0.0,0.0]) ### no matter default value
r = scipy.integrate.ode(f)
r.set_integrator('dopri5', rtol=1e-3, atol=1e-6 )
r.set_f_params(0.05)
#r.set_initial_value(y, t0); r._y = data[2] ### Apparently equivalent
r.set_initial_value(data[2], t0) ### Apparently equivalent
print(np.shares_memory(r.y,y))
print(np.shares_memory(r.y,data))
Here, at the initial state, I have a synchronization between r.y (system A) and data[2] (the variable named data is the data of system B). If I modify one, the other is also modified and vice versa. Tape the command r.y.base confirm that r.y is just a view of the array named data. That the behavior that I desired.
Now, the problem start here. If I make progress my EDO system :
while r.successful() and r.t < t1:
r.integrate(r.t+dt, step=True)
print(r.t+dt,r.y)
print(np.shares_memory(r.y,data))
print(data)
data and r.y are no more synchronized. r.y are no more a view of data.
It looks that the integrate function creates a new instance of its attribute r.y rather than just update it. I have read the source code of this function
https://github.com/scipy/scipy/blob/v0.19.1/scipy/integrate/_ode.py#L396
but it rapidly refers to fortran code, and my understanding abilities stop here.
How can I solve (or got round) this problem by a different way of the data copy between r.y and data (that also implies a manual management of the synchronisation) ?
Is it possible that is a bug in scipy ?
Thanks for your help
I am using the anaconda suite with ipython 3.6.1 and their accelerate package. There is a cufft sub-package in this two functions fft and ifft. These, as far as I understand, takes in a numpy array and outputs to a numpy array, both in system ram, i.e. all gpu-memory and transfer between system and gpu memory is handled automatically and gpu memory is releaseed as function is ended. This seems all very nice and seems to work for me. However, I would like to run multiple fft/ifft calls on the same array and for each time extract just one number from the array. It would be nice to keep the array in the gpu memory to minimize system <-> gpu transfer. Am I correct that this is not possible using this package? If so, is there another package that would do the same. I have noticed the reikna project but that doesn't seem available in anaconda.
The thing I am doing (and would like to do efficiently on gpu) is in short shown here using numpy.fft
import math as m
import numpy as np
import numpy.fft as dft
nr = 100
nh = 2**16
h = np.random.rand(nh)*1j
H = np.zeros(nh,dtype='complex64')
h[10] = 1
r = np.zeros(nr,dtype='complex64')
fftscale = m.sqrt(nh)
corr = 0.12j
for i in np.arange(nr):
r[i] = h[10]
H = dft.fft(h,nh)/fftscale
h = dft.ifft(h*corr)*fftscale
r[nr-1] = h[10]
print(r)
Thanks in advance!
So I found Arrayfire which seems rather easy to work with.
I am dealing with big matrices and time to time my code ends with 'killed:9' message in my terminal. I'm working on Mac OSx.
A wise programmer tells me the problem in my code is liked to the stored matrix I am dealing with.
nn = 35000
dd = 35
XX = np.random.rand(nn,dd)
XX = XX.dot(XX.T) #it should be faster than np.dot(XX,XX.T)
yy = np.random.rand(nn,1)
XX = np.multiply(XX,yy.T)
I have to store this huge matrix XX, my guess: I split the matrix with
upp = np.triu(XX)
Do I actually save space in terms of stored data?
What if later on I store
low = app.T
am I wasting memory and computational time?
It should take up the same total amount of memory. To avoid the error you are probably looking at a few options:
Process batch wise
If you create your model over the CPLEX API, once you supplied the data it is handled by CPLEX I believe. So you could split the data and load it piece by piece and add it to the model consecutively.
Allocate memory manually
If you use Cython you can use the function malloc to allocate memory manually for your array, the size will very likely be no issue then.
Option 1 would be the preferred option in my opinion.
EDIT:
I constructed a little example. It actually combines the two options. The array is not stored as a Python object, but as a C array and the values are computed piecewise.
I am allocating the memory for the array using Cython and malloc. To run the code you have to install Cython.Then you can open a python interpreter at the directory you saved the file and write:
import pyximport;pyximport.install()
import nameofscript
An example for processing your array:
import numpy as np
from libc.stdlib cimport malloc # Allocate memory manually
from cython.parallel import prange # Parallel processing without GIL
dd = 35
# With cdef we can define C variables in Cython.
cdef double **XXN
cdef double y[35000]
cdef int i, j, nn
nn = 35000
# Allocate memory for the Matrix with 1.225 billion double elements
XXN = <double **>malloc(nn * sizeof(double *))
for i in range(nn):
XXN[i] = <double *>malloc(nn * sizeof(double))
XX = np.random.rand(nn,dd)
for i in range(nn):
for j in range(nn):
# Compute the values for the new matrix element by element
XXN[i][j] = XX[i].dot(XX[j].T)
# Multiply the new matrix with y column wise
for i in prange(nn, nogil=True, num_threads=4):
for j in range(nn):
XXN[i][j] = XXN[i][j] * y[i]
Save this file as nameofscript.pyx and run it as described above. I have briefly tested this script and it runs about half an hour on my machine. You can extend this script and use the result array XXN for your further computations.
A little example for parallelization: I did not initialize y and did not assign any values. If you declare y as a C array, you can e. g. assign some values from python objects to fill it with values. Then, you can conduct the last multiplication without GIL, in a parallelized manner, as shown in the code sample.
Regarding computational efficiency: This is probably not the fastest way (which may be writing your code for the CPLEX C Interface entirely maybe), but it does not throw the memory error and does run in an acceptable time if you do not have to repeat this computation too often.
Background:
I have a Python script that uses Fortran code for it's intensive calculations. I'm using F2Py to do this. One particular Fortran subroutine builds a matrix used in later calculations. This subroutine is iterated over in a loop, and solved at each step. A snippet of the code using essential arrays and variables is given below:
for i in xrange(steps):
x+=dx
F_Output=Matrix_Build_F2Py.hamiltonian_solve(array_1, array_2, array_3, array_4)
#Do things with F_Output
SUBROUTINE Hamiltonian_Solve(array_1, array_2, array_3, array_4, output_array)
!N_Long, N_Short are implied, Work, RWork, LWork, INFO
INTEGER, INTENT(IN), DIMENSION(0:N_Long-1) :: array_1, array_2, array_3
INTEGER, INTENT(IN), DIMENSION(0:N_Short-1) :: array_4
COMPLEX*16,ALLOCATABLE :: Hamiltonian(:,:)
COMPLEX*16, DIMENSION(0:N_Short :: Complex_Var
DOUBLE PRECISION, INTENT(OUT), DIMENSION(0:N_Short-1) :: E
INTEGER :: LWork, INFO, j
COMPLEX*16, ALLOCATABLE :: Work(:)
ALLOCATE(Hamiltonian(0:N_Short-1, 0:N_Short-1))
ALLOCATE(RWork(MAX(1,3*(N_Short-2))))
ALLOCATE(Work(MAX(1,LWork)))
ALLOCATE(E(0:N_Short-1))
DO h=0, N_Long-1
Hamiltonian(array_1(h),array_2(h))=Hamiltonian(array_1(h),array_2(h))-Complex_Var(h)
END DO
CALL ZHEEV('N','U',N_Short,Hamiltonian,N_Short,E,Work,LWork,RWork,INFO)
DO j=0,N_Short-1
Output_Array(j)=E(j)
END DO
END SUBROUTINE
However, for some reason this subroutine crashes my Python program, and generates the following malloc error:
error for object 0x1015f9808: incorrect checksum for freed object - object was probably modified after being freed.
This error is unusual in that it does not occur every time, but only a significant percentage of the time. I have determined that the root of the error lies in the line:
Hamiltonian(array_1(h),array_2(h))=Hamiltonian(array_1(h),array_2(h))-Complex_Var(h)
As if I change it to:
Hamiltonian(array_1(h),array_2(h))=Hamiltonian(array_1(h),array_2(h))
The error stops. However, Complex_Var is essential to the output, otherwise the program simply produces zeroes. This thread bears some similarity to my issue, but that issue seemed to occur after every run, mine does not. I have taken care to ensure the arrays are not mismatched, other arrangements (ie not accounting for numpy's different array formats) immediately creates a segmentation fault, as expected.
Question
Why is Complex_Var breaking the code? Why is the problem intermittent rather than systematic? And are there any obvious (or not so obvious) tips to avoid this?
Any help would be much appreciated!
updated per first comment and revision of question:
I see that some arrays in the problem expression have upper-dimension N_long-1 (i.e., array_1 and array_2) and array Complex_Var dimension N_short. The loop iterates up to N_Long-1. Do you know that N_Long-1 <= N_short ? If not, you might be accessing an illegal subscript o Complex_var. And do you know that the values in array_1 and array_2 are always legal subscripts for Hamilton? If you write outside the reserved size of that array, you could corrupt the information used by the memory allocator when it created some array, preventing it from freeing that array later.
If this is the problem, using your compiler's option for run-time subscript checking can help you find similar errors.
It could be because you don't have any deallocate commands. However it is hard to tell with this obviously incomplete code - could you post the actual code (i.e. something that will compile)?
The following program loads two images with PyGame, converts them to Numpy arrays, and then performs some other Numpy operations (such as FFT) to emit a final result (of a few numbers). The inputs can be large, but at any moment only one or two large objects should be live.
A test image is about 10M pixels, which translates to 10MB once it's greyscaled. It gets converted to a Numpy array of dtype uint8, which after some processing (applying Hamming windows), is an array of dtype float64. Two images are loaded into arrays this way; later FFT steps result in an array of dtype complex128. Prior to adding the excessive gc.collect calls, the program memory size tended to increase with each step. Additionally, it seems most Numpy operations will give a result in the highest precision available.
Running the test (sans the gc.collect calls) on my 1GB Linux machine results in prolonged thrashing, which I have not waited for. I don't yet have detailed memory use stats -- I tried some Python modules and the time command to no avail; now I'm looking into valgrind. Watching PS (and dealing with machine unresponsiveness in the later stages of the test) suggests a maximum memory usage of about 800 MB.
A 10 million cell array of complex128 should occupy 160 MB. Having (ideally) at most two of these live at one time, plus the not-insubstantial Python and Numpy libraries and other paraphernalia, probably means allowing for 500 MB.
I can think of two angles from which to attack the problem:
Discarding intermediate arrays as soon as possible. That's what the gc.collect calls are for -- they seem to have improved the situation, as it now completes with only a few minutes of thrashing ;-). I think one can expect that memory-intensive programming in a language like Python will require some manual intervention.
Using less-precise Numpy arrays at each step. Unfortunately the operations that return arrays, like fft2, do not appear to allow the type to be specified.
So my main question is: is there a way of specifying output precision in Numpy array operations?
More generally, are there other common memory-conserving techniques when using Numpy?
Additionally, does Numpy have a more idiomatic way of freeing array memory? (I imagine this would leave the array object live in Python, but in an unusable state.) Explicit deletion followed by immediate GC feels hacky.
import sys
import numpy
import pygame
import gc
def get_image_data(filename):
im = pygame.image.load(filename)
im2 = im.convert(8)
a = pygame.surfarray.array2d(im2)
hw1 = numpy.hamming(a.shape[0])
hw2 = numpy.hamming(a.shape[1])
a = a.transpose()
a = a*hw1
a = a.transpose()
a = a*hw2
return a
def check():
gc.collect()
print 'check'
def main(args):
pygame.init()
pygame.sndarray.use_arraytype('numpy')
filename1 = args[1]
filename2 = args[2]
im1 = get_image_data(filename1)
im2 = get_image_data(filename2)
check()
out1 = numpy.fft.fft2(im1)
del im1
check()
out2 = numpy.fft.fft2(im2)
del im2
check()
out3 = out1.conjugate() * out2
del out1, out2
check()
correl = numpy.fft.ifft2(out3)
del out3
check()
maxs = correl.argmax()
maxpt = maxs % correl.shape[0], maxs / correl.shape[0]
print correl[maxpt], maxpt, (correl.shape[0] - maxpt[0], correl.shape[1] - maxpt[1])
if __name__ == '__main__':
args = sys.argv
exit(main(args))
This
on SO says "Scipy 0.8 will have single precision support for almost all the fft code",
and SciPy 0.8.0 beta 1 is just out.
(Haven't tried it myself, cowardly.)
if I understand correctly, you are calculating a convolution between two images. The Scipy package contains a dedicated module for that (ndimage), which might be more memory efficient than the "manual" approach via Fourier transforms. It would be good to try using it instead of going through Numpy.