I am working on a project and I have a .so (shared object) which contains a bunch of functions inside which deal with matrix-matrix addition and matrix-matrix multiplication.I "LD_PRELOAD" this .so file while running a python script that uses TensorFlow.
In a nutshell, instead of using the matrix multiplication functions built into TensorFlow's I am using this .so file which has matrix multiply functions (TensorFlow-mkl-enabled uses libmklml_intel.so file for matrix multiply functions). This is the format in which mkl takes matrix multiplication.
Now, the problem at hand is that I need to monitor three things:
What are the dimensions of matrices that are multiplied?
How many times is this function called?
What is the % if time run time the script spends in this function?
I can think of two approaches
Trace where these function calls are being made from the TensorFlow framework and then do changes to get the required information. But I feel that this approach involves modification at multiple places as there's not just one single function which calls cblas. Additionally I'll have to scrape through the TensorFlow Source code which is a huge task. And it may also call for building TF from source again
But, if we are somehow able to moniter the interaction of the .so file with TF and maybe somehow register all the data sent while calling, maybe with a clever shell script or with a tool that I am unaware of? What are my best options? Would Perf be of any help here
Thanks in Advance
Newbie here
Related
I am working on a machine learning project in python and many times I found myself rerun some algorithm with different tweaks each time(changing few parameters, different normalization, some extra feature engineering, etc). Each time most computation is similar except a few steps. I can, of course, save some immediate states on disk and load it next time instead of the computing the same thing over and over again.
The thing is that there are so many such immediate results that manually save them and keep a record of them would be a pain. I looked at some python decorator here that can make things a bit easier. However, the problem with this implementation is, that it will always return the same result from the first time you called the function, even when your function has arguments and therefore should produce different results for different arguments. I really need to memorize the output of a function with different arguments.
I googled extensively on this topic and the closest thing that I found that is IncPy by Philip Guo. IncPy (Incremental Python) is an enhanced Python interpreter that speeds up script execution times by automatically memoizing (caching) the results of long-running function calls and then re-using those results rather than re-computing, when safe to do so.
I really like the idea and think it will be very useful for data science and machine learning but the code is written nine years ago for python 2.6 and is no longer maintained.
So my question is that is there any other alternative automatical caching/memorizing techniques in python that can handle relatively large dataset?
I am using the scipy minimize function. The function that it's calling was compiled with Cython and has an underlying C++ implementation that I wrote, but that shouldn't really matter. For some reason when I run my program, it creates as many threads as it can to fill all my cpus. For example if I run top I see that 800% of a cpu is being used or on htop I can see that 8 individual processors are being used, when I only created the program to be run on one. I didn't think that scipy even had parallel processing functionality and I can't find any documentation related to this. What could possible be going on and is there any way to control it?
If some BLAS-implementation (with threading-support) is available (default on Ubuntu for example), some expressions like np.dot() (only the dense case as far as i know) will automatically be run in parallel (reference). Another possible example is sparse-matrix factorization with SuperLU.
Of course different minimizers will behave different.
Newton-type methods (core: solve a system of sparse linear-equations) are probably based on SuperLU (if the code is not one of the common old Fortran/C ones, where the whole code is self-contained). CG-type methods are heavily based on matrix-vector products (np.dot; so the dense-case will be parallel).
For some control over this, start with this SO question.
I'm currently working with a project where I use python functions wrapped in a module within a matlab code. The matlab part of the code is a MCMC (monte carlo multi chains) computation, therefore for speeding up the code I'm using a parfor loop on a cluster.
To be more specific the algorithm can be thought as follow:
for i=1:number of chain steps(==number of iterations)
parfor j=1:number of chains(==number of workers)
load the module and use python functions defined there
end
do something related to the evolution of the chains
end
My problem is that the only way I have for matlab to use the python's defined function is to re-load the python module per each parfor iteration, but as the code work this means also per each chain step (the parfor is nested inside) and there I spend some times.
My question is: is there a smarter-faster way to use python libraries within matlab? (something equivalent to MEX-?-) otherwise, is there a way to "store" the python module infos in each worker at the beginning without the need to reload the module every time I step forward in the outer loop as well?
Any hint will be really really appreciated!! Thanks a lot
Giulia
I believe you're looking for pctRunOnAll. From the Matlab documentation:
This is useful if there are setup changes that need to be performed on all the workers and the client.
You should be able to modify your algorithm with
pctRunOnAll load the module
for i=1:number of chain steps(==number of iterations)
parfor j=1:number of chains(==number of workers)
use python functions from pre-loaded modeule
end
do something related to the evolution of the chains
end
You might be able to take advantage of parallel.pool.Constant here. This allows you to set up some "constant" data to be used by multiple iterations of a parfor loop, even multiple parfor loops. The linked reference page shows you how to build a parallel.pool.Constant using a function handle - you probably want that to be a function handle that loads your module.
My python code is performing fairly complex numerical calculations, and in many cases I am unable to provide known solutions to enable unit testing (especially for intermediate results).
However, I have found that I can catch a lot of bugs with nose, by performing regression testing using the following workflow:
Write test code to solve some relatively small problem
Run once, inspect the results (often in the form of a matplotlib plot), and decide by comparison with analytical results or other numerical software or physical intuition that the results are correct to within acceptable numerical accuracy.
Save the resulting numpy arrays to text files to act as a reference (FWIW I was avoiding numpy's saving routines as a workaround for this bug, but as this has been fixed in a released version think I can use them now).
The test code performs the calculation, and compares it with the reference data read in from the file using numpy's assert_allclose.
The test function is written in such a way as that by default it performs the test, but by passing non-default values for arguments I can plot the results and overwrite the reference file if it becomes necessary. The reference file is checked into git so there is little risk of accidentally overwriting the test values without noticing.
However, I find myself writing a lot of boilerplate code to implement the above functionality, which outweighs the actual test code itself. Cleaning this up would make it much easier to increase test coverage.
Is there some python testing framework or plugin for nose that could easily automate the above workflow?
A few months ago I wrote the nrtest utility in an attempt to make this workflow easier. It sounds like it might help you too.
Here's a quick overview. Each test is defined by its input files and its expected output files. Following execution, output files are stored in a portable benchmark directory. A second step then compares this benchmark to a reference benchmark. A recent update has enabled user extensions, so you can define comparison functions for your custom data.
I hope it helps.
I'm looking into speeding up my python code, which is all matrix math, using some form of CUDA. Currently my code is using Python and Numpy, so it seems like it shouldn't be too difficult to rewrite it using something like either PyCUDA or CudaMat.
However, on my first attempt using CudaMat, I realized I had to rearrange a lot of the equations in order to keep the operations all on the GPU. This included the creation of many temporary variables so I could store the results of the operations.
I understand why this is necessary, but it makes what were once easy to read equations into somewhat of a mess that difficult to inspect for correctness. Additionally, I would like to be able to easily modify the equations later on, which isn't in their converted form.
The package Theano manages to do this by first creating a symbolic representation of the operations, then compiling them to CUDA. However, after trying Theano out for a bit, I was frustrated by how opaque everything was. For example, just getting the actual value for myvar.shape[0] is made difficult since the tree doesn't get evaluated until much later. I would also much prefer less of a framework in which my code much conform to a library that acts invisibly in the place of Numpy.
Thus, what I would really like is something much simpler. I don't want automatic differentiation (there are other packages like OpenOpt that can do that if I require it), or optimization of the tree, but just a conversion from standard Numpy notation to CudaMat/PyCUDA/somethingCUDA. In fact, I want to be able to have it evaluate to just Numpy without any CUDA code for testing.
I'm currently considering writing this myself, but before even consider such a venture, I wanted to see if anyone else knows of similar projects or a good starting place. The only other project I know that might be close to this is SymPy, but I don't know how easy it would be to adapt to this purpose.
My current idea would be to create an array class that looked like a Numpy.array class. It's only function would be to build a tree. At any time, that symbolic array class could be converted to a Numpy array class and be evaluated (there would also be a one-to-one parity). Alternatively, the array class could be traversed and have CudaMat commands be generated. If optimizations are required they can be done at that stage (e.g. re-ordering of operations, creation of temporary variables, etc.) without getting in the way of inspecting what's going on.
Any thoughts/comments/etc. on this would be greatly appreciated!
Update
A usage case may look something like (where sym is the theoretical module), where we might be doing something such as calculating the gradient:
W = sym.array(np.rand(size=(numVisible, numHidden)))
delta_o = -(x - z)
delta_h = sym.dot(delta_o, W)*h*(1.0-h)
grad_W = sym.dot(X.T, delta_h)
In this case, grad_W would actually just be a tree containing the operations that needed to be done. If you wanted to evaluate the expression normally (i.e. via Numpy) you could do:
npGrad_W = grad_W.asNumpy()
which would just execute the Numpy commands that the tree represents. If on the other hand, you wanted to use CUDA, you would do:
cudaGrad_W = grad_W.asCUDA()
which would convert the tree into expressions that can executed via CUDA (this could happen in a couple of different ways).
That way it should be trivial to: (1) test grad_W.asNumpy() == grad_W.asCUDA(), and (2) convert your pre-existing code to use CUDA.
Have you looked at the GPUArray portion of PyCUDA?
http://documen.tician.de/pycuda/array.html
While I haven't used it myself, it seems like it would be what you're looking for. In particular, check out the "Single-pass Custom Expression Evaluation" section near the bottom of that page.