CPP Call slower than Python Interface Call - python

The Github Page of Caffe contains a Windows Branch. I have taken this branch and created a Windows DLL. It is losely based on https://github.com/BVLC/caffe/blob/master/examples/cpp_classification/classification.cpp.
The DLL works and outputs correct classification results. But it is 1.5-5 times slower than the pyCaffe interface. It is very interesting that the pyCaffe Interface takes around 1 second for four images using AlexNet on all computers tested. The DLL time ranges from 1.5 seconds to 2 seconds to 4 seconds.
We have measured the time before and after the loop (using Easily measure elapsed time) of
template <typename Dtype> Dtype Net<Dtype>::ForwardFromTo(int start, int end)
This function resides in https://github.com/BVLC/caffe/blob/master/src/caffe/net.cpp and is called by the CPP and Python Code.
We have compiled Caffe as 32-bit programm without GPU support using Visual Studio 2013
Possible things we have checked so far.
Compiler Optimizations
The data
OS and computer configurations (like CPU/Memory etc.)
We have measured multiple times in one execution, such that the benchmark is more stable.
We have also profiled the code using CodeXL but I could not find anything unusual, but that of course is a little bit vague.

We concluded following: Caffe uses GLog. GLog has Fatal Warning which may look like this
CHECK(a<=b) << "a must be bigger than b";
These warnings let the program crash and are hardly catchable. For that we have created a class to replace GLog. It is fairly simple and uses std::stringstream. Google has done something clever. Whenever the condition is true, the right hand side is not evaluated.
https://github.com/google/glog/blob/de6149ef8e67b064a433a8b88924fa9f606ad5d5/src/windows/glog/logging.h#L569
They solved it using the (void) 0. We missed that part. When I wanted to post the profiling data here, I recognised that some time is lost due to the << operator. We started looking at the profiling data closer and increased the number of function calls, such that every number gets a little bit bigger and clearer. This then has lead us to the solution.

Related

Numba bytecode generation for generic x64 processors? Rather than 1st run compiling a SLOW #njit(cache=True) argument

I have a pretty large project converted to Numba, and after run #1 with #nb.njit(cache=True, parallel=True, nogil=True), well, it's slow on run #1 (like 15 seconds vs. 0.2-1 seconds after compiling). I realize it's compiling byte code optimized for the specific PC I'm running it on, but since the code is distributed to a large audience, I don't want it to take forever compiling the first run after we deploy our model. What is not covered in the documentation is a "generic x64" cache=True method. I don't care if the code is a little slower on a PC that doesn't have my specific processor, I only care that the initial and subsequent runtimes are quick, and prefer that they don't differ by a huge margin if I distribute a cache file for the #njit functions at deployment.
Does anyone know if such a "generic" x64 implementation is possible using Numba? Or are we stuck with a slow run #1 and fast ones thereafter?
Please comment if you want more details; basically it's around a 50 lines of code function that gets JIT compiled via Numba and afterwards runs quite fast in parallel with no GIL. But I'm willing to give up some extreme optimization if the code can work in a generic form across multiple processors. As where I work, the PCs can vary quite a bit in terms of how advanced they are.
I looked briefly at the AOT (ahead of time) compilation of Numba functions, but these functions, in this case, have so many variables being altered I think it would take me weeks to decorate properly the functions to be compiled without a Numba dependency. I really don't have the time to do AOT, it would make more sense to just rewrite in Cython the whole algorithm, and that's more like C/C++ and more time consuming that I want to devote to this project. Unfortunately there is not (to my knowledge) a Numba -> Cython compiler project out there already. Maybe there will be in the future (which would be outstanding), but I don't know of such a project out there currently.
Unfortunately, you mainly listed all the current available options. Numba functions can be cached and the signature can be specified so to perform an eager compilation (compilation at the time of the function definition) instead of a lazy one (compilation during the first execution). Note that the cache=True flag is only meant to skip the compilation when it as already been done on the same platform before and not to share the code between multiple machine. AFAIK, the internal JIT used by Numba (LLVM-Lite) does not support that. In fact, it is exactly the purpose of the AOT compilation to do that. That being said, the AOT compilation requires the signatures to be provided (this is mandatory whatever the approach/tool used as long as the function is compiled) and it has quite strong limitations (eg. currently there is no support for parallel codes and fastmath). Keep in mind that Numba’s main use case is Just-in-Time compilation and not the ahead-of-time compilation.
Regarding your use-case, using Cython appears to make much more sense: the functions are pre-compiled once for some generic platforms and the compiled binaries can directly be provided to users without the need for recompilation on the target machine.
I don't care if the code is a little slower on a PC that doesn't have my specific processor.
Well, regarding your code, using a "generic" x86-64 code can be much slower. The main reasons lie in the use of SIMD instructions. Indeed, x86-64 processors all supports the SSE2 instruction set which provide basic 128-bit SIMD registers working on integers and floating-point numbers. Since about a decade, x86-processors supports the 256-bit AVX instruction set which significantly speed up floating-point computations. Since at least 7 years, almost all mainstream x86-64 processors supports the AVX-2 instruction set which mainly speed up integer computations (although it also improves floating-point thanks to new features). Since nearly a decade, the FMA instruction set can speed up codes using fuse-multiply adds by a factor of 2. Recent Intel processors support the 512-bit AVX-512 instruction set which not only double the number of items that can be computed per instruction but also adds many useful features. In the end, SIMD-friendly codes can be up to an order of magnitude faster with the new instruction sets compared to the obsolete "generic" SSE2 instruction set. Compilers (eg. GCC, Clang, ICC) are meant to generate a portable code by default and thus they only use SSE2 by default. Note that Numpy already uses such "new" features to speed up a lot many functions (see sorts, argmin/argmax, log/exp, etc.).

Is Python sometimes simply not fast enough for a task?

I noticed a lack of good soundfont-compatible synthesizers written in Python. So, a month or so ago, I started some work on my own (for reference, it's here). Making this was also a challenge that I set for myself.
I keep coming up against the same problem again and again and again, summarized by this:
To play sound, a stream of data with a more-or-less constant rate of flow must be sent to the audio device
To synthesize sound in real time based on user input, little-to-no buffering can be used
Thus, there is a cap on the amount of time one 'buffer generation loop' can take
Python, as a language, simply cannot run fast enough to do synthesize sound within this time limit
The problem is not my code, or at least, I've tried to optimize it to extreme levels - using local variables in time-sensitive parts of the code, avoiding using dots to access variables in loops, using itertools for iteration, using pre-compiled macros like max, changing thread switching parameters, doing as few calculations as possible, making approximations, this list goes on.
Using Pypy helps, but even that starts to struggle after not too long.
It's worth noting that (at best) my synth at the moment can play about 25 notes simultaneously. But this isn't enough. Fluidsynth, a synth written in C, has a cap on the number of notes per instrument at 128 notes. It also supports multiple instruments at a time.
Is my assertion that Python simply cannot be used to write a synthesizer correct? Or am I missing something very important?

Speeding up loop with fsolve using PyPy or Cython?

I have a Python script containing a loop with a lot of calls to scipy.optimize.fsolve (99 (55 + 54) times per time step, and right now I need around 10^5 time steps). The rest of the script isn't very fast either, but as far as I can tell from the output of the Spyder Profiler, the calls to fsolve are by far the most time consuming. With my current settings, running the script takes over 10 hours, so I could do with a little speed-up.
Based on what I read here and elsewhere on internet, I already gave PyPy a first try: installed it in a separate environment under conda (MacOS 10.15.5, pypy 7.3.1 and pypy3.6 7.3.1), along with its own versions of numpy, scipy, and pandas, but so far it's actually a bit slower than just Python (195 s vs 171 s, for 100 time steps).
From what I read here (PyPy Status Blog, October '17), this might have to do with using numpy instead of numpypy, and/or repeated allocation of temporary arrays. Apart from calling fsolve 10 million+ times, I'm also using quite a lot of numpy arrays, so that makes sense as far as I can see.
The thing is, I'm not a developer, and I'm completely new to PyPy, so terms like JIT traces don't mean much to me, and deciphering what's in them is likely going to be challenging for me. Moreover, what used to be true in October 2017 may no longer be relevant now. Also, even if I'd manage to speed up the numpy array bit, I'm still not sure about the fsolve part.
Can anyone indicate whether it could be worthwhile for me to invest time in PyPy? Or would Cython be more suitable in this case? Or even mpi4py?
I'd happily share my code if that helps, but it includes a module of over 800 lines of code, so just including it in this post didn't seem like a great idea to me.
Many thanks!
Sita
EDIT: Thanks everyone for your quick and kind response! That's a fair point, about needing to see my code, I put it here (link valid until 19 June 2020). Arterial_1D.py is the module, CoronaryTree.py is the script that calls Arterial_1D.py. For a minimal working example, I put in one extra line, to be uncommented in that case (clearly marked in the code). Also, I set the number of time steps to 100, to have the code run in reasonable time (0.61 s for the minimal example, 37.3 s for the full coronary tree, in my case).
EDIT 2: Silly of me, in my original post I mentioned times of 197 and 171 s for running 100 steps of my code using PyPy and Python, respectively, but in that case I called Python from within the PyPy environment, so it was using the PyPy version of Numpy. From within my base environment running 100 steps takes a little over 30 s. So PyPy is A LOT slower than Python in this case, which motivates me to look into this PyPy Status Blog post anyway.
We can't really help you to optimize without looking at your code. But since you have quite the description going on up there, let me reply with what I think you can try to speed things up.
First thing's first. The Scipy library.
From the source for scipy.optimize.fsolve, it wraps around MINPACK's hybrd and hybrj algorithms which are considerably fast FORTRAN subroutines. So in your case, switching to PyPy is not going to do much good, if any at all for this identified bottleneck.
What can you do to optimize your Scipy fsolve ? One of the most obvious numerical thing to do is to vectorize your function's args. But seems that you are running a sort of time step algorithm and Most standard time stepping algos are not able to vectorize in time. IF your 'XX times per time step' is a sort of implicit spatial loop per time step (i.e. your grid), you can consider vectorizing this to achieve some gains in speed. Next is to zoom into your function's guess / starting root estimate. See if you can mod your algorithm to capitalize on a good starting solution over the whole time interval (do some literature digging). Note that this has less to do with the 'programming' than your knowledge on numerical methods.
Next, on your comment on "rest of the script isn't very fast either". Well, I'd go with Cython to sweat the remaining python parts of your code, esp the loops. It is very actively developed, great community and is battle tested. I personally have used it in many HPC type problems. Cython also has a convenient html annotation that highlights potential optimizations that may be possible over your native python implementation.
Hope this helps! Cheers

Auto-memoizing Python interpreter for machine learning(Incpy alternative?)

I am working on a machine learning project in python and many times I found myself rerun some algorithm with different tweaks each time(changing few parameters, different normalization, some extra feature engineering, etc). Each time most computation is similar except a few steps. I can, of course, save some immediate states on disk and load it next time instead of the computing the same thing over and over again.
The thing is that there are so many such immediate results that manually save them and keep a record of them would be a pain. I looked at some python decorator here that can make things a bit easier. However, the problem with this implementation is, that it will always return the same result from the first time you called the function, even when your function has arguments and therefore should produce different results for different arguments. I really need to memorize the output of a function with different arguments.
I googled extensively on this topic and the closest thing that I found that is IncPy by Philip Guo. IncPy (Incremental Python) is an enhanced Python interpreter that speeds up script execution times by automatically memoizing (caching) the results of long-running function calls and then re-using those results rather than re-computing, when safe to do so.
I really like the idea and think it will be very useful for data science and machine learning but the code is written nine years ago for python 2.6 and is no longer maintained.
So my question is that is there any other alternative automatical caching/memorizing techniques in python that can handle relatively large dataset?

PyCUDA/CUDA: Causes of non-deterministic launch failures?

Anyone following CUDA will probably have seen a few of my queries regarding a project I'm involved in, but for those who haven't I'll summarize. (Sorry for the long question in advance)
Three Kernels, One Generates a data set based on some input variables (deals with bit-combinations so can grow exponentially), another solves these generated linear systems, and another reduction kernel to get the final result out. These three kernels are ran over and over again as part of an optimisation algorithm for a particular system.
On my dev machine (Geforce 9800GT, running under CUDA 4.0) this works perfectly, all the time, no matter what I throw at it (up to a computational limit based on the stated exponential nature), but on a test machine (4xTesla S1070's, only one used, under CUDA 3.1) the exact same code (Python base, PyCUDA interface to CUDA kernels), produces the exact results for 'small' cases, but in mid-range cases, the solving stage fails on random iterations.
Previous problems I've had with this code have been to do with the numeric instability of the problem, and have been deterministic in nature (i.e fails at exactly the same stage every time), but this one is frankly pissing me off, as it will fail whenever it wants to.
As such, I don't have a reliable way to breaking the CUDA code out from the Python framework and doing proper debugging, and PyCUDA's debugger support is questionable to say the least.
I've checked the usual things like pre-kernel-invocation checking of free memory on the device, and occupation calculations say that the grid and block allocations are fine. I'm not doing any crazy 4.0 specific stuff, I'm freeing everything I allocate on the device at each iteration and I've fixed all the data types as being floats.
TL;DR, Has anyone come across any gotchas regarding CUDA 3.1 that I haven't seen in the release notes, or any issues with PyCUDA's autoinit memory management environment that would cause intermittent launch failures on repeated invocations?
Have you tried:
cuda-memcheck python yourapp.py
You likely have an out of bounds memory access.
You can use nVidia CUDA Profiler and see what gets executed before the failure.

Categories