I am working on a machine learning project in python and many times I found myself rerun some algorithm with different tweaks each time(changing few parameters, different normalization, some extra feature engineering, etc). Each time most computation is similar except a few steps. I can, of course, save some immediate states on disk and load it next time instead of the computing the same thing over and over again.
The thing is that there are so many such immediate results that manually save them and keep a record of them would be a pain. I looked at some python decorator here that can make things a bit easier. However, the problem with this implementation is, that it will always return the same result from the first time you called the function, even when your function has arguments and therefore should produce different results for different arguments. I really need to memorize the output of a function with different arguments.
I googled extensively on this topic and the closest thing that I found that is IncPy by Philip Guo. IncPy (Incremental Python) is an enhanced Python interpreter that speeds up script execution times by automatically memoizing (caching) the results of long-running function calls and then re-using those results rather than re-computing, when safe to do so.
I really like the idea and think it will be very useful for data science and machine learning but the code is written nine years ago for python 2.6 and is no longer maintained.
So my question is that is there any other alternative automatical caching/memorizing techniques in python that can handle relatively large dataset?
Related
I noticed a lack of good soundfont-compatible synthesizers written in Python. So, a month or so ago, I started some work on my own (for reference, it's here). Making this was also a challenge that I set for myself.
I keep coming up against the same problem again and again and again, summarized by this:
To play sound, a stream of data with a more-or-less constant rate of flow must be sent to the audio device
To synthesize sound in real time based on user input, little-to-no buffering can be used
Thus, there is a cap on the amount of time one 'buffer generation loop' can take
Python, as a language, simply cannot run fast enough to do synthesize sound within this time limit
The problem is not my code, or at least, I've tried to optimize it to extreme levels - using local variables in time-sensitive parts of the code, avoiding using dots to access variables in loops, using itertools for iteration, using pre-compiled macros like max, changing thread switching parameters, doing as few calculations as possible, making approximations, this list goes on.
Using Pypy helps, but even that starts to struggle after not too long.
It's worth noting that (at best) my synth at the moment can play about 25 notes simultaneously. But this isn't enough. Fluidsynth, a synth written in C, has a cap on the number of notes per instrument at 128 notes. It also supports multiple instruments at a time.
Is my assertion that Python simply cannot be used to write a synthesizer correct? Or am I missing something very important?
I am currently working on an ML NLP project and I want to measure the execution time of certain parts and also potentially predict how long the execution will take. For example, I want to measure the ML training process (including sub-processes like the data preprocessing part). I have been looking online and I have come across different python modules that can measure the execution time of functions (like the time or timeit ones). However, I still haven't found a concrete solution to predict the time it will take for a function to execute. I have thought about running the code several times, save the (data_size, time) values and then use that to extrapolate for future data. I also thought about then updating this estimation with the time it took the run several subparts of a function (like seeing how much of the process was computed, how long it took and then use that to adjust the time left).
However, I am not sure of any of this and I wanted to see if there were better options out there that I wasn't aware of, so if anyone has a better idea, I'd be thankful if you could share it.
Have you looked into using profiling? It should give a detailed breakdown of the function execution times, the number of calls, etc. You will have to execute the script with profiling, and then you will get the detailed breakdown.
https://docs.python.org/3/library/profile.html#module-cProfile
If you want in-time progress reports there are a couple of libraries I've seen. https://pypi.org/project/tqdm/
https://pypi.org/project/progressbar2/
Hope these help!
I've read online that Scala is faster than Python, e.g. here. I've also seen a comparison between different front ends that concluded that R was so slow that the tester gave up trying to measure its performance (here; although this was specifically a test for user-defined functions and may not have used the sparklyr package).
I also know that sparklyr now has arrow integration, which has led to performance improvements for user defined functions as well as copying data to/from the cluster, as shown here.
My question: how fast is sparklyr compared to Python/Scala? I'm mostly interested in standard 'out-of-the-box' functions, but would also be interested to know how it stacks up for user-defined functions now that arrow has been integrated. And are there particular circumstances in which it performs well or badly?
I ask because I've built an app in sparklyr that is slower than I would have hoped despite lots of tinkering with tuning parameters, and I'm wondering if this is partly because of limitations in the package.
in a local PC, most of the time, R is much faster than spark.
Is there a particular software resource monitor that researchers or academics use to compare execution time and other resource usage metrics between programming environments? For instance, if I have a routine in C++, python and another in Matlab, that are all identical in function and similar implantations -how would I make an objective, measurable result comparison as to which was the most efficient process. Likewise is it a tool that could also analyze performance between versions of the same code to track improvements in processing efficiency. Please try to answer this question without generalizations like "oh, C++ is always more efficient than python and python will always be more efficient than Matlab."
The correct way is to write tests. Get current time before actual algo starts, and get current time after it ends. There are ways to do that in c++, python and matlab
You must not think of results as they are 100% precision because of system scheduling process etc, though it is a good way to compare before-after results.
Good way to get more precision results is to run your code multiple times.
(This is a follow-up to Statistical profiler for PyPy)
I'm running some Python code under PyPy and would like to optimize it.
In Python, I would use statprof or lineprofiler to know which exact lines are causing the slowdown and try to work around them. In PyPy though, both of the tools don't really report sensible results as PyPy might optimize away some lines. I would also prefer not to use cProfile as I find it very difficult to distil which part of the reported function is the bottleneck.
Does anyone have some tips on how to proceed? Perhaps another profiler which works nicely under PyPy? In general, how does one go about optimizing Python code for PyPy?
If you understand the way the PyPy architecture works, you'll realize that trying to pinpoint individual lines of code isn't really productive. You start with a Python interpreter written in RPython, which then gets run through a tracing JIT which generates flow graphs and then transforms those graphs to optimize the RPython interpreter. What this means is the layout of your Python code being run by the RPython interpreter being JIT'ed may have very different structure than the optimized assembler actually be run. Furthermore, keep in mind that the JIT always works on a loop or a function, so getting line-by-line stats is not as meaningful. Consequently, I think cProfile may really be a good option for you since it will give you an idea of where to concentrate your optimization. Once you know which functions are your bottlenecks, you can spend your optimization efforts targeting those slower functions, rather than trying to fix a single line of Python code.
Keep in mind as you do this that PyPy has very different performance characteristics than cPython. Always try to write code in as simple a way as possible (that doesn't mean as few lines as possible btw). There are a few other heuristics that help such as using specialized lists, using objects over dicts when you have a small number of mostly constant keys, avoiding C extensions using the C Python API, etc.
If you really, really insist on trying to optimize at the line level. There are a few options. One is called JitViewer (https://foss.heptapod.net/pypy/jitviewer), which will let you have a very low level view of what the JIT is doing to your code. For instance, you can even see the assembler instructions which correspond to a Python loop. Using that tool, you can really get a sense for just how fast PyPy will behave with certain parts of the code, as you can now do silly things like count the number of assembler instructions used for your loop or something.