Save data to disk to avoid re-computation - python

I'm a beginner with Python, but I'm at the final stages of a project I've been working on for the past year and I need help at the final step.
If needed I'll post my code though it's not really relevant.
Here is my problem:
I have a database of images, say for example a 100 images. On each one of those images, I run an algorithm called ICA. This algorithm is very heavy to compute and each picture individually usually takes 7-10 seconds, so 100 pictures can take 700-1000 seconds, and that's way too long to wait.
Thing is, my database of images never changes. I never add pictures or delete pictures, and so the output of the ICA algorithm is always the same. So in reality, every time I run my code, I wait forever and gain the same output every time.
Is there a way to save the data to the hard disk, and extract it at a later time?
Say, I compute the ICA of the 100 images, it takes forever, and I save it and close my computer. Now when I run the program, I don't want it to recompute ICA, I want it to use the values I stored previously.
Would such a thing be possible in Python? if so - how?

Since you're running computation-heavy algorithms, I'm going to assume you're using Numpy. If not, you should be.
Numpy has a numpy.save() function that lets you save arrays in binary format. You can then load them using numpy.load().
EDIT: Docs for aforementioned functions can be found here under the "NPZ Files" section.

Related

Optimising drawing graphs from a model: code architecture

OK, I have a question about how to lay out code efficiently.
I have a model written in python which generates results which I use to produce graphs in matplotlib. As written, the model is contained within a single file, and I have 15 other run-files, which call on it with complicated configurations and produce graphs. It takes a while to go through and run each of these run-files, but since they all use substantially different settings for the model, I need to have complicated setup files anyway, and it all works.
I have the output set up for figures which could go in an academic paper. I have now realised that I am going to need each of these figures again in other formats - one for presentations (low dpi, medium size, different font) and one for a poster (high dpi, much bigger, different font again.)
This means I could potentially have 45 odd files to wade through every time I want to make a change to my model. I also would have to cut and paste a lot of boilerplate matplotlib code with minor alterations (each run-file would become 3 different files - one for each graph).
Can anybody explain to me how (and if) I could speed things up? At the moment, I think it's taking me much longer than it should.
As I see it there are 3 main options:
Set up 3 run-files for each actual model run (so duplicate a fair amount, and run the model a lot more than I need) but I can then tweak everything independently (but risk missing something important).
Add another layer - so save the results as .csv or equivalent and then read them into the files for producing graphs. This means more files, but I only have to run the model once per 3 graphs (which might save some time).
Keep the graph and model parameter files integrated, but add another file which sets up graphing templates, so every time I run the file it spits out 3 graphs) It might speed things up a bit, and will certainly keep the number of files down, but they will get very big (and probably much more complicated).
Something else..
Can anybody point me to a resource or provide me with some advice on how best to handle this?
Thanks!
I think you are close to find what you want.
If calculations take some time, store results in files to process later without recalculation.
The most important: separate code from configuration, instead of copy pasting variations of such mixture.
If the model takes parameters, define a model class. Maybe instantiate the model only once, but the model knows how to load_config, read_input_data and run. Model also does write_results. That way you can loop a sequence of load_config, read_data, write_results for every config and maybe input data.
Write the config files by hand with ini format for example and use the confiparser module to load them.
Do something similar for your Graph class. Put the template definition in configuration files, including output format, sizes fonts, and so on.
In the end you will be able to "manage" the intended workflow with a single script that uses this facilites. Maybe store groups of related configuration files, output templates and input data together, one group per folder for each modelling session.

Dask: why has CPU usage suddenly dropped?

I'm doing some Monte Carlo for a model and figured that Dask could be quite useful for this purpose. For the first 35 hours or so, things were running quite "smoothly" (apart from the fan noise giving a sense that the computer was taking off). Each model run would take about 2 seconds and there were 8 partitions running it in parallel. Activity monitor was showing 8 python3.6 instances.
However, the computer has become "silent" and CPU usage (as displayed in Spyder) hardly exceeds 20%. Model runs are happening sequentially (not in parallel) and taking about 4 seconds each. This happened today at some point while I was working on other things. I understand that depending on the sequence of actions, Dask won't use all cores at the same time. However, in this case there is really just one task to be performed (see further below), so one could expect all partitions to run and finish more or less simultaneously. Edit: the whole set up has run successfully for 10.000 simulations in the past, the difference now being that there are nearly 500.000 simulations to run.
Edit 2: now it has shifted to doing 2 partitions in parallel (instead of the previous 1 and original 8). It appears that something is making it change how many partitions are simultaneously processed.
Edit 3: Following recommendations, I have used a dask.distributed.Client to track what is happening, and ran it for the first 400 rows. An illustration of what it looks like after completing is included below. I am struggling to understand the x-axis labels, hovering over the rectangles shows about 143 s.
Some questions therefore are:
Is there any relationship between running other software (Chrome, MS Word) and having the computer "take back" some CPU from python?
Or instead, could it be related to the fact that at some point I ran a second Spyder instance?
Or even, could the computer have somehow run out of memory? But then wouldn't the command have stopped running?
... any other possible explanation?
Is it possible to "tell" Dask to keep up the hard work and go back to using all CPU power while it is still running the original command?
Is it possible to interrupt an execution and keep whichever calculations have already been performed? I have noticed that stopping the current command doesn't seem to do much.
Is it possible to inquire on the overall progress of the computation while it is running? I would like to know how many model runs are left to have an idea of how long it would take to complete in this slow pace. I have tried using the ProgressBar in the past but it hangs on 0% until a few seconds before the end of the computations.
To be clear, uploading the model and the necessary data would be very complex. I haven't created a reproducible example either out of fear of making the issue worse (for now the model is still running at least...) and because - as you can probably tell by now - I have very little idea of what could be causing it and I am not expecting anyone to be able to reproduce it. I'm aware this is not best practice and apologise in advance. However, I would really appreciate some thoughts on what could be going on and possible ways to go about it, if anyone has been thorough something similar before and/or has experience with Dask.
Running:
- macOS 10.13.6 (Memory: 16 GB | Processor: 2.5 GHz Intel Core i7 | 4 cores)
- Spyder 3.3.1
- dask 0.19.2
- pandas 0.23.4
Please let me know if anything needs to be made clearer
If you believe it can be relevant, the main idea of the script is:
# Create a pandas DataFrame where each column is a parameter and each row is a possible parameter combination (cartesian product). At the end of each row some columns to store the respective values of some objective functions are pre-allocated too.
# Generate a dask dataframe that is the DataFrame above split into 8 partitions
# Define a function that takes a partition and, for each row:
# Runs the model with the coefficient values defined in the row
# Retrieves the values of objective functions
# Assigns these values to the respective columns of the current row in the partition (columns have been pre-allocated)
# and then returns the partition with columns for objective functions populated with the calculated values
# map_partitions() to this function in the dask dataframe
Any thoughts?
This shows how simple the script is:
The dashboard:
Update: The approach I took was to:
Set a large number of partitions (npartitions=nCores*200). This made it much easier to visualise the progress. I'm not sure if setting so many partitions is good practice but it worked without much of a slowdown.
Instead of trying to get a single huge pandas DataFrame in the end by .compute(), I got the dask dataframe to be written to Parquet (in this way each partition was written to a separate file). Later, reading all files into a dask dataframe and computeing it to a pandas DataFrame wasn't difficult, and if something went wrong in the middle at least I wouldn't lose the partitions that had been successfully processed and written.
This is what it looked like at a given point:
Dask has many diagnostic tools to help you understand what is going on inside your computation. See http://docs.dask.org/en/latest/understanding-performance.html
In particular I recommend using the distributed scheduler locally and watching the Dask dashboard to get a sense of what is going on in your computation. See http://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard
This is a webpage that you can visit that will tell you exactly what is going on in all of your processors.

Is faster than 10 screenshots per second possible with Python?

Right now I'm in need of a really fast screenshotter to feed screenshot into a CNN that will update mouse movement based on the screenshot. I'm looking to model the same kind of behavior presented in this paper, and similarly do the steps featured in Figure 6 (without the polar conversion). As a result of needing really fast input, I've searched around a bit and have been able to get this script from here slightly modified that outputs 10fps
from PIL import ImageGrab
from datetime import datetime
while True:
im = ImageGrab.grab([320, 180, 1600, 900])
dt = datetime.now()
fname = "pic_{}.{}.png".format(dt.strftime("%H%M_%S"), dt.microsecond // 100000)
im.save(fname, 'png')
Can I expect anything faster? I'd be fine with using a different program if it's available.
Writing to disk is very slow, and is probably a big part of what's making your loop take so long. Try commenting out the im.save() line and seeing how many screenshots can be captured (add a counter variable or something similar to count how many screenshots are being captured).
Assuming the disk I/O is the bottleneck, you'll want to split these two tasks up. Have one loop that just captures the screenshots and stores them in memory (e.g. to a dictionary with the timestamp as the key), then in a separate thread pull elements out of the dictionary and write them to disk.
See this question for pointers on threading in Python if you haven't done much of that before.

How to save large Python numpy datasets?

I'm attempting to create an autonomous RC car and my Python program is supposed to query the live stream on a given interval and add it to a training dataset. The data I want to collect is the array of the current image from OpenCV and the current speed and angle of the car. I would then like it to be loaded into Keras for processing.
I found out that numpy.save() just saves one array to a file. What is the best/most efficient way of saving data for my needs?
As with anything regarding performance or efficiency, test it yourself. The problem with recommendations for the "best" of anything is that they might change from year to year.
First, you should determine if this is even an issue you should be tackling. If you're not experiencing performance issues or storage issues, then don't bother optimizing until it becomes a problem. What ever you do, don't waste your time on premature optimizations.
Next, assuming it actually is an issue, try out every method for saving to see which one yields the smallest results in the shortest amount of time. Maybe compression is the answer, but that might slow things down? Maybe pickling objects would be faster? Who knows until you've tried.
Finally, weigh the trade-offs and decide which method you can compromise on; You'll almost never have one silver bullet solution. While your at it, determine if just adding more CPU, RAM or disk space at the problem would solve it. Cloud computing affords you a lot of headroom in those areas.
The most simple way is np.savez_compressed(). This saves any number of arrays using the same format as np.save() but encapsulated in a standard Zip file.
If you need to be able to add more arrays to an existing file, you can do that easily, because after all the NumPy ".npz" format is just a Zip file. So open or create a Zip file using zipfile, and then write arrays into it using np.save(). The APIs aren't perfectly matched for this, so you can first construct a StringIO "file", write into it with np.save(), then use writestr() in zipfile.

Python dangers of multithreading?

I've got a program that combines multiple(up to 48 images) that are all color coded images. Only running through 32 of the 1024x768 images takes like 7.5 minutes to process through the images and give the output image. I'm doing each image separately using getdata(wxPython) and then going through each image pixel by pixel to get the underlying value and adding it to the final image list before creating/displaying the final image. The entire program works just like I want it to work with one SLOW exception. It takes 7.5 minutes to complete to run. If I want to do anything slightly different I have to start all over.
What I'm wandering if I were to go through the images using threading that should save time but I'm concerned about the possibility of running into one problem...calling the same exact code at the same time by different threads. Is that something to be worried about or not? As well is there something I'm not thinking about that I should be worried about?
Read about GIL - threading won't do the trick with your problem. It does it, when you can represent your computation as some kind of pipeline, or when it uses a lot of IO (because only then synchronization will be useful, if you want to run the same thing on several threads instead of single threaded loop, you'll end up with random order of computations, one thread at the time).
You probably want to map your data (maybe parts of picture, maybe whole pictures) to many processes, as they can work in parallel, without taking care of GIL, because they will be different interpreter processes. You can use Pool here.

Categories