Detect memory swapping in Python - python

How to detect when OS starts swapping some resources of the running process to disk?
I came here from basically the same question. The psutil library is obviously great and provides a lot of information, yet, I don't know how to use it to solve the problem.
I created a simple test script.
import psutil
import os
import numpy as np
MAX = 45000
ar = []
pr_mem = []
swap_mem = []
virt_mem = []
process = psutil.Process()
for i in range(MAX):
ar.append(np.zeros(100_000))
pr_mem.append(process.memory_info())
swap_mem.append(psutil.swap_memory())
virt_mem.append(psutil.virtual_memory())
Then, I plotted the course of those statistics.
import matplotlib.pyplot as plt
plt.figure(figsize=(16,12))
plt.plot([x.rss for x in pr_mem], label='Resident Set Size')
plt.plot([x.vms for x in pr_mem], label='Virtual Memory Size')
plt.plot([x.available for x in virt_mem], label='Available Memory')
plt.plot([x.total for x in swap_mem], label='Total Swap Memory')
plt.plot([x.used for x in swap_mem], label='Used Swap Memory')
plt.plot([x.free for x in swap_mem], label='Free Swap Memory')
plt.legend(loc='best')
plt.show()
I cannot see, how I can use the information about swap memory to detect the swapping of my process.
The 'Used Swap Memory' information is kind of meaningless, as it is high from the very first moment (it counts global consumption) when the data of the process are obviously not swapped.
It seems best to look at the difference between 'Virtual Memory Size' and 'Resident Set Size' of the process. If VMS greatly exceeds RSS, it is a sign that the data are not on RAM but on disk.
However, here is described a problem with sudden explosion of VMS which makes it irrelevant in some cases, if I understand it correctly.
Other approach is to watch 'Available Memory' and be sure that it does not drop below a certain threshold, as psutil documentation suggests. But to me it seems complicated to set the threshold properly. They suggests (in the small snippet) 100 MB, but on my machine it is something like 500 MB.
So the question stands. How to detect when OS starts swapping some resources of the running process to disk?
I work on Windows, but the solution needs to be cross-platform (or at least as cross-platform as possible).
Any suggestion, advice or useful link is welcomed. Thank you!
For the context, I write a library which needs to manage its memory consumption.
I believe that with the knowledge of the program logic, my manual swapping (serializing data to disk) can work better (faster) than OS swapping. More importantly, the OS swap space it limited, thus, sometimes it is necessary to do the manual swapping which does not utilize the OS swap space.
In order to start the manual swapping at the right time, it is crucial to know when OS swapping starts.

Related

Numpy matrix inverse appears to use multiple threads

So I have this really simple code:
import numpy as np
import scipy as sp
mat = np.identity(4)
for i in range (100000):
np.linalg.inv(mat)
for i in range (100000):
sp.linalg.inv(mat)
Now, the first bit where the inverse is done through numpy, for some reason, launches 3 additional threads (so 4 total, including the main) and collectively they consume roughly 3 or my available cores, causing the fans on my computer to go wild.
The second bit where I use Scipy has no noticable impact on CPU use and there's only one thread, the main thread. This runs about 20% slower than the numpy loop.
Anyone have any idea what's going on? Is numpy doing threading in the background? Why is it so inefficient?
I faced the same issue, the fix was to set export OPENBLAS_NUM_THREADS=1 and then run the python script in the same terminal.
My issue was that a simple code block that consists of np.linalg.inv() was consuming more than 50% of CPU usage. After setting the OPENBLAS_NUM_THREADS parameter, the CPU usage dropped to around 3% and also the total execution time reduced. I read somewhere that this is an issue with the OPENBLAS library (Which is used by numpy.linalg.inv function). Hope this helps!

Astronomical FITS Image calibration: Indexing issue using ccdproc

I seem to be having an issue with some basic astronomical image processing/calibration using the python package ccdproc.
I'm currently compiling 30 bias frames into a single image average of the component frames. Before going through the combination I iterate over each image in order to subtract the overscan region using subtract_overscan() and then select the image dimensions I want to retain using trim_image().
I suppose my indexing is correct but when I get to the combination, it takes extremely long (more than a couple of hours). I'm not sure that this is normal. I suspect something might be being misinterpreted by my computer. I've created the averaged image before without any of the other processing and it didn't take long (5-10 mins or so) which is why I'm thinking it might be an issue with my indexing.
If anyone can verify that my code is correct and/or comment on what might be the issue it'd be a lot of help.
Image dimensions: NAXIS1 = 3128 , NAXIS2 = 3080 and allfiles is a ccdproc.ImageFileCollection.
from astropy.io import fits
import ccdproc as cp
biasImages = []
for filename in allfiles.files_filtered(NAXIS1=3128,NAXIS2=3080,OBSTYPE = 'BIAS'):
ccd = fits.getdata(allfiles.location + filename)
# print(ccd)
ccd = cp.CCDData(ccd, unit = u.adu)
# print(ccd)
ccd = cp.subtract_overscan(ccd,overscan_axis = 1, fits_section = '[3099:3124,:]')
# print(ccd)
ccd = cp.trim_image(ccd,fits_section = '[27:3095,3:3078]')
# print(ccd)
biasImages.append(ccd)
master_bias = cp.combine(biasImages,output_file = path + 'mbias_avg.fits', method='average')
The code looks similar to my own code for combining biases together (see this example), so there is nothing jumping out immediately as a red flag. I rarely do such a large number of biases and the ccdproc.combine task could be far more optimized, so I'm not surprised it is very slow.
One thing that sometimes I run into is issues with garbage collection. So if you are running this in a notebook or part of a large script, there may be a problem with the memory not being cleared. It is useful to see what is happening in memory, and I sometimes include deleting the biasImages object (or an other list of ccd objects) after it has been used and it isn't needed any further
I'm happy to respond further here, or if you have further issues please open an issue at the github repo.
In case you're just looking for a solution skip ahead to the end of this answer but in case you're interested why that (probably) don't skip ahead
it takes extremely long (more than a couple of hours).
That seems like your running out of RAM and your computer then starts using swap memory. That means it will save part (or all) of the objects on your hard disk and remove them from RAM so it can load them again when needed. In some cases swap memory can be very efficient because it only needs to reload from the hard disk rarely but in some cases it has to reload lots of times and then your going to notice a "whole system slow down" and "never ending operations".
After some investigations I think the problem is mainly because the numpy.array created by ccdproc.combine is stacked along the first axis and the operation is along the first axis. The first axis would be good in case it's a FORTRAN-contiguous array but ccdproc doesn't specify any "order" and then it's going to be C-contiguous. That means the elements on the last axis are stored next to each other in memory (if it were FORTRAN-contiguous the elements in the first axis would be next to each other). So if you run out of RAM and your computer starts using swap memory it puts parts of the array on the disk but because the operation is performed along the first axis - the memory addresses of the elements that are used in each operation are "far away from each other". That means it cannot utilize the swap memory in a useful way because it has to basically reload parts of the array from the hard disk for "each" next item.
It's not very important to know that actually, I just included that in case you're interested what the reason for the observed behavior war. The main point to take away is that if you notice that the system is becoming very slow if you run any program and it doesn't seem to make much progress that's because you have been running out of RAM!
The easiest solution (although it has nothing to do with programming) is to buy more RAM.
The complicated solution would be to reduce the memory footprint of your program.
Let's first do a small calculation how much memory we're dealing with:
Your images are 3128 * 3080 that's 9634240 elements. They might be any type when you read them but when you use ccdproc.subtract_overscan they will be floats later. One float (well, actually np.float64) uses 8 bytes, so we're dealing with 77073920 bytes. That's roughly 73 MB per bias image. You have 30 bias images, so we're dealing with roughly 2.2 GB of data here. That's assuming your images don't have uncertainty or mask. If they have that would add another 2.2 GB for the uncertainties or 0.26 GB for the masks.
2.2 GB sounds like a small enough number but ccdproc.combine stacks the NumPy arrays. That means it will create a new array and copy the data of your ccds into the new array. That will double the memory right there. It makes sense to stack them because even though it will take more memory it will be much faster when it actually does the "combining" but it's not there yet.
All in all 4.4 GB could already exhaust your RAM. Some computers will only have 4GB RAM and don't forget that your OS and the other programs need some RAM as well. However, it's unlikely that you run out of RAM if you have 8GB or more but given the numbers and your observations I assume that you only have 4-6GB of RAM.
The interesting question is actually how to avoid the problem. That really depends on the amount of memory you have:
Less than 4GB RAM
That's tricky because you won't have much free RAM after you deduct the size of all the CCDData objects and what you OS and the other processes need. In that case it would be best to process for example 5 bias images at a time and then combine the results of the first combinations. That's probably going to work because you use average as method (it wouldn't work if you used median) because (A+B+C+D) / 4 is equal to ((A+B)/2 + (C+D)/2)/2.
That would be (I haven't actually checked this code, so please inspect it carefully before you run it):
biasImages = []
biasImagesCombined = []
for idx, filename in enumerate(allfiles.files_filtered(NAXIS1=3128,NAXIS2=3080,OBSTYPE = 'BIAS')):
ccd = fits.getdata(allfiles.location + filename)
ccd = cp.CCDData(ccd, unit = u.adu)
ccd = cp.subtract_overscan(ccd,overscan_axis = 1, fits_section = '[3099:3124,:]')
ccd = cp.trim_image(ccd,fits_section = '[27:3095,3:3078]')
biasImages.append(ccd)
# Combine every 5 bias images. This only works correctly if the amount of
# images is a multiple of 5 and you use average as combine-method.
if (idx + 1) % 5 == 0:
tmp_bias = cp.combine(biasImages, method='average')
biasImages = []
biasImagesCombined.append(tmp_bias)
master_bias = cp.combine(biasImagesCombined, output_file = path + 'mbias_avg.fits', method='average')
4GB of RAM
In that case you probably have 500 MB to spare, so you could simply use mem_limit to limit the amount of RAM the combine will take additionally. In that case just change your last line to:
# To account for additional memory usage I chose 100 MB of additional memory
# you could try to adapt the actual number.
master_bias = cp.combine(biasImages, mem_limit=1024*1024*100, ioutput_file=path + 'mbias_avg.fits', method='average')
More than 4GB of RAM but less than 8GB
In that case you probably have 1 GB of free RAM that could be used. It's still the same approach as the 4GB option but you could use a much higher mem_limit. I would start with 500 MB: mem_limit=1024*1024*500.
More than 8GB of RAM
In that case I must have missed something because using ~4.5GB of RAM shouldn't actually exhaust your RAM.

Python randomly drops to 0% CPU usage, causing the code to "hang up", when handling large numpy arrays?

I have been running some code, a part of which loads in a large 1D numpy array from a binary file, and then alters the array using the numpy.where() method.
Here is an example of the operations performed in the code:
import numpy as np
num = 2048
threshold = 0.5
with open(file, 'rb') as f:
arr = np.fromfile(f, dtype=np.float32, count=num**3)
arr *= threshold
arr = np.where(arr >= 1.0, 1.0, arr)
vol_avg = np.sum(arr)/(num**3)
# both arr and vol_avg needed later
I have run this many times (on a free machine, i.e. no other inhibiting CPU or memory usage) with no issue. But recently I have noticed that sometimes the code hangs for an extended period of time, making the runtime an order of magnitude longer. On these occasions I have been monitoring %CPU and memory usage (using gnome system monitor), and found that python's CPU usage drops to 0%.
Using basic prints in between the above operations to debug, it seems to be arbitrary as to which operation causes the pausing (i.e. open(), np.fromfile(), np.where() have each separately caused a hang on a random run). It is as if I am being throttled randomly, because on other runs there are no hangs.
I have considered things like garbage collection or this question, but I cannot see any obvious relation to my problem (for example keystrokes have no effect).
Further notes: the binary file is 32GB, the machine (running Linux) has 256GB memory. I am running this code remotely, via an ssh session.
EDIT: This may be incidental, but I have noticed that there are no hang ups if I run the code after the machine has just been rebooted. It seems they begin to happen after a couple of runs, or at least other usage of the system.
np.where is creating a copy there and assigning it back into arr. So, we could optimize on memory there by avoiding a copying step, like so -
vol_avg = (np.sum(arr) - (arr[arr >= 1.0] - 1.0).sum())/(num**3)
We are using boolean-indexing to select the elements that are greater than 1.0 and getting their offsets from 1.0 and summing those up and subtracting from the total sum. Hopefully the number of such exceeding elements are less and as such won't incur anymore noticeable memory requirement. I am assuming this hanging up issue with large arrays is a memory based one.
The drops in CPU usage were unrelated to python or numpy, but were indeed a result of reading from a shared disk, and network I/O was the real culprit. For such large arrays, reading into memory can be a major bottleneck.
Did you click or select the Console window? This behavior can "hang" the process. Console enters "QuickEditMode". Pressing any key can resume the process.

Biopython Global Alignment : Out of Memory

Im trying the global alignment method from the Biopython module. Using it on short sequences is easy and gives an alignment matrix straightaway. However I really need to run it on larger sequences I have (an average lenght of 2000 nucleatides (or) characters). However I keep running into the Out of Memory error. I looked on SO and found this previous question. The answers provided are not helpfull as they link to the this same website which cant be accessed now.Apart from this I have tried these steps:
I tried using a 64-bit python since my personal computer has 4gb RAM.
sshed to a small school server with 16gb RAM and tried running on that. Its still running after close to 4 hours.
Since its is a small script im unsure how to modify it. ANy help will be greatly appreciated.
My script:
import os
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
file_list = []
file_list = [each for each in os.listdir(os.getcwd()) if each.endswith(".dna")]
align_file = open("seq_align.aln","w")
seq_list = []
for each_file in file_list:
f_o = open(each_file,"r")
seq_list.append(f_o.read())
for a in pairwise2.align.globalmx(seq_list[0],seq_list[1]):
align_file.write(format_alignment(*a))
align_file.close()
So the school server finally completed the task. What I realized was that for each alignment there were 1000 matrices constructed and calculated. The align.globalxx method has a variable MAX_ALIGNMENT which is by default set to 1000. Changing it via monkey patching dint really change anything . The documentation says that, the method tries all possible alignments (yes 1000) but in my case all the matrices have the same alignment score (as well as few test sequences I tried). Finally a small piece of comment in the documentation states that if you need only 1 score, use the optional parameter one_alignment_only which accepts a boolean value only. All i did was this:
for a in pairwise2.align.globalmx(seq_list[0],seq_list[1],one_alignment_only=True):
align_file.write(format_alignment(*a))
This reduced the time considerably. However my PC still crashed so I assume this is a very memory intensive task and requires much more RAM (16gb on a small server). So probably a more efficient way of reading the sequences in the matrix should be thought of.

Memory issue in parallel Python

I have a Python script like this:
from modules import functions
a=1
parameters = par_vals
for i in range(large_number):
#do lots of stuff dependent on a, plot stuff, save plots as png
When I run this for a value of "a" it takes half an hour, and uses only 1 core of my 6 core machine.
I want to run this code for 100 different values of "a"
The question is: How can I parallelize this so I use all cores and try all values of "a"?
my first approach following an online suggestion was:
from joblib import Parallel, delayed
def repeat(a):
from modules import functions
parameters = par_vals
for i in range(large_number):
#do lots of stuff dependent on a, plot stuff, save plots as png
A=list_100_a #list of 100 different a values
Parallel(n_jobs=6,verbose=0)(delayed(repeat)(a) for a in A)
This successfully use all my cores in the computer but it was computing for all the 100 values of a at the same time. after 4 hours my 64GB RAM memory and 64GB swap memory would be saturated and the performance dropped drastically.
So I tried to manually cue the function doing it 6 times at a time inside a for loop. But the problem was that the memory would be consumed also.
I don't know where the problem is. I guess that somehow the program is keeping unnecessary memory.
What can I do so that I don't have this memory problem.
In summary:
When I run this function for a specific value of "a" everything is ok.
When I run this function in parallel for 6 values of "a" everything is ok.
When I sequencially run this function in parallel the memory gradually inscreases until the computer can no longer work.
UPDATE: I found a solution for the memory problem even though I don't understand why.
It appears that changing the backend of matplotlib to 'Agg' no longer produced the memory problem.
just add this before any import and you should be fine:
from matplotlib import use
use('Agg')
Here's how I would do it with multiprocessing. I'll use your repeat function to do the work for one value of a.
def repeat(a):
from modules import functions
parameters = par_vals
for i in range(large_number):
#do lots of stuff dependent on a, plot stuff, save plots as png
Then I'd use multiprocessing.pool like this:
import multiprocessing
pool = multiprocessing.Pool(processes=6) # Create a pool with 6 workers.
A=list_100_a #list of 100 different a values
# Use the workers in the pool to call repeat on each value of a in A. We
# throw away the result of calling map, since it looks like the point of calling
# repeat(a) is for the side effects (files created, etc).
pool.map(repeat, A)
# Close the pool so no more jobs can be submitted to it, then wait for
# all workers to exit.
pool.close()
pool.join()
If you wanted the result of calling repeat, you could just do result = pool.map(repeat, A).
I don't think you'll run into any issues, but it's also useful to read the programming guidelines for using multiprocessing.

Categories