Consider the memory usage of the following code after executing each statement:
import numpy
import scipy.signal
import gc
# memory at this point is ~35mb
a = numpy.ones(10**7)
b = numpy.ones(10**7)
# memory at this point is ~187mb
c = scipy.signal.fftconvolve(a, numpy.flipud(b), mode="full")
# memory usage at this point is 645mb
# given that a,b,c take up about 305mb in total, this is much
# larger than expected
# If we delete a,b,c and garbage collect...
del a,b,c
gc.collect()
# ...the memory usage drops to ~340mb which is a drop of 305mb
# (as expected since that is the size of what we deleted)
# but the remaining memory usage is still 340mb
# which is much larger than the starting value of 35 mb
Why is the process using so much extra memory. Is there a leak in the fftconvolve function, or is there some other explanation?
Platform:
CPython 3.5.1 x64
Scipy 0.17.0 (Christoph Golke's package)
Numpy+MKL 1.11.0rc1 (Cristoph Golke's package)
Windows 8.1 x64
Related
I am using the anaconda suite with ipython 3.6.1 and their accelerate package. There is a cufft sub-package in this two functions fft and ifft. These, as far as I understand, takes in a numpy array and outputs to a numpy array, both in system ram, i.e. all gpu-memory and transfer between system and gpu memory is handled automatically and gpu memory is releaseed as function is ended. This seems all very nice and seems to work for me. However, I would like to run multiple fft/ifft calls on the same array and for each time extract just one number from the array. It would be nice to keep the array in the gpu memory to minimize system <-> gpu transfer. Am I correct that this is not possible using this package? If so, is there another package that would do the same. I have noticed the reikna project but that doesn't seem available in anaconda.
The thing I am doing (and would like to do efficiently on gpu) is in short shown here using numpy.fft
import math as m
import numpy as np
import numpy.fft as dft
nr = 100
nh = 2**16
h = np.random.rand(nh)*1j
H = np.zeros(nh,dtype='complex64')
h[10] = 1
r = np.zeros(nr,dtype='complex64')
fftscale = m.sqrt(nh)
corr = 0.12j
for i in np.arange(nr):
r[i] = h[10]
H = dft.fft(h,nh)/fftscale
h = dft.ifft(h*corr)*fftscale
r[nr-1] = h[10]
print(r)
Thanks in advance!
So I found Arrayfire which seems rather easy to work with.
For example i have a code that produces many integers.
import sys
import random
a = [random.randint(0, sys.maxint) for i in xrange(10000000)]
After running it i got VIRT 350M, RES 320M (view by htop).
Then i do:
del a
But memory still is VIRT 272M, RES 242M (before producing integers was VIRT 24M, RES 6M).
The pmap of a process say that there are to big pieces of [anon] memory.
Python 3.4 does not have such behavior: memory are frees when i delete list here!
What happens? Does python leave integers in memory?
Here's how I can duplicate it. If I start python 2.7, the interpreter uses about 4.5 MB of memory. (I'm quoting "Real Mem" values from the Mac OS X Activity Monitor.app).
>>> a = [random.randint(0, sys.maxint) for i in xrange(10000000)]
Now, memory usage is ~ 305.7 MB.
>>> del a
Removing a seems to have no effect on memory.
>>> import gc
>>> gc.collect() # perform a full collection
Now, memory usage is 27.7 MB. Sometimes, the first call to collect() doesn't seem to do anything, but a second collect() call will clean things up.
But, this behavior is by design, Python isn't leaking. This old FAQ on effbot.org explains a bit more about what's happening:
“For speed”, Python maintains an internal free list for integer objects. Unfortunately, that free list is both immortal and unbounded in size. floats also use an immortal & unbounded free list.
Essentially, python is treating the integers as singletons, under the assumption that you might use them more than once.
Consider this:
# 4.5 MB
>>> a = [object() for i in xrange(10000000)]
# 166.7 MB
>>> del a
# 9.1 MB
In this case, python it's pretty obvious that python is not keeping the objects around in memory, and removing a triggers a garbage collection which cleans everything up.
As I recall, python will actually keep low-valued integers in memory forever (0 - 1000 or so). This may explain why the gc.collect() call doesn't return as much memory as removing the list of objects.
I looked around through the PEPs a bit to figure out why Python3 is different. However, I didn't see anything obvious. If you really wanted to know, you could dig around in the source code.
Suffice to say in Python 3, it either the number-singleton behavior has changed, or the garbage collector got better.
Many things are better in Python 3.
I am using numpy and my model involves intensive matrix-matrix multiplication.
To speed up, I use OpenBLAS multi-threaded library to parallelize the numpy.dot function.
My setting is as follows,
OS : CentOS 6.2 server #CPUs = 12, #MEM = 96GB
python version: Python2.7.6
numpy : numpy 1.8.0
OpenBLAS + IntelMKL
$ OMP_NUM_THREADS=8 python test_mul.py
code, of which I took from https://gist.github.com/osdf/
test_mul.py :
import numpy
import sys
import timeit
try:
import numpy.core._dotblas
print 'FAST BLAS'
except ImportError:
print 'slow blas'
print "version:", numpy.__version__
print "maxint:", sys.maxint
print
x = numpy.random.random((1000,1000))
setup = "import numpy; x = numpy.random.random((1000,1000))"
count = 5
t = timeit.Timer("numpy.dot(x, x.T)", setup=setup)
print "dot:", t.timeit(count)/count, "sec"
when I use OMP_NUM_THREADS=1 python test_mul.py, the result is
dot: 0.200172233582 sec
OMP_NUM_THREADS=2
dot: 0.103047609329 sec
OMP_NUM_THREADS=4
dot: 0.0533880233765 sec
things go well.
However, when I set OMP_NUM_THREADS=8.... the code starts to "occasionally works".
sometimes it works, sometimes it does not even run and and gives me core dumps.
when OMP_NUM_THREADS > 10. the code seems to break all the time..
I am wondering what is happening here ? Is there something like a MAXIMUM number threads that each process can use ? Can I raise that limit, given that I have 12 CPUs in my machine ?
Thanks
Firstly, I don't really understand what you mean by 'OpenBLAS + IntelMKL'. Both of those are BLAS libraries, and numpy should only link to one of them at runtime. You should probably check which of these two numpy is actually using. You can do this by calling:
$ ldd <path-to-site-packages>/numpy/core/_dotblas.so
Update: numpy/core/_dotblas.so was removed in numpy v1.10, but you can check the linkage of numpy/core/multiarray.so instead.
For example, I link against OpenBLAS:
...
libopenblas.so.0 => /opt/OpenBLAS/lib/libopenblas.so.0 (0x00007f788c934000)
...
If you are indeed linking against OpenBLAS, did you build it from source? If you did, you should see that in the Makefile.rule there is a commented option:
...
# You can define maximum number of threads. Basically it should be
# less than actual number of cores. If you don't specify one, it's
# automatically detected by the the script.
# NUM_THREADS = 24
...
By default OpenBLAS will try to set the maximum number of threads to use automatically, but you could try uncommenting and editing this line yourself if it is not detecting this correctly.
Also, bear in mind that you will probably see diminishing returns in terms of performance from using more threads. Unless your arrays are very large it is unlikely that using more than 6 threads will give much of a performance boost because of the increased overhead involved in thread creation and management.
I have an application that uses a number of classes inheriting from HasTraits. Some of these classes manage access to data and others provide functions for analyzing that data. This works wonderfully for a gui -- I can check that the data and analysis code is doing what it should. However, I've noticed that when I use these classes for gui-less computations, all the cpus on the system end up getting used.
Here is a small example that shows the cpu usage:
from traits.api import HasTraits, List, Int, Enum, Instance
import numpy as np
import psutil
from itertools import combinations
"""
Small example of high CPU usage by traited classes
"""
class DataStorage(HasTraits):
nsamples = Int(2000)
samples = List
def _samples_default(self):
return np.random.randn(self.nsamples,2000).tolist()
def sample_samples(self,indices):
""" return a 2D array of data at indices """
return np.array(
[self.samples[i] for i in indices])
class DataAccessor(HasTraits):
""" Class that grabs data and computes something """
measure = Enum("correlation","covariance")
data_source = Instance(DataStorage,())
def compute_measure(self,indices):
""" example of some computation """
samples = self.data_source.sample_samples(indices)
percentage = psutil.cpu_percent(interval=0, percpu=True)
if self.measure == "correlation":
result = np.corrcoef(samples)
elif self.measure == "covariance":
result = np.cov(samples)
return percentage
# Run a simulation to see cpu usage
analyzer = DataAccessor()
usage = []
n_iterations = 0
max_iterations = 500
for combo in combinations(np.arange(2000),500):
# evaluate the measurement on a subset of the data
usage.append(analyzer.compute_measure(combo))
n_iterations += 1
if n_iterations > max_iterations:
break
print n_iterations
use_percents = np.array(usage).T
When I run this on an 8-cpu machine running CentOS, top reports the python process at roughly 600%.
>>> use_percents.mean(1)
shows
array([ 67.05548902, 67.06906188, 66.89041916, 67.28942116,
66.69421158, 67.61437126, 99.8007984 , 67.31996008])
Question:
My computation is embarrassingly parallel, so it would be great to have the other cpus available to split up the job. Does anyone know what's happening here? A plain python version of this uses 100% on a single cpu.
Is there a way to keep everything local to a single cpu without rewriting all my classes without traits?
Traits is not causing the CPU usage. It's easy to rewrite this bit of code without Traits, and you will see that you get the same pattern of CPU usage (at least, I do).
Instead, what you are probably seeing is the CPU usage of the BLAS library that your build of numpy is linked against. numpy.corrcoeff() calls numpy.cov(), and much of the computation of numpy.cov() is taken up by a numpy.dot() call, which does a matrix-matrix multiplication using BLAS. If it is an optimized BLAS library, then it will usually use non-Python threads internally to split up these computations among your CPUs. You will have to consult the documentation of your optimized BLAS library to find out how to change this.
I'm thoroughly confused about the memory usage of a specific python script. I guess I don't really know how to profile the usage despite advice from several SO Questions/Answers.
My questions are: What's the difference between memory_profiler and guppy.hpy? Why is one telling me I'm using huge amounts of memory, and the other is telling me I'm not?
I'm working with pysam, a library for accessing bioinformatics SAM/BAM files. My main script is running out of memory quickly when converting SAM (ASCII) to BAM (Binary) and manipulating the files in between.
I created a small test example to understand how much memory gets allocated at each step.
# test_pysam.py:
import pysam
#from guppy import hpy
TESTFILENAME = ('/projectnb/scv/yannpaul/MAR_CEJ082/' +
'test.sam')
#H = hpy()
#profile # for memory_profiler
def samopen(filename):
# H.setrelheap()
samf = pysam.Samfile(filename)
# print H.heap()
pass
if __name__ == "__main__":
samopen(TESTFILENAME)
Monitoring the memory usage with memory_profiler (python -m memory_profiler test_pysam.py) results in the following output:
Filename: test_pysam.py
Line # Mem usage Increment Line Contents
================================================
10 #profile # for memory_profiler
11 def samopen(filename):
12 10.48 MB 0.00 MB # print H.setrelheap()
13 539.51 MB 529.03 MB samf = pysam.Samfile(filename)
14 # print H.heap()
15 539.51 MB 0.00 MB pass
Then commenting out #profile decorator and uncommenting the guppy related lines, I get the following output (python test_pysam.py):
Partition of a set of 3 objects. Total size = 624 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 33 448 72 448 72 types.FrameType
1 1 33 88 14 536 86 __builtin__.weakref
2 1 33 88 14 624 100 csamtools.Samfile
The total size of line 13 is 529.03 MB in one case and 624 bytes in the other. What's actually going on here? 'test.sam' is a ~52MB SAM (again an ASCII format) file. It's a bit tricky for me to dig deep into pysam, as it's a wrapper around a C library related to samtools. Regardless of what a Samfile actually is, I think I should be able to learn how much memory is allocated to create it. What procedure should I use to correctly profile the memory usage of each step of my larger, more complex python program?
What's the difference between memory_profiler and guppy.hpy?
Do you understand the difference between your internal view of the heap and the OS's external view of your program? (For example, when the Python interpreter calls free on 1MB, that doesn't immediately—or maybe even ever—return 1MB worth of pages to the OS, for multiple reasons.) If you do, then the answer is pretty easy: memory_profiler is asking the OS for your memory use; guppy is figuring it out internally from the heap structures.
Beyond that, memory_profiler has one feature guppy doesn't—automatically instrumenting your function to print a report after each line of code; it's is otherwise much simpler and easier but less flexible. If there's something you know you want to do and memory_profiler doesn't seem to do it, it probably can't; with guppy, maybe it can, so study the docs and the source.
Why is one telling me I'm using huge amounts of memory, and the other is telling me I'm not?
It's hard to be sure, but here are some guesses; the answer is likely to be a combination of more than one:
Maybe samtools uses mmap to map small enough files entirely into memory. This would increase your page usage by the size of the file, but not increase your heap usage at all.
Maybe samtools or pysam creates a lot of temporary objects that are quickly freed. You could have lots of fragmentation (only a couple live PyObjects on each page), or your system's malloc may have decided it should keep lots of nodes in its freelist because of the way you've been allocating, or it may not have returned pages to the OS yet, or the OS's VM may not have reclaimed pages that were returned. The exact reason is almost always impossible to guess; the simplest thing to do is to assume that freed memory is never returned.
What procedure should I use to correctly profile the memory usage of each step of my larger, more complex python program?
If you're asking about memory usage from the OS point of view, memory_profiler is doing exactly what you want. While major digging into pysam may be difficult, it should be trivial to wrap a few of the functions with the #profile decorator. Then you'll know which C functions are responsible for memory; if you want to dig any deeper, you obviously have to profile at the C level (unless there's information in the samtools docs or from the samtools community).