Matplotlib multiprocessing fonts corruption using savefig - python

I have troubles with multiprocessing in Matplotlib since version 1.5. The fonts are randomly jumping around their original position. Example is here:
The simple example to reproduce this bug is here:
import multiprocessing
import matplotlib.pyplot as plt
fig = plt.figure()
def plot(i):
fig = plt.gcf()
plt.plot([],[])
fig.savefig('%d.png' % i)
plot(0)
pool = multiprocessing.Pool(4)
pool.map(plot, range(10))
if the order of multiprocessing and simple plotting is reversed
pool = multiprocessing.Pool(4)
plot(0)
pool.map(plot, range(10))
then it works, but this workaround is useless for my purpose.
Thank you.

I've recently run into this same problem while testing methods for parallel plotting large numbers of plots. While I haven't found a solution using the multiprocessing module, I've found that I do not see the same errors using the Parallel Python package (http://www.parallelpython.com/). It seems to be ~50% slower than the multiprocessing module in my early tests, but still a significant speedup over serial plotting. It's also a little finicky regarding module imports so I would ultimately prefer to find a solution using multiprocessing, but for now this is a passable workaround (for me at least). That said, I'm pretty new to parallel processing so there may be some nuances of the two approaches that I'm missing here.
###############################################################################
import os
import sys
import time
#import numpy as np
import numpy # Importing with 'as' doesn't work with Parallel Python
#import matplotlib.pyplot as plt
import matplotlib.pyplot # Importing with 'as' doesn't work with Parallel Python
import pp
import multiprocessing as mp
###############################################################################
path1='./Test_PP'
path2='./Test_MP'
nplots=100
###############################################################################
def plotrandom(plotid,N,path):
numpy.random.seed() # Required for multiprocessing module but not Parallel Python...
x=numpy.random.randn(N)
y=x**2
matplotlib.pyplot.scatter(x,y)
matplotlib.pyplot.savefig(os.path.join(path,'test_%d.png'%(plotid)),dpi=150)
matplotlib.pyplot.close('all')
############################################################################## #
# Parallel Python implementation
tstart_1=time.time()
if not os.path.exists(path1):
os.makedirs(path1)
ppservers = ()
if len(sys.argv) > 1:
ncpus = int(sys.argv[1])
job_server = pp.Server(ncpus, ppservers=ppservers)
else:
job_server = pp.Server(ppservers=ppservers)
print "Starting Parallel Python v2 with", job_server.get_ncpus(), "workers"
jobs = [(input_i, job_server.submit(plotrandom,(input_i,10,path1),(),("numpy","matplotlib.pyplot"))) for input_i in range(nplots)]
for input_i, job in jobs:
job()
tend_1=time.time()
t1=tend_1-tstart_1
print 'Parallel Python = %0.5f sec'%(t1)
job_server.print_stats()
############################################################################## #
# Multiprocessing implementation
tstart_2=time.time()
if not os.path.exists(path2):
os.makedirs(path2)
if len(sys.argv) > 1:
ncpus = int(sys.argv[1])
else:
ncpus = mp.cpu_count()
print "Starting multiprocessing v2 with %d workers"%(ncpus)
pool = mp.Pool(processes=ncpus)
jobs = [pool.apply_async(plotrandom, args=(i,10,path2)) for i in range(nplots)]
results = [r.get() for r in jobs] # This line actually runs the jobs
pool.close()
pool.join()
tend_2=time.time()
t2=tend_2-tstart_2
print 'Multiprocessing = %0.5f sec'%(t2)
###############################################################################

I have found a solution. The main cause of the troubles is the font caching in dictionary _fontd in /matplotlib/backends/backend_agg.py
Therefore, I have used a different hash for each process by adding multiprocessing.current_process().pid to hash called key in function _get_agg_font.
If anybody know more elegant solution which would not require modification of matplotlib files, just let me know.

Here is shown how I changed function _get_agg_font in backend_agg.py:
from multiprocessing import current_process
def _get_agg_font(self, prop):
"""
Get the font for text instance t, cacheing for efficiency
"""
if __debug__: verbose.report('RendererAgg._get_agg_font',
'debug-annoying')
key = hash(prop)
key += current_process().pid
font = RendererAgg._fontd.get(key)
if font is None:
fname = findfont(prop)
#font = RendererAgg._fontd.get(fname)
if font is None:
font = FT2Font(
fname,
hinting_factor=rcParams['text.hinting_factor'])
RendererAgg._fontd[fname] = font
RendererAgg._fontd[key] = font
font.clear()
size = prop.get_size_in_points()
font.set_size(size, self.dpi)
return font

The solution to this is to place your matplotlib imports inside the function you're passing to multiprocessing.

I had the same problem: Concurrent use of cached fonts created glitches in text plots (e.g. axes numbers). The problems appears to be font caching by an lru cache:
https://github.com/matplotlib/matplotlib/blob/7b6eb77731ff2b58c43c0d75a9cc038ada8d89cd/lib/matplotlib/font_manager.py#L1316
For me upgrading to python 3.7 solved it (which apparently supports cleaning state after forking).
Running the following in your workers might help as well:
import matplotlib
matplotlib.font_manager._get_font.cache_clear()

Related

Why does a python function work in parallel even if it should not?

I am running this code using the healpy package. I am not using multiprocessing and I need it to run on a single core. It worked for a certain amount of time, but, when I run it now, the function healpy.projector.GnomonicProj.projmap takes all the available cores.
This is the incriminated code block:
def Stacking () :
f = lambda x,y,z: pixelfunc.vec2pix(xsize,x,y,z,nest=False)
map_array = pixelfunc.ma_to_array(data)
im = np.zeros((xsize, xsize))
plt.figure()
for i in range (nvoids) :
sys.stdout.write("\r" + str(i+1) + "/" + str(nvoids))
sys.stdout.flush()
proj = hp.projector.GnomonicProj(rot=[rav[i],decv[i]], xsize=xsize, reso=2*nRad*rad_deg[i]*60/(xsize))
im += proj.projmap(map_array, f)
im/=nvoids
plt.imshow(im)
plt.colorbar()
plt.title(title + " (Map)")
plt.savefig("../Plots/stackedMap_"+name+".png")
return im
Does someone know why this function is running in parallel? And most important, does someone know a way to run it in a single core?
Thank you!
In this thread they recommend to set the environment variable OMP_NUM_THREADS accordingly:
Worked with:
import os
os.environ['OMP_NUM_THREADS'] = '1'
import healpy as hp
import numpy as np
os.environ['OMP_NUM_THREADS'] = '1' have to be done before import numpy and healpy libraries.
As to the why: probably they use some parallelization techniques wrapped within their implementation of the functions you use. According to the name of the variable, I would guess OpenMP it is.

python multiprocessing.pool application

following (simplified) code applies an interpolation function on the multiprocessing module:
from multiprocessing import Pool
from scipy.interpolate import LinearNDInterpolator
...
if __name__=="__main__":
p=Pool(4)
lndi = LinearNDInterpolator(points, valuesA)
valuesB = list(np.split(valuesA, 4))
ret = p.map(lndi.__call__, valuesB)
when i run the .py, python freezes, if the last line is run separately everything works fine and i get the speed-up that i hoped for.
anyone knows how to fix the code to have it work automatically?
thanks in advance
edit: github issue was opened -> https://github.com/spyder-ide/spyder/issues/3367

Plotting pandas dataframe and multiprocessing in Python

I have a pandas dataframe and I want to plot slices of it, in a function using multiprocessing. Even though the function "process_expression" works when I call it independently, when I use the "multiprocessing" option it is not giving any plots.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import seaborn as sns
import sys
from multiprocessing import Pool
import os
os.system("taskset -p 0xff %d" % os.getpid())
pool = Pool()
gn = pool.map(process_expression, gene_ids)
pool.close()
pool.join()
def process_expression(gn_name, df_gn=df_coding):
df_part = df_gn.loc[df_gn['Gene_id'] == gn_name]
df_part = df_part.drop('Gene_id', 1)
df_part = df_part.drop('Transcript_biotype', 1)
COUNT100 = df_part[df_part >100 ].count()
COUNT10 = (df_part[df_part >10 ].count()) - COUNT100
COUNT1 = (df_part[df_part >1].count())- COUNT100 - COUNT10
COUNT0 = (df_part[df_part >0].count())- COUNT100-COUNT10- COUNT1
result = pd.concat([COUNT0,COUNT1,COUNT10,COUNT100], axis=1)
result.columns = [ '0 TO 1', '1 TO 10','10 TO 100', '>100']
result.plot( kind='bar', figsize=(50, 20), fontsize=7, stacked=True)
plt.savefig('./expression_levels/all_genes/'+gn_name+'.png')#,bbox_inches='tight')
plt.close()
the df_coding table is something like (it has more columns, I erased some):
Isoform_name,heart,heart.1,lung.3,Gene_id,Transcript_biotype
ENST00000296782,0.14546900000000001,0.161245,0.09479889999999999,ENSG00000164327,protein_coding
ENST00000357387,6.53902,5.86969,7.057689999999999,ENSG00000164327,protein_coding
ENST00000514735,0.0,0.0,0.0,ENSG00000164327,protein_coding
The input dataframe df_coding is a dataframe with a column Gene_id. In this column I have a list of gn_name. What I want is to take each time only the parts of the dataframe which have the name gn_name[i] in the Gene_id column and plot a barplot based on this dataframe.
For example if I call the 'process_expression('ENSG00000164327')', which is a specific gn_name, the output is something like this:
What am I doing wrong? I know that the process stops at the plotting command when I run it with multiprocessing.
The problem is between multiprocessing and matplotlib. With multiprocessing you create a completely new context with each process. The new context does not (and can not) successfully initialize the context because it is already initialized in the parent process.
If you are trying to overcome a performance issue then you may be on the right track. However, plotting back to the correctly initialized context of the parent process will require you to go a lot deeper into the structure of the underlying matplotlib guts. Here is an example of setting a data pipe back to the original application. Really this is only going to help if you are dealing with a lot of processing of the data before it is plotted. It doesn't look like that is what you are doing here.
If you are trying to get a visual effect like stacked / overlayed results then you probably want to look into repeating the plot function or modifying the data structure to better represent what you want to visualize.
So. What problem are you trying to solve? A performance problem, or a visualization problem? If it is a visualization problem then you do NOT want to use multiprocessing.

Multiprocessing IOError: bad message length

I get an IOError: bad message length when passing large arguments to the map function. How can I avoid this?
The error occurs when I set N=1500 or bigger.
The code is:
import numpy as np
import multiprocessing
def func(args):
i=args[0]
images=args[1]
print i
return 0
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
iter_args=[]
for i in range(0,1):
iter_args.append([i,images])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
In the docs of multiprocessing there is the function recv_bytes that raises an IOError. Could it be because of this? (https://python.readthedocs.org/en/v2.7.2/library/multiprocessing.html)
EDIT
If I use images as a numpy array instead of a list, I get a different error: SystemError: NULL result without error in PyObject_Call.
A bit different code:
import numpy as np
import multiprocessing
def func(args):
i=args[0]
images=args[1]
print i
return 0
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
images=np.array(images) #new
iter_args=[]
for i in range(0,1):
iter_args.append([i,images])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
EDIT2 The actual function that I use is:
def func(args):
i=args[0]
images=args[1]
image=np.mean(images,axis=0)
np.savetxt("image%d.txt"%(i),image)
return 0
Additionally, the iter_args do not contain the same set of images:
iter_args=[]
for i in range(0,1):
rand_ind=np.random.random_integers(0,N-1,N)
iter_args.append([i,images[rand_ind]])
You're creating a pool and sending all the images at once to func(). If you can get away with working on a single image at once, try something like this, which runs to completion with N=10000 in 35s with Python 2.7.10 for me:
import numpy as np
import multiprocessing
def func(args):
i = args[0]
img = args[1]
print "{}: {} {}".format(i, img.shape, img.sum())
return 0
N=10000
images = ((i, np.random.random_integers(1,100,size=(500,500))) for i in xrange(N))
pool=multiprocessing.Pool(4)
pool.imap(func, images)
pool.close()
pool.join()
The key here is to use iterators so you don't have to hold all the data in memory at once. For instance I converted images from an array holding all the data to a generator expression to create the image only when needed. You could modify this to load your images from disk or whatever. I also used pool.imap instead of pool.map.
If you can, try to load the image data in the worker function. Right now you have to serialize all the data and ship it across to another process. If your image data is larger, this might be a bottleneck.
[update now that we know func has to handle all images at once]
You could do an iterative mean on your images. Here's a solution without using multiprocessing. To use multiprocessing, you could divide your images into chunks, and farm those chunks out to the pool.
import numpy as np
N=10000
shape = (500,500)
def func(images):
average = np.full(shape, 0)
for i, img in images:
average += img / N
return average
images = ((i, np.full(shape,i)) for i in range(N))
print func(images)
Python is likely to load your data in your RAM memory and you need this memory to be available. Have you checked your computer memory usage ?
Also as Patrick mentioned, you're loading 3GB of data, make sure you use the 64 bits version of Python as you are reaching the 32 bits memory contraint. This could cause your process to crash : 32 vs 64 bits Python
Another improvement would be to use python 3.4 instead of 2.7. Python 3 implementation seems to be optimized for very large ranges, see Python3 vs Python2 list/generator range performance
When running your program it actually gives me an clear error:
OSError: [Errno 12] Cannot allocate memory
Like mentioned by other users, the solution to your problem is simple add memory(a lot) or change the way your program is handling the images.
The reason it's using so much memory is because you allocate your memory for your images on a module level. So when multiprocess forks your process it's also copying all the images (which isn't free according to Shared-memory objects in python multiprocessing), this is not necessary because you are also giving the images as an argument to the function which the multiprocess module also copies using ipc and pickle, this would still likely result in a lack of memory. Try one of the proposed solutions given by the other users.
This is what solved the problem: declaring the images global.
import numpy as np
import multiprocessing
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
def func(args):
i=args[0]
images=images
print i
return 0
iter_args=[]
for i in range(0,1):
iter_args.append([i])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
The reason why you get IOError: bad message length when passing around large objects is due to a hard coded limit in older CPython versions (3.2 and earlier) of 0x7fffffff Bytes or around 2.1GB: https://github.com/python/cpython/blob/v2.7.5/Modules/_multiprocessing/multiprocessing.h#L182
This CPython changeset (which is in CPython 3.3 and later) removed the hard coded limit: https://github.com/python/cpython/commit/87cf220972c9cb400ddcd577962883dcc5dca51a#diff-4711c9abeca41b149f648d4b3c15b6a7d2baa06aa066f46359e4498eb8e39f60L182

How can I release memory after creating matplotlib figures

I have several matlpotlib functions rolled into some django-celery tasks.
Every time the tasks are called more RAM is dedicated to python. Before too long, python is taking up all of the RAM.
QUESTION: How can I release this memory?
UPDATE 2 - A Second Solution:
I asked a similar question specifically about the memory locked up when matplotlib errors, but I got a good answer to this question .clf(), .close(), and gc.collect() aren't needed if you use multiprocess to run the plotting function in a separate process whose memory will automatically be freed once the process ends.
Matplotlib errors result in a memory leak. How can I free up that memory?
UPDATE - The Solution:
These stackoverflow posts suggested that I can release the memory used by matplotlib objects with the following commands:
.clf(): Matplotlib runs out of memory when plotting in a loop
.close(): Python matplotlib: memory not being released when specifying figure size
import gc
gc.collect()
Here is the example I used to test the solution:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from pylab import import figure, savefig
import numpy as np
import gc
a = np.arange(1000000)
b = np.random.randn(1000000)
fig = plt.figure(num=1, dpi=100, facecolor='w', edgecolor='w')
fig.set_size_inches(10,7)
ax = fig.add_subplot(111)
ax.plot(a, b)
fig.clf()
plt.close()
del a, b
gc.collect()
Did you try to run you task function several times (in a for) to be sure that not your function is leaking no matter of celery?
Make sure that django.settings.DEBUG is set False( The connection object holds all queries in memmory when DEBUG=True).
import matplotlib.pyplot as plt
from datetime import datetime
import gc
class MyClass:
def plotmanytimesandsave(self):
plt.plot([1, 2, 3])
ro2 = datetime.now()
f =ro2.second
name =str(f)+".jpg"
plt.savefig(name)
plt.draw()
plt.clf()
plt.close("all")
for y in range(1, 10):
k = MyClass()
k.plotmanytimesandsave()
del k
k = "now our class object is a string"
print(k)
del k
gc.collect
with this program you will save directly as many times you want without the plt.show() command. And the memory consumption will be low.

Categories