I am trying to process a large amount of text with multiprocessing.
import pandas as pd
from multiprocessing import Pool
from itertools import repeat
# ...import data...
type(train)
type(train[0])
output:
pandas.core.series.Series
str
I need to import a really large file (1.2GB) to use for my function. It is stored in an object named Gmodel.
My function takes 4 parameters, one of them being an object like Gmodel:
my_func(text, model=Gmodel, param2, param3)
Then I use the multiprocessing.Pool function:
from functools import partial
# make the multi-parameter function suitable for map
# which takes in single-parameter functions
partial_my_func= partial(my_func, model=Gmodel, param2=100, param3=True)
if __name__ == '__main__':
p = Pool(processes = 10, maxtasksperchild = 200)
train_out = p.map(partial_my_func, train)
When I run the last 3 lines, and execute htop in my terminal, I see several processes with VIRT and RES over 20G. I am using a shared server and I'm not allowed to use so much memory. Is there a way for me to cut memory usage here?
System information:
3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Related
I am using scipy.stats.rv_continuous (v0.19.0) in order to create random values from a custom probability distribution. The code I am using looks as follows (using a Gaussian for debugging purposes):
from scipy.stats import rv_continuous
import numpy as np
import resource
import scipy
import sys
print "numpy version: {}".format(np.version.full_version)
print "Scipy version: {}".format(scipy.version.full_version)
print "Python {}".format(sys.version)
class gaussian_gen(rv_continuous):
"Gaussian distribution"
def _pdf(self, x):
return np.exp(-x**2 / 2.) / np.sqrt(2.0 * np.pi)
def print_mem():
mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print 'Memory usage: %s (kb)' % mem
print_mem()
gaussian = gaussian_gen(name='gaussian')
print_mem()
values = gaussian.rvs(size=1000)
print_mem()
values = gaussian.rvs(size=5000)
print_mem()
Which outputs:
numpy version: 1.12.0
Scipy version: 0.19.0
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609]
Memory usage: 69672 (kb)
Memory usage: 69672 (kb)
Memory usage: 426952 (kb)
Memory usage: 2215576 (kb)
As you can see, the memory consumption of this snippet seems really unreasonable. I have found this question, but it is slightly different since I am not creating a new class instance in a loop.
I thought that I am using rv_continuous correctly, but I cannot see why it would consume such enormous amounts of memory. How do I use it correctly? Ideally, I would like to have a solution which I can actually be call in a loop, but one step at a time.
I am creating a interpolation object with the "tri" module of matplotlib, and would like to pickle it, for in the actual application it takes a long time to generate. Unfortunately, when I call the unpickled object, Python 2.7 crashes with a segfault.
I would like to know three things: 1) How to pickle this LinearTriInterpolator object successfully? 2) Is the segfault due to my ignorance, a problem in matplotlib, or in Pickle? 3) what is causing this?
I have have created a simple test code; when called the first interpolation returns 1.0, and the second, using with a the unpickled object, causes a segfault. cPickle shows the same behavior.
from pylab import *
from numpy import *
import cPickle as pickle
import matplotlib.tri as tri
#make points
x=array([0.0,1.0,1.0,0.0])
y=array([0.0,0.0,1.0,1.0])
z=x+y
#make triangulation
triPnts=tri.Triangulation(x,y)
theInterper=tri.LinearTriInterpolator(triPnts,z)
#test interpolator
print 'Iterped value is ',theInterper([0.5],[0.5])
#now pickle and unpickle interper
pickle.dump(theInterper,open('testPickle.pckl','wb'),-1)
#load pickle
unpickled_Interper=pickle.load(open('testPickle.pckl','rb'))
#and test
print 'Iterped value is ',unpickled_Interper([0.5],[0.5])
My python is:
Enthought Canopy Python 2.7.6 | 64-bit | (default, Sep 15 2014,
17:43:19) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
I am new to Python. I am adapting someone else's code from Python 2.X to 3.5. The code loads a file via cPickle. I changed all "cPickle" occurrences to "pickle" as I understand pickle superceded cPickle in 3.5. I get this execution error:
NameError: name 'cPickle' is not defined
Pertinent code:
import pickle
import gzip
...
def load_data():
f = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = pickle.load(f, fix_imports=True)
f.close()
return (training_data, validation_data, test_data)
The error occurs in the pickle.load line when load_data() is called by another function. However, a) neither cPickle or cpickle no longer appear in any source files anywhere in the project (searched globally) and b) the error does not occur if I run the lines within load_data() individually in the Python shell (however, I do get another data format error). Is pickle calling cPickle, and if so how do I stop it?
Shell:
Python 3.5.0 |Anaconda 2.4.0 (x86_64)| (default, Oct 20 2015, 14:39:26)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
IDE: IntelliJ 15.0.1, Python 3.5.0, anaconda
Unclear how to proceed. Any help appreciated. Thanks.
Actually, if you have pickled objects from python2.x, then generally can be read by python3.x. Also, if you have pickled objects from python3.x, they generally can be read by python2.x, but only if they were dumped with a protocol set to 2 or less.
Python 2.7.10 (default, Sep 2 2015, 17:36:25)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> x = [1,2,3,4,5]
>>> import math
>>> y = math.sin
>>>
>>> import pickle
>>> f = open('foo.pik', 'w')
>>> pickle.dump(x, f)
>>> pickle.dump(y, f)
>>> f.close()
>>>
dude#hilbert>$ python3.5
Python 3.5.0 (default, Sep 15 2015, 23:57:10)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('foo.pik', 'rb') as f:
... x = pickle.load(f)
... y = pickle.load(f)
...
>>> x
[1, 2, 3, 4, 5]
>>> y
<built-in function sin>
Also, if you are looking for cPickle, it's now _pickle, not pickle.
>>> import _pickle
>>> _pickle
<module '_pickle' from '/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/lib-dynload/_pickle.cpython-35m-darwin.so'>
>>>
You also asked how to stop pickle from using the built-in (C++) version. You can do this by using _dump and _load, or the _Pickler class if you like to work with the class objects. Confused? The old cPickle is now _pickle, however dump, load, dumps, and loads all point to _pickleā¦ while _dump, _load, _dumps, and _loads point to the pure python version. For instance:
>>> import pickle
>>> # _dumps is a python function
>>> pickle._dumps
<function _dumps at 0x109c836a8>
>>> # dumps is a built-in (C++)
>>> pickle.dumps
<built-in function dumps>
>>> # the Pickler points to _pickle (C++)
>>> pickle.Pickler
<class '_pickle.Pickler'>
>>> # the _Pickler points to pickle (pure python)
>>> pickle._Pickler
<class 'pickle._Pickler'>
>>>
So if you don't want to use the built-in version, then you can use pickle._loads and the like.
It's looking like the pickled data that you're trying to load was generated by a version of the program that was running on Python 2.7. The data is what contains the references to cPickle.
The problem is that Pickle, as a serialization format, assumes that your standard library (and to a lesser extent your code) won't change layout between serialization and deserialization. Which it did -- a lot -- between Python 2 and 3. And when that happens, Pickle has no path for migration.
Do you have access to the program that generated mnist.pkl.gz? If so, port it to Python 3 and re-run it to regenerate a Python 3-compatible version of the file.
If not, you'll have to write a Python 2 program that loads that file and exports it to a format that can be loaded from Python 3 (depending on the shape of your data, JSON and CSV are popular choices), then write a Python 3 program that loads that format then dumps it as Python 3 pickle. You can then load that Pickle file from your original program.
Of course, what you should really do is stop at the point where you have ability to load the exported format from Python 3 -- and use the aforementioned format as your actual, long-term storage format.
Using Pickle for anything other than short-term serialization between trusted programs (loading Pickle is equivalent to running arbitrary code in your Python VM) is something you should actively avoid, among other things because of the exact case you find yourself in.
In Anaconda Python3.5 :
one can access cPickle as
import _pickle as cPickle
credits to Mike McKerns
This bypasses the technical issues, but there might be a py3 version of that file named mnist_py3k.pkl.gz If so, try opening that file instead.
There is a code in github that does it: https://gist.github.com/rebeccabilbro/2c7bb4d1acfbcdcf9156e7b9b7577cba
I have tried it and it worked. You just need to specify the encoding, in this case it is 'latin1':
pickle.load(open('mnist.pkl','rb'), encoding = 'latin1')
I've got this very peculiar hanging happening on my machine when using pytnon multiprocessing Pool with numpy and PySide imported. This is the most entangled bug I have seen in my life so far:) The following code:
import numpy as np
import PySide
def hang():
import multiprocessing
pool = multiprocessing.Pool(processes = 1)
pool.map(f, [None])
def f(ignore):
print('before dot..')
np.dot(np.zeros((128, 1)), np.zeros((1, 32)))
print('after dot.')
if __name__ == "__main__":
hang()
print('success!')
hangs printing only 'before dot..'. But it is supposed to print
before dot..
after dot.
success!
I'm not gdb expert, but looks like gdb shows that processes exits (or crashes) on 'np.dot' line:
[Inferior 1 (process 2884) exited normally]
There are several magical modifications I can do to prevent hanging:
if you decrease shape of arrays going into 'dot' (e.g. from 128 to
127)
(!) if you increase shape of arrays going into 'dot' from 128 to 256
if you do not use multiprocessing and just run function 'f'
(!!!) if you comment out PySide import which is not used anywhere in the code
Any help is appreciated!
Packages version:
numpy=1.8.1 or 1.7.1 PySide=1.2.1 or 1.2.2
Python version:
Python 2.7.5 (default, Sep 12 2013, 21:33:34)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
or
Python 2.7.6 (default, Apr 9 2014, 11:48:52)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.38)] on darwin
Notice: While hunting for a information, I simplified original code and question a bit. But here is a stack of updates to keep history for others who may encounter this bug (e.g. I started with matplotlib, not with pyside)
Update: I narrowed down pylab import to importing matplotlib with pyside backend and updated the code to run.
Update: I'm modifying the post to import only PySide only instead of:
import matplotlib
matplotlib.use('qt4agg')
matplotlib.rcParams['backend.qt4']='PySide'
import matplotlib.pyplot
Update: Initial statistics shows that it is a Mac-only issue. 3 people have it working on Ubuntu, 2 people got it hanging on Mac.
Update: print(os.getpid()) before dot operation gives me pid that I don't see in 'top' that apparently means that it crashes and multiprocessing waits for a dead process. For this reason I can not attach debugger to it. I edited main question accordingly.
this is a general issue with some BLAS libraries used by numpy for dot.
Apple Accelerate and OpenBlas built with GNU Openmp are known to not be safe to use on both sides of a fork (the parent and the child process multiprocessing create). They will deadlock.
This cannot be fixed by numpy but there are three workarounds:
use netlib BLAS, ATLAS or git master OpenBlas based on pthreads (2.8.0 does not work)
use python 3.4 and its new multiprocessing spawn or forkserver start methods
use threading instead of multiprocessing, numpy releases the gil for most expensive operations so you can archive decent threading speedups on typical desktop machines
I believe this to be an issue with the multiprocessing module.
Try using the following instead.
import numpy as np
import PySide
def hang():
import multiprocessing.dummy as multiprocessing
pool = multiprocessing.Pool(processes = 1)
pool.map(f, [None])
def f(ignore):
print('before dot..')
np.dot(np.zeros((128, 1)), np.zeros((1, 32)))
print('after dot.')
if __name__ == "__main__":
hang()
print('success!')
I ran into this exact problem. There was a deadlock when the child process used numpy.dot. But it ran when I reduced the size of the matrix. So instead of a dot product on a matrix with 156000 floats, I performed 3 dot products of 52000 each and concatenated the result. I'm not sure of what the max limit is and whether it depends on the number of child processes, available memory or any other factors. But if the largest matrix that does not deadlock can be identified by trial and error, then the following code should help.
def get_batch(X, update_iter, batchsize):
curr_ptr = update_iter*batchsize
if X.shape[0] - curr_ptr <= batchsize :
X_i = X[curr_ptr:, :]
else:
X_i = X[curr_ptr:curr_ptr+batchsize, :]
return X_i
def batch_dot(X, w, batchsize):
y = np.zeros((1,))
num_batches = X.shape[0]/batchsize
if X.shape[0]%batchsize != 0:
num_batches += 1
for batch_iter in range(0, num_batches):
X_batch = get_batch(X, batch_iter, batchsize)
y_batch = X_batch.dot(w)
y = np.hstack((y, y_batch))
return y[1:]
The code flow is something as below.
result = []
def Discover(myList=[]):
for item in myList:
t = threading.Thread(target=myFunc, Args=[item])
t.start()
def myFunc(item):
result.append(item+item)
Now this will start multiple threads and in current scenario the threads does some memory intensive Tasks. Thus I want to include semaphores in this so that myList behaves as a queue and number of threads must be in a limited size. What is the better way to do that?
Never use mutable objects as default parameter value in function definition. In your case: def Discover(myList=[])
Use Queue.Queue instead of list to provide myList if it's necessary to update list of "tasks" when threads are running. Or... Use multiprocessing.pool.ThreadPool in order to limit number of running threads at the same time.
Use Queue.Queue instead of list to provide results variable. list implementation is not thread-safe, so you probably will get many problems with it.
You can find some examples in other SO questions, i.e. here.
P.S. ThreadPool available in Python 2.7+
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from multiprocessing.pool import ThreadPool
>>> ThreadPool
<class 'multiprocessing.pool.ThreadPool'>