unpickled object causes segfault when called; whose fault? - python

I am creating a interpolation object with the "tri" module of matplotlib, and would like to pickle it, for in the actual application it takes a long time to generate. Unfortunately, when I call the unpickled object, Python 2.7 crashes with a segfault.
I would like to know three things: 1) How to pickle this LinearTriInterpolator object successfully? 2) Is the segfault due to my ignorance, a problem in matplotlib, or in Pickle? 3) what is causing this?
I have have created a simple test code; when called the first interpolation returns 1.0, and the second, using with a the unpickled object, causes a segfault. cPickle shows the same behavior.
from pylab import *
from numpy import *
import cPickle as pickle
import matplotlib.tri as tri
#make points
x=array([0.0,1.0,1.0,0.0])
y=array([0.0,0.0,1.0,1.0])
z=x+y
#make triangulation
triPnts=tri.Triangulation(x,y)
theInterper=tri.LinearTriInterpolator(triPnts,z)
#test interpolator
print 'Iterped value is ',theInterper([0.5],[0.5])
#now pickle and unpickle interper
pickle.dump(theInterper,open('testPickle.pckl','wb'),-1)
#load pickle
unpickled_Interper=pickle.load(open('testPickle.pckl','rb'))
#and test
print 'Iterped value is ',unpickled_Interper([0.5],[0.5])
My python is:
Enthought Canopy Python 2.7.6 | 64-bit | (default, Sep 15 2014,
17:43:19) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

Related

RecursionError: maximum recursion depth exceeded on integrate, sympy 1.1.1

Bug entered at https://github.com/sympy/sympy/issues/14877
Is this a known issue? Is this a new bug? Will report if new.
What could cause it?
>which python
/opt/anaconda/bin/python
>pip list | grep sympy
sympy 1.1.1
>python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
from sympy import *
x=symbols('x');
integrate(exp(1-exp(x**2)*x+2*x**2)*(2*x**3+x)/(1-exp(x**2)*x)**2,x)
gives
.....
File "/opt/anaconda/lib/python3.6/site-packages/sympy/core/mul.py", line 1067, in <genexpr>
a.is_commutative for a in self.args)
RecursionError: maximum recursion depth exceeded
>>>
btw, the anti derivative should be
-exp(1-exp(x^2)*x)/(-1+exp(x^2)*x)
It is a known issue that SymPy fails to integrate many functions. This particular function probably wasn't reported yet, so by all means, add it to the ever-growing list.
SymPy tries several integration approaches. One of them, called "manual integration", is highly recursive: a substitution or integration by parts is attempted, and then the process is repeated for the resulting integral.
In this specific case, the expression has a lot of functions that look like candidates for substitution: x**2, the denominator, the content of another exponential function. And SymPy goes into an infinite chain of substitution that leads not to a solution but to a stack overflow... There is no pattern implemented in integrate that would tell SymPy to make the crucial substitution u = 1 - x*exp(x**2).
There is a separate, experimental, integrator called RUBI which could be used with
from sympy.integrals.rubi.rubi import rubi_integrate
rubi_integrate(exp(1-exp(x**2)*x+2*x**2)*(2*x**3+x)/(1-exp(x**2)*x)**2, x)
but it relies on MatchPy which I don't have installed, so I can't tell if it would help here.

how to make python multiprocessing pool memory efficient?

I am trying to process a large amount of text with multiprocessing.
import pandas as pd
from multiprocessing import Pool
from itertools import repeat
# ...import data...
type(train)
type(train[0])
output:
pandas.core.series.Series
str
I need to import a really large file (1.2GB) to use for my function. It is stored in an object named Gmodel.
My function takes 4 parameters, one of them being an object like Gmodel:
my_func(text, model=Gmodel, param2, param3)
Then I use the multiprocessing.Pool function:
from functools import partial
# make the multi-parameter function suitable for map
# which takes in single-parameter functions
partial_my_func= partial(my_func, model=Gmodel, param2=100, param3=True)
if __name__ == '__main__':
p = Pool(processes = 10, maxtasksperchild = 200)
train_out = p.map(partial_my_func, train)
When I run the last 3 lines, and execute htop in my terminal, I see several processes with VIRT and RES over 20G. I am using a shared server and I'm not allowed to use so much memory. Is there a way for me to cut memory usage here?
System information:
3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

How to keep memory consumption at bay when using scipy.stats.rv_continuous?

I am using scipy.stats.rv_continuous (v0.19.0) in order to create random values from a custom probability distribution. The code I am using looks as follows (using a Gaussian for debugging purposes):
from scipy.stats import rv_continuous
import numpy as np
import resource
import scipy
import sys
print "numpy version: {}".format(np.version.full_version)
print "Scipy version: {}".format(scipy.version.full_version)
print "Python {}".format(sys.version)
class gaussian_gen(rv_continuous):
"Gaussian distribution"
def _pdf(self, x):
return np.exp(-x**2 / 2.) / np.sqrt(2.0 * np.pi)
def print_mem():
mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print 'Memory usage: %s (kb)' % mem
print_mem()
gaussian = gaussian_gen(name='gaussian')
print_mem()
values = gaussian.rvs(size=1000)
print_mem()
values = gaussian.rvs(size=5000)
print_mem()
Which outputs:
numpy version: 1.12.0
Scipy version: 0.19.0
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609]
Memory usage: 69672 (kb)
Memory usage: 69672 (kb)
Memory usage: 426952 (kb)
Memory usage: 2215576 (kb)
As you can see, the memory consumption of this snippet seems really unreasonable. I have found this question, but it is slightly different since I am not creating a new class instance in a loop.
I thought that I am using rv_continuous correctly, but I cannot see why it would consume such enormous amounts of memory. How do I use it correctly? Ideally, I would like to have a solution which I can actually be call in a loop, but one step at a time.

Python pickle calls cPickle?

I am new to Python. I am adapting someone else's code from Python 2.X to 3.5. The code loads a file via cPickle. I changed all "cPickle" occurrences to "pickle" as I understand pickle superceded cPickle in 3.5. I get this execution error:
NameError: name 'cPickle' is not defined
Pertinent code:
import pickle
import gzip
...
def load_data():
f = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = pickle.load(f, fix_imports=True)
f.close()
return (training_data, validation_data, test_data)
The error occurs in the pickle.load line when load_data() is called by another function. However, a) neither cPickle or cpickle no longer appear in any source files anywhere in the project (searched globally) and b) the error does not occur if I run the lines within load_data() individually in the Python shell (however, I do get another data format error). Is pickle calling cPickle, and if so how do I stop it?
Shell:
Python 3.5.0 |Anaconda 2.4.0 (x86_64)| (default, Oct 20 2015, 14:39:26)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
IDE: IntelliJ 15.0.1, Python 3.5.0, anaconda
Unclear how to proceed. Any help appreciated. Thanks.
Actually, if you have pickled objects from python2.x, then generally can be read by python3.x. Also, if you have pickled objects from python3.x, they generally can be read by python2.x, but only if they were dumped with a protocol set to 2 or less.
Python 2.7.10 (default, Sep 2 2015, 17:36:25)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> x = [1,2,3,4,5]
>>> import math
>>> y = math.sin
>>>
>>> import pickle
>>> f = open('foo.pik', 'w')
>>> pickle.dump(x, f)
>>> pickle.dump(y, f)
>>> f.close()
>>>
dude#hilbert>$ python3.5
Python 3.5.0 (default, Sep 15 2015, 23:57:10)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('foo.pik', 'rb') as f:
... x = pickle.load(f)
... y = pickle.load(f)
...
>>> x
[1, 2, 3, 4, 5]
>>> y
<built-in function sin>
Also, if you are looking for cPickle, it's now _pickle, not pickle.
>>> import _pickle
>>> _pickle
<module '_pickle' from '/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/lib-dynload/_pickle.cpython-35m-darwin.so'>
>>>
You also asked how to stop pickle from using the built-in (C++) version. You can do this by using _dump and _load, or the _Pickler class if you like to work with the class objects. Confused? The old cPickle is now _pickle, however dump, load, dumps, and loads all point to _pickleā€¦ while _dump, _load, _dumps, and _loads point to the pure python version. For instance:
>>> import pickle
>>> # _dumps is a python function
>>> pickle._dumps
<function _dumps at 0x109c836a8>
>>> # dumps is a built-in (C++)
>>> pickle.dumps
<built-in function dumps>
>>> # the Pickler points to _pickle (C++)
>>> pickle.Pickler
<class '_pickle.Pickler'>
>>> # the _Pickler points to pickle (pure python)
>>> pickle._Pickler
<class 'pickle._Pickler'>
>>>
So if you don't want to use the built-in version, then you can use pickle._loads and the like.
It's looking like the pickled data that you're trying to load was generated by a version of the program that was running on Python 2.7. The data is what contains the references to cPickle.
The problem is that Pickle, as a serialization format, assumes that your standard library (and to a lesser extent your code) won't change layout between serialization and deserialization. Which it did -- a lot -- between Python 2 and 3. And when that happens, Pickle has no path for migration.
Do you have access to the program that generated mnist.pkl.gz? If so, port it to Python 3 and re-run it to regenerate a Python 3-compatible version of the file.
If not, you'll have to write a Python 2 program that loads that file and exports it to a format that can be loaded from Python 3 (depending on the shape of your data, JSON and CSV are popular choices), then write a Python 3 program that loads that format then dumps it as Python 3 pickle. You can then load that Pickle file from your original program.
Of course, what you should really do is stop at the point where you have ability to load the exported format from Python 3 -- and use the aforementioned format as your actual, long-term storage format.
Using Pickle for anything other than short-term serialization between trusted programs (loading Pickle is equivalent to running arbitrary code in your Python VM) is something you should actively avoid, among other things because of the exact case you find yourself in.
In Anaconda Python3.5 :
one can access cPickle as
import _pickle as cPickle
credits to Mike McKerns
This bypasses the technical issues, but there might be a py3 version of that file named mnist_py3k.pkl.gz If so, try opening that file instead.
There is a code in github that does it: https://gist.github.com/rebeccabilbro/2c7bb4d1acfbcdcf9156e7b9b7577cba
I have tried it and it worked. You just need to specify the encoding, in this case it is 'latin1':
pickle.load(open('mnist.pkl','rb'), encoding = 'latin1')

OpenCV and Numpy interacting badly

Can anyone explain why importing cv and numpy would change the behaviour of python's struct.unpack? Here's what I observe:
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from struct import pack, unpack
>>> unpack("f",pack("I",31))[0]
4.344025239406933e-44
This is correct
>>> import cv
libdc1394 error: Failed to initialize libdc1394
>>> unpack("f",pack("I",31))[0]
4.344025239406933e-44
Still ok, after importing cv
>>> import numpy
>>> unpack("f",pack("I",31))[0]
4.344025239406933e-44
And OK after importing cv and then numpy
Now I restart python:
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from struct import pack, unpack
>>> unpack("f",pack("I",31))[0]
4.344025239406933e-44
>>> import numpy
>>> unpack("f",pack("I",31))[0]
4.344025239406933e-44
So far so good, but now I import cv AFTER importing numpy:
>>> import cv
libdc1394 error: Failed to initialize libdc1394
>>> unpack("f",pack("I",31))[0]
0.0
I've repeated this a number of times, including on multiple servers, and it always goes the same way. I've also tried it with struct.unpack and struct.pack, which also makes no difference.
I can't understand how importing numpy and cv could have any impact at all on the output of struct.unpack (pack remains the same, btw).
The "libdc1394" thing is, I believe, a red-herring: ctypes error: libdc1394 error: Failed to initialize libdc1394
Any ideas?
tl;dr: importing numpy and then opencv changes the behaviour of struct.unpack.
UPDATE: Paulo's answer below shows that this is reproducible. Seborg's comment suggests that it's something to do with the way python handles subnormals, which sounds plausible. I looked into Contexts but that didn't seem to be the problem, as the context was the same after the imports as it had been before them.
This isn't an answer, but it's too big for a comment. I played with the values a bit to find the limits.
Without loading numpy and cv:
>>> unpack("f", pack("i", 8388608))
(1.1754943508222875e-38,)
>>> unpack("f", pack("i", 8388607))
(1.1754942106924411e-38,)
After loading numpy and cv, the first line is the same, but the second:
>>> unpack("f", pack("i", 8388607))
(0.0,)
You'll notice that the first result is the lower limit for 32 bit floats. I then tried the same with d.
Without loading the libraries:
>>> unpack("d", pack("xi", 1048576))
(2.2250738585072014e-308,)
>>> unpack("d", pack("xi", 1048575))
(2.2250717365114104e-308,)
And after loading the libraries:
>>> unpack("d",pack("xi", 1048575))
(0.0,)
Now the first result is the lower limit for 64 bit float precision.
It seems that for some reason, loading the numpy and cv libraries, in that order, constrains unpack to use 32 and 64 bit precision and return 0 for lower values.

Categories