Why to use marshal to dump and pickle to load?

Why to use marshal to dump and pickle to load? - python

I'm trying to understand the python2 library remotely which helps to run code remotely through xmlrpc.
On the client, the author dump the object with marshal and load the result sent back from the server with pickle:
def run(self, func, *args, **kwds):
code_str = base64.b64encode(marshal.dumps(func.func_code))
output = self.proxy.run(self.api_key, self.a_sync, code_str, *args, **kwds)
return pickle.loads(base64.b64decode(output))
And on the server side, he's doing the other way around:
def run(self, api_key, a_sync, func_str, *args, **kwds):
#... truncated code
code = marshal.loads(base64.b64decode(func_str))
func = types.FunctionType(code, globals(), "remote_func")
#... truncated code
output = func(*args, **kwds)
output = base64.b64encode(pickle.dumps(output))
return output
What's the purpose of dumping with marshal and loading the result with pickle? (and vice-versa)

The object that is getting sent first using marshal is of a very specific type. It's a code object, and only that type needs to be supported. That is a type that the marshal module is designed to handle. The return value on the other hand, can be of any type, it's determined by what the func function returns. The pickle module has a much more general protocol that can serialize many different types of objects, so there's a decent chance it will support the return value.
While you could use pickle for both data items being passed between your programs, the marshal module's output is a little more compact and efficient for passing code objects, as pickle just wraps around it. If you try dumping the same code object with both marshal and pickle (using protocol zero, the default in Python 2), you'll see the output from marshal inside the output from pickle!
So to summarize, the marshal module is used to sent the code because only code objects need to be sent and it's a lower-level serialization, which can be a little more efficient some of the time. The return value is sent back using pickle because the program can't predict what type of object it will be, and pickle can serialize more complicated values than marshal can, at the cost of some additional complexity and (sometimes) the size of the serialization.

Related

Implement Custom Str or Buffer in Python

I'm working with python-gnupg to decrypt a file and the decrypted file content is very large so loading the entire contents into memory is not feasible.
I would like to short-circuit the write method in order to to manipulate the decrypted contents as it is written.
Here are some failed attempts:
import gpg
from StringIO import StringIO
# works but not feasible due to memory limitations
decrypted_data = gpg_client.decrypt_file(decrypted_data)
# works but no access to the buffer write method
gpg_client.decrypt_file(decrypted_data, output=buffer())
# fails with TypeError: coercing to Unicode: need string or buffer, instance found
class TestBuffer:
def __init__(self):
self.buffer = StringIO()
def write(self, data):
print('writing')
self.buffer.write(data)
gpg_client.decrypt_file(decrypted_data, output=TestBuffer())
Can anyone think of any other ideas that would allow me to create a file-like str or buffer object to output the data to?

You can implement a subclass of one of the classes in the io module described in I/O Base Classes, presumably io.BufferedIOBase. The standard library contains an example of something quite similar in the form of the zipfile.ZipExtFile class. At least this way, you won't have to implement complex functions like readline yourself.

Why doesn't tempfile.SpooledTemporaryFile implement readable, writable, seekable?

In Python 3.6.1, I've tried wrapping a tempfile.SpooledTemporaryFile in an io.TextIOWrapper:
with tempfile.SpooledTemporaryFile() as tfh:
do_some_download(tfh)
tfh.seek(0)
wrapper = io.TextIOWrapper(tfh, encoding='utf-8')
yield from do_some_text_formatting(wrapper)
The line wrapper = io.TextIOWrapper(tfh, encoding='utf-8') gives me an error:
AttributeError: 'SpooledTemporaryFile' object has no attribute 'readable'
If I create a simple class like this, I can bypass the error (I get similar errors for writable and seekable):
class MySpooledTempfile(tempfile.SpooledTemporaryFile):
#property
def readable(self):
return self._file.readable
#property
def writable(self):
return self._file.writable
#property
def seekable(self):
return self._file.seekable
Is there a good reason why tempfile.SpooledTemporaryFile doesn't already have these attributes?

SpooledTemporaryFile actually uses 2 different _file implementations under the hood - initially an io Buffer (StringIO or BytesIO), until it rolls over and creates a "file-like object" via tempfile.TemporaryFile() (for example, when max_size is exceeded).
io.TextIOWrapper requires a BufferedIOBase base class/interface, which is provided by io.StringIO and io.BytesIO, but not necessarily by the object returned by TemporaryFile() (though in my testing on OSX, TemporaryFile() returned an _io.BufferedRandom object, which had the desired interface, my theory is this may depend on platform).
So, I would expect your MySpooledTempfile wrapper to possibly fail on some platforms after rollover.

This is fixed in python 3.11. Changelog for reference

Serializing a Python class

I have a sample Python class
class bean :
def __init__(self, puid, pogid, bucketId, dt, at) :
self.puid = puid
self.pogid = pogid
self.bucketId = bucketId
self.dt = (datetime.datetime.today() - datetime.datetime.strptime(dt, "%Y-%m-%d %H:%M:%S")).days
self.absdt=dt
self.at = at
Now i know that in Java to make a class serializable we just have to extend Serializable and ovverride a few methods and life is simple. Though Python is so simplistic yet i cant find a way to serialize the objects of this class.
This class should be serializable over network because the objects of this call goes to apache spark which distributes the object over network.
What is the best way to do that.
I also found this but dont know if it is the best way to do it.
I also read
Classes, functions, and methods cannot be pickled -- if you pickle an object, the object's class is not pickled, just a string that identifies what class it belongs to.
So does that mean those classes cant be serialized ?
PS: There would be millions of object of this class as the data is huge. So please provide 2 solution one the easiest and other the most efficient way of doing so.
EDIT :
For clarification i have to use this something like
def myfun():
**Some Logic **
t1 = bean(<params>)
t2 = bean(<params2>)
temp = list()
temp.append(t1)
temp.append(t2)
return temp
Now how it is finally called
PairRDD.map(myfun).collect()
which throws exception
<function __init__ at 0x7f3549853c80> is not JSON serializable

First, for your example pickle will work great. pickle doesn't serialize "functions", it only serializes "data" - so if you have the types you are trying to serialize on the remote script, i.e. if you have the type "bean" imported on the receiving end - you can use pickle or cpickle and everything will work.the mentioned disclaimer stated it doesn't keep the code of the class, meaning if you don't have it imported on the receiving end, pickle won't work for you.
All cross-language serialization solutions (i.e. json, xml) will never provide the ability to transfer class "code" because there's no reasonable way to represent it. If you're using the same language on both ends (like here) there are way to get this to work - you could for example marshal the object, pickle the result, send it over, receive on receiving end, unpickle and unmarshal and you have an object with it's functions - this is in fact sending the code and eval()-ing it on the receiving end..
Here's a quick example based on your class for pickling object:
test.py
import datetime
import pickle
class bean:
def __init__(self, puid, pogid, bucketId, dt, at) :
self.puid = puid
self.pogid = pogid
self.bucketId = bucketId
self.dt = (datetime.datetime.today() - datetime.datetime.strptime(dt, "%Y-%m-%d %H:%M:%S")).days
self.absdt=dt
self.at = at
def whoami(self):
return "%d %d"%(self.puid, self.pogid)
def myfun():
t1 = bean(1,2,3,"2015-12-31 11:50:25",4)
t2 = bean(5,6,7,"2015-12-31 12:50:25",8)
tmp = list()
tmp.append(t1)
tmp.append(t2)
return tmp
if __name__ == "__main__":
with open("test.data", "w") as f:
pickle.dump(myfun(), f)
with open("test.data", "r") as f2:
obj = pickle.load(f2)
print "|".join([bean.whoami() for bean in obj])
running it:
ben#ben-lnx:~$ python test.py
1 2|5 6
so you can see pickle works as long as you have the imported type..

As long as the all arguments you pass to __init__ (puid, pogid, bucketId, dt, at) can be serialized there should be no need for any additional steps. If you experience any problems it most likely means you didn't properly distribute your modules over the cluster.
While PySpark automatically distributes variables and functions referenced inside closures, distributing modules, libraries and classes is your responsibility. In case of simples classes creating a separate module and passing it via SparkContext.addPyFile should be just enough:
# https://www.python.org/dev/peps/pep-0008/#class-names
from some_module import Bean
sc.addPyFile("some_module.py")

Is it possible to pickle Python object by reference (by name)?

I have a situation where there's a complex object that can be referenced by unique name like package.subpackage.MYOBJECT. While it's possible to pickle this object using standard pickle algorithm, resulting data string will be very big.
I'm looking for some way to get same pickling semantic for an object that is already here for classes and functions: Python's pickle just dumps their fully qualified names, not code. This way just string like package.subpackage.MYOBJECT will be dumped and upon unpickling object will be imported, just like it happens for functions or classes.
It seems that this task boils down to making object aware of variable name it's bound to, but I have no clues how to do it.
Here's short example to explain myself clearly (obvious imports are skipped).
File bigpackage/bigclasses/models.py:
class SomeInterface():
__meta__ = ABCMeta
#abstractmethod
def operation():
pass
class ImplementationA(SomeInterface):
def operation():
print "ImplementationA"
class ImplementationB(SomeInterface):
def operation():
print "ImplementationB"
IMPL_A = ImplementationA()
IMPL_B = ImplementationB()
File bigpackage/bigclasses/tasks.py:
#celery.task
def background_task(impl, somearg):
assert isinstance(impl, SomeInterface)
impl.operation()
print somearg
File bigpackage/bigclasses/work.py:
from bigpackage.bigclasses.models import IMPL_A, IMPL_B
from bigpackage.bigclasses.tasks import background_task
background_task.submit(IMPL_A, "arg1")
background_task.submit(IMPL_B, "arg2")
Here I have trivial background Celery task that accept one of two available implementations of SomeInterface as an argument. Task's arguments are pickled by Celery, passed to a queue and executed on some worker server, that runs exactly the same code base. My idea is to avoid deep pickling of IMPL_A and IMPL_B and instead pass them as bigpackage.bigclasses.models.IMPL_A and bigpackage.bigclasses.models.IMPL_B correspondingly. That will help with performance and total traffic for queue server and also provide some safety against changes in IMPL_A and IMPL_B that will make them non-pickleable (for example lambda anywhere in object attributes hierarchy).

We need to pickle any sort of callable

Recently a question was posed regarding some Python code attempting to facilitate distributed computing through the use of pickled processes. Apparently, that functionality has historically been possible, but for security reasons the same functionality is disabled. On the second attempted at transmitting a function object through a socket, only the reference was transmitted. Correct me if I am wrong, but I do not believe this issue is related to Python's late binding. Given the presumption that process and thread objects can not be pickled, is there any way to transmit a callable object? We would like to avoid transmitting compressed source code for each job, as that would probably make the entire attempt pointless. Only the Python core library can be used for portability reasons.

You could marshal the bytecode and pickle the other function things:
import marshal
import pickle
marshaled_bytecode = marshal.dumps(your_function.func_code)
# In this process, other function things are lost, so they have to be sent separated.
pickled_name = pickle.dumps(your_function.func_name)
pickled_arguments = pickle.dumps(your_function.func_defaults)
pickled_closure = pickle.dumps(your_function.func_closure)
# Send the marshaled bytecode and the other function things through a socket (they are byte strings).
send_through_a_socket((marshaled_bytecode, pickled_name, pickled_arguments, pickled_closure))
In another python program:
import marshal
import pickle
import types
# Receive the marshaled bytecode and the other function things.
marshaled_bytecode, pickled_name, pickled_arguments, pickled_closure = receive_from_a_socket()
your_function = types.FunctionType(marshal.loads(marshaled_bytecode), globals(), pickle.loads(pickled_name), pickle.loads(pickled_arguments), pickle.loads(pickled_closure))
And any references to globals inside the function would have to be recreated in the script that receives the function.
In Python 3, the function attributes used are __code__, __name__, __defaults__ and __closure__.
Please do note that send_through_a_socket and receive_from_a_socket do not actually exist, and you should replace them by actual code that transmits data through sockets.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why to use marshal to dump and pickle to load? - python

Related

Implement Custom Str or Buffer in Python

Why doesn't tempfile.SpooledTemporaryFile implement readable, writable, seekable?

Serializing a Python class

Is it possible to pickle Python object by reference (by name)?

We need to pickle any sort of callable

Categories

Resources