I have a sample Python class
class bean :
def __init__(self, puid, pogid, bucketId, dt, at) :
self.puid = puid
self.pogid = pogid
self.bucketId = bucketId
self.dt = (datetime.datetime.today() - datetime.datetime.strptime(dt, "%Y-%m-%d %H:%M:%S")).days
self.absdt=dt
self.at = at
Now i know that in Java to make a class serializable we just have to extend Serializable and ovverride a few methods and life is simple. Though Python is so simplistic yet i cant find a way to serialize the objects of this class.
This class should be serializable over network because the objects of this call goes to apache spark which distributes the object over network.
What is the best way to do that.
I also found this but dont know if it is the best way to do it.
I also read
Classes, functions, and methods cannot be pickled -- if you pickle an object, the object's class is not pickled, just a string that identifies what class it belongs to.
So does that mean those classes cant be serialized ?
PS: There would be millions of object of this class as the data is huge. So please provide 2 solution one the easiest and other the most efficient way of doing so.
EDIT :
For clarification i have to use this something like
def myfun():
**Some Logic **
t1 = bean(<params>)
t2 = bean(<params2>)
temp = list()
temp.append(t1)
temp.append(t2)
return temp
Now how it is finally called
PairRDD.map(myfun).collect()
which throws exception
<function __init__ at 0x7f3549853c80> is not JSON serializable
First, for your example pickle will work great. pickle doesn't serialize "functions", it only serializes "data" - so if you have the types you are trying to serialize on the remote script, i.e. if you have the type "bean" imported on the receiving end - you can use pickle or cpickle and everything will work.the mentioned disclaimer stated it doesn't keep the code of the class, meaning if you don't have it imported on the receiving end, pickle won't work for you.
All cross-language serialization solutions (i.e. json, xml) will never provide the ability to transfer class "code" because there's no reasonable way to represent it. If you're using the same language on both ends (like here) there are way to get this to work - you could for example marshal the object, pickle the result, send it over, receive on receiving end, unpickle and unmarshal and you have an object with it's functions - this is in fact sending the code and eval()-ing it on the receiving end..
Here's a quick example based on your class for pickling object:
test.py
import datetime
import pickle
class bean:
def __init__(self, puid, pogid, bucketId, dt, at) :
self.puid = puid
self.pogid = pogid
self.bucketId = bucketId
self.dt = (datetime.datetime.today() - datetime.datetime.strptime(dt, "%Y-%m-%d %H:%M:%S")).days
self.absdt=dt
self.at = at
def whoami(self):
return "%d %d"%(self.puid, self.pogid)
def myfun():
t1 = bean(1,2,3,"2015-12-31 11:50:25",4)
t2 = bean(5,6,7,"2015-12-31 12:50:25",8)
tmp = list()
tmp.append(t1)
tmp.append(t2)
return tmp
if __name__ == "__main__":
with open("test.data", "w") as f:
pickle.dump(myfun(), f)
with open("test.data", "r") as f2:
obj = pickle.load(f2)
print "|".join([bean.whoami() for bean in obj])
running it:
ben#ben-lnx:~$ python test.py
1 2|5 6
so you can see pickle works as long as you have the imported type..
As long as the all arguments you pass to __init__ (puid, pogid, bucketId, dt, at) can be serialized there should be no need for any additional steps. If you experience any problems it most likely means you didn't properly distribute your modules over the cluster.
While PySpark automatically distributes variables and functions referenced inside closures, distributing modules, libraries and classes is your responsibility. In case of simples classes creating a separate module and passing it via SparkContext.addPyFile should be just enough:
# https://www.python.org/dev/peps/pep-0008/#class-names
from some_module import Bean
sc.addPyFile("some_module.py")
Related
I want to use Python for creating JSON.
Since I found no library which can help me, I want to know if it's possible to inspect the order of the classes in a Python file?
Example
# example.py
class Foo:
pass
class Bar:
pass
If I import example, I want to know the order of the classes. In this case it is [Foo, Bar] and not [Bar, Foo].
Is this possible? If "yes", how?
Background
I am not happy with yaml/json. I have the vague idea to create config via Python classes (only classes, not instantiation to objects).
Answers which help me to get to my goal (Create JSON with a tool which is easy and fun to use) are welcome.
The inspect module can tell the line numbers of the class declarations:
import inspect
def get_classes(module):
for name, value in inspect.getmembers(module):
if inspect.isclass(value):
_, line = inspect.getsourcelines(value)
yield line, name
So the following code:
import example
for line, name in sorted(get_classes(example)):
print line, name
Prints:
2 Foo
5 Bar
First up, as I see it, there are 2 things you can do...
Continue pursuing to use Python source files as configuration files. (I won't recommend this. It's analogous to using a bulldozer to strike a nail or converting a shotgun to a wheel)
Switch to something like TOML, JSON or YAML for configuration files, which are designed for the job.
Nothing in JSON or YAML prevents them from holding "ordered" key-value pairs. Python's dict data type is unordered by default (at least till 3.5) and list data type is ordered. These map directly to object and array in JSON respectively, when using the default loaders. Just use something like Python's OrderedDict when deserializing them and voila, you preserve order!
With that out of the way, if you really want to use Python source files for the configuration, I suggest trying to process the file using the ast module. Abstract Syntax Trees are a powerful tool for syntax level analysis.
I whipped a quick script for extracting class line numbers and names from a file.
You (or anyone really) can use it or extend it to be more extensive and have more checks if you want for whatever you want.
import sys
import ast
import json
class ClassNodeVisitor(ast.NodeVisitor):
def __init__(self):
super(ClassNodeVisitor, self).__init__()
self.class_defs = []
def visit(self, node):
super(ClassNodeVisitor, self).visit(node)
return self.class_defs
def visit_ClassDef(self, node):
self.class_defs.append(node)
def read_file(fpath):
with open(fpath) as f:
return f.read()
def get_classes_from_text(text):
try:
tree = ast.parse(text)
except Exception as e:
raise e
class_extractor = ClassNodeVisitor()
li = []
for definition in class_extractor.visit(tree):
li.append([definition.lineno, definition.name])
return li
def main():
fpath = "/tmp/input_file.py"
try:
text = read_file(fpath)
except Exception as e:
print("Could not load file due to " + repr(e))
return 1
print(json.dumps(get_classes_from_text(text), indent=4))
if __name__ == '__main__':
sys.exit(main())
Here's a sample run on the following file:
input_file.py:
class Foo:
pass
class Bar:
pass
Output:
$ py_to_json.py input_file.py
[
[
1,
"Foo"
],
[
5,
"Bar"
]
]
If I import example,
If you're going to import the module, the example module to be on the import path. Importing means executing any Python code in the example module. This is a pretty big security hole - you're loading a user-editable file in the same context as the rest of the application.
I'm assuming that since you care about preserving class-definition order, you also care about preserving the order of definitions within each class.
It is worth pointing out that is now the default behavior in python, since python3.6.
Aslo see PEP 520: Preserving Class Attribute Definition Order.
(Moving my comments to an answer)
That's a great vague idea. You should give Figura a shot! It does exactly that.
(Full disclosure: I'm the author of Figura.)
I should point out the order of declarations is not preserved in Figura, and also not in json.
I'm not sure about order-preservation in YAML, but I did find this on wikipedia:
... according to the specification, mapping keys do not have an order
It might be the case that specific YAML parsers maintain the order, though they aren't required to.
You can use a metaclass to record each class's creation time, and later, sort the classes by it.
This works in python2:
class CreationTimeMetaClass(type):
creation_index = 0
def __new__(cls, clsname, bases, dct):
dct['__creation_index__'] = cls.creation_index
cls.creation_index += 1
return type.__new__(cls, clsname, bases, dct)
__metaclass__ = CreationTimeMetaClass
class Foo: pass
class Bar: pass
classes = [ cls for cls in globals().values() if hasattr(cls, '__creation_index__') ]
print(sorted(classes, key = lambda cls: cls.__creation_index__))
The standard json module is easy to use and works well for reading and writing JSON config files.
Objects are not ordered within JSON structures but lists/arrays are, so put order dependent information into a list.
I have used classes as a configuration tool, the thing I did was to derive them from a base class which was customised by the particular class variables. By using the class like this I did not need a factory class. For example:
from .artifact import Application
class TempLogger(Application): partno='03459'; path='c:/apps/templog.exe'; flag=True
class GUIDisplay(Application): partno='03821'; path='c:/apps/displayer.exe'; flag=False
in the installation script
from .install import Installer
import app_configs
installer = Installer(apps=(TempLogger(), GUIDisplay()))
installer.baseline('1.4.3.3475')
print installer.versions()
print installer.bill_of_materials()
One should use the right tools for the job, so perhaps python classes are not the right tool if you need ordering.
Another python tool I have used to create JSON files is Mako templating system. This is very powerful. We used it to populate variables like IP addresses etc into static JSON files that were then read by C++ programs.
I'm not sure if this is answers your question, but it might be relevant. Take a look at the excellent attrs module. It's great for creating classes to use as data types.
Here's an example from glyph's blog (creator of Twisted Python):
import attr
#attr.s
class Point3D(object):
x = attr.ib()
y = attr.ib()
z = attr.ib()
It saves you writing a lot of boilerplate code - you get things like str representation and comparison for free, and the module has a convenient asdict function which you can pass to the json library:
>>> p = Point3D(1, 2, 3)
>>> str(p)
'Point3D(x=1, y=2, z=3)'
>>> p == Point3D(1, 2, 3)
True
>>> json.dumps(attr.asdict(p))
'{"y": 2, "x": 1, "z": 3}'
The module uses a strange naming convention, but read attr.s as "attrs" and attr.ib as "attrib" and you'll be okay.
Just touching the point about creating JSON from python. there is an excellent library called jsonpickle which lets you dump python objects to json. (and using this alone or with other methods mentioned here you can probably get what you wanted)
I have a situation where there's a complex object that can be referenced by unique name like package.subpackage.MYOBJECT. While it's possible to pickle this object using standard pickle algorithm, resulting data string will be very big.
I'm looking for some way to get same pickling semantic for an object that is already here for classes and functions: Python's pickle just dumps their fully qualified names, not code. This way just string like package.subpackage.MYOBJECT will be dumped and upon unpickling object will be imported, just like it happens for functions or classes.
It seems that this task boils down to making object aware of variable name it's bound to, but I have no clues how to do it.
Here's short example to explain myself clearly (obvious imports are skipped).
File bigpackage/bigclasses/models.py:
class SomeInterface():
__meta__ = ABCMeta
#abstractmethod
def operation():
pass
class ImplementationA(SomeInterface):
def operation():
print "ImplementationA"
class ImplementationB(SomeInterface):
def operation():
print "ImplementationB"
IMPL_A = ImplementationA()
IMPL_B = ImplementationB()
File bigpackage/bigclasses/tasks.py:
#celery.task
def background_task(impl, somearg):
assert isinstance(impl, SomeInterface)
impl.operation()
print somearg
File bigpackage/bigclasses/work.py:
from bigpackage.bigclasses.models import IMPL_A, IMPL_B
from bigpackage.bigclasses.tasks import background_task
background_task.submit(IMPL_A, "arg1")
background_task.submit(IMPL_B, "arg2")
Here I have trivial background Celery task that accept one of two available implementations of SomeInterface as an argument. Task's arguments are pickled by Celery, passed to a queue and executed on some worker server, that runs exactly the same code base. My idea is to avoid deep pickling of IMPL_A and IMPL_B and instead pass them as bigpackage.bigclasses.models.IMPL_A and bigpackage.bigclasses.models.IMPL_B correspondingly. That will help with performance and total traffic for queue server and also provide some safety against changes in IMPL_A and IMPL_B that will make them non-pickleable (for example lambda anywhere in object attributes hierarchy).
I'm trying to transfer a function across a network connection (using asyncore). Is there an easy way to serialize a python function (one that, in this case at least, will have no side effects) for transfer like this?
I would ideally like to have a pair of functions similar to these:
def transmit(func):
obj = pickle.dumps(func)
[send obj across the network]
def receive():
[receive obj from the network]
func = pickle.loads(s)
func()
You could serialise the function bytecode and then reconstruct it on the caller. The marshal module can be used to serialise code objects, which can then be reassembled into a function. ie:
import marshal
def foo(x): return x*x
code_string = marshal.dumps(foo.__code__)
Then in the remote process (after transferring code_string):
import marshal, types
code = marshal.loads(code_string)
func = types.FunctionType(code, globals(), "some_func_name")
func(10) # gives 100
A few caveats:
marshal's format (any python bytecode for that matter) may not be compatable between major python versions.
Will only work for cpython implementation.
If the function references globals (including imported modules, other functions etc) that you need to pick up, you'll need to serialise these too, or recreate them on the remote side. My example just gives it the remote process's global namespace.
You'll probably need to do a bit more to support more complex cases, like closures or generator functions.
Check out Dill, which extends Python's pickle library to support a greater variety of types, including functions:
>>> import dill as pickle
>>> def f(x): return x + 1
...
>>> g = pickle.dumps(f)
>>> f(1)
2
>>> pickle.loads(g)(1)
2
It also supports references to objects in the function's closure:
>>> def plusTwo(x): return f(f(x))
...
>>> pickle.loads(pickle.dumps(plusTwo))(1)
3
Pyro is able to do this for you.
The most simple way is probably inspect.getsource(object) (see the inspect module) which returns a String with the source code for a function or a method.
It all depends on whether you generate the function at runtime or not:
If you do - inspect.getsource(object) won't work for dynamically generated functions as it gets object's source from .py file, so only functions defined before execution can be retrieved as source.
And if your functions are placed in files anyway, why not give receiver access to them and only pass around module and function names.
The only solution for dynamically created functions that I can think of is to construct function as a string before transmission, transmit source, and then eval() it on the receiver side.
Edit: the marshal solution looks also pretty smart, didn't know you can serialize something other thatn built-ins
In modern Python you can pickle functions, and many variants. Consider this
import pickle, time
def foobar(a,b):
print("%r %r"%(a,b))
you can pickle it
p = pickle.dumps(foobar)
q = pickle.loads(p)
q(2,3)
you can pickle closures
import functools
foobar_closed = functools.partial(foobar,'locked')
p = pickle.dumps(foobar_closed)
q = pickle.loads(p)
q(2)
even if the closure uses a local variable
def closer():
z = time.time()
return functools.partial(foobar,z)
p = pickle.dumps(closer())
q = pickle.loads(p)
q(2)
but if you close it using an internal function, it will fail
def builder():
z = 'internal'
def mypartial(b):
return foobar(z,b)
return mypartial
p = pickle.dumps(builder())
q = pickle.loads(p)
q(2)
with error
pickle.PicklingError: Can't pickle <function mypartial at 0x7f3b6c885a50>: it's not found as __ main __.mypartial
Tested with Python 2.7 and 3.6
The cloud package (pip install cloud) can pickle arbitrary code, including dependencies. See https://stackoverflow.com/a/16891169/1264797.
code_string = '''
def foo(x):
return x * 2
def bar(x):
return x ** 2
'''
obj = pickle.dumps(code_string)
Now
exec(pickle.loads(obj))
foo(1)
> 2
bar(3)
> 9
Cloudpickle is probably what you are looking for.
Cloudpickle is described as follows:
cloudpickle is especially useful for cluster computing where Python
code is shipped over the network to execute on remote hosts, possibly
close to the data.
Usage example:
def add_one(n):
return n + 1
pickled_function = cloudpickle.dumps(add_one)
pickle.loads(pickled_function)(42)
You can do this:
def fn_generator():
def fn(x, y):
return x + y
return fn
Now, transmit(fn_generator()) will send the actual definiton of fn(x,y) instead of a reference to the module name.
You can use the same trick to send classes across network.
The basic functions used for this module covers your query, plus you get the best compression over the wire; see the instructive source code:
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
Here is a helper class you can use to wrap functions in order to make them picklable. Caveats already mentioned for marshal will apply but an effort is made to use pickle whenever possible. No effort is made to preserve globals or closures across serialization.
class PicklableFunction:
def __init__(self, fun):
self._fun = fun
def __call__(self, *args, **kwargs):
return self._fun(*args, **kwargs)
def __getstate__(self):
try:
return pickle.dumps(self._fun)
except Exception:
return marshal.dumps((self._fun.__code__, self._fun.__name__))
def __setstate__(self, state):
try:
self._fun = pickle.loads(state)
except Exception:
code, name = marshal.loads(state)
self._fun = types.FunctionType(code, {}, name)
I'm new to OOP and trying to use COM objects (arcobjects) in Python. Program is GIS related, but I did not get any answers on GIS.SE, so I am asking here. Below is piece of my code. I am stuck at the end where I receive iFrameElement. ESRI describe it as member/interface of Abstract Class, which can not create objects itself. I need to pass information contained in it to object in its CoClass (MapFrame).
Any suggestions how to do this?
Also where can I find name conventions for objects in Python? There are p, i as prefix and I am not sure where they come from.
from comtypes.client import CreateObject, GetModule
import arcpy
def CType(obj, interface):
"""Casts obj to interface and returns comtypes POINTER or None"""
try:
newobj = obj.QueryInterface(interface)
return newobj
except:
return None
def NewObj(MyClass, MyInterface):
"""Creates a new comtypes POINTER object where\n\
MyClass is the class to be instantiated,\n\
MyInterface is the interface to be assigned"""
from comtypes.client import CreateObject
try:
ptr = CreateObject(MyClass, interface=MyInterface)
return ptr
except:
return None
esriCarto = GetModule(r"C:\Program Files (x86)\ArcGIS\Desktop10.0\com\esriCarto.olb")
esriCartoUI = GetModule(r"C:\Program Files (x86)\ArcGIS\Desktop10.0\com\esriCartoUI.olb")
esriMapUI = GetModule(r"C:\Program Files (x86)\ArcGIS\Desktop10.0\com\esriArcMapUI.olb")
esriFrame = GetModule(r"C:\Program Files (x86)\ArcGIS\Desktop10.0\com\esriFramework.olb")
arcpy.SetProduct('Arcinfo')
pApp = NewObj(esriFrame.AppROT, esriFrame.IAppROT).Item(0)
pDoc = pApp.Document
pMxDoc = CType(pDoc, esriMapUI.IMxDocument)
pLayout = pMxDoc.PageLayout
pGraphContLayout = CType(pLayout, esriCarto.IGraphicsContainer)
iFrameElement = pGraphContLayout.FindFrame(pMxDoc.ActiveView.FocusMap)
As far as I understand, iFrameElement is an interface of an abstract class from which I need to inherit attributes (pointer) to MapFrame object. How do I do that? How do it get to object with IMapGrids interface? Any suggestions?
IFrameElement is an interface, so you can't create an instance of it per se. This interface is implemented by various classes, including MapFrame, which means (in basic terms) that an instance of any of those objects 'behaves' like an IFrameElement. So if you get an IFrameElement from IGraphicsContainer.FindFrame(), you can pass it to something else that expects an IFrameElement without having to find out what the actual type of the object is.
I would suggest reading up on what Interfaces mean in OOP, because ESRI's code uses them a lot.
On naming convetions - there is no hard & fast rule on what to name your variables.
By the looks of your code, the p refers to an object with a distinct type, and i refers to an object defined only by an interface. But on that note, calling a variable by the same name as the interface it's referencing (except with a lower-case 'i') is a bad way to do things, and will lead to confusion. (IMO)
Edit:
To answer your final question (sorry, I missed it originally):
If pGraphContLayout.FindFrame() returns an object of type MapFrame (and there is no guarantee that it does) then you should be able to simply cast it across to IMapGrids:
pGraphContLayout = CType(pLayout, esriCarto.IGraphicsContainer)
pFrame = pGraphContLayout.FindFrame(pMxDoc.ActiveView.FocusMap)
pGrids = CType(pFrame, IMapGrids)
It sounds like you may be getting confused by Python's abstract base classes, which seem to serve the purpose of interfaces...? This thread is useful: Difference between abstract class and interface in Python
I am working on a Python module that suppose to checkout some code from SVN and build it. After much refactoring of some legacy code, I got a fairly decent coverage of the code, however, I have a gaping hole in the code that uses pysvn.
Admittedly the concept of Mock object is new to me, but after reading some of the documentation of MiniMock and pymox (both are available in my environment), I came to the conclusion that I will need to capture some pysvn output and have it returned in my test code.
But here I find myself (pardon the pun) in a pickle. The objects returned from the pysvn.Client() commands do not behave nicely when I try to pickle them, or even to compare them.
Any suggestion of how to serialize or otherwise mock pysvn or some other non-pythonic behaving objects?
Naturally, I am willing to accept that I am approaching this problem from the wrong direction, or that I am simply an idiot. In that case any advice will be helpful.
Additional information 0:
Some pysvn object can be reduced to a dict by accessing their data property, and can be reproduced by passing this dict into the appropriate __init__()
For example:
>>> svn=pysvn.Client()
>>> svn.list('http://svn/svn/')[0][0]
<PysvnList u'http://svn/svn'>
>>> d=svn.list('http://svn/svn/')[0][0].data
>>> pysvn.PysvnList(d)
<PysvnList u'http://svn/svn'>
However inside this object there might be some unpicklable objects:
>>> cPickle.dumps(d)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
cPickle.UnpickleableError: Cannot pickle <type 'node_kind'> objects
Additional Information 1:
As for #H. Dunlop request, here is a (simplified) snippet of my code,
It allow to get a list out of SVN, and let the user choose an item from that list:
class Menu(object):
"""a well covered class"""
# ...
class VersionControl(object):
"""A poorly covered class"""
def __init__(self):
self.svn = pysvn.Client()
# ...
def list(self, url):
"""svn ls $url"""
return [os.path.basename(x['path']) for (x,_) in self.svn.list(url)[1:]]
def choose(self, choice, url):
"""Displays a menu from svn list, and get's the users choice form it.
Returns the svn item (path).
"""
menu = Menu(prompt="Please choose %s from list:\n" % choice,
items=self.list(url),
muliple_choice=False)
menu.present()
return menu.chosen()
In this answer I used minimock, I'm not actually that familiar with it and would suggest using http://www.voidspace.org.uk/python/mock/ instead. This code would end up a bit cleaner . But you specified minimock or pymox so here goes:
from minimock import TraceTracker, Mock, mock
import unittest
import pysvn
from code_under_test import VersionControl
class TestVersionControl(unittest.TestCase):
def test_init(self):
mock_svn = Mock(name='svn_client')
mock('pysvn.Client', returns=mock_svn)
vc = VersionControl()
self.assertEqual(vc.svn, mock_svn)
def test_list_calls_svn_list_and_returns_urls(self):
tracker = TraceTracker()
test_url = 'a test_url'
mock_data = [
({'path': 'first result excluded'}, None),
({'path': 'url2'}, None),
({'path': 'url3', 'info': 'not in result'}, None),
({'path': 'url4'}, None),
]
vc = VersionControl()
mock('vc.svn.list', returns=mock_data, tracker=tracker)
response = vc.list(test_url)
self.assertEqual(['url2', 'url3', 'url4'], response)
self.assertTrue("Called vc.svn.list('a test_url')" in tracker.dump())
if __name__ == '__main__':
unittest.main()
If you wanted to test more of the underlying dictionary returned by pysvn then you can just modify the list of tuples with dictionaries inside of it that you make it return. You could even write a little bit of code that just dumped out the dictionaries from the pysvn objects .
Have you considered the use of: pickle instead cPicles?
"cPickle module the callables Pickler() and Unpickler() are functions, not classes. This means that you cannot use them to derive custom pickling and unpickling subclasses."