How to copy an object without referencing to it? - python

I'm not sure if my title is correct for what I'm looking for, but I think that the referencing is the problem.
I have a Reader object through which I can loop:
msrun = pymzml.run.Reader(mzmlFile)
for feature in msrun:
print feature['id']
With this code I get the id's, starting at 1, of all the features in msrun. However, I need to loop through the code first and get all the keys that I want and put them in a list, like this:
def getKeys(msrun, excludeList):
spectrumKeys = []
done = False
for spectrum in msrun:
if done:
break
if spectrum['ms level'] == 2:
for key in spectrum:
if key not in excludeList and not key.startswith('MS:'):
done = True
spectrumKeys.append(key)
spectrumKeys.extend(spectrum['precursors'][0].keys())
precursorKeys = spectrum['precursors'][0].keys()
break
return spectrumKeys, precursorKeys
However, if I would run this code:
msrun = pymzml.run.Reader(mzmlFile)
specKeys, precursKeys = getKeys(msrun, ['title','name'])
for feature in msrun:
print feature['id']
it starts of at the id that hasn't been in the loop in getKeys() (it starts at 11 instead of 1). So I guess pymzml.run.Reader() works like a generator object. So I tried copying the object. First I tried
copyMsrun = msrun
specKeys, precursKeys = getKeys(copyMsrun, ['title','name'])
But this gives the same problem, if I understood correctly because doing copyMsrun = msrun makes them point to the same thing.
Then I tried
import copy
copyMsrun = copy.copy(msrun)
But I still had the same problem. I used copy.copy instead of copy.deepcopy because I don't think that the Reader objects contains other objects, and when I try deepcopy I get
TypeError: object.__new__(generator) is not safe, use generator.__new__().
So how do I copy an object so that looping through one doesn't affect the other? Should I just do
msrun = pymzml.run.Reader(mzmlFile)
copyMsrun = pymzml.run.Reader(mzmlFile)
?
Edit:
On Ade YU's comment, I tried that too but when I do
spectrumList = []
for spectrum in msrun:
print spectrum['id']
spectrumList.append(spectrum)
for spectrum in spectrumList:
print spectrum['id']
The first print gives me 1-10, but the second print give me ten times 10

From the publication of pymzML and the documentation, it is clear that this "pathologically design" is done on purpose. Initializing thousands of spectrum objects will create a huge computational overhead, memory and cpu cycle wise that are simply not needed. Normally, parsing large sets of mzML naturally calls for analyze-while-parsing approach rather then collecting everything one needs to analyze later.
Having said this, pymzML still offers the function to "deep copy" the spectrum simply by calling spectrum.deRef(). The advantage by using this function is that all unnecessary data will be stripped prior copying, hence offering smaller objects. pymzML deRef
run = pymzml.run.Reader(file_to_read, MS1_Precision = 5e-6, MSn_Precision = 20e-6)
for spec in run:
tmp = spec.deRef()
Hope that helps.

It looks like you're dealing with a pathologically designed class. There are some serious flaws in the library you are using, especially the part where the iterator yields the same object over and over again.
You'll probably need to copy the output of the iterator, like this:
objs = [copy.deepcopy(obj) for obj in pymzml.run.Reader(mzmlFile)]
for obj in objs:
# do something
for obj in objs:
# do something
If that doesn't work, you need to find whoever wrote the library and confiscate their computer.

Try itertools.tee, which gives you independent iterators. If this doesn't work, you are probably in trouble, because the objects yielded by your generator depend on some external state, (id = number of objects yielded so far?), and there is no way to automatically help in that situation. deepcopy is your best bet, but if that doesn't work, you'll have to write your own class that captures alle the info from the spectrum objects.
spectrumList = []
for spectrum in msrun:
spectrumList.append(MySpectrum(spectrum))
or the shorter variant
spectrums = list(map(MySpectrum(msrun)))
You'll need something like
class MySpectrum:
def __init__(self, spectrum):
self.id = spectrum.id
...

Use the deepcopy module to assign them without pointing to the same object
from copy import deepcopy
myq=deepcopy(transq)

Related

Python + Iterator resulting from map sharing current state with the initial iterator

Let me illustrate this with an example we came across with my students :
>>>a_lot = (i for i in range(10e50))
>>>twice_a_lot = map(lambda x: 2*x, a_lot)
>>>next(a_lot)
0
>>>next(a_lot)
1
>>>next(a_lot)
2
>>>next(twice_a_lot)
6
So somehow these iterators share their current state, as crazy and unconfortable as it sounds...
Any hints as of the model python uses behind the scene ?
This may be surprising at first but upon a little reflection, it should seem obvious.
When you create an iterator from another iterator, there is no way to recover the original state over whatever underlying container you are iterating over (in this case, the range object). At least not in general.
Consider the simplest case of this: iter(something).
When something is an iterator, then according to the iterator protocol specification, iterator.__iter__ must:
Return the iterator object itself
In other words, if you've implemented the protocol correctly, then the following identity will always hold:
iter(iterator) is iterator
Of course, map could have some convention that would allow it to recover and create an independent iterator, but there is no such convention. In general, if you want to create independent iterators, you need to create it from the source.
And of course, there are iterators where this really is not possible without storing all previous results. Consider:
import random
def random_iterator():
while True:
yield random.random()
In which case, how should map function with the following?
iterator = random_iterator()
twice = map(lambda x: x*2, iterator)
Ok, thx to all the comments received (in less than 5 minutes!!!!) i understood to related things : if I want two independent iterators, I won't use map to compute the snd from the fst :
>>>a_lot = (i for i in range(10e50))
>>>twice_a_lot = (2*i for i in range(10e50))
and i'll remember map is lazy,'cause there's no other way that could make sense.
That was a nice SO lesson.
THX

Python Multiprocessor and a list of variables to be passed to a function

Okay, so I've never used the python multiprocessing library, and I don't really know how to word my search. I read the docs for the library, and I have tried searching for examples of my problem and I couldn't find anything.
I have a list of file names (~2400), a dictionary (called cond, and is a global), and a function. I want to run my function on each processor, and each time the function is running it is using one of the file names as the variable. So I want it to be running 4 processes, 1 for each processor, and it should work its way through the list, when one function ends it carries onto the next item in the list, and each of those functions are going to be updating a single shared dictionary.
Psudofunction code:
def PSC(fnom):
f = open(fnom,"r")
r = xml.dom.minidom.parse(f)
cond[fnom] = otherfunc(r)
f.close()
So, a) is it possible to use multiprocessing on this function, and b) if it is, what method from the multiprocessing library would be able to handle it, c) if you're extra nice, how do I iterate through a list passing each item as a arg each time.
musings about the way it would work (psudo bulls*** code):
if __name__ == __main__:
name_list = name_list_func()
method = multiprocessing.[method]() #no idea what method
method.something(target=PSC, iter=name_list) #no idea either
This is easy, except for the "single shared dictionary" part. Processes don't share memory. That's a lie, but it's one you should believe at first ;-) I'm going to keep the dict in the main program here, because that's far more efficient than any actual way of sharing the dict across processes:
NUM_CPUS = None # defaults to all available cores
def PSC(fnom):
return fnom, len(fnom)
if __name__ == "__main__":
import multiprocessing as mp
pool = mp.Pool(NUM_CPUS)
list_of_strings = list("abcdefghijklm")
cond = {}
for fnom, result in pool.imap_unordered(PSC, list_of_strings):
cond[fnom] = result
pool.close()
pool.join()
print cond
That's code you can actually run. Plugging in file-opening cruft, XML parsing, etc, doesn't change any of what you need to get the multiprocessing part working right.
In current Python 3, this can be made a little simpler. The code here is for Python 2.
Note that instead of imap_unordered(), you could also use imap() or map(). imap_unordered() gives the implementation the most freedom to arrange things as efficiently as possible, although so far the implementation isn't really smart enough to take much advantage of that. Looking ahead ;-)

Python - using a string in for in statement?

so i know this is a bit of a workaround and theres probably a better way to do this, but heres the deal. Ive simplified the code from where tis gathering this info from and just given solid values.
curSel = nuke.selectedNodes()
knobToChange = "label"
codeIn = "[value in]"
kcPrefix = "x"
kcStart = "['"
kcEnd = "']"
changerString = kcPrefix+kcStart+knobToChange+kcEnd
for x in curSel:
changerString.setValue(codeIn)
But i get the error i figured i would - which is that a string has no attribute "setValue"
its because if i just type x['label'] instead of changerString, it works, but even though changer string says the exact same thing, its being read as a string instead of code.
Any ideas?
It looks like you're looking for something to evaluate the string into a python object based on your current namespace. One way to do that would be to use the globals dictionary:
globals()['x']['label'].setValue(...)
In other words, globals()['x']['label'] is the same thing as x['label'].
Or to spell it out explicitly for your case:
globals()[kcPrefix][knobToChange].setValue(codeIn)
Others might suggest eval:
eval('x["label"]').setValue(...) #insecure and inefficient
but globals is definitely a better idea here.
Finally, usually when you want to do something like this, you're better off using a dictionary or some other sort of data structure in the first place to keep your data more organized
Righto, there's two things you're falling afoul of. Firstly, in your original code where you are trying to do the setValue() call on a string you're right in that it won't work. Ideally use one of the two calls (x.knob('name_of_the_knob') or x['name_of_the_knob'], whichever is consistent with your project/facility/personal style) to get and set the value of the knob object.
From the comments, your code would look like this (my comments added for other people who aren't quite as au fait with Nuke):
# select all the nodes
curSel = nuke.selectedNodes()
# nuke.thisNode() returns the script's context
# i.e. the node from which the script was invoked
knobToChange = nuke.thisNode()['knobname'].getValue()
codeIn = nuke.thisNode()['codeinput'].getValue()
for x in curSel:
x.knob(knobToChange).setValue(codeIn)
Using this sample UI with the values in the two fields as shown and the button firing off the script...
...this code is going to give you an error message of 'Nothing is named "foo"' when you execute it because the .getValue() call is actually returning you the evaluated result of the knob - which is the error message as it tries to execute the TCL [value foo], and finds that there isn't any object named foo.
What you should ideally do is instead invoke .toScript() which returns the raw text.
# select all the nodes
curSel = nuke.selectedNodes()
# nuke.thisNode() returns the script's context
# i.e. the node from which the script was invoked
knobToChange = nuke.thisNode()['knobname'].toScript()
codeIn = nuke.thisNode()['codeinput'].toScript()
for x in curSel:
x.knob(knobToChange).setValue(codeIn)
You can sidestep this problem as you've noted by building up a string, adding in square brackets etc etc as per your original code, but yes, it's a pain, a maintenance nightmare, and starting to go down that route of building objects up from strings (which #mgilson explains how to do in both a globals() or eval() method)
For those who haven't had the joy of working with Nuke, here's a small screencap that may (or may not..) provide more context:

python: dict as prototype/template for another dict - std.lib way to solve this?

I was looking for a way to create multiple ad-hoc copies of a dictionary to hold some "evolutionary states", with just slight generational deviations and found this little prototype-dict:
class ptdict(dict):
def asprototype(self, arg=None, **kwargs):
clone = self.__class__(self.copy())
if isinstance(arg, (dict, ptdict)):
clone.update(arg)
clone.update(self.__class__(kwargs))
return clone
Basically i want smth. like:
generation0 = dict(propertyA="X", propertyB="Y")
generations = [generation0]
while not endofevolution():
# prev. generation = template for next generation:
nextgen = generations[-1].mutate(propertyB="Z", propertyC="NEW")
generations.append(nextgen)
and so on.
I was wondering, if the author of this class and me were missing something, because i just can't imagine, that there's no standard-library approach for this. But neither the collections nor the itertools seemed to provide a similar simple approach.
Can something like this be accomplished with itertools.tee?
Update:
It's not a question of copy & update, because, that's exactly what this ptdict is doing. But using update doesn't return a dict, which ptdict does, so i can for example chain results or do in-place tests, which would enhance readability quite a bit. (My provided example is maybe a bit to trivial, but i didn't want to confuse with big matrices.)
I apologise for not having been precise enough. Maybe the following example clarifies why i'm interested in getting a dictionary with a single copy/update-step:
nextgen = nextgen.mutate(inject_mutagen("A")) if nextgen.mutate(inject_mutagen("A")).get("alive") else nextgen.mutate(inject_mutagen("C"))
I guess you're looking for something like this:
first = {'x':1, 'y':100, 'foo':'bar'}
second = dict(first, x=2, y=200) # {'y': 200, 'x': 2, 'foo': 'bar'}
See dict
You can do it right away without custom types. Just use dict and instead of:
nextgen = generations[-1].mutate(propertyB="Z", propertyC="NEW")
do something like this:
nextgen = generations[-1].copy() # "clone" previous generation
nextgen.update(propertyB="Z", propertyC="NEW") # update properties of this gen.
and this should be enough, if you do not have nested dictionaries and do not need deep copy instead of simple copy.
The copy module contains functions for shallow and deep copying.

The use of id() in Python

Of what use is id() in real-world programming? I have always thought this function is there just for academic purposes. Where would I actually use it in programming?
I have been programming applications in Python for some time now, but I have never encountered any "need" for using id(). Could someone throw some light on its real world usage?
It can be used for creating a dictionary of metadata about objects:
For example:
someobj = int(1)
somemetadata = "The type is an int"
data = {id(someobj):somemetadata}
Now if I occur this object somewhere else I can find if metadata about this object exists, in O(1) time (instead of looping with is).
I use id() frequently when writing temporary files to disk. It's a very lightweight way of getting a pseudo-random number.
Let's say that during data processing I come up with some intermediate results that I want to save off for later use. I simply create a file name using the pertinent object's id.
fileName = "temp_results_" + str(id(self)).
Although there are many other ways of creating unique file names, this is my favorite. In CPython, the id is the memory address of the object. Thus, if multiple objects are instantiated, I'm guaranteed to never have a naming collision. That's all for the cost of 1 address lookup. The other methods that I'm aware of for getting a unique string are much more intense.
A concrete example would be a word-processing application where each open document is an object. I could periodically save progress to disk with multiple files open using this naming convention.
Anywhere where one might conceivably need id() one can use either is or a weakref instead. So, no need for it in real-world code.
The only time I've found id() useful outside of debugging or answering questions on comp.lang.python is with a WeakValueDictionary, that is a dictionary which holds a weak reference to the values and drops any key when the last reference to that value disappears.
Sometimes you want to be able to access a group (or all) of the live instances of a class without extending the lifetime of those instances and in that case a weak mapping with id(instance) as key and instance as value can be useful.
However, I don't think I've had to do this very often, and if I had to do it again today then I'd probably just use a WeakSet (but I'm pretty sure that didn't exist last time I wanted this).
in one program i used it to compute the intersection of lists of non-hashables, like:
def intersection(*lists):
id_row_row = {} # id(row):row
key_id_row = {} # key:set(id(row))
for key, rows in enumerate(lists):
key_id_row[key] = set()
for row in rows:
id_row_row[id(row)] = row
key_id_row[key].add(id(row))
from operator import and_
def intersect(sets):
if len(sets) > 0:
return reduce(and_, sets)
else:
return set()
seq = [ id_row_row[id_row] for id_row in intersect( key_id_row.values() ) ]
return seq

Categories