Why use a generator object in this particular case? - python

I was looking at a bit of code I downloaded from the internet. It's for a basic webcrawler. I came across the following for loop:
for link in (links.pop(0) for _ in xrange(len(links))):
...
Now, I feel the following code will also work:
for link in links:
....
links=[]
Researching, I found out that the first instance clears links and also generates a generator object (genexpr). links is never used in the for loop, so its decreasing length has nothing to do with the code.
Is there any particular reason for using the xrange, and popping the elements each time? I.e. Is there any advantage to using a generator object over calling elements of the standard list? Additionally, in what cases would a generator be useful; why?

It's hard to see any justification for the code you quoted.
The only thing I can think of is that the objects in links might be large, or otherwise associated with scarce resources, and so it might be important to free them as soon as possible (rather than waiting until the end of the loop to free them all). But (a) if so, it would be better to process each link as you created it (perhaps using a generator to organize the code), instead of building up the whole list of links before starting to process it; and (b) even if you had no choice but to build up the whole list before processing it, it would be cheaper to clear each list entry than to pop the list:
for i, link in enumerate(links):
links[i] = None
...
(Popping the first element off a list with n items takes O(n), although in practice it will be fairly fast since it's implemented using memmove.)
Even if you absolutely insisted on repeatedly popping a list as you iterated across it, it would be better to write the loop like this:
while links:
link = links.pop(0)
...

The purpose of generators is to avoid building large collections of intermediate objects that won't serve any external use.
If all the code is doing is building the set of links on a page, the second code snippet is fine. But perhaps what might be desired is the set of root websites names (eg google.com rather than google.com/q=some_search_term....). If this is the case, you'd then take the list of links and then go through the full list, stripping out just the first part.
It's for this second stripping portion where you'd gain more by using a generator. Rather than having needlessly built a list of links which takes memory and time to build, you can now pass through each link one-by-one, getting the website name without a big intermediary list of all links.

Related

Should I create a copy when iterating over a list using enumerate in Python

When answering this question, I came across something I never thought about in Python (pointed by a user).
Basically, I already know (here's an interesting thread about it) that I have to make a copy when iterating while mutating a list in Python in order to avoid strange behaviors.
Now, my question is, is using enumerate overcoming that problem ?
test_list = [1,2,3,4]
for index,item in enumerate(test_list):
if item == 1:
test_list.pop(index)
Would this code be considered safe or I should use,
for index,item in enumerate(test_list[:]):
First, let’s answer your direct question:
enumerate doesn’t help anything here. It works as if it held an iterator to the underlying iterable (and, at least in CPython, that’s exactly what it does), so anything that wouldn’t be legal or safe to do with a list iterator isn’t legal or safe to do with an enumerate object wrapped around that list iterator.
Your original use case—setting test_list[index] = new_value—is safe in practice—but I’m not sure whether it’s guaranteed to be safe or not.
Your new use case—calling test_list.pop(index)—is probably not safe.
The most obvious implementation of a list iterator is basically just a reference to the list and an index into that list. So, if you insert or delete at the current position, or to the left of that position, you’re definitely breaking the iterator. For example, if you delete lst[i], that shifts everything from i + 1 to the end up one position, so when you move on to i + 1, you’re skipping over the original i + 1th value, because it’s now the ith. But if you insert or delete to the right of the current position, that’s not a problem.
Since test_list.pop(index) deletes at or left of the current position, it's not safe even with this implementation. (Of course if you've carefully written your algorithm so that skipping over the value after a hit never matters, maybe even that's fine. But more algorithms won't handle that.)
It’s conceivable that a Python implementation could instead store a raw pointer to the current position in the array used for the list storage. Which would mean that inserting anywhere could break the iterator, because an insert can cause the whole list to get reallocated to new memory. And so could deleting anywhere, if the implementation sometimes reallocates lists on shrinking. I don't think the Python disallows implementations that do all of this, so if you want to be paranoid, it may be safer to just never insert or delete while iterating.
If you’re just replacing an existing value, it’s hard to imagine how that could break the iterator under any reasonable implementation. But, as far as I'm aware, the language reference and list library reference1 don't actually make any promises about the implementation of list iterators.2
So, it's up to you whether you care about "safe in my implementation", "safe in every implementation every written to date", "safe in every conceivable (to me) implementation", or "guaranteed safe by the reference".
I think most people happily replace list items during iteration, but avoid shrinking or growing the list. However, there's definitely production code out there that at least deletes to the right of the iterator.
1. I believe the tutorial just says somewhere to never modify any data structure while iterating over it—but that’s the tutorial. It’s certainly safe to always follow that rule, but it may also be safe to follow a less strict rule.
2. Except that if the key function or anything else tries to access the list in any way in the middle of a sort, the result is undefined.
Since it was my comment which lead to this, I'll add my follow up:
enumerate can be thought of as a generator so it will just take a sequence (any sequence or iterator actually) and just "generate" an incrementing index to be yielded with each item from the passed sequence (so it's not making a copy or mutating the list anyway just using an "enumerate object").
In the case of the code in that question, you were never changing the length of the list you were iterating over and once the if statement was run the value of the element did not matter. So the copy wasn't needed, it will be needed when an element is removed as the iterator index is shared with the list and does not account for removed elements.
The Python Ninja has a great example of when you should use a copy (or move to list comprehension)

Is fetch() better than list(Model.all().run()) for returning a list from a datastore query?

Using Google App Engine Python 2.7 Query Class -
I need to produce a list of results that I pass to my django template. There are two ways I've found to do this.
Use fetch, however in the docs it says that fetch should almost never be used. https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch
Use run() and then wrap it into list() thereby creating the list object.
Is one preferable to the other in terms of memory usage? Is there another way I could be doing this?
The key here is why fetch “should almost never be used”. The documentation says that fetch will get all the results, therefore having to keep all of them in memory at the same time. If the data you get is big, you will need lots of memory.
You say you can wrap run inside list. Sure, you can do that, but you will hit exactly the same problem—list will force all the elements into memory. So, this solution is actually discouraged on the same basis as using fetch.
Now, you could say: so what should I do? The answer is: in most cases you can deal with elements of your data one by one, without keeping them all in memory at the same time. For example, if all you need is to put the result data into a django template, and you know that it will be used at most once in your template, then the django template will happily take any iterator—so you can pass the run call result directly without wrapping it into list.
Similarly, if you need to do some processing, for example go over the results to find the element with the highest price or ranking, or whatever, you can just iterate over the result of run.
But if your usage requires having all the elements in memory (e.g.: your django template uses the data from the query several times), then you have a case where fetch or list(run(…)) actually has sense. In the end—this is just the typical trade-off: if you need for your application to apply an algorithm which requires all the data in memory, you need to pay for it by using up memory. So, you can either redesign your algorithms and usage to work with an iterator, or use fetch and pay for it by longer processing times and higher memory usage. Google of course encourages you to do the first thing. And this is what “should almost never be used” actually means.

How to add a track to an iTunes playlist using Python and Scripting Bridge

I learned how to create a playlist in a previous question, but now I can't figure out how to add tracks to it. Right now I have:
tracks.sort(key=lambda tup: tup[0])
i = 0
for trackList in generatePlaylists(tracks,10):
i += 1
playlistname = str(i)
p = {'name': playlistname}
playlist = iTunes.classForScriptingClass_("playlist").alloc().initWithProperties_(p)
iTunes.sources()[0].playlists().insertObject_atIndex_(playlist, 0)
# Find the playlist I just made
for playlist in iTunes.sources()[0].playlists():
if playlist.name() == playlistname:
newPlaylist = playlist
# Add the tracks to it
for track in trackList:
print track[1].name()
iTunes.add_to_(track[1],newPlaylist)
My tracks are in a list of tuples tracks, where the first element of the tuple is a score and the second is the actual track object. generatePlaylists is an iterator which splits all library tracks into 10 lists.
The above code runs without error, but in iTunes the playlists are empty.
First, here's the short answer:
track.duplicateTo_(newPlaylist)
The problem is that iTunes.add_to_ sends the add command, which takes a file (alias) and imports it into a playlist; you want to send the duplicate command, which takes any object and makes another copy of the object. You don't have a file, you have a track. (You could get a file via track.location(), but you don't want to re-import the file, just copy the track over.)
Also, in this case, you need to call the method on the track, rather than calling it on the app and passing it the track.
The first half of this is hard to explain without a solid understanding of the iTunes object model (and the AE model underneath it). But you don't really need to understand it. In most cases, by looking over the iTunes scripting dictionary (in AppleScript Editor) and trial and error (in AppleScript Editor or with py-appscript) you can figure it out what you want. (Just make sure you're working on a scrap library, or have a backup…) In this case, the only commands it could possibly be are add, copy, duplicate, or move, so just try them all and see what they do. Or, alternatively, go to dougscripts and download a bunch of samples and find one that does what you want.
The second half of this, figuring out how to translate to ScriptingBridge… well, I can't explain it without going into a long rant on SB (which hhas does much better than me, if you want to read one). But the basics are this: As far as iTunes is concerned, duplicate is a command. If you give it a direct object (tell application "iTunes" to duplicate theTrack to thePlaylist) it'll use that; if not, you're asking the subject to duplicate itself (tell theTrack to duplicate to thePlaylist). It works exactly like English. But SB insists on an object-oriented model, where duplicate is a method on some object. So, only one of those two forms is going to work. In general, you can figure out which by just looking at dir(iTunes) and dir(track) to see which one has a method that looks like the command you want.
As you can tell from the above, you've got a lot of trial and error ahead of you if you're trying to do anything complicated. Good luck, and keep asking.
PS, I have no idea why your code fails silently. The obvious way the add_to_ method should translate into a command should raise a -1708 error (as appscript iTunes.add(track, to=newPlaylist) or AppleScript add theTrack to newPlaylist both do…).

I found myself swinging the list comprehension hammer

... and every for-loop looked like a list comprehension.
Instead of:
for stuff in all_stuff:
do(stuff)
I was doing (not assigning the list to anything):
[ do(stuff) for stuff in all_stuff ]
This is a common pattern found on list-comp how-to's. 1) OK, so no big deal right? Wrong. 2) Can't this just be code style? Super wrong.
1) Yea that was wrong. As NiklasB points out, the of the HowTos is to build up a new list.
2) Maybe, but its not obvious and explicit, so better not to use it.
I didn't keep in mind that those how-to's were largely command-line based. After my team yelled at me wondering why the hell I was building up massive lists and then letting them go, it occurred to me that I might be introducing a major memory-related bug.
So here'er my question/s. If I were to do this in a very long running process, where lots of data was being consumed, would this "list" just continue consuming my memory until let go? When will the garbage collector claim the memory back? After the scope this list is built in is lost?
My guess is yes, it will keep consuming my memory. I don't know how the python garbage collector works, but I would venture to say that this list will exist until after the last next is called on all_stuff.
EDIT.
The essence of my question is relayed much cleaner in this question (thanks for the link Niklas)
If I were to do this in a very long running process, where lots of data was being consumed, would this "list" just continue consuming my memory until let go?
Absolutely.
When will the garbage collector claim the memory back? After the scope this list is built in is lost?
CPython uses reference counting, so that is the most likely case. Other implementations work differently, so don't count on it.
Thanks to Karl for pointing out that due to the complex memory management mechanisms used by CPython this does not mean that the memory is immediately returned to the OS after that.
I don't know how the python garbage collector works, but I would venture to say that this list will exist until after the last next is called on all_stuff.
I don't think any garbage collector works like that. Usually they mark-and-sweep, so it could be quite some time before the list is garbage collected.
This is a common pattern found on list-comp how-to's.
Absolutely not. The point is that you iterate the list with the purpose of doing something with every item (do is called for it's side-effects). In all the examples of the List-comp HOWTO, the list is iterated to build up a new list based on the items of the old one. Let's look at an example:
# list comp, creates the list [0,1,2,3,4,5,6,7,8,9]
[i for i in range(10)]
# loop, does nothing
for i in range(10):
i # meh, just an expression which doesn't have an effect
Maybe you'll agree that this loop is utterly senseless, as it doesn't do anything, in contrary to the comprehension, which builds a list. In your example, it's the other way round: The comprehension is completely senseless, because you don't need the list! You can find more information about the issue on a related question
By the way, if you really want to write that loop in one line, use a generator consumer like deque.extend. This will be slightly slower than a raw for loop in this simple example, though:
>>> from collections import deque
>>> consume = deque(maxlen=0).extend
>>> consume(do(stuff) for stuff in all_stuff)
Try manually doing GC and dumping the statistics.
gc.DEBUG_STATS
Print statistics during collection. This information can be useful when tuning the collection frequency.
FROM
http://docs.python.org/library/gc.html
The CPython GC will reap it once there are no references to it outside of a cycle. Jython and IronPython follow the rules of the underlying GCs.
If you like that idiom, do returns something that always evaluates to either True or False and would consider a similar alternative with no ugly side effects, you can use a generator expression combined with either any or all.
For functions that return False values (or don't return):
any(do(stuff) for stuff in all_stuff)
For functions that return True values:
all(do(stuff) for stuff in all_stuff)
I don't know how the python garbage collector works, but I would venture to say that this list will exist until after the last next is called on all_stuff.
Well, of course it will, since you're building a list that will have the same number of elements of all_stuff. The interpreter can't discard the list before it's finished, can it? You could call gc.collect between one of these loops and another one, but each one will be fully constructed before it can be reclaimed.
In some cases you could use a generator expression instead of a list comprehension, so it doesn't have to build a list with all your values:
(do_something(i) for i in xrange(1000))
However you'd still have to "exaust" that generator in some way...

Are there memory efficiencies gained when code is wrapped in functions?

I have been working on some code. My usual approach is to first solve all of the pieces of the problem, creating the loops and other pieces of code I need as I work through the problem and then if I expect to reuse the code I go back through it and group the parts of code together that I think should be grouped to create functions.
I have just noticed that creating functions and calling them seems to be much more efficient than writing lines of code and deleting containers as I am finished with them.
for example:
def someFunction(aList):
do things to aList
that create a dictionary
return aDict
seems to release more memory at the end than
>>do things to alist
>>that create a dictionary
>>del(aList)
Is this expected behavior?
EDIT added example code
When this function finishes running the PF Usage shows an increase of about 100 mb the filingsList has about 8 million lines.
def getAllCIKS(filingList):
cikDICT=defaultdict(int)
for filing in filingList:
if filing.startswith('.'):
del(filing)
continue
cik=filing.split('^')[0].strip()
cikDICT[cik]+=1
del(filing)
ciklist=cikDICT.keys()
ciklist.sort()
return ciklist
allCIKS=getAllCIKS(open(r'c:\filinglist.txt').readlines())
If I run this instead I show an increase of almost 400 mb
cikDICT=defaultdict(int)
for filing in open(r'c:\filinglist.txt').readlines():
if filing.startswith('.'):
del(filing)
continue
cik=filing.split('^')[0].strip()
cikDICT[cik]+=1
del(filing)
ciklist=cikDICT.keys()
ciklist.sort()
del(cikDICT)
EDIT
I have been playing around with this some more today. My observation and question should be refined a bit since my focus has been on the PF Usage. Unfortunately I can only poke at this between my other tasks. However I am starting to wonder about references versus copies. If I create a dictionary from a list does the dictionary container hold a copy of the values that came from the list or do they hold references to the values in the list? My bet is that the values are copied instead of referenced.
Another thing I noticed is that items in the GC list were items from containers that were deleted. Does that make sense? Soo I have a list and suppose each of the items in the list was [(aTuple),anInteger,[another list]]. When I started learning about how to manipulate the gc objects and inspect them I found those objects in the gc even though the list had been forcefully deleted and even though I passed the 0,1 & 2 value to the method that I don't remember to try to still delete them.
I appreciate the insights people have been sharing. Unfortunately I am always interested in figuring out how things work under the hood.
Maybe you used some local variables in your function, which are implicitly released by reference counting at the end of the function, while they are not released at the end of your code segment?
You can use the Python garbage collector interface provided to more closely examine what (if anything) is being left around in the second case. Specifically, you may want to check out gc.get_objects() to see what is left uncollected, or gc.garbage to see if you have any reference cycles.
Some extra memory is freed when you return from a function, but that's exactly as much extra memory as was allocated to call the function in the first place. In any case - if you seeing a large amount of difference, that's likely an artifact of the state of the runtime, and is not something you should really be worrying about. If you are running low on memory, the way to solve the problem is to keep more data on disk using things like b-trees (or just use a database), or use algorithms that use less memory. Also, keep an eye out for making unnecessary copies of large data structures.
The real memory savings in creating functions is in your short-term memory. By moving something into a function, you reduce the amount of detail you need to remember by encapsulating part of the minutia away.
Maybe you should re-engineer your code to get rid of unnecessary variables (that may not be freed instantly)... how about the following snippet?
myfile = file(r"c:\fillinglist.txt")
ciklist = sorted(set(x.split("^")[0].strip() for x in myfile if not x.startswith(".")))
EDIT: I don't know why this answer was voted negative... Maybe because it's short? Or maybe because the dude who voted was unable to understand how this one-liner does the same that the code in the question without creating unnecessary temporal containers?
Sigh...
I asked another question about copying lists and the answers, particularly the answer directing me to look at deepcopy caused me to think about some dictionary behavior. The problem I was experiencing had to do with the fact that the original list is never garbage collected because the dictionary maintains references to the list. I need to use the information about weakref in the Python Docs.
objects referenced by dictionaries seem to stay alive. I think (but am not sure) the process of pushing the dictionary out of the function forces the copy process and kills the object. This is not complete I need to do some more research.

Categories