Generator expressions vs. list comprehensions - python
When should you use generator expressions and when should you use list comprehensions in Python?
# Generator expression
(x*2 for x in range(256))
# List comprehension
[x*2 for x in range(256)]
John's answer is good (that list comprehensions are better when you want to iterate over something multiple times). However, it's also worth noting that you should use a list if you want to use any of the list methods. For example, the following code won't work:
def gen():
return (something for something in get_some_stuff())
print gen()[:2] # generators don't support indexing or slicing
print [5,6] + gen() # generators can't be added to lists
Basically, use a generator expression if all you're doing is iterating once. If you want to store and use the generated results, then you're probably better off with a list comprehension.
Since performance is the most common reason to choose one over the other, my advice is to not worry about it and just pick one; if you find that your program is running too slowly, then and only then should you go back and worry about tuning your code.
Iterating over the generator expression or the list comprehension will do the same thing. However, the list comprehension will create the entire list in memory first while the generator expression will create the items on the fly, so you are able to use it for very large (and also infinite!) sequences.
Use list comprehensions when the result needs to be iterated over multiple times, or where speed is paramount. Use generator expressions where the range is large or infinite.
See Generator expressions and list comprehensions for more info.
The important point is that the list comprehension creates a new list. The generator creates a an iterable object that will "filter" the source material on-the-fly as you consume the bits.
Imagine you have a 2TB log file called "hugefile.txt", and you want the content and length for all the lines that start with the word "ENTRY".
So you try starting out by writing a list comprehension:
logfile = open("hugefile.txt","r")
entry_lines = [(line,len(line)) for line in logfile if line.startswith("ENTRY")]
This slurps up the whole file, processes each line, and stores the matching lines in your array. This array could therefore contain up to 2TB of content. That's a lot of RAM, and probably not practical for your purposes.
So instead we can use a generator to apply a "filter" to our content. No data is actually read until we start iterating over the result.
logfile = open("hugefile.txt","r")
entry_lines = ((line,len(line)) for line in logfile if line.startswith("ENTRY"))
Not even a single line has been read from our file yet. In fact, say we want to filter our result even further:
long_entries = ((line,length) for (line,length) in entry_lines if length > 80)
Still nothing has been read, but we've specified now two generators that will act on our data as we wish.
Lets write out our filtered lines to another file:
outfile = open("filtered.txt","a")
for entry,length in long_entries:
outfile.write(entry)
Now we read the input file. As our for loop continues to request additional lines, the long_entries generator demands lines from the entry_lines generator, returning only those whose length is greater than 80 characters. And in turn, the entry_lines generator requests lines (filtered as indicated) from the logfile iterator, which in turn reads the file.
So instead of "pushing" data to your output function in the form of a fully-populated list, you're giving the output function a way to "pull" data only when its needed. This is in our case much more efficient, but not quite as flexible. Generators are one way, one pass; the data from the log file we've read gets immediately discarded, so we can't go back to a previous line. On the other hand, we don't have to worry about keeping data around once we're done with it.
The benefit of a generator expression is that it uses less memory since it doesn't build the whole list at once. Generator expressions are best used when the list is an intermediary, such as summing the results, or creating a dict out of the results.
For example:
sum(x*2 for x in xrange(256))
dict( (k, some_func(k)) for k in some_list_of_keys )
The advantage there is that the list isn't completely generated, and thus little memory is used (and should also be faster)
You should, though, use list comprehensions when the desired final product is a list. You are not going to save any memeory using generator expressions, since you want the generated list. You also get the benefit of being able to use any of the list functions like sorted or reversed.
For example:
reversed( [x*2 for x in xrange(256)] )
When creating a generator from a mutable object (like a list) be aware that the generator will get evaluated on the state of the list at time of using the generator, not at time of the creation of the generator:
>>> mylist = ["a", "b", "c"]
>>> gen = (elem + "1" for elem in mylist)
>>> mylist.clear()
>>> for x in gen: print (x)
# nothing
If there is any chance of your list getting modified (or a mutable object inside that list) but you need the state at creation of the generator you need to use a list comprehension instead.
Python 3.7:
List comprehensions are faster.
Generators are more memory efficient.
As all others have said, if you're looking to scale infinite data, you'll need a generator eventually. For relatively static small and medium-sized jobs where speed is necessary, a list comprehension is best.
Sometimes you can get away with the tee function from itertools, it returns multiple iterators for the same generator that can be used independently.
I'm using the Hadoop Mincemeat module. I think this is a great example to take a note of:
import mincemeat
def mapfn(k,v):
for w in v:
yield 'sum',w
#yield 'count',1
def reducefn(k,v):
r1=sum(v)
r2=len(v)
print r2
m=r1/r2
std=0
for i in range(r2):
std+=pow(abs(v[i]-m),2)
res=pow((std/r2),0.5)
return r1,r2,res
Here the generator gets numbers out of a text file (as big as 15GB) and applies simple math on those numbers using Hadoop's map-reduce. If I had not used the yield function, but instead a list comprehension, it would have taken a much longer time calculating the sums and average (not to mention the space complexity).
Hadoop is a great example for using all the advantages of Generators.
Some notes for built-in Python functions:
Use a generator expression if you need to exploit the short-circuiting behaviour of any or all. These functions are designed to stop iterating when the answer is known, but a list comprehension must evaluate every element before the function can be called.
For example, if we have
from time import sleep
def long_calculation(value):
sleep(1) # for simulation purposes
return value == 1
then any([long_calculation(x) for x in range(10)]) takes about ten seconds, as long_calculation will be called for every x. any(long_calculation(x) for x in range(10)) takes only about two seconds, since long_calculation will only be called with 0 and 1 inputs.
When any and all iterate over the list comprehension, they will still stop checking elements for truthiness once an answer is known (as soon as any finds a true result, or all finds a false one); however, this is usually trivial compared to the actual work done by the comprehension.
Generator expressions are of course more memory efficient, when it's possible to use them. List comprehensions will be slightly faster with the non-short-circuiting min, max and sum (timings for max shown here):
$ python -m timeit "max(_ for _ in range(1))"
500000 loops, best of 5: 476 nsec per loop
$ python -m timeit "max([_ for _ in range(1)])"
500000 loops, best of 5: 425 nsec per loop
$ python -m timeit "max(_ for _ in range(100))"
50000 loops, best of 5: 4.42 usec per loop
$ python -m timeit "max([_ for _ in range(100)])"
100000 loops, best of 5: 3.79 usec per loop
$ python -m timeit "max(_ for _ in range(10000))"
500 loops, best of 5: 468 usec per loop
$ python -m timeit "max([_ for _ in range(10000)])"
500 loops, best of 5: 442 usec per loop
List comprehensions are eager but generators are lazy.
In list comprehensions all objects are created right away, it takes longer to create and return the list. In generator expressions, object creation is delayed until request by next(). Upon next() generator object is created and returned immediately.
Iteration is faster in list comprehensions because objects are already created.
If you iterate all the elements in list comprehension and generator expression, time performance is about the same. Even though generator expression return generator object right away, it does not create all the elements. Everytime you iterate over a new element, it will create and return it.
But if you do not iterate through all the elements generator are more efficient. Let's say you need to create a list comprehensions that contains millions of items but you are using only 10 of them. You still have to create millions of items. You are just wasting time for making millions of calculations to create millions of items to use only 10. Or if you are making millions of api requests but end up using only 10 of them. Since generator expressions are lazy, it does not make all the calculations or api calls unless it is requested. In this case using generator expressions will be more efficient.
In list comprehensions entire collection is loaded to the memory. But generator expressions, once it returns a value to you upon your next() call, it is done with it and it does not need to store it in the memory any more. Only a single item is loaded to the memory. If you are iterating over a huge file in disk, if file is too big you might get memory issue. In this case using generator expression is more efficient.
There is something that I think most of the answers have missed. List comprehension basically creates a list and adds it to the stack. In cases where the list object is extremely large, your script process would be killed. A generator would be more preferred in this case as its values are not stored in memory but rather stored as a stateful function. Also speed of creation; list comprehension are slower than generator comprehension
In short;
use list comprehension when the size of the obj is not excessively large else use generator comprehension
For functional programming, we want to use as little indexing as possible. For this reason, If we want to continue using the elements after we take the first slice of elements, islice() is a better choice since the iterator state is saved.
from itertools import islice
def slice_and_continue(sequence):
ret = []
seq_i = iter(sequence) #create an iterator from the list
seq_slice = islice(seq_i,3) #take first 3 elements and print
for x in seq_slice: print(x),
for x in seq_i: print(x**2), #square the rest of the numbers
slice_and_continue([1,2,3,4,5])
output: 1 2 3 16 25
Related
Python built-in function sum [duplicate]
When should you use generator expressions and when should you use list comprehensions in Python? # Generator expression (x*2 for x in range(256)) # List comprehension [x*2 for x in range(256)]
John's answer is good (that list comprehensions are better when you want to iterate over something multiple times). However, it's also worth noting that you should use a list if you want to use any of the list methods. For example, the following code won't work: def gen(): return (something for something in get_some_stuff()) print gen()[:2] # generators don't support indexing or slicing print [5,6] + gen() # generators can't be added to lists Basically, use a generator expression if all you're doing is iterating once. If you want to store and use the generated results, then you're probably better off with a list comprehension. Since performance is the most common reason to choose one over the other, my advice is to not worry about it and just pick one; if you find that your program is running too slowly, then and only then should you go back and worry about tuning your code.
Iterating over the generator expression or the list comprehension will do the same thing. However, the list comprehension will create the entire list in memory first while the generator expression will create the items on the fly, so you are able to use it for very large (and also infinite!) sequences.
Use list comprehensions when the result needs to be iterated over multiple times, or where speed is paramount. Use generator expressions where the range is large or infinite. See Generator expressions and list comprehensions for more info.
The important point is that the list comprehension creates a new list. The generator creates a an iterable object that will "filter" the source material on-the-fly as you consume the bits. Imagine you have a 2TB log file called "hugefile.txt", and you want the content and length for all the lines that start with the word "ENTRY". So you try starting out by writing a list comprehension: logfile = open("hugefile.txt","r") entry_lines = [(line,len(line)) for line in logfile if line.startswith("ENTRY")] This slurps up the whole file, processes each line, and stores the matching lines in your array. This array could therefore contain up to 2TB of content. That's a lot of RAM, and probably not practical for your purposes. So instead we can use a generator to apply a "filter" to our content. No data is actually read until we start iterating over the result. logfile = open("hugefile.txt","r") entry_lines = ((line,len(line)) for line in logfile if line.startswith("ENTRY")) Not even a single line has been read from our file yet. In fact, say we want to filter our result even further: long_entries = ((line,length) for (line,length) in entry_lines if length > 80) Still nothing has been read, but we've specified now two generators that will act on our data as we wish. Lets write out our filtered lines to another file: outfile = open("filtered.txt","a") for entry,length in long_entries: outfile.write(entry) Now we read the input file. As our for loop continues to request additional lines, the long_entries generator demands lines from the entry_lines generator, returning only those whose length is greater than 80 characters. And in turn, the entry_lines generator requests lines (filtered as indicated) from the logfile iterator, which in turn reads the file. So instead of "pushing" data to your output function in the form of a fully-populated list, you're giving the output function a way to "pull" data only when its needed. This is in our case much more efficient, but not quite as flexible. Generators are one way, one pass; the data from the log file we've read gets immediately discarded, so we can't go back to a previous line. On the other hand, we don't have to worry about keeping data around once we're done with it.
The benefit of a generator expression is that it uses less memory since it doesn't build the whole list at once. Generator expressions are best used when the list is an intermediary, such as summing the results, or creating a dict out of the results. For example: sum(x*2 for x in xrange(256)) dict( (k, some_func(k)) for k in some_list_of_keys ) The advantage there is that the list isn't completely generated, and thus little memory is used (and should also be faster) You should, though, use list comprehensions when the desired final product is a list. You are not going to save any memeory using generator expressions, since you want the generated list. You also get the benefit of being able to use any of the list functions like sorted or reversed. For example: reversed( [x*2 for x in xrange(256)] )
When creating a generator from a mutable object (like a list) be aware that the generator will get evaluated on the state of the list at time of using the generator, not at time of the creation of the generator: >>> mylist = ["a", "b", "c"] >>> gen = (elem + "1" for elem in mylist) >>> mylist.clear() >>> for x in gen: print (x) # nothing If there is any chance of your list getting modified (or a mutable object inside that list) but you need the state at creation of the generator you need to use a list comprehension instead.
Python 3.7: List comprehensions are faster. Generators are more memory efficient. As all others have said, if you're looking to scale infinite data, you'll need a generator eventually. For relatively static small and medium-sized jobs where speed is necessary, a list comprehension is best.
Sometimes you can get away with the tee function from itertools, it returns multiple iterators for the same generator that can be used independently.
I'm using the Hadoop Mincemeat module. I think this is a great example to take a note of: import mincemeat def mapfn(k,v): for w in v: yield 'sum',w #yield 'count',1 def reducefn(k,v): r1=sum(v) r2=len(v) print r2 m=r1/r2 std=0 for i in range(r2): std+=pow(abs(v[i]-m),2) res=pow((std/r2),0.5) return r1,r2,res Here the generator gets numbers out of a text file (as big as 15GB) and applies simple math on those numbers using Hadoop's map-reduce. If I had not used the yield function, but instead a list comprehension, it would have taken a much longer time calculating the sums and average (not to mention the space complexity). Hadoop is a great example for using all the advantages of Generators.
Some notes for built-in Python functions: Use a generator expression if you need to exploit the short-circuiting behaviour of any or all. These functions are designed to stop iterating when the answer is known, but a list comprehension must evaluate every element before the function can be called. For example, if we have from time import sleep def long_calculation(value): sleep(1) # for simulation purposes return value == 1 then any([long_calculation(x) for x in range(10)]) takes about ten seconds, as long_calculation will be called for every x. any(long_calculation(x) for x in range(10)) takes only about two seconds, since long_calculation will only be called with 0 and 1 inputs. When any and all iterate over the list comprehension, they will still stop checking elements for truthiness once an answer is known (as soon as any finds a true result, or all finds a false one); however, this is usually trivial compared to the actual work done by the comprehension. Generator expressions are of course more memory efficient, when it's possible to use them. List comprehensions will be slightly faster with the non-short-circuiting min, max and sum (timings for max shown here): $ python -m timeit "max(_ for _ in range(1))" 500000 loops, best of 5: 476 nsec per loop $ python -m timeit "max([_ for _ in range(1)])" 500000 loops, best of 5: 425 nsec per loop $ python -m timeit "max(_ for _ in range(100))" 50000 loops, best of 5: 4.42 usec per loop $ python -m timeit "max([_ for _ in range(100)])" 100000 loops, best of 5: 3.79 usec per loop $ python -m timeit "max(_ for _ in range(10000))" 500 loops, best of 5: 468 usec per loop $ python -m timeit "max([_ for _ in range(10000)])" 500 loops, best of 5: 442 usec per loop
List comprehensions are eager but generators are lazy. In list comprehensions all objects are created right away, it takes longer to create and return the list. In generator expressions, object creation is delayed until request by next(). Upon next() generator object is created and returned immediately. Iteration is faster in list comprehensions because objects are already created. If you iterate all the elements in list comprehension and generator expression, time performance is about the same. Even though generator expression return generator object right away, it does not create all the elements. Everytime you iterate over a new element, it will create and return it. But if you do not iterate through all the elements generator are more efficient. Let's say you need to create a list comprehensions that contains millions of items but you are using only 10 of them. You still have to create millions of items. You are just wasting time for making millions of calculations to create millions of items to use only 10. Or if you are making millions of api requests but end up using only 10 of them. Since generator expressions are lazy, it does not make all the calculations or api calls unless it is requested. In this case using generator expressions will be more efficient. In list comprehensions entire collection is loaded to the memory. But generator expressions, once it returns a value to you upon your next() call, it is done with it and it does not need to store it in the memory any more. Only a single item is loaded to the memory. If you are iterating over a huge file in disk, if file is too big you might get memory issue. In this case using generator expression is more efficient.
There is something that I think most of the answers have missed. List comprehension basically creates a list and adds it to the stack. In cases where the list object is extremely large, your script process would be killed. A generator would be more preferred in this case as its values are not stored in memory but rather stored as a stateful function. Also speed of creation; list comprehension are slower than generator comprehension In short; use list comprehension when the size of the obj is not excessively large else use generator comprehension
For functional programming, we want to use as little indexing as possible. For this reason, If we want to continue using the elements after we take the first slice of elements, islice() is a better choice since the iterator state is saved. from itertools import islice def slice_and_continue(sequence): ret = [] seq_i = iter(sequence) #create an iterator from the list seq_slice = islice(seq_i,3) #take first 3 elements and print for x in seq_slice: print(x), for x in seq_i: print(x**2), #square the rest of the numbers slice_and_continue([1,2,3,4,5]) output: 1 2 3 16 25
python optimization: [a, b, c] > [str(a), str(b), str(c)]
I need to turn a list of various entities into strings. So far I use: all_ents_dead=[] # converted to strings for i in all_ents: all_ents_dead.append(str(i)) Is there an optimized way of doing that? EDIT: I then need to find which of these contain certain string. So far I have: matching = [s for s in all_ents_dead if "GROUPS" in s]
Whenever you have a name = [], then name.append() in a loop pattern, consider using a list comprehension. A list comprehension builds a list from a loop, without having to use list.append() lookups and calls, making it faster: all_ents_dead = [str(i) for i in all_ents] This directly echoes the code you had, but with the expression inside all_ents_dead.append(...) moved to the front of the for loop. If you don't actually need a list, but only need to iterate over the str() conversions you should consider lazy conversion options. You can turn the list comprehension in to a generator expression: all_ents_dead = (str(i) for i in all_ents) or, when only applying a function, the faster alternative in the map() function: all_ents_dead = map(str, all_ents) # assuming Python 3 both of which lazily apply str() as you iterate over the resulting object. This helps avoid creating a new list object where you don't actually need one, saving on memory. Do note that a generator expression can be slower however; if performance is at stake consider all options based on input sizes, memory constraints and time trials. For your specific search example, you could just embed the map() call: matching = [s for s in map(str, all_ents) if "GROUPS" in s] which would produce a list of matching strings, without creating an intermediary list of string objects that you then don't use anywhere else.
Use the map() function. This will take your existing list, run a function on each item, and return a new list/iterator (see below) with the result of the function applied on each element. all_ends_dead = map(str, all_ents) In Python 3+, map() will return an iterator, while in Python 2 it will return a list. An iterator can have optimisations you desire since it generates the values when demanded, and not all at once (as opposed to a list).
which is faster and efficient between generator expression or itertools.chain for iterating over large list?
I have large list of string and i want to iteratoe over this list. I want to figure out which is the best way to iterate over list. I have tried using the following ways: Generator Expression: g = (x for x in list) Itertools.chain: ch = itertools.chain(list) Is there is another approach, better than these two, for list iteration?
The fastest way is just to iterate over the list. If you already have a list, layering more iterators/generators isn't going to speed anything up. A good old for item in a_list: is going to be just as fast as any other option, and definitely more readable. Iterators and generators are for when you don't already have a list sitting around in memory. itertools.count() for instance just generates a single number at a time; it's not working off of an existing list of numbers. Another possible use is when you're chaining a number of operations - your intermediate steps can create iterators/generators rather than creating intermediate lists. For instance, if you're wanting to chain a lookup for each item in the list with a sum() call, you could use a generator expression for the output of the lookups, which sum() would then consume: total_inches_of_snow = sum(inches_of_snow(date) for date in list_of_dates) This allows you to avoid creating an intermediate list with all of the individual inches of snow and instead just generate them as sum() consumes them, thus saving memory.
Python Sequence Syntax
I am new to python and was reading through some code for a Sublime Text plugin and came across some code I am not familiar with. views = [v for v in sublime.active_window().views()] it is the "[v for v" part that I don't understand. What in the heck is this piece of code doing? Thanks in advance!
That's a list comprehension. It is equivalent to (but more efficient than): views = [] for v in sublime.active_window().views(): views.append(v) Note that in this case, they should have just used list: views = list(sublime.active_window().views()) There are other types of comprehensions that were introduced in python2.7: set comprehension: {x for x in iterable} and dict comprehension: {k:v for k,v in iterable_that_yields_2_tuples} So, this is an inefficient way to create a dictionary where all the values are 1: {k:1 for k in ("foo","bar","baz")} Finally, python also supports generator expressions (they're available in python2.6 at least -- I'm not sure when they were introduced): (x for x in iterable) This works like a list comprehension, but it returns an iterable object. generators aren't particularly useful until you actually iterate over them. The advantage is that a generator calculates the values on the fly (rather than storing the values in a list which you can then iterate over later). They're more memory efficient, but they execute slower than list-comps in some circumstances -- In others, they outshine list-comprehensions because it's easy to say -- Just give me the first 3 elements please -- whereas with a list comprehension, you'd have to calculate all the elements up front which is sometimes an expensive procedure.
This is a list comprehension. It's a bit like an expression with an inline for loop, used to create a quick list on the fly. In this case, it's creating a shallow copy of the list returned by sublime.active_window().views(). List comprehensions really shine when you need to transform each value. For example, here's a quick list comprehension to get the first ten perfect squares: [x*x for x in range(1,11)]
Python - Best way to read a file and break out the lines by a delimeter
What is the best way to read a file and break out the lines by a delimeter. Data returned should be a list of tuples. Can this method be beaten? Can this be done faster/using less memory? def readfile(filepath, delim): with open(filepath, 'r') as f: return [tuple(line.split(delim)) for line in f]
Your posted code reads the entire file and builds a copy of the file in memory as a single list of all the file contents split into tuples, one tuple per line. Since you ask about how to use less memory, you may only need a generator function: def readfile(filepath, delim): with open(filepath, 'r') as f: for line in f: yield tuple(line.split(delim)) BUT! There is a major caveat! You can only iterate over the tuples returned by readfile once. lines_as_tuples = readfile(mydata,','): for linedata in lines_as_tuples: # do something This is okay so far, and a generator and a list look the same. But let's say your file was going to contain lots of floating point numbers, and your iteration through the file computed an overall average of those numbers. You could use the "# do something" code to calculate the overall sum and number of numbers, and then compute the average. But now let's say you wanted to iterate again, this time to find the differences from the average of each value. You'd think you'd just add another for loop: for linedata in lines_as_tuples: # do another thing # BUT - this loop never does anything because lines_as_tuples has been consumed! BAM! This is a big difference between generators and lists. At this point in the code now, the generator has been completely consumed - but there is no special exception raised, the for loop simply does nothing and continues on, silently! In many cases, the list that you would get back is only iterated over once, in which case a conversion of readfile to a generator would be fine. But if what you want is a more persistent list, which you will access multiple times, then just using a generator will give you problems, since you can only iterate over a generator once. My suggestion? Make readlines a generator, so that in its own little view of the world, it just yields each incremental bit of the file, nice and memory-efficient. Put the burden of retention of the data onto the caller - if the caller needs to refer to the returned data multiple times, then the caller can simply build its own list from the generator - easily done in Python using list(readfile('file.dat', ',')).
Memory use could be reduced by using a generator instead of a list and a list instead of a tuple, so you don't need to read the whole file into memory at once: def readfile(path, delim): return (ln.split(delim) for ln in open(f, 'r')) You'll have to rely on the garbage collector to close the file, though. As for returning tuples: don't do it if it's not necessary, since lists are a tiny fraction faster, constructing the tuple has a minute cost and (importantly) your lines will be split into variable-size sequences, which are conceptually lists. Speed can be improved only by going down to the C/Cython level, I guess; str.split is hard to beat since it's written in C, and list comprehensions are AFAIK the fastest loop construct in Python. More importantly, this is very clear and Pythonic code. I wouldn't try optimizing this apart from the generator bit.