Related
I wrote two function f and g with same functionality
def f(l, count):
if count > 1:
for i in f(l, count-1):
yield i + 1
else:
yield from l
for i in f(range(100000),900):
pass
print('f')
and
def g(l, count):
if count > 1:
tmp = []
for i in g(l, count-1):
tmp.append(i+1)
return tmp
else:
return l
for i in g(range(100000),900):
pass
print('f')
and i
I think f shuold be faster but g is faster when in run it
time for g
real 0m5.977s
user 0m5.956s
sys 0m0.020s
time for f
real 0m7.389s
user 0m7.376s
sys 0m0.012s
There are a couple of big differences between a solution that yields a result and one that computes the complete result.
The yield keeps returning the next result until exhausted while the complete calculation is always done fully so if you had a test that might terminate your calculation early, (often the case), the yield method will only be called enough times to meet that criteria - this often results in faster code.
The yield result only consumes enough memory to hold the generator and a single result at any moment in time - the full calculation consumes enough memory to hold all of the results at once. When you get to really large data sets that can make the difference between something that runs regardless of the size and something that crashes.
So yield is slightly more expensive, per operation, but much more reliable and often faster in cases where you don't exhaust the results.
I have no idea what your h and g functions are doing but remember this,
Yield is a keyword that is used like return, except the function will return a generator and that is the reason it takes time.
There is a wonderful explanation about what yeild does. Check this answer on stackoverflow.
So suppose I have an array of some elements. Each element have some number of properties.
I need to filter this list from some subsets of values determined by predicates. This subsets of course can have intersections.
I also need to determine amount of values in each such subset.
So using imperative approach I could write code like that and it would have running time of 2*n. One iteration to copy array and another one to filter it count subsets sizes.
from split import import groupby
a = [{'some_number': i, 'some_time': str(i) + '0:00:00'} for i in range(10)]
# imperative style
wrong_number_count = 0
wrong_time_count = 0
for item in a[:]:
if predicate1(item):
delete_original(item, a)
wrong_number_count += 1
if predicate2(item):
delete_original(item, a)
wrong_time_count += 1
update_some_data(item)
do_something_with_filtered(a)
def do_something_with_filtered(a, c1, c2):
print('filtered a {}'.format(a))
print('{} items had wrong number'.format(c1))
print('{} items had wrong time'.format(c2))
def predicate1(x):
return x['some_number'] < 3
def predicate2(x):
return x['some_time'] < '50:00:00'
Somehow I can't think of the way to do that in Python in functional way with same running time.
So in functional style I could have used groupby multiple times probably or write a comprehension for each predicate, but that's obviously would be slower than imperative approach.
I think such thing possible in Haskell using Stream Fusion (am I right?)
But how do that in Python?
Python has a strong support to "stream processing" in the form of its iterators - and what you ask seens just trivial to do. You just have to have a way to group your predicates and attributes to it - it could be a dictionary where the predicate itself is the key.
That said, a simple iterator function that takes in your predicate data structure, along with the data to be processed could do what you want. TThe iterator would have the side effect of changing your data-structure with the predicate-information. If you want "pure functions" you'd just have to duplicate the predicate information before, and maybe passing and retrieving all predicate and counters valus to the iterator (through the send method) for each element - I don´ t think it would be worth that level of purism.
That said you could have your code something along:
from collections import OrderedDict
def predicate1(...):
...
...
def preticateN(...):
...
def do_something_with_filtered(item):
...
def multifilter(data, predicates):
for item in data:
for predicate in predicates:
if predicate(item):
predicates[predicate] += 1
break
else:
yield item
def do_it(data):
predicates = OrderedDict([(predicate1, 0), ..., (predicateN, 0) ])
for item in multifilter(data, predicates):
do_something_with_filtered(item)
for predicate, value in predicates.items():
print("{} filtered out {} items".format(predicate.__name__, value)
a = ...
do_it(a)
(If you have to count an item for all predicates that it fails, then an obvious change from the "break" statement to a state flag variable is enough)
Yes, fusion in Haskell will often turn something written as two passes into a single pass. Though in the case of lists, it's actually foldr/build fusion rather than stream fusion.
That's not generally possible in languages that don't enforce purity, though. When side effects are involved, it's no longer correct to fuse multiple passes into one. What if each pass performed output? Unfused, you get all the output from each pass separately. Fused, you get the output from both passes interleaved.
It's possible to write a fusion-style framework in Python that will work correctly if you promise to only ever use it with pure functions. But I'm doubtful such a thing exists at the moment. (I'd loved to be proven wrong, though.)
Which one of these is considered the more pythonic, taking into account scalability and readability?
Using enumerate:
group = ['A','B','C']
tag = ['a','b','c']
for idx, x in enumerate(group):
print(x, tag[idx])
or using zip:
for x, y in zip(group, tag):
print(x, y)
The reason I ask is that I have been using a mix of both. I should keep to one standard approach, but which should it be?
No doubt, zip is more pythonic. It doesn't require that you use a variable to store an index (which you don't otherwise need), and using it allows handling the lists uniformly, while with enumerate, you iterate over one list, and index the other list, i.e. non-uniform handling.
However, you should be aware of the caveat that zip runs only up to the shorter of the two lists. To avoid duplicating someone else's answer I'd just include a reference here: someone else's answer.
#user3100115 aptly points out that in python2, you should prefer using itertools.izip over zip, due its lazy nature (faster and more memory efficient). In python3 zip already behaves like py2's izip.
While others have pointed out that zip is in fact more pythonic than enumerate, I came here to see if it was any more efficient. According to my tests, zip is around 10 to 20% faster than enumerate when simply accessing and using items from multiple lists in parallel.
Here I have three lists of (the same) increasing length being accessed in parallel. When the lists are more than a couple of items in length, the time ratio of zip/enumerate is below zero and zip is faster.
Code I used:
import timeit
setup = \
"""
import random
size = {}
a = [ random.randint(0,i+1) for i in range(size) ]
b = [ random.random()*i for i in range(size) ]
c = [ random.random()+i for i in range(size) ]
"""
code_zip = \
"""
data = []
for x,y,z in zip(a,b,c):
data.append(x+z+y)
"""
code_enum = \
"""
data = []
for i,x in enumerate(a):
data.append(x+c[i]+b[i])
"""
runs = 10000
sizes = [ 2**i for i in range(16) ]
data = []
for size in sizes:
formatted_setup = setup.format(size)
time_zip = timeit.timeit(code_zip, formatted_setup, number=runs)
time_enum = timeit.timeit(code_enum, formatted_setup, number=runs)
ratio = time_zip/time_enum
row = (size,time_zip,time_enum,ratio)
data.append(row)
with open("testzipspeed.csv", 'w') as csv_file:
csv_file.write("size,time_zip,time_enumerate,ratio\n")
for row in data:
csv_file.write(",".join([ str(i) for i in row ])+"\n")
The answer to the question asked in your title, "Which is more pythonic; zip or enumerate...?" is: they both are. enumerate is just a special case of zip.
The answer to your more specific question about that for loop is: use zip, but not for the reasons you've seen so far.
The biggest advantage of zip in that loop has nothing to do with zip itself. It has to do with avoiding the assumptions made in your enumerate loop. To explain, I'll make two different generators based on your two examples:
def process_items_and_tags(items, tags):
"Do something with two iterables: items and tags."
for item, tag in zip(items, tag):
yield process(item, tag)
def process_items_and_list_of_tags(items, tags_list):
"Do something with an iterable of items and an indexable collection of tags."
for idx, item in enumerate(items):
yield process(item, tags_list[idx])
Both generators can take any iterable as their first argument (items), but they differ in how they handle their second argument. The enumerate-based approach can only process tags in a list-like collection with [] indexing. That rules out a huge number of iterables, like file streams and generators, for no good reason.
Why is one parameter more tightly constrained than the other? The restriction isn't inherent in the problem the user is trying to solve, since the generator could just as easily have been written the other way 'round:
def process_list_of_items_and_tags(items_list, tags):
"Do something with an indexable collection of items and an iterable of tags."
for idx, tag in enumerate(tags):
yield process(items[idx], tag)
Same result, different restriction on the inputs. Why should your caller have to know or care about any of that?
As an added penalty, anything of the form some_list[some_index] could raise an IndexError, which you would have to either catch or prevent in some way. That's not normally a problem when your loop both enumerates and accesses the same list-like collection, but here you're enumerating one and then accessing items from another. You'd have to add more code to handle an error that could not have happened in the zip-based version.
Avoiding the unnecessary idx variable is also nice, but hardly the deciding difference between the two approaches.
For more on the subject of iterables, generators, and functions that use them, see Ned Batchelder's PyCon US 2013 talk, "Loop Like a Native" (text, 30-minute video).
zip is more pythonic as said where you don't require another variable while you could also use
from collections import deque
deque(map(lambda x, y:sys.stdout.write(x+" "+y+"\n"),group,tag),maxlen=0)
Since we are printing output here a the list of None values need to be rectified and also provided your lists are of same length.
Update : Well in this case it may not be as good because you are printing group and tag values and it generates a list of None values because of sys.stdout.write but practically if you needed to fetch values it would be better.
zip might be more Pythonic, but it has a gotcha. If you want to change elements in place, you need to use indexing. Iterating over the elements will not work. For example:
x = [1,2,3]
for elem in x:
elem *= 10
print(x)
Output: [1,2,3]
y = [1,2,3]
for index in range(len(y)):
y[i] *= 10
print(y)
Output: [10,20,30]
This is a trivial starting question. I think range(len([list])) isn´t pythonic trying a non pythonist solution.
Thinking about it and reading excelent python documentation, I really like docs as numpy format style in simple pythonic code, that enumerate is a solution for iterables if you need a for loop because make an iterable is a comprehensive form.
list_a = ['a', 'b', 'c'];
list_2 = ['1', '2', '3',]
[print(a) for a in lista]
is for exec the printable line and perhaps better is a generator,
item = genetator_item = (print(i, a) for i, a in enumerate(lista) if a.find('a') == 0)
next(item)
for multiline for and more complex for loops, we can use the enumerate(zip(.
for i, (arg1, arg2) i in enumerate(zip(list_a, list_2)):
print('multiline') # do complex code
but perhaps in extended pythonic code we can use anotrher complex format with itertools, note idx at the end for len(list_a[:]) slice
from itertools import count as idx
for arg1, arg2, i in zip(list_a, list_2, idx(start=1)):
print(f'multiline {i}: {arg1}, {arg2}') # do complex code
The following python code produces [(0, 0), (0, 7)...(0, 693)] instead of the expected list of tuples combining all of the multiples of 3 and multiples of 7:
multiples_of_3 = (i*3 for i in range(100))
multiples_of_7 = (i*7 for i in range(100))
list((i,j) for i in multiples_of_3 for j in multiples_of_7)
This code fixes the problem:
list((i,j) for i in (i*3 for i in range(100)) for j in (i*7 for i in range(100)))
Questions:
The generator object seems to play the role of an iterator instead of providing an iterator object each time the generated list is to be enumerated. The later strategy seems to be adopted by .Net LINQ query objects. Is there an elegant way to get around this?
How come the second piece of code works? Shall I understand that the generator's iterator is not reset after looping through all multiples of 7?
Don't you think that this behavior is counter intuitive if not inconsistent?
A generator object is an iterator, and therefore one-shot. It's not an iterable which can produce any number of independent iterators. This behavior is not something you can change with a switch somewhere, so any work around amounts to either using an iterable (e.g. a list) instead of an generator or repeatedly constructing generators.
The second snippet does the latter. It is by definition equivalent to the loops
for i in (i*3 for i in range(100)):
for j in (i*7 for i in range(100)):
...
Hopefully it isn't surprising that here, the latter generator expression is evaluated anew on each iteration of the outer loop.
As you discovered, the object created by a generator expression is an iterator (more precisely a generator-iterator), designed to be consumed only once. If you need a resettable generator, simply create a real generator and use it in the loops:
def multiples_of_3(): # generator
for i in range(100):
yield i * 3
def multiples_of_7(): # generator
for i in range(100):
yield i * 7
list((i,j) for i in multiples_of_3() for j in multiples_of_7())
Your second code works because the expression list of the inner loop ((i*7 ...)) is evaluated on each pass of the outer loop. This results in creating a new generator-iterator each time around, which gives you the behavior you want, but at the expense of code clarity.
To understand what is going on, remember that there is no "resetting" of an iterator when the for loop iterates over it. (This is a feature; such a reset would break iterating over a large iterator in pieces, and it would be impossible for generators.) For example:
multiples_of_2 = iter(xrange(0, 100, 2)) # iterator
for i in multiples_of_2:
print i
# prints nothing because the iterator is spent
for i in multiples_of_2:
print i
...as opposed to this:
multiples_of_2 = xrange(0, 100, 2) # iterable sequence, converted to iterator
for i in multiples_of_2:
print i
# prints again because a new iterator gets created
for i in multiples_of_2:
print i
A generator expression is equivalent to an invoked generator and can therefore only be iterated over once.
The real issue as I found out is about single versus multiple pass iterables and the fact that there is currently no standard mechanism to determine if an iterable single or multi pass: See Single- vs. Multi-pass iterability
If you want to convert a generator expression to a multipass iterable, then it can be done in a fairly routine fashion. For example:
class MultiPass(object):
def __init__(self, initfunc):
self.initfunc = initfunc
def __iter__(self):
return self.initfunc()
multiples_of_3 = MultiPass(lambda: (i*3 for i in range(20)))
multiples_of_7 = MultiPass(lambda: (i*7 for i in range(20)))
print list((i,j) for i in multiples_of_3 for j in multiples_of_7)
From the point of view of defining the thing it's a similar amount of work to typing:
def multiples_of_3():
return (i*3 for i in range(20))
but from the point of view of the user, they write multiples_of_3 rather than multiples_of_3(), which means the object multiples_of_3 is polymorphic with any other iterable, such as a tuple or list.
The need to type lambda: is a bit inelegant, true. I don't suppose there would be any harm in introducing "iterable comprehensions" to the language, to give you what you want while maintaining backward compatibility. But there are only so many punctuation characters, and I doubt this would be considered worth one.
For example, files, in Python, are iterable - they iterate over the lines in the file. I want to count the number of lines.
One quick way is to do this:
lines = len(list(open(fname)))
However, this loads the whole file into memory (at once). This rather defeats the purpose of an iterator (which only needs to keep the current line in memory).
This doesn't work:
lines = len(line for line in open(fname))
as generators don't have a length.
Is there any way to do this short of defining a count function?
def count(i):
c = 0
for el in i: c += 1
return c
To clarify, I understand that the whole file will have to be read! I just don't want it in memory all at once
Short of iterating through the iterable and counting the number of iterations, no. That's what makes it an iterable and not a list. This isn't really even a python-specific problem. Look at the classic linked-list data structure. Finding the length is an O(n) operation that involves iterating the whole list to find the number of elements.
As mcrute mentioned above, you can probably reduce your function to:
def count_iterable(i):
return sum(1 for e in i)
Of course, if you're defining your own iterable object you can always implement __len__ yourself and keep an element count somewhere.
If you need a count of lines you can do this, I don't know of any better way to do it:
line_count = sum(1 for line in open("yourfile.txt"))
The cardinality package provides an efficient count() function and some related functions to count and check the size of any iterable: http://cardinality.readthedocs.org/
import cardinality
it = some_iterable(...)
print(cardinality.count(it))
Internally it uses enumerate() and collections.deque() to move all the actual looping and counting logic to the C level, resulting in a considerable speedup over for loops in Python.
I've used this redefinition for some time now:
def len(thingy):
try:
return thingy.__len__()
except AttributeError:
return sum(1 for item in iter(thingy))
It turns out there is an implemented solution for this common problem. Consider using the ilen() function from more_itertools.
more_itertools.ilen(iterable)
An example of printing a number of lines in a file (we use the with statement to safely handle closing files):
# Example
import more_itertools
with open("foo.py", "r+") as f:
print(more_itertools.ilen(f))
# Output: 433
This example returns the same result as solutions presented earlier for totaling lines in a file:
# Equivalent code
with open("foo.py", "r+") as f:
print(sum(1 for line in f))
# Output: 433
Absolutely not, for the simple reason that iterables are not guaranteed to be finite.
Consider this perfectly legal generator function:
def forever():
while True:
yield "I will run forever"
Attempting to calculate the length of this function with len([x for x in forever()]) will clearly not work.
As you noted, much of the purpose of iterators/generators is to be able to work on a large dataset without loading it all into memory. The fact that you can't get an immediate length should be considered a tradeoff.
Because apparently the duplication wasn't noticed at the time, I'll post an extract from my answer to the duplicate here as well:
There is a way to perform meaningfully faster than sum(1 for i in it) when the iterable may be long (and not meaningfully slower when the iterable is short), while maintaining fixed memory overhead behavior (unlike len(list(it))) to avoid swap thrashing and reallocation overhead for larger inputs.
# On Python 2 only, get zip that lazily generates results instead of returning list
from future_builtins import zip
from collections import deque
from itertools import count
def ilen(it):
# Make a stateful counting iterator
cnt = count()
# zip it with the input iterator, then drain until input exhausted at C level
deque(zip(it, cnt), 0) # cnt must be second zip arg to avoid advancing too far
# Since count 0 based, the next value is the count
return next(cnt)
Like len(list(it)), ilen(it) performs the loop in C code on CPython (deque, count and zip are all implemented in C); avoiding byte code execution per loop is usually the key to performance in CPython.
Rather than repeat all the performance numbers here, I'll just point you to my answer with the full perf details.
For filtering, this variation can be used:
sum(is_good(item) for item in iterable)
which can be naturally read as "count good items" and is shorter and simpler (although perhaps less idiomatic) than:
sum(1 for item in iterable if is_good(item)))
Note: The fact that True evaluates to 1 in numeric contexts is specified in the docs
(https://docs.python.org/3.6/library/stdtypes.html#boolean-values), so this coercion is not a hack (as opposed to some other languages like C/C++).
We'll, if you think about it, how do you propose you find the number of lines in a file without reading the whole file for newlines? Sure, you can find the size of the file, and if you can gurantee that the length of a line is x, you can get the number of lines in a file. But unless you have some kind of constraint, I fail to see how this can work at all. Also, since iterables can be infinitely long...
I did a test between the two common procedures in some code of mine, which finds how many graphs on n vertices there are, to see which method of counting elements of a generated list goes faster. Sage has a generator graphs(n) which generates all graphs on n vertices. I created two functions which obtain the length of a list obtained by an iterator in two different ways and timed each of them (averaging over 100 test runs) using the time.time() function. The functions were as follows:
def test_code_list(n):
l = graphs(n)
return len(list(l))
and
def test_code_sum(n):
S = sum(1 for _ in graphs(n))
return S
Now I time each method
import time
t0 = time.time()
for i in range(100):
test_code_list(5)
t1 = time.time()
avg_time = (t1-t0)/10
print 'average list method time = %s' % avg_time
t0 = time.time()
for i in range(100):
test_code_sum(5)
t1 = time.time()
avg_time = (t1-t0)/100
print "average sum method time = %s" % avg_time
average list method time = 0.0391882109642
average sum method time = 0.0418473792076
So computing the number of graphs on n=5 vertices this way, the list method is slightly faster (although 100 test runs isn't a great sample size). But when I increased the length of the list being computed by trying graphs on n=7 vertices (i.e. changing graphs(5) to graphs(7)), the result was this:
average list method time = 4.14753051996
average sum method time = 3.96504004002
In this case the sum method was slightly faster. All in all, the two methods are approximately the same speed but the difference MIGHT depend on the length of your list (it might also just be that I only averaged over 100 test runs, which isn't very high -- would have taken forever otherwise).