I am trying to understand Python's iterators in the context of the pysam module. By using the fetch method on a so called AlignmentFile class one get a proper iterator iter consisting of records from the file file. I can the use various methods to access each record (iterable), for instance the name with query_name:
import pysam
iter = pysam.AlignmentFile(file, "rb", check_sq=False).fetch(until_eof=True)
for record in iter:
print(record.query_name)
It happens that records come in pairs so that one would like something like:
while True:
r1 = iter.__next__()
r2 = iter.__next__()
print(r1.query_name)
print(r2.query_name)
Calling next() is probably not the right way for million of records, but how can one use a for loop to consume the same iterator in pairs of iterables. I looked at the grouper recipe from itertools and the SOs Iterate an iterator by chunks (of n) in Python? [duplicate] (even a duplicate!) and What is the most “pythonic” way to iterate over a list in chunks? but cannot get it to work.
First of all, don't use the variable name iter, because that's already the name of a builtin function.
To answer your question, simply use itertools.izip (Python 2) or zip (Python 3) on the iterator.
Your code may look as simple as
for next_1, next_2 in zip(iterator, iterator):
# stuff
edit: whoops, my original answer was the correct one all along, don't mind the itertools recipe.
edit 2: Consider itertools.izip_longest if you deal with iterators that could yield an uneven amount of objects:
>>> from itertools import izip_longest
>>> iterator = (x for x in (1,2,3))
>>>
>>> for next_1, next_2 in izip_longest(iterator, iterator):
... next_1, next_2
...
(1, 2)
(3, None)
Related
I'd like to iterate over a number of infinite generators:
def x(y):
while True:
for i in xrange(y):
yield i
for i,j in zip(x(5),x(3)):
print i,j
The code above will will produce nothing. What am I doing wrong?
That's because Python 2 zip tries to create a list by getting all the elements the generator will ever produce. What you want is an iterator, i.e. itertools.izip.
In Python 3 zip works like izip.
zip is not the right tool for generators. Try itertools.izip instead!
(Or even better, use Python 3, where your code works fine - once you add parentheses to the print)
You just need to use a variant of zip that returns an iterator instead of a list. Fortunately, there's one of them in the itertools module.
import itertools
def x(y):
while True:
for i in xrange(y):
yield i
for i,j in itertools.izip(x(5),x(3)):
print i,j
Note that in Python 3, itertools.izip doesn't exist because the vanilla zip is already an iterator.
Also in itertools there's a function called cycle which infinitely cycles over an iterable.
Make an iterator returning elements from the iterable and saving a
copy of each. When the iterable is exhausted, return elements from the
saved copy. Repeats indefinitely.
So itertools.cycle(range(5)) does the same thing as your x(5); you can also pass xrange(5) to cycle, it's not fussy. ;)
From what I understood, you are trying to iterate over two iterators simultaneously. You can always use while loop if nothing else works.
gen1 = x(5)
gen2 = x(3)
while True:
try:
print(next(gen1), next(gen2))
except StopIteration:
break
If you using python3.4 and above, then your function x can be refactored too-
def x(y):
yield from xrange(y)
Other than using the list and sorted methods to convert an itertools.chain object to get an unordered and ordered list, respectively, are there more efficient ways of doing the same in python3? I read in this answer that list is for debugging. Is this true?
Below is an example code where I time the processes:
from itertools import chain
from time import time
def foo(n):
for i in range(n):
yield range(n)
def check(n):
# check list method
start = time()
a = list(itertools.chain.from_iterable(foo(n)))
end = time()- start
print('Time for list = ', end)
# check sorted method
start = time()
b = sorted(itertools.chain.from_iterable(foo(n)))
end = time()- start
print('Time for sorted = ', end)
Results:
>>> check(1000)
Time for list = 0.04650092124938965
Time for sorted = 0.08582258224487305
>>> check(10000)
Time for list = 1.615750789642334
Time for sorted = 8.84056806564331
>>>
Other than using the list and sorted methods to convert an itertools.chain object to get an unordered and ordered list, respectively, are there more efficient ways of doing the same in python3?
simple answer: no. When working with the python generators and iterators, the only caveat you have to avoid is to convert a generator into a list, then into a generator, then into a list again etc…
i.e. a chain like that would be stupid:
list(sorted(list(filter(list(map(…
because you would then lose all the added value of the generators.
I read in this answer that list is for debugging. Is this true?
it depends on your context, generally speaking a list() is not for debugging, it's a different way to represent an iterable.
You might want to use a list() if you need to access an element at a given index, or if you want to know the length of the dataset.
You'll want to not use a list() if you can consume the data as it goes.
Think of all the generators/iterator scheme as a way to apply an algorithm for each item as they are available, whereas you work on lists as bulk.
About the question you quote, the question is very specific, and it asks how it can introspect a generator from the REPL, in order to know what is inside. And the advice from the person who answered this is to use the list(chain) only for introspection, but keep as it was originally.
The most efficient way is using list() but if you want to flatten a nested iterable by itertools.chain() or concatenate some iterables and then convert it to list you can just use a nested list comprehension at once. Also the reason that sorted() takes more time is that it's sorting the iterable while list just calls some methods of generator function (like __next__) in order to copy the items to a list object.
Note that in terms of run time the itertools.chain can perform slightly even faster than list comprehension (python 2.x and python 3.x). Here is an example:
In [27]: lst = [range(10000) for _ in range(10000)]
In [28]: %timeit [i for sub in lst for i in sub]
1 loops, best of 3: 3.94 s per loop
In [29]: %timeit list(chain.from_iterable(lst))
1 loops, best of 3: 2.75 s per loop
Which one of these is considered the more pythonic, taking into account scalability and readability?
Using enumerate:
group = ['A','B','C']
tag = ['a','b','c']
for idx, x in enumerate(group):
print(x, tag[idx])
or using zip:
for x, y in zip(group, tag):
print(x, y)
The reason I ask is that I have been using a mix of both. I should keep to one standard approach, but which should it be?
No doubt, zip is more pythonic. It doesn't require that you use a variable to store an index (which you don't otherwise need), and using it allows handling the lists uniformly, while with enumerate, you iterate over one list, and index the other list, i.e. non-uniform handling.
However, you should be aware of the caveat that zip runs only up to the shorter of the two lists. To avoid duplicating someone else's answer I'd just include a reference here: someone else's answer.
#user3100115 aptly points out that in python2, you should prefer using itertools.izip over zip, due its lazy nature (faster and more memory efficient). In python3 zip already behaves like py2's izip.
While others have pointed out that zip is in fact more pythonic than enumerate, I came here to see if it was any more efficient. According to my tests, zip is around 10 to 20% faster than enumerate when simply accessing and using items from multiple lists in parallel.
Here I have three lists of (the same) increasing length being accessed in parallel. When the lists are more than a couple of items in length, the time ratio of zip/enumerate is below zero and zip is faster.
Code I used:
import timeit
setup = \
"""
import random
size = {}
a = [ random.randint(0,i+1) for i in range(size) ]
b = [ random.random()*i for i in range(size) ]
c = [ random.random()+i for i in range(size) ]
"""
code_zip = \
"""
data = []
for x,y,z in zip(a,b,c):
data.append(x+z+y)
"""
code_enum = \
"""
data = []
for i,x in enumerate(a):
data.append(x+c[i]+b[i])
"""
runs = 10000
sizes = [ 2**i for i in range(16) ]
data = []
for size in sizes:
formatted_setup = setup.format(size)
time_zip = timeit.timeit(code_zip, formatted_setup, number=runs)
time_enum = timeit.timeit(code_enum, formatted_setup, number=runs)
ratio = time_zip/time_enum
row = (size,time_zip,time_enum,ratio)
data.append(row)
with open("testzipspeed.csv", 'w') as csv_file:
csv_file.write("size,time_zip,time_enumerate,ratio\n")
for row in data:
csv_file.write(",".join([ str(i) for i in row ])+"\n")
The answer to the question asked in your title, "Which is more pythonic; zip or enumerate...?" is: they both are. enumerate is just a special case of zip.
The answer to your more specific question about that for loop is: use zip, but not for the reasons you've seen so far.
The biggest advantage of zip in that loop has nothing to do with zip itself. It has to do with avoiding the assumptions made in your enumerate loop. To explain, I'll make two different generators based on your two examples:
def process_items_and_tags(items, tags):
"Do something with two iterables: items and tags."
for item, tag in zip(items, tag):
yield process(item, tag)
def process_items_and_list_of_tags(items, tags_list):
"Do something with an iterable of items and an indexable collection of tags."
for idx, item in enumerate(items):
yield process(item, tags_list[idx])
Both generators can take any iterable as their first argument (items), but they differ in how they handle their second argument. The enumerate-based approach can only process tags in a list-like collection with [] indexing. That rules out a huge number of iterables, like file streams and generators, for no good reason.
Why is one parameter more tightly constrained than the other? The restriction isn't inherent in the problem the user is trying to solve, since the generator could just as easily have been written the other way 'round:
def process_list_of_items_and_tags(items_list, tags):
"Do something with an indexable collection of items and an iterable of tags."
for idx, tag in enumerate(tags):
yield process(items[idx], tag)
Same result, different restriction on the inputs. Why should your caller have to know or care about any of that?
As an added penalty, anything of the form some_list[some_index] could raise an IndexError, which you would have to either catch or prevent in some way. That's not normally a problem when your loop both enumerates and accesses the same list-like collection, but here you're enumerating one and then accessing items from another. You'd have to add more code to handle an error that could not have happened in the zip-based version.
Avoiding the unnecessary idx variable is also nice, but hardly the deciding difference between the two approaches.
For more on the subject of iterables, generators, and functions that use them, see Ned Batchelder's PyCon US 2013 talk, "Loop Like a Native" (text, 30-minute video).
zip is more pythonic as said where you don't require another variable while you could also use
from collections import deque
deque(map(lambda x, y:sys.stdout.write(x+" "+y+"\n"),group,tag),maxlen=0)
Since we are printing output here a the list of None values need to be rectified and also provided your lists are of same length.
Update : Well in this case it may not be as good because you are printing group and tag values and it generates a list of None values because of sys.stdout.write but practically if you needed to fetch values it would be better.
zip might be more Pythonic, but it has a gotcha. If you want to change elements in place, you need to use indexing. Iterating over the elements will not work. For example:
x = [1,2,3]
for elem in x:
elem *= 10
print(x)
Output: [1,2,3]
y = [1,2,3]
for index in range(len(y)):
y[i] *= 10
print(y)
Output: [10,20,30]
This is a trivial starting question. I think range(len([list])) isn´t pythonic trying a non pythonist solution.
Thinking about it and reading excelent python documentation, I really like docs as numpy format style in simple pythonic code, that enumerate is a solution for iterables if you need a for loop because make an iterable is a comprehensive form.
list_a = ['a', 'b', 'c'];
list_2 = ['1', '2', '3',]
[print(a) for a in lista]
is for exec the printable line and perhaps better is a generator,
item = genetator_item = (print(i, a) for i, a in enumerate(lista) if a.find('a') == 0)
next(item)
for multiline for and more complex for loops, we can use the enumerate(zip(.
for i, (arg1, arg2) i in enumerate(zip(list_a, list_2)):
print('multiline') # do complex code
but perhaps in extended pythonic code we can use anotrher complex format with itertools, note idx at the end for len(list_a[:]) slice
from itertools import count as idx
for arg1, arg2, i in zip(list_a, list_2, idx(start=1)):
print(f'multiline {i}: {arg1}, {arg2}') # do complex code
I've written a for-loop using map, with a function that has a side-effect. Here's a minimal working example of what I mean:
def someFunc(t):
n, d = t
d[n] = str(n)
def main():
d = {}
map(somefunc, ((i,d) for i in range(10**3)))
print(len(d))
So it's clear that someFunc, which is mapped onto the non-negative numbers under 1000, has the side-effect of populating a dictionary, which is later used for something else.
Now, given the way that the above code has been structured, the expected output of print(len(d)) is 0, since map returns an iterator, and not a list (unlike python2.x). So if I really want to see the changes applied to d, then I would have to iterate over that map object until completion. One way I could do so is:
d = {}
for i in map(somefunc, ((i,d) for i in range(10**3))):
pass
But that doesn't seem very elegant. I could call list on the map object, but that would require O(n) memory, which is inefficient. Is there a way to force a full iteration over the map object?
You don't want to do this (run a map() just for the side effects), but there is a itertools consume recipe that applies here:
from collections import deque
deque(map(somefunc, ((i,d) for i in range(10**3))), maxlen=0)
The collections.deque() object, configured to a maximum size of 0, consumes the map() iterable with no additional memory use. The deque object is specifically optimized for this use-case.
Can someone explain why these output different things in Python 2.7.4? They output the same thing in python 3.3.1. I'm just wondering if this is a bug in 2.7 that was fixed in 3, or if it is due to some change in the language.
>>> for (i,j),k in zip(groupby([1,1,2,2,3,3]), [4,5,6]):
... print list(j)
...
[]
[]
[3]
>>> for i,j in groupby([1,1,2,2,3,3]):
... print list(j)
...
[1, 1]
[2, 2]
[3, 3]
This isn't a mistake. It has to do with when the groupby iterable gets consumed. Try the following with python3 and you'll see the same behavior:
from itertools import groupby
for (i,j),k in list(zip(groupby([1,1,2,2,3,3]), [4,5,6])):
print (i,list(j),k)
Note that if you remove the outer list, then you get the result you expect. The "problem" here is that the grouper object (returned in j) is an iterable which yields elements as long as they are the same. It doesn't know ahead of time what it will yield or how many elements there are. It just receives an iterable as input and then yields from that iterable. If you move on to the next "group", then the iterable ends up being consumed before you ever get a chance to look at the elements. This is a design decision to allow groupby to operate on iterables which yield arbitrary (even infinite) numbers of elements.
In python2.x, zip will create a list, effectively moving past each "group" before the loop even starts. In doing so, it ends up consuming each of the "group" ojects returned by groupby. This is why you only have the last element in the list reported. The fix for python2.x is to use itertools.izip rather than zip. In python3.x, izip became the builtin zip. As I see it, the only way to support both in this script is via something like:
from __future__ import print_function
from itertools import groupby
try:
from itertools import izip
except ImportError: #python3.x
izip = zip
for (i,j),k in izip(groupby([1,1,2,2,3,3]), [4,5,6]):
print (i,list(j),k)