zip and groupby curiosity in python 2.7 - python

Can someone explain why these output different things in Python 2.7.4? They output the same thing in python 3.3.1. I'm just wondering if this is a bug in 2.7 that was fixed in 3, or if it is due to some change in the language.
>>> for (i,j),k in zip(groupby([1,1,2,2,3,3]), [4,5,6]):
... print list(j)
...
[]
[]
[3]
>>> for i,j in groupby([1,1,2,2,3,3]):
... print list(j)
...
[1, 1]
[2, 2]
[3, 3]

This isn't a mistake. It has to do with when the groupby iterable gets consumed. Try the following with python3 and you'll see the same behavior:
from itertools import groupby
for (i,j),k in list(zip(groupby([1,1,2,2,3,3]), [4,5,6])):
print (i,list(j),k)
Note that if you remove the outer list, then you get the result you expect. The "problem" here is that the grouper object (returned in j) is an iterable which yields elements as long as they are the same. It doesn't know ahead of time what it will yield or how many elements there are. It just receives an iterable as input and then yields from that iterable. If you move on to the next "group", then the iterable ends up being consumed before you ever get a chance to look at the elements. This is a design decision to allow groupby to operate on iterables which yield arbitrary (even infinite) numbers of elements.
In python2.x, zip will create a list, effectively moving past each "group" before the loop even starts. In doing so, it ends up consuming each of the "group" ojects returned by groupby. This is why you only have the last element in the list reported. The fix for python2.x is to use itertools.izip rather than zip. In python3.x, izip became the builtin zip. As I see it, the only way to support both in this script is via something like:
from __future__ import print_function
from itertools import groupby
try:
from itertools import izip
except ImportError: #python3.x
izip = zip
for (i,j),k in izip(groupby([1,1,2,2,3,3]), [4,5,6]):
print (i,list(j),k)

Related

Removing earlier duplicates from a list and keeping order

I want to define a function that takes a list as an argument and removes all duplicates from the list except the last one.
For example:
remove_duplicates([3,4,4,3,6,3]) should be [4,6,3]. The other post answers do not solve this one.
The function is removing each element if it exists later in the list.
This is my code:
def remove(y):
for x in y:
if y.count(x) > 1:
y.remove(x)
return y
and for this list:
[1,2,1,2,1,2,3] I am getting this output:
[2,1,2,3]. The real output should be [1,2,3].
Where am I going wrong and how do I fix it?
The other post does actually answer the question, but there's an extra step: reverse the input then reverse the output. You could use reversed to do this, with an OrderedDict:
from collections import OrderedDict
def remove_earlier_duplicates(sequence):
d = OrderedDict.fromkeys(reversed(sequence))
return reversed(d)
The output is a reversed iterator object for greater flexibility, but you can easily convert it to a list.
>>> list(remove_earlier_duplicates([3,4,4,3,6,3]))
[4, 6, 3]
>>> list(remove_earlier_duplicates([1,2,1,2,1,2,3]))
[1, 2, 3]
BTW, your remove function doesn't work because you're changing the size of the list as you're iterating over it, meaning certain items get skipped.
I found this way to do after a bit of research. #wjandrea provided me with the fromkeys method idea and helped me out a lot.
def retain_order(arr):
return list(dict.fromkeys(arr[::-1]))[::-1]

Multiprocessing iteration without repeating duplicates in Python

In python, I have a large list of numbers (~0.5 billion items) with some duplicates in it, for example:
[1,1,2,2,3,4,5,6,6]
I want to apply a function over these numbers to create a dictionary, keyed by the number, giving the function result. So if the function is simply (say) lambda x: x*10, I want a dictionary like:
{1:10, 2:20, 3:30, 4:40, 5:50, 6:60}
The thing is, I want to use the Python multiprocessing module to do this (I don't care in what order the functions are run), and I don't really want to make the list of numbers unique beforehand: I'd prefer to check when iterating over the numbers whether there's a duplicate, and if so, not add the calculation to the multiprocessing pool or queue.
So should I use something like multiprocessing.Pool.imap_unordered for this, and check for previously visited iterators, e.g.:
import multiprocessing
import itertools
import time
def f(x):
print(x)
time.sleep(0.1)
return x, x*10.0
input = [1, 1, 2, 2, 3, 4, 5, 6, 6]
result = {}
def unique_everseen(iterable):
for element in iterable:
if element not in result:
result[element] = None # Mark this result as being processed
yield element
with multiprocessing.Pool(processes=2) as pool:
for k, v in pool.imap_unordered(f, unique_everseen(input)):
result[k]=v
I ask, as it seems a little hacky using the result dictionary to also check whether we have visited this value before (I've done this to save having to create a separate set of half a billion items just to check if they are dups). Is there a more pythonic way to do this, perhaps adding the items to a Queue or something? I'm not used multiprocessing much before, so perhaps I'm doing this wrong, and e.g. opening myself up to race conditions?

How to print list of strings with lambda?

I have a list of strings that print out just fine using a normal loop:
for x in listing:
print(x)
I thought it should be pretty simple to use a lambda to reduce the loop syntax, and kickstart my learning of lambdas in Python (I'm pretty new to Python).
Given that the syntax for map is map(function, iterable, ...) I tried:
map(lambda x: print(x), listing)
But this does not print anything out (it also does not produce an error). I've done some searching through material online but everything I have found to date is based on Python 2, namely mentioning that with Python 2 this isn't possible but that it should be in Python 3, without explicitly mentioning how so.
What am I doing wrong?
In python 3, map returns an iterator:
>>> map(print, listing)
<map object at 0x7fabf5d73588>
This iterator is lazy, which means that it won't do anything until you iterate over it. Once you do iterate over it, you get the values of your list printed:
>>> listing = [1, 2, 3]
>>> for _ in map(print, listing):
... pass
...
1
2
3
What this also means is that map isn't the right tool for the job. map creates an iterator, so it should only be used if you're planning to iterate over that iterator. It shouldn't be used for side effects, like printing values. See also When should I use map instead of a for loop.
I wouldn't recommend using map here, as you don't really care about the iterator. If you want to simplify the basic "for loop", you could instead use str.join():
>>> mylist = ['hello', 'there', 'everyone']
>>> '\n'.join(mylist)
hello
there
everyone
Or if you have a non-string list:
>>> mylist = [1,2,3,4]
>>> '\n'.join(map(str, mylist))
1
2
3
4

Python consume an iterator pair-wise

I am trying to understand Python's iterators in the context of the pysam module. By using the fetch method on a so called AlignmentFile class one get a proper iterator iter consisting of records from the file file. I can the use various methods to access each record (iterable), for instance the name with query_name:
import pysam
iter = pysam.AlignmentFile(file, "rb", check_sq=False).fetch(until_eof=True)
for record in iter:
print(record.query_name)
It happens that records come in pairs so that one would like something like:
while True:
r1 = iter.__next__()
r2 = iter.__next__()
print(r1.query_name)
print(r2.query_name)
Calling next() is probably not the right way for million of records, but how can one use a for loop to consume the same iterator in pairs of iterables. I looked at the grouper recipe from itertools and the SOs Iterate an iterator by chunks (of n) in Python? [duplicate] (even a duplicate!) and What is the most “pythonic” way to iterate over a list in chunks? but cannot get it to work.
First of all, don't use the variable name iter, because that's already the name of a builtin function.
To answer your question, simply use itertools.izip (Python 2) or zip (Python 3) on the iterator.
Your code may look as simple as
for next_1, next_2 in zip(iterator, iterator):
# stuff
edit: whoops, my original answer was the correct one all along, don't mind the itertools recipe.
edit 2: Consider itertools.izip_longest if you deal with iterators that could yield an uneven amount of objects:
>>> from itertools import izip_longest
>>> iterator = (x for x in (1,2,3))
>>>
>>> for next_1, next_2 in izip_longest(iterator, iterator):
... next_1, next_2
...
(1, 2)
(3, None)

Python how to iterate two infinite generators at the same time?

I'd like to iterate over a number of infinite generators:
def x(y):
while True:
for i in xrange(y):
yield i
for i,j in zip(x(5),x(3)):
print i,j
The code above will will produce nothing. What am I doing wrong?
That's because Python 2 zip tries to create a list by getting all the elements the generator will ever produce. What you want is an iterator, i.e. itertools.izip.
In Python 3 zip works like izip.
zip is not the right tool for generators. Try itertools.izip instead!
(Or even better, use Python 3, where your code works fine - once you add parentheses to the print)
You just need to use a variant of zip that returns an iterator instead of a list. Fortunately, there's one of them in the itertools module.
import itertools
def x(y):
while True:
for i in xrange(y):
yield i
for i,j in itertools.izip(x(5),x(3)):
print i,j
Note that in Python 3, itertools.izip doesn't exist because the vanilla zip is already an iterator.
Also in itertools there's a function called cycle which infinitely cycles over an iterable.
Make an iterator returning elements from the iterable and saving a
copy of each. When the iterable is exhausted, return elements from the
saved copy. Repeats indefinitely.
So itertools.cycle(range(5)) does the same thing as your x(5); you can also pass xrange(5) to cycle, it's not fussy. ;)
From what I understood, you are trying to iterate over two iterators simultaneously. You can always use while loop if nothing else works.
gen1 = x(5)
gen2 = x(3)
while True:
try:
print(next(gen1), next(gen2))
except StopIteration:
break
If you using python3.4 and above, then your function x can be refactored too-
def x(y):
yield from xrange(y)

Categories