Assuming I have a generator yielding hashable values (str / int etc.) is there a way to prevent the generator from yielding the same value twice?
Obviously, I'm using a generator so I don't need to unpack all the values first so something like yield from set(some_generator) is not an option, since that will unpack the entire generator.
Example:
# Current result
for x in my_generator():
print(x)
>>> 1
>>> 17
>>> 15
>>> 1 # <-- This shouldn't be here
>>> 15 # <-- This neither!
>>> 3
>>> ...
# Wanted result
for x in my_no_duplicate_generator():
print(x)
>>> 1
>>> 17
>>> 15
>>> 3
>>> ...
What's the most Pythonic solution for this?
There is a unique_everseen in Python itertools module recipes that is roughly equivalent to #NikosOikou's answer.
The main drawback of these solutions is that they rely upon the hypothesis that elements of the iterable are hashable:
>>> L = [[1], [2,3], [1]]
>>> seen = set()
>>> for e in L: seen.add(e)
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
The more-itertools module refines the implementation to accept unhashables elements and the doc give a tip on how to keep a good speed in some cases (disclaimer: I'm the "author" of the tip).
You can check the source code.
You can try this:
def my_no_duplicate_generator(iterable):
seen = set()
for x in iterable:
if x not in seen:
yield x
seen.add(x)
You can use it by passing your generator as an argument:
for x in my_no_duplicate_generator(my_generator()):
print(x)
Related
This is an exercise on Kaggle/Python/Strings and Dictionaries. I wasn't able to solve it so I peeked at the solution and tried to write it in a way I would do it (i.e. not necessarily as sophisticated but in a way I understood). I use Python tutor to visualise what's going on behind the code and understand most things but the for-loop is getting me.
normalised = (token.strip(",.").lower() for token in tokens) This works and gives me index [0]
but if I rewrite as:
for token in tokens:
normalised = token.strip(",.").lower()
it doesn't work; it gives me index [0][2] (presumably because casino is in casinoville). Can someone write the multi-line equivalent: for token in tokens:...?
code is below for a bit more context.
def word_search(doc_list, keyword):
Takes a list of documents (each document is a string) and a keyword.
Returns list of the index values into the original list for all documents
containing the keyword.
Example:
doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
>>> word_search(doc_list, 'casino')
>>> [0]
"""
indices = []
counter = 0
for doc in doc_list:
tokens = doc.split()
**normalised = (token.strip(",.").lower() for token in tokens)**
if keyword.lower() in normalised:
indices.append(counter)
counter += 1
return indices
#Test - output should be [0]
doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
keyword = 'Casino'
print(word_search(doc_list,keyword))
normalised = (token.strip(",.").lower() for token in tokens) returns a tuple generator. Let's explore this:
>>> a = [1,2,3]
>>> [x**2 for x in a]
[1, 4, 9]
This is a list comprehension. The multi-line equivalent is:
>>> a = [1,2,3]
>>> b = []
>>> for x in a:
... b.append(x**2)
...
>>> print(b)
[1, 4, 9]
Using parentheses instead of square brackets does not return a tuple (as one might suspect naively, as I did earlier), but a generator:
>>> a = [1,2,3]
>>> (x**2 for x in a)
<generator object <genexpr> at 0x0000024BD6E33B48>
We can iterate over this object with next:
>>> a = [1,2,3]
>>> b = (x**2 for x in a)
>>> next(b)
1
>>> next(b)
4
>>> next(b)
9
>>> next(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
This can be written as a multi-line expression like this:
>>> a = [1,2,3]
>>> def my_iterator(x):
... for k in x:
... yield k**2
...
>>> b = my_iterator(a)
>>> next(b)
1
>>> next(b)
4
>>> next(b)
9
>>> next(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
In the original example, an in comparison is used. This works for both the list and the generator, but for the generator it only works once:
>>> a = [1,2,3]
>>> b = [x**2 for x in a]
>>> 9 in b
True
>>> 5 in b
False
>>> b = (x**2 for x in a)
>>> 9 in b
True
>>> 9 in b
False
Here is a discussion of the issue with generator reset: Resetting generator object in Python
I hope that clarified the differences between list comprehensions, generators and multi-line loops.
first line of code:
for i in list:
print(i)
second line of code:
print(i for i in list)
what would I use each of them for?
You can see for yourself what the difference is.
The first one iterates over range and then prints integers.
>>> for i in range(4):
... print(i)
...
0
1
2
3
The second one is a generator expression.
>>> print(i for i in range(4))
<generator object <genexpr> at 0x10b6c20f0>
How iteration works in the generator. Python generators are a simple way of creating iterators.
Simply speaking, a generator is a function that returns an object (iterator) which we can iterate over (one value at a time).
>>> g=(i for i in range(4))
>>> print(g)
<generator object <genexpr> at 0x100f015d0>
>>> print(next(g))
0
>>>
>>> print(next(g))
1
>>> print(next(g))
2
>>> print(next(g))
3
>>> print(next(g))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>> g=(i for i in range(4))
>>> for i in g:
... print(i)
...
0
1
2
3
>>> for i in g:
... print(i)
...
>>>
>>>
In python3, you can use tuple unpacking to print the generator. If that's what you were going for.
>>> print(*(i for i in range(4)))
0 1 2 3
The first code snippet will iterate over your list and print the value of i for each pass through the loop. In most cases you will want to use something like this to print the values in a list:
my_list = list(range(5))
for i in my_list:
print(i)
0
1
2
3
4
The second snippet will evaluate the expression in the print statement and print the result. Since the expression in print statement, i for i in my_list evaluates to a generator expression, that string representation of that generator expression will be outputted. I cannot think of any real world cases where that is the result you would want.
my_list = list(range(5))
print(i for i in my_list)
<generator object <genexpr> at 0x0E9EB2F0>
The first way is just a loop going through a list and printing the elements one by one:
l = [1, 2, 3]
for i in l:
print(i)
output:
1
2
3
The second way, list comprehension, creates an iterable you can store (list, dictionary, etc.)
l = [i for i in l]
print( l ) #[1, 2, 3]
print( l[0] ) #1
print( l[1:] ) #[2, 3]
output:
[1, 2, 3]
1
[2, 3]
The second is used for doing 1 thing to all the elements e.g. turn all elements from string to int:
l = ['1', '2', '3']
l = [int(i) for i in l] #now the list is [1, 2, 3]
loops are better for doing a lot of things:
for i in range(4):
#Code
#more code
#lots of more code
pass
In response to the (since edited) answers that suggested otherwise:
the second one is NOT a list comprehension.
the two code snippets do NOT do the same thing.
>>> x = [1,2,3,4,5]
>>> print(i for i in x)
<generator object <genexpr> at 0x000002322FA1BA50>
The second one is printing a generator object because (i for i in x) is a generator. The first snippet simply prints the elements in the list one at a time.
BTW: don't use list as a variable name. It's the name of a built-in type in Python, so when you use it as a variable name, you overwrite the constructor for that type. Basically, you're erasing the built-in list() function.
The second one is a generator expression. Both will give same result if you convert the generator expression to list comprehension. In short, list comprehensions are used to increase both memory and execution efficiency. However, they are generally applicable to small blocks of codes, generally one to two lines.
For more information, see this link on official python website - https://docs.python.org/3/tutorial/datastructures.html?highlight=list%20comprehensions
>>> a = filter(lambda x: x&1, [1,2])
>>> list(a)
[2]
>>> list(a)
[]
It's quite counter intuitive, isn't it? So if anyone has an explanation for why it is like that, feel free!
I am using Python 3.8.2 by the way
Since you knew to wrap the result of the call to filter with list(), I am assuming you are familiar with the concept of generator functions and its ilk. The filter function returns actually something similar to a generator function in that it can only be iterated once. See below:
>>> a = filter(lambda x: x^1, [1,2])
>>> type(a)
<class 'filter'>
>>> it = iter(a)
>>> next(it)
2
>>> next(it)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>> it = iter(a) # try to iterate the filter a second time
>>> next(it) # you will get a StopIteration exception the very first time
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>>
The above code is essentially equivalent to:
a = filter(lambda x: x^1, [1,2])
print(type(a))
for item in a:
print(item)
for item in a:
print(item)
a is an iterable, the items of which have been consumed the first time you called list(a). Subsequent list(a) will get nothing.
Similarly,
a = (i for i in range(10))
list(a)
[0, 1....10]
list(a)
[]
I prefer C# behavior that has distinct IEnumerable and IEnumerator interfaces.
I was playing with the map object and noticed that it didn't print if I do list() beforehand. When I viewed only the map beforehand, the printing worked. Why?
map returns an iterator and you can consume an iterator only once.
Example:
>>> a=map(int,[1,2,3])
>>> a
<map object at 0x1022ceeb8>
>>> list(a)
[1, 2, 3]
>>> next(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>> list(a)
[]
Another example where I consume the first element and create a list with the rest
>>> a=map(int,[1,2,3])
>>> next(a)
1
>>> list(a)
[2, 3]
As per the answer from #newbie, this is happening because you are consuming the map iterator before you use it. (Here is another great answer on this topic from #LukaszRogalski)
Example 1:
w = [[1,5,7],[2,2,2,9],[1,2],[0]]
m = map(sum,w) # map iterator is generated
list(m) # map iterator is consumed here (output: [13,15,3,0])
for v in m:
print(v) # there is nothing left in m, so there's nothing to print
Example 2:
w = [[1,5,7],[2,2,2,9],[1,2],[0]]
m = map(sum,w) #map iterator is generated
for v in m:
print(v) #map iterator is consumed here
# if you try and print again, you won't get a result
for v in m:
print(v) # there is nothing left in m, so there's nothing to print
So you have two options here, if you only want to iterate the list once, Example 2 will work fine. However, if you want to be able to continue using m as a list in your code, you need to amend Example 1 like so:
Example 1 (amended):
w = [[1,5,7],[2,2,2,9],[1,2],[0]]
m = map(sum,w) # map iterator is generated
m = list(m) # map iterator is consumed here, but it is converted to a reusable list.
for v in m:
print(v) # now you are iterating a list, so you should have no issue iterating
# and reiterating to your heart's content!
It's because it return an generator so clearer example:
>>> gen=(i for i in (1,2,3))
>>> list(gen)
[1, 2, 3]
>>> for i in gen:
print(i)
>>>
Explanation:
it's because to convert it into the list it basically loops trough than after you want to loop again it will think that still continuing but there are no more elements
so best thing to do is:
>>> M=list(map(sum,W))
>>> M
[13, 15, 3, 0]
>>> for i in M:
print(i)
13
15
3
0
You can either use this:
list(map(sum,W))
or this:
{*map(sum,W)}
I have a generator defined like this:
def gen():
r = [0]
yield r
r[0] = 1
yield r
r[0] = 2
yield r
it will yield three lists of one element going from 0 to 2:
>>> a = gen()
>>> next(a)
[0]
>>> next(a)
[1]
>>> next(a)
[2]
>>> next(a)
Traceback (most recent call last):
File "<pyshell#313>", line 1, in <module>
next(a)
StopIteration
Now, when I go to make a list from the generator, I got this:
>>> list(gen())
[[2], [2], [2]]
That is, it seems to yield each time the very last computed value.
Is this a python bug or am I missing something?
It's not a bug, it does exactly what you told it to do. You're yielding the very same object several times, so you get several references to that object. The only reason you don't see three [2]s in your first snippet is that Python won't go back in time and change previous output to match when objects are mutated. Try storing the values you get when calling next explicitly in variables and check them at the end - you'll get the same result.
Such an iterator is only useful if no yielded value is used after the iterator is advanced another time. Therefore I'd generally avoid it, as it produces unexpected results when trying to pre-compute some or all results (this also means it breaks various useful tricks such as itertools.tee and iterable unpacking).
You want:
def gen():
for i in (0,1,2):
yield [i]
That will yield three lists, not one list three times.