Why can an itertools.groupby grouping only be iterated once?

Why can an itertools.groupby grouping only be iterated once? - python

I recently had to debug some code that went something like this:
for key, group in itertools.groupby(csvGrid, lambda x: x[0]):
value1 = sum(row[1] for row in group)
value2 = sum(row[2] for row in group)
results.append([key, value1, value2])
In every result set, value2 came out as 0. When I looked into it, I found that the first time the code iterated over group, it consumed it, so that the second time there were zero elements to iterate over.
Intuitively, I would expect group to be a list which can be iterated over an indefinite number of times, but instead it behaves like an iterator which can only be iterated once. Is there any good reason why this is the case?

itertools is an iterator library, and like just about everything else in the library, the itertools.groupby groups are iterators. There isn't a single function in all of itertools that returns a sequence.
The reasons the groupby groups are iterators are the same reasons everything else in itertools is an iterator:
It's more memory efficient.
The groups could be infinite.
You can get results immediately instead of waiting for the whole group to be ready.
Additionally, the groups are iterators because you might only want the keys, in which case materializing the groups would be a waste.
itertools.groupby is not intended to be an exact match for any LINQ construct, SQL clause, or other thing that goes by the name "group by". Its grouping behavior is closer to an extension of Unix's uniq command than what LINQ or SQL do, although the fact that it makes groups means it's not an exact match for uniq either.
As an example of something you could do with itertools.groupby that you couldn't with the other tools I've named, here's a run-length encoder:
def runlengthencode(iterable):
for key, group in groupby(iterable):
yield (key, sum(1 for val in group))

Intuitively, I would expect group to be a list which can be iterated over an indefinite number of times, but instead it behaves like an iterator which can only be iterated once.
That's correct.
Is there any good reason why this is the case?
It's potentially more memory efficient: you don't need to build an entire list first and then store it in memory, only to then iterate over it. Instead, you can process the elements as you iterate.
It's potentially more CPU efficient: by not generating all data up front, e.g. by producing a list, you can bail out early: if you find a particular group which matches some predicate, you can stop iteration - no further work needs to be done.
The decision of whether you need all data and iterate it multiple times is not hardcoded by the callee but is left to the caller.

From the docs
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list
Interestingly, if you don't consume g yourself, groupby will do it before returning the next iteration.
>>> def vals():
... for i in range(10):
... print(i)
... yield i
...
>>> for k,g in itertools.groupby(vals(), lambda x: x<5):
... print('processing group')
...
0
processing group
1
2
3
4
5
processing group
6
7
8
9

I got the same issue when trying to access a "groupby" returned iterator multiple times.
Based on Python3 doc , it suggests transfer iterator to list , so that is can be accessed later.

Related

Tuple-key dictionary in python: Accessing a whole block of entries

I am looking for an efficient python method to utilise a hash table that has two keys:
E.g.:
(1,5) --> {a}
(2,3) --> {b,c}
(2,4) --> {d}
Further I need to be able to retrieve whole blocks of entries, for example all entries that have "2" at the 0-th position (here: (2,3) as well as (2,4)).
In another post it was suggested to use list comprehension, i.e.:
sum(val for key, val in dict.items() if key[0] == 'B')
I learned that dictionaries are (probably?) the most efficient way to retrieve a value from an object of key:value-pairs. However, calling only an incomplete tuple-key is a bit different than querying the whole key where I either get a value or nothing. I want to ask if python can still return the values in a time proportional to the number of key:value-pairs that match? Or alternatively, is the tuple-dictionary (plus list comprehension) better than using pandas.df.groupby() (but that would occupy a bit much memory space)?

The "standard" way would be something like
d = {(randint(1,10),i):"something" for i,x in enumerate(range(200))}
def byfilter(n,d):
return list(filter(lambda x:x==n, d.keys()))
byfilter(5,d) ##returns a list of tuples where x[0] == 5
Although in similar situations I often used next() to iterate manually, when I didn't need the full list.
However there may be some use cases where we can optimize that. Suppose you need to do a couple or more accesses by key first element, and you know the dict keys are not changing meanwhile. Then you can extract the keys in a list and sort it, and make use of some itertools functions, namely dropwhile() and takewhile():
ls = [x for x in d.keys()]
ls.sort() ##I do not know why but this seems faster than ls=sorted(d.keys())
def bysorted(n,ls):
return list(takewhile(lambda x: x[0]==n, dropwhile(lambda x: x[0]!=n, ls)))
bysorted(5,ls) ##returns the same list as above
This can be up to 10x faster in the best case (i=1 in my example) and more or less take the same time in the worst case (i=10) because we are trimming the number of iterations needed.
Of course you can do the same for accessing keys by x[1], you just need to add a key parameter to the sort() call

Can I get more than one iterators from group by?

I am a python beginner and was facing a issue with iterating over a grouped data more than once. I understand that once consumed an iterator can't be re-used but is it possible to get multiple iterators from single groupby()?
This answer says that multiple iterators can be created over lists etc. But i don't understand how I can do the same for groupby?
Multiple Iterators
What I am trying to do is as follows:
I have data that are (key, value) pairs and I want to groupby key.
There is some special kind of data based on the value part in each
group and I want to extract these special pairs and process them
separately.
After I am done I need to go back to the original data and process
the remaining pairs (this is where I need the second iterator).
If you need to see my code here is the basic layout of what I am doing but I dunno if it is really required:
for current_vertex, group in groupby(data, itemgetter(0)):
try:
# Special data extraction
matching = [int(value.rstrip().split(':')[0]) for key, value in group if CURRENT_NODE_IDENTIFIER in value]
if len(matching) != 0:
# Do something with the data extracted (some variables generated here -- say x, y z)
for key, value in group:
if not CURRENT_NODE_IDENTIFIER in value:
# Do something with remaining key, value pairs (use x, y, z)

In case anyone is wondering the same, I resolved the problem by duplicating the iterator as described here:
How to duplicate an Iterator?
Since the group itself is an iterator all I had to do was duplicate it as:
# To duplicate an iterator given the iterator group
group, duplicate_iterator = tee(group)
Don't forget to import tee function from itertools. I don't know if this is best way possible but at least it works and get the job done.

Python: update a frequency table (in the form of a list of lists)

I have two lists of US state abbreviations (for example):
s1=['CO','MA','IN','OH','MA','CA','OH','OH']
s2=['MA','FL','CA','GA','MA','OH']
What I want to end up with is this (basically an ordered frequency table):
S=[['CA',2],['CO',1],['FL',1],['GA',1],['IN',1],['MA',4],['OH',4]]
The way I came up with was:
s3=s1+s2
S=[[x,s3.count(x)] for x in set(s3)]
This works great - though, tbh, I don't know that this is very memory efficient.
BUT... there is a catch.
s1+s2
...is too big to hold in memory, so what I'm doing is appending to s1 until it reaches a length of 10K (yes, resources are THAT limited), then summarizing it (using the list comprehension step above), deleting the contents of s1, and re-filling s1 with the next chunk of data (only represented as 's2' above for purpose of demonstration). ...and so on through the loop until it reaches the end of the data.
So with each iteration of the loop, I want to sum the 'base' list of lists 'S' with the current iteration's list of lists 's'. My question is, essentially, how do I add these:
(the current master data):
S=[['CA',1],['CO',1],['IN',1],['MA',2],['OH',3]]
(the new data):
s=[['CA',1],['FL',1],['GA',1],['MA',2],['OH',1]]
...to get (the new master data):
S=[['CA',2],['CO',1],['FL',1],['GA',1],['IN',1],['MA',4],['OH',4]]
...in some sort of reasonably efficient way. If this is better to do with dictionaries or something else, I am fine with that. What I can't do, unfortunately is make use of ANY remotely specialized Python module -- all I have to work with is the most stripped-down version of Python 2.6 imaginable in a closed-off, locked-down, resource-poor Linux environment (hazards of the job). Any help is greatly appreciated!!

You can use itertools.chain to chain two iterators efficiently:
import itertools
import collections
counts = collections.Counter()
for val in itertools.chain(s1, s2): # memory efficient
counts[val] += 1
A collections.Counter object is a dict specialized for counting... if you know how to use a dict you can use a collections.Counter. However, it allows you to write the above more succinctly as:
counts = collections.Counter(itertools.chain(s1, s2))
Also note, the following construction:
S=[[x,s3.count(x)] for x in set(s3)]
Happens to also be very time inefficient, since you are calling s3.count in a loop. Although, this might not be too bad if len(set(s3)) << len(s3)
Note, you can do the chaining "manually" by doing something like:
it1 = iter(s1)
it2 = iter(s2)
for val in it1:
...
for val in it2:
...

You can run Counter.update as many times as you like, cutting your data to fit in memory / streaming them as you like.
import collections
counter = collections.Counter()
counter.update(['foo', 'bar'])
assert counter['foo'] == counter['bar'] == 1
counter.update(['foo', 'bar', 'foo'])
assert counter['foo'] == 3
assert counter['bar'] == 2
assert sorted(counter.items(), key=lambda rec: -rec[1]) == [('foo', 3), ('bar', 2)]
The last line uses negated count as the sorting key to make the higher counts come first.
If with that your count structure does not fit in memory, you need a (disk-based) database, such as Postgres, or likely just a machine with more memory and a more efficient key-value store, such as Redis.

Efficient functional list iteration in Python

So suppose I have an array of some elements. Each element have some number of properties.
I need to filter this list from some subsets of values determined by predicates. This subsets of course can have intersections.
I also need to determine amount of values in each such subset.
So using imperative approach I could write code like that and it would have running time of 2*n. One iteration to copy array and another one to filter it count subsets sizes.
from split import import groupby
a = [{'some_number': i, 'some_time': str(i) + '0:00:00'} for i in range(10)]
# imperative style
wrong_number_count = 0
wrong_time_count = 0
for item in a[:]:
if predicate1(item):
delete_original(item, a)
wrong_number_count += 1
if predicate2(item):
delete_original(item, a)
wrong_time_count += 1
update_some_data(item)
do_something_with_filtered(a)
def do_something_with_filtered(a, c1, c2):
print('filtered a {}'.format(a))
print('{} items had wrong number'.format(c1))
print('{} items had wrong time'.format(c2))
def predicate1(x):
return x['some_number'] < 3
def predicate2(x):
return x['some_time'] < '50:00:00'
Somehow I can't think of the way to do that in Python in functional way with same running time.
So in functional style I could have used groupby multiple times probably or write a comprehension for each predicate, but that's obviously would be slower than imperative approach.
I think such thing possible in Haskell using Stream Fusion (am I right?)
But how do that in Python?

Python has a strong support to "stream processing" in the form of its iterators - and what you ask seens just trivial to do. You just have to have a way to group your predicates and attributes to it - it could be a dictionary where the predicate itself is the key.
That said, a simple iterator function that takes in your predicate data structure, along with the data to be processed could do what you want. TThe iterator would have the side effect of changing your data-structure with the predicate-information. If you want "pure functions" you'd just have to duplicate the predicate information before, and maybe passing and retrieving all predicate and counters valus to the iterator (through the send method) for each element - I don´ t think it would be worth that level of purism.
That said you could have your code something along:
from collections import OrderedDict
def predicate1(...):
...
...
def preticateN(...):
...
def do_something_with_filtered(item):
...
def multifilter(data, predicates):
for item in data:
for predicate in predicates:
if predicate(item):
predicates[predicate] += 1
break
else:
yield item
def do_it(data):
predicates = OrderedDict([(predicate1, 0), ..., (predicateN, 0) ])
for item in multifilter(data, predicates):
do_something_with_filtered(item)
for predicate, value in predicates.items():
print("{} filtered out {} items".format(predicate.__name__, value)
a = ...
do_it(a)
(If you have to count an item for all predicates that it fails, then an obvious change from the "break" statement to a state flag variable is enough)

Yes, fusion in Haskell will often turn something written as two passes into a single pass. Though in the case of lists, it's actually foldr/build fusion rather than stream fusion.
That's not generally possible in languages that don't enforce purity, though. When side effects are involved, it's no longer correct to fuse multiple passes into one. What if each pass performed output? Unfused, you get all the output from each pass separately. Fused, you get the output from both passes interleaved.
It's possible to write a fusion-style framework in Python that will work correctly if you promise to only ever use it with pure functions. But I'm doubtful such a thing exists at the moment. (I'd loved to be proven wrong, though.)

Split list into chunks by condition

I have a list like:
["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
that I want to split into the two groups who's elements are equal after I remove the number:
"asdf-1-bhd", "asdf-2-bhd", "asdf-3-bhd"
"uuu-2-ggg" , uuu-1-ggg"
I have been using itertools.groupby with
for key, group in itertools.groupby(elements, key= lambda x : removeIndexNumber(x)):
but this does not work when the elements to be grouped are not consecutive.
I have thought about using list comprehensions, but this seems impossible since the number of groups is not fixed.
tl;dr:
I want to group stuff, two problems:
I don't know the number of chunks I will obtain
I the elements that will be grouped into a chunk might not be consecutive

Why don't you think about it a bit differently. You can map everyting into a dict:
import re
from collections import defaultdict
regex = re.compile('([a-z]+\-)\d(\-[a-z]+)')
t = ["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
maps = defaultdict(list)
for x in t:
parts = regex.match(x).groups()
maps[parts[0]+parts[1]].append(x)
Output:
[['asdf-1-bhd', 'asdf-2-bhd', 'asdf-3-bhd'], ['uuu-2-ggg', 'uuu-1-ggg']]
This is really fast because you don't have to compare one thing to another.
Edit:
On Thinking differently
Your original approach was to iterate through each item and compare them to one another. This is overcomplicated and unnecessary.
Let's consider what my code does. First it gets the stripped down version:
"asdf-1-bhd" -> "asdf--bhd"
"uuu-2-ggg" -> "uuu--ggg"
"asdf-2-bhd" -> "asdf--bhd"
"uuu-1-ggg" -> "uuu--ggg"
"asdf-3-bhd" -> "asdf--bhd"
You can already start to see the groups, and we haven't compared anything yet!
We now do a sort of reverse mapping. We take everything thing on the right and make it a key, and anything on the left and put it in a list that is mapped by its value on the left:
'asdf--bhd' -> ['asdf-1-bhd', 'asdf-2-bhd', 'asdf-3-bhd']
'uuu--ggg' -> ['uuu-2-ggg', 'uuu-1-ggg']
And there we have our groups defined by their common computed value (key). This will work for any amount of elements and groups.

Ok, simple solution (it must be too late over here):
Use itertools.groupby , but first sort the list.
As for the example given above:
elements = ["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
elemens.sort(key = lambda x : removeIndex(x))
for key, group in itertools.groupby(elements, key= lambda x : removeIndexNumber(x)):
for element in group:
# do stuff
As you can see, the condition for sorting is the same as for grouping. That way, the elements that will eventually have to be grouped are first put into consecutive order. After this has been done, itertools.groupy can work properly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why can an itertools.groupby grouping only be iterated once? - python

I got the same issue when trying to access a "groupby" returned iterator multiple times. Based on Python3 doc , it suggests transfer iterator to list , so that is can be accessed later.

Related

Tuple-key dictionary in python: Accessing a whole block of entries

Can I get more than one iterators from group by?

Python: update a frequency table (in the form of a list of lists)

Efficient functional list iteration in Python

Split list into chunks by condition

Categories

Resources