I am a python beginner and was facing a issue with iterating over a grouped data more than once. I understand that once consumed an iterator can't be re-used but is it possible to get multiple iterators from single groupby()?
This answer says that multiple iterators can be created over lists etc. But i don't understand how I can do the same for groupby?
Multiple Iterators
What I am trying to do is as follows:
I have data that are (key, value) pairs and I want to groupby key.
There is some special kind of data based on the value part in each
group and I want to extract these special pairs and process them
separately.
After I am done I need to go back to the original data and process
the remaining pairs (this is where I need the second iterator).
If you need to see my code here is the basic layout of what I am doing but I dunno if it is really required:
for current_vertex, group in groupby(data, itemgetter(0)):
try:
# Special data extraction
matching = [int(value.rstrip().split(':')[0]) for key, value in group if CURRENT_NODE_IDENTIFIER in value]
if len(matching) != 0:
# Do something with the data extracted (some variables generated here -- say x, y z)
for key, value in group:
if not CURRENT_NODE_IDENTIFIER in value:
# Do something with remaining key, value pairs (use x, y, z)
In case anyone is wondering the same, I resolved the problem by duplicating the iterator as described here:
How to duplicate an Iterator?
Since the group itself is an iterator all I had to do was duplicate it as:
# To duplicate an iterator given the iterator group
group, duplicate_iterator = tee(group)
Don't forget to import tee function from itertools. I don't know if this is best way possible but at least it works and get the job done.
Related
I have a dataframe that I want to separate in order to apply a certain function.
I have the fields df['beam'], df['track'], df['cycle'] and want to separate it by unique values of each of this three. Then, I want to apply this function (it works between two individual dataframes) to each pair that meets that df['track'] is different between the two. Also, the result doesn't change if you switch the order of the pair, so I'd like to not make unnecessary calls to the function if possible.
I currently work it through with four nested for loops into an if conditional, but I'm absolutely sure there's a better, cleaner way.
I'd appreciate all help!
Edit: I ended up solving it like this:
I split the original dataframe into multiple by using df.groupby()
dfsplit=df.groupby(['beam','track','cycle'])
This generates a dictionary where the keys are all the unique ['beam','track','cycle'] combinations as tuples
I combined all possible ['beam','track','cycle'] pairs with the use of itertools.combinations()
keys=list(itertools.combinations(dfsplit.keys(),2))
This generates a list of 2-element tuples where each element is one ['beam','track','cycle'] tuple itself, and it doesn't include the tuple with the order swapped, so I avoid calling the function twice for what would be the same case.
I removed the combinations where 'track' was the same through a for loop
for k in keys.copy():
if k[0][1]==k[1][1]:
keys.remove(k)
Now I can call my function by looping through the list of combinations
for k in keys:
function(dfsplit[k[0]],dfsplit[k[1]])
Step 3 is taking a long time, probably because I have a very large number of unique ['beam','track','cycle'] combinations so the list is very long, but also probably because I'm doing it sub-optimally. I'll keep the question open in case someone realizes a better way to do this last step.
EDIT 2:
Solved the problem with step 3, once again with itertools, just by doing
keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))
itertools.filterfalse returns all elements of the list that return false to the function defined, so it's doing the same as the previous for loop but selecting the false instead of removing the true. It's very fast and I believe this solves my problem for good.
I don't know how to mark the question as solved so I'll just repeat the solution here:
dfsplit=df.groupby(['beam','track','cycle'])
keys=list(itertools.combinations(dfsplit.keys(),2))
keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))
for k in keys:
function(dfsplit[k[0]],dfsplit[k[1]])
I am looking for an efficient python method to utilise a hash table that has two keys:
E.g.:
(1,5) --> {a}
(2,3) --> {b,c}
(2,4) --> {d}
Further I need to be able to retrieve whole blocks of entries, for example all entries that have "2" at the 0-th position (here: (2,3) as well as (2,4)).
In another post it was suggested to use list comprehension, i.e.:
sum(val for key, val in dict.items() if key[0] == 'B')
I learned that dictionaries are (probably?) the most efficient way to retrieve a value from an object of key:value-pairs. However, calling only an incomplete tuple-key is a bit different than querying the whole key where I either get a value or nothing. I want to ask if python can still return the values in a time proportional to the number of key:value-pairs that match? Or alternatively, is the tuple-dictionary (plus list comprehension) better than using pandas.df.groupby() (but that would occupy a bit much memory space)?
The "standard" way would be something like
d = {(randint(1,10),i):"something" for i,x in enumerate(range(200))}
def byfilter(n,d):
return list(filter(lambda x:x==n, d.keys()))
byfilter(5,d) ##returns a list of tuples where x[0] == 5
Although in similar situations I often used next() to iterate manually, when I didn't need the full list.
However there may be some use cases where we can optimize that. Suppose you need to do a couple or more accesses by key first element, and you know the dict keys are not changing meanwhile. Then you can extract the keys in a list and sort it, and make use of some itertools functions, namely dropwhile() and takewhile():
ls = [x for x in d.keys()]
ls.sort() ##I do not know why but this seems faster than ls=sorted(d.keys())
def bysorted(n,ls):
return list(takewhile(lambda x: x[0]==n, dropwhile(lambda x: x[0]!=n, ls)))
bysorted(5,ls) ##returns the same list as above
This can be up to 10x faster in the best case (i=1 in my example) and more or less take the same time in the worst case (i=10) because we are trimming the number of iterations needed.
Of course you can do the same for accessing keys by x[1], you just need to add a key parameter to the sort() call
I recently had to debug some code that went something like this:
for key, group in itertools.groupby(csvGrid, lambda x: x[0]):
value1 = sum(row[1] for row in group)
value2 = sum(row[2] for row in group)
results.append([key, value1, value2])
In every result set, value2 came out as 0. When I looked into it, I found that the first time the code iterated over group, it consumed it, so that the second time there were zero elements to iterate over.
Intuitively, I would expect group to be a list which can be iterated over an indefinite number of times, but instead it behaves like an iterator which can only be iterated once. Is there any good reason why this is the case?
itertools is an iterator library, and like just about everything else in the library, the itertools.groupby groups are iterators. There isn't a single function in all of itertools that returns a sequence.
The reasons the groupby groups are iterators are the same reasons everything else in itertools is an iterator:
It's more memory efficient.
The groups could be infinite.
You can get results immediately instead of waiting for the whole group to be ready.
Additionally, the groups are iterators because you might only want the keys, in which case materializing the groups would be a waste.
itertools.groupby is not intended to be an exact match for any LINQ construct, SQL clause, or other thing that goes by the name "group by". Its grouping behavior is closer to an extension of Unix's uniq command than what LINQ or SQL do, although the fact that it makes groups means it's not an exact match for uniq either.
As an example of something you could do with itertools.groupby that you couldn't with the other tools I've named, here's a run-length encoder:
def runlengthencode(iterable):
for key, group in groupby(iterable):
yield (key, sum(1 for val in group))
Intuitively, I would expect group to be a list which can be iterated over an indefinite number of times, but instead it behaves like an iterator which can only be iterated once.
That's correct.
Is there any good reason why this is the case?
It's potentially more memory efficient: you don't need to build an entire list first and then store it in memory, only to then iterate over it. Instead, you can process the elements as you iterate.
It's potentially more CPU efficient: by not generating all data up front, e.g. by producing a list, you can bail out early: if you find a particular group which matches some predicate, you can stop iteration - no further work needs to be done.
The decision of whether you need all data and iterate it multiple times is not hardcoded by the callee but is left to the caller.
From the docs
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list
Interestingly, if you don't consume g yourself, groupby will do it before returning the next iteration.
>>> def vals():
... for i in range(10):
... print(i)
... yield i
...
>>> for k,g in itertools.groupby(vals(), lambda x: x<5):
... print('processing group')
...
0
processing group
1
2
3
4
5
processing group
6
7
8
9
I got the same issue when trying to access a "groupby" returned iterator multiple times.
Based on Python3 doc , it suggests transfer iterator to list , so that is can be accessed later.
I'd like to get all unique values in a collection for a particular key in a MongoDB. I can loop through the entire collection to get them:
values = []
for item in collection.find():
if item['key'] in values:
pass
else:
values.append(item)
But this seems incredibly inefficient, since I have to check every entry, and loop through the list each time (which gets slow as the number of values gets high). Alternatively, I can put all the values in a list and then make a set (which I think is faster, though I haven't tried to figure out how to test speed yet):
values = []
for item in collection.find():
values.append(item['key'])
unique_values = set(values)
Or with a list comprehension:
unique_values = set([item['key'] for item in collection.find()])
But I'm wondering if there's a built-in function that wouldn't require looping through the entire collection (like if these values are stored in hash tables or something), or if there's some better way to get this.
The distinct() method does this. It returns an array(list) of the distinct values for the given key:
unqiue_values = collection.distinct("key")
MongoDB has a build-in method for this problem:
db.collection.distinct(FIELD)
I have a list like:
["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
that I want to split into the two groups who's elements are equal after I remove the number:
"asdf-1-bhd", "asdf-2-bhd", "asdf-3-bhd"
"uuu-2-ggg" , uuu-1-ggg"
I have been using itertools.groupby with
for key, group in itertools.groupby(elements, key= lambda x : removeIndexNumber(x)):
but this does not work when the elements to be grouped are not consecutive.
I have thought about using list comprehensions, but this seems impossible since the number of groups is not fixed.
tl;dr:
I want to group stuff, two problems:
I don't know the number of chunks I will obtain
I the elements that will be grouped into a chunk might not be consecutive
Why don't you think about it a bit differently. You can map everyting into a dict:
import re
from collections import defaultdict
regex = re.compile('([a-z]+\-)\d(\-[a-z]+)')
t = ["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
maps = defaultdict(list)
for x in t:
parts = regex.match(x).groups()
maps[parts[0]+parts[1]].append(x)
Output:
[['asdf-1-bhd', 'asdf-2-bhd', 'asdf-3-bhd'], ['uuu-2-ggg', 'uuu-1-ggg']]
This is really fast because you don't have to compare one thing to another.
Edit:
On Thinking differently
Your original approach was to iterate through each item and compare them to one another. This is overcomplicated and unnecessary.
Let's consider what my code does. First it gets the stripped down version:
"asdf-1-bhd" -> "asdf--bhd"
"uuu-2-ggg" -> "uuu--ggg"
"asdf-2-bhd" -> "asdf--bhd"
"uuu-1-ggg" -> "uuu--ggg"
"asdf-3-bhd" -> "asdf--bhd"
You can already start to see the groups, and we haven't compared anything yet!
We now do a sort of reverse mapping. We take everything thing on the right and make it a key, and anything on the left and put it in a list that is mapped by its value on the left:
'asdf--bhd' -> ['asdf-1-bhd', 'asdf-2-bhd', 'asdf-3-bhd']
'uuu--ggg' -> ['uuu-2-ggg', 'uuu-1-ggg']
And there we have our groups defined by their common computed value (key). This will work for any amount of elements and groups.
Ok, simple solution (it must be too late over here):
Use itertools.groupby , but first sort the list.
As for the example given above:
elements = ["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
elemens.sort(key = lambda x : removeIndex(x))
for key, group in itertools.groupby(elements, key= lambda x : removeIndexNumber(x)):
for element in group:
# do stuff
As you can see, the condition for sorting is the same as for grouping. That way, the elements that will eventually have to be grouped are first put into consecutive order. After this has been done, itertools.groupy can work properly.