Split list into chunks by condition - python

I have a list like:
["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
that I want to split into the two groups who's elements are equal after I remove the number:
"asdf-1-bhd", "asdf-2-bhd", "asdf-3-bhd"
"uuu-2-ggg" , uuu-1-ggg"
I have been using itertools.groupby with
for key, group in itertools.groupby(elements, key= lambda x : removeIndexNumber(x)):
but this does not work when the elements to be grouped are not consecutive.
I have thought about using list comprehensions, but this seems impossible since the number of groups is not fixed.
tl;dr:
I want to group stuff, two problems:
I don't know the number of chunks I will obtain
I the elements that will be grouped into a chunk might not be consecutive

Why don't you think about it a bit differently. You can map everyting into a dict:
import re
from collections import defaultdict
regex = re.compile('([a-z]+\-)\d(\-[a-z]+)')
t = ["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
maps = defaultdict(list)
for x in t:
parts = regex.match(x).groups()
maps[parts[0]+parts[1]].append(x)
Output:
[['asdf-1-bhd', 'asdf-2-bhd', 'asdf-3-bhd'], ['uuu-2-ggg', 'uuu-1-ggg']]
This is really fast because you don't have to compare one thing to another.
Edit:
On Thinking differently
Your original approach was to iterate through each item and compare them to one another. This is overcomplicated and unnecessary.
Let's consider what my code does. First it gets the stripped down version:
"asdf-1-bhd" -> "asdf--bhd"
"uuu-2-ggg" -> "uuu--ggg"
"asdf-2-bhd" -> "asdf--bhd"
"uuu-1-ggg" -> "uuu--ggg"
"asdf-3-bhd" -> "asdf--bhd"
You can already start to see the groups, and we haven't compared anything yet!
We now do a sort of reverse mapping. We take everything thing on the right and make it a key, and anything on the left and put it in a list that is mapped by its value on the left:
'asdf--bhd' -> ['asdf-1-bhd', 'asdf-2-bhd', 'asdf-3-bhd']
'uuu--ggg' -> ['uuu-2-ggg', 'uuu-1-ggg']
And there we have our groups defined by their common computed value (key). This will work for any amount of elements and groups.

Ok, simple solution (it must be too late over here):
Use itertools.groupby , but first sort the list.
As for the example given above:
elements = ["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
elemens.sort(key = lambda x : removeIndex(x))
for key, group in itertools.groupby(elements, key= lambda x : removeIndexNumber(x)):
for element in group:
# do stuff
As you can see, the condition for sorting is the same as for grouping. That way, the elements that will eventually have to be grouped are first put into consecutive order. After this has been done, itertools.groupy can work properly.

Related

How to split a dataframe and select all possible pairs?

I have a dataframe that I want to separate in order to apply a certain function.
I have the fields df['beam'], df['track'], df['cycle'] and want to separate it by unique values of each of this three. Then, I want to apply this function (it works between two individual dataframes) to each pair that meets that df['track'] is different between the two. Also, the result doesn't change if you switch the order of the pair, so I'd like to not make unnecessary calls to the function if possible.
I currently work it through with four nested for loops into an if conditional, but I'm absolutely sure there's a better, cleaner way.
I'd appreciate all help!
Edit: I ended up solving it like this:
I split the original dataframe into multiple by using df.groupby()
dfsplit=df.groupby(['beam','track','cycle'])
This generates a dictionary where the keys are all the unique ['beam','track','cycle'] combinations as tuples
I combined all possible ['beam','track','cycle'] pairs with the use of itertools.combinations()
keys=list(itertools.combinations(dfsplit.keys(),2))
This generates a list of 2-element tuples where each element is one ['beam','track','cycle'] tuple itself, and it doesn't include the tuple with the order swapped, so I avoid calling the function twice for what would be the same case.
I removed the combinations where 'track' was the same through a for loop
for k in keys.copy():
if k[0][1]==k[1][1]:
keys.remove(k)
Now I can call my function by looping through the list of combinations
for k in keys:
function(dfsplit[k[0]],dfsplit[k[1]])
Step 3 is taking a long time, probably because I have a very large number of unique ['beam','track','cycle'] combinations so the list is very long, but also probably because I'm doing it sub-optimally. I'll keep the question open in case someone realizes a better way to do this last step.
EDIT 2:
Solved the problem with step 3, once again with itertools, just by doing
keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))
itertools.filterfalse returns all elements of the list that return false to the function defined, so it's doing the same as the previous for loop but selecting the false instead of removing the true. It's very fast and I believe this solves my problem for good.
I don't know how to mark the question as solved so I'll just repeat the solution here:
dfsplit=df.groupby(['beam','track','cycle'])
keys=list(itertools.combinations(dfsplit.keys(),2))
keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))
for k in keys:
function(dfsplit[k[0]],dfsplit[k[1]])

Python pick a random value from hashmap that has a list as value?

so I have a defaultdict(list) hashmap, potential_terms
potential_terms={9: ['leather'], 10: ['type', 'polyester'], 13:['hello','bye']}
What I want to output is the 2 values (words) with the lowest keys, so 'leather' is definitely the first output, but 'type' and 'polyester' both have k=10, when the key is the same, I want a random choice either 'type' or 'polyester'
What I did is:
out=[v for k,v in sorted(potential_terms.items(), key=lambda x:(x[0],random.choice(x[1])))][:2]
but when I print out I get :
[['leather'], ['type', 'polyester']]
My guess is ofcourse the 2nd part of the lambda function: random.choice(x[1]). Any ideas on how to make it work as expected by outputting either 'type' or 'polyester' ?
Thanks
EDIT: See Karl's answer and comment as to why this solution isn't correct for OP's problem.
I leave it here because it does demonstrate what OP originally got wrong.
key= doesn't transform the data itself, it only tells sorted how to sort,
you want to apply choice on v when selecting it for the comprehension, like so:
out=[random.choice(v) for k,v in sorted(potential_terms.items())[:2]]
(I also moved the [:2] inside, to shorten the list before the comprehension)
Output:
['leather', 'type']
OR
['leather', 'polyester']
You have (with some extra formatting to highlight the structure):
out = [
v
for k, v in sorted(
potential_terms.items(),
key=lambda x:(x[0], random.choice(x[1]))
)
][:2]
This means (reading from the inside out): sort the items according to the key, breaking ties using a random choice from the value list. Extract the values (which are lists) from those sorted items into a list (of lists). Finally, get the first two items of that list of lists.
This doesn't match the problem description, and is also somewhat nonsensical: since the keys are, well, keys, there cannot be duplicates, and thus there cannot be ties to break.
What we wanted: sort the items according to the key, then put all the contents of those individual lists next to each other to make a flattened list of strings, but randomizing the order within each sublist (i.e., shuffling those sublists). Then, get the first two items of that list of strings.
Thus, applying the technique from the link, and shuffling the sublists "inline" as they are discovered by the comprehension:
out = [
term
for k, v in sorted(
potential_terms.items(),
key = lambda x:x[0] # this is not actually necessary now,
# since the natural sort order of the items will work.
)
for term in random.sample(v, len(v))
][:2]
Please also see https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/ to understand how the list flattening and result ordering works in a two-level comprehension like this.
Instead of the out, a simpler function, is:
d = list(p.values()) which stores all the values.
It will store the values as:
[['leather'], ['polyester', 'type'], ['hello', 'bye']]
You can access, leather as d[0] and the list, ['polyester', 'type'], as d[1]. Now we'll just use random.shuffle(d[1]), and use d[1][0].
Which would get us a random word, type or polyester.
Final code should be like this:
import random
potential_terms={9: ['leather'], 10: ['type', 'polyester'], 13:['hello','bye']}
d = list(p.values())
random.shuffle(d[1])
c = []
c.append(d[0][0])
c.append(d[1][0])
Which gives the desired output,
either ['leather', 'polyester'] or ['leather', 'type'].

Tuple-key dictionary in python: Accessing a whole block of entries

I am looking for an efficient python method to utilise a hash table that has two keys:
E.g.:
(1,5) --> {a}
(2,3) --> {b,c}
(2,4) --> {d}
Further I need to be able to retrieve whole blocks of entries, for example all entries that have "2" at the 0-th position (here: (2,3) as well as (2,4)).
In another post it was suggested to use list comprehension, i.e.:
sum(val for key, val in dict.items() if key[0] == 'B')
I learned that dictionaries are (probably?) the most efficient way to retrieve a value from an object of key:value-pairs. However, calling only an incomplete tuple-key is a bit different than querying the whole key where I either get a value or nothing. I want to ask if python can still return the values in a time proportional to the number of key:value-pairs that match? Or alternatively, is the tuple-dictionary (plus list comprehension) better than using pandas.df.groupby() (but that would occupy a bit much memory space)?
The "standard" way would be something like
d = {(randint(1,10),i):"something" for i,x in enumerate(range(200))}
def byfilter(n,d):
return list(filter(lambda x:x==n, d.keys()))
byfilter(5,d) ##returns a list of tuples where x[0] == 5
Although in similar situations I often used next() to iterate manually, when I didn't need the full list.
However there may be some use cases where we can optimize that. Suppose you need to do a couple or more accesses by key first element, and you know the dict keys are not changing meanwhile. Then you can extract the keys in a list and sort it, and make use of some itertools functions, namely dropwhile() and takewhile():
ls = [x for x in d.keys()]
ls.sort() ##I do not know why but this seems faster than ls=sorted(d.keys())
def bysorted(n,ls):
return list(takewhile(lambda x: x[0]==n, dropwhile(lambda x: x[0]!=n, ls)))
bysorted(5,ls) ##returns the same list as above
This can be up to 10x faster in the best case (i=1 in my example) and more or less take the same time in the worst case (i=10) because we are trimming the number of iterations needed.
Of course you can do the same for accessing keys by x[1], you just need to add a key parameter to the sort() call

iterator.groupby() cannot generate the correct result

Code:
import itertools
first_letter = lambda x: x[0]
names = ['Alan', 'Adam', 'Wes', 'Albert', 'Steven']
for letter, name in itertools.groupby(names, first_letter):
print(letter, list(name))
Output:
A ['Alan', 'Adam']
W ['Wes']
A ['Albert']
S ['Steven']
I want to group by the first element, but it seems not work well, what's wrong here?
As you would expect form any function in itertools, groupby operates on sequences of elements that share a common key. You have to remember that an iterator can be any source of sequential data, possibly one that doesn't store is own elements as a list does.
What this means is that if the data is not already grouped within the iterator, groupby won't work the way you expect. Put it another way, groupby starts another group whenever the key changes, regardless of whether the key has already appeared in the sequence or not.
Probably the easiest way to pre-group the data in your case is to sort it. Lists can be sorted in-place:
names=['Alan','Adam','Wes','Albert','Steven']
names.sort()
for letter, name in itertools.groupby(names, first_letter):
print( letter, list(name))
A similar result could be obtained by distributing your list into a dictionary. I use collections.defaultdict below because it makes adding new elements easier. You could use a regular dictionary just as easily:
grouped = collections.defaultdict(list)
for name in names:
grouped[name[0]].append(name)
for letter, group in grouped.items():
print(letter, group)
In either case, the point is that you can't expect groupby to do exactly what you want with the order of elements in your raw data.

Collapse list of lists to eliminate redundancy

I have a couple of long lists of lists of related objects that I'd like to group to reduce redundancy. Pseudocode:
>>>list_of_lists = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]...]
>>>remove_redundancy(list_of_lists)
[[1,2,3,4,8,9,10],[5,6,7]...]
So lists that contain the same elements would be collapsed into single lists. Collapsing them is easy, once I find lists to combine I can make the lists into sets and take their union, but I'm not sure how to compare the lists. Do I need to do a series of for loops?
My first thought was that I should loop through and check whether each item in a sublist is in any of the other lists, if yes, merge the lists and then start over, but that seems terribly inefficient. I did some searching and found this: Python - dividing a list-of-lists to groups but my data isn't structured. Also, my actual data is a series of strings and thus not sortable in any meaningful sense.
I can write some gnarly looping code to make this work, but I was wondering if there are any built-in functions that would make this sort of comparison easier. Maybe something in list comprehensions?
I think this is a reasonably efficient way of doing it, if I understand your question correctly. The result here will be a list of sets.
Maybe the missing bit of knowledge was d & g (also written d.intersection(g)) for finding the set intersection, along with the fact that an empty set is "falsey" in Python
data = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]]
result = []
for d in data:
d = set(d)
matched = [d]
unmatched = []
# first divide into matching and non-matching groups
for g in result:
if d & g:
matched.append(g)
else:
unmatched.append(g)
# then combine all matching groups into one group
# while leaving unmatched groups intact
result = unmatched + [set().union(*matched)]
print(result)
# [set([5, 6, 7]), set([1, 2, 3, 4, 8, 9, 10])]
We start with no groups at all (result = []). Then we take the first list from the data. We then check which of the existing groups intersect this list and which don't. Then we merge all of these matching groups along with the list (achieved by starting with matched = [d]). We don't touch the non-matching groups (though maybe some of these will end up being merged in a later iteration). If you add a line print(result) in each loop you should be able to see how it's built up.
The union of all the sets in matched is computed by set().union(*matched). For reference:
Pythonic Way to Create Union of All Values Contained in Multiple Lists
What does the Star operator mean?
I assume that you want to merge lists that contain any common element.
Here is a function that looks efficiently (to the best of my knowledge) if any two lists contain at least one common element (according to the == operator)
import functools #python 2.5+
def seematch(X,Y):
return functools.reduce(lambda x,y : x|y,functools.reduce(lambda x,y : x+y, [[k==l for k in X ] for l in Y]))
it would be even faster if you would use a reduce that can be interrupted when finding "true" as described here:
Stopping a Reduce() operation mid way. Functional way of doing partial running sum
I was trying to find an elegant way to iterate fast after having that in place, but I think a good way would be simply looping once and creating an other container that will contain the "merged" lists. You loop once on the lists contained on the original list and for every new list created on the proxy list.
Having said that - it seems there might be a much better option - see if you can do away with that redundancy by some sort of book-keeping on the previous steps.
I know this is an incomplete answer - hope that helped anyway!

Categories