Grouping a grouped list of str without duplicates

Grouping a grouped list of str without duplicates - python

I have a grouped list of strings that sort of looks like this, the lists inside of these groups will always contain 5 elements:
text_list = [['aaa','bbb','ccc','ddd','eee'],
['fff','ggg','hhh','iii','jjj'],
['xxx','mmm','ccc','bbb','aaa'],
['fff','xxx','aaa','bbb','ddd'],
['aaa','bbb','ccc','ddd','eee'],
['fff','xxx','aaa','ddd','eee'],
['iii','xxx','ggg','jjj','aaa']]
The objective is simple, group all of the list that is similar by the first 3 elements that is then compared against all of the elements inside of the other groups.
So from the above example the output might look like this (output is the index of the list):
[[0,2,4],[3,5]]
Notice how if there is another list that contains the same elements but in a different order is removed.
I've written the following code to extract the groups but they would return duplicates and I am unsure how to proceed. I also think this might not be the most efficient way to do the extraction as the real list can contain upwards to millions of groups:
grouped_list = []
for i in range(0,len(text_list)):
int_temp = []
for m in range(0,len(text_list)):
if i == m:
continue
bool_check = all( x in text_list[m] for x in text_list[i][0:3])
if bool_check:
if len(int_temp) == 0:
int_temp.append(i)
int_temp.append(m)
continue
int_temp.append(m)
grouped_list.append(int_temp)
## remove index with no groups
grouped_list = [x for x in grouped_list if x != []]
Is there a better way to go about this? How do I remove the duplicate group afterwards? Thank you.
Edit:
To be clearer, I would like to retrieve the lists that is similar to each other but only using the first 3 elements of the other lists. For example, using the first 3 elements from list A, check if list B,C,D... contains all 3 of the elements from list A. Repeat for the entire list then remove any list that contains duplicate elements.

You can build a set of frozensets to keep track of indices of groups with the first 3 items being a subset of the rest of the members:
groups = set()
sets = list(map(set, text_list))
for i, lst in enumerate(text_list):
groups.add(frozenset((i, *(j for j, s in enumerate(sets) if set(lst[:3]) <= s))))
print([sorted(group) for group in groups if len(group) > 1])
If the input list is long, it would be faster to create a set of frozensets of the first 3 items of all sub-lists and use the set to filter all combinations of 3 items from each sub-list, so that the time complexity is essentially linear to the input list rather than quadratic despite the overhead in generating combinations:
from itertools import combinations
sets = {frozenset(lst[:3]) for lst in text_list}
groups = {}
for i, lst in enumerate(text_list):
for c in map(frozenset, combinations(lst, 3)):
if c in sets:
groups.setdefault(c, []).append(i)
print([sorted(group) for group in groups.values() if len(group) > 1])

Related

python efficient way to compare nested lists and append matches to new list

I wish to compare two nested lists. If there is a match between the first element of each sublist, I wish to add the matched element to a new list for further operations. Below is an example and what I've tried so far:
Example:
x = [['item1','somethingelse1'], ['item2', 'somethingelse2']...]
y = [['item1','somethingelse3'], ['item3','somethingelse4']...]
What I've I tried so far:
match = []
for itemx in x:
for itemy in y:
if itemx[0] == itemy[0]:
match.append(itemx)
The above of what I tried did the job of appending the matched item into the new list, but I have two very long nested lists, and what I did above is very slow for operating on very long lists. Are there any more efficient ways to get out the matched item between two nested lists?

Yes, use a data structure with constant-time membership testing. So, using a set, for example:
seen = set()
for first,_ in x:
seen.add(first)
matched = []
for first,_ in y:
if first in seen:
matched.append(first)
Or, more succinctly using set/list comprehensions:
seen = {first for first,_ in x}
matched = [first for first,_ in y if first in seen]

(This was before the OP changed the question from append(itemx[0]) to append(itemx)...)
>>> {a[0] for a in x} & {b[0] for b in y}
{'item1'}
Or if the inner lists are always pairs:
>>> dict(x).keys() & dict(y)
{'item1'}

IIUC using numpy:
import numpy as np
y=[l[0] for l in y]
x=np.array(x)
x[np.isin(x[:, 0], y)]

Find indexes of common items in two python lists

I have two lists in python list_A and list_B and I want to find the common item they share. My code to do so is the following:
both = []
for i in list_A:
for j in list_B:
if i == j:
both.append(i)
The list common in the end contains the common items. However, I want also to return the indexes of those elements in the initial two lists. How can I do so?

It is advised in python that you avoid as much as possible to use for loops if better methods are available. You can efficiently find the common elements in the two lists by using python set as follows
both = set(list_A).intersection(list_B)
Then you can find the indices using the build-in index method
indices_A = [list_A.index(x) for x in both]
indices_B = [list_B.index(x) for x in both]

Instead of iterating through the list, access elements by index:
both = []
for i in range(len(list_A)):
for j in range(len(list_B)):
if list_A[i] == list_B[j]:
both.append((i,j))
Here i and j will take integer values and you can check values in list_A and list_B by index.

You can also get common elements and their indexes with numpy.intersect1d()
common_elements, a_indexes, b_indexes = np.intersect1d(a, b, return_indices=True)

How to filter the list by selecting for unique combinations of characters in the elements (Python)?

I have the the following pairs stored in the following list
sample = [[CGCG,ATAT],[CGCG,CATC],[ATAT,TATA]]
Each pairwise comparison can have only two unique combinations of characters, if not then those pairwise comparisons are eliminated. eg,
In sample[1]
C C
G A
C T
G C
Look a the corresponding elements in both sub-lists, CC, GA, CT, GC.
Here, there are more than two types of pairs (CC), (GA), (CT) and (GC). So this pairwise comparison cannot occur.
Every comparison can have only 2 combinations out of (AA, GG,CC,TT, AT,TA,AC,CA,AG,GA,GC,CG,GT,TG,CT,TC) ... basically all possible combinations of ACGT where order matters.
In the above example, more than 2 such combinations are found.
However,
In sample[0]
C A
G T
C A
G T
There are only 2 unique combinations: CA and GT
Thus, the only pairs, that remain are:
output = [[CGCG,ATAT],[ATAT,TATA]]
I would prefer if the code was in traditional for-loop format and not comprehensions
This is a small part of the question listed here. This portion of the question is re-asked, as the answer provided earlier provided incorrect output.

def filter_sample(sample):
filtered_sample = []
for s1, s2 in sample:
pairs = {pair for pair in zip(s1, s2)}
if len(pairs) <= 2:
filtered_sample.append([s1, s2])
return filtered_sample
Running this
sample = [["CGCG","ATAT"],["CGCG","CATC"],["ATAT","TATA"]]
filter_sample(sample)
Returns this
[['CGCG', 'ATAT'], ['ATAT', 'TATA']]

sample = [[CGCG,ATAT],[CGCG,CATC],[ATAT,CATC]]
result = []
for s in sample:
first = s[0]
second = s[1]
combinations = []
for i in range(0,len(first)):
comb = [first[i],second[i]]
if comb not in combinations:
combinations.append(comb)
if len(combinations) == 2:
result.append(s)
print result

The core of this task is extracting the pairs from your sublists and counting the number of unique pairs. Assuming your samples actually contain strings, you can use zip(*sub_list) to get the pairs. Then you can use set() to remove duplicate entries.
sample = [['CGCG','ATAT'],['CGCG','CATC'],['ATAT','CATC']]
def filter(sub_list, n_pairs):
pairs = zip(*sub_list)
return len(set(pairs)) == n_pairs
Then you can use a for loop or a list comprehension to apply this function to your main list.
new_sample = [sub_list for sub_list in sample if filter(sub_list, 2)]
...or as a for loop...
new_sample = []
for sub_list in sample:
if filter(sub_list, 2):
new_sample.append(sub_list)

Matching values in variable length lists containing sublists in python

I am trying to iterate through a dictionary where each key contains a list which in turn contains from 0 up to 20+ sub-lists. The goal is to iterate through the values of dictionary 1, check if they are in any of the sublists of dictionary 2 for the same key, and if so, add +1 to a counter and not consider that sublist again.
The code looks somewhat like this:
dict1={"key1":[[1,2],[6,7]],"key2":[[1,2,3,4,5,6,7,8,9]]}
dict2={"key1":[[0,1,2,3],[5,6,7,8],[11,13,15]],"key2":[[7,8,9,10,11],[16,17,18]]}
for (k,v), (k2,v2) in zip(dict1.iteritems(),dict2.iteritems()):
temp_hold=[]
span_overlap=0
for x in v:
if x in v2 and v2 not in temp_hold:
span_overlap+=1
temp_hold.append(v2)
else:
continue
print temp_hold, span_overlap
This does obviously not work mainly due to the code not being able to check hierarchally through the list and sublists, and partly due to likely incorrect iteration syntax. I have not the greatest of grasp of nested loops and iterations which makes this a pain. Another option would be to first join the sublists into a single list using:
v=[y for x in v for y in x]
Which would make it easy to check if one value is in another dictionary, but then I lose the ability to work specifically with the sublist which contained parts of the values iterated through, nor can I count that sublist only once.
The desired output is a count of 2 for key1, and 1 for key2, as well as being able to handle the matching sublists for further analysis.

Here is one solution. I am first converting the list of lists into a list of sets. If you have any control over the lists, make them sets.
def matching_sublists(dict1, dict2):
result = dict()
for k in dict1:
assert(k in dict2)
result[k] = 0
A = [set(l) for l in dict1[k]]
B = [set(l) for l in dict2[k]]
for sublistA in A:
result[k] += sum([1 for sublistB in B if not sublistA.isdisjoint(sublistB) ])
return result
if __name__=='__main__':
dict1={"key1":[[1,2],[6,7]],"key2":[[1,2,3,4,5,6,7,8,9]]}
dict2={"key1":[[0,1,2,3],[5,6,7,8],[11,13,15]],"key2":[[7,8,9,10,11],[16,17,18]]}
print(matching_sublists(dict1, dict2))

python union of 2 nested lists with index

I want to get the union of 2 nested lists plus an index to the common values.
I have two lists like A = [[1,2,3],[4,5,6],[7,8,9]] and B = [[1,2,3,4],[3,3,5,7]] but the length of each list is about 100 000. To A belongs an index vector with len(A): I = [2,3,4]
What I want is to find all sublists in B where the first 3 elements are equal to a sublist in A. In this example I want to get B[0] returned ([1,2,3,4]) because its first three elements are equal to A[0]. In addition, I also want the index to A[0] in this example, that is I[0].
I tried different things, but nothing worked so far :(
First I tried this:
Common = []
for i in range(len(B)):
if B[i][:3] in A:
id = [I[x] for x,y in enumerate(A) if y == B[i][:3]][0]
ctdCommon.append([int(id)] + B[i])
But that takes ages, or never finishes
Then I transformed A and B into sets and took the union from both, which was very quick, but then I don't know how to get the corresponding indices
Does anyone have an idea?

Create an auxiliary dict (work is O(len(A)) -- assuming the first three items of a sublist in A uniquely identify it (otherwise you need a dict of lists):
aud = dict((tuple(a[:3]), i) for i, a in enumerate(A))
Use said dict to loop once on B (work is O(len(B))) to get B sublists and A indices:
result = [(b, aud[tuple(b[:3])]) for b in B if tuple(b[:3]) in aud]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping a grouped list of str without duplicates - python

Related

python efficient way to compare nested lists and append matches to new list

Find indexes of common items in two python lists

How to filter the list by selecting for unique combinations of characters in the elements (Python)?

Matching values in variable length lists containing sublists in python

python union of 2 nested lists with index

Categories

Resources