Order sublist within a list depending on some characters - python

Hel lo I need some help in order to order sublist within a list depending on some characters
here is an exemple:
I have a list :
List
[['QANH01000554.1_32467-32587_+__Sp_1', 'QANH01000554.1_32371-32464_+__Sp_1'], ['QANH01000809.1_27675-27794_-__Sp_1', 'QANH01000809.1_27798-27890_-__Sp_1']]
and I would like to get :
[['QANH01000554.1_32467-32587_+__Sp_1', 'QANH01000554.1_32371-32464_+__Sp_1'], [ 'QANH01000809.1_27798-27890_-__Sp_1','QANH01000809.1_27675-27794_-__Sp_1']]
So I would like to iterate over each sublist and if there is a -, then I would like to sort the sublist with [\d]+[-]+[\d] (the first [\d] being the highest)
As you can se in the sublist
['QANH01000809.1_27675-27794_-__Sp_1', 'QANH01000809.1_27798-27890_-__Sp_1']
27798 > 27675 so I changed to
['QANH01000809.1_27798-27890_-__Sp_1','QANH01000809.1_27675-27794_-__Sp_1']

This is one approach using sorted with a custom key.
Ex:
import re
data = [['QANH01000554.1_32467-32587_+__Sp_1', 'QANH01000554.1_32371-32464_+__Sp_1'], ['QANH01000809.1_27675-27794_-__Sp_1', 'QANH01000809.1_27798-27890_-__Sp_1']]
print([sorted(i, key=lambda x: tuple(int(n) for n in re.search(r"(\d+)\-(\d+)", x).groups()), reverse=True) for i in data])
Output:
[['QANH01000554.1_32467-32587_+__Sp_1', 'QANH01000554.1_32371-32464_+__Sp_1'], ['QANH01000809.1_27798-27890_-__Sp_1', 'QANH01000809.1_27675-27794_-__Sp_1']]

Related

removing a sublist if a string in the sublist contains a substring (all values within all sublists are strings)

Given nested list: mistake_list = [['as','as*s','sd','*ssa'],['a','ds','dfg','mal']]
Required output: corrected_list = [['a','ds','dfg','mal']]
Now the given list can contain hundreds or thousands of sublists in which the strings may or may not contain the special character *, but if it does that whole sublist has to be removed.
I have shown an example above where the mistake_list is the input nested list, and corrected_list is the output nested list.
NOTE: all sublists have an equal number of elements (I don't think it is necessary to know this though)
The filter function can help you:
mistake_list = [['as','as*s','sd','*ssa'],['a','ds','dfg','mal']]
corrected_list = list(filter(lambda l: not any("*" in x for x in l), mistake_list))
print(corrected_list)
[['a', 'ds', 'dfg', 'mal']]
You can use list comprehension:
mistake_list = [['as','as*s','sd','*ssa'],['a','ds','dfg','mal']]
corrected_list = [sublst for sublst in mistake_list if not any('*' in s for s in sublst)]
print(corrected_list) # [['a', 'ds', 'dfg', 'mal']]
The filtering condition here checks whether there is any '*' character in each item of sublst.

Grouping a grouped list of str without duplicates

I have a grouped list of strings that sort of looks like this, the lists inside of these groups will always contain 5 elements:
text_list = [['aaa','bbb','ccc','ddd','eee'],
['fff','ggg','hhh','iii','jjj'],
['xxx','mmm','ccc','bbb','aaa'],
['fff','xxx','aaa','bbb','ddd'],
['aaa','bbb','ccc','ddd','eee'],
['fff','xxx','aaa','ddd','eee'],
['iii','xxx','ggg','jjj','aaa']]
The objective is simple, group all of the list that is similar by the first 3 elements that is then compared against all of the elements inside of the other groups.
So from the above example the output might look like this (output is the index of the list):
[[0,2,4],[3,5]]
Notice how if there is another list that contains the same elements but in a different order is removed.
I've written the following code to extract the groups but they would return duplicates and I am unsure how to proceed. I also think this might not be the most efficient way to do the extraction as the real list can contain upwards to millions of groups:
grouped_list = []
for i in range(0,len(text_list)):
int_temp = []
for m in range(0,len(text_list)):
if i == m:
continue
bool_check = all( x in text_list[m] for x in text_list[i][0:3])
if bool_check:
if len(int_temp) == 0:
int_temp.append(i)
int_temp.append(m)
continue
int_temp.append(m)
grouped_list.append(int_temp)
## remove index with no groups
grouped_list = [x for x in grouped_list if x != []]
Is there a better way to go about this? How do I remove the duplicate group afterwards? Thank you.
Edit:
To be clearer, I would like to retrieve the lists that is similar to each other but only using the first 3 elements of the other lists. For example, using the first 3 elements from list A, check if list B,C,D... contains all 3 of the elements from list A. Repeat for the entire list then remove any list that contains duplicate elements.
You can build a set of frozensets to keep track of indices of groups with the first 3 items being a subset of the rest of the members:
groups = set()
sets = list(map(set, text_list))
for i, lst in enumerate(text_list):
groups.add(frozenset((i, *(j for j, s in enumerate(sets) if set(lst[:3]) <= s))))
print([sorted(group) for group in groups if len(group) > 1])
If the input list is long, it would be faster to create a set of frozensets of the first 3 items of all sub-lists and use the set to filter all combinations of 3 items from each sub-list, so that the time complexity is essentially linear to the input list rather than quadratic despite the overhead in generating combinations:
from itertools import combinations
sets = {frozenset(lst[:3]) for lst in text_list}
groups = {}
for i, lst in enumerate(text_list):
for c in map(frozenset, combinations(lst, 3)):
if c in sets:
groups.setdefault(c, []).append(i)
print([sorted(group) for group in groups.values() if len(group) > 1])

python efficient way to compare nested lists and append matches to new list

I wish to compare two nested lists. If there is a match between the first element of each sublist, I wish to add the matched element to a new list for further operations. Below is an example and what I've tried so far:
Example:
x = [['item1','somethingelse1'], ['item2', 'somethingelse2']...]
y = [['item1','somethingelse3'], ['item3','somethingelse4']...]
What I've I tried so far:
match = []
for itemx in x:
for itemy in y:
if itemx[0] == itemy[0]:
match.append(itemx)
The above of what I tried did the job of appending the matched item into the new list, but I have two very long nested lists, and what I did above is very slow for operating on very long lists. Are there any more efficient ways to get out the matched item between two nested lists?
Yes, use a data structure with constant-time membership testing. So, using a set, for example:
seen = set()
for first,_ in x:
seen.add(first)
matched = []
for first,_ in y:
if first in seen:
matched.append(first)
Or, more succinctly using set/list comprehensions:
seen = {first for first,_ in x}
matched = [first for first,_ in y if first in seen]
(This was before the OP changed the question from append(itemx[0]) to append(itemx)...)
>>> {a[0] for a in x} & {b[0] for b in y}
{'item1'}
Or if the inner lists are always pairs:
>>> dict(x).keys() & dict(y)
{'item1'}
IIUC using numpy:
import numpy as np
y=[l[0] for l in y]
x=np.array(x)
x[np.isin(x[:, 0], y)]

How to get common elements together in a python list?

This might sound like a stupid question but I have the following list:
list = ['a','b','c','d','a','b','c','d']
And I want to get common elements together to rearrange it as:
sorted_list = ['a','a','b','b','c','c','d','d']
Is there any built in function in python to do that?
Well to get sorted list you could just use:
sorted_list = sorted(list)
which gives the output ['a','a','b','b','c','c','d','d']
To sort and group elements by values:
list = sorted(list)
sorted_list = [[y for y in list if y==x] for x in list]
which gives the output [['a','a'],['b','b'],['c','c'],['d','d']]

Pair strings in list based on containing text in Python

I'm looking to take a list of strings and create a list of tuples that groups items based on whether they contain the same text.
For example, say I have the following list:
MyList=['Apple1','Pear1','Apple3','Pear2']
I want to pair them based on all but the last character of their string, so that I would get:
ListIWant=[('Apple1','Apple3'),('Pear1','Pear2')]
We can assume that only the last character of the string is used to identify. Meaning I'm looking to group the strings by the following unique values:
>>> list(set([x[:-1] for x in MyList]))
['Pear', 'Apple']
In [69]: from itertools import groupby
In [70]: MyList=['Apple1','Pear1','Apple3','Pear2']
In [71]: [tuple(v) for k, v in groupby(sorted(MyList, key=lambda x: x[:-1]), lambda x: x[:-1])]
Out[71]: [('Apple1', 'Apple3'), ('Pear1', 'Pear2')]
Consider this code:
def alphagroup(lst):
results = {}
for i in lst:
letter = i[0].lower()
if not letter in results.keys():
results[letter] = [i,]
else:
results[letter].append(i)
output = []
for k in results.keys():
res = results[k]
output.append(res)
return output
arr = ["Apple1", "Pear", "Apple2", "Pack"];
print alphagroup(arr);
This will achieve your goal. If each element must be a tuple, use the tuple() builtin in order to convert each element to a tuple. Hope this helps; I tested the code.

Categories