Pair strings in list based on containing text in Python - python

I'm looking to take a list of strings and create a list of tuples that groups items based on whether they contain the same text.
For example, say I have the following list:
MyList=['Apple1','Pear1','Apple3','Pear2']
I want to pair them based on all but the last character of their string, so that I would get:
ListIWant=[('Apple1','Apple3'),('Pear1','Pear2')]
We can assume that only the last character of the string is used to identify. Meaning I'm looking to group the strings by the following unique values:
>>> list(set([x[:-1] for x in MyList]))
['Pear', 'Apple']

In [69]: from itertools import groupby
In [70]: MyList=['Apple1','Pear1','Apple3','Pear2']
In [71]: [tuple(v) for k, v in groupby(sorted(MyList, key=lambda x: x[:-1]), lambda x: x[:-1])]
Out[71]: [('Apple1', 'Apple3'), ('Pear1', 'Pear2')]

Consider this code:
def alphagroup(lst):
results = {}
for i in lst:
letter = i[0].lower()
if not letter in results.keys():
results[letter] = [i,]
else:
results[letter].append(i)
output = []
for k in results.keys():
res = results[k]
output.append(res)
return output
arr = ["Apple1", "Pear", "Apple2", "Pack"];
print alphagroup(arr);
This will achieve your goal. If each element must be a tuple, use the tuple() builtin in order to convert each element to a tuple. Hope this helps; I tested the code.

Related

Order sublist within a list depending on some characters

Hel lo I need some help in order to order sublist within a list depending on some characters
here is an exemple:
I have a list :
List
[['QANH01000554.1_32467-32587_+__Sp_1', 'QANH01000554.1_32371-32464_+__Sp_1'], ['QANH01000809.1_27675-27794_-__Sp_1', 'QANH01000809.1_27798-27890_-__Sp_1']]
and I would like to get :
[['QANH01000554.1_32467-32587_+__Sp_1', 'QANH01000554.1_32371-32464_+__Sp_1'], [ 'QANH01000809.1_27798-27890_-__Sp_1','QANH01000809.1_27675-27794_-__Sp_1']]
So I would like to iterate over each sublist and if there is a -, then I would like to sort the sublist with [\d]+[-]+[\d] (the first [\d] being the highest)
As you can se in the sublist
['QANH01000809.1_27675-27794_-__Sp_1', 'QANH01000809.1_27798-27890_-__Sp_1']
27798 > 27675 so I changed to
['QANH01000809.1_27798-27890_-__Sp_1','QANH01000809.1_27675-27794_-__Sp_1']
This is one approach using sorted with a custom key.
Ex:
import re
data = [['QANH01000554.1_32467-32587_+__Sp_1', 'QANH01000554.1_32371-32464_+__Sp_1'], ['QANH01000809.1_27675-27794_-__Sp_1', 'QANH01000809.1_27798-27890_-__Sp_1']]
print([sorted(i, key=lambda x: tuple(int(n) for n in re.search(r"(\d+)\-(\d+)", x).groups()), reverse=True) for i in data])
Output:
[['QANH01000554.1_32467-32587_+__Sp_1', 'QANH01000554.1_32371-32464_+__Sp_1'], ['QANH01000809.1_27798-27890_-__Sp_1', 'QANH01000809.1_27675-27794_-__Sp_1']]

split list into sublists according to variable string and variable block length

I have a list of strings:
['splitter001','stringA','stringB','splitter_1234','stringC']
and I want my end result to be:
[ ['splitter001','stringA','stringB'] , ['splitter_1234','stringC'] ]
The splitter dividers are not identical strings.
I've tried to .find the 'splitter' if the element index > 0, and then delete the indexes [:2nd splitter] and append the first group into a new list, but this doesn't work properly.
I am iterating a for loop over all the strings and it doesn't work for the second group so I can get:
[ ['splitter001','stringA','stringB'] ] as my new list, but the second group is missing.
I've read many answers on this topic and the closest solution was to use:
[list(x[1]) for x in itertools.groupby(myList, lambda x: x=='#') if not x[0]]
but I do not understand this syntax... I've read on groupby and intertools but I'm not sure this is helpful for my situations.
Here's one way to do this with groupby. We tell groupby to look for strings that start with 'splitter'. This creates two kinds of groups: strings that start with 'splitter', and all the other strings. Eg,
from itertools import groupby
data = ['splitter001','stringA','stringB','splitter_1234','stringC']
for k, g in groupby(data, key=lambda s: s.startswith('splitter')):
print(k, list(g))
output
True ['splitter001']
False ['stringA', 'stringB']
True ['splitter_1234']
False ['stringC']
So we can put those groups into two lists and then zip them together to make the final list.
from itertools import groupby
data = ['splitter001','stringA','stringB','splitter_1234','stringC']
head = []
tail = []
for k, g in groupby(data, key=lambda s: s.startswith('splitter')):
if k:
head.append(list(g))
else:
tail.append(list(g))
out = [u+v for u, v in zip(head, tail)]
print(out)
output
[['splitter001', 'stringA', 'stringB'], ['splitter_1234', 'stringC']]
Here's a more compact way to do the same thing, using a list of lists to store the head and tail lists:
from itertools import groupby
data = ['splitter001','stringA','stringB','splitter_1234','stringC']
results = [[], []]
for k, g in groupby(data, key=lambda s: s.startswith('splitter')):
results[k].append(list(g))
out = [v+u for u, v in zip(*results)]
print(out)
output
[['splitter001', 'stringA', 'stringB'], ['splitter_1234', 'stringC']]
If you want to print each sublist on a separate line, the simple way is to do it with a for loop instead of creating the out list.
for u, v in zip(*results):
print(v + u)
output
['splitter001', 'stringA', 'stringB']
['splitter_1234', 'stringC']
Another way is to convert the sublists to strings and then join them together with newlines to create one big string.
print('\n'.join([str(v + u) for u, v in zip(*results)]))
This final variation stores both kinds of groups into a single iterator object. I think you'll agree that the previous versions are easier to read. :)
it = iter(list(g) for k, g in groupby(data, key=lambda s: s.startswith('splitter')))
out = [u+v for u, v in zip(it, it)]
get indices of startswith('splitter') elements, then slice the list at those indices
sl = ['splitter001','stringA','stringB','splitter_1234','stringC']
si = [i for i, e in enumerate(sl) if e.startswith('splitter')]
[sl[i:j] for i, j in zip(si, si[1:] + [len(sl)])]
Out[66]: [['splitter001', 'stringA', 'stringB'], ['splitter_1234', 'stringC']]
Here is an approach using a for loop, as you mentioned you tried, that handles the case of the second group:
# define list of strings for input
strings = ['splitter001','stringA','stringB','splitter_1234','stringC']
split_strings = [] # this is going to hold the final output
current_list = [] # this is a temporary list
# loop over strings in the input
for s in strings:
if 'splitter' in s:
# if current_list is not empty
if current_list:
split_strings.append(current_list) # append to output
current_list = [] # reset current_list
current_list.append(s)
# outside of the loop, append the leftover strings (if any)
if current_list:
split_strings.append(current_list)
The key here is that you do one more append at the end, outside of your loop, to capture the last group.
Output:
[['splitter001', 'stringA', 'stringB'], ['splitter_1234', 'stringC']]
EDIT: Adding explanation of code.
We create a temp variable current_list to hold each list that we will append to the final output split_strings.
Loop over the strings in the input. For each string s, check if it contains 'splitter'. If it does AND the current_list is not empty, this means that we've hit the next delimiter. Append current_list to the output and clear it out so we can begin collecting items for the next set of strings.
After this check, append the current string to current_list. This works because we cleared it out (setting it equal to []) after we found a delimiter.
At the end of the list, we append whatever is leftover to the output, if anything.
You can try something like this :
first get the from to index numbers when splitter appeared then just chuck the list according to those index:
sl = ['splitter001','stringA','stringB','splitter_1234','stringC']
si = [index for index, value in enumerate(sl) if value.startswith('splitter')]
for i in range(0,len(si),1):
slice=si[i:i+2]
if len(slice)==2:
print(sl[slice[0]:slice[1]])
else:
print(sl[slice[0]:])
output:
['splitter001', 'stringA', 'stringB']
['splitter_1234', 'stringC']

Comparing a 3-tuple to a list of 3-tuples using only the first two parts of the tuple

I have a list of 3-tuples in a Python program that I'm building while looking through a file (so one at a time), with the following setup:
(feature,combination,durationOfTheCombination),
such that if a unique combination of feature and combination is found, it will be added to the list. The list itself holds a similar setup, but the durationOfTheCombination is the sum of all duration that share the unique combination of (feature,combination). Therefore, when deciding if it should be added to the list, I need to only compare the first two parts of the tuple, and if a match is found, the duration is added to the corresponding list item.
Here's an example for clarity. If the input is
(ABC,123,10);(ABC,123,10);(DEF,123,5);(ABC,123,30);(EFG,456,30)
The output will be (ABC,123,50);(DEF,123,5);(EFG,456,30).
Is there any way to do this comparison?
You can do this with Counter,
In [42]: from collections import Counter
In [43]: lst = [('ABC',123,10),('ABC',123,10),('DEF',123,5)]
In [44]: [(i[0],i[1],i[2]*j) for i,j in Counter(lst).items()]
Out[44]: [('DEF', 123, 5), ('ABC', 123, 20)]
As per the OP suggestion if it's have different values, use groupby
In [26]: lst = [('ABC',123,10),('ABC',123,10),('ABC',123,25),('DEF',123,5)]
In [27]: [tuple(list(n)+[sum([i[2] for i in g])]) for n,g in groupby(sorted(lst,key = lambda x:x[:2]), key = lambda x:x[:2])]
Out[27]: [('ABC', 123, 45), ('DEF', 123, 5)]
If you don't want to use Counter, you can use a dict instead.
setOf3Tuples = dict()
def add3TupleToSet(a):
key = a[0:2]
if key in setOf3Tuples:
setOf3Tuples[a[0:2]] += a[2]
else:
setOf3Tuples[a[0:2]] = a[2]
def getRaw3Tuple():
for k in setOf3Tuples:
yield k + (setOf3Tuples[k],)
if __name__ == "__main__":
add3TupleToSet(("ABC",123,10))
add3TupleToSet(("ABC",123,10))
add3TupleToSet(("DEF",123,5))
print([i for i in getRaw3Tuple()])
It seems a dict is more suited than a list here, with the first 2 fields as key. And to avoid checking each time if the key is already here you can use a defaultdict.
from collections import defaultdict
d = defaultdict(int)
for t in your_list:
d[t[:2]] += t[-1]
Assuming your input is collected in a list as below, you can use pandas groupby to accomplish this quickly:
import pandas as pd
input = [('ABC',123,10),('ABC',123,10),('DEF',123,5),('ABC',123,30),('EFG',456,30)]
output = [tuple(x) for x in pd.DataFrame(input).groupby([0,1])[2].sum().reset_index().values]

Ordering a string by its substring numerical value in python

I have a list of strings that need to be sorted in numerical order using as a int key two substrings.
Obviously using the sort() function orders my strings alphabetically so I get 1,10,2... that is obviously not what I'm looking for.
Searching around I found a key parameter can be passed to the sort() function, and using sort(key=int) should do the trick, but being my key a substring and not the whole string should lead to a cast error.
Supposing my strings are something like:
test1txtfgf10
test1txtfgg2
test2txffdt3
test2txtsdsd1
I want my list to be ordered in numeric order on the basis of the first integer and then on the second, so I would have:
test1txtfgg2
test1txtfgf10
test2txtsdsd1
test2txffdt3
I think I could extract the integer values, sort only them keeping track of what string they belong to and then ordering the strings, but I was wondering if there's a way to do this thing in a more efficient and elegant way.
Thanks in advance
Try the following
In [26]: import re
In [27]: f = lambda x: [int(x) for x in re.findall(r'\d+', x)]
In [28]: sorted(strings, key=f)
Out[28]: ['test1txtfgg2', 'test1txtfgf10', 'test2txtsdsd1', 'test2txffdt3']
This uses regex (the re module) to find all integers in each string, then compares the resulting lists. For example, f('test1txtfgg2') returns [1, 2], which is then compared against other lists.
Extract the numeric parts and sort using them
import re
d = """test1txtfgf10
test1txtfgg2
test2txffdt3
test2txtsdsd1"""
lines = d.split("\n")
re_numeric = re.compile("^[^\d]+(\d+)[^\d]+(\d+)$")
def key(line):
"""Returns a tuple (n1, n2) of the numeric parts of line."""
m = re_numeric.match(line)
if m:
return (int(m.groups(1)), int(m.groups(2)))
else:
return None
lines.sort(key=key)
Now lines are
['test1txtfgg2', 'test1txtfgf10', 'test2txtsdsd1', 'test2txffdt3']
import re
k = [
"test1txtfgf10",
"test1txtfgg2",
"test2txffdt3",
"test2txtsdsd1"
]
tmp = [([e for e in re.split("[a-z]",el) if e], el) for el in k ]
sorted(tmp, key=lambda k: tmp[0])
tmp = [res for cm, res in tmp]

Splitting a list by first character of each element

I have a Python list mylist whose elements are a sublist containing a string of a letter and number. I was wondering how I could split mylist by the character at the start of the string without using code with individual statements/cases for each character.
Say I want to split mylist into lists a, b, c:
mylist = [['a1'],['a2'],['c1'],['b1']]
a = [['a1'],['a2']]
b = [['b1']]
c = [['c1']]
It is important that I keep them as a list-of-lists (even though it's only a single element in each sublist).
This will work:
import itertools as it
mylist = [['a1'],['a2'],['c1'],['b1']]
keyfunc = lambda x: x[0][0]
mylist = sorted(mylist, key=keyfunc)
a, b, c = [list(g) for k, g in it.groupby(mylist, keyfunc)]
The line where sorted() is used is necessary only if the elements in mylist are not already sorted by the character at the start of the string.
EDIT :
As pointed out in the comments, a more general solution (one that does not restrict the number of variables to just three) would be using dictionary comprehensions (available in Python 2.7+) like this:
result_dict = {k: list(g) for k, g in it.groupby(mylist, keyfunc)}
Now the answer is keyed in the dictionary by the first character:
result_dict['a']
> [['a1'],['a2']]
result_dict['b']
> [['b1']]
result_dict['c']
> [['c1']]
Using a dictionary could work too
mylist = [['a1'],['a2'],['c1'],['b1']]
from collections import defaultdict
dicto = defaultdict(list)
for ele in mylist:
dicto[ele[0][0]].append(ele)
Result:
>>> dicto
defaultdict(<type 'list'>, {'a': [['a1'], ['a2']], 'c': [['c1']], 'b': [['b1']]})
It does not give the exact result you were asking for; however, it is quite easy to access a list of lists associated with each letter
>>> dicto['a']
[['a1'], ['a2']]
You can also get these sublists by using a simple function:
def get_items(mylist, letter):
return [item for item in mylist if item[0][0] == letter]
The expression item[0][0] simply means to get the first letter of the first element of the current item. You can then call the function for each letter:
a = get_items(mylist, 'a')
b = get_items(mylist, 'b')
c = get_items(mylist, 'c')

Categories