How to group list items based on a specific condition?

How to group list items based on a specific condition? - python

I have this text:
>A1
KKKKKKKK
DDDDDDDD
>A2
FFFFFFFF
FFFFOOOO
DAA
>A3
OOOZDDD
KKAZAAA
A
When I split it and remove the line jumps, I get this list:
It gives me a list that looks like this:
['>A1', 'KKKKKKKK', 'DDDDDDDD', '>A2', 'FFFFFFFF', 'FFFFOOOO', 'DAA', '>A3', 'OOOZDDD', 'KKAZAAA', 'A']
I'm trying to merge all the strings between each part that starts with '>', such that it looks like:
['KKKKKKKKDDDDDDDD', 'FFFFFFFFFFFFOOOODAA', 'OOOZDDDKKAZAAAA']
What I have so far, but it doesn't do anything and I'm lost:
my_list = ['>A1', 'KKKKKKKK', 'DDDDDDDD', '>A2', 'FFFFFFFF', 'FFFFOOOO', 'DAA', '>A3', 'OOOZDDD', 'KKAZAAA', 'A']
result = []
for item in range(len(my_list)):
if my_list[item][0] == '>':
temp = ''
while my_list[item] != '>':
temp += my_list[item]
result.append(temp)
print(result)

#Andrej has given a compact code for your problem, but I want to help you by pointing out some issues in your original code.
You have while in if, but when my_list[item] starts with '>', the inner while won't work. The correct thing is to add a else-statement to concatenate the following string.
You append a string temp to result at each iterative step, but temp is not a concatenated string. The correct time to append is when you meet '>' again.
After solving them, you may get something like this,
result = []
for item in range(len(my_list)):
if my_list[item][0] == '>':
if item != 0:
result.append(temp)
temp = ''
else:
temp += my_list[item]
if item != 0:
result.append(item)
print(result)
You can further simplify it.
Save list indexing by directly iterating over the list.
Save final repeated check by adding a sentinel.
result = []
concat_string = '' # just change a readable name
for string in my_list + ['>']: # iterate over list directly and add a sentinel
if string[0] == '>': # or string.startswith('>')
if concat_string:
result.append(concat_string)
concat_string = ''
else:
concat_string += string
print(result)

You can use itertools.groupby for the task:
from itertools import groupby
lst = [
">A1",
"KKKKKKKK",
"DDDDDDDD",
">A2",
"FFFFFFFF",
"FFFFOOOO",
"DAA",
">A3",
"OOOZDDD",
"KKAZAAA",
"A",
]
out = []
for k, g in groupby(lst, lambda s: s.startswith(">")):
if not k:
out.append("".join(g))
print(out)
Prints:
["KKKKKKKKDDDDDDDD", "FFFFFFFFFFFFOOOODAA", "OOOZDDDKKAZAAAA"]

Regex version:
data = """>A1
KKKKKKKK
DDDDDDDD
>A2
FFFFFFFF
FFFFOOOO
DAA
>A3
OOOZDDD
KKAZAAA
A"""
import re
patre = re.compile("^>.+\n",re.MULTILINE)
#split on `>xxx`
chunks = patre.split(data)
#remove whitespaces and newlines
blocks = [v.replace("\n","").strip() for v in chunks]
#get rid of leading trailing empty blocks
blocks = [v for v in blocks if v]
print(blocks)
output:
['KKKKKKKKDDDDDDDD', 'FFFFFFFFFFFFOOOODAA', 'OOOZDDDKKAZAAAA']

Related

Managing duplicates when sorting by character order in string

I am trying to solve through a challenge where I have to reorder the letters in string s in the order it appears on string t. For example:
For s = "weather" and t = "therapyw", the output should be
sortByString(s, t) = "theeraw";
For s = "good" and t = "odg", the output should be
sortByString(s, t) = "oodg".
This is my code:
def sortByString(s, t):
s_list = list(s)
t_list = list(t)
output = []
for i in range(len(t_list)):
if t_list[i] in s_list:
output.insert(i, t_list[i])
return ''.join(output)
It works for all cases except if the same letter exists more than once.
s: "weather"
t: "therapyw"
Output:
"theraw"
Expected Output:
"theeraw"
How can I handle this situation in my code above? What am I missing? I appreciate all help but instead of just blurting out the answer, I would like to know what I'm doing wrong.

The issue with your current code is that it only adds one copy of each character in t to output, regardless of how many times it occurs in s. You can work around that by looping over the count of that character in s and appending to output for each count:
def sortByString(s, t):
s_list = list(s)
t_list = list(t)
output = []
for i in range(len(t_list)):
for _ in range(s_list.count(t_list[i])):
output.append(t_list[i])
return ''.join(output)
print(sortByString('weather',"therapyw"))
print(sortByString('good',"odg"))
Output:
theeraw
oodg
You can simplify the loop by just adding copies of a list with the current character according to the count of the character in s:
for c in t_list:
output = output + [c] * s_list.count(c)

Easy way
Use enumerate and turn your string into a dict
def sortByString(s, t):
s_list = list(s)
t_list = list(t)
orderdict = {char: index for index, char in enumerate(t_list)}
output = sorted(list('weather'),key=orderdict.get)
return ''.join(output)
This will allow repeated values
Example
>>> sortByString('weather',"therapyw")
'theeraw'
Modification to OP's code
Just add the element number of times it appear in s to the output
def sortByString(s,t):
s_list = list(s)
t_list = list(t)
output = []
for i in range(len(t_list)):
if t_list[i] in s_list:
output.append(t_list[i]*s_list.count(t_list[i]))
return ''.join(output)
output
>>> sortByString('weather',"therapyw")
'theeraw'

2 steps:
a. create a sorted list of characters in s and their order in t using index()
b. use zip(* to extract the sorted list of characters
s = "weather"
t = "therapy"
a = sorted([(t.index(c),c) for c in s])
b = ''.join(list(zip(*a))[1])
print(b)
Output:
theeraw

Python filtering out a list using elements from another list

I am trying to filter out a list using another list. However, the elements of the list I am using to filter the other list is not identical strings. Please see my example as it will make more sense:
mylist = ['14001IB_L1P0', '14001OB_L1P1', '14002IB_L3P0', '14003OB_L1P1', '14001OB_L2P0']
remove_list = ['14001', '14002']
I want to remove the values from mylist that start with the values from remove_list.
I have tried doing this:
filtered_mylist = mylist[:]
for x in remove_list:
for i in filtered_mylist:
if x in i:
print('remove ' +i)
filtered_mylist.remove(i)
else:
print('keep '+i)
However, this is the result:
remove 14001IB_L1P0
keep 14002IB_L3P0
keep 14003OB_L1P1
remove 14001OB_L2P0
keep 14001OB_L1P1
remove 14002IB_L3P0
and this is what filtered_mylist consists of:
['14001OB_L1P1', '14003OB_L1P1']
However, it should consist of only 1 element:
['14003OB_L1P1']
It seems to me that for some reason, the loop has skipped over '14001OB_L1P1', the second element in the first loop. Why has this happened?

Here's a one liner
mylist = list(filter(lambda x: all([x.find(y) != 0 for y in remove_list]), mylist))
#Output
['14003OB_L1P1']
The all([x.find(y) != 0 for y in remove_list]) will return True if and only if x does not start with a single value from remove_list.
all() means all have to be True. x.find(y) != 0 means x does not begin with y.
The rest is just executing the filter.

Would this help?
remove_final = []
keep_final = []
for element in mylist:
if any([element.startswith(x) for x in remove_list]):
print(f'remove {element}')
remove_final.append(element)
else:
print(f'keep {element}')
keep_final.append(element)
Output:
remove 14001IB_L1P0
remove 14001OB_L1P1
remove 14002IB_L3P0
keep 14003OB_L1P1
remove 14001OB_L2P0
And final lists:
keep_final
['14003OB_L1P1']
remove_final
['14001IB_L1P0', '14001OB_L1P1', '14002IB_L3P0', '14001OB_L2P0']

Hope this code help you.
mylist = ['14001IB_L1P0', '14001OB_L1P1', '14002IB_L3P0', '14003OB_L1P1', '14001OB_L2P0']
remove_list = ['14001', '14002']
filtered_mylist = mylist[:]
for x in remove_list:
i = 0
while i < len(filtered_mylist):
if x in filtered_mylist[i]:
print('remove ' + filtered_mylist[i])
filtered_mylist.remove(filtered_mylist[i])
else:
print('keep '+ filtered_mylist[i])
i+=1

Here's another method - append method.
Try to use "filter function + append" to do this instead of remove. That's much safer.
mylist = ['14001IB_L1P0', '14001OB_L1P1', '14002IB_L3P0', '14003OB_L1P1', '14001OB_L2P0']
remove_list = ['14001', '14002']
def is_valid(item):
for pattern in remove_list:
if item.startswith(pattern):
return False
return True
res = []
for item in mylist:
if is_valid(item):
res.append(item)
print(res)

Split a string using a list of value at the same time

I have a string and a list:
src = 'ways to learn are read and execute.'
temp = ['ways to','are','and']
What I wanted is to split the string using the list temp's values and produce:
['learn','read','execute']
at the same time.
I had tried for loop:
for x in temp:
src.split(x)
This is what it produced:
['','to learn are read and execute.']
['ways to learn','read and execute.']
['ways to learn are read','execute.']
What I wanted is to output all the values in list first, then use it split the string.
Did anyone has solutions?

re.split is the conventional solution for splitting on multiple separators:
import re
src = 'ways to learn are read and execute.'
temp = ['ways to','are','and']
pattern = "|".join(re.escape(item) for item in temp)
result = re.split(pattern, src)
print(result)
Result:
['', ' learn ', ' read ', ' execute.']
You can also filter out blank items and strip the spaces+punctuation with a simple list comprehension:
result = [item.strip(" .") for item in result if item]
print(result)
Result:
['learn', 'read', 'execute']

This is a method which is purely pythonic and does not rely on regular expressions. It's more verbose and more complex:
result = []
current = 0
for part in temp:
too_long_result = src.split(part)[1]
if current + 1 < len(temp): result.append(too_long_result.split(temp[current+1])[0].lstrip().rstrip())
else: result.append(too_long_result.lstrip().rstrip())
current += 1
print(result)
You cann remove the .lstrip().rstrip() commands if you don't want to remove the trailing and leading whitespaces in the list entries.

Loop solution. You can add conditions such as strip if you need them.
src = 'ways to learn are read and execute.'
temp = ['ways to','are','and']
copy_src = src
result = []
for x in temp:
left, right = copy_src.split(x)
if left:
result.append(left) #or left.strip()
copy_src = right
result.append(copy_src) #or copy_src.strip()

just keep it simple
src = 'ways to learn are read and execute.'
temp = ['ways','to','are','and']
res=''
for w1 in src.split():
if w1 not in temp:
if w1 not in res.split():
res=res+w1+" "
print(res)

python string splitting with multiple splitting points

Ok so ill get straight to the point here is my code
def digestfragmentwithenzyme(seqs, enzymes):
fragment = []
for seq in seqs:
for enzyme in enzymes:
results = []
prog = re.compile(enzyme[0])
for dingen in prog.finditer(seq):
results.append(dingen.start() + enzyme[1])
results.reverse()
#result = 0
for result in results:
fragment.append(seq[result:])
seq = seq[:result]
fragment.append(seq[:result])
fragment.reverse()
return fragment
Input for this function is a list of multiple strings (seq) e.g. :
List = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
And enzymes as input:
[["TC", 1],["GC",1]]
(note: there can be multiple given but most of them are in this matter of letters with ATCG)
The function should return a list that, in this example, contain 2 lists:
Outputlist = [["AATT","CCGGT","CGGGG","CT","CGGGGG"],["AAAG","CAAAAT","CAAAAAAG","CAAAAAAT","C"]]
Right now i am having troubles with splitting it twice and getting the right output.
Little bit more information about the function. It looks through the string (seq) for the recognizion point. in this case TC or GC and splits it on the 2nd index of enzymes. it should do that for both strings in the list with both enzymes.

Assuming the idea is to split at each enzyme, at the index point where enzymes are multiple letters, and the split, in essence comes between the two letters. Don't need regex.
You can do this by looking for the occurrences and inserting a split indicator at the correct index and then post-process the result to actually split.
For example:
def digestfragmentwithenzyme(seqs, enzymes):
# preprocess enzymes once, then apply to each sequence
replacements = []
for enzyme in enzymes:
replacements.append((enzyme[0], enzyme[0][0:enzyme[1]] + '|' + enzyme[0][enzyme[1]:]))
result = []
for seq in seqs:
for r in replacements:
seq = seq.replace(r[0], r[1]) # So AATTC becomes AATT|C
result.append(seq.split('|')) # So AATT|C becomes AATT, C
return result
def test():
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1],["GC",1]]
print digestfragmentwithenzyme(seqs, enzymes)

Here is my solution:
Replace TC with T C, GC with G C (this is done based on index given) and then split based on space character....
def digest(seqs, enzymes):
res = []
for li in seqs:
for en in enzymes:
li = li.replace(en[0],en[0][:en[1]]+" " + en[0][en[1]:])
r = li.split()
res.append(r)
return res
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1],["GC",1]]
#enzymes = [["AAT", 2],["GC",1]]
print seqs
print digest(seqs, enzymes)
the results are:
for ([["TC", 1],["GC",1]])
['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAA
AAAAT', 'C']]
for ([["AAT", 2],["GC",1]])
['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AA', 'TTCCGGTCGGGG', 'CTCGGGGG'], ['AAAG', 'CAAAA', 'TCAAAAAAG', 'CAAAAAA', '
TC']]

Here is something that should work using regex. In this solution, I find all occurrences of your enzyme strings and split using their corresponding index.
def digestfragmentwithenzyme(seqs, enzymes):
out = []
dic = dict(enzymes) # dictionary of enzyme indices
for seq in seqs:
sub = []
pos1 = 0
enzstr = '|'.join(enz[0] for enz in enzymes) # "TC|GC" in this case
for match in re.finditer('('+enzstr+')', seq):
index = dic[match.group(0)]
pos2 = match.start()+index
sub.append(seq[pos1:pos2])
pos1 = pos2
sub.append(seq[pos1:])
out.append(sub)
# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
return out

Use positive lookbehind and lookahead regex search:
import re
def digest_fragment_with_enzyme(sequences, enzymes):
pattern = '|'.join('((?<={})(?={}))'.format(strs[:ind], strs[ind:]) for strs, ind in enzymes)
print pattern # prints ((?<=T)(?=C))|((?<=G)(?=C))
for seq in sequences:
indices = [0] + [m.start() for m in re.finditer(pattern, seq)] + [len(seq)]
yield [seq[start: end] for start, end in zip(indices, indices[1:])]
seq = ["AATTCCGGTCGGGGCTCGGGGG", "AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1], ["GC", 1]]
print list(digest_fragment_with_enzyme(seq, enzymes))
Output:
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'],
['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]

The simplest answer I can think of:
input_list = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = ['TC', 'GC']
output = []
for string in input_list:
parts = []
left = 0
for right in range(1,len(string)):
if string[right-1:right+1] in enzymes:
parts.append(string[left:right])
left = right
parts.append(string[left:])
output.append(parts)
print(output)

Throwing my hat in the ring here.
Using dict for patterns rather than list of lists.
Joining pattern as others have done to avoid fancy regexes.
.
import re
sequences = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
patterns = { 'TC': 1, 'GC': 1 }
def intervals(patterns, text):
pattern = '|'.join(patterns.keys())
start = 0
for match in re.finditer(pattern, text):
index = match.start() + patterns.get(match.group())
yield text[start:index]
start = index
yield text[index:len(text)]
print [list(intervals(patterns, s)) for s in sequences]
# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]

python : Noob IndexError

I am currently trying to create a code that can scan a string, put the position of each letter found in a list associated to the letter (ex : if you find a S as the 35, 48 and 120 letter of the string, it will put 35, 48, and 120 in a list for the letter S). It will then put this list in a dictionary as a value with S for key.
My problem is simple, I have an IndexError: list assignment index out of range when I try to put the value in the list, but I cant find out why.
string = "Squalalanoussommespartisetjetedteste"
taille = len(string)
dico = dict()
dico = {}
i = 0
for i in range(taille):
if string[i] == "A" or string[i] == "a" :
va = 0
valA = []
valA[va] = i
va = va + 1
print(valA)
I apologize for my poor English, and thank by advance for the help.

You don't need to specify an index while pushing an item to a list in python. Try this:
for i in range(taille):
if string[i] == "A" or string[i] == "a" :
valA = []
valA.append(i)
print(valA)

You are getting this error in these lines
va = 0
valA = []
valA[va] = i
valA is blank list here, with zero elements, so when you try to assign a value to its 0 index, it raises IndexError.
Also to get indexes for each character you can directly loop over string, like
s = "Squalalanoussommespartisetjetedteste"
d = dict()
for i, c in enumerate(s):
d.setdefault(c.lower(), []).append(i)
print d

I found some errors in the code.
The index error is because you tried to call 0th position of an empty list.
valA = []
The list is empty. Then you tried to replace value at the 0th position when there is no 0th position
valA[va] = i
I made some changes to the code. In the ninth line you initialize an empty list. You should do that before for loop. Otherwise for loop initiate it everytime and you lose the value in the previous loop.
here is the modified code.
string = "Squalalanoussommespartisetjetedteste"
taille = len(string)
dico = dict()
dico = {}
i = 0
valA = []
for i in range(taille):
if string[i] == "A" or string[i] == "a":
valA.append(i)
print(valA)
The output i got is
[3, 5, 7, 19]

Though you may use straightforward approach, Python has some usefull modules that may help. For example
import collections
s = "Squalalanoussommespartisetjetedteste"
result = collections.defaultdict(list)
for i,char in enumerate(s):
result[char].append(i)
result would contain a dictionary with string characters as keys and lists if char's indexes as items.

You are redefining variables below everytime. So move them to out of loop.
va = 0
valA = []
Also use insert method for list. (You can use insert for when you need to use define index in list. otherwise append is enough)
so final code :
string = "Squalalanoussommespartisetjetedteste"
taille = len(string)
dico = dict()
dico = {}
i = 0
va = 0
valA = []
for i in range(taille):
if string[i] == "A" or string[i] == "a" :
valA.insert(va, i)
va = va + 1
print(valA)

index error, because array valA is empty array, that means there are no indexes...
use function append and declare array valA outside the loop...
string = "Squalalanoussommespartisetjetedteste"
taille = len(string)
dico = dict()
dico = {}
i = 0
valA = []
for i in range(taille):
if string[i] == "A" or string[i] == "a" :
valA.append(i)
print(valA)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to group list items based on a specific condition? - python

Related

Managing duplicates when sorting by character order in string

Python filtering out a list using elements from another list

Split a string using a list of value at the same time

python string splitting with multiple splitting points

python : Noob IndexError

Categories

Resources