Let's assume I have a list like this
List=["Face123","Body234","Face565"]
I would like to obtain as output a list without character/substring described in another list.
NonDesideredPattern["Face","Body"]
Output=[123,234,565].
Create a function which returns a string without the undesired patterns.
Then use this function in a comprehension list:
import re
def remove_pattern(string, patterns):
result = string
for p in patterns:
result = re.sub(p, '', result)
return result
inputs = ["Face123", "Body234", "Face565"]
undesired_patterns = ["Face", "Body"]
outputs = [remove_pattern(e, undesired_patterns) for e in inputs]
I am not sure, this is 100% efficient, but you could do something like this:
def eval_list(og_list):
list_parts = []
list_nums = []
for element in og_list:
part = ""
num = ""
for char in element:
if char.isalpha():
part += char
else:
num += char
list_parts.append(part)
list_nums.append(num)
return list_parts, list_nums
(if you are always working with alphabetical syntax and then a number)
Use re.compile and re.sub
import re
lst = ["Face123", "Body234", "Face565"]
lst_no_desired_pattern = ["Face","Body"]
pattern = re.compile("|".join(lst_no_desired_pattern))
lst_output = [re.sub(pattern, "", word) for word in lst]
Result:
['123', '234', '565']
Ok so ill get straight to the point here is my code
def digestfragmentwithenzyme(seqs, enzymes):
fragment = []
for seq in seqs:
for enzyme in enzymes:
results = []
prog = re.compile(enzyme[0])
for dingen in prog.finditer(seq):
results.append(dingen.start() + enzyme[1])
results.reverse()
#result = 0
for result in results:
fragment.append(seq[result:])
seq = seq[:result]
fragment.append(seq[:result])
fragment.reverse()
return fragment
Input for this function is a list of multiple strings (seq) e.g. :
List = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
And enzymes as input:
[["TC", 1],["GC",1]]
(note: there can be multiple given but most of them are in this matter of letters with ATCG)
The function should return a list that, in this example, contain 2 lists:
Outputlist = [["AATT","CCGGT","CGGGG","CT","CGGGGG"],["AAAG","CAAAAT","CAAAAAAG","CAAAAAAT","C"]]
Right now i am having troubles with splitting it twice and getting the right output.
Little bit more information about the function. It looks through the string (seq) for the recognizion point. in this case TC or GC and splits it on the 2nd index of enzymes. it should do that for both strings in the list with both enzymes.
Assuming the idea is to split at each enzyme, at the index point where enzymes are multiple letters, and the split, in essence comes between the two letters. Don't need regex.
You can do this by looking for the occurrences and inserting a split indicator at the correct index and then post-process the result to actually split.
For example:
def digestfragmentwithenzyme(seqs, enzymes):
# preprocess enzymes once, then apply to each sequence
replacements = []
for enzyme in enzymes:
replacements.append((enzyme[0], enzyme[0][0:enzyme[1]] + '|' + enzyme[0][enzyme[1]:]))
result = []
for seq in seqs:
for r in replacements:
seq = seq.replace(r[0], r[1]) # So AATTC becomes AATT|C
result.append(seq.split('|')) # So AATT|C becomes AATT, C
return result
def test():
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1],["GC",1]]
print digestfragmentwithenzyme(seqs, enzymes)
Here is my solution:
Replace TC with T C, GC with G C (this is done based on index given) and then split based on space character....
def digest(seqs, enzymes):
res = []
for li in seqs:
for en in enzymes:
li = li.replace(en[0],en[0][:en[1]]+" " + en[0][en[1]:])
r = li.split()
res.append(r)
return res
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1],["GC",1]]
#enzymes = [["AAT", 2],["GC",1]]
print seqs
print digest(seqs, enzymes)
the results are:
for ([["TC", 1],["GC",1]])
['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAA
AAAAT', 'C']]
for ([["AAT", 2],["GC",1]])
['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AA', 'TTCCGGTCGGGG', 'CTCGGGGG'], ['AAAG', 'CAAAA', 'TCAAAAAAG', 'CAAAAAA', '
TC']]
Here is something that should work using regex. In this solution, I find all occurrences of your enzyme strings and split using their corresponding index.
def digestfragmentwithenzyme(seqs, enzymes):
out = []
dic = dict(enzymes) # dictionary of enzyme indices
for seq in seqs:
sub = []
pos1 = 0
enzstr = '|'.join(enz[0] for enz in enzymes) # "TC|GC" in this case
for match in re.finditer('('+enzstr+')', seq):
index = dic[match.group(0)]
pos2 = match.start()+index
sub.append(seq[pos1:pos2])
pos1 = pos2
sub.append(seq[pos1:])
out.append(sub)
# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
return out
Use positive lookbehind and lookahead regex search:
import re
def digest_fragment_with_enzyme(sequences, enzymes):
pattern = '|'.join('((?<={})(?={}))'.format(strs[:ind], strs[ind:]) for strs, ind in enzymes)
print pattern # prints ((?<=T)(?=C))|((?<=G)(?=C))
for seq in sequences:
indices = [0] + [m.start() for m in re.finditer(pattern, seq)] + [len(seq)]
yield [seq[start: end] for start, end in zip(indices, indices[1:])]
seq = ["AATTCCGGTCGGGGCTCGGGGG", "AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1], ["GC", 1]]
print list(digest_fragment_with_enzyme(seq, enzymes))
Output:
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'],
['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
The simplest answer I can think of:
input_list = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = ['TC', 'GC']
output = []
for string in input_list:
parts = []
left = 0
for right in range(1,len(string)):
if string[right-1:right+1] in enzymes:
parts.append(string[left:right])
left = right
parts.append(string[left:])
output.append(parts)
print(output)
Throwing my hat in the ring here.
Using dict for patterns rather than list of lists.
Joining pattern as others have done to avoid fancy regexes.
.
import re
sequences = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
patterns = { 'TC': 1, 'GC': 1 }
def intervals(patterns, text):
pattern = '|'.join(patterns.keys())
start = 0
for match in re.finditer(pattern, text):
index = match.start() + patterns.get(match.group())
yield text[start:index]
start = index
yield text[index:len(text)]
print [list(intervals(patterns, s)) for s in sequences]
# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
I want to get rid of repeated consecutive punctuation signs and only leave one of them.
If I have
string = 'Is it raining????',
I want to get
string = 'Is it raining?'
But I don't want to get rid of '...'
I also need to do this without using regular expressions. I am a beginner in python and would appreciate any advice or hint. Thanks :)
Yet another groupby approach:
from itertools import groupby
from string import punctuation
punc = set(punctuation) - set('.')
s = 'Thisss is ... a test!!! string,,,,, with 1234445556667 rrrrepeats????'
print(s)
newtext = []
for k, g in groupby(s):
if k in punc:
newtext.append(k)
else:
newtext.extend(g)
print(''.join(newtext))
output
Thisss is ... a test!!! string,,,,, with 1234445556667 rrrrepeats????
Thisss is ... a test! string, with 1234445556667 rrrrepeats?
import string
from itertools import groupby
# get all punctuation minus period.
puncs = set(string.punctuation)-set('.')
s = 'Is it raining???? No but...,,,, it is snowing!!!!!!!###!######'
# get count of consecutive characters
t = [[k,len(list(g))] for k, g in groupby(s)]
s = ''
for ele in t:
char = ele[0]
count = ele[1]
if char in puncs and count > 1:
count = 1
s+=char*count
print s
#Is it raining? No but..., it is snowing!#!###
How about the following kind of approach:
import string
text = 'Is it raining???? No,,,, but...,,,, it is snoooowing!!!!!!!'
for punctuation in string.punctuation:
if punctuation != '.':
while True:
replaced = text.replace(punctuation * 2, punctuation)
if replaced == text:
break
text = replaced
print(text)
This would give the following output:
Is it raining? No, but..., it is snoooowing!
Or for a more efficient version giving the same results:
import string
text = 'Is it raining???? No,,,, but...,,,, it is snoooowing!!!!!!!'
last = None
output = []
for c in text:
if c == '.':
output.append(c)
elif c != last:
if c in string.punctuation:
last = c
else:
last = None
output.append(c)
print(''.join(output))
from itertools import groupby
s = 'Is it raining???? okkkk!!! ll... yeh""" ok?'
replaceables = [ch for i, ch in enumerate(s) if i > 0 and s[i - 1] == ch and (not ch.isalpha() and ch != '.')]
replaceables = [list(g) for k, g in groupby(replaceables)]
start = 0
for replaceable in replaceables:
replaceable = ''.join(replaceable)
start = s.find(replaceable, start)
r = s[start:].replace(replaceable, '', 1)
s = s.replace(s[start:], r)
print s
Python has string.find() and string.rfind() to get the index of a substring in a string.
I'm wondering whether there is something like string.find_all() which can return all found indexes (not only the first from the beginning or the first from the end).
For example:
string = "test test test test"
print string.find('test') # 0
print string.rfind('test') # 15
#this is the goal
print string.find_all('test') # [0,5,10,15]
For counting the occurrences, see Count number of occurrences of a substring in a string.
There is no simple built-in string function that does what you're looking for, but you could use the more powerful regular expressions:
import re
[m.start() for m in re.finditer('test', 'test test test test')]
#[0, 5, 10, 15]
If you want to find overlapping matches, lookahead will do that:
[m.start() for m in re.finditer('(?=tt)', 'ttt')]
#[0, 1]
If you want a reverse find-all without overlaps, you can combine positive and negative lookahead into an expression like this:
search = 'tt'
[m.start() for m in re.finditer('(?=%s)(?!.{1,%d}%s)' % (search, len(search)-1, search), 'ttt')]
#[1]
re.finditer returns a generator, so you could change the [] in the above to () to get a generator instead of a list which will be more efficient if you're only iterating through the results once.
>>> help(str.find)
Help on method_descriptor:
find(...)
S.find(sub [,start [,end]]) -> int
Thus, we can build it ourselves:
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub) # use start += 1 to find overlapping matches
list(find_all('spam spam spam spam', 'spam')) # [0, 5, 10, 15]
No temporary strings or regexes required.
Here's a (very inefficient) way to get all (i.e. even overlapping) matches:
>>> string = "test test test test"
>>> [i for i in range(len(string)) if string.startswith('test', i)]
[0, 5, 10, 15]
Use re.finditer:
import re
sentence = input("Give me a sentence ")
word = input("What word would you like to find ")
for match in re.finditer(word, sentence):
print (match.start(), match.end())
For word = "this" and sentence = "this is a sentence this this" this will yield the output:
(0, 4)
(19, 23)
(24, 28)
Again, old thread, but here's my solution using a generator and plain str.find.
def findall(p, s):
'''Yields all the positions of
the pattern p in the string s.'''
i = s.find(p)
while i != -1:
yield i
i = s.find(p, i+1)
Example
x = 'banananassantana'
[(i, x[i:i+2]) for i in findall('na', x)]
returns
[(2, 'na'), (4, 'na'), (6, 'na'), (14, 'na')]
You can use re.finditer() for non-overlapping matches.
>>> import re
>>> aString = 'this is a string where the substring "is" is repeated several times'
>>> print [(a.start(), a.end()) for a in list(re.finditer('is', aString))]
[(2, 4), (5, 7), (38, 40), (42, 44)]
but won't work for:
In [1]: aString="ababa"
In [2]: print [(a.start(), a.end()) for a in list(re.finditer('aba', aString))]
Output: [(0, 3)]
Come, let us recurse together.
def locations_of_substring(string, substring):
"""Return a list of locations of a substring."""
substring_length = len(substring)
def recurse(locations_found, start):
location = string.find(substring, start)
if location != -1:
return recurse(locations_found + [location], location+substring_length)
else:
return locations_found
return recurse([], 0)
print(locations_of_substring('this is a test for finding this and this', 'this'))
# prints [0, 27, 36]
No need for regular expressions this way.
If you're just looking for a single character, this would work:
string = "dooobiedoobiedoobie"
match = 'o'
reduce(lambda count, char: count + 1 if char == match else count, string, 0)
# produces 7
Also,
string = "test test test test"
match = "test"
len(string.split(match)) - 1
# produces 4
My hunch is that neither of these (especially #2) is terribly performant.
this is an old thread but i got interested and wanted to share my solution.
def find_all(a_string, sub):
result = []
k = 0
while k < len(a_string):
k = a_string.find(sub, k)
if k == -1:
return result
else:
result.append(k)
k += 1 #change to k += len(sub) to not search overlapping results
return result
It should return a list of positions where the substring was found.
Please comment if you see an error or room for improvment.
This does the trick for me using re.finditer
import re
text = 'This is sample text to test if this pythonic '\
'program can serve as an indexing platform for '\
'finding words in a paragraph. It can give '\
'values as to where the word is located with the '\
'different examples as stated'
# find all occurances of the word 'as' in the above text
find_the_word = re.finditer('as', text)
for match in find_the_word:
print('start {}, end {}, search string \'{}\''.
format(match.start(), match.end(), match.group()))
This thread is a little old but this worked for me:
numberString = "onetwothreefourfivesixseveneightninefiveten"
testString = "five"
marker = 0
while marker < len(numberString):
try:
print(numberString.index("five",marker))
marker = numberString.index("five", marker) + 1
except ValueError:
print("String not found")
marker = len(numberString)
You can try :
>>> string = "test test test test"
>>> for index,value in enumerate(string):
if string[index:index+(len("test"))] == "test":
print index
0
5
10
15
You can try :
import re
str1 = "This dress looks good; you have good taste in clothes."
substr = "good"
result = [_.start() for _ in re.finditer(substr, str1)]
# result = [17, 32]
When looking for a large amount of key words in a document, use flashtext
from flashtext import KeywordProcessor
words = ['test', 'exam', 'quiz']
txt = 'this is a test'
kwp = KeywordProcessor()
kwp.add_keywords_from_list(words)
result = kwp.extract_keywords(txt, span_info=True)
Flashtext runs faster than regex on large list of search words.
This function does not look at all positions inside the string, it does not waste compute resources. My try:
def findAll(string,word):
all_positions=[]
next_pos=-1
while True:
next_pos=string.find(word,next_pos+1)
if(next_pos<0):
break
all_positions.append(next_pos)
return all_positions
to use it call it like this:
result=findAll('this word is a big word man how many words are there?','word')
src = input() # we will find substring in this string
sub = input() # substring
res = []
pos = src.find(sub)
while pos != -1:
res.append(pos)
pos = src.find(sub, pos + 1)
Whatever the solutions provided by others are completely based on the available method find() or any available methods.
What is the core basic algorithm to find all the occurrences of a
substring in a string?
def find_all(string,substring):
"""
Function: Returning all the index of substring in a string
Arguments: String and the search string
Return:Returning a list
"""
length = len(substring)
c=0
indexes = []
while c < len(string):
if string[c:c+length] == substring:
indexes.append(c)
c=c+1
return indexes
You can also inherit str class to new class and can use this function
below.
class newstr(str):
def find_all(string,substring):
"""
Function: Returning all the index of substring in a string
Arguments: String and the search string
Return:Returning a list
"""
length = len(substring)
c=0
indexes = []
while c < len(string):
if string[c:c+length] == substring:
indexes.append(c)
c=c+1
return indexes
Calling the method
newstr.find_all('Do you find this answer helpful? then upvote
this!','this')
This is solution of a similar question from hackerrank. I hope this could help you.
import re
a = input()
b = input()
if b not in a:
print((-1,-1))
else:
#create two list as
start_indc = [m.start() for m in re.finditer('(?=' + b + ')', a)]
for i in range(len(start_indc)):
print((start_indc[i], start_indc[i]+len(b)-1))
Output:
aaadaa
aa
(0, 1)
(1, 2)
(4, 5)
Here's a solution that I came up with, using assignment expression (new feature since Python 3.8):
string = "test test test test"
phrase = "test"
start = -1
result = [(start := string.find(phrase, start + 1)) for _ in range(string.count(phrase))]
Output:
[0, 5, 10, 15]
I think the most clean way of solution is without libraries and yields:
def find_all_occurrences(string, sub):
index_of_occurrences = []
current_index = 0
while True:
current_index = string.find(sub, current_index)
if current_index == -1:
return index_of_occurrences
else:
index_of_occurrences.append(current_index)
current_index += len(sub)
find_all_occurrences(string, substr)
Note: find() method returns -1 when it can't find anything
The pythonic way would be:
mystring = 'Hello World, this should work!'
find_all = lambda c,s: [x for x in range(c.find(s), len(c)) if c[x] == s]
# s represents the search string
# c represents the character string
find_all(mystring,'o') # will return all positions of 'o'
[4, 7, 20, 26]
>>>
if you only want to use numpy here is a solution
import numpy as np
S= "test test test test"
S2 = 'test'
inds = np.cumsum([len(k)+len(S2) for k in S.split(S2)[:-1]])- len(S2)
print(inds)
if you want to use without re(regex) then:
find_all = lambda _str,_w : [ i for i in range(len(_str)) if _str.startswith(_w,i) ]
string = "test test test test"
print( find_all(string, 'test') ) # >>> [0, 5, 10, 15]
please look at below code
#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
def get_substring_indices(text, s):
result = [i for i in range(len(text)) if text.startswith(s, i)]
return result
if __name__ == '__main__':
text = "How much wood would a wood chuck chuck if a wood chuck could chuck wood?"
s = 'wood'
print get_substring_indices(text, s)
def find_index(string, let):
enumerated = [place for place, letter in enumerate(string) if letter == let]
return enumerated
for example :
find_index("hey doode find d", "d")
returns:
[4, 7, 13, 15]
Not exactly what OP asked but you could also use the split function to get a list of where all the substrings don't occur. OP didn't specify the end goal of the code but if your goal is to remove the substrings anyways then this could be a simple one-liner. There are probably more efficient ways to do this with larger strings; regular expressions would be preferable in that case
# Extract all non-substrings
s = "an-example-string"
s_no_dash = s.split('-')
# >>> s_no_dash
# ['an', 'example', 'string']
# Or extract and join them into a sentence
s_no_dash2 = ' '.join(s.split('-'))
# >>> s_no_dash2
# 'an example string'
Did a brief skim of other answers so apologies if this is already up there.
def count_substring(string, sub_string):
c=0
for i in range(0,len(string)-2):
if string[i:i+len(sub_string)] == sub_string:
c+=1
return c
if __name__ == '__main__':
string = input().strip()
sub_string = input().strip()
count = count_substring(string, sub_string)
print(count)
I runned in the same problem and did this:
hw = 'Hello oh World!'
list_hw = list(hw)
o_in_hw = []
while True:
o = hw.find('o')
if o != -1:
o_in_hw.append(o)
list_hw[o] = ' '
hw = ''.join(list_hw)
else:
print(o_in_hw)
break
Im pretty new at coding so you can probably simplify it (and if planned to used continuously of course make it a function).
All and all it works as intended for what i was doing.
Edit: Please consider this is for single characters only, and it will change your variable, so you have to create a copy of the string in a new variable to save it, i didnt put it in the code cause its easy and its only to show how i made it work.
By slicing we find all the combinations possible and append them in a list and find the number of times it occurs using count function
s=input()
n=len(s)
l=[]
f=input()
print(s[0])
for i in range(0,n):
for j in range(1,n+1):
l.append(s[i:j])
if f in l:
print(l.count(f))
To find all the occurence of a character in a give string and return as a dictionary
eg: hello
result :
{'h':1, 'e':1, 'l':2, 'o':1}
def count(string):
result = {}
if(string):
for i in string:
result[i] = string.count(i)
return result
return {}
or else you do like this
from collections import Counter
def count(string):
return Counter(string)