Find multiple elements in string in Python

Find multiple elements in string in Python - python

my problem is that I need to find multiple elements in one string.
For example I got one string that looks like this:
line = if ((var.equals("INPUT")) || (var.equals("OUTPUT"))
and then i got this code to find everything between ' (" ' and ' ") '
char1 = '("'
char2 = '")'
add = line[line.find(char1)+2 : line.find(char2)]
list.append(add)
The current result is just:
['INPUT']
but I need the result to look like this:
['INPUT','OUTPUT', ...]
after it got the first match it stopped searching for other matches, but I need to find everything in that string that matches this search.
I also need to append every single match to the list.

The simplest:
>>> import re
>>> s = """line = if ((var.equals("INPUT")) || (var.equals("OUTPUT"))"""
>>> r = re.compile(r'\("(.*?)"\)')
>>> r.findall(s)
['INPUT', 'OUTPUT']
The trick is to use .*? which is a non-greedy *.

You should look into regular expressions because that's a perfect fit for what you're trying to achieve.
Let's examine a regular expression that does what you want:
import re
regex = re.compile(r'\("([^"]+)"\)')
It matches the string (" then captures anything that isn't a quotation mark and then matches ") at the end.
By using it with findall you will get all the captured groups:
In [1]: import re
In [2]: regex = re.compile(r'\("([^"]+)"\)')
In [3]: line = 'if ((var.equals("INPUT")) || (var.equals("OUTPUT"))'
In [4]: regex.findall(line)
Out[4]: ['INPUT', 'OUTPUT']

If you don't want to use regex, this will help you.
line = 'if ((var.equals("INPUT")) || (var.equals("OUTPUT"))'
char1 = '("'
char2 = '")'
add = line[line.find(char1)+2 : line.find(char2)]
list.append(add)
line1=line[line.find(char2)+1:]
add = line1[line1.find(char1)+2 : line1.find(char2)]
list.append(add)
print(list)
just add those 3 lines in your code, and you're done

if I understand you correct, than something like that is help you:
line = 'line = if ((var.equals("INPUT")) || (var.equals("OUTPUT"))'
items = []
start = 0
end = 0
c = 0;
while c < len(line):
if line[c] == '(' and line[c + 1] == '"':
start = c + 2
if line[c] == '"' and line[c + 1] == ')':
end = c
if start and end:
items.append(line[start:end])
start = end = None
c += 1
print(items) # ['INPUT', 'OUTPUT']

Related

python replace multiple occurrences of string with different values

i am writing a script in python that replaces all the occurrences of an math functions such as log with there answers but soon after i came into this problem i am unable replace multiple occurrences of a function with its answer
text = "69+log(2)+log(3)-log(57)/420"
log_list = []
log_answer = []
z = ""
c = 0
hit_l = False
for r in text:
if hit_l:
c += 1
if c >= 4 and r != ")":
z += r
elif r == ")":
hit_l = False
if r == "l":
hit_l = True
log_list.append(z)
if z != '':
logs = log_list[-1]
logs = re.sub("og\\(", ";", logs)
log_list = logs.split(";")
for ans in log_list:
log_answer.append(math.log(int(ans)))
for a in log_answer:
text = re.sub(f"log\\({a}\\)", str(a), text)
i want to replace log(10) and log(2) with 1 and 0.301 respectively i tried using re.sub but it is not working i am not able to replace the respective functions with there answers any help will be appreciated thank you

Here is my take on this using eval along with re.sub with a callback function:
x = "log(10)+log(2)"
output = re.sub(r'log\((\d+(?:\.\d+)?)\)', lambda x: str(eval('math.log(' + x.group(1) + ', 10)')), x)
print(output) # 1.0+0.301029995664

As long as your string contains no spaces and there are + signs between different logarithmic functions, eval could be a way to do it.
>>> a = 'log(10)+log(2)'
>>> b = a.split('+')
>>> b
['log(10)', 'log(2)']
>>> from math import log10 as log
>>> [eval(i) for i in b]
[1.0, 0.3010299956639812]
EDIT:
You could repeatedly use str.replace method to replace all mathematical operators (if there are more than one) with whitespaces and eventually use str.split like:
>>> text.replace('+', ' ').replace('-', ' ').replace('*', ' ').replace('/', ' ').split()
['69', 'log(2)', 'log(3)', 'log(57)', '420']

Is there a regrex script that can be used to extract texts by defining a start and an end in a text file [duplicate]

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.
I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.
With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
How to do the same thing in Python?

Using regular expressions - documentation for further reference
import re
text = 'gfgfdAAA1234ZZZuijjk'
m = re.search('AAA(.+?)ZZZ', text)
if m:
found = m.group(1)
# found: 1234
or:
import re
text = 'gfgfdAAA1234ZZZuijjk'
try:
found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
# AAA, ZZZ not found in the original string
found = '' # apply your error handling
# found: 1234

>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'
Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.

regular expression
import re
re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)
The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text
string methods
your_text.partition("AAA")[2].partition("ZZZ")[0]
The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.
PS Python Challenge?

Surprised that nobody has mentioned this which is my quick version for one-off scripts:
>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'

you can do using just one line of code
>>> import re
>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')
>>> ['1234']
result will receive list...

import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)

You can use re module for that:
>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)

In python, extracting substring form string can be done using findall method in regular expression (re) module.
>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']

text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'
print(text[text.index(left)+len(left):text.index(right)])
Gives
string

>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')

With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
You could do the same with re.sub function using the same regex.
>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'
In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).

You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.
def FindSubString(strText, strSubString, Offset=None):
try:
Start = strText.find(strSubString)
if Start == -1:
return -1 # Not Found
else:
if Offset == None:
Result = strText[Start+len(strSubString):]
elif Offset == 0:
return Start
else:
AfterSubString = Start+len(strSubString)
Result = strText[AfterSubString:AfterSubString + int(Offset)]
return Result
except:
return -1
# Example:
Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"
print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")
print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")
print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))
# Your answer:
Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"
AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0)
print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))

Using PyParsing
import pyparsing as pp
word = pp.Word(pp.alphanums)
s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
print(match)
which yields:
[['1234']]

One liner with Python 3.8 if text is guaranteed to contain the substring:
text[text.find(start:='AAA')+len(start):text.find('ZZZ')]

Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:
regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'
I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.
Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.

also, you can find all combinations in the bellow function
s = 'Part 1. Part 2. Part 3 then more text'
def find_all_places(text,word):
word_places = []
i=0
while True:
word_place = text.find(word,i)
i+=len(word)+word_place
if i>=len(text):
break
if word_place<0:
break
word_places.append(word_place)
return word_places
def find_all_combination(text,start,end):
start_places = find_all_places(text,start)
end_places = find_all_places(text,end)
combination_list = []
for start_place in start_places:
for end_place in end_places:
print(start_place)
print(end_place)
if start_place>=end_place:
continue
combination_list.append(text[start_place:end_place])
return combination_list
find_all_combination(s,"Part","Part")
result:
['Part 1. ', 'Part 1. Part 2. ', 'Part 2. ']

In case you want to look for multiple occurences.
content ="Prefix_helloworld_Suffix_stuff_Prefix_42_Suffix_andsoon"
strings = []
for c in content.split('Prefix_'):
spos = c.find('_Suffix')
if spos!=-1:
strings.append( c[:spos])
print( strings )
Or more quickly :
strings = [ c[:c.find('_Suffix')] for c in content.split('Prefix_') if c.find('_Suffix')!=-1 ]

Here's a solution without regex that also accounts for scenarios where the first substring contains the second substring. This function will only find a substring if the second marker is after the first marker.
def find_substring(string, start, end):
len_until_end_of_first_match = string.find(start) + len(start)
after_start = string[len_until_end_of_first_match:]
return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]

Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :
string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []
for char in string:
if char in numbersList: output.append(char)
print(f"output: {''.join(output)}")
### output: 1234

Typescript. Gets string in between two other strings.
Searches shortest string between prefixes and postfixes
prefixes - string / array of strings / null (means search from the start).
postfixes - string / array of strings / null (means search until the end).
public getStringInBetween(str: string, prefixes: string | string[] | null,
postfixes: string | string[] | null): string {
if (typeof prefixes === 'string') {
prefixes = [prefixes];
}
if (typeof postfixes === 'string') {
postfixes = [postfixes];
}
if (!str || str.length < 1) {
throw new Error(str + ' should contain ' + prefixes);
}
let start = prefixes === null ? { pos: 0, sub: '' } : this.indexOf(str, prefixes);
const end = postfixes === null ? { pos: str.length, sub: '' } : this.indexOf(str, postfixes, start.pos + start.sub.length);
let value = str.substring(start.pos + start.sub.length, end.pos);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
while (true) {
try {
start = this.indexOf(value, prefixes);
} catch (e) {
break;
}
value = value.substring(start.pos + start.sub.length);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
}
return value;
}

a simple approach could be the following:
string_to_search_in = 'could be anything'
start = string_to_search_in.find(str("sub string u want to identify"))
length = len("sub string u want to identify")
First_part_removed = string_to_search_in[start:]
end_coord = length
Extracted_substring=First_part_removed[:end_coord]

One liners that return other string if there was no match.
Edit: improved version uses next function, replace "not-found" with something else if needed:
import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )
My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:
import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

add space between arabic and english word by regular expression in python

I want to add a space between Arabic/Farsi and English words in my text.
It should be with regular expression in python.
for example:
input: "علیAli" output: "علی Ali"
input: "علیAliرضا" output: "علی Ali رضا"
input: "AliعلیRezaرضا" output: "Ali علی Reza رضا"
and what ever like them.

You can do it using re.sub likes the following in python 3:
rx = r'[a-zA-Z]+'
output = re.sub(rx, r' \g<0> ', input)

Instead of regular expression , I think this can be done by comparing unicodes. I tried to code the same but didn't know how to again split /r/n to get the required output. This code might be useful for some one.
import codecs,string
def detect_language(character):
maxchar = max(character)
if u'\u0041' <= maxchar <= u'\u007a':
return 'eng'
with codecs.open('letters.txt', encoding='utf-8') as f:
eng_list = []
eng_var =0
arab_list = []
arab_var=0
input = f.read()
for i in input:
isEng = detect_language(i)
if isEng == "eng":
eng_list.append(i)
eng_var = eng_var + 1
elif '\n' in i or '\r' in i:
eng_list.append(i)
arab_list.append(i)
else:
arab_list.append(i)
arab_var =arab_var +1
temp = str(eng_list)
temp1 = temp.encode('ascii','ignore')

How to copy spaces from one string to another in Python?

I need a way to copy all of the positions of the spaces of one string to another string that has no spaces.
For example:
string1 = "This is a piece of text"
string2 = "ESTDTDLATPNPZQEPIE"
output = "ESTD TD L ATPNP ZQ EPIE"

Insert characters as appropriate into a placeholder list and concatenate it after using str.join.
it = iter(string2)
output = ''.join(
[next(it) if not c.isspace() else ' ' for c in string1]
)
print(output)
'ESTD TD L ATPNP ZQ EPIE'
This is efficient as it avoids repeated string concatenation.

You need to iterate over the indexes and characters in string1 using enumerate().
On each iteration, if the character is a space, add a space to the output string (note that this is inefficient as you are creating a new object as strings are immutable), otherwise add the character in string2 at that index to the output string.
So that code would look like:
output = ''
si = 0
for i, c in enumerate(string1):
if c == ' ':
si += 1
output += ' '
else:
output += string2[i - si]
However, it would be more efficient to use a very similar method, but with a generator and then str.join. This removes the slow concatenations to the output string:
def chars(s1, s2):
si = 0
for i, c in enumerate(s1):
if c == ' ':
si += 1
yield ' '
else:
yield s2[i - si]
output = ''.join(char(string1, string2))

You can try insert method :
string1 = "This is a piece of text"
string2 = "ESTDTDLATPNPZQEPIE"
string3=list(string2)
for j,i in enumerate(string1):
if i==' ':
string3.insert(j,' ')
print("".join(string3))
outout:
ESTD TD L ATPNP ZQ EPIE

Cut string within a specific pattern in python

I have string of some length consisting of only 4 characters which are 'A,T,G and C'. I have pattern 'GAATTC' present multiple times in the given string. I have to cut the string at intervals where this pattern is..
For example for a string, 'ATCGAATTCATA', I should get output of
string one - ATCGA
string two - ATTCATA
I am newbie in using Python but I have come up with the following (incomplete) code:
seq = seq.upper()
str1 = "GAATTC"
seqlen = len(seq)
seq = list(seq)
for i in range(0,seqlen-1):
site = seq.find(str1)
print(site[0:(i+2)])
Any help would be really appreciated.

First lets develop your idea of using find, so you can figure out your mistakes.
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GAATTC"
split_at = 2
seqlen = len(seq)
i = 0
while i < seqlen:
site = seq.find(pattern, i)
if site != -1:
print(seq[i: site + split_at])
i = site + split_at
else:
print seq[i:]
break
Yet python string sports a powerful replace method that directly replaces fragments of string. The below snippet uses the replace method to insert separators when needed:
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GA","ATTC"
pattern1 = ''.join(pattern) # 'GAATTC'
pattern2 = ' '.join(pattern) # 'GA ATTC'
splited_seq = seq.replace(pattern1, pattern2) # 'ATCGA ATTCATAATCGA ATTCATAATCGA ATTCATA'
print (splited_seq.split())
I believe it is more intuitive and should be faster then RE (which might have lower performance, depending on library and usage)

Here is a simple solution :
seq = 'ATCGAATTCATA'
seq_split = seq.upper().split('GAATTC')
result = [
(seq_split[i] + 'GA') if i % 2 == 0 else ('ATTC' + seq_split[i])
for i in range(len(seq_split)) if len(seq_split[i]) > 0
]
Result :
print(result)
['ATCGA', 'ATTCATA']

BioPython has a restriction enzyme package to do exactly what you're asking.
from Bio.Restriction import *
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
print(EcoRI.site) # You will see that this is the enzyme you listed above
test = 'ATCGAATTCATA'.upper() # This is the sequence you want to search
my_seq = Seq(test, IUPACAmbiguousDNA()) # Create a biopython Seq object with our sequence
cut_sites = EcoRI.search(my_seq)
cut_sites contain a list of exactly where to cut the input sequence (such that GA is in the left sequence and ATTC is in the right sequence.
You can then split the sequence into contigs using:
cut_sites = [0] + cut_sites # We add a leading zero so this works for the first
# contig. This might not always be needed.
contigs = [test[i:j] for i,j in zip(cut_sites, cut_sites[1:]+[None])]
You can see this page for more details about BioPython.

My code is a bit sloppy, but you could try something like this when you want to iterate over multiple occurrences of the string
def split_strings(seq):
string1 = seq[:seq.find(str1) +2]
string2 = seq[seq.find(str1) +2:]
return string1, string2
test = 'ATCGAATTCATA'.upper()
str1 = 'GAATTC'
seq = test
while str1 in seq:
string1, seq = split_strings(seq)
print string1
print seq

Here's a solution using the regular expression module:
import re
seq = 'ATCGAATTCATA'
restriction_site = re.compile('GAATTC')
subseq_start = 0
for match in restriction_site.finditer(seq):
print seq[subseq_start:match.start()+2]
subseq_start = match.start()+2
print seq[subseq_start:]
Output:
ATCGA
ATTCATA

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find multiple elements in string in Python - python

The simplest: >>> import re >>> s = """line = if ((var.equals("INPUT")) || (var.equals("OUTPUT"))""" >>> r = re.compile(r'\("(.?)"\)') >>> r.findall(s) ['INPUT', 'OUTPUT'] The trick is to use .? which is a non-greedy *.

Related

python replace multiple occurrences of string with different values

Is there a regrex script that can be used to extract texts by defining a start and an end in a text file [duplicate]

add space between arabic and english word by regular expression in python

How to copy spaces from one string to another in Python?

Cut string within a specific pattern in python

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find multiple elements in string in Python - python

The simplest: >>> import re >>> s = """line = if ((var.equals("INPUT")) || (var.equals("OUTPUT"))""" >>> r = re.compile(r'\("(.*?)"\)') >>> r.findall(s) ['INPUT', 'OUTPUT'] The trick is to use .*? which is a non-greedy *.

Related

python replace multiple occurrences of string with different values

Is there a regrex script that can be used to extract texts by defining a start and an end in a text file [duplicate]

add space between arabic and english word by regular expression in python

How to copy spaces from one string to another in Python?

Cut string within a specific pattern in python

Categories

Resources

The simplest: >>> import re >>> s = """line = if ((var.equals("INPUT")) || (var.equals("OUTPUT"))""" >>> r = re.compile(r'\("(.?)"\)') >>> r.findall(s) ['INPUT', 'OUTPUT'] The trick is to use .? which is a non-greedy *.