I am required to use regex module.
I have coded this little program to replace certain regex matches such as orange with the length of orange in # signs, for example, if orange is in the string then it will be replaced with ######.
If a string has been changed it will add " !! This string has been changed !!" to the end of the string.
If a string has not been changed but has a # in it then it will not add " !! This string has been changed !!".
I am wondering, is there a more efficient way of coding this up? using regex functions and better python code.
orange = re.compile(r'\borange\b', re.IGNORECASE)
frog = re.compile(r'\bfrog\b', re.IGNORECASE)
cat = re.compile(r'\bcat\b', re.IGNORECASE)
num = 0
if re.search(orange, s):
s = re.sub(orange, "!!!!!!", s)
num +=1
if re.search(frog, s):
s = re.sub(frog, "!!!!", s)
num +=1
if re.search(cat, s):
s = re.sub(cat, "!!!", s)
num +=1
if num > 0:
return s + " !! This string has been changed !!"
else:
return s```
Assuming your line input can contain 'orange' 'frog' 'cat' simultaneously ONE particular solution to this is, create a regex pattern which can match either of your solutions, then create an iterator for each match, re-place this found match with the 'x' according to the len of the matched string and printing the string modified (or not if that is the case)
Code is:
import re
string = "orange frog cat test"
#string = "one two tree testing stackoverflow"
regex_pattern = re.compile(r"\b(orange|frog|cat)\b", re.IGNORECASE)
total_matches = regex_pattern.finditer(string)
# We find either of the options? then changes will be made
changes_done = regex_pattern.search(string)
for match in total_matches:
element_find = match.group(0)
string = regex_pattern.sub("x" * len(element_find), string, 1)
if( changes_done ):
print(string + " | changes where made")
else:
print(string + " | no changes made")
What really shines in this particular solution is the third parameter of sub, where you can limit the count of matches done. As i said, one particular solution for your problem.
Output generated for the replacement will be xxxxxx xxxx xxx test | changes where made
I guess you're using this code inside a function, since you're returning some values.
Anyway, without the num counter:
import re
pattern = r"\b(orange|frog|cat)\b"
s = "an orange eaten by a frog and a cat"
rgx_matches = re.findall(pattern, s, flags=re.IGNORECASE)
for rgx_match in rgx_matches:
print(re.sub(rgx_match, "#"*len(rgx_match), s) +\
" !! This string has been changed !!")
I have a long string that is a phylogenetic tree and I want to do a very specific filtering.
(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;
Basically every x#y is a species#gene_id information. What I am trying to do is trimming this down so that I will only have x instead of x#y.
(Esy, Aar,(Spa,Cpl))...
I tried splitting the string first but the problem is string has different 'split points' for what I want to achieve i.e. some parts x#y is ending with a , and others with a ). I searched for a solution and saw regular expression operations, but I am new to Python and I couldn't be sure if that is what I should be focusing on. I also thought about strip() but it seems like I need to specify the characters to be stripped for this.
Main problem is there is no 'pattern' for me to tell Python to follow. Only thing is that all species ids are 3 letters and they are before an # character.
Is there a method that can do what I want? I will be really glad if you can help me out with my problem. Thanks in advance.
Give this a try:
import re:
pat = re.compile(r'(\w{3})#')
txt = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
pat.findall(t)
Result:
['Esy', 'Aar', 'Spa', 'Cpl', 'Bst', 'Aly', 'Ath', 'Chi', 'Cru', 'Hco', 'Hlo', 'Hla', 'Hse', 'Esa', 'Aal']
If you need the structure intact, we can try to remove the unnecessary parts instead:
pat = re.compile(r'(#|:)[^/),]*')
pat.sub('',t).replace(',', ', ')
Result:
'(Esy, Aar, ((Spa, Cpl), (((Bst, ((Aly, Ath), (Chi, Cru))), (((Hco, Hlo), Hla), Hse)), (Esa, Aal))))'
Regex demo
How about this kind of function:
def parse_string(string):
new_string = ''
skip = False
for char in string:
if char == '#':
skip = True
if char == ',':
skip = False
if not skip or char in ['(', ')']:
new_string += char
return new_string
Calling it on your string:
string = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
parse_string(string)
> '(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hse)),(Esa,Aal))))'
you can use regex:
import re
s = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
p = "...?(?=#)|\(|\)"
result = re.findall(p, s)
and you have your result as a list, so you can make it string or do anything with it
for explaining what is happening :
p is regular expression pattern
so in this pattern:
. means matching any word
...?(?=#) means match any word until I get to a word ? wich ? is #, so this whole pattern means that you get any three words before #
| is or statement, I used it here to find another pattern
and the rest of them is to find ) and (
Try this regex if you need the brackets in the output:
import re
regex = r"#[A-Za-z0-9_\.:]+|[0-9:\.;e-]+"
phylogenetic_tree = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
print(re.sub(regex,"",phylogenetic_tree))
Output:
(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hs)),(Esa,Aal))))
Because you are trying to parse a phylogenetic tree, I highly suggest to let BioPython do the heavy lifting for you.
You can easily parse and display a phylogenetic with Bio.Phylo. Then it is just iterating over all tree elements and splitting the names at the 'at'-sign.
Because Phylo expects the input to be in a file, we create an in-memory file-like object with io.StringIO. Getting the complete tree is then as easy as
Phylo.read(io.StringIO(s), 'newick')
In order to check if the parsed tree looks sane, I print it once with print(tree).
Now we want to change all node names that contain a '#'. With tree.find_elements we get access to all nodes. Some nodes don't have a name and some might not contain a '#'. So to be extra careful, we first check if n.name and '#' in n.name. Only then do we split each node's name at the '#' and take just the first part (index 0) of it:
n.name = n.name.split('#')[0]
In order to recreate the initial string representation, we use Phylo.write:
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Again, write wants to get a file argument - if we just want to get a string, we can use a StringIO object again.
Full code:
import io
from Bio import Phylo
if __name__ == '__main__':
s = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
tree = Phylo.read(io.StringIO(s), 'newick')
print(' before '.center(20, '='))
print(tree)
for n in tree.find_elements():
if n.name and '#' in n.name:
n.name = n.name.split('#')[0]
print(' result '.center(20, '='))
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Output:
====== before ======
Tree(rooted=False, weight=1.0)
Clade(branch_length=0.0129090235079)
Clade(branch_length=0.0726396855636, name='Esy#ESY15_g64743_DN3_SP7_c0')
Clade(branch_length=0.137507902808, name='Aar#AA_maker7399_1')
Clade(branch_length=0.0129090235079)
Clade(branch_length=9.05326020871e-05)
Clade(branch_length=0.0318934795022, name='Spa#Tp2g18720')
Clade(branch_length=0.0273465005242, name='Cpl#CP2_g48793_DN3_SP8_c')
Clade(branch_length=0.00328120860999)
Clade(branch_length=0.00859075940423)
Clade(branch_length=0.0340484449097)
Clade(branch_length=0.0332592496158, name='Bst#Bostr_13083s0053_1')
Clade(branch_length=0.0150356382287)
Clade(branch_length=0.0205924636564)
Clade(branch_length=0.0328569260951, name='Aly#AL8G21130_t1')
Clade(branch_length=0.0391706378372, name='Ath#AT5G48370_1')
Clade(branch_length=0.00998579652059)
Clade(branch_length=0.0954469923893, name='Chi#CARHR183840_1')
Clade(branch_length=0.0570981548016, name='Cru#Carubv10026342m')
Clade(branch_length=0.0372829371381)
Clade(branch_length=0.0206478928557)
Clade(branch_length=0.0144626717872)
Clade(branch_length=0.00823215335663, name='Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100')
Clade(branch_length=0.0085462978729, name='Hlo#DN13684_c0_g1_i1_p1')
Clade(branch_length=0.0225079453622, name='Hla#DN22821_c0_g1_i1_p1')
Clade(branch_length=0.048590776459, name='Hse#DN23412_c0_g1_i3_p1')
Clade(branch_length=1.00000050003e-06)
Clade(branch_length=0.0378509854703, name='Esa#Thhalv10004228m')
Clade(branch_length=0.0712272454125, name='Aal#Aa_G102140_t1')
==== result =====
(Esy:0.07264,Aar:0.13751,((Spa:0.03189,Cpl:0.02735):0.00009,(((Bst:0.03326,((Aly:0.03286,Ath:0.03917):0.02059,(Chi:0.09545,Cru:0.05710):0.00999):0.01504):0.03405,(((Hco:0.00823,Hlo:0.00855):0.01446,Hla:0.02251):0.02065,Hse:0.04859):0.03728):0.00859,(Esa:0.03785,Aal:0.07123):0.00000):0.00328):0.01291):0.01291;
The default format of Phylo uses less digits than in your original tree. In order to keep the numbers unchanged, just override the branch length format string with a '%s':
Phylo.write(tree, out, "newick", format_branch_length="%s")
Parsing code can be hard to follow. Tatsu lets you write readable parsing code by combining grammars and python:
text = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
import sys
import tatsu
grammar = """
start = things ';'
;
things = thing [ ',' things ]
;
thing = x '#' y ':' number
| '(' things ')' ':' number
;
x = /\w+/
;
y = /\w+/
;
number = /[+-]?\d+\.?\d*(e?[+-]?\d*)/
;
"""
class Semantics:
def x(self, ast):
# the method name matches the rule name
print('X =', ast)
parser = tatsu.compile(grammar, semantics=Semantics())
parser.parse(text)
My question is quite simple
I'm trying to come up with a RE to select any set of words or statement in between two characters.
For example is the strings are something like this :
') as whatever '
and it can also look like
') as whatever\r\n'
So i need to extract 'whatever' from this string.
The Regex I came up with is this :
\)\sas\s(.*?)\s
It works fine and extracts 'whatever' but this will only work for the first example not the second. What should i do in case of the second statement
I'm basically looking for an OR condition kind of thing!
Any help would be appreciated
Thanks in advance
The question is not very clear but maybe the regular expression syntax you are looking for might be something like this:
\)\sas\s(.*?)[\s | \r | \n]
basically telling after the string you are interested you can find a space or other characters.
EDIT
As example take the following code in Python2. The OR operator is '|' and I used it in the square brackets to catch the strings which have as subsequent character a space, '\r' a . or 'd'.
import re
a = ') as whatever '
b = ') as whatever\r\n'
c = ') as whatever.'
d = ') as whateverd'
a_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \n]', a)[0] #ending with space, \r or new line char
b_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \n]', b)[0]
c_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \on | \.]', c)[0] #ending with space, \r new line char or .
d_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \on | \. | d]', d)[0] #ending with space, \r, new line char, . or d
print(a_res, len(a_res))
print(b_res, len(b_res))
print(c_res, len(c_res))
print(d_res, len(d_res))
It is working as you intended. Please check it
import re
a =') as whatever '
b=') as whatever\r\n'
print re.findall(r'\)\sas\s(.*?)\s', a)[0]
print re.findall(r'\)\sas\s(.*?)\s', b)[0]
This will output as
'whatever'
'whatever'
I'm working on code to take a user input business name and print out reviews for it. When I run my final loop, I tell python to right justify the reviews by four spaces, yet nothing happens. I've tried multiple solutions and am honestly at a loss.
(Problem area is the very last line)
import json
import textwrap
import sys
f = open('reviews.json')
f1= open('businesses.json')
line1= f1.readline()
business= json.loads(line1)
line = f.readline()
review = json.loads(line)
idlist=[]
reviewlist=[]
bizname= raw_input('Enter a business name => ')
print bizname
for line in f1:
business= json.loads(line)
if bizname == business['name']:
idlist.append(business['business_id'])
if len(idlist)==0:
print 'This business is not found'
sys.exit()
for line in f:
review = json.loads(line)
for item in idlist:
if item == review['business_id']:
reviewlist.append(review['text'])
if len(reviewlist)==0:
print 'No reviews for this business are found'
sys.exit()
for i in range(len(reviewlist)):
w = textwrap.TextWrapper(replace_whitespace=False)
print 'Review',str(i+1)+':'
print w.fill(reviewlist[i] , ).rjust(4,' ')
I suggest you to verify the output print w.fill(reviewlist[i] , ).
the lenght may less than 4. so it looks like not working. e.g. 'abcdef'.rjust(4, ' ')
>>> 'abcdef'.rjust(4, ' ')
'abcdef'
>>> 'abcdef'.rjust(20, ' ')
' abcdef'
https://docs.python.org/2/library/string.html#string.rjust
"Right justify by 4 spaces" doesn't makes sense, so it's unclear what you really want. The first argument to .rjust() is the total width of the field, and if the string is already at least that long nothing at all is done. Some examples:
>>> "abcde".rjust(4, " ") # nothing done: 5 > 4
'abcde'
>>> "abcd".rjust(4, " ") # nothing done: 4 == 4
'abcd'
>>> "abc".rjust(4, " ") # extends to 4 with 1 blank on left
' abc'
>>> "ab".rjust(4, " ") # extends to 4 with 2 blanks on left
' ab'
>>> "a".rjust(4, " ") # extends to 4 with 3 blanks on left
' a'
>>> "".rjust(4, " ") # # extends to 4 with 4 blanks
' '
Assuming that you actually want to indent the text, you can do it with the TextWrapper object:
indent = ' ' * 4
w = textwrap.TextWrapper(replace_whitespace=False, initial_indent=indent, subsequent_indent=indent)
Demo
>>> indent = ' ' * 4
>>> w = textwrap.TextWrapper(width=20, replace_whitespace=False, initial_indent=indent, subsequent_indent=indent)
>>> print(w.fill('A longish paragraph to demonstrate indentation with TextWrapper objects.'))
A longish
paragraph to
demonstrate
indentation with
TextWrapper
objects.
Note that the indent is included in the line width, so you might want to adjust the width accordingly:
>>> w = textwrap.TextWrapper(width=20+len(indent), replace_whitespace=False, initial_indent=indent, subsequent_indent=indent)
>>> print(w.fill('A longish paragraph to demonstrate indentation with TextWrapper objects.'))
A longish paragraph
to demonstrate
indentation with
TextWrapper objects.
Most likely it doesn't work because wrap() returns a single string that is much longer than 4 characters. Example:
'hello'.rjust(3, '*')
output:
'hello'
While, if you do:
'hello'.rjust(10, '*')
Output:
'*****hello'
So, if I understand what you are trying to do, you may need to split the wrapped string and then apply the right justification to each string in the list, while you print it:
wrapped = w.fill(reviewlist[i], )
for line in wrapped.split('\n'):
print line.rjust(4, ' ')
Although I am not sure that justifying on a width of only four characters is really what you need.
There's a couple of problems you're facing here:
.rjust(4,' ') says you want the result to be 4 characters wide, not that you want to indent the line by 4 spaces.
.rjust() just looks at the length of the string, and after you've run it through textwrap it has a bunch of newlines in it that make the length of the string different than the width it prints out to.
You don't want to right justify, really, you want to indent.
The solution given above about indents is correct.
Formatting text through fixed space fonts is very old school, but also very fragile. Perhaps you could think about generating HTML output in a subsequent revision of your application. HTML tables work well for this and are appropriate for tabular data. Alternatively, consider doing a CSV file, and then you can import the result into Excel.
import textwrap
import re
# A sample input string.
inputStr = 'This is a long string which I want to right justify by 70 chars with four spaces on left'
# Wrap by 70 (default) chars. This would result in a multi-line string
w = textwrap.fill(inputStr, )
# Using RegEx read the lines and right justify for 75 chars.
m = re.sub("^(\w+.*)$", lambda g : g.group(0).rjust(75), w, flags = re.MULTILINE)
# Print the result
print(m)
I'm a newbie to Regular expression in Python :
I have a list that i want to search if it's contain a employee name.
The employee name can be :
it can be at the beginning followed by space.
followed by ®
OR followed by space
OR Can be at the end and space before it
not a case sensitive
ListSentence = ["Steve®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel"]
ListEmployee = ["Steve", "Rob", "daniel"]
The output from the ListSentence is:
["Steve®", "Rob spring", "Car Daniel", "Done daniel"]
First take all your employee names and join them with a | character and wrap the string so it looks like:
(?:^|\s)((?:Steve|Rob|Daniel)(?:®)?)(?=\s|$)
By first joining all the names together you avoid the performance overhead of using a nested set of for next loops.
I don't know python well enough to offer a python example, however in powershell I'd write it like this
[array]$names = #("Steve", "Rob", "daniel")
[array]$ListSentence = #("Steve®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel")
# build the regex, and insert the names as a "|" delimited string
$Regex = "(?:^|\s)((?:" + $($names -join "|") + ")(?:®)?)(?=\s|$)"
# use case insensitive match to find any matching array values
$ListSentence -imatch $Regex
Yields
Steve®
Rob spring
Car Daniel
Done daniel
Why do you want to use regular expressions? I'd generally recommend avoiding them in Python - you can use string methods instead.
For example:
def string_has_employee_name_in_it(test_string):
test_string = test_string.lower() # case insensitive
for name in ListEmployee:
name = name.lower()
if name == test_string:
return True
elif name + '®' == test_string:
return True
elif test_string.endswith(' ' + name):
return True
elif test_string.startswith(name + ' '):
return True
elif (' ' + name + ' ') in test_string:
return True
return False
final_list = []
for string in ListSentence:
if string_has_employee_name_in_it(string):
final_list.append(string)
final_list is the list you want. This is longer than a regex, but it's also a lot easier to parse and maintain. You can make it a lot shorter in various ways (e.g. combining the tests in the function, and using a list comprehension instead of a loop), but as you're starting out with Python it's a good idea to be clear with what's going on.
I don't think you need to check for all of those scenarios. I think all you need to do is check for word breaks.
You can join the ListEmployee list with | to make an either or regex (also, lowercase it for case-insensitivity), surrounded by \b for word breaks, and that should work:
regex = '|'.join(ListEmployee).lower()
import re
[l for l in ListSentence if re.search(r'\b(%s)\b' % regex, l.lower())]
Should output:
['Steve\xb6\xa9', 'Rob spring', 'Car Daniel', 'Done daniel']
If you're just looking for strings containing a space, as your example indicates, it should be something like this:
[i for i in ListSentence if i.endswith('®') or (' ' in i)]
A possible solution:
import re
ListSentence = ["Steve®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel"]
ListEmployee = ["Steve", "Rob", "daniel"]
def findEmployees(employees, sentence):
retval = []
for employee in employees:
expr = re.compile(r'(^%(employee)s(®)?(\s|$))|((^|\s)%(employee)s(®)?(\s|$))|((^|\s)%(employee)s(®)?$)'
% {'employee': employee},
re.IGNORECASE)
for part in sentence:
if expr.search(part):
retval.append(part)
return retval
findEmployees(ListEmployee, ListSentence)
>> Returns ['Steve\xc3\x82\xc2\xae', 'Rob spring', 'Car Daniel', 'Done daniel']