Replace string which has dynamic character in python - python

Trying to replace the string with regular expression and could not success.
The strings are "LIVE_CUS2_PHLR182" ,"LIVE_CUS2ee_PHLR182" and "PHLR182 - testing recovery".Here I need to get PHLR182 as an output with all the string but where second string has "ee" which is not constant. It can be string or number with 2 character.Below is the code what I have tried.
For first and last string I just simply used replace function like below.
s = "LIVE_CUS2_PHLR182"
s.replace("LIVE_CUS2_", ""), s.replace(" - testing recovery","")
>>> PHLR182
But for second I tried like below.
1. s= "LIVE_CUS2ee_PHLR182"
s.replace(r'LIVE_CUS2(\w+)*_','')
2. batRegex = re.compile(r'LIVE_CUS2(\w+)*_PHLR182')
mo2 = batRegex.search('LIVE_CUS2dd_PHLR182')
mo2.group()
3. re.sub(r'LIVE_CUS2(?is)/s+_PHLR182', '', r)
In all case I could not get "PHLR182" as an output. Please help me.

I think this is what you need:
import re
texts = """LIVE_CUS2_PHLR182
LIVE_CUS2ee_PHLR182
PHLR182 - testing recovery""".split('\n')
pat = re.compile(r'(LIVE_CUS2\w{,2}_| - testing recovery)')
# 1st alt pattern | 2nd alt pattern
# Look for 'LIV_CUS2_' with up to two alphanumeric characters after 2
# ... or Look for ' - testing recovery'
results = [pat.sub('', text) for text in texts]
# replace the matched pattern with empty string
print(f'Original: {texts}')
print(f'Results: {results}')
Result:
Original: ['LIVE_CUS2_PHLR182', 'LIVE_CUS2ee_PHLR182', 'PHLR182 - testing recovery']
Results: ['PHLR182', 'PHLR182', 'PHLR182']
Python Demo: https://repl.it/repls/ViolentThirdAutomaticvectorization
Regex Demo: https://regex101.com/r/JiEVqn/2

Related

How to start at a specific letter and end when it hits a digit?

I have some sample strings:
s = 'neg(able-23, never-21) s2-1/3'
i = 'amod(Market-8, magical-5) s1'
I've got the problem where I can figure out if the string has 's1' or 's3' using:
word = re.search(r's\d$', s)
But if I want to know if the contains 's2-1/3' in it, it won't work.
Is there a regex expression that can be used so that it works for both cases of 's#' and 's#+?
Thanks!
You can allow the characters "-" and "/" to be captured as well, in addition to just digits. It's hard to tell the exact pattern you're going for here, but something like this would capture "s2-1/3" from your example:
import re
s = "neg(able-23, never-21) s2-1/3"
word = re.search(r"s\d[-/\d]*$", s)
I'm guessing that maybe you would want to extract that with some expression, such as:
(s\d+)-?(.*)$
Demo 1
or:
(s\d+)-?([0-9]+)?\/?([0-9]+)?$
Demo 2
Test
import re
expression = r"(s\d+)-?(.*)$"
string = """
neg(able-23, never-21) s211-12/31
neg(able-23, never-21) s2-1/3
amod(Market-8, magical-5) s1
"""
print(re.findall(expression, string, re.M))
Output
[('s211', '12/31'), ('s2', '1/3'), ('s1', '')]

get all occurence of a regex in string python

I am trying to find in the following string TreeModel/Node/Node[1]/Node[4]/Node[1] this :
TreeModel/Node
TreeModel/Node/Node[1]
TreeModel/Node/Node[1]/Node[4]
TreeModel/Node/Node[1]/Node[4]/Node[1]
Using regular expression in python. Here is the code I tried:
string = 'TreeModel/Node/Node[1]/Node[4]/Node[1]'
pattern = r'.+?Node\[[1-9]\]'
print re.findall(pattern=pattern,string=string)
#result : ['TreeModel/Node/Node[1]', '/Node[4]', '/Node[1]']
#expected result : ['TreeModel/Node', 'TreeModel/Node/Node[1]', 'TreeModel/Node/Node[1]/Node[4]', 'TreeModel/Node/Node[1]/Node[4]/Node[1]']
You can use split here:
>>> s = 'TreeModel/Node/Node[1]/Node[4]/Node[1]'
>>> split_s = s.split('/')
>>> ['/'.join(split_s[:i]) for i in range(2, len(split_s)+1)]
['TreeModel/Node',
'TreeModel/Node/Node[1]',
'TreeModel/Node/Node[1]/Node[4]',
'TreeModel/Node/Node[1]/Node[4]/Node[1]']
You can also use regex:
for i in range(2, s.count('/')+2):
s_ = '[^/]+/*'
regex = re.search(r'('+s_*i+')', s).group(0)
print(regex)
TreeModel/Node/
TreeModel/Node/Node[1]/
TreeModel/Node/Node[1]/Node[4]/
TreeModel/Node/Node[1]/Node[4]/Node[1]
I'm not good in Python at all but for regex part with your specific structure of string below regex matches each segment:
/?(?:{[^{}]*})?[^/]+
Where braces and preceding / is optional. It matches a slash mark (if any) then braces with their content (if any) then the rest up to next slash mark.
Python code (see live demo here):
matches = re.findall(r'/?(?:{[^{}]*})?[^/]+', string)
output = ''
for i in range(len(matches)):
output += matches[i];
print(output)

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

Regex match everything between special tag

I have the following string that I need to parse and get the values of anything inside the defined \$ tags
for example, the string
The following math equation: \$f(x) = x^2\$ is the same as \$g(x) = x^(4/2) \$
I want to parse whatever is in between the \$ tags, so that the result will contain both equations
'f(x) = x^2'
'g(x) = x^(4/2) '
I tried something like re.compile(r'\\\$(.)*\\$') but it didnt work.
You almost got it, just missing a backslash and a question mark (so it stops as soon as it finds the second \$ and doesn't match the longest string possible): r'\\\$(.*?)\\\$'
>>> pattern = r'\\\$(.*?)\\\$'
>>> data = "The following math equation: \$f(x) = x^2\$ is the same as \$g(x) = x^(4/2) \$"
>>> re.findall(pattern, data)
['f(x) = x^2', 'g(x) = x^(4/2) ']
That regex can fit:
/\\\$.{0,}\\\$/g
/ - begin
\\\$ - escaped: \$
. - any character between
{0,} - at least 0 chars (any number of chars, actually)
\\\$ - escaped: \$
/ - end
g - global search
This works:
import re
regex = r'\\\$(.*)\\\$'
r = re.compile(regex)
print r.match("\$f(x) = x^2\$").group(1)
print r.match("\$g(x) = x^(4/2) \$").group(1)

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

Categories