Find and replace symbols with regex python - python

I have such sample:
sample = 'TEXT/xx_271802_1A'
p = re.compile("(/[a-z]{2})")
print p.match(sample)
in position of xx may be any from [a-z] in quantity of 2:
TEXT/qq_271802_1A TEXT/sg_271802_1A TEXT/ut_271802_1A
How can I find this xx and f.e. replace it with 'WW':
TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A
my code returns None

sample = 'TEXT/xx_271802_1A'
p = re.compile("(/[a-z]{2})")
print p.search(sample).group()
Your code return None as you are using match which matches from start.You need search or findall as you are finding anywhere in string and not at start.
For replacement use
re.sub(r'(?<=/)[a-z]{2}','WW',sample)

You can try the following Regular expression :
>>> sample = 'TEXT/xx_271802_1A'
>>> import re
>>> re.findall(r'([a-z])\1',sample)
['x']
>>> re.sub(r'([a-z])\1','WW',sample)
'TEXT/WW_271802_1A'
>>> sample = 'TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A'
>>> re.sub(r'([a-z])\1','WW',sample)
'TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A'
The RegEx ([a-z])\1 searches for 1 letter and then matches it if it repeats immediately.

you only need to do this:
sample = re.sub(r'(?<=/)[a-z]{2}', 'WW', sample)
No need to check the string before with match. re.sub makes the replacement when the pattern is found.
(?<=..) is a lookbehind assertion and means preceded by, it's only a check and is not part of the match result. So / is not replaced.
In the same way, you can add a lookahead (?=_) (followed by) at the end of the pattern, if you want to check if there is the underscore.

Related

Python replace between two chars (no split function)

I currently investigate a problem that I want to replace something in a string.
For example. I have the following string:
'123.49, 19.30, 02\n'
I only want the first two numbers like '123.49, 19.30'. The split function is not possible, because a I have a lot of data and some with and some without the last number.
I tried something like this:
import re as regex
#result = regex.match(', (.*)\n', string)
result = re.search(', (.*)\\n', string)
print(result.group(1))
This is not working finde. Can someone help me?
Thanks in advance
You could do something like this:
reg=r'(\d+\.\d+), (\d+\.\d+).*'
if(re.search(reg, your_text)):
match = re.search(reg, your_text)
first_num = match.group(1)
second_num = match.group(2)
Alternatively, also adding the ^ sign at the beginning, making sure to always only take the first two.
import re
string = '123.49, 19.30, 02\n'
pattern = re.compile('^(\d*.?\d*), (\d*.?\d*)')
result = re.findall(pattern, string)
result
Output:
[('123.49', '19.30')]
In the code you are using import re as regex. If you do that, you would have to use regex.search instead or re.search.
But in this case you can just use re.
If you use , (.*) you would capture all after the first occurrence of , and you are not taking digits into account.
If you want the first 2 numbers as stated in the question '123.49, 19.30' separated by comma's you can match them without using capture groups:
\b\d+\.\d+,\s*\d+\.\d+\b
Or matching 1 or more repetitions preceded by a comma:
\b\d+\.\d+(?:,\s*\d+\.\d+)+\b
regex demo | Python demo
As re.search can also return None, you can first check if there is a result (no need to run re.search twice)
import re
regex = r"\b\d+\.\d+(?:,\s*\d+\.\d+)+\b"
s = "123.49, 19.30, 02"
match = re.search(regex, s)
if match:
print(match.group())
Output
123.49, 19.30

How to print substring using RegEx in Python?

This is two texts:
1) 'provider:sipoutilp1.ym.ms'
2) 'provider:sipoutqtm.ym.ms'
I would like to print ilp when reaches to the fist line and qtm when reaches to the second line.
This is my solution but it is not working.
RE_PROVIDER = re.compile(r'(?P<provider>\((ilp+|qtm+)')
or in the line below,
182938,DOMINICAN REPUBLIC-MOBILE
to DOMINICAN REPUBLIC , can I use the same approach re.compile?
Thank you for any help.
Your regex is not correct because you have a open parenthesis before your keywords, since there is no such character in your lines.
As a more general way you can get capture the alphabetical character after sipout or provider:sipout.
>>> s1 = 'provider:sipoutilp1.ym.ms'
>>> s2 = 'provider:sipoutqtm.ym.ms'
>>> RE_PROVIDER = re.compile(r'(?P<provider>(?<=sipout)(ilp|qtm))')
>>> RE_PROVIDER.search(s1).groupdict()
{'provider': 'ilp'}
>>> RE_PROVIDER.search(s2).groupdict()
{'provider': 'qtm'}
(?<=sipout) is a positive look-behind which will makes the regex engine match the patter which is precede with sipout.
After edit:
If you want to match multiple strings with different structure, you have to use a optional preceding patterns for matching your keywords, and due to this point that you cannot use unfixed length patterns within look-behind you cannot use it for this aim. So instead you can use a capture group trick.
You can define the optional preceding patterns within a none capture group and your keyword within a capture group then after match get the second matched gorup (group(1), group(0) is the whole of your match).
>>> RE_PROVIDER = re.compile(r'(?:sipout|\d+,)(?P<provider>(ilp|qtm|[A-Z\s]+))')
>>> RE_PROVIDER.search(s1).groupdict()
{'provider': 'ilp'}
>>> RE_PROVIDER.search(s2).groupdict()
{'provider': 'qtm'}
>>> s3 = "182938,DOMINICAN REPUBLIC-MOBILE"
>>> RE_PROVIDER.search(s3).groupdict()
{'provider': 'DOMINICAN REPUBLIC'}
Note that gorupdict doesn't works in this case because it will returns

Python Regex matching already matched sub-string

I'm fairly new to Python Regex and I'm not able to understand the following:
I'm trying to find one small letter surrounded by three capital letters.
My first problem is that the below regex is giving only one match instead of the two matches that are present ['AbAD', 'DaDD']
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[A-Z][a-z][A-Z][A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
['AbAD']
I guess the above is due to the fact that the last D in the first regex is not available for matching any more? Is there any way to turn off this kind of matching.
The second issue is the following regex:
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[^A-Z][A-Z][a-z][A-Z][A-Z][^A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
[]
Basically what I want is that there shouldn't be more than three capital letters surrounding a small letter, and therefore I placed a negative match around them. But ['AbAD'] should be matched, but it is not getting matched. Any ideas?
It's mainly because of the overlapping of matches. Just put your regex inside a lookahead inorder to handle this type of overlapping matches.
(?=([A-Z][a-z][A-Z][A-Z]))
Code:
>>> s = 'AbADaDD'
>>> re.findall(r'(?=([A-Z][a-z][A-Z][A-Z]))', s)
['AbAD', 'DaDD']
DEMO
For the 2nd one, you should use negative lookahead and lookbehind assertion like below,
(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))
Code:
>>> re.findall(r'(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))', s)
['AbAD']
DEMO
The problem with your second regex is, [^A-Z] consumes a character (there isn't a character other than uppercase letter exists before first A) but the negative look-behind (?<![A-Z]) also do the same but it won't consume any character . It asserts that the match would be preceded by any but not of an uppercase letter. That;s why you won't get any match.
The problem with you regex is tha it is eating up the string as it progresses leaving nothing for second match.Use lookahead to make sure it does not eat up the string.
pat = '(?=([A-Z][a-z][A-Z][A-Z]))'
For your second regex again do the same.
print re.findall(r"(?=([A-Z][a-z][A-Z][A-Z](?=[^A-Z])))",s)
.For more insights see
1)After first match the string left is aDD as the first part has matched.
2)aDD does not satisfy pat = '[A-Z][a-z][A-Z][A-Z]'.So it is not a part of your match.
1st issue,
You should use this pattern,
r'([A-Z]{1}[a-z]{1}[A-Z]{1})'
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'([A-Z]{1}[a-z]{1}[A-Z]{1})', str)
['AbA', 'DaD']
2nd issue
You should use,
(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))', str)
['AbAD']

Regex pattern to extract substring

mystring = "q1)whatq2)whenq3)where"
want something like ["q1)what", "q2)when", "q3)where"]
My approach is to find the q\d+\) pattern then move till I find this pattern again and stop. But I'm not able to stop.
I did req_list = re.compile("q\d+\)[*]\q\d+\)").split(mystring)
But this gives the whole string.
How can I do it?
You could try the below code which uses re.findall function,
>>> import re
>>> s = "q1)whatq2)whenq3)where"
>>> m = re.findall(r'q\d+\)(?:(?!q\d+).)*', s)
>>> m
['q1)what', 'q2)when', 'q3)where']
Explanation:
q\d+\) Matches the string in the format q followed by one or more digits and again followed by ) symbol.
(?:(?!q\d+).)* Negative look-ahead which matches any char not of q\d+ zero or more times.

Regex related to * and + in python

I am new to python. I didnt understand the behaviour of these program in python.
import re
sub="dear"
pat="[aeiou]+"
m=re.search(pat,sub)
print(m.group())
This prints "ea"
import re
sub="dear"
pat="[aeiou]*"
m=re.search(pat,sub)
print(m.group())
This doesnt prints anything.
I know + matches 1 or more occurences and * matches 0 or more occurrences. I am expecting it to print "ea" in both program.But it doesn't.
Why this happens?
This doesnt prints anything.
Not exactly. It prints an empty string which you just of course you didn't notice, as it's not visible. Try using this code instead:
l = re.findall(pat, sub)
print l
this will print:
['', 'ea', '', '']
Why this behaviour?
This is because when you use * quantifier - [aeiou]*, this regex pattern also matches an empty string before every non-matching string and also the empty string at the end. So, for your string dear, it matches like this:
*d*ea*r* // * where the pattern matches.
All the *'s denote the position of your matches.
d doesn't match the pattern. So match is the empty string before it.
ea matches the pattern. So next match is ea.
r doesn't match the pattern. So the match is empty string before r.
The last empty string is the empty string after r.
Using [aeiou]*, the pattern match at the beginning. You can confirm that using MatchObject.start:
>>> import re
>>> sub="dear"
>>> pat="[aeiou]*"
>>> m=re.search(pat,sub)
>>> m.start()
0
>>> m.end()
0
>>> m.group()
''
+ matches at least one of the character or group before it. [aeiou]+ will thus match at least one of a, e, i, o or u (vowels).
The regex will look everywhere in the string to find the minimum 1 vowel it's looking for and does what you expect it to (it will relentlessly try to get the condition satisfied).
* however means at least 0, which also means it can match nothing. That said, when the regex engine starts to look for a match at the beginning of the string to be tested, it doesn't find a match, so that the 0 match condition is satisfied and this is the result that you obtain.
If you had used the string ear, note that you would have ea as match.

Categories