How to get the matching word in a regex with alternations? - python

In python, suppose I want to search the string
"123"
for occurrences of the pattern
"abc|1.*|def|.23" .
I would currently do this as follows:
import re
re.match ("abc|1.*|def|.23", "123") .
The above returns a match object from which I can retrieve the starting and ending indices of the match in the string, which in this case would be 0 and 3.
My question is: How can I retrieve the particular word(s) in the regular expression which matched with
"123" ?
In other words: I would like to get "1.*" and ".23". Is this possible?

Given your string always have a common separator - in our case "|"
you can try:
str = "abc|1.*|def|.23"
matches = [s for s in str.split("|") if re.match(s, "123")]
print(matches)
output:
['1.*', '.23']

Another approach would be to create one capture group for each token in the alternation:
import re
s = 'def'
rgx = r'\b(?:(abc)|(1.*)|(def)|(.23))\b'
m = re.match(rgx, s)
print(m.group(0)) #=> def
print(m.group(1)) #=> None
print(m.group(2)) #=> None
print(m.group(3)) #=> def
print(m.group(4)) #=> None
This example shows the match is 'def' and was matched by the 3rd capture group,(def).
Python code

Related

How to find a string in text and return the string from text?

I need find a string which already has special chars removed. So, I want to do is to find that string in a sentence and return the string with special chars.
Ex: string = France09
Sentence : i leaved in France'09.
now I did re.search('France09',sentence), it will return True or False. But I want to get the output as France'09.
Can any one help me.
From the docs (https://docs.python.org/2/library/re.html#re.search), search is not returning True or False:
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
Have a look at https://regex101.com/r/18NJ2E/1
TL;DR
import re
regex = r"(?P<relevant_info>France'09)"
test_str = "Sentence : i leaved in France'09."
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print(match.group('relevant_info'))
Try This:
Input_str = "i leaved in France'09"
Word_list = Input_str.split(" ")
for val in Word_list:
if not val.isalnum():
print(val)
Output:
France'09
You will need to create a regular expression that matches the special characters at any location:
import re
Sentence = "i leaved in France'09"
Match = 'France09'
Match2 = "[']*".join(Match)
m = re.search(Match2, Sentence)
print(m.group(0))
Match2 gets the value "F[']*r[']*a[']*n[']*c[']*e[']*0[']*9". You can add other special characters into the ['] part.

How to use regex to tell if first and last character of a string match?

I'm relatively new to using Python and Regex, and I wanted to check if strings first and last characters are the same.
If first and last characters are same, then return 'True' (Ex: 'aba')
If first and last characters are not same, then return 'False' (Ex: 'ab')
Below is the code, I've written:
import re
string = 'aba'
pattern = re.compile(r'^/w./1w$')
matches = pattern.finditer(string)
for match in matches
print (match)
But from the above code, I don't see any output
if and only if you really want to use regex (for learning purpose):
import re
string = 'aba'
string2 = 'no match'
pattern = re.compile(r'^(.).*\1$')
if re.match(pattern, string):
print('ok')
else:
print('nok')
if re.match(pattern, string2):
print('ok')
else:
print('nok')
output:
ok
nok
Explanations:
^(.).*\1$
^ start of line anchor
(.) match the first character of the line and store it in a group
.* match any characters any time
\1 backreference to the first group, in this case the first character to impose that the first char and the last one are equal
$ end of line anchor
Demo: https://regex101.com/r/DaOPEl/1/
Otherwise the best approach is to simply use the comparison string[0] == string[-1]
string = 'aba'
if string[0] == string[-1]:
print 'same'
output:
same
Why do you overengineer with an regex at all? One principle of programming should be keeping it simple like:
string[0] is string[-1]
Or is there a need for regex?
The above answer of #Tobias is perfect & simple but if you want solution using regex then try the below code.
Try this code !
Code :
import re
string = 'abbaaaa'
pattern = re.compile(r'^(.).*\1$')
matches = pattern.finditer(string)
for match in matches:
print (match)
Output :
<_sre.SRE_Match object; span=(0, 7), match='abbaaaa'>
I think this is the regex you are trying to execute:
Code:
import re
string = 'aba'
pattern = re.compile(r'^(\w).(\1)$')
matches = pattern.finditer(string)
for match in matches:
print (match.group(0))
Output:
aba
if you want to check with regex use below:
import re
string = 'aba is a cowa'
pat = r'^(.).*\1$'
re.findall(pat,string)
if re.findall(pat,string):
print(string)
this will match first and last character of line or string if they match then it returns matching character in that case it will print string of line otherwise it will skip

Find and replace symbols with regex python

I have such sample:
sample = 'TEXT/xx_271802_1A'
p = re.compile("(/[a-z]{2})")
print p.match(sample)
in position of xx may be any from [a-z] in quantity of 2:
TEXT/qq_271802_1A TEXT/sg_271802_1A TEXT/ut_271802_1A
How can I find this xx and f.e. replace it with 'WW':
TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A
my code returns None
sample = 'TEXT/xx_271802_1A'
p = re.compile("(/[a-z]{2})")
print p.search(sample).group()
Your code return None as you are using match which matches from start.You need search or findall as you are finding anywhere in string and not at start.
For replacement use
re.sub(r'(?<=/)[a-z]{2}','WW',sample)
You can try the following Regular expression :
>>> sample = 'TEXT/xx_271802_1A'
>>> import re
>>> re.findall(r'([a-z])\1',sample)
['x']
>>> re.sub(r'([a-z])\1','WW',sample)
'TEXT/WW_271802_1A'
>>> sample = 'TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A'
>>> re.sub(r'([a-z])\1','WW',sample)
'TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A'
The RegEx ([a-z])\1 searches for 1 letter and then matches it if it repeats immediately.
you only need to do this:
sample = re.sub(r'(?<=/)[a-z]{2}', 'WW', sample)
No need to check the string before with match. re.sub makes the replacement when the pattern is found.
(?<=..) is a lookbehind assertion and means preceded by, it's only a check and is not part of the match result. So / is not replaced.
In the same way, you can add a lookahead (?=_) (followed by) at the end of the pattern, if you want to check if there is the underscore.

How to print regex match results in python 3?

I was in IDLE, and decided to use regex to sort out a string. But when I typed in what the online tutorial told me to, all it would do was print:
<_sre.SRE_Match object at 0x00000000031D7E68>
Full program:
import re
reg = re.compile("[a-z]+8?")
str = "ccc8"
print(reg.match(str))
result:
<_sre.SRE_Match object at 0x00000000031D7ED0>
Could anybody tell me how to actually print the result?
You need to include .group() after to the match function so that it would print the matched string otherwise it shows only whether a match happened or not. To print the chars which are captured by the capturing groups, you need to pass the corresponding group index to the .group() function.
>>> import re
>>> reg = re.compile("[a-z]+8?")
>>> str = "ccc8"
>>> print(reg.match(str).group())
ccc8
Regex with capturing group.
>>> reg = re.compile("([a-z]+)8?")
>>> print(reg.match(str).group(1))
ccc
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
If you need to get the whole match value, you should use
m = reg.match(r"[a-z]+8?", text)
if m: # Always check if a match occurred to avoid NoneType issues
print(m.group()) # Print the match string
If you need to extract a part of the regex match, you need to use capturing groups in your regular expression. Enclose those patterns with a pair of unescaped parentheses.
To only print captured group results, use Match.groups:
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.
So, to get ccc and 8 and display only those, you may use
import re
reg = re.compile("([a-z]+)(8?)")
s = "ccc8"
m = reg.match(s)
if m:
print(m.groups()) # => ('ccc', '8')
See the Python demo

python return matching and non-matching patterns of string

I would like to split a string into parts that match a regexp pattern and parts that do not match into a list.
For example
import re
string = 'my_file_10'
pattern = r'\d+$'
# I know the matching pattern can be obtained with :
m = re.search(pattern, string).group()
print m
'10'
# The final result should be as following
['my_file_', '10']
Put parenthesis around the pattern to make it a capturing group, then use re.split() to produce a list of matching and non-matching elements:
pattern = r'(\d+$)'
re.split(pattern, string)
Demo:
>>> import re
>>> string = 'my_file_10'
>>> pattern = r'(\d+$)'
>>> re.split(pattern, string)
['my_file_', '10', '']
Because you are splitting on digits at the end of the string, an empty string is included.
If you only ever expect one match, at the end of the string (which the $ in your pattern forces here), then just use the m.start() method to obtain an index to slice the input string:
pattern = r'\d+$'
match = re.search(pattern, string)
not_matched, matched = string[:match.start()], match.group()
This returns:
>>> pattern = r'\d+$'
>>> match = re.search(pattern, string)
>>> string[:match.start()], match.group()
('my_file_', '10')
You can use re.split to make a list of those separate matches and use filter, which filters out all elements which are considered false ( empty strings )
>>> import re
>>> filter(None, re.split(r'(\d+$)', 'my_file_015_01'))
['my_file_015_', '01']

Categories