How can I get part of regex match as a variable in python? - python

In Perl it is possible to do something like this (I hope the syntax is right...):
$string =~ m/lalala(I want this part)lalala/;
$whatIWant = $1;
I want to do the same in Python and get the text inside the parenthesis in a string like $1.

If you want to get parts by name you can also do this:
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
>>> m.groupdict()
{'first_name': 'Malcom', 'last_name': 'Reynolds'}
The example was taken from the re docs

See: Python regex match objects
>>> import re
>>> p = re.compile("lalala(I want this part)lalala")
>>> p.match("lalalaI want this partlalala").group(1)
'I want this part'

import re
astr = 'lalalabeeplalala'
match = re.search('lalala(.*)lalala', astr)
whatIWant = match.group(1) if match else None
print(whatIWant)
A small note: in Perl, when you write
$string =~ m/lalala(.*)lalala/;
the regexp can match anywhere in the string. The equivalent is accomplished with the re.search() function, not the re.match() function, which requires that the pattern match starting at the beginning of the string.

import re
data = "some input data"
m = re.search("some (input) data", data)
if m: # "if match was successful" / "if matched"
print m.group(1)
Check the docs for more.

there's no need for regex. think simple.
>>> "lalala(I want this part)lalala".split("lalala")
['', '(I want this part)', '']
>>> "lalala(I want this part)lalala".split("lalala")[1]
'(I want this part)'
>>>

import re
match = re.match('lalala(I want this part)lalala', 'lalalaI want this partlalala')
print match.group(1)

import re
string_to_check = "other_text...lalalaI want this partlalala...other_text"
p = re.compile("lalala(I want this part)lalala") # regex pattern
m = p.search(string_to_check) # use p.match if what you want is always at beginning of string
if m:
print m.group(1)
In trying to convert a Perl program to Python that parses function names out of modules, I ran into this problem, I received an error saying "group" was undefined. I soon realized that the exception was being thrown because p.match / p.search returns 0 if there is not a matching string.
Thus, the group operator cannot function on it. So, to avoid an exception, check if a match has been stored and then apply the group operator.
import re
filename = './file_to_parse.py'
p = re.compile('def (\w*)') # \w* greedily matches [a-zA-Z0-9_] character set
for each_line in open(filename,'r'):
m = p.match(each_line) # tries to match regex rule in p
if m:
m = m.group(1)
print m

Related

Regular expression for YYYY-MM-DDTHH:MM:SS is not detecting the presence of .00Z [duplicate]

Suppose I have a string like test-123.
I want to test whether it matches a pattern like test-<number>, where <number> means one or more digit symbols.
I tried this code:
import re
correct_string = 'test-251'
wrong_string = 'test-123x'
regex = re.compile(r'test-\d+')
if regex.match(correct_string):
print 'Matching correct string.'
if regex.match(wrong_string):
print 'Matching wrong_string.'
How can I make it so that only the correct_string matches, and the wrong_string doesn't? I tried using .search instead of .match but it didn't help.
Try with specifying the start and end rules in your regex:
re.compile(r'^test-\d+$')
For exact match regex = r'^(some-regex-here)$'
^ : Start of string
$ : End of string
Since Python 3.4 you can use re.fullmatch to avoid adding ^ and $ to your pattern.
>>> import re
>>> p = re.compile(r'\d{3}')
>>> bool(p.match('1234'))
True
>>> bool(p.fullmatch('1234'))
False
I think It may help you -
import re
pattern = r"test-[0-9]+$"
s = input()
if re.match(pattern,s) :
print('matched')
else :
print('not matched')
You can try re.findall():
import re
correct_string = 'test-251'
if len(re.findall("test-\d+", correct_string)) > 0:
print "Match found"
A pattern such as \btest-\d+\b should do you;
matches = re.search(r'\btest-\d+\', search_string)
Demo
This requires the matching of word boundaries, so prevents other substrings from occuring after your desired match.

I need help formulating a specific regex

I do not consider myself a newbie in regex, but I seem to have found a problem that stumped me (it's also Friday evening, so brain not at peak performance).
I am trying to substitute a place-holder inside a string with some other value. I am having great difficulty getting a syntax that behaves the way I want.
My place-holder has this format: {swap}
I want it to capture and replace these:
{swap} # NewValue
x{swap}x # xNewValuex
{swap}x # NewValuex
x{swap} # xNewValue
But I want it to NOT match these:
{{swap}} # NOT {NewValue}
x{{swap}}x # NOT x{NewValue}x
{{swap}}x # NOT {NewValue}x
x{{swap}} # NOT x{NewValue}
In all of the above, x can be any string, of any length, be it "word" or not.
I'm trying to do this using python3's re.sub() but anytime I satisfy one subset of criteria I lose another in the process. I'm starting to think it might not be possible to do in a single command.
Cheers!
If you're able to use the newer regex module, you can use (*SKIP)(*FAIL):
{{.*?}}(*SKIP)(*FAIL)|{.*?}
See a demo on regex101.com.
Broken down, this says:
{{.*?}}(*SKIP)(*FAIL) # match any {{...}} and "throw them away"
| # or ...
{.*?} # match your desired pattern
In Python this would be:
import regex as re
rx = re.compile(r'{{.*?}}(*SKIP)(*FAIL)|{.*?}')
string = """
{swap}
x{swap}x
{swap}x
x{swap}
{{swap}}
x{{swap}}x
{{swap}}x
x{{swap}}"""
string = rx.sub('NewValue', string)
print(string)
This yields:
NewValue
xNewValuex
NewValuex
xNewValue
{{swap}}
x{{swap}}x
{{swap}}x
x{{swap}}
For the sake of completeness, you can also achieve this with Python's own re module but here, you'll need a slightly adjusted pattern as well as a replacement function:
import re
rx = re.compile(r'{{.*?}}|({.*?})')
string = """
{swap}
x{swap}x
{swap}x
x{swap}
{{swap}}
x{{swap}}x
{{swap}}x
x{{swap}}"""
def repl(match):
if match.group(1) is not None:
return "NewValue"
else:
return match.group(0)
string = rx.sub(repl, string)
print(string)
Use negative lookahead and lookbehind:
s1 = "x{swap}x"
s2 = "x{{swap}}x"
pattern = r"(?<!\{)\{[^}]+\}(?!})"
re.sub(pattern, "foo", s1)
#'xfoox'
re.sub(pattern, "foo", s2)
#'x{{swap}}x'

How can I make a regex match the entire string?

Suppose I have a string like test-123.
I want to test whether it matches a pattern like test-<number>, where <number> means one or more digit symbols.
I tried this code:
import re
correct_string = 'test-251'
wrong_string = 'test-123x'
regex = re.compile(r'test-\d+')
if regex.match(correct_string):
print 'Matching correct string.'
if regex.match(wrong_string):
print 'Matching wrong_string.'
How can I make it so that only the correct_string matches, and the wrong_string doesn't? I tried using .search instead of .match but it didn't help.
Try with specifying the start and end rules in your regex:
re.compile(r'^test-\d+$')
For exact match regex = r'^(some-regex-here)$'
^ : Start of string
$ : End of string
Since Python 3.4 you can use re.fullmatch to avoid adding ^ and $ to your pattern.
>>> import re
>>> p = re.compile(r'\d{3}')
>>> bool(p.match('1234'))
True
>>> bool(p.fullmatch('1234'))
False
I think It may help you -
import re
pattern = r"test-[0-9]+$"
s = input()
if re.match(pattern,s) :
print('matched')
else :
print('not matched')
You can try re.findall():
import re
correct_string = 'test-251'
if len(re.findall("test-\d+", correct_string)) > 0:
print "Match found"
A pattern such as \btest-\d+\b should do you;
matches = re.search(r'\btest-\d+\', search_string)
Demo
This requires the matching of word boundaries, so prevents other substrings from occuring after your desired match.

Using regex to extract information from string

I am trying to write a regex in Python to extract some information from a string.
Given:
"Only in Api_git/Api/folder A: new.txt"
I would like to print:
Folder Path: Api_git/Api/folder A
Filename: new.txt
After having a look at some examples on the re manual page, I'm still a bit stuck.
This is what I've tried so far
m = re.match(r"(Only in ?P<folder_path>\w+:?P<filename>\w+)","Only in Api_git/Api/folder A: new.txt")
print m.group('folder_path')
print m.group('filename')
Can anybody point me in the right direction??
Get the matched group from index 1 and 2 using capturing groups.
^Only in ([^:]*): (.*)$
Here is demo
sample code:
import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"
re.findall(p, test_str)
If you want to print in the below format then try with substitution.
Folder Path: Api_git/Api/folder A
Filename: new.txt
DEMO
sample code:
import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"
subst = u"Folder Path: $1\nFilename: $2"
result = re.sub(p, subst, test_str)
Your pattern: (Only in ?P<folder_path>\w+:?P<filename>\w+) has a few flaws in it.
The ?P construct is only valid as the first bit inside a parenthesized expression,
so we need this.
(Only in (?P<folder_path>\w+):(?P<filename>\w+))
The \w character class is only for letters and underscores. It won't match / or ., for example. We need to use a different character class that more closely aligns with requirements. In fact, we can just use ., the class of nearly all characters:
(Only in (?P<folder_path>.+):(?P<filename>.+))
The colon has a space after it in your example text. We need to match it:
(Only in (?P<folder_path>.+): (?P<filename>.+))
The outermost parentheses are not needed. They aren't wrong, just not needed:
Only in (?P<folder_path>.+): (?P<filename>.+)
It is often convenient to provide the regular expression separate from the call to the regular expression engine. This is easily accomplished by creating a new variable, for example:
regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
... # several lines later
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")
The above is purely for the convenience of the programmer: it neither saves nor squanders time or memory space. There is, however, a technique that can save some of the time involved in regular expressions: compiling.
Consider this code segment:
regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
for line in input_file:
m = re.match(regex, line)
...
For each iteration of the loop, the regular expression engine must interpret the regular expression and apply it to the line variable. The re module allows us to separate the interpretation from the application; we can interpret once but apply several times:
regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
for line in input_file:
m = re.match(regex, line)
...
Now, your original program should look like this:
regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")
print m.group('folder_path')
print m.group('filename')
However, I'm a fan of using comments to explain regular expressions. My version, including some general cleanup, looks like this:
import re
regex = re.compile(r'''(?x) # Verbose
Only\ in\ # Literal match
(?P<folder_path>.+) # match longest sequence of anything, and put in 'folder_path'
:\ # Literal match
(?P<filename>.+) # match longest sequence of anything and put in 'filename'
''')
with open('diff.out') as input_file:
for line in input_file:
m = re.match(regex, line)
if m:
print m.group('folder_path')
print m.group('filename')
It really depends on the limitation of the input, if this is the only input this will do the trick.
^Only in (?P<folder_path>[a-zA-Z_/ ]*): (?P<filename>[a-z]*.txt)$

Python regular expression not matching

This is one of those things where I'm sure I'm missing something simple, but... In the sample program below, I'm trying to use Python's RE library to parse the string "line" to get the floating-point number just before the percent sign, i.e. "90.31". But the code always prints "no match".
I've tried a couple other regular expressions as well, all with the same result. What am I missing?
#!/usr/bin/python
import re
line = ' 0 repaired, 90.31% done'
pct_re = re.compile(' (\d+\.\d+)% done$')
#pct_re = re.compile(', (.+)% done$')
#pct_re = re.compile(' (\d+.*)% done$')
match = pct_re.match(line)
if match: print 'got match, pct=' + match.group(1)
else: print 'no match'
match only matches from the beginning of the string. Your code works fine if you do pct_re.search(line) instead.
You should use re.findall instead:
>>> line = ' 0 repaired, 90.31% done'
>>>
>>> pattern = re.compile("\d+[.]\d+(?=%)")
>>> re.findall(pattern, line)
['90.31']
re.match will match at the start of the string. So you would need to build the regex for complete string.
try this if you really want to use match:
re.match(r'.*(\d+\.\d+)% done$', line)
r'...' is a "raw" string ignoring some escape sequences, which is a good practice to use with regexp in python. – kratenko (see comment below)

Categories