Match a pattern only when previous pattern matches - python

I have a situation where I have to match a pattern only when previous regex pattern matches. Both pattern are different and matchobj in different line. For exmaple,
Text:
blah blah blah MyHost="xxxx"
again blah blah blah MyIp= "x.x.x.x"
I am only interested in whats comes after MyHost and MyIp, I also have a requirement that MyIp should match only when there is a match(MyHost="xxxx") in the above line.
I am able to match both MyHost value and MyIp value separately but having hard time finding a logic to match both as per the requirement. Please note I am fairly new to python and tried lot of search and end up here.

MyIp should match only when there is a match(MyHost="xxxx") in the above line.
Get the matched group from index 1 in Lazy way. You know already what is next after MyHost
\bMyHost="xxxx"\r?\n.*?MyIp=\s*\"([^"]*)
Here is demo
sample code:
import re
p = re.compile(ur'\bMyHost="xxxx"\r?\n.*?MyIp=\s*\"([^"]*)', re.IGNORECASE)
test_str = u"blah blah blah MyHost=\"xxxx\"\nagain blah blah blah MyIp= \"x.x.x.x\""
re.findall(p, test_str)

You could do this through regex module.
>>> import regex
>>> s = '''blah blah blah MyHost="xxxx"
... foo bar
... again blah blah blah MyIp= "x.x.x.x"
...
... blah blah blah MyHost="xxxx"
... again blah blah blah MyIp= "x.x.x.x"'''
>>> m = regex.search(r'(?<=MyHost="xxxx"[^\n]*\n.*?MyIp=\s*")[^"]*', s)
>>> m.group()
'x.x.x.x'
This would match the value of MyIp only if the string MyHost="xxxx" present on the previous line.
If you want to list the both, then try the below code.
>>> m = regex.findall(r'(?<=(MyHost="[^"]*")[^\n]*\n.*?)(MyIp=\s*"[^"]*")', s)
>>> m
[('MyHost="xxxx"', 'MyIp= "x.x.x.x"')]

Generally if you want to use Regex , you'll need to match "MyHost" and all that follows and "MyIP" and that follows it to the end of the line
So basically what you want to do is write a regex similar to this one
MyHost="\w+"
This will match MyHost=" " and the input between it will be set to W
afterwards you can retrieve the value of W and do the computation you need
To solve the problem where you have to match The host first
a simple if Condition can solve this problem by checking the Host name first before the Ip

(?=.*? MyHost=\"xxx\" .*) .*? MyIp=\"(\S+)\" .*
The xxx can be changed as required.MyIP will get captured.
You can use python lookahead.Only when xxx matches regex will go ahead and fetch IP
(?=regex)regex1
match regex1 only when regex has matched.

You should take advantage of short circuiting, I believe python supports it. In short circuiting, the second condition will only be evaluated if the first one is true (for AND operations). So your code will look like the following:
patternMatch1(MyHost) and patternMatch2(MyIp)
Here you could have both the pattern match functions return true if they are appropriately matched to.
Please let me know if you have any questions!

Related

regex lookback get the first word

I am looking to get the first word back from a date value using regex. There isn't an exact pattern of words but I am just looking to obtain the first word preceding the date (not the entire string).
what I have so far:
regexDate = r'.*(?=)[w]*\d{2}\/\d{2}\/\d{4}|.*(?=)[a-zA-Z]+\s[0-9][0-9],?\s[0-9][0-9][0-9][0-9]|.*(?=)\d{2}-\d{2}-\d{4}|\d{2}/\w*/\d{4}|.*(?=)\d{2}-\w*-\d{4}|.*(?=)\d{2}.\w*.\d{4}|.*(?=)\d{4}/\d{2}/\d{2}|.*(?=)\d{2}-\w*-\d{2}|.*(?=)\d{2}/\d{2}/\d{2}'
text = 'blah blah Start: 01-Oct-2018 blah blah End: 30-Sep-2019'
find = re.findall(regexDate, text)
result:
['blah blah Start: 01-Oct-2018 blah blah End: 30-Sep-2019']
In this case, I'm just looking to get 'Start: 01-Oct-2018' and 'End: 30-Sep-2019'.
Note: Start and End may be a different set of words in other circumstances. Date formats also differ.
I did try wrapping .*(?=) around the possibilities but it would just return 30-Sep-2019. maybe someone can suggest a cleaner version as well?
This seems to do it:
re.findall(r'\b\S+\s+\d\d?-[A-Za-z]{3}-\d{4}\b', text)
returns:
['Start: 01-Oct-2018', 'End: 30-Sep-2019']
The \b at the start forces a start-of-word (word break), followed by a string of not-spaces and one or more spaces, then followed by your date expression.
I suppose you're looking for something like this? (?i)start:\s*(?<start>\d\d-\w+-\d{4}).*end:\s*(?<end>\d\d-\w+-\d{4})
Demo
Late answer, but here's a way to do it:
[^ ]+: [^ ]+
import re
text = 'blah blah Start: 01-Oct-2018 blah blah End: 30-Sep-2019'
result = re.findall("[^ ]+: [^ ]+", text)
# ['Start: 01-Oct-2018', 'End: 30-Sep-2019']
Use this Regex to match all of the type of month (number or abbreviation):
import re
regex = r"(\S+\s*\d{2}-\w+-\d{4})"
test_str = "blah blah Start: 01-Oct-2018 blah blah End: 30-Sep-2019"
# test_str = "test test Start:01-06-2018 blah blah End: 30-08-2019"
match = re.findall(regex,test_str)
print(match)
# ['Start: 01-Oct-2018', 'End: 30-Sep-2019']

Python and Regex. Or statement

I will explain my problem with an example. Here is two different version of my text:
Version 1:
Blah: 1 2345 $ blah blah blah
Version 2:
Blah: 1 2345 $ (9 8546 $) blah blah blah
I try to write a regex in Python where if the text is in Version 2, then it will return the number in the parenthesis. Otherwise, it will return the number outside.
pat = re.compile(r"Blah: [0-9]+\s[0-9]+ /$ \(([0-9]+\s[0-9]+)|Blah: ([0-9]+\s[0-9]+)")
pat.findall(text)
The problem is that it returns ('1 2345', '') or ('', '9 8546') in each case.
How can I change the regex to return only the number?
If you are pretty comfortable with the RegEx you wrote, then I would suggest not to change the RegEx and get the value like this
print "".join(pat.findall(text)[0])
This will just concatenate the matching results. Since the other group captures nothing, you will get a single string.
Note: Also, you need to escape $ in your RegEx, like \$, otherwise it will be considered as the end of line.
Don't use findall. The only situation in which it is useful is when you have a simple regex and you want to get all its matches. When you start having capturing groups it easily become quite useless.
The finditer method returns the actual match objects created during matching instead of returning the tuples of the matched groups. You can slightly modify your regex to use capturing groups:
pat = re.compile(r'Blah: (\d+\s\d+) \$ (\((\d+\s\d+)\s*\$\))?')
Afterwards to get the matched number you can use match.group(3) or match.group(1) to select one or the other depending whether there was a parenthesized match:
text = 'Blah: 1 2345 $ (9 8546 $) blah blah blah\nBlah: 1 2345 $ blah blah blah'
[m.group(3) or m.group(1) for m in pat.finditer(text)]
Outputs:
Out[12]: ['9 8546', '1 2345']

Split a string of a specific pattern into three parts

I am given a string which is of this pattern:
[blah blah blah] [more blah] some text
I want to split the string into three parts: blah blah blah, more blah and some text.
A crude way to do it is to use mystr.split('] '), and then removes the lead [ from the first two elements. Is there a better and performant way (need to do this for thousands of strings very quickly).
You can use a regular expression to extract the text, if you know that it will be in that form. For efficiency, you can precompile the regex and then repeatedly use it when matching.
prog = re.compile('\[([^\]]*)\]\s*\[([^\]]*)\]\s*(.*)')
for mystr in string_list:
result = prog.match(mystr)
groups = result.groups()
If you'd like an explanation on the regex itself, you can get one using this tool.
You can use a regular expression to split where you want to leave out characters:
>>> import re
>>> s = '[...] [...] ...'
>>> re.split(r'\[|\] *\[?', s)[1:]
['...', '...', '...']

Full expression for findall

I have a regular expression that looks for a url in some text like:
my_urlfinder = re.compile(r'\shttp:\/\/(\S+.|)blah.com/users/(\d+)(\/|)')
text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"
for match in my_urlfinder.findall(text):
print match #prints an array with all the individual parts of the regex
How do I get the entire url? Currently match just prints out the matched parts (which I need for other things)...but I also want the full url.
You should make your groups non-capturing:
my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')
findall() changes behaviour when there are capturing groups. With groups, it'll only return the groups, without capturing groups, the whole matched text is returned instead.
Demo:
>>> text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"
>>> my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')
>>> for match in my_urlfinder.findall(text):
... print match
...
http://blah.com/users/123
http://blah.com/users/353
An alternative to not using any capturing groups would be to add another one around everything:
my_urlfinder = re.compile(r'\s(http:\/\/(\S+.|)blah.com/users/(\d+)(\/|))')
This will allow you to keep the inner capturing groups while still having the whole result.
For the demo text it would yield these results:
('http://blah.com/users/123', '', '123', '')
('http://blah.com/users/353', '', '353', '')
As a side note beware that the current expression requires a whitespace in front of the URL, so if the text started with one that would not be matched.

Python regular expressions - re.search() vs re.findall()

For school I'm supposed to write a Python RE script that extracts IP addresses. The regular expression I'm using seems to work with re.search() but not with re.findall().
exp = "(\d{1,3}\.){3}\d{1,3}"
ip = "blah blah 192.168.0.185 blah blah"
match = re.search(exp, ip)
print match.group()
The match for that is always 192.168.0.185, but its different when I do re.findall()
exp = "(\d{1,3}\.){3}\d{1,3}"
ip = "blah blah 192.168.0.185 blah blah"
matches = re.findall(exp, ip)
print matches[0]
0.
I'm wondering why re.findall() yields 0. when re.search() yields 192.168.0.185, since I'm using the same expression for both functions.
And what can I do to make it so re.findall() will actually follow the expression correctly? Or am I making some kind of mistake?
findall returns a list of matches, and from the documentation:
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
So, your previous expression had one group that matched 3 times in the string where the last match was 0.
To fix your problem use: exp = "(?:\d{1,3}\.){3}\d{1,3}"; by using the non-grouping version, there is no returned groups so the match is returned in both cases.
You're only capturing the 0 in that regex, as it'll be the last one that's caught.
Change the expression to capture the entire IP, and the repeated part to be a non-capturing group:
In [2]: ip = "blah blah 192.168.0.185 blah blah"
In [3]: exp = "((?:\d{1,3}\.){3}\d{1,3})"
In [4]: m = re.findall(exp, ip)
In [5]: m
Out[5]: ['192.168.0.185']
In [6]:
And if it helps to explain the regex:
In [6]: re.compile(exp, re.DEBUG)
subpattern 1
max_repeat 3 3
subpattern None
max_repeat 1 3
in
category category_digit
literal 46
max_repeat 1 3
in
category category_digit
This explains the subpatterns. Subpattern 1 is what gets captured by findall.

Categories