Python regular expressions - re.search() vs re.findall() - python

For school I'm supposed to write a Python RE script that extracts IP addresses. The regular expression I'm using seems to work with re.search() but not with re.findall().
exp = "(\d{1,3}\.){3}\d{1,3}"
ip = "blah blah 192.168.0.185 blah blah"
match = re.search(exp, ip)
print match.group()
The match for that is always 192.168.0.185, but its different when I do re.findall()
exp = "(\d{1,3}\.){3}\d{1,3}"
ip = "blah blah 192.168.0.185 blah blah"
matches = re.findall(exp, ip)
print matches[0]
0.
I'm wondering why re.findall() yields 0. when re.search() yields 192.168.0.185, since I'm using the same expression for both functions.
And what can I do to make it so re.findall() will actually follow the expression correctly? Or am I making some kind of mistake?

findall returns a list of matches, and from the documentation:
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
So, your previous expression had one group that matched 3 times in the string where the last match was 0.
To fix your problem use: exp = "(?:\d{1,3}\.){3}\d{1,3}"; by using the non-grouping version, there is no returned groups so the match is returned in both cases.

You're only capturing the 0 in that regex, as it'll be the last one that's caught.
Change the expression to capture the entire IP, and the repeated part to be a non-capturing group:
In [2]: ip = "blah blah 192.168.0.185 blah blah"
In [3]: exp = "((?:\d{1,3}\.){3}\d{1,3})"
In [4]: m = re.findall(exp, ip)
In [5]: m
Out[5]: ['192.168.0.185']
In [6]:
And if it helps to explain the regex:
In [6]: re.compile(exp, re.DEBUG)
subpattern 1
max_repeat 3 3
subpattern None
max_repeat 1 3
in
category category_digit
literal 46
max_repeat 1 3
in
category category_digit
This explains the subpatterns. Subpattern 1 is what gets captured by findall.

Related

RegEx pattern not behaving as wanted

I am using regex pattern
[^A-Za-z](email,|help|BGN|won't|go|corner|issues|disconected|We|group|No|send|Bv|connecting|has|Pittsburgh,|Many|(Akustica,|Toluca|cannot|Restarting|they|not|PI2|one|condition|entire|LAN|experincing|bar|Exchange,|server|Are|PA)|OutLook|right|says|Rose|Montalvo|back|computer|are|Jane|thier|Disconnected|Nrd|and/or|network|for|Appears|e-mail|unable|Connected|then|Broadview,|issue|email|shows|available|be|we|exchange|error|address|based|My|Microsoft|received|working|created|receive|impacted|WIFI|through|connection|including|or|IL|outlook|via|facility|Everyone's|servers|Also|message|"The|your|Status|doesn't|service|SI-MBX82.de.bosch.com,|next|appears|"disconnected"|Encryption|eMail/file|today|"Waiting|"send/receive"|but|it|trying|SAP|disconnected|e-mails|this|getting|can|of|connect|Incorrect|manually|is|site|an|folder"|cant|Other|have|in|Receiving|if|Plant|no|SI-MBX80.de.bosch.com|that|when|online|persists."|Customer|administrator|users|update|applications|"Disconnected"|SI-MBX81.de.bosch.com|The|on|lower|Some|It|contact|In|the|having)[^A-Za-z]
And applying but it is not able to find "Jane" in the sentence
"Issue with eMail/file Encryption Incorrect email address created for Jane Rose Montalvo."
While Jane is present in the above pattern that I am using.
What could be the reason?
The problem is your regex captures \s before and after the word and it is also the matching criteria.
Hello Jane
So from this once Hello is captured Jane is left and it cannot be matched as it has no space before it.You should make it an assert rather than matching one.
Use (?<=[^a-zA-Z]) instead of simple [^a-zA-Z].See demo.
http://regex101.com/r/lU7jH1/9
Because of overlapping of characters. Just use a capturing group inside lookahead inorder to capture the overlapping characters,
(?=[^A-Za-z](email,|help|BGN|won't|go|corner|issues|disconected|We|group|No|send|Bv|connecting|has|Pittsburgh,|Many|(Akustica,|Toluca|cannot|Restarting|they|not|PI2|one|condition|entire|LAN|experincing|bar|Exchange,|server|Are|PA)|OutLook|right|says|Rose|Montalvo|back|computer|are|Jane|thier|Disconnected|Nrd|and/or|network|for|Appears|e-mail|unable|Connected|then|Broadview,|issue|email|shows|available|be|we|exchange|error|address|based|My|Microsoft|received|working|created|receive|impacted|WIFI|through|connection|including|or|IL|outlook|via|facility|Everyone's|servers|Also|message|"The|your|Status|doesn't|service|SI-MBX82\.de\.bosch\.com,|next|appears|"disconnected"|Encryption|eMail/file|today|"Waiting|"send/receive"|but|it|trying|SAP|disconnected|e-mails|this|getting|can|of|connect|Incorrect|manually|is|site|an|folder"|cant|Other|have|in|Receiving|if|Plant|no|SI-MBX80\.de\.bosch\.com|that|when|online|persists\."|Customer|administrator|users|update|applications|"Disconnected"|SI-MBX81\.de\.bosch.com|The|on|lower|Some|It|contact|In|the|having)[^A-Za-z])
DEMO
If for some reason you cannot or do not want to modify your pattern and you have overlapping matches that you want to capture, you can use re.search in a loop - moving the starting point for the search to the character just after the beginning of the previous match.
#recursive
def foo(s, p, start = 0):
m = p.search(s, start)
if not m:
return ''
return m.group() + foo(s, p, m.start() + 1)
#iterative
def foo1(s, p):
result = ''
m = p.search(s, 0)
while m:
result += m.group()
m = p.search(s, m.start() + 1)
return result
print foo(s, re.compile(p))
print foo1(s, re.compile(p))
>>>
eMail/file Encryption Incorrect email address created for Jane Rose Montalvo.
eMail/file Encryption Incorrect email address created for Jane Rose Montalvo.
>>>

Match a pattern only when previous pattern matches

I have a situation where I have to match a pattern only when previous regex pattern matches. Both pattern are different and matchobj in different line. For exmaple,
Text:
blah blah blah MyHost="xxxx"
again blah blah blah MyIp= "x.x.x.x"
I am only interested in whats comes after MyHost and MyIp, I also have a requirement that MyIp should match only when there is a match(MyHost="xxxx") in the above line.
I am able to match both MyHost value and MyIp value separately but having hard time finding a logic to match both as per the requirement. Please note I am fairly new to python and tried lot of search and end up here.
MyIp should match only when there is a match(MyHost="xxxx") in the above line.
Get the matched group from index 1 in Lazy way. You know already what is next after MyHost
\bMyHost="xxxx"\r?\n.*?MyIp=\s*\"([^"]*)
Here is demo
sample code:
import re
p = re.compile(ur'\bMyHost="xxxx"\r?\n.*?MyIp=\s*\"([^"]*)', re.IGNORECASE)
test_str = u"blah blah blah MyHost=\"xxxx\"\nagain blah blah blah MyIp= \"x.x.x.x\""
re.findall(p, test_str)
You could do this through regex module.
>>> import regex
>>> s = '''blah blah blah MyHost="xxxx"
... foo bar
... again blah blah blah MyIp= "x.x.x.x"
...
... blah blah blah MyHost="xxxx"
... again blah blah blah MyIp= "x.x.x.x"'''
>>> m = regex.search(r'(?<=MyHost="xxxx"[^\n]*\n.*?MyIp=\s*")[^"]*', s)
>>> m.group()
'x.x.x.x'
This would match the value of MyIp only if the string MyHost="xxxx" present on the previous line.
If you want to list the both, then try the below code.
>>> m = regex.findall(r'(?<=(MyHost="[^"]*")[^\n]*\n.*?)(MyIp=\s*"[^"]*")', s)
>>> m
[('MyHost="xxxx"', 'MyIp= "x.x.x.x"')]
Generally if you want to use Regex , you'll need to match "MyHost" and all that follows and "MyIP" and that follows it to the end of the line
So basically what you want to do is write a regex similar to this one
MyHost="\w+"
This will match MyHost=" " and the input between it will be set to W
afterwards you can retrieve the value of W and do the computation you need
To solve the problem where you have to match The host first
a simple if Condition can solve this problem by checking the Host name first before the Ip
(?=.*? MyHost=\"xxx\" .*) .*? MyIp=\"(\S+)\" .*
The xxx can be changed as required.MyIP will get captured.
You can use python lookahead.Only when xxx matches regex will go ahead and fetch IP
(?=regex)regex1
match regex1 only when regex has matched.
You should take advantage of short circuiting, I believe python supports it. In short circuiting, the second condition will only be evaluated if the first one is true (for AND operations). So your code will look like the following:
patternMatch1(MyHost) and patternMatch2(MyIp)
Here you could have both the pattern match functions return true if they are appropriately matched to.
Please let me know if you have any questions!

Python and Regex. Or statement

I will explain my problem with an example. Here is two different version of my text:
Version 1:
Blah: 1 2345 $ blah blah blah
Version 2:
Blah: 1 2345 $ (9 8546 $) blah blah blah
I try to write a regex in Python where if the text is in Version 2, then it will return the number in the parenthesis. Otherwise, it will return the number outside.
pat = re.compile(r"Blah: [0-9]+\s[0-9]+ /$ \(([0-9]+\s[0-9]+)|Blah: ([0-9]+\s[0-9]+)")
pat.findall(text)
The problem is that it returns ('1 2345', '') or ('', '9 8546') in each case.
How can I change the regex to return only the number?
If you are pretty comfortable with the RegEx you wrote, then I would suggest not to change the RegEx and get the value like this
print "".join(pat.findall(text)[0])
This will just concatenate the matching results. Since the other group captures nothing, you will get a single string.
Note: Also, you need to escape $ in your RegEx, like \$, otherwise it will be considered as the end of line.
Don't use findall. The only situation in which it is useful is when you have a simple regex and you want to get all its matches. When you start having capturing groups it easily become quite useless.
The finditer method returns the actual match objects created during matching instead of returning the tuples of the matched groups. You can slightly modify your regex to use capturing groups:
pat = re.compile(r'Blah: (\d+\s\d+) \$ (\((\d+\s\d+)\s*\$\))?')
Afterwards to get the matched number you can use match.group(3) or match.group(1) to select one or the other depending whether there was a parenthesized match:
text = 'Blah: 1 2345 $ (9 8546 $) blah blah blah\nBlah: 1 2345 $ blah blah blah'
[m.group(3) or m.group(1) for m in pat.finditer(text)]
Outputs:
Out[12]: ['9 8546', '1 2345']

Python re.finditer match.groups() does not contain all groups from match

I am trying to use regex in Python to find and print all matching lines from a multiline search.
The text that I am searching through may have the below example structure:
AAA
ABC1
ABC2
ABC3
AAA
ABC1
ABC2
ABC3
ABC4
ABC
AAA
ABC1
AAA
From which I want to retrieve the ABC*s that occur at least once and are preceeded by an AAA.
The problem is, that despite the group catching what I want:
match = <_sre.SRE_Match object; span=(19, 38), match='AAA\nABC2\nABC3\nABC4\n'>
... I can access only the last match of the group:
match groups = ('AAA\n', 'ABC4\n')
Below is the example code that I use for this problem.
#! python
import sys
import re
import os
string = "AAA\nABC1\nABC2\nABC3\nAAA\nABC1\nABC2\nABC3\nABC4\nABC\nAAA\nABC1\nAAA\n"
print(string)
p_MATCHES = []
p_MATCHES.append( (re.compile('(AAA\n)(ABC[0-9]\n){1,}')) ) #
matches = re.finditer(p_MATCHES[0],string)
for match in matches:
strout = ''
gr_iter=0
print("match = "+str(match))
print("match groups = "+str(match.groups()))
for group in match.groups():
gr_iter+=1
sys.stdout.write("TEST GROUP:"+str(gr_iter)+"\t"+group) # test output
if group is not None:
if group != '':
strout+= '"'+group.replace("\n","",1)+'"'+'\n'
sys.stdout.write("\nCOMPLETE RESULT:\n"+strout+"====\n")
Here is your regular expression:
(AAA\r\n)(ABC[0-9]\r\n){1,}
Debuggex Demo
Your goal is to capture all ABC#s that immediately follow AAA. As you can see in this Debuggex demo, all ABC#s are indeed being matched (they're highlighted in yellow). However, since only the "what is being repeated" part
ABC[0-9]\r\n
is being captured (is inside the parentheses), and its quantifier,
{1,}
is not being captured, this therefore causes all matches except the final one to be discarded. To get them, you must also capture the quantifier:
AAA\r\n((?:ABC[0-9]\r\n){1,})
Debuggex Demo
I've placed the "what is being repeated" part (ABC[0-9]\r\n) into a non-capturing group. (I've also stopped capturing AAA, as you don't seem to need it.)
The captured text can be split on the newline, and will give you all the pieces as you wish.
(Note that \n by itself doesn't work in Debuggex. It requires \r\n.)
This is a workaround. Not many regular expression flavors offer the capability of iterating through repeating captures (which ones...?). A more normal approach is to loop through and process each match as they are found. Here's an example from Java:
import java.util.regex.*;
public class RepeatingCaptureGroupsDemo {
public static void main(String[] args) {
String input = "I have a cat, but I like my dog better.";
Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group());
}
}
}
Output:
cat
dog
(From http://ocpsoft.org/opensource/guide-to-regular-expressions-in-java-part-1/, about a 1/4 down)
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. The links in this answer come from it.
You want the pattern of consecutive ABC\n occurring after a AAA\n in the most greedy way. You also want only the group of consecutive ABC\n and not a tuple of that and the most recent ABC\n. So in your regex, exclude the subgroup within the group.
Notice the pattern, write the pattern that represents the whole string.
AAA\n(ABC[0-9]\n)+
Then capture the one you are interested in with (), while remembering to exclude subgroup(s)
AAA\n((?:ABC[0-9]\n)+)
You can then use either findall() or finditer(). I find findIter easier especially when you are dealing with more than one capture.
finditer:-
import re
matches_iter = re.finditer(r'AAA\n((?:ABC[0-9]\n)+)', string)
[print(i.group(1)) for i in matches_iter]
findall, used the original {1,} as its a more verbose form of + :-
matches_all = re.findall(r'AAA\n((?:ABC[0-9]\n){1,})', string)
[[print(x) for x in y.split("\n")] for y in matches_all]

Full expression for findall

I have a regular expression that looks for a url in some text like:
my_urlfinder = re.compile(r'\shttp:\/\/(\S+.|)blah.com/users/(\d+)(\/|)')
text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"
for match in my_urlfinder.findall(text):
print match #prints an array with all the individual parts of the regex
How do I get the entire url? Currently match just prints out the matched parts (which I need for other things)...but I also want the full url.
You should make your groups non-capturing:
my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')
findall() changes behaviour when there are capturing groups. With groups, it'll only return the groups, without capturing groups, the whole matched text is returned instead.
Demo:
>>> text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"
>>> my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')
>>> for match in my_urlfinder.findall(text):
... print match
...
http://blah.com/users/123
http://blah.com/users/353
An alternative to not using any capturing groups would be to add another one around everything:
my_urlfinder = re.compile(r'\s(http:\/\/(\S+.|)blah.com/users/(\d+)(\/|))')
This will allow you to keep the inner capturing groups while still having the whole result.
For the demo text it would yield these results:
('http://blah.com/users/123', '', '123', '')
('http://blah.com/users/353', '', '353', '')
As a side note beware that the current expression requires a whitespace in front of the URL, so if the text started with one that would not be matched.

Categories