How can I ignore empty groups?

How can I ignore empty groups? - python

Pretty straightforward regex, I am trying to extract IP from logs. But group(1) is empty, which is given. Is there a better way to approach this problem?
sourceip_regex_extract = re.compile(r"{}".format(sourceip_syslog_regex))
sourceip_extract = sourceip_regex_extract.search(message)
sourceip_txt = sourceip_extract.group(1)
Regex101: https://regex101.com/r/jmtQci/1

First of all, when you search for a match with a regex, make sure you actually get a match and only then access the first group value.
Next, r"{}".format(sourceip_syslog_regex) makes no sense, it is the same as sourceip_syslog_regex.
To fix the current issue, you can use a (?:from |inside:) alternation to match either from or inside:
sourceip_syslog_regex = r'(?:from |inside:)(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
sourceip_regex_extract = re.compile(sourceip_syslog_regex)
sourceip_extract = sourceip_regex_extract.search(message)
if sourceip_extract:
sourceip_txt = sourceip_extract.group(1)
See the regex demo
Note you can shorten the IP address matching pattern a bit and use (?:from |inside:)(\d{1,3}(?:\.\d{1,3}){3}).
Details:
(?:from |inside:) - either from or inside:
(\d{1,3}(?:\.\d{1,3}){3}) - Group 1: one to three digits and then three occurrences of a . and one to three digits.

Related

Python split before a certain character

I have following string:
BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6
I am trying to split it in a way I would get back the following dict / other data structure:
BUCKET1 -> /dir1/dir2/, BUCKET1 -> /dir3/dir4/, BUCKET2 -> /dir5/dir6/
I can somehow split it if I only have one BUCKET, not multiple, like this:
res.split(res.split(':', 1)[0].replace('.', '').upper()) -> it's not perfect
Input: ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/
Output: [(ADRIAN, /dir1/dir11), (DANIEL, /dir2/), (CULEA, /dir3/), (ADRIAN, /dir5/), (ADRIAN, /dir6/)
As per Wiktor Stribiżew comments, the following regex does the job:
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"

If you're experienced, I'd recommend learning Regex just as the others have suggested. However, if you're looking for an alternative, here's a way of doing such without Regex. It also produces the output you're looking for.
string = input("Enter:") #Put your own input here.
tempList = string.replace("BUCKET",':').split(":")
outputList = []
for i in range(1,len(tempList)-1,2):
someTuple = ("BUCKET"+tempList[i],tempList[i+1])
outputList.append(someTuple)
print(outputList) #Put your own output here.
This will produce:
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
This code is hopefully easier to understand and manipulate if you're unfamiliar with Regex, although I'd still personally recommend Regex to solve this if you're familiar with how to use it.

Use re.findall() function:
s = "ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/"
result = re.findall(r'(\w+):([^:]+\/)', s)
print(result)
The output:
[('ADRIAN', '/dir1/dir11/'), ('DANIEL', '/dir2/'), ('ADI_BUCKET', '/dir3/'), ('CULEA', '/dir4/'), ('ADRIAN', '/dir5/'), ('ADRIAN', '/dir6/')]

Use regex instead?
impore re
test = 'BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6'
output = re.findall(r'(?P<bucket>[A-Z0-9]+):(?P<path>[/a-z0-9]+)', test)
print(output)
Which gives
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

It appears you have a list of predefined "buckets" that you want to use as boundaries for the records inside the string.
That means, the easiest way to match these key-value pairs is by matching one of the buckets, then a colon and then any chars not starting a sequence of chars equal to those bucket names.
You may use
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"
Compile with re.S / re.DOTALL if your values span across multiple lines. See the regex demo.
Details:
(BUCKET1|BUCKET2) - capture group one that matches and stores in .group(1) any of the bucket names
: - a colon
(.*?) - any 0+ chars, as few as possible (as *? is a lazy quantifier), up to the first occurrence of (but not inlcuding)...
(?=(?:BUCKET1|BUCKET2)|$) - any of the bucket names or end of string.
Build it dynamically while escaping bucket names (just to play it safe in case those names contain * or + or other special chars):
import re
buckets = ['BUCKET1','BUCKET2']
rx = r"({0}):(.*?)(?=(?:{0})|$)".format("|".join([re.escape(bucket) for bucket in buckets]))
print(rx)
s = "BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6"
print(re.findall(rx, s))
# => (BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
See the online Python demo.

What should I use the Non-greedy match in this case

Assume I have a string which includes some data fields that are separated by "|", like
|1|2|3|4|5|6|7|8|
My purpose is to get the 8th field. This is what I'm doing:
pattern = re.compile(r'^\s+(\|.*?\|){8}')
match = pattern.match(test_line)
if match:
print:match.group(8)
But looks like it can not match. I know in this case I need to use ? for non-greedy match, but why I can not get the 8th field?
Thanks

Regex might be complicating this problem rather than simplifying it. A simple way to get an eighth item from a | delimited string is using split():
a = '|here|is|some|data|separated|by|bars|hooray!|'
print a.split('|')[8]
RETURNS
hooray!
Using regex, one way to get it would be:
import re
a = '|here|is|some|data|separated|by|bars|hooray!|'
pattern = re.compile(r'([^\|]+)')
match = pattern.findall(a)
print match[7]
RETURNS
hooray!

Optional grouping in a simple python regex

All I want to do is search a string for instances of two consecutive digits. If such an instance is found I want to group it, otherwise return none for that particular groups. I thought this would be trivial, but I can't understand where I'm going wrong. In the example below, removing the optional (?) character gets me the numbers, but in strings without numbers, the r evaluates to None, so r.groups() throws an exception.
p = re.compile(r'(\d{2})?')
r = p.search('wqddsel78ffgr')
print r.groups()
>>>(None, ) # why not ('78', )?
# --- update/clarification --- #
Thanks for the answers, but the explanations given are leaving me none-the-wiser. Here's a another go at pin-pointing exactly what it is I don't understand.
pattern = re.compile(r'z.*(A)?')
_string = "aazaa90aabcdefA"
result = pattern.search(_string)
result.group()
>>> zaa90aabcdefA
result.groups()
>>> (None, )
I understand why result.group() produces the result it does, but why doesn't result.groups() produce ('A', )? I thought it worked like this: once the regex hits the z it then matches right to the end of the line using .*. In spite of .* matching everything, the regex engine is aware that it passed over an optional group, and since ? means it will try to match if it can, it should work backwards to try and match. Replacing ? with + does return ('A', ). This suggests that ? won't try and match if it doesn't have to, but this seems to contrast with much of what I've read on the subject (esp. J. Friedl's excellent book).

This works for me:
p = re.compile('\D*(\d{2})?')
r = p.search('wqddsel78ffgr')
print r.groups() # ('78',)
r = p.search('wqddselffgr')
print r.groups() # (None,)

Use regex pattern
(\d{2}|(?!.*\d{2}))
(see this demo)
If you want be sure there are exactly 2 consecutive digits and not 3 or more, go with
((?<!\d)\d{2}(?!\d)|(?!.*(?<!\d)\d{2}(?!\d)))
(see this demo)

The ? makes your regex match the empty string. If you omit it, you could just check the result like this:
p = re.compile(r'(\d{2})')
r = p.search('wqddsel78ffgr')
print r.groups() if r else ('',)

Remember that you can search for all matches of a RE in a string easily using findall():
re.findall(r'\d{2}', 'wqddsel78ffgr') # => ['78']
If you don't need the positions where the match occurs, this seems like a simpler way to accomplish what you're doing.

? - is 0 or 1 repetitions. So the regex processor first tries to find 0 repetitions, and... finds it :)

Python Regex match or potential match

Question:
How do I use Python's regular expression module (re) to determine if a match has been made, or that a potential match could be made?
Details:
I want a regex pattern which searches for a pattern of words in a correct order regardless of what's between them. I want a function which returns Yes if found, Maybe if a match could still be found or No if no match can be found. We are looking for the pattern One|....|Two|....|Three, here are some examples (Note the names, their count, or their order are not important, all I care about is the three words One, Two and Three, and the acceptable words in between are John, Malkovich, Stamos and Travolta).
Returns YES:
One|John|Malkovich|Two|John|Stamos|Three|John|Travolta
Returns YES:
One|John|Two|John|Three|John
Returns YES:
One|Two|Three
Returns MAYBE:
One|Two
Returns MAYBE:
One
Returns NO:
Three|Two|One
I understand the examples are not airtight, so here is what I have for the regex to get YES:
if re.match('One\|(John\||Malkovich\||Stamos\||Travolta\|)*Two\|(John\||Malkovich\||Stamos\||Travolta\|)*Three\|(John\||Malkovich\||Stamos\||Travolta\|)*', 'One|John|Malkovich|Two|John|Stamos|Three|John|Travolta') != None
return 'Yes'
Obviously if the pattern is Three|Two|One the above will fail, and we can return No, but how do I check for the Maybe case? I thought about nesting the parentheses, like so (note, not tested)
if re.match('One\|((John\||Malkovich\||Stamos\||Travolta\|)*Two(\|(John\||Malkovich\||Stamos\||Travolta\|)*Three\|(John\||Malkovich\||Stamos\||Travolta\|)*)*)*', 'One|John|Malkovich|Two|John|Stamos|Three|John|Travolta') != None
return 'Yes'
But I don't think that will do what I want it to do.
More Details:
I am not actually looking for Travoltas and Malkovichs (shocking, I know). I am matching against inotify Patterns such as IN_MOVE, IN_CREATE, IN_OPEN, and I am logging them and getting hundreds of them, then I go in and then look for a particular pattern such as IN_ACCESS...IN_OPEN....IN_MODIFY, but in some cases I don't want an IN_DELETE after the IN_OPEN and in others I do. I'm essentially pattern matching to use inotify to detect when text editors gone wild and they try to crush programmers souls by doing a temporary-file-swap-save instead of just modifying the file. I don't want to free up those logs instantly, but I only want to hold on to them for as long as is necessary. Maybe means dont erase the logs. Yes means do something then erase the log and No means don't do anything but still erase the logs. As I will have multiple rules for each program (ie. vim v gedit v emacs) I wanted to use a regular expression which would be more human readable and easier to write then creating a massive tree, or as user Joel suggested, just going over the words with a loop

I wouldn't use a regex for this. But it's definitely possible:
regex = re.compile(
r"""^ # Start of string
(?: # Match...
(?: # one of the following:
One() # One (use empty capturing group to indicate match)
| # or
\1Two() # Two if One has matched previously
| # or
\1\2Three() # Three if One and Two have matched previously
| # or
John # any of the other strings
| # etc.
Malkovich
|
Stamos
|
Travolta
) # End of alternation
\|? # followed by optional separator
)* # any number of repeats
$ # until the end of the string.""",
re.VERBOSE)
Now you can check for YES and MAYBE by checking if you get a match at all:
>>> yes = regex.match("One|John|Malkovich|Two|John|Stamos|Three|John|Travolta")
>>> yes
<_sre.SRE_Match object at 0x0000000001F90620>
>>> maybe = regex.match("One|John|Malkovich|Two|John|Stamos")
>>> maybe
<_sre.SRE_Match object at 0x0000000001F904F0>
And you can differentiate between YES and MAYBE by checking whether all of the groups have participated in the match (i. e. are not None):
>>> yes.groups()
('', '', '')
>>> maybe.groups()
('', '', None)
And if the regex doesn't match at all, that's a NO for you:
>>> no = regex.match("Three|Two|One")
>>> no is None
True

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski
Perhaps an algorithm like this would be more appropriate. Here is some pseudocode.
matchlist.current = matchlist.first()
for each word in input
if word = matchlist.current
matchlist.current = matchlist.next() // assuming next returns null if at end of list
else if not allowedlist.contains(word)
return 'No'
if matchlist.current = null // we hit the end of the list
return 'Yes'
return 'Maybe'

Matching alternative regexps in Python

I'm using Python to parse a file in search for e-mail addresses, but I can't figure out what the syntax for alternative regexps should be. Here's the code:
addresses = []
pattern = '(\w+)#(\w+\.com)|(\w+)#(it.\w+\.com)'
for line in file:
matches = re.findall(pattern,line)
for m in matches:
address = '%s#%s' % m
addresses.append(address)
So I want to find addresses that look like john#company.com or john#it.company.com, but the above code doesn't work because either the first two groups are empty or the last two groups are empty. What is the correct solution? I need to use groups to store the user name (before #) and server name (after #) separately.
EDIT: Matching email adresses is only an example. What I'm trying to find out is how to match different regexps that have only one thing in common - they match two groups.

(\w+)#((?:it\.)?\w+\.com)
You want to capture the part after the # whether it's example.com or it.example.com, so you put both options inside the same capture group. But since they share a similar format, you can condense (it\.\w+\.com|\w+\.com) to just ((it\.)?\w+\.com)
The (?: ) makes that parens a non-capturing group, so it won't take part in your matched groups. There will be one match for the first (\w+), and one match for the whole ((?:it\.)?\w+\.com) after the #. That's two matches total, plus the default group-0 match for the full string.
EDIT: To answer your new question, all you have to do is follow the grouping I used, but stop before you condense it.
If your test cases are:
1) example#abcdef
2) example#123456
You could write your regex as such: (\w+)#([a-zA-Z]+|\d+), which would always have the part before the # in group 1, and the part after in group 2. Notice that there are only two pairs of parens, and the |("or") operator appears inside of the second parens group.

I once found here a well written email regex, it was build for extracting a wide range of valid email adresses from a generic string, so it should also be able to do what you're looking for.
Example:
>>> email_regex = re.compile("""((([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~]+|"([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~(),:;<>#\[\]\.]|\\[ \\"])*")\.)*([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~]+|"([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~(),:;<>#\[\]\.]|\\[ \\"])*"))#((([a-zA-Z0-9]([a-zA-Z0-9]*(\-[a-zA-Z0-9]*)*)?\.)*[a-zA-Z]+|\[((0?\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\.){3}(0?\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\]|\[[Ii][Pp][vV]6(:[0-9a-fA-F]{0,4}){6}\]))""")
>>>
>>> m = email_regex.search('john#it.company.com')
>>> m.group(0)
'john#it.company.com'
>>> m.group(1)
'john'
>>> m.group(7)
'it.company.com'
>>>
>>> n = email_regex.search('john#company.com')
>>> n.group(0)
'john#company.com'
>>> n.group(1)
'john'
>>> n.group(7)
'company.com'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I ignore empty groups? - python

Related

Python split before a certain character

What should I use the Non-greedy match in this case

Optional grouping in a simple python regex

Python Regex match or potential match

Matching alternative regexps in Python

Categories

Resources