I'm using python to parse out an SDDL using regex. The SDDL is always in the form of 'type:some text' repeated up to 4 times. The types can be either 'O', 'G', 'D', or 'S' followed by a colon. The 'some text' will be variable in length.
Here is a sample SDDL:
O:DAG:S-1-5-21-2021943911-1813009066-4215039422-1735D:(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)S:NO_ACCESS_CONTROL
Here is what I have so far. Two of the tuples are returned just fine, but the other two - ('G','S-1-5-21-2021943911-1813009066-4215039422-1735') and ('S','NO_ACCESS_CONTROL') are not.
import re
sddl="O:DAG:S-1-5-21-2021943911-1813009066-4215039422-1735D:(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)S:NO_ACCESS_CONTROL"
matches = re.findall('(.):(.*?).:',sddl)
print matches
[('O', 'DA'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)')]
what I'd like to have returned is
[('O', 'DA'), ('G','S-1-5-21-2021943911-1813009066-4215039422-1735'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)'),('S','NO_ACCESS_CONTROL')]
Try the following:
(.):(.*?)(?=.:|$)
Example:
>>> re.findall(r'(.):(.*?)(?=.:|$)', sddl)
[('O', 'DA'), ('G', 'S-1-5-21-2021943911-1813009066-4215039422-1735'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)'), ('S', 'NO_ACCESS_CONTROL')]
This regex starts out the same way as yours, but instead of including the .: at the end as a part of the match, a lookahead is used. This is necessary because re.findall() will not return overlapping matches, so you need each match to stop before the next match begins.
The lookahead (?=.:|$) essentially means "match only if the next characters are anything followed by a colon, or we are at the end of the string".
It seems like using regex isn't the best solution to this problem. Really, all you want to do is split across the colons and then do some transformations on the resulting list.
chunks = sddl.split(':')
pairs = [(chunks[i][-1], chunks[i+1][:-1] \
if i < (len(chunks) - 2) \
else chunks[i+1])
for i in range(0, len(chunks) - 1)]
Related
I have got a file (.VAR) which gives me a positions and lengths in a strings per row, see the example below.
*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE
How do i retrieve the " xxLxxx:" value, which is always preceded by a space and always ends with a colon, but never on the same location within the string.
Preferably I would like to find the number before L as the position, and the number behind L as the length, but only searching for "L" would give me also the input from other values within the string. Therefore I think I have to use the space_number_L_number_colon to recognize this part, but I don't know how.
Any thoughts? TIA
You can use a regex here.
Example:
s='''*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE'''
import re
out = re.findall(r'\s(\d+)L(\d+):', s)
output: [('1', '8'), ('29', '4'), ('33', '2')]
As integers:
out = [tuple(map(int, x)) for x in re.findall(r'\s(\d+)L(\d+):', s)]
output: [(1, 8), (29, 4), (33, 2)]
regex:
regex demo
\s # space
(\d+) # capture one or more digits
L # literal L
(\d+) # capture one or more digits
: # literal :
This question is about the operation '?'. In my previous question, someone mistakenly mark my question as a duplicate. SO I reopen this question asking for an answer.
I would like to ask why the first expression doesn't output ('a', 'b','c','d') from the string 'axxxxxbcd'.
import re
match = re.findall(r'(a).*?(b)?.*?(c)?(d)','awsssd axxxxxbcd ad adfdfdcdfdd
awsbdfdfdcd')
print (match)
output[1]: [('a', '', '', 'd'), ('a', '', 'c', 'd'), ('a', '', '', 'd'), ('a', '', '', 'd'), ('a', '', '', 'd')]
import re
match = re.findall(r'(a).*?(b)?(c)?(d)','awsssd axxxxxbcd ad adfdfdcdfdd awsbdfdfdcd')
print (match)
output[2]: [('a', '', '', 'd'), ('a', 'b', 'c', 'd'), ('a', '', '', 'd'), ('a', '', '', 'd'), ('a', 'b', '', 'd')]
#Isaac
You can better understand what is going on by wrapping every element of the regex in capturing parentheses:
import re
rgx1 = re.compile(r'(a)(.*?)(b)?(.*?)(c)?(d)')
m1 = rgx1.search('axxxxxbcd')
print(m1.groups())
Output:
('a', '', None, 'xxxxxb', 'c', 'd')
Here's what happens:
# Group 1: 'a'
# Group 2: capture as little as possible, so we get ''
# Group 3: 'b' is not present, but it's optional, so we get None
# Group 4: 'xxxxxb'
# Group 5: 'c'
# Group 6: 'd'
Why does group 4 end up with content rather than group 2? Initially, they are the same, capturing as little as possible (nothing), but that will cause the overall regex to fail. So the engine has to start expanding either group 2 or group 4. Based on this one example, it appears that the engine expands the latter group first -- but I don't know what the precise implementation rules are for situations like this. To demonstrate that the two groups do indeed pursue a non-greedy strategy, you can add a d earlier in the string: for example, use the input text axxdxxxbcd. In that case, group 4 ends up holding just xx.
The following approach might do what you want:
rgx1 = re.compile(r'(a)(?:.*?(b)|.*?)(?:.*?(c)|.*?)(d)')
m1 = rgx1.search('a...b...cd')
print(m1.groups()) # Output: ('a', 'b', 'c', 'd')
But I probably wouldn't solve the problem that way. Regexes where every (or nearly every) element is optional are often tricky to get right. Sometimes you are better off to parse the text in a few simple stages rather than in a big hairy regex.
You are getting that output because of
.*? matches any character (except for line terminators
(a).*?(b)?.*?(c)?(d)
Group 1. 0-1 `a`
Group 3. 7-8 `c`
Group 4. 8-9 `d`
1st Capturing Group (a)
a matches the character a literally (case sensitive)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
2nd Capturing Group (b)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
b matches the character b literally (case sensitive)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
3rd Capturing Group (c)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
c matches the character c literally (case sensitive)
4th Capturing Group (d)
d matches the character d literally (case sensitive)
But if you want output ('a', 'b','c','d') from the string 'axxxxxbcd'
your regular expression should be
(a).*?(b)?(c)?(d)
Group 1. 0-1 `a`
Group 2. 6-7 `b`
Group 3. 7-8 `c`
Group 4. 8-9 `d`
I need to extract all letters after the + sign or at the beginning of a string like this:
formula = "X+BC+DAF"
I tried so, and I do not want to see the + sign in the result. I wish see only ['X', 'B', 'D'].
>>> re.findall("^[A-Z]|[+][A-Z]", formula)
['X', '+B', '+D']
When I grouped with parenthesis, I got this strange result:
re.findall("^([A-Z])|[+]([A-Z])", formula)
[('X', ''), ('', 'B'), ('', 'D')]
Why it created tuples when I try to group ? How to write the regexp directly such that it returns ['X', 'B', 'D'] ?
If there are any capturing groups in the regular expression then re.findall returns only the values captured by the groups. If there are no groups the entire matched string is returned.
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
How to write the regexp directly such that it returns ['X', 'B', 'D'] ?
Instead of using a capturing group you can use a non-capturing group:
>>> re.findall(r"(?:^|\+)([A-Z])", formula)
['X', 'B', 'D']
Or for this specific case you could try a simpler solution using a word boundary:
>>> re.findall(r"\b[A-Z]", formula)
['X', 'B', 'D']
Or a solution using str.split that doesn't use regular expressions:
>>> [s[0] for s in formula.split('+')]
['X', 'B', 'D']
I would like to have a RegEx that matches several of the same character in a row, within a range of possible characters but does not return those pattern matches as one pattern. How can this be accomplished?
For clarification:
I want a pattern that starts with [a-c] and ungreedly returns any number of the same character, but not the other characters in the range. In the sequence 'aafaabbybcccc' it would find patterns for:
('aa', 'aa', 'bb', 'b', 'cccc')
but would exclude the following:
('f', 'aabb', 'y', 'bcccc')
I don't want to use multiple RegEx pattern searches because the order that i find the patterns will determine the output of another function. This question is for the purposes of self study (python), not homework. (I'm also under 15 rep but will come back and upvote when I can.)
Good question. Use a regex like:
(?P<L>[a-c])(?P=L)+
This is more robust - you're not limited to a-c, you can replace it with a-z if you like. It first defines any character within a-c as L, then sees whether that character occurs again one or more times. You want to run re.findall() using this regex.
You can use backreference \1 - \9 to capture previously matched 1st to 9th group.
/([a-c])(\1+)/
[a-c]: Matches one of the character.
\1+ : Matches subsequent one or more previously matched character.
Perl:
perl -e '#m = "ccccbbb" =~ /([a-c])(\1+)/; print $m[0], $m[1]'
cccc
Python:
>>> import re
>>> [m.group(0) for m in re.finditer(r"([a-c])\1+", 'aafaabbybcccc')]
['aa', 'aa', 'bb', 'cccc']
I am looking for a regexp that returns only three matched groups for the string "A :B C:D"
where A,B,C,D are words examples (\w+)
The following Python code prints unwanted (None,None).
I just want ('A',None) (None,'B') and ('C','D') using one regexp (No added python code for filtering).
for m in re.compile(r'(?:(\w+)|)(?:(?::)(\w+)|)').finditer('A :B C:D'):
print m.groups()
This might do the trick:
(?=[\w:])(\w*)(?::(\w*))?
(\w*)(?::(\w*))? describes the structure you want, but it has a problem that it also matches empty string; thus we have to assure that there is at least one non-space character at the start (which will get matched by the greedy operators), and the lookahead at the start does it.
Edit: wrong paste :)
import re
print([m.groups() for m in re.finditer(
r'''(?x) # verbose mode
(\w+)? # match zero-or-more \w's
(?: :|\s) # match (non-groupingly) a colon or a space
(\w+ (?:\s|\Z))? # match zero-or-more \w's followed by a space or EOL
''',
'A :B C:D')])
yields
[('A', None), (None, 'B '), ('C', 'D')]