Use of findall and parenthesis in Python - python

I need to extract all letters after the + sign or at the beginning of a string like this:
formula = "X+BC+DAF"
I tried so, and I do not want to see the + sign in the result. I wish see only ['X', 'B', 'D'].
>>> re.findall("^[A-Z]|[+][A-Z]", formula)
['X', '+B', '+D']
When I grouped with parenthesis, I got this strange result:
re.findall("^([A-Z])|[+]([A-Z])", formula)
[('X', ''), ('', 'B'), ('', 'D')]
Why it created tuples when I try to group ? How to write the regexp directly such that it returns ['X', 'B', 'D'] ?

If there are any capturing groups in the regular expression then re.findall returns only the values captured by the groups. If there are no groups the entire matched string is returned.
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
How to write the regexp directly such that it returns ['X', 'B', 'D'] ?
Instead of using a capturing group you can use a non-capturing group:
>>> re.findall(r"(?:^|\+)([A-Z])", formula)
['X', 'B', 'D']
Or for this specific case you could try a simpler solution using a word boundary:
>>> re.findall(r"\b[A-Z]", formula)
['X', 'B', 'D']
Or a solution using str.split that doesn't use regular expressions:
>>> [s[0] for s in formula.split('+')]
['X', 'B', 'D']

Related

Variable delimiters in sorted list to string

I'm currently trying to convert a sorted list of characters into a string. The delimiter of this string should be '-' if the character before and after are different, but should be '&' if these characters are equal.
An example:
The list ['1', '1', '2', '9', 'A', 'A', 'A', 'B', 'C'] should become '1&1-2-9-A&A&A-B-C'.
This will happen for approximately 250K lists in a pandas DataFrame. I'm thinking of creating a string with all delimiters '-' and replacing them using str.replace() function, but getting stuck at the final part.
A simplificiation of my current code is as follows (where column 'sorted' contains a sorted list of the characters as above).
df['joined'] = df['sorted'].str.join('-')
df['correct'] = df['joined'].str.replace(r"\-(.\-)\1{1,}?", xxxx, regex=True)
Is there a regex pattern that can replace the xxxx that would be able to do the same part as the first pattern, with the '.' being the original character? Or is there another solution (for example a matching positive lookbehind and lookahead?
Thanks!
I would do it following way:
import re
chars = ['1', '1', '2', '9', 'A', 'A', 'A', 'B', 'C']
joined = '-'.join(chars)
result = re.sub(r'(.)-(?=\1)', r'\1&', joined)
print(result) # 1&1-2-9-A&A&A-B-C
Explanation: I used positive lookahead here, checking if - is followed by same characters as one before. Zero-length assertion does not capture, which result in proper replacement of - which are 1 from each other, consider
A-A-A
Result in matches:
(A-)(A-)A
If we would use r'(.)-\1' as pattern it would be:
(A-A)-A
thus lefting second - unchanged

The reason of the result of `re.findall(r'(.)*', 'aabc')` in python3

the line re.findall(r'(.)*', 'aabc') will return ['c', ''] instead of ['a', 'a', 'b', 'c', '']. Why is that?
Thank you
Your pattern says match one capturing group of one character (.) zero or more times *.
There are two matches with this pattern. First, it matches 'aabc' as (.) (a capturing group of a single character) four times. The result in findall for that will be the content of the captured group, which is c, since the last thing your group captures is c.
The second match found is the empty string (which is a match since * can mean "zero times"), in which case nothing is captured in the capturing group, and you get an empty string as a result.
If you want the result ['a', 'a', 'b', 'c', ''], you could use
re.findall(r'.?', 'aabc')
which is "match any single character optionally".

Trouble understanding re.findall() behavior

I'm having some trouble understanding the behavior of re.findall. Quoting from the documentation:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Based on this, I would expect the following line
re.findall(f"(a)|(b)|(c)","c")
to produce the result
[(c)]
However, it produces the result
[('', '', 'c')]
I don't understand why the two empty strings are included, since I don't see an empty match anywhere.
It's because of having three capturing groups:
import re
print(re.findall(r"(a)|(b)|(c)","d"))
print(re.findall(f"(a)|(b)|(c)","c"))
print(re.findall(r"(?:a)|(?:b)|(?:c)","c"))
print(re.findall(f"(?:a)|(b)|(c)","c"))
print(re.findall(f"(?:a|b|c)","c"))
print(re.findall(r"a|b|c","c"))
Output
[]
[('', '', 'c')]
['c']
[('', 'c')]
['c']
['c']

Regex to parse SDDL

I'm using python to parse out an SDDL using regex. The SDDL is always in the form of 'type:some text' repeated up to 4 times. The types can be either 'O', 'G', 'D', or 'S' followed by a colon. The 'some text' will be variable in length.
Here is a sample SDDL:
O:DAG:S-1-5-21-2021943911-1813009066-4215039422-1735D:(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)S:NO_ACCESS_CONTROL
Here is what I have so far. Two of the tuples are returned just fine, but the other two - ('G','S-1-5-21-2021943911-1813009066-4215039422-1735') and ('S','NO_ACCESS_CONTROL') are not.
import re
sddl="O:DAG:S-1-5-21-2021943911-1813009066-4215039422-1735D:(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)S:NO_ACCESS_CONTROL"
matches = re.findall('(.):(.*?).:',sddl)
print matches
[('O', 'DA'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)')]
what I'd like to have returned is
[('O', 'DA'), ('G','S-1-5-21-2021943911-1813009066-4215039422-1735'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)'),('S','NO_ACCESS_CONTROL')]
Try the following:
(.):(.*?)(?=.:|$)
Example:
>>> re.findall(r'(.):(.*?)(?=.:|$)', sddl)
[('O', 'DA'), ('G', 'S-1-5-21-2021943911-1813009066-4215039422-1735'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)'), ('S', 'NO_ACCESS_CONTROL')]
This regex starts out the same way as yours, but instead of including the .: at the end as a part of the match, a lookahead is used. This is necessary because re.findall() will not return overlapping matches, so you need each match to stop before the next match begins.
The lookahead (?=.:|$) essentially means "match only if the next characters are anything followed by a colon, or we are at the end of the string".
It seems like using regex isn't the best solution to this problem. Really, all you want to do is split across the colons and then do some transformations on the resulting list.
chunks = sddl.split(':')
pairs = [(chunks[i][-1], chunks[i+1][:-1] \
if i < (len(chunks) - 2) \
else chunks[i+1])
for i in range(0, len(chunks) - 1)]

Greedy match with negative lookahead in a regular expression

I have a regular expression in which I'm trying to extract every group of letters that is not immediately followed by a "(" symbol. For example, the following regular expression operates on a mathematical formula that includes variable names (x, y, and z) and function names (movav and movsum), both of which are composed entirely of letters but where only the function names are followed by an "(".
re.findall("[a-zA-Z]+(?!\()", "movav(x/2, 2)*movsum(y, 3)*z")
I would like the expression to return the array
['x', 'y', 'z']
but it instead returns the array
['mova', 'x', 'movsu', 'y', 'z']
I can see in theory why the regular expression would be returning the second result, but is there a way I can modify it to return just the array ['x', 'y', 'z']?
Another solution which doesn't rely on word boundaries:
Check that the letters aren't followed by either a ( or by another letter.
>>> re.findall(r'[a-zA-Z]+(?![a-zA-Z(])', "movav(x/2, 2)*movsum(y, 3)*z")
['x', 'y', 'z']
Add a word-boundary matcher \b:
>>> re.findall(r'[a-zA-Z]+\b(?!\()', "movav(x/2, 2)*movsum(y, 3)*z")
['x', 'y', 'z']
\b matches the empty string in between two words, so now you're looking for letters followed by a word boundary that isn't immediately followed by (. For more details, see the re docs.
You need to limit matches to whole words. So use \b to match the beginning or end of a word:
re.findall(r"\b[a-zA-Z]+\b(?!\()", "movav(x/2, 2)*movsum(y, 3)*z")
An alternate approach: find strings of letters followed by either end-of-string or by a non-letter, non-bracket character; then capture the letter portion.
re.findall("([a-zA-Z]+)(?:[^a-zA-Z(]|$)", "movav(x/2, 2)*movsum(y, 3)*z")

Categories