The reason of the result of `re.findall(r'(.)*', 'aabc')` in python3 - python

the line re.findall(r'(.)*', 'aabc') will return ['c', ''] instead of ['a', 'a', 'b', 'c', '']. Why is that?
Thank you

Your pattern says match one capturing group of one character (.) zero or more times *.
There are two matches with this pattern. First, it matches 'aabc' as (.) (a capturing group of a single character) four times. The result in findall for that will be the content of the captured group, which is c, since the last thing your group captures is c.
The second match found is the empty string (which is a match since * can mean "zero times"), in which case nothing is captured in the capturing group, and you get an empty string as a result.
If you want the result ['a', 'a', 'b', 'c', ''], you could use
re.findall(r'.?', 'aabc')
which is "match any single character optionally".

Related

Variable delimiters in sorted list to string

I'm currently trying to convert a sorted list of characters into a string. The delimiter of this string should be '-' if the character before and after are different, but should be '&' if these characters are equal.
An example:
The list ['1', '1', '2', '9', 'A', 'A', 'A', 'B', 'C'] should become '1&1-2-9-A&A&A-B-C'.
This will happen for approximately 250K lists in a pandas DataFrame. I'm thinking of creating a string with all delimiters '-' and replacing them using str.replace() function, but getting stuck at the final part.
A simplificiation of my current code is as follows (where column 'sorted' contains a sorted list of the characters as above).
df['joined'] = df['sorted'].str.join('-')
df['correct'] = df['joined'].str.replace(r"\-(.\-)\1{1,}?", xxxx, regex=True)
Is there a regex pattern that can replace the xxxx that would be able to do the same part as the first pattern, with the '.' being the original character? Or is there another solution (for example a matching positive lookbehind and lookahead?
Thanks!
I would do it following way:
import re
chars = ['1', '1', '2', '9', 'A', 'A', 'A', 'B', 'C']
joined = '-'.join(chars)
result = re.sub(r'(.)-(?=\1)', r'\1&', joined)
print(result) # 1&1-2-9-A&A&A-B-C
Explanation: I used positive lookahead here, checking if - is followed by same characters as one before. Zero-length assertion does not capture, which result in proper replacement of - which are 1 from each other, consider
A-A-A
Result in matches:
(A-)(A-)A
If we would use r'(.)-\1' as pattern it would be:
(A-A)-A
thus lefting second - unchanged

How to get number of repetition of each group of regexp in a line?

How I can get number of repetitions of each group in regexp, using python, and get a list of this groups?
For example:
This regex (ab)*.*?(cd)* on the string ababababcdcddscdcdfscdcd
Should return 4 for the first group, because ab exists 4 times in the string.
And return 6 for the second group, because cd exists 6 times in the string.
This or maybe another function should also return a list of groups and another part of the line. For this string it must be list with [ab,ab,ab,ab,cd,cd,ds,cd,cd,fs,cd,cd]. I tried to use match object, but I can't find a way to get the number of repetitions of every group.
Thanks very much everybody for the help.
When you quantify a capture group, it just captures the first match, not all the matches, so you can't get [ab, ab, ab, ab, ...].
You an put the quantifier inside a group, so that all the repetitions will be captured at once.
((ab)*).*?((cd)*)
The capture groups will be:
["abababab", "ab", "cdcdcdcdcdcd", "cd"]
You can divide the length of the even elements by the length of the following element to get the number of repetitions.
In your pattern you are repeating a capturing group which will give you the value of the last iteration in a group. So for example this part (ab)* will contain the value of the last occurrence of ab.
matched ()()()
abababab
() captured
One option is to split on either ab or cd using a capturing group (ab|cd) to keep the delimiter and remove the empty entries from the result.
For example
import re
s = "ababababcdcddscdcdfscdcd"
pattern = r"(ab|cd)"
result = list(filter(None, re.split(pattern, s)))
print(result)
Output
['ab', 'ab', 'ab', 'ab', 'cd', 'cd', 'ds', 'cd', 'cd', 'fs', 'cd', 'cd']
Python demo

How to comprehend the python regex compile matching result: `re.compile(r'a*')`

import re
pattern = re.compile(r'a*')
pattern.findall("aba")
result:
['a', '', 'a', '']
Why there is empty matches in the result? How to comprehend this?
To be more specific, what do the two empty matches--'' in the result stand for in the string "aba"?
findall(pattern, string, flags=0)ΒΆ
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
You are searching for a*. * matches zero or more repetitions of the character. So b matches a*, and so does anything else. It seems like you want a+ instead, which matches one or more repetitions of the character.
Let me try to explain, as I also could not find good information on the outputs. The documentation states that
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a previous empty match.
import re
text = 'abcaad'
print(f"'a' matches {re.findall('a' , text)}")
print(f"'a+' matches {re.findall('a+', text)}")
print(f"'a*' matches {re.findall('a*', text)}")
print(f"'z*' matches {re.findall('z*', text)}")
The output is
'a' matches ['a', 'a', 'a']
'a+' matches ['a', 'aa']
'a*' matches ['a', '', '', 'aa', '', '']
'z*' matches ['', '', '', '', '', '', '']
a matches exactly the character a thrice.
a+ matches one or more occurrences of character a.
a* matches zero or more occurrences of character a.
Besides matching a and aa, it also does not matches b, c, d and the whole string.
z* matches zero or more occurrences of character z.
It does not matches a, b, c, a, a, d and the whole string.

Use of findall and parenthesis in Python

I need to extract all letters after the + sign or at the beginning of a string like this:
formula = "X+BC+DAF"
I tried so, and I do not want to see the + sign in the result. I wish see only ['X', 'B', 'D'].
>>> re.findall("^[A-Z]|[+][A-Z]", formula)
['X', '+B', '+D']
When I grouped with parenthesis, I got this strange result:
re.findall("^([A-Z])|[+]([A-Z])", formula)
[('X', ''), ('', 'B'), ('', 'D')]
Why it created tuples when I try to group ? How to write the regexp directly such that it returns ['X', 'B', 'D'] ?
If there are any capturing groups in the regular expression then re.findall returns only the values captured by the groups. If there are no groups the entire matched string is returned.
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
How to write the regexp directly such that it returns ['X', 'B', 'D'] ?
Instead of using a capturing group you can use a non-capturing group:
>>> re.findall(r"(?:^|\+)([A-Z])", formula)
['X', 'B', 'D']
Or for this specific case you could try a simpler solution using a word boundary:
>>> re.findall(r"\b[A-Z]", formula)
['X', 'B', 'D']
Or a solution using str.split that doesn't use regular expressions:
>>> [s[0] for s in formula.split('+')]
['X', 'B', 'D']

Separating RegEx pattern matches that have the same potential starting characters

I would like to have a RegEx that matches several of the same character in a row, within a range of possible characters but does not return those pattern matches as one pattern. How can this be accomplished?
For clarification:
I want a pattern that starts with [a-c] and ungreedly returns any number of the same character, but not the other characters in the range. In the sequence 'aafaabbybcccc' it would find patterns for:
('aa', 'aa', 'bb', 'b', 'cccc')
but would exclude the following:
('f', 'aabb', 'y', 'bcccc')
I don't want to use multiple RegEx pattern searches because the order that i find the patterns will determine the output of another function. This question is for the purposes of self study (python), not homework. (I'm also under 15 rep but will come back and upvote when I can.)
Good question. Use a regex like:
(?P<L>[a-c])(?P=L)+
This is more robust - you're not limited to a-c, you can replace it with a-z if you like. It first defines any character within a-c as L, then sees whether that character occurs again one or more times. You want to run re.findall() using this regex.
You can use backreference \1 - \9 to capture previously matched 1st to 9th group.
/([a-c])(\1+)/
[a-c]: Matches one of the character.
\1+ : Matches subsequent one or more previously matched character.
Perl:
perl -e '#m = "ccccbbb" =~ /([a-c])(\1+)/; print $m[0], $m[1]'
cccc
Python:
>>> import re
>>> [m.group(0) for m in re.finditer(r"([a-c])\1+", 'aafaabbybcccc')]
['aa', 'aa', 'bb', 'cccc']

Categories