I don't understand why the regex ^(.)+$ matches the last letter of a string. I thought it would match the whole string.
Example in Python:
>>> text = 'This is a sentence'
>>> re.findall('^(.)+$', text)
['e']
If there's a capturing group (or groups), re.findall returns differently:
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result unless they touch the
beginning of another match.
And according to MatchObject.group documentation:
If a group matches multiple times, only the last match is accessible:
If you want to get whole string, use a non-capturing group:
>>> re.findall('^(?:.)+$', text)
['This is a sentence']
or don't use capturing groups at all:
>>> re.findall('^.+$', text)
['This is a sentence']
or change the group to capturing all:
>>> re.findall('^(.+)$', text)
['This is a sentence']
>>> re.findall('(^.+$)', text)
['This is a sentence']
Alternatively, you can use re.finditer which yield match objects. Using MatchObject.group(), you can get the whole matched string:
>>> [m.group() for m in re.finditer('^(.)+$', text)]
['This is a sentence']
Because the capture group is just one character (.). The regex engine will continue to match the whole string because of the + quantifier, and each time, the capture group will be updated to the latest match. In the end, the capture group will be the last character.
Even if you use findall, the first time the regex is applied, because of the + quantifier it will continue to match the whole string up to the end. And since the end of the string was reached, the regex won't be applied again, and the call returns just one result.
If you remove the + quantifier, then the first time, the regex will match just one character, so the regex will be applied again and again, until the whole string will be consumed, and findall will return a list of all the characters in the string.
NOte that + is greedy by default which matches all the characters upto the last. Since only the dot present inside the capturing group, the above regex matches all the characters from the start but captures only the last character. Since findall function gives the first preference to groups, it just prints out the chars present inside the groups.
re.findall('^(.+)$', text)
Related
I have a string s = '10000',
I need using only the Python re.findall to get how many 0\d0 in the string s
For example: for the string s = '10000' it should return 2
explanation:
the first occurrence is 10000 while the second occurrence is 10000
I just need how many occurrences and not interested in the occurrence patterns
I've tried the following regex statements:
re.findall(r'(0\d0)', s) #output: ['000']
re.findall(r'(0\d0)*', s) #output: ['', '', '000', '', '', '']
Finally, if I want to make this regex generic to fetch any number then
any_number_included_my_number then the_same_number_again, how can I do it?
How to get all possible occurrences?
The regex
As I mentioned in my comment, you can use the following pattern:
(?=(0\d0))
How it works:
(?=...) is a positive lookahead ensuring what follows matches. This doesn't consume characters (allowing us to check for a match at each position in the string as a regex would otherwise resume pattern matching after the consumed characters).
(0\d0) is a capture group matching 0, then any digit, then 0
The code
Your code becomes:
See code in use here
re.findall(r'(?=(0\d0))', s)
The result is:
['000', '000']
The python re.findall method states the following
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
This means that our matches are the results of capture group 1 rather than the full match as many would expect.
How to generalize the pattern?
The regex
You can use the following pattern:
(\d)\d\1
How this works:
(\d) captures any digit into capture group 1
\d matches any digit
\1 is a backreference that matches the same text as most recently matched by capture group 1
The code
Your code becomes:
See code in use here
re.findall(r'(?=((\d)\d\2))', s)
print([n[0] for n in x])
Note: The code above has two capture groups, so we need to change the backreference to \2 to match correctly. Since we now have two capture groups, we will get tuples as the documentation states and can use list comprehension to get the expected results.
The result is:
['000', '000']
s = "[abc]abx[abc]b"
s = re.sub("\[([^\]]*)\]a", "ABC", s)
'ABCbx[abc]b'
In the string, s, I want to match 'abc' when it's enclosed in [], and followed by a 'a'. So in that string, the first [abc] will be replaced, and the second won't.
I wrote the pattern above, it matches:
match anything starting with a '[', followed by any number of characters which is not ']', then followed by the character 'a'.
However, in the replacement, I want the string to be like:
[ABC]abx[abc]b . // NOT ABCbx[abc]b
Namely, I don't want the whole matched pattern to be replaced, but only anything with the bracket []. How to achieve that?
match.group(1) will return the content in []. But how to take advantage of this in re.sub?
Why not simply include [ and ] in the substitution?
s = re.sub("\[([^\]]*)\]a", "[ABC]a", s)
There exist more than 1 method, one of them is exploting groups.
import re
s = "[abc]abx[abc]b"
out = re.sub('(\[)([^\]]*)(\]a)', r'\1ABC\3', s)
print(out)
Output:
[ABC]abx[abc]b
Note that there are 3 groups (enclosed in brackets) in first argument of re.sub, then I refer to 1st and 3rd (note indexing starts at 1) so they remain unchanged, instead of 2nd group I put ABC. Second argument of re.sub is raw string, so I do not need to escape \.
This regex uses lookarounds for the prefix/suffix assertions, so that the match text itself is only "abc":
(?<=\[)[^]]*(?=\]a)
Example: https://regex101.com/r/NDlhZf/1
So that's:
(?<=\[) - positive look-behind, asserting that a literal [ is directly before the start of the match
[^]]* - any number of non-] characters (the actual match)
(?=\]a) - positive look-ahead, asserting that the text ]a directly follows the match text.
I'm looking to find words in a string that match a specific pattern.
Problem is, if the words are part of an email address, they should be ignored.
To simplify, the pattern of the "proper words" \w+\.\w+ - one or more characters, an actual period, and another series of characters.
The sentence that causes problem, for example, is a.a b.b:c.c d.d#e.e.e.
The goal is to match only [a.a, b.b, c.c] . With most Regexes I build, e.e returns as well (because I use some word boundary match).
For example:
>>> re.findall(r"(?:^|\s|\W)(?<!#)(\w+\.\w+)(?!#)\b", "a.a b.b:c.c d.d#e.e.e")
['a.a', 'b.b', 'c.c', 'e.e']
How can I match only among words that do not contain "#"?
I would definitely clean it up first and simplify the regex.
first we have
words = re.split(r':|\s', "a.a b.b:c.c d.d#e.e.e")
then filter out the words that have an # in them.
words = [re.search(r'^((?!#).)*$', word) for word in words]
Properly parsing email addresses with a regex is extremely hard, but for your simplified case, with a simple definition of word ~ \w\.\w and the email ~ any sequence that contains #, you might find this regex to do what you need:
>>> re.findall(r"(?:^|[:\s]+)(\w+\.\w+)(?=[:\s]+|$)", "a.a b.b:c.c d.d#e.e.e")
['a.a', 'b.b', 'c.c']
The trick here is not to focus on what comes in the next or previous word, but on what the word currently captured has to look like.
Another trick is in properly defining word separators. Before the word we'll allow multiple whitespaces, : and string start, consuming those characters, but not capturing them. After the word we require almost the same (except string end, instead of start), but we do not consume those characters - we use a lookahead assertion.
You may match the email-like substrings with \S+#\S+\.\S+ and match and capture your pattern with (\w+\.\w+) in all other contexts. Use re.findall to only return captured values and filter out empty items (they will be in re.findall results when there is an email match):
import re
rx = r"\S+#\S+\.\S+|(\w+\.\w+)"
s = "a.a b.b:c.c d.d#e.e.e"
res = filter(None, re.findall(rx, s))
print(res)
# => ['a.a', 'b.b', 'c.c']
See the Python demo.
See the regex demo.
I want to match a floating number that might be in the form of 0.1234567 or 1.23e-5
Here is my python code:
import re
def main():
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
for svs_elem in m2:
print svs_elem
main()
It prints blank... Based on my test, the problem was in (e-\d+)? part.
See emphasis:
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
You have a group, so it’s returned instead of the entire match, but it doesn’t match in any of your cases. Make it non-capturing with (?:e-\d+):
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
Use a non-capturing group. The matches are succeeding, but the output is the contents of the optional groups that don't actually match.
See the output when your input includes something like e-6:
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['', '', 'e-6']
With a non-capturing group ((?:...)):
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['1:0.00003', '3:0.123456', '8:-0.12345e-6']
Here's are some simpler examples to demonstrate how capturing groups work and how they influence the output of findall. First, no groups:
>>> re.findall("a[bc]", "ab")
["ab"]
Here, the string "ab" matched the regex, so we print everything the regex matched.
>>> re.findall("a([bc])", "ab")
["b"]
This time, we put the [bc] inside a capturing group, so even though the entire string is still matched by the regex, findall only includes the part inside the capturing group in its output.
>>> re.findall("a(?:[bc])", "ab")
["ab"]
Now, by converting the capturing group to a non-capturing group, findall again uses the match of the entire regex in its output.
>>> re.findall("a([bc])?", "a")
['']
>>> re.findall("a(?:[bc])?", "a")
['a']
In both of these final case, the regular expression as a whole matches, so the return value is a non-empty list. In the first one, the capturing group itself doesn't match any text, though, so the empty string is part of the output. In the second, we don't have a capturing group, so the match of the entire regex is used for the output.
If I have a string that may look like this:
"[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
How do I extract the categories and put them into a list?
I'm having a hard time getting the regular expression to work.
To expand on the explanation of the regex used by Avinash in his answer:
Category:([^\[\]]*) consists of several parts:
Category: which matches the text "Category:"
(...) is a capture group meaning roughly "the expression inside this group is a block that I want to extract"
[^...] is a negated set which means "do not match any characters in this set".
\[ and \] match "[" and "]" in the text respectively.
* means "match zero or more of the preceding regex defined items"
Where I have used ... to indicate that I removed some characters that were not important for the explanation.
So putting it all together, the regex does this:
Finds "Category:" and then matches any number (including zero) characters after that that are not the excluded characters "[" or "]". When it hits an excluded character it stops and the text matched by the regex inside the (...) part is returned. So the regex does not actually look for "[[" or "]]" as you might expect and so will match even if they are left out. You could force it to look for the double square brackets at the beginning and end by changing it to \[\[Category:([^\[\]]*)\]\].
For the second regex, Category:[^\[\]]*, the capture group (...) is excluded, so Python returns everything matched which includes "Category:".
Seems like you want something like this,
>>> str = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
>>> re.findall(r'Category:([^\[\]]*)', str)
['Political culture', 'Political ideologies']
>>> re.findall(r'Category:[^\[\]]*', str)
['Category:Political culture', 'Category:Political ideologies']
By default re.findall will print only the strings which are matched by the pattern present inside a capturing group. If no capturing group was present, then only the findall function would return the matches in list. So in our case , this Category: matches the string category: and this ([^\[\]]*) would capture any character but not of [ or ] zero or more times. Now the findall function would return the characters which are present inside the group index 1.
Python code:
s = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
cats = [line.strip().strip("[").strip("]") for line in s.splitlines() if line]
print(cats)
Output:
['Category:Political culture', 'Category:Political ideologies']