How to avoid capturing groups in RegEx splitting result? - python

I'm trying to use re to match a pattern that starts with '\n', followed by a possible 'real(r8)', followed by zero or more white spaces and then followed by the word 'function', and then I want to split the string at where matches occur. So for this string,
text = '''functional \n function disdat \nkitkat function wakawak\nreal(r8) function noooooo \ndoit'''
I would like:
['functional ',
' disdat \nkitkat function wakawak',
' noooooo \ndoit']
However,
regex = re.compile(r'''\n(real\(r8\))?\s*\bfunction\b''')
regex.split(text)
returns
['functional ',
None,
' disdat \nkitkat function wakawak',
'real(r8)',
' noooooo \ndoit']
split returns the matches' groups too. How do I ask it not to?

You can use non-capturing groups, like this
>>> regex = re.compile(r'\n(?:real\(r8\))?\s*\bfunction\b')
>>> regex.split(text)
['functional ', ' disdat \nkitkat function wakawak', ' noooooo \ndoit']
Note ?: in (?:real\(r8\)). Quoting Python documentation for (?:..)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

Related

Split concatenated functions keeping the delimiters

I am trying to split strings containing python functions, so that the resulting output keeps separate functions as list elements.
s='hello()there()' should be split into ['hello()', 'there()']
To do so I use a regex lookahead to split on the closing parenthesis, but not at the end of the string.
While the lookahead seems to work, I cannot keep the ) in the resulting strings as suggested in various posts. Simply splitting with the regex discards the separator:
import re
s='hello()there()'
t=re.split("\)(?!$)", s)
This results in: 'hello(', 'there()'] .
s='hello()there()'
t=re.split("(\))(?!$)", s)
Wrapping the separator as a group results in the ) being retained as a separate element: ['hello(', ')', 'there()']
As does this approach using the filter() function:
s='hello()there()'
u = list(filter(None, re.split("(\))(?!$)", s)))
resulting again in the parenthesis as a separate element: ['hello(', ')', 'there()']
How can I split such a string so that the functions remain intact in the output?
Use re.findall()
\w+\(\) matches one or more word characters followed by an opening and a closing parenthesis> That part matches the hello() and there()
t = re.findall(r"\w+\(\)", s)
['hello()', 'there()']
Edition:
.*? is a non-greedy match, meaning it will match as few characters as possible in the parenthesis.
s = "hello(x, ...)there(6L)now()"
t = re.findall(r"\w+\(.*?\)", s)
print(t)
['hello(x, ...)', 'there(6L)', 'now()']
You can split on a lookbehind for () and negative lookahead for the end of the string.
t = re.split(r'(?<=\(\))(?!$)', s)

Python regular expression truncate string by special character with one leading space

I need to truncate string by special characters '-', '(', '/' with one leading whitespace, i.e. ' -', ' (', ' /'.
how to do that?
patterns=r'[-/()]'
try:
return row.split(re.findall(patterns, row)[0], 1)[0]
except:
return row
the above code picked up all special characters but without the leading space.
patterns=r'[s-/()]'
this one does not work.
Try this pattern
patterns=r'^\s[-/()]'
or remove ^ depending on your needs.
It looks like you want to get a part of the string before the first occurrence of \s[-(/] pattern.
Use
return re.sub(r'\s[-(/].*', '', row)
This code will return a part of row string without all chars after the first occurrence of a whitespace (\s) followed with -, ( or / ([-(/]).
See the regex demo.
Please try this pattern patterns = r'\s+-|\s\/|\s\(|\s\)'

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

What do round brackets in Regex mean?

I don't understand why the regex ^(.)+$ matches the last letter of a string. I thought it would match the whole string.
Example in Python:
>>> text = 'This is a sentence'
>>> re.findall('^(.)+$', text)
['e']
If there's a capturing group (or groups), re.findall returns differently:
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result unless they touch the
beginning of another match.
And according to MatchObject.group documentation:
If a group matches multiple times, only the last match is accessible:
If you want to get whole string, use a non-capturing group:
>>> re.findall('^(?:.)+$', text)
['This is a sentence']
or don't use capturing groups at all:
>>> re.findall('^.+$', text)
['This is a sentence']
or change the group to capturing all:
>>> re.findall('^(.+)$', text)
['This is a sentence']
>>> re.findall('(^.+$)', text)
['This is a sentence']
Alternatively, you can use re.finditer which yield match objects. Using MatchObject.group(), you can get the whole matched string:
>>> [m.group() for m in re.finditer('^(.)+$', text)]
['This is a sentence']
Because the capture group is just one character (.). The regex engine will continue to match the whole string because of the + quantifier, and each time, the capture group will be updated to the latest match. In the end, the capture group will be the last character.
Even if you use findall, the first time the regex is applied, because of the + quantifier it will continue to match the whole string up to the end. And since the end of the string was reached, the regex won't be applied again, and the call returns just one result.
If you remove the + quantifier, then the first time, the regex will match just one character, so the regex will be applied again and again, until the whole string will be consumed, and findall will return a list of all the characters in the string.
NOte that + is greedy by default which matches all the characters upto the last. Since only the dot present inside the capturing group, the above regex matches all the characters from the start but captures only the last character. Since findall function gives the first preference to groups, it just prints out the chars present inside the groups.
re.findall('^(.+)$', text)

Match single quotes from python re

How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')
The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']
I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary
^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.
Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"

Categories