>>> sentence = "Thomas Jefferson began building Monticello at the age of 26."
>>> tokens1 = re.split(r"([-\s.,;!?])+", sentence)
>>> tokens2 = re.split(r"[-\s.,;!?]+", sentence)
>>> tokens1 = ['Thomas', ' ', 'Jefferson', ' ', 'began', ' ', 'building', ' ', 'Monticello', ' ', 'at', ' ', 'the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']
>>> tokens2 = ['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '']
Can you explain the purpose of ( and )?
(..) in a regex denotes a capturing group (aka "capturing parenthesis"). They are used when you want to extract values out of a pattern. In this case, you are using re.split function which behaves in a specific way when the pattern has capturing groups. According to the documentation:
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the occurrences of pattern. If capturing parentheses
are used in pattern, then the text of all groups in the pattern are
also returned as part of the resulting list.
So normally, the delimiters used to split the string are not present in the result, like in your second example. However, if you use (), the text captured in the groups will also be in the result of the split. This is why you get a lot of ' ' in the first example. That is what is captured by your group ([-\s.,;!?]).
With a capturing group (()) in the regex used to split a string, split will include the captured parts.
In your case, you are splitting on one or more characters of whitespace and/or punctuation, and capturing the last of those characters to include in the split parts, which seems kind of a weird thing to do. I'd have expected you might want to capture all of the separator, which would look like r"([-\s.,;!?]+)" (capturing one or more characters whitespace/punctuation characters, rather than matching one or more but only capturing the last).
Related
Here is my str example, I need to save delimiters near last word like dot, dash and space.
str example:
a = 'Beautiful. is. better5-than ugly'
what I tried
re.split('\W+', a)
['Beautiful', 'is', 'better5', 'than', 'ugly']
expected output:
['Beautiful.', ' ', 'is.', ' ', 'better5-', 'than', ' ', 'ugly']
Is it possible?
>>> import re
>>> a = 'Beautiful. is. better5-than ugly'
>>> re.findall("\w+[.-]?|\s+", a)
['Beautiful.', ' ', 'is.', ' ', 'better5-', 'than', ' ', 'ugly']
\w+[.-]? matches words with an optional dot or hyphen at the end.
\s+ matches whitespace.
| makes sure we capture either of the above.
Since we want our delimiters to be part of our result, we should keep them so, I used both "lookbehind" and "lookahead" assertions in the regex. You can read about them in the re module's documentation
import re
a = 'Beautiful. is. better5-than ugly'
print(re.split(r'(?<=[-. ])|(?= )', a))
Additional note: with "lookbehind" assertion I could achieve almost the same result, but for the last word "than " I need to include a "lookahead" assertion to my regex pattern (I mean |(?= )) to split that space too.
In Python document, I came across the following code snippet
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
What I am confusing is that \W matches any character which is not a Unicode word character, but ',' is Unicode character. And what does the parentheses mean? I know it match a group but there is only one group in the pattern. Why ', ' is also return?
"any character which is not a Unicode word character" is a character being part of a word: letter or digit basically.
Comma cannot be part of a word.
And comma is included in the resulting list because the split regex is into parentheses (defining a group inside the split regex). That's how re.split works (That's the difference between your 2 code snippets)
This question already has answers here:
How to split but ignore separators in quoted strings, in python?
(17 answers)
Closed 6 years ago.
I know this is probably really easy question, but i'm struggling to split a string in python. My regex has group separators like this:
myRegex = "(\W+)"
And I want to parse this string into words:
testString = "This is my test string, hopefully I can get the word i need"
testAgain = re.split("(\W+)", testString)
Here's the results:
['This', ' ', 'is', ' ', 'my', ' ', 'test', ' ', 'string', ', ', 'hopefully', ' ', 'I', ' ', 'can', ' ', 'get', ' ', 'the', ' ', 'word', ' ', 'i', ' ', 'need']
Which isn't what I expected. I am expecting the list to contain:
['This','is','my','test']......etc
Now I know it's something to do with the grouping in my regex, and I can fix the issue by removing the brackets. But how can I keep the brackets and get the result above?
Sorry about this question, I have read the official python documentation on regex spliting with groups, but I still don't understand why the empty spaces are in my list
As described in this answer, How to split but ignore separators in quoted strings, in python?, you can simply slice the array once it's split. It's easy to do so because you want every other member, starting with the first one (so 1,3,5,7)
You can use the [start:end:step] notation as described below:
testString = "This is my test string, hopefully I can get the word i need"
testAgain = re.split("(\W+)", testString)
testAgain = testAgain[0::2]
Also, I must point out that \W matches any non-word characters, including punctuation. If you want to keep your punctuation, you'll need to change your regex.
You can simly do:
testAgain = testString.split() # built-in split with space
Different regex ways of doing this:
testAgain = re.split(r"\s+", testString) # split with space
testAgain = re.findall(r"\w+", testString) # find all words
testAgain = re.findall(r"\S+", testString) # find all non space characters
I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.
Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo
I'm trying to use re to match a pattern that starts with '\n', followed by a possible 'real(r8)', followed by zero or more white spaces and then followed by the word 'function', and then I want to split the string at where matches occur. So for this string,
text = '''functional \n function disdat \nkitkat function wakawak\nreal(r8) function noooooo \ndoit'''
I would like:
['functional ',
' disdat \nkitkat function wakawak',
' noooooo \ndoit']
However,
regex = re.compile(r'''\n(real\(r8\))?\s*\bfunction\b''')
regex.split(text)
returns
['functional ',
None,
' disdat \nkitkat function wakawak',
'real(r8)',
' noooooo \ndoit']
split returns the matches' groups too. How do I ask it not to?
You can use non-capturing groups, like this
>>> regex = re.compile(r'\n(?:real\(r8\))?\s*\bfunction\b')
>>> regex.split(text)
['functional ', ' disdat \nkitkat function wakawak', ' noooooo \ndoit']
Note ?: in (?:real\(r8\)). Quoting Python documentation for (?:..)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.