Python regular expression split with \W - python

In Python document, I came across the following code snippet
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
What I am confusing is that \W matches any character which is not a Unicode word character, but ',' is Unicode character. And what does the parentheses mean? I know it match a group but there is only one group in the pattern. Why ', ' is also return?

"any character which is not a Unicode word character" is a character being part of a word: letter or digit basically.
Comma cannot be part of a word.
And comma is included in the resulting list because the split regex is into parentheses (defining a group inside the split regex). That's how re.split works (That's the difference between your 2 code snippets)

Related

Regex: Use \b (word boundary) separator but ignore some characters

Given this example:
s = "Hi, domain: (foo.bar.com) bye"
I'd like to create a regex that matches both word and non-word strings, separately, i.e:
re.findall(regex, s)
# Returns: ["Hi", ", ", "domain", ": (", "foo.bar.com", ") ", "bye"]
My approach was to use the word boundary separator \b to catch any string that is bound by two word-to-non-word switches. From the re module docs:
\b is defined as the boundary between a \w and a \W character (or vice versa)
Therefore I tried as a first step:
regex = r'(?:^|\b).*?(?=\b|$)'
re.findall(regex, s)
# Returns: ["Hi", ",", "domain", ": (", "foo", ".", "bar", ".", "com", ") ", "bye"]
The problem is that I don't want the dot (.) character to be a separator too, I'd like the regex to see foo.bar.com as a whole word and not as three words separated by dots.
I tried to find a way to use a negative lookahead on dot but did not manage to make it work.
Is there any way to achieve that?
I don't mind that the dot won't be a separator at all in the regex, it doesn't have to be specific to domain names.
I looked at Regex word boundary alternative, Capture using word boundaries without stopping at "dot" and/or other characters and Regex word boundary excluding the hyphen but it does not fit my case as I cannot use the space as a separator condition.
Exclude some characters from word boundary is the only one that got me close, but I didn't manage to reach it.
You may use this regex in findall:
\w+(?:\.\w+)*|\W+
Which finds a word followed by 0 or more repeats of dot separated words or 1+ of non-word characters.
Code:
import re
s = "Hi, domain: (foo.bar.com) bye"
print (re.findall(r'\w+(?:\.\w+)*|\W+', s))
Output:
['Hi', ', ', 'domain', ': (', 'foo.bar.com', ') ', 'bye']
For your example, you could just split on [^\w.]+, using a capturing group around it to keep those values in the output:
import re
s = "Hi, domain: (foo.bar.com) bye"
re.split(r'([^\w.]+)', s)
# ['Hi', ', ', 'domain', ': (', 'foo.bar.com', ') ', 'bye']
If your string might end or finish with non-word/space characters, you can filter out the resultant empty strings in the list with a comprehension:
s = "!! Hello foo.bar.com, your domain ##"
re.split(r'([^\w.]+)', s)
# ['', '!! ', 'Hello', ' ', 'foo.bar.com', ', ', 'your', ' ', 'domain', ' ##', '']
[w for w in re.split(r'([^\w.]+)', s) if len(w)]
# ['!! ', 'Hello', ' ', 'foo.bar.com', ', ', 'your', ' ', 'domain', ' ##']
Lookarounds let you easily say "dot, except if it's surrounded by alphabetics on both sides" if that's what you mean;
re.findall(r'(?:^|\b)(\w+(?:\.\w+)*|\W+)(?!\.\w)(?=\b|$)', s)
or simply "word boundary, unless it's a dot":
re.findall(r'(?:^|(?<!\.)\b(?!\.)).+?(?=(?<!\.)\b(?!\.)|$)', s)
Notice that the latter will join text across a word boundary if it's a dot; so, for example, 'bye. ' would be extracted as one string.
(Perhaps try to be more precise about your requirements?)
Demo: https://ideone.com/dvQhFO

Match words that don't start with a certain letter using regex

I am learning regex but have not been able to find the right regex in python for selecting characters that start with a particular alphabet.
Example below
text='this is a test'
match=re.findall('(?!t)\w*',text)
# match returns
['his', '', 'is', '', 'a', '', 'est', '']
match=re.findall('[^t]\w+',text)
# match
['his', ' is', ' a', ' test']
Expected : ['is','a']
With regex
Use the negative set [^\Wt] to match any alphanumeric character that is not t. To avoid matching subsets of words, add the word boundary metacharacter, \b, at the beginning of your pattern.
Also, do not forget that you should use raw strings for regex patterns.
import re
text = 'this is a test'
match = re.findall(r'\b[^\Wt]\w*', text)
print(match) # prints: ['is', 'a']
See the demo here.
Without regex
Note that this is also achievable without regex.
text = 'this is a test'
match = [word for word in text.split() if not word.startswith('t')]
print(match) # prints: ['is', 'a']
You are almost on the right track. You just forgot \b (word boundary) token:
\b(?!t)\w+
Live demo

How to split sentence to words with regular expression?

"She's so nice!" -> ["she","'","s","so","nice","!"]
I want to split sentence like this!
so I wrote the code, but It includes white space!
How to make code only using regular expression?
words = re.findall('\W+|\w+')
-> ["she", "'","s", " ", "so", " ", "nice", "!"]
words = [word for word in words if not word.isspace()]
Regex: [A-Za-z]+|[^A-Za-z ]
In [^A-Za-z ] add chars you don't want to match.
Details:
[] Match a single character present in the list
[^] Match a single character NOT present in the list
+ Matches between one and unlimited times
| Or
Python code:
text = "She's so nice!"
matches = re.findall(r'[A-Za-z]+|[^A-Za-z ]', text)
Output:
['She', "'", 's', 'so', 'nice', '!']
Code demo
Python's re module doesn't allow you to split on zero-width assertions. You can use python's pypi regex package instead (ensuring you specify to use version 1, which properly handles zero-width matches).
See code in use here
import regex
s = "She's so nice!"
x = regex.split(r"\s+|\b(?!^|$)", s, flags=regex.VERSION1)
print(x)
Output: ['She', "'", 's', 'so', 'nice', '!']
\s+|\b(?!^|$) Match either of the following options
\s+ Match one or more whitespace characters
\b(?!^|$) Assert position as a word boundary, but not at the beginning or end of the line

What's the difference between([])+ and []+?

>>> sentence = "Thomas Jefferson began building Monticello at the age of 26."
>>> tokens1 = re.split(r"([-\s.,;!?])+", sentence)
>>> tokens2 = re.split(r"[-\s.,;!?]+", sentence)
>>> tokens1 = ['Thomas', ' ', 'Jefferson', ' ', 'began', ' ', 'building', ' ', 'Monticello', ' ', 'at', ' ', 'the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']
>>> tokens2 = ['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '']
Can you explain the purpose of ( and )?
(..) in a regex denotes a capturing group (aka "capturing parenthesis"). They are used when you want to extract values out of a pattern. In this case, you are using re.split function which behaves in a specific way when the pattern has capturing groups. According to the documentation:
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the occurrences of pattern. If capturing parentheses
are used in pattern, then the text of all groups in the pattern are
also returned as part of the resulting list.
So normally, the delimiters used to split the string are not present in the result, like in your second example. However, if you use (), the text captured in the groups will also be in the result of the split. This is why you get a lot of ' ' in the first example. That is what is captured by your group ([-\s.,;!?]).
With a capturing group (()) in the regex used to split a string, split will include the captured parts.
In your case, you are splitting on one or more characters of whitespace and/or punctuation, and capturing the last of those characters to include in the split parts, which seems kind of a weird thing to do. I'd have expected you might want to capture all of the separator, which would look like r"([-\s.,;!?]+)" (capturing one or more characters whitespace/punctuation characters, rather than matching one or more but only capturing the last).

Regex parsing text and get relevant words / characters

I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.
Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo

Categories