Here is my str example, I need to save delimiters near last word like dot, dash and space.
str example:
a = 'Beautiful. is. better5-than ugly'
what I tried
re.split('\W+', a)
['Beautiful', 'is', 'better5', 'than', 'ugly']
expected output:
['Beautiful.', ' ', 'is.', ' ', 'better5-', 'than', ' ', 'ugly']
Is it possible?
>>> import re
>>> a = 'Beautiful. is. better5-than ugly'
>>> re.findall("\w+[.-]?|\s+", a)
['Beautiful.', ' ', 'is.', ' ', 'better5-', 'than', ' ', 'ugly']
\w+[.-]? matches words with an optional dot or hyphen at the end.
\s+ matches whitespace.
| makes sure we capture either of the above.
Since we want our delimiters to be part of our result, we should keep them so, I used both "lookbehind" and "lookahead" assertions in the regex. You can read about them in the re module's documentation
import re
a = 'Beautiful. is. better5-than ugly'
print(re.split(r'(?<=[-. ])|(?= )', a))
Additional note: with "lookbehind" assertion I could achieve almost the same result, but for the last word "than " I need to include a "lookahead" assertion to my regex pattern (I mean |(?= )) to split that space too.
Related
Given this example:
s = "Hi, domain: (foo.bar.com) bye"
I'd like to create a regex that matches both word and non-word strings, separately, i.e:
re.findall(regex, s)
# Returns: ["Hi", ", ", "domain", ": (", "foo.bar.com", ") ", "bye"]
My approach was to use the word boundary separator \b to catch any string that is bound by two word-to-non-word switches. From the re module docs:
\b is defined as the boundary between a \w and a \W character (or vice versa)
Therefore I tried as a first step:
regex = r'(?:^|\b).*?(?=\b|$)'
re.findall(regex, s)
# Returns: ["Hi", ",", "domain", ": (", "foo", ".", "bar", ".", "com", ") ", "bye"]
The problem is that I don't want the dot (.) character to be a separator too, I'd like the regex to see foo.bar.com as a whole word and not as three words separated by dots.
I tried to find a way to use a negative lookahead on dot but did not manage to make it work.
Is there any way to achieve that?
I don't mind that the dot won't be a separator at all in the regex, it doesn't have to be specific to domain names.
I looked at Regex word boundary alternative, Capture using word boundaries without stopping at "dot" and/or other characters and Regex word boundary excluding the hyphen but it does not fit my case as I cannot use the space as a separator condition.
Exclude some characters from word boundary is the only one that got me close, but I didn't manage to reach it.
You may use this regex in findall:
\w+(?:\.\w+)*|\W+
Which finds a word followed by 0 or more repeats of dot separated words or 1+ of non-word characters.
Code:
import re
s = "Hi, domain: (foo.bar.com) bye"
print (re.findall(r'\w+(?:\.\w+)*|\W+', s))
Output:
['Hi', ', ', 'domain', ': (', 'foo.bar.com', ') ', 'bye']
For your example, you could just split on [^\w.]+, using a capturing group around it to keep those values in the output:
import re
s = "Hi, domain: (foo.bar.com) bye"
re.split(r'([^\w.]+)', s)
# ['Hi', ', ', 'domain', ': (', 'foo.bar.com', ') ', 'bye']
If your string might end or finish with non-word/space characters, you can filter out the resultant empty strings in the list with a comprehension:
s = "!! Hello foo.bar.com, your domain ##"
re.split(r'([^\w.]+)', s)
# ['', '!! ', 'Hello', ' ', 'foo.bar.com', ', ', 'your', ' ', 'domain', ' ##', '']
[w for w in re.split(r'([^\w.]+)', s) if len(w)]
# ['!! ', 'Hello', ' ', 'foo.bar.com', ', ', 'your', ' ', 'domain', ' ##']
Lookarounds let you easily say "dot, except if it's surrounded by alphabetics on both sides" if that's what you mean;
re.findall(r'(?:^|\b)(\w+(?:\.\w+)*|\W+)(?!\.\w)(?=\b|$)', s)
or simply "word boundary, unless it's a dot":
re.findall(r'(?:^|(?<!\.)\b(?!\.)).+?(?=(?<!\.)\b(?!\.)|$)', s)
Notice that the latter will join text across a word boundary if it's a dot; so, for example, 'bye. ' would be extracted as one string.
(Perhaps try to be more precise about your requirements?)
Demo: https://ideone.com/dvQhFO
>>> sentence = "Thomas Jefferson began building Monticello at the age of 26."
>>> tokens1 = re.split(r"([-\s.,;!?])+", sentence)
>>> tokens2 = re.split(r"[-\s.,;!?]+", sentence)
>>> tokens1 = ['Thomas', ' ', 'Jefferson', ' ', 'began', ' ', 'building', ' ', 'Monticello', ' ', 'at', ' ', 'the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']
>>> tokens2 = ['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '']
Can you explain the purpose of ( and )?
(..) in a regex denotes a capturing group (aka "capturing parenthesis"). They are used when you want to extract values out of a pattern. In this case, you are using re.split function which behaves in a specific way when the pattern has capturing groups. According to the documentation:
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the occurrences of pattern. If capturing parentheses
are used in pattern, then the text of all groups in the pattern are
also returned as part of the resulting list.
So normally, the delimiters used to split the string are not present in the result, like in your second example. However, if you use (), the text captured in the groups will also be in the result of the split. This is why you get a lot of ' ' in the first example. That is what is captured by your group ([-\s.,;!?]).
With a capturing group (()) in the regex used to split a string, split will include the captured parts.
In your case, you are splitting on one or more characters of whitespace and/or punctuation, and capturing the last of those characters to include in the split parts, which seems kind of a weird thing to do. I'd have expected you might want to capture all of the separator, which would look like r"([-\s.,;!?]+)" (capturing one or more characters whitespace/punctuation characters, rather than matching one or more but only capturing the last).
I want to use re.sub to remove leading and trailing whitespace from single-quoted strings embedded in a larger string. If I have, say,
textin = " foo ' bar nox ': glop ,' frox ' "
I want to produce
desired = " foo 'bar nox': glop ,'frox' "
Removing the leading whitespace is relatively straightforward.
>>> lstripped = re.sub(r"'\s*([^']*')", r"'\1", textin)
>>> lstripped
" foo 'bar nox ': glop ,'frox ' "
The problem is removing the trailing whitespace. I tried, for example,
>>> rstripped = re.sub(r"('[^']*)(\s*')", r"\1'", lstripped)
>>> rstripped
" foo 'bar nox ': glop ,'frox ' "
but that fails because the [^']* matches the trailing whitespace.
I thought about using lookback patterns, but the Re doc says they can only contain fixed-length patterns.
I'm sure this is a previously solved problem but I'm stumped.
Thanks!
EDIT: The solution needs to handle strings containing a single non-whitespace character and empty strings, i.e. ' p ' --> 'p' and ' ' --> ''.
[^\']* - is greedy, i.e. it includes also spaces and/or tabs, so let's use non-greedy one: [^\']*?
In [66]: re.sub(r'\'\s*([^\']*?)\s*\'','\'\\1\'', textin)
Out[66]: " foo 'bar nox': glop ,'frox' "
Less escaped version:
re.sub(r"'\s*([^']*?)\s*'", r"'\1'", textin)
The way to catch the whitespaces is by defining the previous
* as non-greedy, instead of r"('[^']*)(\s*')" use r"('[^']*?)(\s*')".
You can also catch both sides with a single regex:
stripped = re.sub("'\s*([^']*?)\s*'", r"'\1'", textin)
This seems to work:
'(\s*)(.*?)(\s*)'
' # an apostrophe
(\s*) # 0 or more white-space characters (leading white-space)
(.*?) # 0 or more any character, lazily matched (keep)
(\s*) # 0 or more white-space characters (trailing white-space)
' # an apostrophe
Demo
This question already has answers here:
How to split but ignore separators in quoted strings, in python?
(17 answers)
Closed 6 years ago.
I know this is probably really easy question, but i'm struggling to split a string in python. My regex has group separators like this:
myRegex = "(\W+)"
And I want to parse this string into words:
testString = "This is my test string, hopefully I can get the word i need"
testAgain = re.split("(\W+)", testString)
Here's the results:
['This', ' ', 'is', ' ', 'my', ' ', 'test', ' ', 'string', ', ', 'hopefully', ' ', 'I', ' ', 'can', ' ', 'get', ' ', 'the', ' ', 'word', ' ', 'i', ' ', 'need']
Which isn't what I expected. I am expecting the list to contain:
['This','is','my','test']......etc
Now I know it's something to do with the grouping in my regex, and I can fix the issue by removing the brackets. But how can I keep the brackets and get the result above?
Sorry about this question, I have read the official python documentation on regex spliting with groups, but I still don't understand why the empty spaces are in my list
As described in this answer, How to split but ignore separators in quoted strings, in python?, you can simply slice the array once it's split. It's easy to do so because you want every other member, starting with the first one (so 1,3,5,7)
You can use the [start:end:step] notation as described below:
testString = "This is my test string, hopefully I can get the word i need"
testAgain = re.split("(\W+)", testString)
testAgain = testAgain[0::2]
Also, I must point out that \W matches any non-word characters, including punctuation. If you want to keep your punctuation, you'll need to change your regex.
You can simly do:
testAgain = testString.split() # built-in split with space
Different regex ways of doing this:
testAgain = re.split(r"\s+", testString) # split with space
testAgain = re.findall(r"\w+", testString) # find all words
testAgain = re.findall(r"\S+", testString) # find all non space characters
I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.
Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo