is re.split("\W") = re.split("\w")? - python

As the title says, is re.split("\W") the same as re.split("\w") because the results I get are the same whichever I use. The same goes if it has + or not. Is this right? Or it works in some cases, and if yes why? Thank you in advance.

They are not the same thing at all:
>>> test_string = 'hello world'
>>> import re
>>> re.split('\w', test_string)
['', '', '', '', '', ' ', '', '', '', '', '']
>>> re.split('\W', test_string)
['hello', 'world']
re.split does the following:
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.
\w and \W are:
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
With LOCALE, it will match the set [0-9_] plus characters defined
as letters for the current locale.
\W Matches the complement of \w.

Related

Substitute all non letters/numbers with regex

I have the string 'hello (new)' and I would like to remove all non numbers and letters. One way to do this is by finding all letters and joining them:
>>> ''.join(re.findall(r'[a-zA-Z0-0]', 'hello (new)'))
'hellonew'
How would I do the reverse, that is, subtituting all non-characters to ''? So far I had:
>>> re.sub(r'^[a-zA-Z0-9]+', '', 'hello (new)')
' (new)'
But it's off a bit.
You should use a negated character class instead and remove the anchor at the front:
re.sub(r'[^a-z0-9]+', '', 'hello (new)', re.IGNORECASE)
You could match any non word character 1+ times using \W+.
If you want to keep the underscore which is matched by \w you could use a character class [\W_]+.
Python demo
import re
print(re.sub(r'\W+', '', 'hello (new)'))
Output
hellonew

Match only non-quoted words using a regex in python

While trying to process some code, I needed to find instances in which variables from a certain list were used. Problem is, the code is obfuscated and those variable names could also appear in a string, for example, which I didn't want to match.
However, I haven't been able to find a regex to match only non-quoted words that works in python...
"[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"
Should match any non-quoted word to the last group (6th group, index 5 with 0-based indexing). Minor modifications are required to avoid matching strings which begin with quotes.
Explanation:
[^\\\\] Match any character but an escape character. Escaped quotes do not start a string.
((\")|(')) Immediately after the non-escaped character, match either " or ', which starts a string. This is group 1, which contains groups 2 (\") and 3 (')
(?(2) if we matched group 2 (a double-quote)
([^\"]|\\\")*| match anything but double quotes, or match escaped double quotes. Otherwise:
([^']|\\')*) match anything but a single quote or match an escaped single quote.
If you wish to retrieve the string inside the quotes, you will have to add another group: (([^\"]|\\\")*) will allow you to retrieve the whole consumed string, rather than just the last matched character.
Note that the last character of a quoted string will actually be consumed by the last [^\\\\]. To retrieve it, you have to turn it into a group: ([^\\\\]). Additionally, The first character before the quote will also be consumed by [^\\\\], which might be meaningful in cases such as r"Raw\text".
[^\\\\]\\1 will match any non-escape character followed by what the first group matched again. That is, if ((\")|(')) matched a double quote, we requite a double quote to end the string. Otherwise, it matched a single quote, which is what we require to end the string.
|(\w+) will match any word. This will only match if non-quoted strings, as quoted strings will be consumed by the previous regex.
For example:
import re
non_quoted_words = "[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"
quote = "This \"is an example ' \\\" of \" some 'text \\\" like wtf' \\\" is what I said."
print(quote)
print(re.findall(non_quoted_words,quote))
will return:
This "is an example ' \" of " some 'text \" like wtf' \" is what I said.
[('', '', '', '', '', 'This'), ('"', '"', '', 'f', '', ''), ('', '', '', '', '', 'some'), ("'", '', "'", '', 't', ''), ('', '', '', '', '', 'is'), ('', '', '', '', '', 'what'), ('', '', '', '', '', 'I'), ('', '', '', '', '', 'said')]

Match words that don't start with a certain letter using regex

I am learning regex but have not been able to find the right regex in python for selecting characters that start with a particular alphabet.
Example below
text='this is a test'
match=re.findall('(?!t)\w*',text)
# match returns
['his', '', 'is', '', 'a', '', 'est', '']
match=re.findall('[^t]\w+',text)
# match
['his', ' is', ' a', ' test']
Expected : ['is','a']
With regex
Use the negative set [^\Wt] to match any alphanumeric character that is not t. To avoid matching subsets of words, add the word boundary metacharacter, \b, at the beginning of your pattern.
Also, do not forget that you should use raw strings for regex patterns.
import re
text = 'this is a test'
match = re.findall(r'\b[^\Wt]\w*', text)
print(match) # prints: ['is', 'a']
See the demo here.
Without regex
Note that this is also achievable without regex.
text = 'this is a test'
match = [word for word in text.split() if not word.startswith('t')]
print(match) # prints: ['is', 'a']
You are almost on the right track. You just forgot \b (word boundary) token:
\b(?!t)\w+
Live demo

Python regular expression split with \W

In Python document, I came across the following code snippet
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
What I am confusing is that \W matches any character which is not a Unicode word character, but ',' is Unicode character. And what does the parentheses mean? I know it match a group but there is only one group in the pattern. Why ', ' is also return?
"any character which is not a Unicode word character" is a character being part of a word: letter or digit basically.
Comma cannot be part of a word.
And comma is included in the resulting list because the split regex is into parentheses (defining a group inside the split regex). That's how re.split works (That's the difference between your 2 code snippets)

Python how to separate punctuation from text

So I want to separate group of punctuation from the text with spaces.
my_text = "!where??and!!or$$then:)"
I want to have a ! where ?? and !! or $$ then :) as a result.
I wanted something like in Javascript, where you can use $1 to get your matching string. What I have tried so far:
my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=##?\[\\\]^_`{|}~]*', my_text)
Here my_matches is empty so I had to delete \\\ from the expression:
my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=##?\^_`{|}~]*', my_text)
I have this result:
['!', '', '', '', '', '', '??', '', '', '', '!!', '', '', '$$', '', '', '', '',
':)', '']
So I delete all the redundant entry like this:
my_matches_distinct = list(set(my_matches))
And I have a better result:
['', '??', ':)', '$$', '!', '!!']
Then I replace every match by himself and space:
for match in my_matches:
if match != '':
my_text = re.sub(match, ' ' + match + ' ', my_text)
And of course it's not working ! I tried to cast the match as a string, but it's not working either... When I try to put directly the string to replace it's working though.
But I think I'm not doing it right, because I will have problems with '!' et '!!' right?
Thanks :)
It is recommended to use raw string literals when defining a regex pattern. Besides, do not escape arbitrary symbols inside a character class, only \ must be always escaped, and others can be placed so that they do not need escaping. Also, your regex matches an empty string - and it does - due to *. Replace with + quantifier. Besides, if you want to remove these symbols from your string, use re.sub directly.
import re
my_text = "!where??and!!or$$then:)"
print(re.sub(r'[]!"$%&\'()*+,./:;=##?[\\^_`{|}~-]+', r' \g<0> ', my_text).strip())
# => ! where ?? and !! or $$ then :)
See the Python demo
Details: The []!"$%&'()*+,./:;=##?[\^_`{|}~-]+ matches any 1+ symbols from the set (note that only \ is escaped here since - is used at the end, and ] at the start of the class), and the replacement inserts a space + the whole match (the \g<0> is the backreference to the whole match) and a space. And .strip() will remove leading/trailing whitespace after the regex finishes processing the string.
string.punctuation NOTE
Those who think that they can use f"[{string.punctuation}]+" make a mistake because this won't match \. Why? Because the resulting pattern looks like [!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~]+ and the \] part does not match a backslash or ], it only matches a ] since the \ escapes the ] char.
If you plan to use string.punctuation, you need to escape ] and \ (it would be also correct to escape - and ^ as these are the only special chars inside square brackets, but in this case, it would be redundant):
from string import punctuation
my_text = "!where??and!!or$$then:)"
pattern = "[" + punctuation.replace('\\','\\\\').replace(']', r'\]') + "]+"
print(re.sub(pattern, r' \g<0> ', my_text).strip())
# => ! where ?? and !! or $$ then :)
See this Python demo.
Use sub() method in re library. You can do this as follows,
import re
str = '!where??and!!or$$then:)'
print re.sub(r'([!##%\^&\*\(\):;"\',\./\\]+)', r' \1 ', str).strip()
I hope this code should solve your problem. If you are obvious with regex then the regex part is not a big deal. Just it is to use the right function.
Hope this helps! Please comment if you have any queries. :)
References:
Python re library

Categories