Python how to separate punctuation from text - python

So I want to separate group of punctuation from the text with spaces.
my_text = "!where??and!!or$$then:)"
I want to have a ! where ?? and !! or $$ then :) as a result.
I wanted something like in Javascript, where you can use $1 to get your matching string. What I have tried so far:
my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=##?\[\\\]^_`{|}~]*', my_text)
Here my_matches is empty so I had to delete \\\ from the expression:
my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=##?\^_`{|}~]*', my_text)
I have this result:
['!', '', '', '', '', '', '??', '', '', '', '!!', '', '', '$$', '', '', '', '',
':)', '']
So I delete all the redundant entry like this:
my_matches_distinct = list(set(my_matches))
And I have a better result:
['', '??', ':)', '$$', '!', '!!']
Then I replace every match by himself and space:
for match in my_matches:
if match != '':
my_text = re.sub(match, ' ' + match + ' ', my_text)
And of course it's not working ! I tried to cast the match as a string, but it's not working either... When I try to put directly the string to replace it's working though.
But I think I'm not doing it right, because I will have problems with '!' et '!!' right?
Thanks :)

It is recommended to use raw string literals when defining a regex pattern. Besides, do not escape arbitrary symbols inside a character class, only \ must be always escaped, and others can be placed so that they do not need escaping. Also, your regex matches an empty string - and it does - due to *. Replace with + quantifier. Besides, if you want to remove these symbols from your string, use re.sub directly.
import re
my_text = "!where??and!!or$$then:)"
print(re.sub(r'[]!"$%&\'()*+,./:;=##?[\\^_`{|}~-]+', r' \g<0> ', my_text).strip())
# => ! where ?? and !! or $$ then :)
See the Python demo
Details: The []!"$%&'()*+,./:;=##?[\^_`{|}~-]+ matches any 1+ symbols from the set (note that only \ is escaped here since - is used at the end, and ] at the start of the class), and the replacement inserts a space + the whole match (the \g<0> is the backreference to the whole match) and a space. And .strip() will remove leading/trailing whitespace after the regex finishes processing the string.
string.punctuation NOTE
Those who think that they can use f"[{string.punctuation}]+" make a mistake because this won't match \. Why? Because the resulting pattern looks like [!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~]+ and the \] part does not match a backslash or ], it only matches a ] since the \ escapes the ] char.
If you plan to use string.punctuation, you need to escape ] and \ (it would be also correct to escape - and ^ as these are the only special chars inside square brackets, but in this case, it would be redundant):
from string import punctuation
my_text = "!where??and!!or$$then:)"
pattern = "[" + punctuation.replace('\\','\\\\').replace(']', r'\]') + "]+"
print(re.sub(pattern, r' \g<0> ', my_text).strip())
# => ! where ?? and !! or $$ then :)
See this Python demo.

Use sub() method in re library. You can do this as follows,
import re
str = '!where??and!!or$$then:)'
print re.sub(r'([!##%\^&\*\(\):;"\',\./\\]+)', r' \1 ', str).strip()
I hope this code should solve your problem. If you are obvious with regex then the regex part is not a big deal. Just it is to use the right function.
Hope this helps! Please comment if you have any queries. :)
References:
Python re library

Related

Regex - Remove space between two punctuation marks but not between punctuation mark and letter

I have the following regex for removing spaces between punctuation marks.
re.sub(r'\s*(\W)\s*', r'\1', s)
which works fine in almost all of my test cases, except for this one:
This is! ? a test! ?
For which I need to have
This is!? a test!?
and get
This is!?a test!?
How do I NOT remove the space between that ? and 'a'? What am I missing?
This should work:
import re
str = 'This is! ? a test! ?'
res = re.sub(r'(?<=[?!])\s+(?=[?!])', '', str)
print(res)
Output:
This is!? a test!?
Explanation:
(?<=[?!]) # positive lookbehind, make sure we have a punctuation before (you can add all punctuations you want to check)
\s+ # 1 or more spaces
(?=[?!]) # positive lookahead, make sure we have a punctuation after
Try this:
string = "This is! ? a test! ?"
string = re.sub(r"(\W)\s*(\W)", r"\1\2", string)
print(string)
Output:
This is!? a test!?
In order to match a punctuation char with a regex in Python, you may use (?:[^\w\s]|_) pattern, it matches any char but a letter, digit or whitespace.
So, you need to match one or more whitespaces (\s+) that is immediately preceded with a punctuation char ((?<=[^\w\s]|_)) and is immediately followed with such a char ((?=[^\w\s]|_)):
(?<=[^\w\s]|_)\s+(?=[^\w\s]|_)
See the online regex demo.
Python demo:
import re
text = "This is! ? a test! ?"
print( re.sub(r"(?<=[^\w\s]|_)\s+(?=[^\w\s]|_)", "", text) )
# => This is!? a test!?
Another option is to make use of the PyPi regex module use \p{Punct} inside positive lookarounds to match the punctuation marks.
Python demo
For example
import regex
pattern = r"(?<=\p{Punct})\s+(?=\p{Punct})"
s = 'This is! ? a test! ?'
print(regex.sub(pattern, '', s))
Output
This is!? a test!?
Note that \s could also match a newline. You could also use [^\S\r\n] to match a whitespace char except newlines.

Python regular expression truncate string by special character with one leading space

I need to truncate string by special characters '-', '(', '/' with one leading whitespace, i.e. ' -', ' (', ' /'.
how to do that?
patterns=r'[-/()]'
try:
return row.split(re.findall(patterns, row)[0], 1)[0]
except:
return row
the above code picked up all special characters but without the leading space.
patterns=r'[s-/()]'
this one does not work.
Try this pattern
patterns=r'^\s[-/()]'
or remove ^ depending on your needs.
It looks like you want to get a part of the string before the first occurrence of \s[-(/] pattern.
Use
return re.sub(r'\s[-(/].*', '', row)
This code will return a part of row string without all chars after the first occurrence of a whitespace (\s) followed with -, ( or / ([-(/]).
See the regex demo.
Please try this pattern patterns = r'\s+-|\s\/|\s\(|\s\)'

Remove Twitter mentions from Pandas column

I have a dataset that includes Tweets from Twitter. Some of them also have user mentions such as #thisisauser. I try to remove that text at the same time I do other cleaning processes.
def clean_text(row, options):
if options['lowercase']:
row = row.lower()
if options['decode_html']:
txt = BeautifulSoup(row, 'lxml')
row = txt.get_text()
if options['remove_url']:
row = row.replace('http\S+|www.\S+', '')
if options['remove_mentions']:
row = row.replace('#[A-Za-z0-9]+', '')
return row
clean_config = {
'remove_url': True,
'remove_mentions': True,
'decode_utf8': True,
'lowercase': True
}
df['tweet'] = df['tweet'].apply(clean_text, args=(clean_config,))
However, when I run the above code, all the Twitter mentions are still on the text. I verified with a Regex online tool that my Regex is working correctly, so the problem should be on the Pandas's code.
You are misusing replace method on a string because it does not accept regular expressions, only fixed strings (see docs at https://docs.python.org/2/library/stdtypes.html#str.replace for more).
The right way of achieving your needs is using re module like:
import re
re.sub("#[A-Za-z0-9]+","", "#thisisauser text")
' text'
the problem is with the way you used replace method & not pandas
see output from the REPL
>>> my_str ="#thisisause"
>>> my_str.replace('#[A-Za-z0-9]+', '')
'#thisisause'
replace doesn't support regex. Instead do use regular expressions library in python as mentioned in the answer
>>> import re
>>> my_str
'hello #username hi'
>>> re.sub("#[A-Za-z0-9]+","",my_str)
'hello hi'
Removing Twitter mentions, or words that start with a # char, in Pandas, you can use
df['tweet'] = df['tweet'].str.replace(r'\s*#\w+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*\B#\w+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*#\S+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*#\S+\b', '', regex=True)
If you need to remove remaining leading/trailing spaces after the replacement, add .str.strip() after .str.replace call.
Details:
\s*#\w+ - zero or more whitespaces, # and one or more "word" chars (letters, digits, underscores (and other connector punctuation in Python 3.x), also some diacritics (in Python 3.x) (see regex demo)
\s*\B#\w+ - zero or more whitespaces, a position other than word boundary, # and one or more "word" chars (see regex demo)
\s*#\S+ - zero or more whitespaces, # and one or more non-whitespace chars (see regex demo)
\s*#\S+\b - zero or more whitespaces, # and one or more non-whitespace chars (as many as possible) followed with a word boundary (see regex demo).
Without Pandas, use one of the the above regexps in re.sub:
text = re.sub(r'...pattern here...', '', text)
## or
text = re.sub(r'...pattern here...', '', text).strip()

Add string between tabs and text

I simply want to add string after (0 or more) tabs in the beginning of a string.
i.e.
a = '\t\t\tHere is the next part of string. More garbage.'
(insert Added String here.)
to
b = '\t\t\t Added String here. Here is the next part of string. More garbage.'
What is the easiest/simplest way to go about it?
Simple:
re.sub(r'^(\t*)', r'\1 Added String here. ', inputtext)
The ^ caret matches the start of the string, \t a tab character, of which there should be zero or more (*). The parenthesis capture the matched tabs for use in the replacement string, where \1 inserts them again in front of the string you need adding.
Demo:
>>> import re
>>> a = '\t\t\tHere is the next part of string. More garbage.'
>>> re.sub(r'^(\t*)', r'\1 Added String here. ', a)
'\t\t\t Added String here. Here is the next part of string. More garbage.'
>>> re.sub(r'^(\t*)', r'\1 Added String here. ', 'No leading tabs.')
' Added String here. No leading tabs.'

Python Regex - Match a character without consuming it

I would like to convert the following string
"For "The" Win","Way "To" Go"
to
"For ""The"" Win","Way ""To"" Go"
The straightforward regex would be
str2 = re.sub(r'(?<!,|^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)
i.e., Double the quotes that are
Followed by a letter but not preceded by a comma or the beginning of line
Preceded by a letter but not followed by a comma or the end of line
The problem is I am using python and it's regex engine does not allow using the OR operator in the lookbehind construct. I get the error
sre_constants.error: look-behind requires fixed-width pattern
What I am looking for is a regex that will replace the '"' around 'The' and 'To' with '""'.
I can use the following regex (An answer provided to another question)
\b\s*"(?!,|[ \t]*$)
but that consumes the space just before the 'The' and 'To' and I get the below
"For""The"" Win","Way""To"" Go"
Is there a workaround so that I can double the quotes around 'The' and 'To' without consuming the spaces just before them?
Instead of saying not preceded by comma or the line start, say preceded by a non-comma character:
r'(?<=[^,])"(?=\w)|(?<=\w)"(?!,|$)'
Looks to me like you don't need to bother with anchors.
If there is a character before the quote, you know it's not at the beginning of the string.
If that character is not a newline, you're not at the beginning of a line.
If the character is not a comma, you're not at the beginning of a field.
So you don't need to use anchors, just do a positive lookbehind/lookahead for a single character:
result = re.sub(r'(?<=[^",\r\n])"(?=[^,"\r\n])', '""', subject)
I threw in the " on the chance that there might be some quotes that are already escaped. But realistically, if that's the case you're probably screwed anyway. ;)
re.sub(r'\b(\s*)"(?!,|[ \t]*$)', r'\1""', s)
Most direct workaround whenever you encounter this issue: explode the look-behind into two look-behinds.
str2 = re.sub(r'(?<!,)(?<!^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)
(don't name your strings str)
str2 = re.sub('(?<=[^,])"(?=\w)'
'|'
'(?<=\w)"(?!,|$)',
'""', ss,
flags=re.MULTILINE)
I always wonder why people use raw strings for regex patterns when it isn't needed.
Note I changed your str which is the name of a builtin class to ss
.
For `"fun" :
str2 = re.sub('"'
'('
'(?<=[^,]")(?=\w)'
'|'
'(?<=\w")(?!,|$)'
')',
'""', ss,
flags=re.MULTILINE)
or also
str2 = re.sub('(?<=[^,]")(?=\w)'
'|'
'(?<=\w")(?!,|$)',
'"', ss,
flags=re.MULTILINE)

Categories