Python how to separate punctuation from text

Python how to separate punctuation from text - python

So I want to separate group of punctuation from the text with spaces.
my_text = "!where??and!!or$$then:)"
I want to have a ! where ?? and !! or $$ then :) as a result.
I wanted something like in Javascript, where you can use $1 to get your matching string. What I have tried so far:
my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=##?\[\\\]^_`{|}~]*', my_text)
Here my_matches is empty so I had to delete \\\ from the expression:
my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=##?\^_`{|}~]*', my_text)
I have this result:
['!', '', '', '', '', '', '??', '', '', '', '!!', '', '', '$$', '', '', '', '',
':)', '']
So I delete all the redundant entry like this:
my_matches_distinct = list(set(my_matches))
And I have a better result:
['', '??', ':)', '$$', '!', '!!']
Then I replace every match by himself and space:
for match in my_matches:
if match != '':
my_text = re.sub(match, ' ' + match + ' ', my_text)
And of course it's not working ! I tried to cast the match as a string, but it's not working either... When I try to put directly the string to replace it's working though.
But I think I'm not doing it right, because I will have problems with '!' et '!!' right?
Thanks :)

It is recommended to use raw string literals when defining a regex pattern. Besides, do not escape arbitrary symbols inside a character class, only \ must be always escaped, and others can be placed so that they do not need escaping. Also, your regex matches an empty string - and it does - due to *. Replace with + quantifier. Besides, if you want to remove these symbols from your string, use re.sub directly.
import re
my_text = "!where??and!!or$$then:)"
print(re.sub(r'[]!"$%&\'()*+,./:;=##?[\\^_`{|}~-]+', r' \g<0> ', my_text).strip())
# => ! where ?? and !! or $$ then :)
See the Python demo
Details: The []!"$%&'()*+,./:;=##?[\^_`{|}~-]+ matches any 1+ symbols from the set (note that only \ is escaped here since - is used at the end, and ] at the start of the class), and the replacement inserts a space + the whole match (the \g<0> is the backreference to the whole match) and a space. And .strip() will remove leading/trailing whitespace after the regex finishes processing the string.
string.punctuation NOTE
Those who think that they can use f"[{string.punctuation}]+" make a mistake because this won't match \. Why? Because the resulting pattern looks like [!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~]+ and the \] part does not match a backslash or ], it only matches a ] since the \ escapes the ] char.
If you plan to use string.punctuation, you need to escape ] and \ (it would be also correct to escape - and ^ as these are the only special chars inside square brackets, but in this case, it would be redundant):
from string import punctuation
my_text = "!where??and!!or$$then:)"
pattern = "[" + punctuation.replace('\\','\\\\').replace(']', r'\]') + "]+"
print(re.sub(pattern, r' \g<0> ', my_text).strip())
# => ! where ?? and !! or $$ then :)
See this Python demo.

Use sub() method in re library. You can do this as follows,
import re
str = '!where??and!!or$$then:)'
print re.sub(r'([!##%\^&\*\(\):;"\',\./\\]+)', r' \1 ', str).strip()
I hope this code should solve your problem. If you are obvious with regex then the regex part is not a big deal. Just it is to use the right function.
Hope this helps! Please comment if you have any queries. :)
References:
Python re library

Related

Regex - Remove space between two punctuation marks but not between punctuation mark and letter

I have the following regex for removing spaces between punctuation marks.
re.sub(r'\s*(\W)\s*', r'\1', s)
which works fine in almost all of my test cases, except for this one:
This is! ? a test! ?
For which I need to have
This is!? a test!?
and get
This is!?a test!?
How do I NOT remove the space between that ? and 'a'? What am I missing?

This should work:
import re
str = 'This is! ? a test! ?'
res = re.sub(r'(?<=[?!])\s+(?=[?!])', '', str)
print(res)
Output:
This is!? a test!?
Explanation:
(?<=[?!]) # positive lookbehind, make sure we have a punctuation before (you can add all punctuations you want to check)
\s+ # 1 or more spaces
(?=[?!]) # positive lookahead, make sure we have a punctuation after

Try this:
string = "This is! ? a test! ?"
string = re.sub(r"(\W)\s*(\W)", r"\1\2", string)
print(string)
Output:
This is!? a test!?

In order to match a punctuation char with a regex in Python, you may use (?:[^\w\s]|_) pattern, it matches any char but a letter, digit or whitespace.
So, you need to match one or more whitespaces (\s+) that is immediately preceded with a punctuation char ((?<=[^\w\s]|_)) and is immediately followed with such a char ((?=[^\w\s]|_)):
(?<=[^\w\s]|_)\s+(?=[^\w\s]|_)
See the online regex demo.
Python demo:
import re
text = "This is! ? a test! ?"
print( re.sub(r"(?<=[^\w\s]|_)\s+(?=[^\w\s]|_)", "", text) )
# => This is!? a test!?

Another option is to make use of the PyPi regex module use \p{Punct} inside positive lookarounds to match the punctuation marks.
Python demo
For example
import regex
pattern = r"(?<=\p{Punct})\s+(?=\p{Punct})"
s = 'This is! ? a test! ?'
print(regex.sub(pattern, '', s))
Output
This is!? a test!?
Note that \s could also match a newline. You could also use [^\S\r\n] to match a whitespace char except newlines.

Python regular expression truncate string by special character with one leading space

I need to truncate string by special characters '-', '(', '/' with one leading whitespace, i.e. ' -', ' (', ' /'.
how to do that?
patterns=r'[-/()]'
try:
return row.split(re.findall(patterns, row)[0], 1)[0]
except:
return row
the above code picked up all special characters but without the leading space.
patterns=r'[s-/()]'
this one does not work.

Try this pattern
patterns=r'^\s[-/()]'
or remove ^ depending on your needs.

It looks like you want to get a part of the string before the first occurrence of \s[-(/] pattern.
Use
return re.sub(r'\s[-(/].*', '', row)
This code will return a part of row string without all chars after the first occurrence of a whitespace (\s) followed with -, ( or / ([-(/]).
See the regex demo.

Please try this pattern patterns = r'\s+-|\s\/|\s\(|\s\)'

Remove Twitter mentions from Pandas column

I have a dataset that includes Tweets from Twitter. Some of them also have user mentions such as #thisisauser. I try to remove that text at the same time I do other cleaning processes.
def clean_text(row, options):
if options['lowercase']:
row = row.lower()
if options['decode_html']:
txt = BeautifulSoup(row, 'lxml')
row = txt.get_text()
if options['remove_url']:
row = row.replace('http\S+|www.\S+', '')
if options['remove_mentions']:
row = row.replace('#[A-Za-z0-9]+', '')
return row
clean_config = {
'remove_url': True,
'remove_mentions': True,
'decode_utf8': True,
'lowercase': True
}
df['tweet'] = df['tweet'].apply(clean_text, args=(clean_config,))
However, when I run the above code, all the Twitter mentions are still on the text. I verified with a Regex online tool that my Regex is working correctly, so the problem should be on the Pandas's code.

You are misusing replace method on a string because it does not accept regular expressions, only fixed strings (see docs at https://docs.python.org/2/library/stdtypes.html#str.replace for more).
The right way of achieving your needs is using re module like:
import re
re.sub("#[A-Za-z0-9]+","", "#thisisauser text")
' text'

the problem is with the way you used replace method & not pandas
see output from the REPL
>>> my_str ="#thisisause"
>>> my_str.replace('#[A-Za-z0-9]+', '')
'#thisisause'
replace doesn't support regex. Instead do use regular expressions library in python as mentioned in the answer
>>> import re
>>> my_str
'hello #username hi'
>>> re.sub("#[A-Za-z0-9]+","",my_str)
'hello hi'

Removing Twitter mentions, or words that start with a # char, in Pandas, you can use
df['tweet'] = df['tweet'].str.replace(r'\s*#\w+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*\B#\w+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*#\S+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*#\S+\b', '', regex=True)
If you need to remove remaining leading/trailing spaces after the replacement, add .str.strip() after .str.replace call.
Details:
\s*#\w+ - zero or more whitespaces, # and one or more "word" chars (letters, digits, underscores (and other connector punctuation in Python 3.x), also some diacritics (in Python 3.x) (see regex demo)
\s*\B#\w+ - zero or more whitespaces, a position other than word boundary, # and one or more "word" chars (see regex demo)
\s*#\S+ - zero or more whitespaces, # and one or more non-whitespace chars (see regex demo)
\s*#\S+\b - zero or more whitespaces, # and one or more non-whitespace chars (as many as possible) followed with a word boundary (see regex demo).
Without Pandas, use one of the the above regexps in re.sub:
text = re.sub(r'...pattern here...', '', text)
## or
text = re.sub(r'...pattern here...', '', text).strip()

Add string between tabs and text

I simply want to add string after (0 or more) tabs in the beginning of a string.
i.e.
a = '\t\t\tHere is the next part of string. More garbage.'
(insert Added String here.)
to
b = '\t\t\t Added String here. Here is the next part of string. More garbage.'
What is the easiest/simplest way to go about it?

Simple:
re.sub(r'^(\t*)', r'\1 Added String here. ', inputtext)
The ^ caret matches the start of the string, \t a tab character, of which there should be zero or more (*). The parenthesis capture the matched tabs for use in the replacement string, where \1 inserts them again in front of the string you need adding.
Demo:
>>> import re
>>> a = '\t\t\tHere is the next part of string. More garbage.'
>>> re.sub(r'^(\t*)', r'\1 Added String here. ', a)
'\t\t\t Added String here. Here is the next part of string. More garbage.'
>>> re.sub(r'^(\t*)', r'\1 Added String here. ', 'No leading tabs.')
' Added String here. No leading tabs.'

Python Regex - Match a character without consuming it

I would like to convert the following string
"For "The" Win","Way "To" Go"
to
"For ""The"" Win","Way ""To"" Go"
The straightforward regex would be
str2 = re.sub(r'(?<!,|^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)
i.e., Double the quotes that are
Followed by a letter but not preceded by a comma or the beginning of line
Preceded by a letter but not followed by a comma or the end of line
The problem is I am using python and it's regex engine does not allow using the OR operator in the lookbehind construct. I get the error
sre_constants.error: look-behind requires fixed-width pattern
What I am looking for is a regex that will replace the '"' around 'The' and 'To' with '""'.
I can use the following regex (An answer provided to another question)
\b\s*"(?!,|[ \t]*$)
but that consumes the space just before the 'The' and 'To' and I get the below
"For""The"" Win","Way""To"" Go"
Is there a workaround so that I can double the quotes around 'The' and 'To' without consuming the spaces just before them?

Instead of saying not preceded by comma or the line start, say preceded by a non-comma character:
r'(?<=[^,])"(?=\w)|(?<=\w)"(?!,|$)'

Looks to me like you don't need to bother with anchors.
If there is a character before the quote, you know it's not at the beginning of the string.
If that character is not a newline, you're not at the beginning of a line.
If the character is not a comma, you're not at the beginning of a field.
So you don't need to use anchors, just do a positive lookbehind/lookahead for a single character:
result = re.sub(r'(?<=[^",\r\n])"(?=[^,"\r\n])', '""', subject)
I threw in the " on the chance that there might be some quotes that are already escaped. But realistically, if that's the case you're probably screwed anyway. ;)

re.sub(r'\b(\s*)"(?!,|[ \t]*$)', r'\1""', s)

Most direct workaround whenever you encounter this issue: explode the look-behind into two look-behinds.
str2 = re.sub(r'(?<!,)(?<!^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)
(don't name your strings str)

str2 = re.sub('(?<=[^,])"(?=\w)'
'|'
'(?<=\w)"(?!,|$)',
'""', ss,
flags=re.MULTILINE)
I always wonder why people use raw strings for regex patterns when it isn't needed.
Note I changed your str which is the name of a builtin class to ss
.
For `"fun" :
str2 = re.sub('"'
'('
'(?<=[^,]")(?=\w)'
'|'
'(?<=\w")(?!,|$)'
')',
'""', ss,
flags=re.MULTILINE)
or also
str2 = re.sub('(?<=[^,]")(?=\w)'
'|'
'(?<=\w")(?!,|$)',
'"', ss,
flags=re.MULTILINE)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python how to separate punctuation from text - python

Related

Regex - Remove space between two punctuation marks but not between punctuation mark and letter

Python regular expression truncate string by special character with one leading space

Remove Twitter mentions from Pandas column

Add string between tabs and text

Python Regex - Match a character without consuming it

Categories

Resources