Remove punctuation in Python but keep emoticons - python

I'm doing research on sentiment analysis. In a list of data, I'd like to remove all punctuation, in orde to get to the words in their pure version. But I would like to keep emoticons, such as :) and :/.
Is there a way to say in Python that I want to remove all punctuation signs unless they appear in a combination such as ":)", ":/", "<3"?
Thanks in advance
This is my code for the stripping:
for message in messages:
message=message.lower()
message=message.replace("!","")
message=message.replace(".","")
message=message.replace(",","")
message=message.replace(";","")
message=message.replace(";","")
message=message.replace("?","")
message=message.replace("/","")
message=message.replace("#","")

You can try this regex:
(?<=\w)[^\s\w](?![^\s\w])
Usage:
import re
print(re.sub(r'(?<=\w)[^\s\w](?![^\s\w])', '', your_data))
Here is an online demo.
The idea is to match a single special character if it is preceded by a letter.
If the regex doesn't work as you expect, you can customize it a little. For example if you don't want it to match commas, you can remove them from the character class like so: (?<=\w)[^\s\w,](?![^\s\w]). Or if you want to remove the emoticon :-), you can add it to the regex like so: (?<=\w)[^\s\w](?![^\s\w])|:-\).

Going off of the work you've already done using str.replace, you could do something like this:
lines = [
"Sentence 1.",
"Sentence 2 :)",
"Sentence <3 ?"
]
emoticons = {
":)": "000smile",
"<3": "000heart"
}
emoticons_inverse = {v: k for k, v in emoticons.items()}
punctuation = ",./<>?;':\"[]\\{}|`~!##$%^&*()_+-="
lines_clean = []
for line in lines:
#Replace emoticons with non-punctuation
for emote, rpl in emoticons.items():
line = line.replace(emote, rpl)
#Remove punctuation
for char in line:
if char in punctuation:
line = line.replace(char, "")
#Revert emoticons
for emote, rpl in emoticons_inverse.items():
line = line.replace(emote, rpl)
lines_clean.append(line)
print(lines_clean)
This is not super efficient, though, so if performance becomes a bottleneck you might want to examine how you can make this faster.
Output: python3 test.py
['Sentence 1', 'Sentence 2 :)', 'Sentence <3 ']

Your best bet might be to simply declare a list of emoticons as a variable. Then compare your punctuation to the list. If it's not in the list, remove it from the string.
Edit: Instead of using a whole block of str.replace() over and over, you might try something like:
to_remove = ".,;:!()\"
for char in to_remove:
message = message.replace(char, "")
Edit 2:
The simplest way (skill-wise) might be to try this:
from string import punctuation
emoticons = [":)" ":D" ":("]
word_list = message.split(" ")
for word in word_list:
if word not in emoticons:
word = word.translate(None, punctuation)
output = " ".join(word_list)
Once again, this will only work on emoticons that are separated from other characters, i.e. "Sure :D" but not "Sorry:(".

Related

Easy way of converting a string to lowercase in python

I have a text as follows.
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
I want to convert it to lowercase, except the words that has _ABB in it.
So, my output should look as follows.
mytext = "this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday"
My current code is as follows.
splits = mytext.split()
newtext = []
for item in splits:
if not '_ABB' in item:
item = item.lower()
newtext.append(item)
else:
newtext.append(item)
However, I want to know if there is any easy way of doing this, possibly in one line?
You can use a one liner splitting the string into words, check the words with str.endswith() and then join the words back together:
' '.join(w if w.endswith('_ABB') else w.lower() for w in mytext.split())
# 'this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday'
Of course use the in operator rather than str.endswith() if '_ABB' can actually occur anywhere in the word and not just at the end.
Extended regex approach:
import re
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
result = re.sub(r'\b((?!_ABB)\S)+\b', lambda m: m.group().lower(), mytext)
print(result)
The output:
this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday
Details:
\b - word boundary
(?!_ABB) - lookahead negative assertion, ensures that the given pattern will not match
\S - non-whitespace character
\b((?!_ABB)\S)+\b - the whole pattern matches a word NOT containing substring _ABB
Here is another possible(not elegant) one-liner:
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
print(' '.join(map(lambda x : x if '_ABB' in x else x.lower(), mytext.split())))
Which Outputs:
this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday
Note: This assumes that your text will only seperate the words by spaces, so split() suffices here. If your text includes punctuation such as",!.", you will need to use regex instead to split up the words.

Stripping Punctuation from Python String

I seem to be having a bit of an issue stripping punctuation from a string in Python. Here, I'm given a text file (specifically a book from Project Gutenberg) and a list of stopwords. I want to return a dictionary of the 10 most commonly used words. Unfortunately, I keep getting one hiccup in my returned dictionary.
import sys
import collections
from string import punctuation
import operator
#should return a string without punctuation
def strip_punc(s):
return ''.join(c for c in s if c not in punctuation)
def word_cloud(infile, stopwordsfile):
wordcount = {}
#Reads the stopwords into a list
stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]
#reads data from the text file into a list
lines = []
with open(infile) as f:
lines = f.readlines()
lines = [line.split() for line in lines]
#does the wordcount
for line in lines:
for word in line:
word = strip_punc(word).lower()
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
#sorts the dictionary, grabs 10 most common words
output = dict(sorted(wordcount.items(),
key=operator.itemgetter(1), reverse=True)[:10])
print(output)
if __name__=='__main__':
try:
word_cloud(sys.argv[1], sys.argv[2])
except Exception as e:
print('An exception has occured:')
print(e)
print('Try running as python3 word_cloud.py <input-text> <stopwords>')
This will print out
{'said': 659, 'mr': 606, 'one': 418, '“i': 416, 'lorry': 322, 'upon': 288, 'will': 276, 'defarge': 268, 'man': 264, 'little': 263}
The "i shouldn't be there. I don't understand why it isn't eliminated in my helper function.
Thanks in advance.
The character “ is not ".
string.punctuation only includes the following ASCII characters:
In [1]: import string
In [2]: string.punctuation
Out[2]: '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
so you will need to augment the list of characters you are stripping.
Something like the following should accomplish what you need:
extended_punc = punctuation + '“' # and any other characters you need to strip
def strip_punc(s):
return ''.join(c for c in s if c not in extended_punc)
Alternatively, you could use the package unidecode to ASCII-fy your text and not worry about creating a list of unicode characters you may need to handle:
from unidecode import unidecode
def strip_punc(s):
s = unidecode(s.decode('utf-8'))
return ''.join(c for c in s if c not in punctuation).encode('utf-8')
As stated in other answers, the problem is that string.punctuation only contains ASCII characters, so the typographical ("fancy") quotes like “ are missing, among many other.
You could replace your strip_punc function with the following:
def strip_punc(s):
'''
Remove all punctuation characters.
'''
return re.sub(r'[^\w\s]', '', s)
This approach uses the re module.
The regular expression works as follows:
It matches any character that is neither alphanumeric (\w) nor whitespace (\s) and replaces it with the empty string (ie. deletes it).
This solution takes advantage of the fact that the "special sequences" \w and \s are unicode-aware, ie. they work equally well for any characters of any script, not only ASCII:
>>> strip_punc("I said “naïve”, didn't I!")
'I said naïve didnt I'
Please note that \w includes the underscore (_), because it is considered "alphanumeric".
If you want to strip it as well, change the pattern to:
r'[^\w\s]|_'
w/o knowing what is in the stopwords list, the fastest solution is to add this:
#Reads the stopwords into a list
stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]
stopwords.append('“i')
And continue with the rest of your code..
I'd change my logic up on the strip_punc function
from string import asci_letters
def strip_punc(word):
return ''.join(c for c in word if c in ascii_letters)
This logic is an explicit allow vs an explicit deny which means you are only allowing in the values you want vs only blocking the values you know you don't want i.e. leaves out any edge cases you didn't think about.
Also note this.
Best way to strip punctuation from a string in Python

Removing Punctuation and Replacing it with Whitespace using Replace in Python

trying to remove the following punctuation in python I need to use the replace methods to remove these punctuation characters and replace it with whitespace ,.:;'"-?!/
here is my code:
text_punct_removed = raw_text.replace(".", "")
text_punct_removed = raw_text.replace("!", "")
print("\ntext with punctuation characters removed:\n", text_punct_removed)
It only will remove the last one I try to replace, so I tried combining them
text_punct_removed = raw_text.replace(".", "" , "!", "")
print("\ntext with punctuation characters removed:\n", text_punct_removed)
but I get an error message, how do I remove multiple punctuation? Also there will be an issue if I put the " in quotes like this """ which will make a comment, is there a way around that? thanks
If you don't need to explicitly use replace:
exclude = set(",.:;'\"-?!/")
text = "".join([(ch if ch not in exclude else " ") for ch in text])
Here's a naive but working solution:
for sp in '.,"':
raw_text = raw_text.replace(sp, '')
If you need to replace all punctuations with space, you can use the built-in punctuation list to replace the string:
Python 3
import string
import re
my_string = "(I hope...this works!)"
translator = re.compile('[%s]' % re.escape(string.punctuation))
translator.sub(' ', my_string)
print(my_string)
# Result:
# I hope this works
After, if you want to remove double spaces inside string, you can make:
my_string = re.sub(' +',' ', my_string).strip()
print(my_string)
# Result:
# I hope this works
This works in Python3.5.3:
from string import punctuation
raw_text_with_punctuations = "text, with: punctuation; characters? all over ,.:;'\"-?!/"
print(raw_text_with_punctuations)
for char in punctuation:
raw_text_with_punctuations = raw_text_with_punctuations.replace(char, '')
print(raw_text_with_punctuations)
Either remove one character at a time:
raw_text.replace(".", "").replace("!", "")
Or, better, use regular expressions (re.sub()):
re.sub(r"\.|!", "", raw_text)

Splitting strings in Python using specific characters

I'm trying to split an inputted document at specific characters. I need to split them at [ and ] but I'm having a difficult time figuring this out.
def main():
for x in docread:
words = x.split('[]')
for word in words:
doclist.append(word)
this is the part of the code that splits them into my list. However, it is returning each line of the document.
For example, I want to convert
['I need to [go out] to lunch', 'and eat [some food].']
to
['I need to', 'go out', 'to lunch and eat', 'some food', '.']
Thanks!
You could try using re.split() instead:
>>> import re
>>> re.split(r"[\[\]]", "I need to [go out] to lunch")
['I need to ', 'go out', ' to lunch']
The odd-looking regular expression [\[\]] is a character class that means split on either [ or ]. The internal \[ and \] must be backslash-escaped because they use the same characters as the [ and ] to surround the character class.
str.split() splits at the exact string you pass to it, not at any of its characters. Passing "[]" would split at occurrences of [], but not at individual brackets. Possible solutions are
splitting twice:
words = [z for y in x.split("[") for z in y.split("]")]
using re.split().
string.split(s), the one you are using, treats the entire content of 's' as a separator. In other words, you input should've looked like "[]'I need to []go out[] to lunch', 'and eat []some food[].'[]" for it to give you the results you want.
You need to use split(s) from the re module, which will treat s as a regex
import re
def main():
for x in docread:
words = re.split('[]', x)
for word in words:
doclist.append(word)

Efficient way to search for invalid characters in python

I am building a forum application in Django and I want to make sure that users dont enter certain characters in their forum posts. I need an efficient way to scan their whole post to check for the invalid characters. What I have so far is the following although it does not work correctly and I do not think the idea is very efficient.
def clean_topic_message(self):
topic_message = self.cleaned_data['topic_message']
words = topic_message.split()
if (topic_message == ""):
raise forms.ValidationError(_(u'Please provide a message for your topic'))
***for word in words:
if (re.match(r'[^<>/\{}[]~`]$',topic_message)):
raise forms.ValidationError(_(u'Topic message cannot contain the following: <>/\{}[]~`'))***
return topic_message
Thanks for any help.
For a regex solution, there are two ways to go here:
Find one invalid char anywhere in the string.
Validate every char in the string.
Here is a script that implements both:
import re
topic_message = 'This topic is a-ok'
# Option 1: Invalidate one char in string.
re1 = re.compile(r"[<>/{}[\]~`]");
if re1.search(topic_message):
print ("RE1: Invalid char detected.")
else:
print ("RE1: No invalid char detected.")
# Option 2: Validate all chars in string.
re2 = re.compile(r"^[^<>/{}[\]~`]*$");
if re2.match(topic_message):
print ("RE2: All chars are valid.")
else:
print ("RE2: Not all chars are valid.")
Take your pick.
Note: the original regex erroneously has a right square bracket in the character class which needs to be escaped.
Benchmarks: After seeing gnibbler's interesting solution using set(), I was curious to find out which of these methods would actually be fastest, so I decided to measure them. Here are the benchmark data and statements measured and the timeit result values:
Test data:
r"""
TEST topic_message STRINGS:
ok: 'This topic is A-ok. This topic is A-ok.'
bad: 'This topic is <not>-ok. This topic is {not}-ok.'
MEASURED PYTHON STATEMENTS:
Method 1: 're1.search(topic_message)'
Method 2: 're2.match(topic_message)'
Method 3: 'set(invalid_chars).intersection(topic_message)'
"""
Results:
r"""
Seconds to perform 1000000 Ok-match/Bad-no-match loops:
Method Ok-time Bad-time
1 1.054 1.190
2 1.830 1.636
3 4.364 4.577
"""
The benchmark tests show that Option 1 is slightly faster than option 2 and both are much faster than the set().intersection() method. This is true for strings which both match and don't match.
You have to be much more careful when using regular expressions - they are full of traps.
in the case of [^<>/\{}[]~] the first ] closes the group which is probably not what you intended. If you want to use ] in a group it has to be the first character after the [ eg []^<>/\{}[~]
simple test confirms this
>>> import re
>>> re.search("[[]]","]")
>>> re.search("[][]","]")
<_sre.SRE_Match object at 0xb7883db0>
regex is overkill for this problem anyway
def clean_topic_message(self):
topic_message = self.cleaned_data['topic_message']
invalid_chars = '^<>/\{}[]~`$'
if (topic_message == ""):
raise forms.ValidationError(_(u'Please provide a message for your topic'))
if set(invalid_chars).intersection(topic_message):
raise forms.ValidationError(_(u'Topic message cannot contain the following: %s'%invalid_chars))
return topic_message
If efficiency is a major concern I would re.compile() the re string, since you're going to use the same regex many times.
re.match and re.search behave differently. Splitting words is not required to search using regular expressions.
import re
symbols_re = re.compile(r"[^<>/\{}[]~`]");
if symbols_re.search(self.cleaned_data('topic_message')):
//raise Validation error
I can't say what would be more efficient, but you certainly should get rid of the $ (unless it's an invalid character for the message)... right now you only match the re if the characters are at the end of topic_message because $ anchors the match to the right-hand side of the line.
In any case you need to scan the entire message. So wouldn't something simple like this work ?
def checkMessage(topic_message):
for char in topic_message:
if char in "<>/\{}[]~`":
return False
return True
is_valid = not any(k in text for k in '<>/{}[]~`')
I agree with gnibbler, regex is an overkiller for this situation. Probably after removing this unwanted chars you'll want to remove unwanted words also, here's a little basic way to do it:
def remove_bad_words(title):
'''Helper to remove bad words from a sentence based in a dictionary of words.
'''
word_list = title.split(' ')
for word in word_list:
if word in BAD_WORDS: # BAD_WORDS is a list of unwanted words
word_list.remove(word)
#let's build the string again
title2 = u''
for word in word_list:
title2 = ('%s %s') % (title2, word)
#title2 = title2 + u' '+ word
return title2
Example: just tailor to your needs.
### valid chars: 0-9 , a-z, A-Z only
import re
REGEX_FOR_INVALID_CHARS=re.compile( r'[^0-9a-zA-Z]+' )
list_of_invalid_chars_found=REGEX_FOR_INVALID_CHARS.findall( topic_message )

Categories