Stripping Punctuation from Python String - python

I seem to be having a bit of an issue stripping punctuation from a string in Python. Here, I'm given a text file (specifically a book from Project Gutenberg) and a list of stopwords. I want to return a dictionary of the 10 most commonly used words. Unfortunately, I keep getting one hiccup in my returned dictionary.
import sys
import collections
from string import punctuation
import operator
#should return a string without punctuation
def strip_punc(s):
return ''.join(c for c in s if c not in punctuation)
def word_cloud(infile, stopwordsfile):
wordcount = {}
#Reads the stopwords into a list
stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]
#reads data from the text file into a list
lines = []
with open(infile) as f:
lines = f.readlines()
lines = [line.split() for line in lines]
#does the wordcount
for line in lines:
for word in line:
word = strip_punc(word).lower()
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
#sorts the dictionary, grabs 10 most common words
output = dict(sorted(wordcount.items(),
key=operator.itemgetter(1), reverse=True)[:10])
print(output)
if __name__=='__main__':
try:
word_cloud(sys.argv[1], sys.argv[2])
except Exception as e:
print('An exception has occured:')
print(e)
print('Try running as python3 word_cloud.py <input-text> <stopwords>')
This will print out
{'said': 659, 'mr': 606, 'one': 418, '“i': 416, 'lorry': 322, 'upon': 288, 'will': 276, 'defarge': 268, 'man': 264, 'little': 263}
The "i shouldn't be there. I don't understand why it isn't eliminated in my helper function.
Thanks in advance.

The character “ is not ".
string.punctuation only includes the following ASCII characters:
In [1]: import string
In [2]: string.punctuation
Out[2]: '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
so you will need to augment the list of characters you are stripping.
Something like the following should accomplish what you need:
extended_punc = punctuation + '“' # and any other characters you need to strip
def strip_punc(s):
return ''.join(c for c in s if c not in extended_punc)
Alternatively, you could use the package unidecode to ASCII-fy your text and not worry about creating a list of unicode characters you may need to handle:
from unidecode import unidecode
def strip_punc(s):
s = unidecode(s.decode('utf-8'))
return ''.join(c for c in s if c not in punctuation).encode('utf-8')

As stated in other answers, the problem is that string.punctuation only contains ASCII characters, so the typographical ("fancy") quotes like “ are missing, among many other.
You could replace your strip_punc function with the following:
def strip_punc(s):
'''
Remove all punctuation characters.
'''
return re.sub(r'[^\w\s]', '', s)
This approach uses the re module.
The regular expression works as follows:
It matches any character that is neither alphanumeric (\w) nor whitespace (\s) and replaces it with the empty string (ie. deletes it).
This solution takes advantage of the fact that the "special sequences" \w and \s are unicode-aware, ie. they work equally well for any characters of any script, not only ASCII:
>>> strip_punc("I said “naïve”, didn't I!")
'I said naïve didnt I'
Please note that \w includes the underscore (_), because it is considered "alphanumeric".
If you want to strip it as well, change the pattern to:
r'[^\w\s]|_'

w/o knowing what is in the stopwords list, the fastest solution is to add this:
#Reads the stopwords into a list
stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]
stopwords.append('“i')
And continue with the rest of your code..

I'd change my logic up on the strip_punc function
from string import asci_letters
def strip_punc(word):
return ''.join(c for c in word if c in ascii_letters)
This logic is an explicit allow vs an explicit deny which means you are only allowing in the values you want vs only blocking the values you know you don't want i.e. leaves out any edge cases you didn't think about.
Also note this.
Best way to strip punctuation from a string in Python

Related

Find and remove slightly different substring on string

I want to find out if a substring is contained in the string and remove it from it without touching the rest of the string. The thing is that the substring pattern that I have to perform the search on is not exactly what will be contained in the string. In particular the problem is due to spanish accent vocals and, at the same time, uppercase substring, so for example:
myString = 'I'm júst a tésting stríng'
substring = 'TESTING'
Perform something to obtain:
resultingString = 'I'm júst a stríng'
Right now I've read that difflib library can compare two strings and weight it similarity somehow, but I'm not sure how to implement this for my case (without mentioning that I failed to install this lib).
Thanks!
This normalize() method might be a little overkill and maybe using the code from #Harpe at https://stackoverflow.com/a/71591988/218663 works fine.
Here I am going to break the original string into "words" and then join all the non-matching words back into a string:
import unicodedata
def normalize(text):
return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower()
myString = "I'm júst a tésting stríng"
substring = "TESTING"
newString = " ".join(word for word in myString.split(" ") if normalize(word) != normalize(substring))
print(newString)
giving you:
I'm júst a stríng
If your "substring" could be multi-word I might think about switching strategies to a regex:
import re
import unicodedata
def normalize(text):
return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower()
myString = "I'm júst á tésting stríng"
substring = "A TESTING"
match = re.search(f"\\s{ normalize(substring) }\\s", normalize(myString))
if match:
found_at = match.span()
first_part = myString[:found_at[0]]
second_part = myString[found_at[1]:]
print(f"{first_part} {second_part}".strip())
I think that will give you:
I'm júst stríng
You can use the package unicodedata to normalize accented letters to ascii code letters like so:
import unicodedata
output = unicodedata.normalize('NFD', "I'm júst a tésting stríng").encode('ascii', 'ignore')
print(str(output))
which will give
b"I'm just a testing string"
You can then compare this with your input
"TESTING".lower() in str(output).lower()
which should return True.

Read a text file and remove all characters except alphabets & spaces in Python

I am trying to remove all characters except alphabets along with the spaces.
This is what my code looks like.
Where sampletext.txt contains words with multiple characters, I am writing the result in removed.txt.
When I run this code. I am getting only blanks in removed.txt
import re
import sys
filename = open("removed.txt",'w')
sys.stdout = filename
from string import ascii_letters
allowed = set(ascii_letters + ' ')
with open("/Desktop/stem_analysis/sampletext.txt", 'r') as f:
answer = ''.join(l for l in f if l in allowed)
print(answer)
Whats the problem with my code
I am trying to remove all characters except alphabets along with the
spaces.
I'm not 100% sure of what you're trying to do, but to remove all characters except alphabets along with the spaces, you can use something like:
with open("old_file.txt", "r") as f, open("new_file.txt", "w") as n:
x = f.read()
result = re.sub("[^a-z\s]", "", x, 0, re.IGNORECASE | re.MULTILINE)
n.write(result)
Regex Explanation:
Regex Demo
This will give you all the characters that are not in the alphabet. Add another if statement to check for spaces.
def letters(input):
return ''.join([c for c in input if (c.isalpha()==False)])
Something like this
import re
re.sub(r'^[a-zA-Z]', '', your_string)
should do what you’re asking except for the spaces part. I’m sure you can figure out how to add that in to the regex as well.

Search file for exact match of word list

There are many many questions surrounding this, some using regex, some using with open, and others but I have found none suitably fit my requirements.
I am opening a xml file which contains strings, 1 per line. e.g
<string name="AutoConf_5">setup is in progress…</string>
I want to iterate over each line in the file and search each line for exact matches of words in a list. The current code seems to work and prints out matches but it doesn't do exact matches, e.g 'pass' finds 'passed', 'pro' finds 'provide', 'process', 'proceed' etc
def stringRun(self,file):
str_file = ['admin','premium','pro','paid','pass','password','api']
with open(file, 'r') as sf:
for s in sf:
if any(x in str(s) for x in str_file):
self.progressBox.AppendText(s)
Instead of using the function "in" which matches any substring in the line, you should use regex "re.search"
I haven't checked it with python so minor syntax errors might have slipped in but this is the general idea, replace the if in your code with this:
if any(re.search(x, str(s)) for x in str_file):
Then you can use the power of regex to search for the words in the list with word boundaries. You need to add '\b' to the beginning and end of each search string, or add to all in the condition:
if any(re.search(r'\b' + x + r'\b', str(s)) for x in str_file):
If you want an exact match, IMO, the best way is to prepare the strings to match and then search each string in each line.
For instances, you can prepare a mapping between tagged string and strings you want to match:
tagged = {'<string name="AutoConf_5">{0}</string>'.format(s): s
for s in str_file}
This dict is an association between the tagged string you want to match and the actual string.
You can use it like that:
for line in sf:
line = line.strip()
if line in tagged:
self.progressBox.AppendText(tagged[line])
Note: if any of your string contains "&", "<" or ">", you need to escape those characters, like this:
from xml.sax.saxutils import escape
tagged = {'<string name="AutoConf_5">{0}</string>'.format(escape(s)): s
for s in str_file}
Another solution is to use lxml to parse your XML tree and find nodes which match a given xpath expression.
EDIT: match at least a word (form a words list)
You have a list of strings containing words. To match the XML content which contains at least of word of this list, you can use regular expression.
You may encounter 2 difficulties:
a XML content, parsed like a text file, can contains "&", "<" or ">". So you need to unescape the XML content.
some word from your words list may contains RegEx special characters (like "[" or "(") which must be escaped.
First, you can prepare a RegEx (and a function) to find all occurence of a word in a string. To do that, you can use "\b" to match the empty string, but only at the beginning or end of a word:
str_file = ['admin', 'premium', 'pro', 'paid', 'pass', 'password', 'api']
re_any_word = r"\b(?:" + r"|".join(re.escape(e) for e in str_file) + r")\b"
find_any_word = re.compile(re_any_word, flags=re.DOTALL).findall
For instance:
>>> find_any_word("Time has passed")
[]
>>> find_any_word("I pass my exam, I'm a pro")
['pass', 'pro']
To extract the content of a XML fragment, you can also use a RegEx (even if it is not recommended in the general case, it worth it here):
The following RegEx (and function) matches a "<string>...</string>" fragment and select the content in the first group:
re_string = r'<string[^>]*>(.*?)</string>'
match_string = re.compile(re_string, flags=re.DOTALL).match
For instance:
>>> match_string('<string name="AutoConf_5">setup is in progress…</string>').group(1)
setup is in progress…
Now, all you have to do is to parse your file, line by line.
For the demo, I used a list of strings:
lines = [
'<string name="AutoConf_5">setup is in progress…</string>\n',
'<string name="AutoConf_5">it has passed</string>\n',
'<string name="AutoConf_5">I pass my exam, I am a pro</string>\n',
]
for line in lines:
line = line.strip()
mo = match_string(line)
if mo:
content = saxutils.unescape(mo.group(1))
words = find_any_word(content)
if words:
print(line + " => " + ", ".join(words))
You get:
<string name="AutoConf_5">I pass my exam, I am a pro</string> => pass, pro

Remove punctuation in Python but keep emoticons

I'm doing research on sentiment analysis. In a list of data, I'd like to remove all punctuation, in orde to get to the words in their pure version. But I would like to keep emoticons, such as :) and :/.
Is there a way to say in Python that I want to remove all punctuation signs unless they appear in a combination such as ":)", ":/", "<3"?
Thanks in advance
This is my code for the stripping:
for message in messages:
message=message.lower()
message=message.replace("!","")
message=message.replace(".","")
message=message.replace(",","")
message=message.replace(";","")
message=message.replace(";","")
message=message.replace("?","")
message=message.replace("/","")
message=message.replace("#","")
You can try this regex:
(?<=\w)[^\s\w](?![^\s\w])
Usage:
import re
print(re.sub(r'(?<=\w)[^\s\w](?![^\s\w])', '', your_data))
Here is an online demo.
The idea is to match a single special character if it is preceded by a letter.
If the regex doesn't work as you expect, you can customize it a little. For example if you don't want it to match commas, you can remove them from the character class like so: (?<=\w)[^\s\w,](?![^\s\w]). Or if you want to remove the emoticon :-), you can add it to the regex like so: (?<=\w)[^\s\w](?![^\s\w])|:-\).
Going off of the work you've already done using str.replace, you could do something like this:
lines = [
"Sentence 1.",
"Sentence 2 :)",
"Sentence <3 ?"
]
emoticons = {
":)": "000smile",
"<3": "000heart"
}
emoticons_inverse = {v: k for k, v in emoticons.items()}
punctuation = ",./<>?;':\"[]\\{}|`~!##$%^&*()_+-="
lines_clean = []
for line in lines:
#Replace emoticons with non-punctuation
for emote, rpl in emoticons.items():
line = line.replace(emote, rpl)
#Remove punctuation
for char in line:
if char in punctuation:
line = line.replace(char, "")
#Revert emoticons
for emote, rpl in emoticons_inverse.items():
line = line.replace(emote, rpl)
lines_clean.append(line)
print(lines_clean)
This is not super efficient, though, so if performance becomes a bottleneck you might want to examine how you can make this faster.
Output: python3 test.py
['Sentence 1', 'Sentence 2 :)', 'Sentence <3 ']
Your best bet might be to simply declare a list of emoticons as a variable. Then compare your punctuation to the list. If it's not in the list, remove it from the string.
Edit: Instead of using a whole block of str.replace() over and over, you might try something like:
to_remove = ".,;:!()\"
for char in to_remove:
message = message.replace(char, "")
Edit 2:
The simplest way (skill-wise) might be to try this:
from string import punctuation
emoticons = [":)" ":D" ":("]
word_list = message.split(" ")
for word in word_list:
if word not in emoticons:
word = word.translate(None, punctuation)
output = " ".join(word_list)
Once again, this will only work on emoticons that are separated from other characters, i.e. "Sure :D" but not "Sorry:(".

Search and replace with "whole word only" option [duplicate]

This question already has answers here:
Match a whole word in a string using dynamic regex
(1 answer)
Word boundary with words starting or ending with special characters gives unexpected results
(2 answers)
Closed 4 years ago.
I have a script that runs into my text and search and replace all the sentences I write based in a database.
The script:
with open('C:/Users/User/Desktop/Portuguesetranslator.txt') as f:
for l in f:
s = l.split('*')
editor.replace(s[0],s[1])
And the Database example:
Event*Evento*
result*resultado*
And so on...
Now what is happening is that I need the "whole word only" in that script, because I'm finding myself with problems.
For example with Result and Event, because when I replace for Resultado and Evento, and I run the script one more time in the text the script replace again the Resultado and Evento.
And the result after I run the script stays like this Resultadoado and Eventoo.
Just so you guys know.. Its not only for Event and Result, there is more then 1000+ sentences that I already set for the search and replace to work..
I don't need a simples search and replace for two words.. because I'm going to be editing the database over and over for different sentences..
You want a regular expression. You can use the token \b to match a word boundary: i.e., \bresult\b would match only the exact word "result."
import re
with open('C:/Users/User/Desktop/Portuguesetranslator.txt') as f:
for l in f:
s = l.split('*')
editor = re.sub(r"\b%s\b" % s[0] , s[1], editor)
Use re.sub:
replacements = {'the':'a',
'this':'that'}
def replace(match):
return replacements[match.group(0)]
# notice that the 'this' in 'thistle' is not matched
print re.sub('|'.join(r'\b%s\b' % re.escape(s) for s in replacements),
replace, 'the cat has this thistle.')
Prints
a cat has that thistle.
Notes:
All the strings to be replaced are joined into a single pattern so
that the string needs to be looped over just once.
The source strings are passed to re.escape to make avoid
interpreting them as regular expressions.
The words are surrounded by r'\b' to make sure matches are for
whole words only.
A replacement function is used so that any match can be replaced.
Use re.sub instead of normal string replace to replace only whole words.So your script,even if it runs again will not replace the already replaced words.
>>> import re
>>> editor = "This is result of the match"
>>> new_editor = re.sub(r"\bresult\b","resultado",editor)
>>> new_editor
'This is resultado of the match'
>>> newest_editor = re.sub(r"\bresult\b","resultado",new_editor)
>>> newest_editor
'This is resultado of the match'
It is very simple. use re.sub, don't use replace.
import re
replacements = {r'\bthe\b':'a',
r'\bthis\b':'that'}
def replace_all(text, dic):
for i, j in dic.iteritems():
text = re.sub(i,j,text)
return text
replace_all("the cat has this thistle.", replacements)
It will print
a cat has that thistle.
import re
match = {} # create a dictionary of words-to-replace and words-to-replace-with
f = open("filename", "r")
data = f.read() # string of all file content
def replace_all(text, dic):
for i, j in dic.items():
text = re.sub(r"\b%s\b" % i, j, text)
# r"\b%s\b"% enables replacing by whole word matches only
return text
data = replace_all(data, match)
print(data) # you can copy and paste the result to whatever file you like

Categories