Stop words nltk/python problem - python

I have some code that processes a dataset for later use, the code i'm using for the stop words seems to be ok, however I think the problem lies within the rest of my code as it seems to only remove some of the stop words.
import re
import nltk
# Quran subset
filename = 'subsetQuran.txt'
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
word_list2 = [w for w in word_list if not w in nltk.corpus.stopwords.words('english')]
# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list2:
# remove punctuation marks
word = punctuation.sub("", word)
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result
for freq, word in freq_list2:
print word, freq
f = open("wordfreq.txt", "w")
f.write( str(freq_list3) )
f.close()
The output is looking like this
[(71, 'allah'), (65, 'ye'), (46, 'day'), (21, 'lord'), (20, 'truth'), (20, 'say'), (20, 'and')
This is just a small sample, there are others that should have been removed.
Any help is appreciated.

try stripping your words while making your word_list2
word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]

Related

Deleting duplicated words in a very large words list

I'm a beginner at this and I wrote a program that generates a wordlist following specific algorithms. The problem is it makes duplications.
So I'm looking for a way to make the code iterates through the range given or the number of words given to make without duplicating words.
OR write another program that goes through the words list the first program made and delete any duplicated words in that file which is going to take time but is worth it.
The words that should be generated should be like this one X4K7GB9y, 8 characters in length, following the rule
[A-Z][0-9][A-Z][0-9][A-Z][A-Z][0-9][a-z], and the code is this:
import random
import string
random.seed(0)
NUM_WORDS = 100000000
with open("wordlist.txt", "w", encoding="utf-8") as ofile:
for _ in range(NUM_WORDS):
uppc = random.sample(string.ascii_uppercase, k=4)
lowc = random.sample(string.ascii_lowercase, k=1)
digi = random.sample(string.digits, k=3)
word = uppc[0] + digi[0] + uppc[1] + digi[1] + uppc[2] + uppc[3] + digi[2] + lowc[0]
print(word, file=ofile)
I'll appreciate it if you can modify the code to not make duplications or write another code that checks the wordlist for duplications and deletes them. Thank you so much in advance
Given that your algorithm creates a list of words(unique or not).
You could use set to retain only the unique words, look at the example below.
word_list = ["word1", "word2", "word3", "word1"]
unique_words = set(word_list)
It returns the unique_words list that includes only ["word1", "word2", "word3"].
You can prevent duplicate words from the get go by remembering what you created and not write it again.
This needs a bit of memory to hold 100.000.000 8 letter words - you can lessen that by only remembering the hashes of words. You will miss out on some hash collisions, but with about 26**5 * 10**3 = 11,881,376,000 possible combinations you should be fine.
import random
import string
random.seed(0)
NUM_WORDS = 100 # reduced for testing purposes
found = 0
words = set()
with open("wordlist.txt", "w", encoding="utf-8") as ofile:
while found < NUM_WORDS:
# get 5 upper case letters, use the 5h as .lower()
l = random.sample(string.ascii_uppercase, k=5)
d = random.sample(string.digits, k=3)
word = l[0] + d[0] + l[1] + d[1] + l[2] + l[3] + d[2] + l[4].lower()
if hash(word) in words:
continue
print(word, file=ofile)
words.add(hash(word))
found += 1
Here is a possible solution using a set() to deduplicate the words list:
import random
import string
random.seed(0)
words_count = 100_000_000
words = set()
while len(words) < words_count:
u = random.sample(string.ascii_uppercase, k=4)
l = random.sample(string.ascii_lowercase, k=1)
d = random.sample(string.digits, k=3)
words.add(f'{u[0]}{d[0]}{u[1]}{d[1]}{u[2]}{u[3]}{d[2]}{l[0]}')
with open('wordlist.txt', 'w', encoding='utf-8') as f:
print(*words, file=f, sep='\n')
Bear in mind that it will take lots of memory and a long time to generate a hundred million random words.
Below program will generate unique values following the condition and also writes the same into a text file.
this set of code creates unique values
import random
import string
n = 100
l = []
for i in range(n):
word = chr(random.randint(65, 90)) + str(random.randint(1, 9)) + chr(random.randint(65, 90)) + str(random.randint(1, 9)) + chr(random.randint(65, 90))+ chr(random.randint(65, 90)) + str(random.randint(1, 9)) + chr(random.randint(65, 90)).lower()
l.append(word)
finallist = list(set(l))
And the below codes will write the outcome into a file.
with open("Uniquewords.txt", "w") as f:
for i in finallist:
f.write(i)
f.write("\n")
f.close()

Replacing word in a sentence by another one work but does not output well the punctuation

I use a list of synonym to replace word in my sentence by them. The function works but there is a slightly problem with the output
#Function
eda(t, alpha_sr=0.1, num_aug=3)
Original : "Un abricot est bon."
New sentence : 'Un aubercot est bon .'
As you can see the replacement was made but the punctuation is far the last word est the original. I would like to modify so that I will obtain this result for each pounction.
New sentence : 'Un aubercot est bon.'
augmented_sentences.append(' '.join(a_words)) # the problem arise here since, I joined the words after splitting them the punctuation is also join with space.
Sinc I am working with some quite long review, the punctuation is really important.
The code is below :
def cleaning(texte):
texte = re.sub(r"<iwer>.*?</iwer>", " ", str(texte)) # clean
return texte
def eda(sentence, alpha_sr=float, num_aug=int):
sentence = cleaning(sentence)
sent_doc = nlp(sentence)
words = [token.text for token in sent_doc if token.pos_ != "SPACE"]
num_words = len(words)
augmented_sentences = []
num_new_per_technique = int(num_aug/4)+1
if (alpha_sr > 0):
n_sr = max(1, int(alpha_sr*num_words))
for _ in range(num_new_per_technique):
a_words = synonym_replacement(words, n_sr)
print(a_words)
augmented_sentences.append(' '.join(a_words)) # the problem is here since, I joined the words adfter using
shuffle(augmented_sentences)
#trim so that we have the desired number of augmented sentences
if num_aug >= 1:
augmented_sentences = augmented_sentences[:num_aug]
else:
keep_prob = num_aug / len(augmented_sentences)
augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]
#append the original sentence
augmented_sentences.append(sentence)
#print(len(augmented_sentences))
return augmented_sentences
def synonym_replacement(words, n):
new_words = words.copy()
random_word_list = [word for word in words if word not in stop_words]
random_word_list = ' '.join(new_words)
#print("random list :", random_word_list)
sent_doc = nlp(random_word_list)
random_word_list = [token.lemma_ for token in sent_doc if token.pos_ == "NOUN" or token.pos_ == "ADJ" or token.pos_ == "VERB" or token.pos_ == "ADV"]
random.shuffle(random_word_list)
num_replaced = 0
for random_word in random_word_list:
synonyms = get_synonyms(random_word)
if len(synonyms) >= 1:
synonym = random.choice(list(synonyms))
new_words = [synonym if word == random_word else word for word in new_words]
#print("replaced", random_word, "with", synonym)
num_replaced += 1
if num_replaced >= n: #only replace up to n words
break
#this is stupid but we need it, trust me
sentence = ' '.join(new_words)
new_words = sentence.split(' ')
return new_words
def get_synonyms(word):
synonyms = []
for k_syn, v_syn in word_syn_map.items():
if k_syn == word:
print(v_syn)
synonyms.extend(v_syn)
synonyms = set(synonyms)
if word in synonyms:
synonyms.remove(word)
return list(synonyms)
the dictionnary of synonym look like this :
#word_syn_map
defaultdict(<class 'list'>,
{'ADN': ['acide désoxyribonucléique', 'acide désoxyribonucléique'],
'abdomen': ['bas-ventre',
'bide',
'opisthosome',
'panse',
'ventre',
'bas-ventre',
'bide',
'opisthosome',
'panse',
'ventre'],
'abricot': ['aubercot', 'michemis', 'aubercot', 'michemis']})
tokenization
import stanza
import spacy_stanza
stanza.download('fr')
nlp = spacy_stanza.load_pipeline('fr', processors='tokenize,mwt,pos,lemma')
Two answers to this:
I can't see your nlp function, so I don't know how you're tokenising the string, but it looks like you're doing it by treating punctuation as a separate token. That's why it's picking up a space, because it's being treated like any other word. You either need to adjust your tokenisation algorithm so that it includes the punctuation in the word or if you can't do that then you need to do an extra pass through the words list at the start to stick punctuation back onto the token it belongs to (ie. if a given token is punctuation, and you'll need a list of punctuation tokens, then glue it together with the token before). Either way then you need to adjust your matching algorithm so it ignores punctuation and matches the rest of the word.
This feels like you're overcomplicating the problem. I'd be inclined to do something like this:
import re
def get_synonym(wordmatch):
synonym = # pick one synonym for word at random
return wordmatch.group(1) + synonym + matchobj.group(2)
new_sentence = sentence.copy # Or however you want to take a copy of sentence, eg. copy.copy()
for original_word in defaultdict:
wordexp = re.compile((^|" ") + word + ([ .!?-,])) # Add more punctuation to this list
new_sentence = re.sub(wordexp, get_synonym, new_sentence, flags=re.IGNORECASE)
Not guaranteed to work, I haven't tested it (and you'll certainly need to do something to maintain capitalisation or it'll lowercase everything), but I'd do something with regexes, myself.

Counting/Print Unique Words in Directory up to x instances

I am attempting to take all unique words in tale4653, count their instances, and then read off the top 100 mentioned unique words.
My struggle is sorting the directory so that I can print both the unique word and its' respected instances.
My code thus far:
import string
fhand = open('tale4653.txt')
counts = dict()
for line in fhand:
line = line.translate(None, string.punctuation)
line = line.lower()
words = line.split()
for word in words:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
fhand.close()
rangedValue = sorted(counts.values(), reverse=True)
i =0
while i<100:
print rangedValue[i]
i=i+1
Thank you community,
you loose the word (the key in your dictionary) when you do counts.values())
you can do this instead
rangedValue = sorted(counts.items(), reverse=True, key=lambda x: x[1])
for word, count in rangedValue:
print word + ': ' + str(rangedValue)
when you do counts.items() it will return a list of tuples of key and value like this:
[('the', 1), ('end', 2)]
and when we sort it we tell it to take the second value as the "key" to sort with
DorElias is correct in the initial problem: you need to use count.items() with key=lambda x: x[1] or key=operator.itemgetter(1), latter of which would be faster.
However, I'd like to show how I'd do it, completely avoiding sorted in your code. collections.Counter is an optimal data structure for this code. I also prefer the logic of reading words in a file be wrapped in a generator
import string
from collections import Counter
def read_words(filename):
with open(filename) as fhand:
for line in fhand:
line = line.translate(None, string.punctuation)
line = line.lower()
words = line.split()
for word in words: # in Python 3 one can use `yield from words`
yield word
counts = Counter(read_words('tale4653.txt'))
for word, count in counts.most_common(100):
print('{}: {}'.format(word, count))

Find words that appear only once

I am retrieving only unique words in a file, here is what I have so far, however is there a better way to achieve this in python in terms of big O notation? Right now this is n squared
def retHapax():
file = open("myfile.txt")
myMap = {}
uniqueMap = {}
for i in file:
myList = i.split(' ')
for j in myList:
j = j.rstrip()
if j in myMap:
del uniqueMap[j]
else:
myMap[j] = 1
uniqueMap[j] = 1
file.close()
print uniqueMap
If you want to find all unique words and consider foo the same as foo. and you need to strip punctuation.
from collections import Counter
from string import punctuation
with open("myfile.txt") as f:
word_counts = Counter(word.strip(punctuation) for line in f for word in line.split())
print([word for word, count in word_counts.iteritems() if count == 1])
If you want to ignore case you also need to use line.lower(). If you want to accurately get unique word then there is more involved than just splitting the lines on whitespace.
I'd go with the collections.Counter approach, but if you only wanted to use sets, then you could do so by:
with open('myfile.txt') as input_file:
all_words = set()
dupes = set()
for word in (word for line in input_file for word in line.split()):
if word in all_words:
dupes.add(word)
all_words.add(word)
unique = all_words - dupes
Given an input of:
one two three
two three four
four five six
Has an output of:
{'five', 'one', 'six'}
Try this to get unique words in a file.using Counter
from collections import Counter
with open("myfile.txt") as input_file:
word_counts = Counter(word for line in input_file for word in line.split())
>>> [word for (word, count) in word_counts.iteritems() if count==1]
-> list of unique words (words that appear exactly once)
You could slightly modify your logic and move it from unique on second occurrence (example using sets instead of dicts):
words = set()
unique_words = set()
for w in (word.strip() for line in f for word in line.split(' ')):
if w in words:
continue
if w in unique_words:
unique_words.remove(w)
words.add(w)
else:
unique_words.add(w)
print(unique_words)

Searching from a list of word to words in a text file

I am trying to write a program which reads a text file and then sorts it out into whether the comments in it are positive, negative or neutral. I have tried all sorts of ways to do this but each time with no avail. I can search for 1 word with no problems but any more than that and it doesn't work. Also, I have an if statement but I've had to use else twice underneath it as it wouldn't allow me to use elif. Any help with where I'm going wrong would be really appreciated. Thanks in advance.
middle = open("middle_test.txt", "r")
positive = []
negative = [] #the empty lists
neutral = []
pos_words = ["GOOD", "GREAT", "LOVE", "AWESOME"] #the lists I'd like to search
neg_words = ["BAD", "HATE", "SUCKS", "CRAP"]
for tweet in middle:
words = tweet.split()
if pos_words in words: #doesn't work
positive.append(words)
else: #can't use elif for some reason
if 'BAD' in words: #works but is only 1 word not list
negative.append(words)
else:
neutral.append(words)
Use a Counter, see http://docs.python.org/2/library/collections.html#collections.Counter:
import urllib2
from collections import Counter
from string import punctuation
# data from http://inclass.kaggle.com/c/si650winter11/data
target_url = "http://goo.gl/oMufKm"
data = urllib2.urlopen(target_url).read()
word_freq = Counter([i.lower().strip(punctuation) for i in data.split()])
pos_words = ["good", "great", "love", "awesome"]
neg_words = ["bad", "hate", "sucks", "crap"]
for i in pos_words:
try:
print i, word_freq[i]
except: # if word not in data
pass
[out]:
good 638
great 1082
love 7716
awesome 2032
You could use the code below to count the number of positive and negative words in a paragraph:
from collections import Counter
def readwords( filename ):
f = open(filename)
words = [ line.rstrip() for line in f.readlines()]
return words
# >cat positive.txt
# good
# awesome
# >cat negative.txt
# bad
# ugly
positive = readwords('positive.txt')
negative = readwords('negative.txt')
print positive
print negative
paragraph = 'this is really bad and in fact awesome. really awesome.'
count = Counter(paragraph.split())
pos = 0
neg = 0
for key, val in count.iteritems():
key = key.rstrip('.,?!\n') # removing possible punctuation signs
if key in positive:
pos += val
if key in negative:
neg += val
print pos, neg
You are not reading the lines from the file. And this line
if pos_words in words:
I think it is checking for the list ["GOOD", "GREAT", "LOVE", "AWESOME"] in words. That is you are looking in the list of words for a list pos_words = ["GOOD", "GREAT", "LOVE", "AWESOME"].
You have some problems. At first you can create functions that read comments from file and divides comments into words. Make them and check if they work as you want. Then main procedure can look like:
for comment in get_comments(file_name):
words = get_words(comment)
classified = False
# at first look for negative comment
for neg_word in NEGATIVE_WORDS:
if neg_word in words:
classified = True
negatives.append(comment)
break
# now look for positive
if not classified:
for pos_word in POSITIVE_WORDS:
if pos_word in words:
classified = True
positives.append(comment)
break
if not classified:
neutral.append(comment)
be careful, open() returns a file object.
>>> f = open('workfile', 'w')
>>> print f
<open file 'workfile', mode 'w' at 80a0960>
Use this:
>>> f.readline()
'This is the first line of the file.\n'
Then use set intersection:
positive += list(set(pos_words) & set(tweet.split()))

Categories