Removing stopwords from list using python3 - python

I have been trying to remove stopwords from a csv file that im reading using python code but my code does not seem to work. I have tried using a sample text in the code to validate my code but it is still the same . Below is my code and i would appreciate if anyone can help me rectify the issue.. here is the code below
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv
article = ['The computer code has a little bug' ,
'im learning python' ,
'thanks for helping me' ,
'this is trouble' ,
'this is a sample sentence'
'cat in the hat']
tokenized_models = [word_tokenize(str(i)) for i in article]
stopset = set(stopwords.words('english'))
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
print('token:'+str(stop_models))

Your tokenized_models is a list of tokenized sentences, so a list of lists. Ergo, the following line tries to match a list of words to a stopword:
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
Instead, iterate again through words. Something like:
clean_models = []
for m in tokenized_models:
stop_m = [i for i in m if str(i).lower() not in stopset]
clean_models.append(stop_m)
print(clean_models)
Off-topic useful hint:
To define a multi-line string, use brackets and no comma:
article = ('The computer code has a little bug'
'im learning python'
'thanks for helping me'
'this is trouble'
'this is a sample sentence'
'cat in the hat')
This version would work with your original code

word_tokenize(str(i)) returns a list of words, so tokenized_models is a list of lists. You need to flatten that list, or better yet just make article a single string, since I don't see why it's a list at the moment.
This is because the in operator won't search through a list and then through strings in that list at the same time, e.g.:
>>> 'a' in 'abc'
True
>>> 'a' in ['abc']
False

Related

How to replace multiple substrings in a list of sentences using regex in python?

I have a list of sentences as below :
sentences = ["I am learning to code", "coding seems to be intresting in python", "how to code in python", "practicing how to code is the key"]
Now I wish to replace few substrings in this list of sentences using dictionary of words and its replacements.
word_list = {'intresting': 'interesting', 'how to code': 'learning how to code', 'am learning':'love learning', 'in python': 'using python'}
I tried the following code:
replaced_sentences = [' '.join([word_list.get(w, w) for w in sentence.split()])
for sentence in sentences]
But only the one word string is getting replaced and not the keys with more than one word. It is because am using sentence.split() which tokenizes sentences word by word and misses out replacing substrings greater than one word.
How do I get to replace the substring with exact match using regex or any other suggestions?
expected output:
sentences = ["I love learning to code", "coding seems to be interesting using python", "learning how to code using python", "practicing learning how to code is the key"]
Thanks in advance.
It's probably easiest to read if you break this into a function that replaces all the words for a single sentence. Then you can apply it to all the sentences in the list. Here we make a single regex by concaving all the keys of the dict with '|'. Then use re.sub grab the found value associated with the key, and return it as the replacement.
import re
def replace_words(s, word_lookup):
rx = '|'.join(word_lookup.keys())
return re.sub(rx, lambda match: word_lookup[match.group(0)], s)
[replace_words(s, word_list) for s in sentences]
This will result in:
['I love learning to code',
'coding seems to be interesting using python',
'learning how to code using python',
'practicing learning how to code is the key']
You could optimize a bit by making the regex once instead of each time in the function. This would allow you to do something like:
import re
rx = re.compile('|'.join(word_list.keys()))
[rx.sub(lambda match: word_list[match.group(0)], s) for s in sentences]

How do I match exact strings from a list to a larger string taking white spaces into account?

I have a large list of strings and I want to check whether a string occurs in a larger string. The list contains of strings of one word and also strings of multiple words. To do so I have written the following code:
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache"
emptylist = []
for i in example_text:
res = [ele for ele in example_list if(ele in i)]
emptylist.append(res)
However the problem is here is 'pain' is also added to emptylist which it should not as I only want something from the example_list to be added if exactly matches the text. I also tried using sets:
word_set = set(example_list)
phrase_set = set(example_text.split())
word_set.intersection(phrase_set)
This however chops op 'morning sickness' into 'morning' and 'sickness'. Does anyone know what is the correct way to tackle this problem?
Nice examples have already been provided in this post by members.
I made the matching_text a little more challenging where the pain occurred more than once. I also aimed for a little more information about where the match location starts. I ended up with the following code.
I worked on the following sentence.
"The patient has not only kneepain but headache and arm pain, stomach pain and sickness"
import re
from collections import defaultdict
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has not only kneepain but headache and arm pain, stomach pain and sickness"
TruthFalseDict = defaultdict(list)
for i in example_list:
MatchedTruths = re.finditer(r'\b%s\b'%i, example_text)
if MatchedTruths:
for j in MatchedTruths:
TruthFalseDict[i].append(j.start())
print(dict(TruthFalseDict))
The above gives me the following output.
{'pain': [55, 69], 'headache': [38], 'sickness': [78]}
Using PyParsing:
import pyparsing as pp
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache morning sickness"
list_of_matches = []
for word in example_list:
rule = pp.OneOrMore(pp.Keyword(word))
for t, s, e in rule.scanString(example_text):
if t:
list_of_matches.append(t[0])
print(list_of_matches)
Which yields:
['headache', 'sickness', 'morning sickness']
You should be able to use a regex using word boundaries
>>> import re
>>> [word for word in example_list if re.search(r'\b{}\b'.format(word), example_text)]
['headache']
This will not match 'pain' in 'kneepain' since that does not begin with a word boundary. But it would properly match substrings that contained whitespace.

Tokenize multi word in python

I'm new in python . I have a big data set from twitter and i want to tokenize it .
but i don't know how can i token verbs like this : "look for , take off ,grow up and etc." and it's important to me .
my code is :
>>> from nltk.tokenize import word_tokenize
>>> s = "I'm looking for the answer"
>>> word_tokenize(s)
['I', "'m", 'looking', 'for', 'the', 'answer']
my data set is big and i can't use this page code :
Find multi-word terms in a tokenized text in Python
so , how can i solve my problem?
You need to use parts of speech tags for that, or actually dependency parsing would be more accurate. I haven't tried with nltk, but with spaCy you can do it like this:
import spacy
nlp = spacy.load('en_core_web_lg')
def chunk_phrasal_verbs(lemmatized_sentence):
ph_verbs = []
for word in nlp(lemmatized_sentence):
if word.dep_ == 'prep' and word.head.pos_ == 'VERB':
ph_verb = word.head.text+ ' ' + word.text
ph_verbs.append(ph_verb)
return ph_verbs
I also suggest first lemmatizing the sentence to get rid of conjugations. Also if you need noun phrases, with the similar way you can use compound relationship.

Python - specify corpus from whose sentence to perform a function on?

I have imported all the books from the NLTK Book library, and I am just trying to figure out how to define a corpus then sentence to be printed.
For example, if I wanted to print sentence 1 of text 3, then sentence 2 of text 4
import nltk
from nltk.book import *
print(???)
print(???)
I've tried the below combinations, which do not work:
print(text3.sent1)
print(text4.sent2)
print(sent1.text3)
print(sent2.text4)
print(text3(sent1))
print(text4(sent2))
I am new to python, so it is likely a v. basic question, but I cannot seem to find the solution elsewhere.
Many thanks, in advance!
Simple example can be given as :
from nltk.tokenize import sent_tokenize
# List of sentences
sentences = "This is first sentence. This is second sentence. Let's try to tokenize the sentences. how are you? I am doing good"
# define function
def sentence_tokenizer(sentences):
sentence_tokenize_list = sent_tokenize(sentences)
print "tokenized sentences are = ", sentence_tokenize_list
return sentence_tokenize_list
# call function
tokenized_sentences = sentence_tokenizer(sentences)
# print first sentence
print tokenized_sentences[0]
Hope this helps.
You need to split the texts into lists of sentences first.
If you already have text3 and text4:
from nltk.tokenize import sent_tokenize
sents = sent_tokenize(text3)
print(sents[0]) # the first sentence in the list is at position 0
sents = sent_tokenize(text4)
print(sents[1]) # the second sentence in the list is at position 1
print(text3[0]) # prints the first word of text3
You seem to need both a NLTK tutorial and a python tutorial. Luckily, the NLTK book is both.

To find frequency of every word in text file in python

I want to find frequency of all words in my text file so that i can find out most frequently occuring words from them.
Can someone please help me the command to be used for that.
import nltk
text1 = "hello he heloo hello hi " // example text
fdist1 = FreqDist(text1)
I have used above code but problem is that it is not giving word frequency,rather it is displaying frequency of every character.
Also i want to know how to input text using text file.
I saw you were using the example and saw the same thing you were seeing, in order for it to work properly, you have to split the string by spaces. If you do not do this, it seems to count each character, which is what you were seeing. This returns the proper counts of each word, not character.
import nltk
text1 = 'hello he heloo hello hi '
text1 = text1.split(' ')
fdist1 = nltk.FreqDist(text1)
print (fdist1.most_common(50))
If you want to read from a file and get the word count, you can do it like so:
input.txt
hello he heloo hello hi
my username is heinst
your username is frooty
python code
import nltk
with open ("input.txt", "r") as myfile:
data=myfile.read().replace('\n', ' ')
data = data.split(' ')
fdist1 = nltk.FreqDist(data)
print (fdist1.most_common(50))
For what it's worth, NLTK seems like overkill for this task. The following will give you word frequencies, in order from highest to lowest.
from collections import Counter
input_string = [...] # get the input from a file
word_freqs = Counter(input_string.split())
text1 in the nltk book is a collection of tokens (words, punctuation) unlike in your code example where text1 is a string (collection of Unicode codepoints):
>>> from nltk.book import text1
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text1[99] # 100th token in the text
','
>>> from nltk import FreqDist
>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024,
'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})
If your input is indeed space-separated words then to find the frequency, use #Boa's answer:
freq = Counter(text_with_space_separated_words.split())
Note: FreqDist is a Counter but it also defines additional methods such as .plot().
If you want to use nltk tokenizers instead:
#!/usr/bin/env python3
from itertools import chain
from nltk import FreqDist, sent_tokenize, word_tokenize # $ pip install nltk
with open('your_text.txt') as file:
text = file.read()
words = chain.from_iterable(map(word_tokenize, sent_tokenize(text)))
freq = FreqDist(map(str.casefold, words))
freq.pprint()
# -> FreqDist({'hello': 2, 'hi': 1, 'heloo': 1, 'he': 1})
sent_tokenize() tokenizes the text into sentences. Then word_tokenize tokenizes each sentence into words. There are many ways to tokenize text in nltk.
In order to have the frequency as well as the words as a dictionary, the following code will be beneficial:
import nltk
from nltk.tokenize import word_tokenize
for f in word_tokenize(inputSentence):
dict[f] = fre[f]
print dict
I think below code is useful for you to get the frequency of each word in the file in dictionary form
myfile=open('greet.txt')
temp=myfile.read()
x=temp.split("\n")
y=list()
for item in x:
z=item.split(" ")
y.append(z)
count=dict()
for name in y:
for items in name:
if items not in count:`enter code here`
count[items]=1
else:
count[items]=count[items]+1
print(count)

Categories