To find frequency of every word in text file in python - python

I want to find frequency of all words in my text file so that i can find out most frequently occuring words from them.
Can someone please help me the command to be used for that.
import nltk
text1 = "hello he heloo hello hi " // example text
fdist1 = FreqDist(text1)
I have used above code but problem is that it is not giving word frequency,rather it is displaying frequency of every character.
Also i want to know how to input text using text file.

I saw you were using the example and saw the same thing you were seeing, in order for it to work properly, you have to split the string by spaces. If you do not do this, it seems to count each character, which is what you were seeing. This returns the proper counts of each word, not character.
import nltk
text1 = 'hello he heloo hello hi '
text1 = text1.split(' ')
fdist1 = nltk.FreqDist(text1)
print (fdist1.most_common(50))
If you want to read from a file and get the word count, you can do it like so:
input.txt
hello he heloo hello hi
my username is heinst
your username is frooty
python code
import nltk
with open ("input.txt", "r") as myfile:
data=myfile.read().replace('\n', ' ')
data = data.split(' ')
fdist1 = nltk.FreqDist(data)
print (fdist1.most_common(50))

For what it's worth, NLTK seems like overkill for this task. The following will give you word frequencies, in order from highest to lowest.
from collections import Counter
input_string = [...] # get the input from a file
word_freqs = Counter(input_string.split())

text1 in the nltk book is a collection of tokens (words, punctuation) unlike in your code example where text1 is a string (collection of Unicode codepoints):
>>> from nltk.book import text1
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text1[99] # 100th token in the text
','
>>> from nltk import FreqDist
>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024,
'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})
If your input is indeed space-separated words then to find the frequency, use #Boa's answer:
freq = Counter(text_with_space_separated_words.split())
Note: FreqDist is a Counter but it also defines additional methods such as .plot().
If you want to use nltk tokenizers instead:
#!/usr/bin/env python3
from itertools import chain
from nltk import FreqDist, sent_tokenize, word_tokenize # $ pip install nltk
with open('your_text.txt') as file:
text = file.read()
words = chain.from_iterable(map(word_tokenize, sent_tokenize(text)))
freq = FreqDist(map(str.casefold, words))
freq.pprint()
# -> FreqDist({'hello': 2, 'hi': 1, 'heloo': 1, 'he': 1})
sent_tokenize() tokenizes the text into sentences. Then word_tokenize tokenizes each sentence into words. There are many ways to tokenize text in nltk.

In order to have the frequency as well as the words as a dictionary, the following code will be beneficial:
import nltk
from nltk.tokenize import word_tokenize
for f in word_tokenize(inputSentence):
dict[f] = fre[f]
print dict

I think below code is useful for you to get the frequency of each word in the file in dictionary form
myfile=open('greet.txt')
temp=myfile.read()
x=temp.split("\n")
y=list()
for item in x:
z=item.split(" ")
y.append(z)
count=dict()
for name in y:
for items in name:
if items not in count:`enter code here`
count[items]=1
else:
count[items]=count[items]+1
print(count)

Related

Unstructured data, NLP Lemmatize Book Review

Here I have m trying to read the content let's say 'book1.txt' and here I have to remove all the special characters and punctuation marks and word tokenise the content using nltk's word tokeniser.
Lemmatize those token using wordnetLemmatizer
And write those token into csv file one by one.
Here is the code I m using which obviously is not working but just need some suggestion on this please.
import nltk
from nltk.stem import WordNetLemmatizer
import csv
from nltk.tokenize import word_tokenize
file_out=open('data.csv','w')
with open('book1.txt','r') as myfile:
for s in myfile:
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
for word in words:
token=WordNetLemmatizer().lemmatize(words,'v')
filtered_sentence=[""]
for n in words:
if n not in token:
filtered_sentence.append(""+n)
file_out.writelines(filtered_sentence+["\n"])
There's some issues here, most notably with the last 2 for loops.
The way you are doing it made it write it as follows:
word1
word1word2
word1word2word3
word1word2word3word4
........etc
I'm guessing that is not the expected output. I'm assuming the expected output is:
word1
word2
word3
word4
........etc (without creating duplicates)
I applied the code below to a 3 paragraph Cat Ipsum file. Note that I changed some variable names due to my own naming conventions.
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from pprint import pprint
# read the text into a single string.
with open("book1.txt") as infile:
text = ' '.join(infile.readlines())
words = word_tokenize(text)
words = [word.lower() for word in words if word.isalpha()]
# create the lemmatized word list
results = []
for word in words:
# you were using words instead of word below
token = WordNetLemmatizer().lemmatize(word, "v")
# check if token not already in results.
if token not in results:
results.append(token)
# sort results, just because :)
results.sort()
# print and save the results
pprint(results)
print(len(results))
with open("nltk_data.csv", "w") as outfile:
outfile.writelines(results)

Python - specify corpus from whose sentence to perform a function on?

I have imported all the books from the NLTK Book library, and I am just trying to figure out how to define a corpus then sentence to be printed.
For example, if I wanted to print sentence 1 of text 3, then sentence 2 of text 4
import nltk
from nltk.book import *
print(???)
print(???)
I've tried the below combinations, which do not work:
print(text3.sent1)
print(text4.sent2)
print(sent1.text3)
print(sent2.text4)
print(text3(sent1))
print(text4(sent2))
I am new to python, so it is likely a v. basic question, but I cannot seem to find the solution elsewhere.
Many thanks, in advance!
Simple example can be given as :
from nltk.tokenize import sent_tokenize
# List of sentences
sentences = "This is first sentence. This is second sentence. Let's try to tokenize the sentences. how are you? I am doing good"
# define function
def sentence_tokenizer(sentences):
sentence_tokenize_list = sent_tokenize(sentences)
print "tokenized sentences are = ", sentence_tokenize_list
return sentence_tokenize_list
# call function
tokenized_sentences = sentence_tokenizer(sentences)
# print first sentence
print tokenized_sentences[0]
Hope this helps.
You need to split the texts into lists of sentences first.
If you already have text3 and text4:
from nltk.tokenize import sent_tokenize
sents = sent_tokenize(text3)
print(sents[0]) # the first sentence in the list is at position 0
sents = sent_tokenize(text4)
print(sents[1]) # the second sentence in the list is at position 1
print(text3[0]) # prints the first word of text3
You seem to need both a NLTK tutorial and a python tutorial. Luckily, the NLTK book is both.

Removing stopwords from list using python3

I have been trying to remove stopwords from a csv file that im reading using python code but my code does not seem to work. I have tried using a sample text in the code to validate my code but it is still the same . Below is my code and i would appreciate if anyone can help me rectify the issue.. here is the code below
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv
article = ['The computer code has a little bug' ,
'im learning python' ,
'thanks for helping me' ,
'this is trouble' ,
'this is a sample sentence'
'cat in the hat']
tokenized_models = [word_tokenize(str(i)) for i in article]
stopset = set(stopwords.words('english'))
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
print('token:'+str(stop_models))
Your tokenized_models is a list of tokenized sentences, so a list of lists. Ergo, the following line tries to match a list of words to a stopword:
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
Instead, iterate again through words. Something like:
clean_models = []
for m in tokenized_models:
stop_m = [i for i in m if str(i).lower() not in stopset]
clean_models.append(stop_m)
print(clean_models)
Off-topic useful hint:
To define a multi-line string, use brackets and no comma:
article = ('The computer code has a little bug'
'im learning python'
'thanks for helping me'
'this is trouble'
'this is a sample sentence'
'cat in the hat')
This version would work with your original code
word_tokenize(str(i)) returns a list of words, so tokenized_models is a list of lists. You need to flatten that list, or better yet just make article a single string, since I don't see why it's a list at the moment.
This is because the in operator won't search through a list and then through strings in that list at the same time, e.g.:
>>> 'a' in 'abc'
True
>>> 'a' in ['abc']
False

Converting a Text File to a String in Python

I am new to python and am trying to find the largest word in the alice_in_worderland.txt. I think I have a good system set up ("See Below"), but my output is returning a "word" with dashes connecting multiple words. Is there someway to remove the dashes in the input of the file? For the text file visit here
sample from text file:
That's very important,' the King said, turning to the jury. They were
just beginning to write this down on their slates, when the White
Rabbit interrupted: UNimportant, your Majesty means, of course,' he
said in a very respectful tone, but frowning and making faces at him
as he spoke. " UNimportant, of course, I meant,' the King hastily
said, and went on to himself in an undertone, important--unimportant--
unimportant--important--' as if he were trying which word sounded
best."
code:
#String input
with open("alice_in_wonderland.txt", "r") as myfile:
string=myfile.read().replace('\n','')
#initialize list
my_list = []
#Split words into list
for word in string.split(' '):
my_list.append(word)
#initialize list
uniqueWords = []
#Fill in new list with unique words to shorten final printout
for i in my_list:
if not i in uniqueWords:
uniqueWords.append(i)
#Legnth of longest word
count = 0
#Longest word place holder
longest = []
for word in uniqueWords:
if len(word)>count:
longest = word
count = len(longest)
print longest
>>> import nltk # pip install nltk
>>> nltk.download('gutenberg')
>>> words = nltk.corpus.gutenberg.words('carroll-alice.txt')
>>> max(words, key=len) # find the longest word
'disappointment'
Here's one way using re and mmap:
import re
import mmap
with open('your alice in wonderland file') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
words = re.finditer('\w+', mf)
print max((word.group() for word in words), key=len)
# disappointment
Far more efficient than loading the file to physical memory.
Use str.replace to replace the dashes with spaces (or whatever you want). To do this, simply add another call to replace after the first call on line 3:
string=myfile.read().replace('\n','').replace('-', ' ')

How to print tokenized arabic text in python/nltk?

I'm doing the sentiment analysis for arabic language, I'm using python /nltk and the dream pie shell ,this problem occurs when I apply the function of tokenization , how to display these words ?
>>> import nltk
>>> sentence = "مصادمات عنيفه في"
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['\xd9\x85\xd8\xb5\xd8\xa7\xd8\xaf\xd9\x85\xd8\xa7\xd8\xaa', '\xd8\xb9\xd9\x86\xd9\x8a\xd9\x81\xd9\x87', '\xd9\x81\xd9\x8a']
By printing tokens, you're printing the list and the \x... is the bytecode representation. If you want to print out the arabic form, just loop through the list and print the tokens one by one.
>>> import nltk
>>> sentence = "مصادمات عنيفه في"
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['\xd9\x85\xd8\xb5\xd8\xa7\xd8\xaf\xd9\x85\xd8\xa7\xd8\xaa', '\xd8\xb9\xd9\x86\xd9\x8a\xd9\x81\xd9\x87', '\xd9\x81\xd9\x8a']
>>> for i in tokens:
... print i
...
مصادمات
عنيفه
في

Categories