Word frequency counter in file - python

I am working on an assignment and I have hit a wall. The assignment requires me to count the frequency of words in a text file. I got my code to count the words and put them into a dictionary but cannot put words together if they have different cases. For example I need the output to show {'a':16...} but it outputs this instead {'A':2...'a':14}. Here is my code. Any help would be much appreciated.
file=open("phrases.txt","r")
wordCount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word]=1
else:
wordcount[word]+=1
print(wordcount)

You can use an inbuilt function called Counter for this as an alternative to looping through the list.
example :
from collections import Counter
file = open("phrases.txt","r")
data = file.read().lower().split() # added lower() will convert everything to lower case
wordcount = dict(Counter(data))
print(wordcount)

Seems like in the question your saying there is a uppercase and lowercase issue, so why not:
file=open("phrases.txt","r")
wordCount={}
for word in file.read().split():
if word.lower() not in wordcount:
wordcount[word.lower()]=1
else:
wordcount[word.lower()]+=1
print(wordcount)
Or:
file=open("phrases.txt","r")
wordCount={}.fromkeys([i.lower() for i in file.read().split()],1)
for word in file.read().split():
wordcount[word.lower()]+=1
print(wordcount)

lower all the words when comparing.
for word.lower() in file.read().split():

You can convert the words to lowercase, and then count them. So, your code changes to something like this.
file=open("phrases.txt","r")
wordCount={}
for word in file.read().split():
newWord = word.lower()
if newWord not in wordcount:
wordcount[newWord]=1
else:
wordcount[newWord]+=1
print(wordcount)
Basically, you will be storing in the dict, where keys are the lower case versions of each word.
Do note, that you will lose "data", if you are doing operations which are case sensitive.

Related

What's the difference between "word = line.split()" and "for word in line.split()"?

I'm new to programming and this is my first question here. I feel it might be a very silly beginner doubt, but here goes.
On multiple occasions, I've typed out the whole code right except for this one line, on which I make the same mistake every time.
Could someone please explain to me what the computer understands when I type each of the following lines, and what the difference is?
word = line.split()
for word in line.split()
The difference between the expected and my actual output is just because I typed the former instead of the latter:
word = line.split()
This will split the line variable (using the default "any amount of white space" separator) and give you back a list of words built from it. You then bind the variable word to that list.
On the other hand:
for word in line.split()
initially does the same thing the previous command did (splitting the line to get a list) but, instead of binding the word variable to that entire list, it iterates over the list, binding word to each string in the list in turn.
The following transcript hopefully makes this clearer:
>>> line = 'pax is good-looking'
>>> word = line.split() ; print(word)
['pax', 'is', 'good-looking']
>>> for word in line.split(): print(word)
...
pax
is
good-looking
The split() is a separator method.
word = line.split() will return a list by splitting line into words where ' ' is present (as it is the default separator.)
for word in line.split() will iterate over that list (line.split()).
Here is an example for clarification.
line = "Stackoverflow is amazing"
word = line.split()
print(word)
>>>['Stackoverflow','is','amazing']
for word in line.split():
print(word)
>>>
'Stackoverflow'
'is'
'amazing'

Python how to use a method in a list comprehensions

I have a method to remove punctuation from every word in an array of words and I want to use it a list comprehension. All I can think of with my basic Python knowledge is:
def remove_punctuation(sentence: str) -> str:
return sentence.translate(str.maketrans('', '', string.punctuation))
def letters_only(astr):
return astr.isalpha()
def clean_text(docs):
cleaned_docs = []
for doc in docs:
cleaned_docs.append(' '.join([lemmatizer.lemmatize(remove_punctuation(word.lower()))
for word in doc.split()
if letters_only(word)
and remove_punctuation(word) not in all_names
and remove_punctuation(word) not in all_names_lower]))
return cleaned_docs
As you can see I am using the"remove_punctuation" method in many place. Is there any way to use if only once or more efficiently?
Thanks!
*letters_only - it is from some tutorial and unfortunatelly it sees word "best!" with exclamation mark at the end and removes the word - bu I am trying to make it remove only exclamation mark.
Since you provided the definitions for letters_only and remove_punctuation we can now say that your code is equivalent to:
[lemmatizer.lemmatize(word.lower())
for word in doc.split()
if letters_only(word) and word.lower() not in all_names_lower]
So all the calls to remove_punctuation are useless, because they are done only if letters_only(word) which means word does not have any punctuation.
Not really. The best you can do is zip together the original list with a generator that removes punctuation:
original_words = doc.split()
no_punct_words = map(remove_punctuation, original_words)
cleaned_docs.append(' '.join([lemmatizer.lemmatize(no_punct_word.lower())
for word, no_punct_word in zip(original_words, no_punct_words) if letters_only(word)
and no_punct_word not in all_names
and no_punct_word not in all_names_lower]))
Anyway your conditions do not make much sense. If the if letters_only(word) condition is true I'd expect remove_punctuation to do nothing to word and so you could remove it.
Also: the two conditions:
no_punct_word not in all_names and no_punct_word not in all_names_lower
Could probably become just:
no_punct_word.lower() not in all_names_lower
As an aside: if the conditions you want to apply should always be applied to remove_punctuation(word) then you can do better: you can just map that function:
no_punct_words = map(remove_punctuation, doc.split())
# ...
[lemmatizer.lemmatize(word.lower())
for word in no_punct_words if letters_only(word)
and word.lower() not in all_names_lower]
And maybe you can do the same with .lower():
lower_no_punct_words = map(str.lower, map(remove_punctuation, doc.split()))
# ...
[lemmatizer.lemmatize(word)
for word in lower_no_punct_words if letters_only(word)
and word not in all_names_lower]
Trying to guess the intention (the code seems to have few bugs), I'd say you should be good with something like the below. Note the laziness of the whole thing, it should make the code less greedy on memory consumption.
def normalized_words_of(doc):
for word in doc.split():
if letters_only(word):
yield remove_punctuation(word.lower())
def clean_text(docs):
for doc in docs:
yield ' '.join(word for word in normalized_words_of(doc) if word not in all_names_lower)
print(list(clean_text(['hi there, you', 'good bye - till next time'])))

Python: making a wordcounter that excludes words less than three letters long

I'm very new to python and I've been trying to make a wordcounter that excludes words less than three letters long. Right now my basic counter looks like this:
wordcount = len(raw_input("Paste your document here: ").split())
print wordcount
This returns the word count but I can't figure out how to make it exclude words with three letters or less. Every time I try something new, I get an iteration error. I've been scouring the internet for some ideas on how to make python recognize how long certain words are, but I haven't had much luck. Any help would be appreciated.
Code -
wordcount = raw_input("Paste your document here: ").split()
wordcount = [word for word in wordcount if len(word) >= 3]
You're on the right path. You just split on the input, then using a list comprehension to only select words with len >= 3:
words = raw_input("Paste your document here: ").split()
newwords = [word for word in words if len(word) >= 3
wordcount = len(newwords)

Python lists not working properly

import random
words = ["Football" , "Happy" ,"Sad", "Love", "Human"]
for word in words:
word = random.choice(words)
print(word)
words.remove(word)
Why does the above code only print out 3 words instead of all 5? Am I trying to achieve printing the words from wordsin a random order in an incorrect way?
You can't modify a list (by adding or removing elements) while iterating over it, the behaviour is undefined. Here's a possible alternative for what you're doing that doesn't have that problem:
random.shuffle(words)
for word in words:
print(word)
This is because you are not looping correctly. Try this:
import random
words = ["Football" , "Happy" ,"Sad", "Love", "Human"]
while words:
word = random.choice(words)
print(word)
words.remove(word)
You need to make sure that the list words is not empty because you cannot modify an array whilst iterating over it.
People have mostly explained why you're not getting the behavior you want, but just to throw an alternate solution into the mix using a different idiom:
import random
words = ["Football" , "Happy" ,"Sad", "Love", "Human"]
random.shuffle(words)
while words:
print(words.pop())
you should not modify a list while iterating over it try
for _ in range(len(words)):
word = random.choice(words)
words.remove(word)
print word
To explicitly state blogbeards suggestion,
>>>import random
>>>random.shuffle(words)
>>>print(*words)

python - remove string from words in an array

#!/usr/bin/python
#this looks for words in dictionary that begin with 'in' and the suffix is a real word
wordlist = [line.strip() for line in open('/usr/share/dict/words')]
newlist = []
for word in wordlist:
if word.startswith("in"):
newlist.append(word)
for word in newlist:
word = word.split('in')
print newlist
how would I get the program to remove the string "in" from all the words that it starts with? right now it does not work
#!/usr/bin/env python
# Look for all words beginning with 'in'
# such that the rest of the word is also
# a valid word.
# load the dictionary:
with open('/usr/share/dict/word') as inf:
allWords = set(word.strip() for word in inf) # one word per line
using 'with' ensures the file is always properly closed;
I make allWords a set; this makes searching it an O(1) operation
then we can do
# get the remainder of all words beginning with 'in'
inWords = [word[2:] for word in allWords if word.startswith("in")]
# filter to get just those which are valid words
inWords = [word for word in inWords if word in allWords]
or run it into a single statement, like
inWords = [word for word in (word[2:] for word in allWords if word.startswith("in")) if word in allWords]
Doing it the second way also lets us use a generator for the inside loop, reducing memory requirements.
split() returns a list of the segments obtained by splitting. Furthermore,
word = word.split('in')
doesn't modify your list, it just modifies the variable being iterated.
Try replacing your second loop with this:
for i in range(len(newlist)):
word = newlist[i].split('in', 1)
newlist[i] = word[1]
It's difficult to tell from your question what you want in newlist if you just want words that start with "in" but with "in" removed then you can use a slice:
newlist = [word[2:] for word in wordlist if word.startswith('in')]
If you want words that start with "in" are still in wordlist once they've had "in" removed (is that what you meant by "real" in your comment?) then you need something a little different:
newlist = [word for word in wordlist if word.startswith('in') and word[2:] in wordlist
Note that in Python we use a list, not an "array".
Suppose that wordlist is the list of words. Following code should do the trick:
for i in range(len(wordlist)):
if wordlist[i].startswith("in"):
wordlist[i] = wordlist[i][2:]
It is better to use while loop if the number of words in the list is quite big.

Categories