Sorting and counting words from a text file - python

I'm new to programming and stuck on my current program. I have to read in a story from a file, sort the words, and count the number of occurrences per word. It will count the words, but it won't sort the words, remove the punctuation, or duplicate words. I'm lost to why its not working. Any advice would be helpful.
ifile = open("Story.txt",'r')
fileout = open("WordsKAI.txt",'w')
lines = ifile.readlines()
wordlist = []
countlist = []
for line in lines:
wordlist.append(line)
line = line.split()
# line.lower()
for word in line:
word = word.strip(". , ! ? : ")
# word = list(word)
wordlist.sort()
sorted(wordlist)
countlist.append(word)
print(word,countlist.count(word))

There main problem in your code is at the line (line 9):
wordlist.append(line)
You are appending the whole line into the wordlist, I doubt that is what you want. As you do this, the word added is not .strip()ed before it is added to wordlist.
What you have to do is to add the word only after you have strip()ed it and make sure you only do that after you checked that there are not other same words (no duplicates):
ifile = open("Story.txt",'r')
lines = ifile.readlines()
wordlist = []
countlist = []
for line in lines:
# Get all the words in the current line
words = line.split()
for word in words:
# Perform whatever manipulation to the word here
# Remove any punctuation from the word
word = word.strip(".,!?:;'\"")
# Make the word lowercase
word = word.lower()
# Add the word into wordlist only if it is not in wordlist
if word not in wordlist:
wordlist.append(word)
# Add the word to countlist so that it can be counted later
countlist.append(word)
# Sort the wordlist
wordlist.sort()
# Print the wordlist
for word in wordlist:
print(word, countlist.count(word))
Another way you could do this is using a dictionary, storing the word as they key and the number of occurences as the value:
ifile = open("Story.txt", "r")
lines = ifile.readlines()
word_dict = {}
for line in lines:
# Get all the words in the current line
words = line.split()
for word in words:
# Perform whatever manipulation to the word here
# Remove any punctuation from the word
word = word.strip(".,!?:;'\"")
# Make the word lowercase
word = word.lower()
# Add the word to word_dict
word_dict[word] = word_dict.get(word, 0) + 1
# Create a wordlist to display the words sorted
word_list = list(word_dict.keys())
word_list.sort()
for word in word_list:
print(word, word_dict[word])

You have to provide a key function to the sorting methods.
Try this
r = sorted(wordlist, key=str.lower)

punctuation = ".,!?: "
counts = {}
with open("Story.txt",'r') as infile:
for line in infile:
for word in line.split():
for p in punctuation:
word = word.strip(p)
if word not in counts:
counts[word] = 0
counts[word] += 1
with open("WordsKAI.txt",'w') as outfile:
for word in sorted(counts): # if you want to sort by counts instead, use sorted(counts, key=counts.get)
outfile.write("{}: {}\n".format(word, counts[word]))

Related

How can I get two txt files by finding common occurrences?

I need to know which English words were used in the Italian chat and to count how many times they were used.
But in the output I also have the words I didn't use in the example chat (baby-blue-eyes': 0)
english_words = {}
with open("dizionarioen.txt") as f:
for line in f:
for word in line.strip().split():
english_words[word] = 0
with open("_chat.txt") as f:
for line in f:
for word in line.strip().split():
if word in english_words:
english_words[word] += 1
print(english_words)
You can simply iterate over your result and remove all elements that have value 0:
english_words = {}
with open("dizionarioen.txt") as f:
for line in f:
for word in line.strip().split():
english_words[word] = 0
with open("_chat.txt") as f:
for line in f:
for word in line.strip().split():
if word in english_words:
english_words[word] += 1
result = {key: value for key, value in english_words.items() if value}
print(result)
Also here is another solution that allows you to count words with usage of Counter:
from collections import Counter
with open("dizionarioen.txt") as f:
all_words = set(word for line in f for word in line.split())
with open("_chat.txt") as f:
result = Counter([word for line in f for word in line.split() if word in all_words])
print(result)
If you want to remove the words without occurrence after indexing, just delete these entries:
for w in list(english_words.keys()):
if english_words[w]==0: del english_words[w]
Then, your dictionary only contains words that occurred. Was that the question?

If the character 'p' is in a word, add the word to a list variable

So my assignment is this: Using the file school_prompt.txt, if the character ā€˜pā€™ is in a word, then add the word to a list called p_words.
I'm not sure what progress I've made but I've gotten stuck.
wordsFile = open("school_prompt.txt", 'r')
words = wordsFile.read()
wordsFile.close()
wordList = words.split()
p_words = 0
for words in wordList:
if words[0] == 'p':
p_words += 1
What you want is pretty straightforward; I'm not sure why you are making p_words a count of words instead of a list of words.
p_words = [word for word in wordList if 'p' in word]
As answered by Henrik, this could be done by using the if statement. Also the p_words should be a list not a variable.
file=open("school_prompt.txt","r")
p_words=[]
file=file.read()
wordlist=file.split()
for i in wordlist:
if 'p' in i:
p_words.append(i)
This works and I tried to do one line of code using list comprehension but couldnt get it to work.
fileref = open('school_prompt.txt', 'r')
words = fileref.read().split()
p_words = [word for word in words if 'p' in word]
we need to do for loop inside for loop to check for the words contain letter "p".
Here's the code.
file = open("school_prompt.txt", "r")
content = file.readlines()
p_words = []
for lines in content:
lines = lines.split()
for words in lines:
if "p" in words:
p_words.append(words)
print(p_words)

how to disregard counting a certain word in python dictionary

Hello I'm wondering how to make a dictionary that will not count the word 'the' or definitely remove it from the dictionary so i came up in this code:
counts = dict()
print('Enter a line of text:')
line = input('')
words = line.split()
print('Words:', words)
print('counting...')
for word in words :
if words != 'the':
counts[word]= counts.get(word,0)+1
else:
counts[word] = counts.get(word,0)
print('Counts', counts)
can you help me to make it right?
While #Steve's answer is correct, the code can be a bit prettified and simplified:
from collections import Counter
line = input('Enter a line of text:')
words = line.split()
print('Words:', words)
print('counting...')
c = Counter(words)
del c['the'] # remove 'the' key from counter
print('Counts', dict(c))
Why not just this? :
counts = dict()
print('Enter a line of text:')
line = input('')
words = line.split()
print('Words:', words)
print('counting...')
for word in words :
if word != 'the':
counts[word]= counts.get(word,0)+1
print('Counts', counts)

How to search words from txt file to python

How can I show words which length are 20 in a text file?
To show how to list all the word, I know I can use the following code:
#Program for searching words is in 20 words length in words.txt file
def main():
file = open("words.txt","r")
lines = file.readlines()
file.close()
for line in lines:
print (line)
return
main()
But I not sure how to focus and show all the words with 20 letters.
Big thanks
If your lines have lines of text and not just a single word per line, you would first have to split them, which returns a list of the words:
words = line.split(' ')
Then you can iterate over each word in this list and check whether its length is 20.
for word in words:
if len(word) == 20:
# Do what you want to do here
If each line has a single word, you can just operate on line directly and skip the for loop. You may need to strip the trailing end-of-line character though, word = line.strip('\n'). If you just want to collect them all, you can do this:
words_longer_than_20 = []
for word in words:
if len(word) > 20:
words_longer_than_20.append(word)
If your file has one word only per line, and you want only the words with 20 letters you can simply use:
with open("words.txt", "r") as f:
words = f.read().splitlines()
found = [x for x in words if len(x) == 20]
you can then print the list or print each word seperately
You can try this:
f = open('file.txt')
new_file = f.read().splitlines()
words = [i for i in f if len(i) == 20]
f.close()

Create a dict whose value is the set of all possible anagrams for a given word

So what im trying to do is create a dict whose:
key is the word in sorted order and
value is the set of each anagram (generated by an anagram program).
When i run my program i get Ex. word : {('w', 'o', 'r', 'd')} not word : dorw, wrdo, rowd. Text file just contains a lot of words one on each line.
Code:
def main():
wordList = readMatrix()
print(lengthWord())
def readMatrix():
wordList = []
strFile = open("words.txt", "r")
lines = strFile.readlines()
for line in lines:
word = sorted(line.rstrip().lower())
wordList.append(tuple(word))
return tuple(wordList)
def lengthWord():
lenWord = 4
sortDict = {}
wordList = readMatrix()
for word in wordList:
if len(word) == lenWord:
sortWord = ''.join(sorted(word))
if sortWord not in sortDict:
sortDict[sortWord] = set()
sortDict[sortWord].add(word)
return sortDict
main()
You are creating tuples of each word in the file:
for line in lines:
word = sorted(line.rstrip().lower())
wordList.append(tuple(word))
This will sort all your anagrams, creating duplicate sorted character tuples.
If you wanted to track all possible words, you should not produce tuples here. Just read the words:
for line in lines:
word = line.rstrip().lower()
wordList.append(word)
and process those words with your lengthWord() function; this function does need to take the wordList value as an argument:
def lengthWord(wordList):
# ...
and you need to pass that in from main():
def main():
wordList = readMatrix()
print(lengthWord(wordList))

Categories