I tried to find and count specific 3-word-phrases in txt files by using this code:
phrases = ['hi there you','eat sausage bread', ...]
with open('test.txt') as f:
for word in phrases:
contents = f.read()
count = contents.count('word')
print(word, count)
Python lists me every phrase, but it doesn't count it accurately. Instead the 1st phrase is always 63 and any of the following are 0. As I have more than 100 phrases and also lot's of different files it would be a waste of time to count any phrase on its own (which btw works with this script). Maybe someone could clear my obvious mistake or knows a possible solutions, I'd be very thankful.
You read your entire file into contents for each word. Since you never restore the file pointer to the start of the file, after the first read it only stores an empty string.
Fix by reading the file only once.
with open('test.txt') as f:
contents = f.read()
for word in phrases:
count = contents.count(word)
print(word, count)
Related
def word_counter(s)
word_list=s.split()
return len(word_list)
f=open("a.txt")
total=0
for i in f.readlines():
total+=word_counter(i)
print(total)
if I want to count number of alphabet(without blank), number of word used and average length of each 'a.txt', 'b.txt', 'c.txt', 'd.txt', 'e.txt'. At last, I want to get a 'total.txt' of all txt combined.
I dont know how to do more..
please help
You actually have the concept right. Just need to add a little more to reach your desired output.
Remember when you use f = open("a.txt"), make sure you call f.close(). Or, use the with keyword, like I did in the example. It automatically closes the file for you, even if you forget to.
I won't give the exact code as it is, but will provide the steps so that you learn the concepts.
Put all the .txt files names in a list.
Example, list_FileNames = ["a.txt", "b.txt"]
Then open each file, and get the entire file into a string.
for file in list_FileNames:
with open(file, 'r') as inFile:
myFileInOneString = inFile.read().replace('\n', '')
You have the right function to count words. For characters: len(myFileInOneString) - myFileInOneString.count(' ')
Save all these values into a varible and write to another file. Check how to write to a file: How to Write to a File in Python
my code is trying to count how many words I have in the file which is printed above, after I wish to be able to enter a word and for the code to tell me how many of that words there are in the text and the position of it.
2 seconds the code did not paste.
Will not let me post image so here is the code
import os
os.path.isfile('text1.text')
file = open('text1.txt','r')
print(file.readline())
count = 0
with open(text1, "rb") as fp:
data = data.translate(string>maketrans("",""), string.punctuation)
for word in data.split():
if word in input_list:
count += 1
print(count)
First thing wrong about your code, in os.path.isfile('text1.txt') you're testing whether the file text1.txt exists. Therefore, the return value will be either True or False and not putting it within a condition is completely unnecessary and unreasonable.
Ok, now for why your code is printing correctly but not counting words. It is because the first time you open the file (text1.txt) you open it correctly, but on the second time you as open to open the file from the variable text1 and as far as I can see, by the code you provided, there's no such variable. So the correct way would be something like this:
# pass string instead of variable
with open('text1.txt', "r") as fp: # use only "r" as 'b' is for binary and it's a text file
data = data.translate(string.maketrans("",""), string.punctuation)
for word in data.split():
if word in input_list:
count += 1
Well, additionally, I don't know where this data.translate came from so I can't tell if it's interfering (I don't even know if it works - it didn't work for me).
To make a long story short, I am writing a Python script that asks for the user to drop a .docx file and the file converts to .txt. Python looks for keywords within the .txt file and displays them to the shell. I was running into UnicodeDecodeError codec charmap etc..... I overcame that by writing within my for loop "word.decode("charmap"). NOW, Python is not displaying the keywords it does find to the shell. Any advice on how to overcome this? Maybe have Python skip through the characters it cannot decode and continue reading through the rest? Here is my code:
import sys
import os
import codecs
filename = input("Drag and drop resume here: ")
keywords =['NGA', 'DoD', 'Running', 'Programing', 'Enterprise', 'impossible', 'meets']
file_words = []
with open(filename, "rb") as file:
for line in file:
for word in line.split():
word.decode("charmap")
file_words.append(word)
comparison = []
for words in file_words:
if words in keywords:
comparison.append(words)
def remove_duplicates(comparison):
output = []
seen = set()
for words in comparison:
if words not in seen:
output.append(words)
seen.add(words)
return output
comparison = remove_duplicates(comparison)
print ("Keywords found:",comparison)
key_count = 0
word_count = 0
for element in comparison:
word_count += 1
for element in keywords:
key_count += 1
Threshold = word_count / key_count
if Threshold <= 0.7:
print ("The candidate is not qualified for")
else:
print ("The candidate is qualified for")
file.close()
And the output:
Drag and drop resume here: C:\Users\User\Desktop\Resume_Newton Love_151111.txt
Keywords found: []
The candidate is not qualified for
In Python 3, don't open text files in binary mode. The default is the file will decode to Unicode using locale.getpreferredencoding(False) (cp1252 on US Windows):
with open(filename) as file:
for line in file:
for word in line.split():
file_words.append(word)
or specify an encoding:
with open(filename, encoding='utf8') as file:
for line in file:
for word in line.split():
file_words.append(word)
You do need to know the encoding of your file. There are other options to open as well, including errors='ignore' or errors='replace' but you shouldn't get errors if you know the correct encoding.
As others have said, posting a sample of your text file that reproduces the error and the error traceback would help diagnose your specific issue.
In case anyone cares. It's been a long time, but wanted to clear up that I didn't even know the difference between binary and txt files back in these days. I eventually found a doc/docx module for python that made things easier. Sorry for the headache!
Maybe posting the code producing the traceback would be easier to fix.
I'm not sure this is the only problem, maybe this would work better:
with open(filename, "rb") as file:
for line in file:
for word in line.split():
file_words.append(word.decode("charmap"))
Alright I figured it out. Here is my code, but I tried a docx file that seemed to be more complex and when converted to .txt the entire file consisted of special characters. So now I am thinking that I should go to the python-docx module, since it deals with xml files like Word documents. I added "encoding = 'charmap'"
with open(filename, encoding = 'charmap') as file:
for line in file:
for word in line.split():
file_words.append(word)
So im trying to find a way so I can read a txt file and find a specific word. I have been calling the file with
myfile=open('daily.txt','r')
r=myfile.readlines()
that would return a list with a string for each line in the file, i want to find a word in one of the strings inside the list.
edit:
Im sorry I meant if there was a way to find where the word is in the txt file, like x=myfile[12] x=x[2:6]
def findLines():
myWord = 'someWordIWantToSearchFor'
answer = []
with open('daily.txt') as myfile:
lines = myfile.readlines()
for line in lines:
if myWord in line:
answer.append(line)
return answer
with open('daily.txt') as myfile:
for line in myfile:
if "needle" in line:
print "found it:", line
With the above, you don't need to allocate memory for the entire file at once, only one line at a time. This will be much more efficient if your file is large. It also closes the file automatically at the end of the with.
I'm not sure if the suggested answers solve the problem or not, because I'm not sure what the original proposer means. If he really means "words," not "substrings" then the solutions don't work, because, for example,
'cat' in line
evaluates to True if line contains the word 'catastrophe.' I think you may want to amend these answers along the lines of
if word in line.split(): ...
I'm new to Python and am working on a program that will count the instances of words in a simple text file. The program and the text file will be read from the command line, so I have included into my programming syntax for checking command line arguments. The code is below
import sys
count={}
with open(sys.argv[1],'r') as f:
for line in f:
for word in line.split():
if word not in count:
count[word] = 1
else:
count[word] += 1
print(word,count[word])
file.close()
count is a dictionary to store the words and the number of times they occur. I want to be able to print out each word and the number of times it occurs, starting from most occurrences to least occurrences.
I'd like to know if I'm on the right track, and if I'm using sys properly. Thank you!!
What you did looks fine to me, one could also use collections.Counter (assuming you are python 2.7 or newer) to get a bit more information like the number of each word. My solution would look like this, probably some improvement possible.
import sys
from collections import Counter
lines = open(sys.argv[1], 'r').readlines()
c = Counter()
for line in lines:
for work in line.strip().split():
c.update(work)
for ind in c:
print ind, c[ind]
Your final print doesn't have a loop, so it will just print the count for the last word you read, which still remains as the value of word.
Also, with a with context manager, you don't need to close() the file handle.
Finally, as pointed out in a comment, you'll want to remove the final newline from each line before you split.
For a simple program like this, it's probably not worth the trouble, but you might want to look at defaultdict from Collections to avoid the special case for initializing a new key in the dictionary.
I just noticed a typo: you open the file as f but you close it as file. As tripleee said, you shouldn't close files that you open in a with statement. Also, it's bad practice to use the names of builtin functions, like file or list, for your own identifiers. Sometimes it works, but sometimes it causes nasty bugs. And it's confusing for people who read your code; a syntax highlighting editor can help avoid this little problem.
To print the data in your count dict in descending order of count you can do something like this:
items = count.items()
items.sort(key=lambda (k,v): v, reverse=True)
print '\n'.join('%s: %d' % (k, v) for k,v in items)
See the Python Library Reference for more details on the list.sort() method and other handy dict methods.
I just did this by using re library. This was for average words in a text file per line but you have to find out number of words per line.
import re
#this program get the average number of words per line
def main():
try:
#get name of file
filename=input('Enter a filename:')
#open the file
infile=open(filename,'r')
#read file contents
contents=infile.read()
line = len(re.findall(r'\n', contents))
count = len(re.findall(r'\w+', contents))
average = count // line
#display fie contents
print(contents)
print('there is an average of', average, 'words per sentence')
#closse the file
infile.close()
except IOError:
print('An error oocurred when trying to read ')
print('the file',filename )
#call main
main()