I'm trying to write a script to pull the word count of many files within a directory. I have it working fairly close to what I want, but there is one part that is throwing me off. The code so far is:
import glob
directory = "/Users/.../.../files/*"
output = "/Users/.../.../output.txt"
filepath = glob.glob(directory)
def wordCount(filepath):
for file in filepath:
name = file
fileO = open(file, 'r')
for line in fileO:
sentences = 0
sentences += line.count('.') + line.count('!') + line.count('?')
tempwords = line.split()
words = 0
words += len(tempwords)
outputO = open(output, "a")
outputO.write("Name: " + name + "\n" + "Words: " + str(words) + "\n")
wordCount(filepath)
This writes the word counts to a file named "output.txt" and gives me output that looks like this:
Name: /Users/..../..../files/Bush1989.02.9.txt
Words: 10
Name: /Users/..../..../files/Bush1989.02.9.txt
Words: 0
Name: /Users/..../..../files/Bush1989.02.9.txt
Words: 3
Name: /Users/..../..../files/Bush1989.02.9.txt
Words: 0
Name: /Users/..../..../files/Bush1989.02.9.txt
Words: 4821
And this repeats for each file in the directory. As you can see, it gives me multiple counts for each file. The files are formatted such as:
Address on Administration Goals Before a Joint Session of Congress
February 9, 1989
Mr. Speaker, Mr. President, and distinguished Members of the House and
Senate...
So, it seems that the script is giving me a count of each "part" of the file, such as the 10 words on the first line, 0 on the line break, 3 on the next, 0 on the next, and then the count for the body of the text.
What I'm looking for is a single count for each file. Any help/direction is appreciated.
The last two lines of your inner loop, which print out the filename and word count, should be part of the outer loop, not the inner loop - as it is, they're being run once per line.
You're also resetting the sentence and word counts for each line - these should be in the outer loop, before the start of the inner loop.
Here's what your code should look like after the changes:
import glob
directory = "/Users/.../.../files/*"
output = "/Users/.../.../output.txt"
filepath = glob.glob(directory)
def wordCount(filepath):
for file in filepath:
name = file
fileO = open(file, 'r')
sentences = 0
words = 0
for line in fileO:
sentences += line.count('.') + line.count('!') + line.count('?')
tempwords = line.split()
words += len(tempwords)
outputO = open(output, "a")
outputO.write("Name: " + name + "\n" + "Words: " + str(words) + "\n")
wordCount(filepath)
Isn't your identation wrong? I mean, the last lines are called once per line, but you really mean once per file, don't you?
(besides, try to avoid "file" as an identifier - it is a Python type)
Related
This is for a project I am working on and I have simplified my issue. I have two files at hand. First file contains a list of terms such as:
dog
apple
gold
boy
tomato
My second file contains a paragraph possibly containing the terms found in 1st file, but does not have to. Example:
the dog for some reason had a grand
appetite for eating golden apples,
but the dog did not like eating tomatoes
the dog only likes eating gold colored foods
My goal is to open the 1st file, assign variable "wanted_word" the term on the first line (in this case "dog"). Then I want to search for this "wanted_word" in the second file in each line. If the string is found, I want to create a file that contains the first 3 terms of the line that the "wanted_word" is found in. So the output I want would be:
the dog for
but the dog
the dog only
With my current code, I can achieve this. My issue is that after the file is created, I want to move onto the string on the next line in the first file (in this case: "apple"). The idea of the code is to repeat the whole process, create a new file each time a string in first file is found in the second file. If the string is not in the 2nd file, then I want the program to move onto next line.
My code so far:
def word_match(Listofwords, string):
wordnumber = 0
listOfAssociatedWords = []
with open(Listofwords, 'r') as read_obj:
for line in read_obj:
wordnumber += 1
if string in line:
listOfAssociatedWords.append(line.split()[:3])
return listOfAssociatedWords
#------------------------------------------------------------------------------
Firstfile = open("/Directory/firstfilename", "r")
wanted_word = Firstfile.readline(3) #This part also undermines my goal since I limit the read to 3 chars
Firstfile.close()
#------------------------------------------------------------------------------
matched_words = word_match("/Directory/secondfilename", wanted_word)
NewFile = open(wanted_word + '.txt', "w") #this is for the file creation part
for elem in matched_words:
NewFile.write(elem[0] + " " + elem[1] + " " + elem[2])
NewFile.write("\n")
So at the end, by this logic I would have 4 files, with the exception of "boy" which was not in the second file. I am aware I need a loop, but my inexperience with Python requires a need for help.
You need to loop over words, and inside of the loop go over each line:
with open("/Directory/firstfilename") as words:
for word in words:
found_lines = []
with open("/Directory/secondfilename") as lines:
for line in lines:
if word in line:
found_lines.append(' '.join(line.split()[:3]))
if found_lines:
with open(word + '.txt', 'w') as out_file:
for line in found_lines:
out_file.write(line + '\n')
This should write a new file "wordsX" for every mathing word in your paragraph list
def wordMatch(wordListFileLocation, paragraphListFileLocation):
fileCounter = 0
with open(paragraphListFileLocation) as file:
paragraphs = file.readlines()
with open(wordListFileLocation) as wordListFile:
for word in wordListFile:
matching = [s for s in paragraphs if word in s]
if len(matching):
with open('words{0}.txt'.format(fileCounter), 'w') as newFile:
words = matching[0]
firstWords = words[0:3]
line = firstWords.join(' ')
newFile.write(line)
fileCounter += 1
I am using this code to count the same words in a text file.
filename = input("Enter name of input file: ")
file = open(filename, "r", encoding="utf8")
wordCounter = {}
with open(filename,'r',encoding="utf8") as fh:
for line in fh:
# Replacing punctuation characters. Making the string to lower.
# The split will spit the line into a list.
word_list = line.replace(',','').replace('\'','').replace('.','').replace("'",'').replace('"','').replace('"','').replace('#','').replace('!','').replace('^','').replace('$','').replace('+','').replace('%','').replace('&','').replace('/','').replace('{','').replace('}','').replace('[','').replace(']','').replace('(','').replace(')','').replace('=','').replace('*','').replace('?','').lower().split()
for word in word_list:
# Adding the word into the wordCounter dictionary.
if word not in wordCounter:
wordCounter[word] = 1
else:
# if the word is already in the dictionary update its count.
wordCounter[word] = wordCounter[word] + 1
print('{:15}{:3}'.format('Word','Count'))
print('-' * 18)
# printing the words and its occurrence.
for word,occurance in wordCounter.items():
print(word,occurance)
I need them to be in order in bigger number to smaller number as output. For example:
word 1: 25
word 2: 12
word 3: 5
.
.
.
I also need to get the input as just ".txt" file. If the user writes anything different the program must get an error as "Write a valid file name".
How can i sort output and make the error code at the same time ?
For printing in order, you can sort them prior to printing by the occurrence like this:
for word,occurance in sorted(wordCounter.items(), key=lambda x: x[1], reverse=True):
print(word,occurance)
In order to check whether the file is valid in the way that you want, you can consider using:
import os
path1 = "path/to/file1.txt"
path2 = "path/to/file2.png"
if not path1.lower().endswith('.txt'):
print("Write a valid file name")
if not os.path.exists(path1):
print("File does not exists!")
You can try:
if ( filename[-4:0] != '.txt'):
print('Please input a valid file name')
And repeat input command...
I have 2 txt files (a and b_).
file_a.txt contains a long list of 4-letter combinations (one combination per line):
aaaa
bcsg
aacd
gdee
aadw
hwer
etc.
file_b.txt contains a list of letter combinations of various length (some with spaces):
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
etc.
I am looking for a python script that would allow me to do the following:
read file_a.txt line by line
take each 4-letter combination (e.g. aaai)
read file_b.txt and find all the various-length letter combinations starting with the 4-letter combination (eg. aaaibjkes, aaailoiersaaageehikjaaa, aaailoiuwegoiglkjaaaike etc.)
print the results of each search in a separate txt file named with the 4-letter combination.
File aaai.txt:
aaaibjkes
aaailoiersaaageehikjaaa
aaailoiuwegoiglkjaaake
etc.
File bcsi.txt:
bcspwiopiejowih
bcsiweyoieotpwe
etc.
I'm sorry I'm a newbie. Can someone point me in the right direction, please. So far I've got only:
#I presume I will have to use regex at some point
import re
file1 = open('file_a.txt', 'r').readlines()
file2 = open('file_b.txt', 'r').readlines()
#Should I look into findall()?
I hope this would help you;
file1 = open('file_a.txt', 'r')
file2 = open('file_b.txt', 'r')
#get every item in your second file into a list
mylist = file2.readlines()
# read each line in the first file
while file1.readline():
searchStr = file1.readline()
# find this line in your second file
exists = [s for s in mylist if searchStr in s]
if (exists):
# if this line exists in your second file then create a file for it
fileNew = open(searchStr,'w')
for line in exists:
fileNew.write(line)
fileNew.close()
file1.close()
What you can do is to open both files and run both files down line by line using for loops.
You can have two for loops, the first one reading file_a.txt as you will be reading through it only once. The second will read through file_b.txt and look for the string at the start.
To do so, you will have to use .find() to search for the string. Since it is at the start, the value should be 0.
file_a = open("file_a.txt", "r")
file_b = open("file_b.txt", "r")
for a_line in file_a:
# This result value will be written into your new file
result = ""
# This is what we will search with
search_val = a_line.strip("\n")
print "---- Using " + search_val + " from file_a to search. ----"
for b_line in file_b:
print "Searching file_b using " + b_line.strip("\n")
if b_line.strip("\n").find(search_val) == 0:
result += (b_line)
print "---- Search ended ----"
# Set the read pointer to the start of the file again
file_b.seek(0, 0)
if result:
# Write the contents of "results" into a file with the name of "search_val"
with open(search_val + ".txt", "a") as f:
f.write(result)
file_a.close()
file_b.close()
Test Cases:
I am using the test cases in your question:
file_a.txt
aaaa
bcsg
aacd
gdee
aadw
hwer
file_b.txt
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
The program produces an output file bcsg.txt as it is supposed to with bcsgiweyoieotpwe inside.
Try this:
f1 = open("a.txt","r").readlines()
f2 = open("b.txt","r").readlines()
file1 = [word.replace("\n","") for word in f1]
file2 = [word.replace("\n","") for word in f2]
data = []
data_dict ={}
for short_word in file1:
data += ([[short_word,w] for w in file2 if w.startswith(short_word)])
for single_data in data:
if single_data[0] in data_dict:
data_dict[single_data[0]].append(single_data[1])
else:
data_dict[single_data[0]]=[single_data[1]]
for key,val in data_dict.iteritems():
open(key+".txt","w").writelines("\n".join(val))
print(key + ".txt created")
I have a code to go through text files in a folder and look for specific word matches and count those. For example in file 1.txt I have word 'one' mentioned two times. So, my output should be:
1.txt | 2
print >> out, paper + "|" + str(hit_count)
Does not return me anything. Maybe str(hit_count) is not the right variable to print?
Any advise? Thanks.
for word in text:
if re.match("(.*)(one|two)(.*)", word)
hit_count = hit_count + 1
print >> out, paper + "|" + str(hit_count)
If I understand what you are trying to do, you don't really need a regex.
import glob
#glob.glob the directory to get a list of files - you didn't specify
for fname in file_list:
with open(fname,'r') as f:
# if files are very long consider line by line
# for line in f:
file_content = f.read()
count = file_content.count('one')
print '{0} | {1}'.format(fname, count)
In the code below I'm opening a fileList and check for each file in the fileList.
If the name of the file corresponds with first 4 characters of each line in another text file, I extract the number which is written in the text file with line.split()[1] and then assign the int of this string to d. Afterwards I will use this d to divide the counter.
Here's a part of my function :
fp=open('yearTerm.txt' , 'r') #open the text file
def parsing():
fileList = pathFilesList()
for f in fileList:
date_stamp = f[15:-4]
#problem is here that this for , finds d for first file and use it for all
for line in fp :
if date_stamp.startswith(line[:4]) :
d = int(line.split()[1])
print d
print "Processing file: " + str(f)
fileWordList = []
fileWordSet = set()
# One word per line, strip space. No empty lines.
fw = open(f, 'r')
fileWords = Counter(w for w in fw.read().split())
# For each unique word, count occurance and store in dict.
for stemWord, stemFreq in fileWords.items():
Freq= stemFreq / d
if stemWord not in wordDict:
wordDict[stemWord] = [(date_stamp, Freq)]
else:
wordDict[stemWord].append((date_stamp, Freq))
This works but it gives me the wrong output, the for cycle for finding d is just done once but I want it to run for each file as each file has different d. I don't know how to change this for in order to get the right d for each file or whatever else I should use.
I appreciate any advices.
I don't quite understand what you are trying to do, but if you want to do some processing for each "good" line in fp, you should move corresponding code under that if:
def parsing():
fileList = pathFilesList()
for f in fileList:
date_stamp = f[15:-4]
#problem is here that this for , finds d for first file and use it for all
for line in fp :
if date_stamp.startswith(line[:4]) :
d = int(line.split()[1])
print d
print "Processing file: " + str(f)
fileWordList = []
fileWordSet = set()
...