Creating a list with words counted from multiple .docx files - python

I'm trying to do a project where I automate my invoices for translation jobs. Basically the script reads multiple .docx files in a folder, counts words for every separate file, then writes those filenames and the corresponding word counts into Excel file.
I've created a word counter script, but can't figure out how to add the counted words to a list to later use this list to extract values from it for my Excel file, and create an invoice.
Here is my code:
import docx
import os
import re
from docx import Document
#Folder to work with
folder = r'D:/Tulk_1'
files = os.listdir(folder)
#List with the names of files and word counts for each file
list_files = []
list_words = []
for file in files:
#Getting the absolute location
location = folder + '/' + file
#Adding filenames to the list
list_files.append(file)
#Word counter
document = docx.Document(location)
newparatextlist = []
for paratext in document.paragraphs:
newparatextlist.append(paratext.text)
#Printing file names
print(file)
#Printing word counts for each file
print(len(re.findall(r'\w+', '\n'.join(newparatextlist))))
Output:
cold calls.docx
2950
Kristības.docx
1068
Tulkojums starpniecības līgums.docx
946
Tulkojums_PL_ULIHA_39_41_4 (1).docx
788
Unfortunately I copied the counter part from the web and the last line is too complicated for me:
print(len(re.findall(r'\w+', '\n'.join(newparatextlist))))
So I don't know how to extract the results out of it into a list.
When I try to store the last line into a variable like this:
x = len(re.findall(r'\w+', '\n'.join(newparatextlist)))
The output is only word count for one of the files:
cold calls.docx
Kristības.docx
Tulkojums starpniecības līgums.docx
Tulkojums_PL_ULIHA_39_41_4 (1).docx
788
Maybe you could help me to break the last line into smaller steps? Or perhaps there are easier solutions to my task?
EDIT:
The desired output for the:
print(list_words)
should be:
[2950, 1068, 946, 788]
Similar as it already is for file names:
print(list_files)
output:
['cold calls.docx', 'Kristības.docx', 'Tulkojums starpniecības līgums.docx', 'Tulkojums_PL_ULIHA_39_41_4 (1).docx']

Related

Creating and then modifying pdf file in python

I am writing some code that merges some pdfs from their file paths and then writes some text on each page of the merged document. My problem is this: I can do both things separately - merge pdfs and write text to a pdf - I just cant seem to do it all in one go.
My code is below - the pdfs are merged together from their file paths contained in an excel workbook, they are then saved as a single pdf with file name obtained from the workbook (this will change depending on what pdfs are merged so it needs to be dynamic) and I am then attempting to write text (a question number) to this merged pdf.
I keep getting error "cannot save with zero pages" and not sure why this is so as I can saved the merged file, and I can write the desired text to any other pdf with function I made if I pass the document file path into it. Any ideas on how I can merge these pdfs into a single file, then edit it with the inserted text and save it with the chosen file name from the excel doc? Hopefully you get what I mean!
from pypdf import PdfMerger
def insert_qu_numbers(document):
qu_numbers = fitz.open(document)
counter = 0
for page in qu_numbers:
page.clean_contents()
counter += 1
text = f"Question {counter}"
text_length = fitz.get_text_length(text, fontname= "times-roman")
print(text_length)
rect_x0 = 70
rect_y0 = 50
rect_x1 = rect_x0 + text_length + 35
rect_y1 = rect_y0 + 40
rect = fitz.Rect(rect_x0, rect_y0, rect_x1, rect_y1)
page.insert_textbox(rect, text, fontsize = 16, fontname = "times-roman", align = 0)
qu_numbers.write()
# opens the workbook and gets the file paths.
wbxl = xw.Book('demo.xlsm')
get_links = wbxl.sheets['Sheet1'].range('C2:C5').value
# gets rid of any blank cells in range and makes a list of all the file paths called
filenames
filenames = []
for file in get_links:
if file is not None:
filenames.append(file)
# gets each file path from filename list and adds it to merged pdf where it will be
merged
merged_pdf = PdfMerger()
for i in range(len(filenames)):
merged_pdf.append(filenames[i], 'rb')
# merges separate file paths into one pdf and names it the output name in the given
cell
output_name = wbxl.sheets['Sheet1'].range('C7').value
final = merged_pdf.write(output_name + ".pdf")
insert_qu_numbers(final)
You can use PyMuPDF for merging and modifcation as well:
# filelist = list of files to merge
doc = fitz.open() # the output to receive the merged PDFs
for file in filelist:
src = fitz.open(file)
doc.insert_pdf(src) # append input file
src.close()
for page in doc: # now iterate through the pages of the result
page.insert_text(...blahblah ...) # insert text or whatever was on your mind
doc.ez_save("output.pdf")

os.walk() to match contents of a files from a CSV file

I have two words i would like to search for using a CSV file, this is done using the os.walk() method to recursively look through each file within the rootDir however not too sure on what i'm missing to complete my code. The two words i am looking for is which are in two separate files:
XZOXNEOXXTWX, YOEYTWOYZYNY
To start off with i created a csv to look for certain words, i then have created the os.walk() method and tried reading the text from the CSV file to output the matching contents. I have looked through a fair bit of material but i want it not matching up to what i would like to output.
appendData = []
mPath = r"C:\Users\test\Documents\test"
wordstoSearch = r"C:\Users\test\Documents\test\strings.csv"
for rootDir, subDir, files in os.walk(mPath, topdown=True):
print('Root Directory:', rootDir)
for x in files:
with open(os.path.join(files)):
with open ('strings.csv', 'rt') as stringSearch:
if wordstoSearch in stringSearch.read():
appendData.append('File: {}\nMatching Content: {}\n'.format(x, wordstoSearch))
appendData = '\n'.join(appendData)
print (appendData)
You should read the words to check once using csv at the top of the script. Then be careful how you build the filename to check - you need to join the root directory to the filename. Finally, read each target file and then go through the list of words to check.
import csv
appendData = []
mPath = r"C:\Users\test\Documents\test"
wordstoSearch = r"C:\Users\test\Documents\test\strings.csv"
# get the words to check
with open(wordstoSearch) as fp:
reader = csv.reader(fp)
words = [cell.strip() for cell in next(reader)]
# walk the tree
for rootDir, subDir, files in os.walk(mPath, topdown=True):
print('Root Directory:', rootDir)
for x in files:
with open(os.path.join(rootDir, x)) as check_fp:
text = check_fp.read()
for word in words:
if word in text:
appendData.append(
'File: {}\nMatching Content: {}\n'.format(x, word))
appendData = '\n'.join(appendData)
print (appendData)

How to remove rows from a csv file when compared to a list in a txt file using Python?

I have a list of 12.000 dictionary entries (the words only, without their definitions) stored in a .txt file.
I have a complete dictionary with 62.000 entries (the words with their definitions) stored in .csv file.
I need to compare the small list in the .txt file with the larger list in the .csv file and delete the rows containing the entries that doesn't appear on the smaller list. In other words, I want to purge this dictionary to only 12.000 entries.
The .txt file is ordered in separate lines like this, line by line:
word1
word2
word3
The .csv file is ordered like this:
ID (column 1) WORD (column 2) MEANING (column 3)
How do I accomplish this using Python?
Good answers so far. If you want to get minimalistic...
import csv
lookup = set(l.strip().lower() for l in open(path_to_file3))
map(csv.writer(open(path_to_file2, 'w')).writerow,
(row for row in csv.reader(open(path_to_file))
if row[1].lower() in lookup))
The following will not scale well, but should work for the number of records indicated.
import csv
csv_in = csv.reader(open(path_to_file, 'r'))
csv_out = csv.writer(open(path_to_file2, 'w'))
use_words = open(path_to_file3, 'r').readlines()
lookup = dict([(word, None) for word in use_words])
for line in csv_in:
if lookup.has_key(line[0]):
csv_out.writerow(line)
csv_out.close()
One of the least known facts of current computers is that when you delete a line from a text file and save the file, most of the time the editor does this:
load the file into memory
write a temporary file with the rows you want
close the files and move the temp over the original
So you have to load your wordlist:
with open('wordlist.txt') as i:
wordlist = set(word.strip() for word in i) # you said the file was small
Then you open the input file:
with open('input.csv') as i:
with open('output.csv', 'w') as o:
output = csv.writer(o)
for line in csv.reader(i): # iterate over the CSV line by line
if line[1] not in wordlist: # test the value at column 2, the word
output.writerow(line)
os.rename('input.csv', 'output.csv')
This is untested, now go do your homework and comment here if you find any bug... :-)
i would use pandas for this. the data set's not large, so you can do it in memory with no problem.
import pandas as pd
words = pd.read_csv('words.txt')
defs = pd.read_csv('defs.csv')
words.set_index(0, inplace=True)
defs.set_index('WORD', inplace=True)
new_defs = words.join(defs)
new_defs.to_csv('new_defs.csv')
you might need to manipulate new_defs to make it look like you want it to, but that's the gist of it.

How would I read and write from multiple files in a single directory? Python

I am writing a Python code and would like some more insight on how to approach this issue.
I am trying to read in multiple files in order that end with .log. With this, I hope to write specific values to a .csv file.
Within the text file, there are X/Y values that are extracted below:
Textfile.log:
X/Y = 5
X/Y = 6
Textfile.log.2:
X/Y = 7
X/Y = 8
DesiredOutput in the CSV file:
5
6
7
8
Here is the code I've come up with so far:
def readfile():
import os
i = 0
for file in os.listdir("\mydir"):
if file.endswith(".log"):
return file
def main ():
import re
list = []
list = readfile()
for line in readfile():
x = re.search(r'(?<=X/Y = )\d+', line)
if x:
list.append(x.group())
else:
break
f = csv.write(open(output, "wb"))
while 1:
if (i>len(list-1)):
break
else:
f.writerow(list(i))
i += 1
if __name__ == '__main__':
main()
I'm confused on how to make it read the .log file, then the .log.2 file.
Is it possible to just have it automatically read all the files in 1 directory without typing them in individually?
Update: I'm using Windows 7 and Python V2.7
The simplest way to read files sequentially is to build a list and then loop over it. Something like:
for fname in list_of_files:
with open(fname, 'r') as f:
#Do all the stuff you do to each file
This way whatever you do to read each file will be repeated and applied to every file in list_of_files. Since lists are ordered, it will occur in the same order as the list is sorted to.
Borrowing from #The2ndSon's answer, you can pick up the files with os.listdir(dir). This will simply list all files and directories within dir in an arbitrary order. From this you can pull out and order all of your files like this:
allFiles = os.listdir(some_dir)
logFiles = [fname for fname in allFiles if "log" in fname.split('.')]
logFiles.sort(key = lambda x: x.split('.')[-1])
logFiles[0], logFiles[-1] = logFiles[-1], logFiles[0]
The above code will work with files name like "somename.log", "somename.log.2" and so on. You can then take logFiles and plug it in as list_of_files. Note that the last line is only necessary if the first file is "somename.log" instead of "somename.log.1". If the first file has a number on the end, just exclude the last step
Line By Line Explanation:
allFiles = os.listdir(some_dir)
This line takes all files and directories within some_dir and returns them as a list
logFiles = [fname for fname in allFiles if "log" in fname.split('.')]
Perform a list comprehension to gather all of the files with log in the name as part of the extension. "something.log.somethingelse" will be included, "log_something.somethingelse" will not.
logFiles.sort(key = lambda x: x.split('.')[-1])
Sort the list of log files in place by the last extension. x.split('.')[-1] splits the file name into a list of period delimited values and takes the last entry. If the name is "name.log.5", it will be sorted as "5". If the name is "name.log", it will be sorted as "log".
logFiles[0], logFiles[-1] = logFiles[-1], logFiles[0]
Swap the first and last entries of the list of log files. This is necessary because the sorting operation will put "name.log" as the last entry and "nane.log.1" as the first.
If you change the naming scheme for your log files you can easily return of list of files that have the ".log" extension. For example if you change the file names to Textfile1.log and Textfile2.log you can update readfile() to be:
import os
def readfile():
my_list = []
for file in os.listdir("."):
if file.endswith(".log"):
my_list.append(file)
print my_list will return ['Textfile1.log', 'Textfile2.log']. Using the word 'list' as a variable is generally avoided, as it is also used to for an object in python.

how to find frequency of the keys in a dictionary across multiple text files?

I am supposed to count the frequency of all the key values of dictionary "d" across all the files in the document "individual-articles'
Here,the document "individual-articles' has around 20000 txt files,with filenames 1,2,3,4...
for ex: suppose d[Britain]=[5,76,289] must return the number of times Britain appears in the files 5.txt,76.txt,289.txt belonging to the document "induvidual articles", and also i need to find its frequency across all the files in the same document. i need to store these values in another d2
for the same example,
d2 must contain (Britain,26,1200) where 26 is the frequency of the word Britain in the files 5.txt,76.txt and 289.txt and 1200 is the frequency of the word Britain in all the files.
I am a python newbie, and i have tried little! please help!!
import collections
import sys
import os
import re
sys.stdout=open('dictionary.txt','w')
from collections import Counter
from glob import glob
def removegarbage(text):
text=re.sub(r'\W+',' ',text)
text=text.lower()
sorted(text)
return text
folderpath='d:/individual-articles'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
d2={}
with open('topics.txt') as f:
d = collections.defaultdict(list)
for line in f:
value, *keys = line.strip().split('~')
for key in filter(None, keys):
d[key].append(value)
for filepath in filepaths:
with open(filepath,'r') as filehandle:
lines = filehandle.read()
words = removegarbage(lines).split()
for k in d.keys():
d2[k] = words.count(k)
for i in d2.items():
print(i)
Well, I'm not exactly sure what you mean by all the files in the document "X" but I assume it's analogous to pages in a book. With this interpretation, I would do my best to store the the data in the easiest way. Putting data in easily manipulable adds efficiency later, because you can always just add a method for accomplishing and type of output you want.
Since it seems the main key you're looking at is keyword, I would create a nested python dictionary with this structure
dict = (keyword:{file:count})
Once it's in this form, you can do any type of manipulation on the data really easily.
To create this dict,
import os
# returns the next word in the file
def words_generator(fileobj):
for line in fileobj:
for word in line.split():
yield word
word_count_dict = {}
for dirpath, dnames, fnames in os.walk("./"):
for file in fnames:
f = open(file,"r")
words = words_generator(f)
for word in words:
if word not in word_count_dict:
word_count_dict[word] = {"total":0}
if file not in word_count_dict[word]:
word_count_dict[word][file] = 0
word_count_dict[word][file] += 1
word_count_dict[word]["total"] += 1
This will create an easily parsable dictionary.
Want the number of total words Britain?
word_count_dict["Britain"]["total"]
Want the number of times Britain is in files 74.txt and 75.txt?
sum([word_count_dict["Britain"][file] if file in word_count_dict else 0 for file in ["74.txt", "75.txt"]])
Want to see all files that the word Britain shows up in?
[file for key in word_count_dict["Britain"]]
You can of course write functions that perform these operations with a simple call.

Categories