Docx - update and create different versions?

Docx - update and create different versions? - python

I would like to create different Word-Documents using a template and an excel-input-file.
So I read the xlsx-file, change the content of the template, and want to save several Docx-files. I use the following code which generally works fine for the first document (anything gets replaced and stored as expected). But in the following documents, there is always the same content as in the first document. I tried to reassign the for every row in the excel-sheet the document with
docWork = doc
But it seems that somehow this initialization is not working.
This is the full code I am using:
from docx import Document
import os, sys
import xlwings as xw
import time
from DateTime import DateTime
if __name__ == '__main__':
print(f"Start Program V8...")
SAVE_INTERVAL = 5
WAIT = 3
FN = "dataCreateCover.xlsx"
path = os.path.abspath(os.path.dirname(sys.argv[0]))
fn = os.path.join(path, FN)
print(f"Read {fn}...")
wb = xw.Book (fn)
ws = wb.sheets[0]
inpData = ws.range ("A2:Z5000").value
inpData = [x for x in inpData if x[0] != None]
tday = str(datetime.today().date())
print(f"Read DOCX-Files...")
FNDOC = "template.docx"
fnDoc = os.path.join(path, FNDOC)
print(f"Path for DOCX: {fnDoc}...")
doc = Document(fnDoc)
for elem in inpData:
dictWords = {}
docWork = doc
elem = [x for x in elem if x != None]
for idxElem, valElem in enumerate(elem):
dictWords[f"[section{idxElem + 1}]"] = valElem
for idx,para in enumerate(docWork.paragraphs):
for k,v in dictWords.items():
if k in para.text:
inline = para.runs
for item in inline:
if k in item.text:
item.text = item.text.replace(k, v)
print(f"Replaced {k} with {v}...")
break
docFN = f"{tday}_{elem[1]}.docx"
docWork.save(docFN)
print(f"Document <{docFN}> created - pls press <enter> to close the window...")
How can I use the docx-template and write different word-docs as output?

Related

Loop an existing script over multiple PDF files in a directory

I have a script which runs on single input PDF file. Is there a way that this script can run for multiple PDF files in a directory.
The below snippet where single PDF input file is
# Select the Master PDF Path. Located in "INPUT" folder
masterPDF_path = r"C:\Users\rohitpandey\Downloads\OneDrive_1_1-25-2023\CLAIMS Analysis\Format 2(Wire)"
master_pdf_document = 'Payment# 79724.pdf'
The complete script that runs on a single PDF file is as below :-
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
import fitz
from datetime import datetime
import os
# Select the Master PDF Path. Located in "INPUT" folder
masterPDF_path = r"C:\Users\rohitpandey\Downloads\OneDrive_1_1-25-2023\CLAIMS Analysis\Format 2(Wire)"
master_pdf_document = 'Payment# 79724.pdf'
os.chdir(masterPDF_path)
# Choose the Path of Where the Doc Split Invoices should go. Located in "OUTPUT" folder
docSplit_dumpPath = r"C:\Users\rohitpandey\Downloads\OneDrive_1_1-25-2023\New folder"
#=========================================================================================#
#===================================== GET NUMBER OF PAGES ===============================#
#=========================================================================================#
String1 = "WIRE TRANSFER/ACH RECORD VOUCHER"
page_range = {}
pdfstarter = 0
doc = fitz.open(masterPDF_path+ "\\" + master_pdf_document)
docPageCount = doc.page_count
#================= PARSE PDF INTO THE DICTIONARY - LONG COMPUTE TIME ======================#
for i in range(0, docPageCount):
pageText = doc.load_page(i)
totalpage = i + 1
pageiText = pageText.get_text('text')
if String1 in pageiText:
page_range.update({pdfstarter:totalpage})
pdfstarter = totalpage
#print("Current Page: ", i, " out of ", docPageCount)
#================= PARSE PDF INTO THE DICTIONARY - LONG COMPUTE TIME ======================#
invoiceList = []
for i in range(0,docPageCount):
pageText = doc.load_page(i)
pageiText = pageText.get_text('text')
if String1 in pageiText:
pageiText = pageiText.split("\n")
test_list = pageiText
# Checking if exists in list
for i in test_list:
if(i == String1):
invoice = "PAYMENT_WIRE_BANK STATEMENT_STEP_1_" + master_pdf_document
#print(invoice)
invoiceList.append(invoice)
#========================================= SETUP ==========================================#
### SPLITING INTO n
n = int(len(invoiceList))
### CREATING FOUR LIST OF Invoice LIST
fourSplit_invoiceList = [invoiceList[i:i + n] for i in range(0, len(invoiceList), n)]
### CONVERTING DIC TO LIST CONTAINING TUPLES
page_rangeList = [(k,v) for k, v in page_range.items()]
### CREATING FOUR LIST OF PAGE RANGE
fourSplit_pageRange = [page_rangeList[i:i + n] for i in range(0, len(page_rangeList), n)]
TotalNumberOfDocs = len(fourSplit_invoiceList[0])
#=========================================================================================#
#=========================================================================================#
#==================================== CREATE PDFs ========================================#
#=========================================================================================#
openpdf = PyPDF2.PdfFileReader(masterPDF_path + "\\" + master_pdf_document)
for i in range(len(fourSplit_invoiceList[0])):
page_numberstart = fourSplit_pageRange[0][i][0]
page_numberend = fourSplit_pageRange[0][i][1]
outputfile = fourSplit_invoiceList[0][i]
outputfile = os.path.join(docSplit_dumpPath, outputfile)
try:
assert page_numberstart < openpdf.numPages
pdf_writer1 = PdfFileWriter()
for page in range(page_numberstart, page_numberend):
pdf_writer1.addPage(openpdf.getPage(page))
with open("{}".format(outputfile), 'wb') as file0:
pdf_writer1.write(file0)
except AssertionError as e:
print("Error: The PDF you are cutting has less pages than you want to cut!")

If you have a list of file names you can loop over them:
files = ['Payment# 1.pdf', 'Payment# 2.pdf']
for file in files:
master_pdf_document = file
Or, if you want to loop over your payment numbers and the 'Payment' string remains unchanged:
payment_numbers = [1,2]
for payment_number in payment_numbers:
master_pdf_document = 'Payment# '+str(payment_number)+'.pdf'

How to extract radiobutton / checkbox information with python from a pdf-file?

i would like to get the radio-button / checkbox information from a pdf-document -
I had a look at pdfplumber and pypdf2 - but was not able to find a solution with this modules.
I can parse the text using this code - but for the radio-buttons i get only the text - but no information which button (or checkbox) is selected.
import pdfplumber
import os
import sys
if __name__ == '__main__':
path = os.path.abspath(os.path.dirname(sys.argv[0]))
fn = os.path.join(path, "input.pdf")
pdf = pdfplumber.open(fn)
page = pdf.pages[0]
text = page.extract_text()
I have also uploaded an example file here:
https://easyupload.io/8y8k2v
Is there any way to get this information from the pdf-file using python?

I think i found a solution using pdfplumber -
(probably not elegant - but i can check if the radio-buttons are selected or not)
Generally:
i read all chars and all curves for all pages
then i sort all elements by x and y (to get the chars and elements in the correct order like in the pdf)
then i concatenate the cars and add also blanks when the distance between the chars is longer than in a word
i check the pts-information for the carves and get so the information if the radio button is selected or not
the final lines and yes/not informatin i store in a list line-by-line for furhter working
import pdfplumber
import os
import sys
fn = os.path.join(path, "input.pdf")
pdf = pdfplumber.open(fn)
finalContent = []
for idx,page in enumerate(pdf.pages, start=1):
print(f"Reading page {idx}")
contList = []
for e in page.chars:
tmpRow = ["char", e["text"], e["x0"], e["y0"]]
contList.append(tmpRow)
for e in page.curves:
tmpRow = ["curve", e["pts"], e["x0"], e["y0"]]
contList.append(tmpRow)
contList.sort(key=lambda x: x[2])
contList.sort(key=lambda x: x[3], reverse=True)
workContent = []
workText = ""
workDistCharX = False
for e in contList:
if e[0] == "char":
if workDistCharX != False and \
(e[2] - workDistCharX > 20 or e[3] - workDistCharY < -2):
workText += " / "
workText += e[1]
workDistCharX = e[2]
workDistCharY = e[3]
continue
if e[0] == "curve":
if workText != "":
workContent.append(workText)
workText = ""
if e[1][0][0] < 100:
tmpVal = "SELECT-YES"
else:
tmpVal = "SELECT-NO"
workContent.append(f"CURVE {tmpVal}, None, None")
finalContent.extend(workContent)
workContent = "\n".join(workContent)

Why does my python script with sleep in infinite loop stop running?

I'm working on a python script to transfer data from an .xlsx file to a html: I read/parse the excel with pandas and use beautifulsoup to edit the html (reading the paths to these two files from two .txt's). This, on its own, works. However, this script has to run constantly so everything is called in an infinite while that loops every 15 minutes, each time messages being displayed on the console.
My problem is the following: for some reason, after an aleatoric number of loops, the code just doesn't run anymore, and by that I mean no text on the console and no changes in the html file. When this happens, I have to rerun it in order to get it to function again.
Here is the main function:
def mainFunction():
if getattr(sys, 'frozen', False):
application_path = os.path.dirname(sys.executable)
elif __file__:
application_path = os.path.dirname(__file__)
excelFiles = open(str(application_path) +"\\pathsToExcels.txt")
htmlFiles = open(str(application_path) +"\\pathsToHTMLs.txt")
sheetFiles = open(str(application_path) +"\\sheetNames.txt")
print("Reading file paths ...")
linesEx = excelFiles.readlines()
linesHtml = htmlFiles.readlines()
linesSheet = sheetFiles.readlines()
print("Begining transfer")
for i in range (len(linesEx)):
excel = linesEx[i].strip()
html = linesHtml[i].strip()
sheet = linesSheet[i].strip()
print("Transfering data for " + sheet)
updater = UpdateHtml(excel, sheet, str(application_path) + "\\pageTemplate.html", html)
updater.refreshTable()
updater.addData()
updater.saveHtml()
print("Transfer done")
excelFiles.close()
htmlFiles.close()
sheetFiles.close()
UpdateHtml is the one actually responsible for the data transfer.
The "__main__" which also contains the while loop:
if __name__ == "__main__":
while(True):
print("Update at " + str(datetime.now()))
mainFunction()
print("Next update in 15 minutes\n")
time.sleep(900)
And finally, the batch code that launches this
python "C:\Users\Me\PythonScripts\excelToHtmlTransfer.py"
pause
From what I've noticed through trials, this situation doesn't occur when sleep is set to under 5 minutes (still happens for 5 minutes) or if it's omitted altogether.
Does anyone have any clue why this might be happening? Or any alternatives to sleep in this context?
EDIT: UpdateHtml:
import pandas as pd
from bs4 import BeautifulSoup
class UpdateHtml:
def __init__(self, pathToExcel, sheetName, pathToHtml, pathToFinalHtml):
with open(pathToHtml, "r") as htmlFile:
self.soup = BeautifulSoup(htmlFile.read(), features="html.parser")
self.df = pd.read_excel (pathToExcel, sheet_name=sheetName)
self.html = pathToFinalHtml
self.sheet = sheetName
def refreshTable(self):
#deletes the inner html of all table cells
for i in range(0, 9):
td = self.soup.find(id = 'ok' + str(i))
td.string = ''
td = self.soup.find(id = 'acc' + str(i))
td.string = ''
td = self.soup.find(id = 'nok' + str(i))
td.string = ''
td = self.soup.find(id = 'problem' + str(i))
td.string = ''
def prepareData(self):
#changes the names of columns according to their data
counter = 0
column_names = {}
for column in self.df.columns:
if 'OK' == str(self.df[column].values[6]):
column_names[self.df.columns[counter]] = 'ok'
elif 'Acumulate' == str(self.df[column].values[6]):
column_names[self.df.columns[counter]] = 'acc'
elif 'NOK' == str(self.df[column].values[6]):
column_names[self.df.columns[counter]] = 'nok'
elif 'Problem Description' == str(self.df[column].values[7]):
column_names[self.df.columns[counter]] = 'prob'
counter += 1
self.df.rename(columns = column_names, inplace=True)
def saveHtml(self):
with open(self.html, "w") as htmlFile:
htmlFile.write(self.soup.prettify())
def addData(self):
groupCounter = 0
index = 0
self.prepareData()
for i in range(8, 40):
#Check if we have a valid value in the ok column
if pd.notna(self.df['ok'].values[i]) and str(self.df['ok'].values[i]) != "0":
td = self.soup.find(id = 'ok' + str(index))
td.string = str(self.df['ok'].values[i])
#Check if we have a valid value in the accumulate column
if pd.notna(self.df['acc'].values[i]) and str(self.df['acc'].values[i]) != "0":
td = self.soup.find(id = 'acc' + str(index))
td.string = str(self.df['acc'].values[i])
#Check if we have a valid value in the nok column
if pd.notna(self.df['nok'].values[i]) and str(self.df['nok'].values[i]) != "0":
td = self.soup.find(id = 'nok' + str(index))
td.string = str(self.df['nok'].values[i])
#Check if we have a valid value in the problem column
if pd.notna(self.df['prob'].values[i]):
td = self.soup.find(id = 'problem' + str(index))
td.string = str(self.df['prob'].values[i])
if groupCounter == 3:
index += 1
groupCounter = 0
else:
groupCounter += 1
The excel I'm working with is a bit strange hence why I perform so many (seemingly) redundant operations. Still, it has to remain in its current form.
The main thing is the fact that the 'rows' that contain data is actually formed out of 4 regular rows, hence the need for groupCounter.

Found a workaround for this problem. Basically what I did was move the loop in the batch script, as so:
:whileLoop
python "C:\Users\Me\PythonScripts\excelToHtmlTransfer.py"
timeout /t 900 /nobreak
goto :whileLoop
After leaving it to run for a few hours the situation didn't occur anymore, however unfortunately I still don't know what caused it.

Pyinstaller executable closes instantly

I need to create a one file portable executable. I wrote a python program for word suggestions and I wanted to build a one file executable for the program. I ran pyinstaller --F prog.py
However, the exe file built from this does not run and instead just shows a blank screen. My program:
import pandas as pd
import csv
import timeit
import copy
dictionary = {}
with open('EnglishDictionary.csv', mode='r') as infile:
reader = csv.reader(infile)
for rows in reader:
dictionary[rows[0]] = int(rows[1])
dictionary = dict(sorted(dictionary.items(), key=lambda x: x[1], reverse=True))
a = sorted(dictionary, key=dictionary.get, reverse=True)
def recommend(word):
global a
count = 0
num = len(word)
res = []
a = [i for i in a if i[:num]==word]
if len(a)>5:
return a[:5]
elif len(a)==0:
return ['No match found!']
else:
return a
word = ''
char = ''
while(1):
char = input("Enter character: ")[0]
if char=='#':
break
start_time = timeit.default_timer()
word+=char
x = recommend(word)
time = str(int((timeit.default_timer() - start_time)*(10**6)))+' microseconds'
for i in x:
if i!=x[-1]:
i+=','
print("{0: <10}".format(i),end=' ')
print(time)
if x[0]=='No match found!':
print("Exiting")
break
I'm new to building executables from python files. So if any other way to build that works would also be very helpful!
Link to CSV file: https://drive.google.com/file/d/12UJl_TjV_JlMVS9XCGLEXfboPXO4lMi3/view?usp=sharing

List index out of range error in breaking whiloe loop in python

Hi I am new to python and struggling my way out. Currently ia m doing some appending excel files kind of task and here's my sample code. Getting list out of index error as according to me while loop is not breaking at rhe end of each excel file. Any help would be appreciated. Thanks:
import xlrd
import glob
import os
import openpyxl
import csv
from xlrd import open_workbook
from os import listdir
row = {}
basedir = '../files/'
files = listdir('../files')
sheets = [filename for filename in files if filename.endswith("xlsx")]
header_is_written = False
for filename in sheets:
print('Parsing {0}{1}\r'.format(basedir,filename))
worksheet = open_workbook(basedir+filename).sheet_by_index(0)
print (worksheet.cell_value(5,6))
counter = 0
while True:
row['plan name'] = worksheet.cell_value(1+counter,1).strip()
row_values = worksheet.row_slice(counter+1,start_colx=0, end_colx=30)
row['Dealer'] = int(row_values[0].value)
row['Name'] = str(row_values[1].value)
row['City'] = str(row_values[2].value)
row['State'] = str(row_values[3].value)
row['Zip Code'] = int(row_values[4].value)
row['Region'] = str(row_values[5].value)
row['AOM'] = str(row_values[6].value)
row['FTS Short Name'] = str(row_values[7].value)
row['Overall Score'] = float(row_values[8].value)
row['Overall Rank'] = int(row_values[9].value)
row['Count of Ros'] = int(row_values[10].value)
row['Count of PTSS Cases'] = int(row_values[11].value)
row['% of PTSS cases'] = float(row_values[12].value)
row['Rank of Cases'] = int(row_values[13].value)
row['% of Not Prepared'] = float(row_values[14].value)
row['Rank of Not Prepared'] = int(row_values[15].value)
row['FFVt Pre Qrt'] = float(row_values[16].value)
row['Rank of FFVt'] = int(row_values[17].value)
row['CSI Pre Qrt'] = int(row_values[18].value)
row['Rank of CSI'] = int(row_values[19].value)
row['FFVC Pre Qrt'] = float(row_values[20].value)
row['Rank of FFVc'] = int(row_values[21].value)
row['OnSite'] = str(row_values[22].value)
row['% of Onsite'] = str(row_values[23].value)
row['Not Prepared'] = int(row_values[24].value)
row['Open'] = str(row_values[25].value)
row['Cost per Vin Pre Qrt'] = float(row_values[26].value)
row['Damages per Visit Pre Qrt'] = float(row_values[27].value)
row['Claim Sub time pre Qrt'] = str(row_values[28].value)
row['Warranty Index Pre Qrt'] = str(row_values[29].value)
counter += 1
if row['plan name'] is None:
break
with open('table.csv', 'a',newline='') as f:
w=csv.DictWriter(f, row.keys())
if header_is_written is False:
w.writeheader()
header_is_written = True
w.writerow(row)

In place of while True use for.
row['plan name'] = worksheet.cell_value(1 + counter, 1).strip()
row_values = worksheet.row_slice(counter + 1, start_colx=0, end_colx=30)
for values in row_values:
row['Dealer'] = int(values.value)
row['Name'] = str(values.value)
....
because while True means to run this loop infinite time.(or until it means break keyword) inside while loop
Read more about while loop

while True loop basically means: execute the following code block to infinity, unless a break or sys.exit statement get you out.
So in your case, you need to terminate after the lines to append the excel are over (exhausted). You have two options: check if there are more lines to append, and if not break.
A more suitable approach when writing a file is for loops. This kind of a loop terminates when it is exausted.
Also, you should consider gathering the content of the excel in one operation, and save it to a variable. Then, once you have it, create iteration and append it to csv.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Docx - update and create different versions? - python

Related

Loop an existing script over multiple PDF files in a directory

How to extract radiobutton / checkbox information with python from a pdf-file?

Why does my python script with sleep in infinite loop stop running?

Pyinstaller executable closes instantly

List index out of range error in breaking whiloe loop in python

Categories

Resources