Creating and then modifying pdf file in python - python

I am writing some code that merges some pdfs from their file paths and then writes some text on each page of the merged document. My problem is this: I can do both things separately - merge pdfs and write text to a pdf - I just cant seem to do it all in one go.
My code is below - the pdfs are merged together from their file paths contained in an excel workbook, they are then saved as a single pdf with file name obtained from the workbook (this will change depending on what pdfs are merged so it needs to be dynamic) and I am then attempting to write text (a question number) to this merged pdf.
I keep getting error "cannot save with zero pages" and not sure why this is so as I can saved the merged file, and I can write the desired text to any other pdf with function I made if I pass the document file path into it. Any ideas on how I can merge these pdfs into a single file, then edit it with the inserted text and save it with the chosen file name from the excel doc? Hopefully you get what I mean!
from pypdf import PdfMerger
def insert_qu_numbers(document):
qu_numbers = fitz.open(document)
counter = 0
for page in qu_numbers:
page.clean_contents()
counter += 1
text = f"Question {counter}"
text_length = fitz.get_text_length(text, fontname= "times-roman")
print(text_length)
rect_x0 = 70
rect_y0 = 50
rect_x1 = rect_x0 + text_length + 35
rect_y1 = rect_y0 + 40
rect = fitz.Rect(rect_x0, rect_y0, rect_x1, rect_y1)
page.insert_textbox(rect, text, fontsize = 16, fontname = "times-roman", align = 0)
qu_numbers.write()
# opens the workbook and gets the file paths.
wbxl = xw.Book('demo.xlsm')
get_links = wbxl.sheets['Sheet1'].range('C2:C5').value
# gets rid of any blank cells in range and makes a list of all the file paths called
filenames
filenames = []
for file in get_links:
if file is not None:
filenames.append(file)
# gets each file path from filename list and adds it to merged pdf where it will be
merged
merged_pdf = PdfMerger()
for i in range(len(filenames)):
merged_pdf.append(filenames[i], 'rb')
# merges separate file paths into one pdf and names it the output name in the given
cell
output_name = wbxl.sheets['Sheet1'].range('C7').value
final = merged_pdf.write(output_name + ".pdf")
insert_qu_numbers(final)

You can use PyMuPDF for merging and modifcation as well:
# filelist = list of files to merge
doc = fitz.open() # the output to receive the merged PDFs
for file in filelist:
src = fitz.open(file)
doc.insert_pdf(src) # append input file
src.close()
for page in doc: # now iterate through the pages of the result
page.insert_text(...blahblah ...) # insert text or whatever was on your mind
doc.ez_save("output.pdf")

Related

Creating a list with words counted from multiple .docx files

I'm trying to do a project where I automate my invoices for translation jobs. Basically the script reads multiple .docx files in a folder, counts words for every separate file, then writes those filenames and the corresponding word counts into Excel file.
I've created a word counter script, but can't figure out how to add the counted words to a list to later use this list to extract values from it for my Excel file, and create an invoice.
Here is my code:
import docx
import os
import re
from docx import Document
#Folder to work with
folder = r'D:/Tulk_1'
files = os.listdir(folder)
#List with the names of files and word counts for each file
list_files = []
list_words = []
for file in files:
#Getting the absolute location
location = folder + '/' + file
#Adding filenames to the list
list_files.append(file)
#Word counter
document = docx.Document(location)
newparatextlist = []
for paratext in document.paragraphs:
newparatextlist.append(paratext.text)
#Printing file names
print(file)
#Printing word counts for each file
print(len(re.findall(r'\w+', '\n'.join(newparatextlist))))
Output:
cold calls.docx
2950
Kristības.docx
1068
Tulkojums starpniecības līgums.docx
946
Tulkojums_PL_ULIHA_39_41_4 (1).docx
788
Unfortunately I copied the counter part from the web and the last line is too complicated for me:
print(len(re.findall(r'\w+', '\n'.join(newparatextlist))))
So I don't know how to extract the results out of it into a list.
When I try to store the last line into a variable like this:
x = len(re.findall(r'\w+', '\n'.join(newparatextlist)))
The output is only word count for one of the files:
cold calls.docx
Kristības.docx
Tulkojums starpniecības līgums.docx
Tulkojums_PL_ULIHA_39_41_4 (1).docx
788
Maybe you could help me to break the last line into smaller steps? Or perhaps there are easier solutions to my task?
EDIT:
The desired output for the:
print(list_words)
should be:
[2950, 1068, 946, 788]
Similar as it already is for file names:
print(list_files)
output:
['cold calls.docx', 'Kristības.docx', 'Tulkojums starpniecības līgums.docx', 'Tulkojums_PL_ULIHA_39_41_4 (1).docx']

A series of graphviz diagrams in a single pdf file

I create a graphviz diagram. Using below code.
network1 = G.Digraph(
graph_attr={...},
node_attr={...},
edge_attr={..} )
I add nodes
network1.node("node_edge_name",...)
...
and edges
network1.edge("A", "B")
...
and then call the below code. It creates me a pdf file and a dot file.
network1.view(file_name).
This way my diagram becomes very complicated. What I want is, to create a series of network objects instead of one and to visualize them in a single pdf file page by page. In the end, I hope to have multiple dot files and a single pdf file.
Can someone describe if is there a way to do that and how?
Many thanks,
Ferda
Graphviz does not seem to directly support this (maybe postscript output format does). But there are several tools that will allow you to combine multiple pdfs into a single file.
This ghostscript command works:
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=combined.pdf -dBATCH f1.pdf f2.pdf f3.pdf
See also: https://sites.astro.caltech.edu/observatories/coo/solicit/mergePDF.html
I used PdfFileReader, PdfFileWriter libraries to mege multiple pdf files into one. This was what I wanted.
pdf1File = open(file_name1, 'rb')
pdf2File = open(file_name2, 'rb')
pdf3File = open(file_name3, 'rb')
pdf4File = open(file_name4, 'rb')
# Read the files that you have opened
pdf1Reader = PyPDF2.PdfFileReader(pdf1File)
pdf2Reader = PyPDF2.PdfFileReader(pdf2File)
pdf3Reader = PyPDF2.PdfFileReader(pdf3File)
pdf4Reader = PyPDF2.PdfFileReader(pdf4File)
# Create a new PdfFileWriter object which represents a blank PDF document
pdfWriter = PyPDF2.PdfFileWriter()
# Loop through all the pagenumbers for the first document
for pageNum in range(pdf1Reader.numPages):
pageObj = pdf1Reader.getPage(pageNum)
pdfWriter.addPage(pageObj)
# Loop through all the pagenumbers for the second document
for pageNum in range(pdf2Reader.numPages):
pageObj = pdf2Reader.getPage(pageNum)
pdfWriter.addPage(pageObj)
# Loop through all the pagenumbers for the third document
for pageNum in range(pdf3Reader.numPages):
pageObj = pdf3Reader.getPage(pageNum)
pdfWriter.addPage(pageObj)
# Loop through all the pagenumbers for the fourth document
for pageNum in range(pdf4Reader.numPages):
pageObj = pdf4Reader.getPage(pageNum)
pdfWriter.addPage(pageObj)
# Now that you have copied all the pages in both the documents, write them into the a new document
pdfOutputFile = open(file_name5+"_Merged_Files.pdf", 'wb')
pdfWriter.write(pdfOutputFile)
One possible problem is that node positions may change from one graph to the next. The layers feature will keep nodes from changing position (https://graphviz.org/faq/#FaqOverlays)

How to use use Python to import (.tex file), create a new (.tex file) and append the new (.tex file) from the imported (.tex file)

I have several (1000+) .tex files which goes something like this:
File1.tex:
\documentclass[15pt]{article}
\begin{document}
Question1:
$f(x)=sin(x)$\\
Question2:
$f(x)=tan(x)$
\end{document}
File2.tex is similar in structure:
\documentclass[15pt]{article}
\begin{document}
Question1:
$f(x)=cos(x)$\\
Question2:
$f(x)=sec(x)$\\
Question3:
$f(x)=cot(x)$
\end{document}
What I would like to do is write a Python script that allows me to select question 1 from file1.tex and question 3 from file2.tex and compile a new file3.tex file (or PDF) with the following format:
\documentclass[15pt]{article}
\begin{document}
Question1:
$f(x)=sin(x)$\\
Question2:
$f(x)=cot(x)$
\end{document}
PS- I don't mind if I can carry out this type of work on LaTex. I just thought with Python I can eventually create a GUI.
So far I've managed to read/append a .tex file by manually typing what I want rather than creating some sort of a system that allows me to "copy" specific section of a .tex file or files into another another .tex file.
I used exactly what you had for file1 and file2.tex. I left comments throughout rather than explain step by step.
PreProcess
The preprocess involves creating an xlsx file which will have all of the names of the tex file in the first column.
import os
import xlsxwriter
workbook = xlsxwriter.Workbook('Filenames.xlsx')
worksheet = workbook.add_worksheet("FileNames")
worksheet.write(0, 0, "NameCol")
path = os.getcwd() # get path to directory
filecount = 1
for file in os.listdir(path): # for loop over files in directory
if file.split('.')[-1] == 'tex': # only open latex files
worksheet.write(filecount, 0, file)
filecount += 1
workbook.close()
Select Problems
Now you go through an list to the right like I have what problems you want out of the file.
PostProcess
Now we can run through our xlsx file and create a new latex file from it.
import pandas as pd
import math
import os
# get data
allfileqs = []
df = pd.read_excel('Filenames.xlsx')
for row in df.iterrows():
tempqs = []
for i in range(len(row[1].values) - 1):
if math.isnan(row[1].values[i + 1]):
continue
else:
tempqs.append(int(row[1].values[i + 1]))
allfileqs.append(tempqs)
print(allfileqs)
allfileqs = [["Question" + str(allfileqs[i][j]) + ':' for j in range(len(allfileqs[i]))] for i in range(len(allfileqs))]
starttex = [r'\documentclass[15pt]{article}', r'\begin{document}']
endtex = [r'\end{document}']
alloflines = []
path = os.getcwd() # get path to directory
for file in os.listdir(path): # for loop over files in directory
if file.split('.')[-1] == 'tex': # only open latex files
lf = open(file, 'r')
lines = lf.readlines()
# remove all new lines, each item is on new line we know
filt_lines = [lines[i].replace('\n', '') for i in range(len(lines)) if lines[i] != '\n']
alloflines.append(filt_lines) # save data for later
lf.close() # close file
# good now we have filtered lines
newfile = []
questcount = 1
for i in range(len(alloflines)):
for j in range(len(alloflines[i])):
if alloflines[i][j] in allfileqs[i]:
newfile.append("Question" + str(questcount) + ":")
newfile.append(alloflines[i][j + 1])
questcount += 1
# okay cool we have beg, middle (newfile) and end of tex
newest = open('file3.tex', 'w') # open as write only
starter = '\n\n'.join(starttex) + '\n' + '\n\n'.join(newfile) + '\n\n' + endtex[0]
for i in range(len(starter)):
newest.write(starter[i])
newest.close()

Using same code for multiple text files and generate multiple text files as output using python

I have more than 30 text files. I need to do some processing on each text file and save them again in text files with different names.
Example-1: precise_case_words.txt ---- processing ---- precise_case_sentences.txt
Example-2: random_case_words.txt ---- processing ---- random_case_sentences.txt
Like this i need to do for all text files.
present code:
new_list = []
with open('precise_case_words.txt') as inputfile:
for line in inputfile:
new_list.append(line)
final = open('precise_case_sentences.txt', 'w+')
for item in new_list:
final.write("%s\n" % item)
Am manually copy+paste this code all the times and manually changing the names everytime. Please suggest me a solution to avoid manual job using python.
Suppose you have all your *_case_words.txt in the present dir
import glob
in_file = glob.glob('*_case_words.txt')
prefix = [i.split('_')[0] for i in in_file]
for i, ifile in enumerate(in_file):
data = []
with open(ifile, 'r') as f:
for line in f:
data.append(line)
with open(prefix[i] + '_case_sentence.txt' , 'w') as f:
f.write(data)
This should give you an idea about how to handle it:
def rename(name,suffix):
"""renames a file with one . in it by splitting and inserting suffix before the ."""
a,b = name.split('.')
return ''.join([a,suffix,'.',b]) # recombine parts including suffix in it
def processFn(name):
"""Open file 'name', process it, save it under other name"""
# scramble data by sorting and writing anew to renamed file
with open(name,"r") as r, open(rename(name,"_mang"),"w") as w:
for line in r:
scrambled = ''.join(sorted(line.strip("\n")))+"\n"
w.write(scrambled)
# list of filenames, see link below for how to get them with os.listdir()
names = ['fn1.txt','fn2.txt','fn3.txt']
# create demo data
for name in names:
with open(name,"w") as w:
for i in range(12):
w.write("someword"+str(i)+"\n")
# process files
for name in names:
processFn(name)
For file listings: see How do I list all files of a directory?
I choose to read/write line by line, you can read in one file fully, process it and output it again on block to your liking.
fn1.txt:
someword0
someword1
someword2
someword3
someword4
someword5
someword6
someword7
someword8
someword9
someword10
someword11
into fn1_mang.txt:
0demoorsw
1demoorsw
2demoorsw
3demoorsw
4demoorsw
5demoorsw
6demoorsw
7demoorsw
8demoorsw
9demoorsw
01demoorsw
11demoorsw
I happened just today to be writing some code that does this.

Iteratively populate array in Python

I have multiple files in a directory which I'm trying to read from and save each file's contents into the same array.
getFileNames returns all of the file names from the base directory, and they are returned and saved correctly into the allFiles array.
I've tried the below code but it only returns the data from the first file. Actually, the array file has just one item, allFiles[0].
basePath = '/home/resume_examples/'
allFiles = getFileNames(basePath)
for document in allFiles:
fileTexts = [getFileText(basePath + document)]
print fileTexts
I have also tried, but there's still one item in the array (the contents of the last file read).
basePath = '/home/resume_examples/'
allFiles = getFileNames(basePath)
for document in allFiles:
fileTexts = []
fileTexts.append(getFileText(basePath + document))
print fileTexts[2]
I understand that my array gets overwritten at every itteration, but I can't see why even append doesn't work. Can someone please explain how I should define / populate my array with each call of getFileText function?
You reset the list to [] every iteration. Do that only before the iteration:
fileTexts = []
for document in allFiles:
fileTexts.append(getFileText(basePath + document))
or use a list comprehension
file_texts = [getFileText(basePath + document) for document in allFiles]
https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions

Categories