At the moment I use a script to populate a template for each of the entries in our database and generate a docx file for each entry. Following that I convert that docx file to a pdf and mail it to the user.
For this I use following code :
from docxtpl import DocxTemplate
from docx2pdf import convert
pathToTemplate='template.docx'
outputPath='output.docx'
template = DocxTemplate(pathToTemplate)
context = person.get_context(short) # gets the context used to render the document
template.render(context)
template.save(outputPath)
pdfpath = outputPath[:-4]+'pdf'
convert(outputPath, pdfpath)
This part of the code is embedded in a loop and when measuring the time needed to generate the context from the database (in the person.get_context(short) function) and generating the docx file it gives me a result between 0.5s - 1.0s. When measuring the time needed to convert this docx to pdf it gives me a time of 5.0s - 7.0s.
Because the loop has to loop over > 1000 users, this is the difference can add up. Does anyone have an idea how the DocxTemplate kan save to pdf directly (and how fast this is) or if there is a faster way to generate the pdf files?
as far as I know you just can't do it with the docx library itself, but I have found an alternate way to achieve this, we can convert the docx to pdf using the following code
from docxtpl import DocxTemplate
import pandas as pd
df = pd.read_excel("Data.xlsx")
import time
import os
from win32com import client
word_app = client.Dispatch("Word.Application")
for i , j in df.iterrows():
Name = j["Party_Name"]
tpl = DocxTemplate("Invoice_Template.docx")
dicty = df.to_dict()
x = df.to_dict(orient="records")
context = x
tpl.render(context[i])
tpl.save("hello.docx")
rod = os.path.dirname(os.path.abspath(__file__))
print(rod)
time.sleep(2)
#converting to pdf
doc = word_app.Documents.Open(rod + "\\1.docx")
doc.SaveAs(rod + "\\hello.pdf", FileFormat=17)
Related
I have a pdf file and extracting the data from pdf file using pdfquery and pandas
Code is as follows:
import pdfquery
import pandas as pd
pdf = pdfquery.PDFQuery('data/BUSTA_PAGA - 2.pdf')
pdf.load()
pdf.tree.write('pdfXML.txt', pretty_print = True)
Name = pdf.pq('LTTextLineHorizontal:overlaps_bbox("25.509, 188.273, 188.558,
748.621")').text()
s=pd.DataFrame({
Name
})
s.to_csv('file_name.csv')
When I run this, It gives the data of the full text box which I wanted but there is specific data that I want to extract. How would I do that?
The below code is working fine in python, but how to pass parameter values from html?
import pandas as pd
import pandas_profiling
# read the file
df = pd.read_csv('Dataprofile.csv')
# run the profile report
profile = df.profile_report(title='Pandas Profiling Report')
# save the report as html file
profile.to_file(output_file="pandas_profiling1.html")
# save the report as json file
profile.to_file(output_file="pandas_profiling2.json")
Instead of to_file use to_html.
df.to_html("Table.html")
You can style it with CSS in the HTML file.
I want to add an image as background/watermark to a new word document using Python. I tried Python-docx but couldn't find anything useful
from docx import Document
from docx.shared import Inches
document = Document()
document.add_picture(r'D:\Python\Projects\raw_imgs\3b057d6199d95c4339ef532001cb20cd.jpg', width=Inches(6))
document.save('demo.docx')
The above code just inserts the image but I want to add it as the background image.
Aspose.Words Cloud SDK for Python can insert an image as a background to the DOC/DOCX. Though it is paid product, its free trial allows 150 free API calls monthly.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
localFile = 'C:/Temp/Sections.docx'
imageFile= 'C:/Temp/Tulips.jpg'
outputFile= 'C:/Temp/Watermark.docx'
request = asposewordscloud.models.requests.InsertWatermarkImageOnlineRequest(document=open(localFile, 'rb'), image_file=open(imageFile, 'rb'))
result = words_api.insert_watermark_image_online(request)
copyfile(result.document, outputFile)
I need to read data from hundreds of PDF forms. These forms have all text entry boxes, the forms are not editable. I have been trying to use Python and PyPDF2 to read these forms to a CSV file (since the ultimate goal is an excel database.
I have tried using acrobats export as csv function, but this is extremely slow as each form has 4 embedded images that export as plaintext. I have the following code,
from PyPDF2 import PdfFileReader
infile = "FormSample.pdf"
pdf_reader = PdfFileReader(open(infile, "rb"))
with open('exportharvest.csv','w') as exportharvestcsv:
dictionary = pdf_reader.getFields(fileobj = exportharvestcsv)
textfields = pdf_reader.getFormTextFields()
dest = pdf_reader.getNamedDestinations()
print(dest)
The issue with the above code is as follows: the getFields command only gets the ~4 digital signature fields in the form (form has ~300 entries). Is there some way to instruct python to look through all the fields? I know the field names in the document as they are listed when I export to pdf.
getFormTextFields() returns a dictionary of {}
getNamedDestinations() returns a dictionary of {}
Thanks for any help.
From my experience pyPDF is slow as well.
this here should do what you want:
from PyPDF2 import PdfFileReader
from pprint import pprint
pdf_file_name = 'formdocument.pdf'
f = PdfFileReader(pdf_file_name)
fields = f.getFields()
fdfinfo = dict((k, v.get('/V', '')) for k, v in fields.items())
pprint(fdfinfo)
with open('test.csv', 'w') as f2:
for key in fdfinfo.keys():
if type(key)==type("string") and type(str(fdfinfo[key]))==type("string"):
f2.write('"'+key+'","'+fdfinfo[key]+'"\n')
I need to read large data from temp file in Spotfire using IronPython.
First I have exported my Tibco data table in a temp file using the Exported text() method:
#Temp file for storing the TablePlot data
tempFolder = Path.GetTempPath()
tempFilename = Path.GetTempFileName()
#Export TablePlot data to the temp file
tp = tablePlotViz.As[TablePlot]()
writer = StreamWriter(tempFilename)
tp.ExportText(writer)
After that, opened the temp file using the open() method.
f = open(tempFilename)
Now when I started to read the data from the opened file and write back into a String variable then it is taking too much time. And my Spotfire screen is stopped working.
Has anyone idea about this?
My data table is of 8MB size.
Code is:
from Spotfire.Dxp.Application.Visuals import TablePlot, HtmlTextArea
import clr
import sys
clr.AddReference('System.Data')
import System
from System.Data import DataSet, DataTable, XmlReadMode
from Spotfire.Dxp.Data import DataType, DataTableSaveSettings
from System.IO import StringReader, StreamReader, StreamWriter, MemoryStream, SeekOrigin, FileStream, FileMode,Path, File
from Spotfire.Dxp.Data.Export import DataWriterTypeIdentifiers
from System.Threading import Thread
from Spotfire.Dxp.Data import IndexSet
from Spotfire.Dxp.Data import RowSelection
from Spotfire.Dxp.Data import DataValueCursor
from Spotfire.Dxp.Data import DataSelection
from Spotfire.Dxp.Data import DataPropertyClass
from Spotfire.Dxp.Data import Import
from Spotfire.Dxp.Data.Import import TextFileDataSource, TextDataReaderSettings
from System import Array
from Spotfire.Dxp.Application.Visuals import VisualContent
from Spotfire.Dxp.Application.Visuals import TablePlot
from System.IO import Path, StreamWriter
from System.Text import StringBuilder
#Temp file for storing the TablePlot data
tempFolder = Path.GetTempPath()
tempFilename = Path.GetTempFileName()
#Export TablePlot data to the temp file
tp = tablePlotViz.As[TablePlot]()
writer = StreamWriter(tempFilename)
tp.ExportText(writer)
#Build the table
sb = StringBuilder()
#Open the temp file for reading
f = open(tempFilename)
#build the html table
html = " <TABLE id='table' style='display:none;'>\n"
html += "<THEAD>"
html += " <TR><TH>"
html += " </TH><TH>".join(f.readline().split("\t")).strip()
html += " </TH></TR>"
html += "</THEAD>\n"
html += "<TBODY>\n"
for line in f:
html += "<TR><TD>"
html += "</TD><TD>".join(line.split("\t")).strip()
html += "</TD></TR>\n"
#Assigned the all HTML data in the text area
print html
The code works fine with short data.
If I am getting correctly, the intention behind the code is reading Table Plot visualization data into a string, for further using in a HTML Text Area.
There is an alternative way for doing this, without writing data into temporary file. We can use memory stream to export data and convert exported text to string for further reuse. The sample code can be referred from here.