I have a pdf file and extracting the data from pdf file using pdfquery and pandas
Code is as follows:
import pdfquery
import pandas as pd
pdf = pdfquery.PDFQuery('data/BUSTA_PAGA - 2.pdf')
pdf.load()
pdf.tree.write('pdfXML.txt', pretty_print = True)
Name = pdf.pq('LTTextLineHorizontal:overlaps_bbox("25.509, 188.273, 188.558,
748.621")').text()
s=pd.DataFrame({
Name
})
s.to_csv('file_name.csv')
When I run this, It gives the data of the full text box which I wanted but there is specific data that I want to extract. How would I do that?
Related
import pandas as pd
# read by default 1st sheet of an excel file
dataframe1 = pd.read_excel(r'E:\Images\New Folder\afec9b91-5c2f-4cab-aca8-abd7bde854e0\P_SA_C0002_DcW_R1_01_FMV_000000000000.xlsx')
print(dataframe1)
Output
Sensor Longitude Sensor Latitude Survey ID
72.69362 32.090865 P_SA_C0002_DcW_R1_01
Now I want that output to written on specific image.
from PIL import Image,ImageDraw,ImageFont
import glob
import os
images=glob.glob("E:\Images/*.jpg")
for img in images:
images=Image.open(img)
draw=ImageDraw.Draw(images)
font=ImageFont.load_default()
import pandas as pd
# read by default 1st sheet of an excel file
dataframe1 = pd.read_excel(r'E:\Images\New Folder\afec9b91-5c2f-4cab-aca8-abd7bde854e0\P_SA_C0002_DcW_R1_01_FMV_000000000000.xlsx')
#print(dataframe1)
# write ="dataframe1"
text="print"
draw.text((0,240),text,(250,250,250),font=font)
images.save(img)
I am trying to write the output on image by using above code by its not working.Please help.
I need to read xml file and fetch data to a dataframe. I have developed this to extract data for one xml file.
import pandas as pd
import numpy as np
import xml.etree.cElementTree as et
import datetime
tree=et.parse('/data/dump_xml/1013.xml')
root=tree.getroot()
NAME = []
for name in root.iter('name'):
NAME.append(name.text)
print(NAME[0])
print(NAME[1])
UPDATE = []
for update in root.iter('lastupdate'):
UPDATE.append(update.text)
updated = datetime.datetime.fromtimestamp(int(UPDATE[0]))
lastupdate=updated.strftime('%Y-%m-%d %H:%M:%S')
ParaValue = []
for parameterevalue in root.iter('value'):
ParaValue.append(parameterevalue.text)
print(ParaValue[0])
print(ParaValue[1])
print(lastupdate,NAME[0],ParaValue[0])
print(lastupdate,NAME[1],ParaValue[1])
For one each file I need to get below two results as an output..
2022-05-23 11:25:01 in 1.5012356187e+05
2022-05-23 11:25:01 out 1.7723777592e+05
Now I need to do this to all my xml files in /data/dump_xml/ and make a df with all the data at one execution. Can someone help me to improve my code?
I have am writing a script that reads a folder of .pdfs and extracts their fillable fields to a pandas df. I had success extracting one .pdf with the following code:
import numpy as np
import pandas as pd
import PyPDF2
import glob, os
pwd = os.getcwd()
pdfFileObj = open('pdf_filename', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
fields_dict = pdfReader.getFormTextFields()
series = pd.Series(fields_dict).to_frame()
df = pd.DataFrame(pd.Series(fields_dict)).T
I want to build a function that runs this script for all pdfs in the directory. My first idea was to use a function in glob that collects all pdfs. Here is what I have so far:
import numpy as np
import pandas as pd
import PyPDF2
import glob, os
pwd = os.getcwd()
def readfiles():
os.chdir(pwd)
pdfs = []
for file in glob.glob("*.pdf"):
print(file)
pdfs.append(file)
pdfFileObj = open(readfiles, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
fields_dict = pdfReader.getFormTextFields()
series = pd.Series(fields_dict).to_frame()
df = pd.DataFrame(pd.Series(fields_dict)).T
Unfortunately, this doesn't work because I cannot put a function in the pdfFileReader. Does anyone have suggestions on a better way to do this? Thanks!
I can't comment, new account. But you could try making your readFiles function return the array pdfs.
Then in code execution below just:
listofPDF=readfiles()
arrayofDF=list()
for file in listofPDF:
pdfFileObj = open(file , 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
##execute your code to obtain a single dataframe from a pdf here
fields_dict = pdfReader.getFormTextFields()
series = pd.Series(fields_dict).to_frame()
df = pd.DataFrame(pd.Series(fields_dict)).T
arrayofDF.append(df)
You would end up having a list of dataframes, each one corresponding to one of the pdf files, if the first part of the code ( in which you get the dataframe from the singular pdf file) works.
Additionally, you could make a dictionary like {filename:file , dataframe: df} and then append that to your list, so you can later recover the dataframe based of the name of the file. It all depends on what you plan to do with the dataframes later.
At the moment I use a script to populate a template for each of the entries in our database and generate a docx file for each entry. Following that I convert that docx file to a pdf and mail it to the user.
For this I use following code :
from docxtpl import DocxTemplate
from docx2pdf import convert
pathToTemplate='template.docx'
outputPath='output.docx'
template = DocxTemplate(pathToTemplate)
context = person.get_context(short) # gets the context used to render the document
template.render(context)
template.save(outputPath)
pdfpath = outputPath[:-4]+'pdf'
convert(outputPath, pdfpath)
This part of the code is embedded in a loop and when measuring the time needed to generate the context from the database (in the person.get_context(short) function) and generating the docx file it gives me a result between 0.5s - 1.0s. When measuring the time needed to convert this docx to pdf it gives me a time of 5.0s - 7.0s.
Because the loop has to loop over > 1000 users, this is the difference can add up. Does anyone have an idea how the DocxTemplate kan save to pdf directly (and how fast this is) or if there is a faster way to generate the pdf files?
as far as I know you just can't do it with the docx library itself, but I have found an alternate way to achieve this, we can convert the docx to pdf using the following code
from docxtpl import DocxTemplate
import pandas as pd
df = pd.read_excel("Data.xlsx")
import time
import os
from win32com import client
word_app = client.Dispatch("Word.Application")
for i , j in df.iterrows():
Name = j["Party_Name"]
tpl = DocxTemplate("Invoice_Template.docx")
dicty = df.to_dict()
x = df.to_dict(orient="records")
context = x
tpl.render(context[i])
tpl.save("hello.docx")
rod = os.path.dirname(os.path.abspath(__file__))
print(rod)
time.sleep(2)
#converting to pdf
doc = word_app.Documents.Open(rod + "\\1.docx")
doc.SaveAs(rod + "\\hello.pdf", FileFormat=17)
The below code is working fine in python, but how to pass parameter values from html?
import pandas as pd
import pandas_profiling
# read the file
df = pd.read_csv('Dataprofile.csv')
# run the profile report
profile = df.profile_report(title='Pandas Profiling Report')
# save the report as html file
profile.to_file(output_file="pandas_profiling1.html")
# save the report as json file
profile.to_file(output_file="pandas_profiling2.json")
Instead of to_file use to_html.
df.to_html("Table.html")
You can style it with CSS in the HTML file.