How to access data from pdf forms with python? - python

I need to access data from pdf form fields. I tried the package PyPDF2 with this code:
import PyPDF2
reader = PyPDF2.PdfReader('formular.pdf')
print(reader.pages[0].extract_text())
But this gives me only the text of the normal pdf data, not the form fields.
Does anyone know how to read text from the form fields?

You can use the getFormTextFields() method to return a dictionary of form fields (see https://pythonhosted.org/PyPDF2/PdfFileReader.html). Use the dictionary keys (the field name) to access the values (the field values).The following example might help:
from PyPDF2 import PdfFileReader
infile = "myInputPdf.pdf"
pdf_reader = PdfFileReader(open(infile, "rb"))
dictionary = pdf_reader.getFormTextFields() # returns a python dictionary
my_field_value = str(dictionary['my_field_name']) # use field name (dictionary key) to access field value (dictionary value)

There are library in python through which you can access pdf data. As pdf is not a raw data like csv, txt,tsv etc. So python can't directly read data inside pdf files.
There is a python library name as slate Slate documentation. Read this documentation. I hope you will get answer to your question.

Related

How to extract very nested json without pattern

I've been trying to normalize a JSON file and wanted a python(pandas) or pyspark script the more generic as possible that can extract data from a very nested mongodb JSON - it comes from a third party API and saved in MongoDB - and return it in a relational dataset so we can consume it from the datalake.
There are a lot of records and fields, so we can't do it in only one dataframe. Also, the layout does not have a pattern.
Could you please help us?
What is the best way to do this in best practices and, if possible, recursively?
Below is a chunk of the json file
https://raw.githubusercontent.com/migueelcruz/sample_json/main/sample.json
We expect multiple dataframes that link each other so we can consume data like a relational database. Also, the files must be like a database table.
Thanks a lot for your help!
A way to approach this problem would be using the json module to deserialize the data into a python dictionary.
# Get the data
import urllib.request as urllib
link = "https://raw.githubusercontent.com/migueelcruz/sample_json/main/sample.json"
f = urllib.urlopen(link)
myfile = f.read()
# Deserialize
import json
data = json.loads(myfile)
data
Now the way you would get the data is using python dictionaries syntax.
i.e if you want to get eventos which is under dados which is under eventos would be:
data["dados"]["nfe"]["eventos"]

How can I automaticallly copy data from excel and paste into a word?

I tried to used docxtpl with Python, but the return is very ugly, neerly unreadable. I tried using Dataframe, list, ... but I doesn't have a clean table in my word. Does any one know how to make it with Python ? Or is it more simple using VBA ?
(And docx doesn't allow me to add INSIDE the Word, the table i want.)
With Dataframe the table is trounce. And with list, the columns doesn't fit....
Thanks a lot
from docxtpl import DocxTemplate
doc = DocxTemplate(fichier_test)
context = {para_multiple[i]: liste_dataframe[i] for i in range(len(para_multiple))}
doc.render(context)
doc.save(file_location)
Where para_multiple is a list with all the tags in the .doc and liste_dataframe, the list of dataframe containing the data i need on the doc.
(This is what I get for now, i can't find out how to display it correctly)
I need to delete the tabulation and the index
https://i.stack.imgur.com/r3CQJ.png
In Windows, if the data you're trying to copy from Excel is in a named range, you could paste it in your Word document using the following bit:
from win32com import client
excel = client.Dispatch("Excel.Application")
word = client.Dispatch("Word.Application")
doc = word.Documents.Open("C:/word_file.docx")
book = excel.Workbooks.Open("C:/excel_file.xlsx")
sheet = book.Worksheets(1) #depends on which sheet your named range is located
sheet.Range(NAMED_RANGE_OF_YOUR_TABLE_IN_EXCEL).Copy()
target = doc.Range()
findtext = "WHERE_I_WANT_TO_PASTE_MY_TABLE_IN_WORD" #use a placeholder in your word document where you want the table to appear
if target.Find.Execute(FindText=findtext):
table_range = doc.Range(Start = target.start, End=target.end)
table_range.PasteExcelTable(False, False, False)
It will keep the formatting from the workbook.

Convert DICOM tag into something more readable

I'm working with pydicom and a DICOMWeb client. The latter I use to fetch metadata from a DICOM repository.
When retrieving DICOM metadata, I only get the DICOM tags as tuples of hexadecimals. I was wondering how to look up the tags and get a readable identifier using pydicom.
For example, how to convert the tag 0x10,0x20 into its string representation/keyword ("PatientID")? (See specs of the DICOM data dictionary)
pydicom offers some utility functions to handle the DICOM data dictionary:
import pydicom as dicom
tag = dicom.tag.Tag(0x10,0x20)
# Option 1) Retrieve the keyword:
keyword = dicom.datadict.keyword_for_tag(tag)
# Option 2) Retrieve the complete datadict entry:
entry = dicom.datadict.get_entry(tag)
representation, multiplicity, name, is_retired, keyword = entry
# keyword: "PatientID"
# name: "Patient ID"

Declaring path to invoke a template file python27

I have tried using various declarations for trying to invoke the template file but for some reason the script is failing to pick the template from the location specified and loading the reporting contents to the template and then exporting it to a pdf format.
The code i have attached hereby :
`#Build the html report using the html template and save to the set location
output_from_parsed_template = buildTemplate()
with open(r"C:\python_report_scripts\anram_report.html","wb") as fh:
fh.write(output_from_parsed_template)
#Convert html to pdf
subprocess.call('C:\omniformat\html2pdf995.exe r"C:\python_report_scripts\anram_report.html" r"C:\ANRAM_Requests\Working_folder\\'+sectionName+'.pdf"')
#Run a summary report when there is at least 1 section in the section list
if len(sectionInfo)>1:
#Flag that this is a summary report
summary = 1
#Generate the Existing Map image
MapExisting(summary)
#Generate the Existing Risk Graph image
RiskgraphExisting(summary)
#Generate the New Map Image
MapNew(summary)
#Generate the New Risk Graph image
RiskgraphNew(summary)
#Generate the tabular results for the report
populateResults(summary)
#Indicate that the report is a summary, which then forms the report title
#global sectionName
sectionName = 'SUMMARY - '+sectionName
#Build the html report using the html template and save to the set location
output_from_parsed_template = buildTemplate()
with open(r"C:\python_report_scripts\anram_report.html","wb") as fh:
fh.write(output_from_parsed_template)
#Convert html to pdf
subprocess.call('C:\omniformat\html2pdf995.exe r"C:\python_report_scripts\anram_report.html" r"C:\ANRAM_Requests\Working_folder\\'+sectionName+'.pdf"')`
So am i declaring it correctly ?
Please advise,
This line looks weird:
#Convert html to pdf
subprocess.call('C:\omniformat\html2pdf995.exe r"C:\python_report_scripts\anram_report.html" r"C:\ANRAM_Requests\Working_folder\\'+sectionName+'.pdf"')`
You seem to be mixing up Python quoting, Python raw strings, with shell quoting and you still have some double backslashes around...
I suggest:
Use subprocess.check_call instead, that way Python will report if there are any errors running the command.
Pass it a list, instead of a string, that way it's more clear which argument is which and you don't depend so much on shell quoting and word splitting.
Use raw strings consistently (r'...' or r"...", either one is fine.)
Use os.path.join to join paths!
So, putting it all together, try this instead:
# Convert html to pdf
subprocess.check_call([
r'C:\omniformat\html2pdf995.exe',
r'C:\python_report_scripts\anram_report.html',
os.path.join(r'C:\ANRAM_Requests\Working_folder', sectionName + '.pdf')
])
I hope this solves the issue you're seeing... Or, if it doesn't, at least gives you a more meaningful error message that you can act on.

Django/Python: Save an HTML table to Excel

I have an HTML table that I'd like to be able to export to an Excel file. I already have an option to export the table into an IQY file, but I'd prefer something that didn't allow the user to refresh the data via Excel. I just want a feature that takes a snapshot of the table at the time the user clicks the link/button.
I'd prefer it if the feature was a link/button on the HTML page that allows the user to save the query results displayed in the table. It would also be nice if the formatting from the HTML/CSS could be retained. Is there a way to do this at all? Or, something I can modify with the IQY?
I can try to provide more details if needed. Thanks in advance.
You can use the excellent xlwt module.
It is very easy to use, and creates files in xls format (Excel 2003).
Here is an (untested!) example of use for a Django view:
from django.http import HttpResponse
import xlwt
def excel_view(request):
normal_style = xlwt.easyxf("""
font:
name Verdana
""")
response = HttpResponse(mimetype='application/ms-excel')
wb = xlwt.Workbook()
ws0 = wb.add_sheet('Worksheet')
ws0.write(0, 0, "something", normal_style)
wb.save(response)
return response
Use CSV. There's a module in Python ("csv") to generate it, and excel can read it natively.
Excel support opening an HTML file containing a table as a spreadsheet (even with CSS formatting).
You basically have to serve that HTML content from a django view, with the content-type application/ms-excel as Roberto said.
Or if you feel adventurous, you could use something like Downloadify to prepare the file to be downloaded on the client side.

Categories