Web app to extract certain fields from PDF invoices - python

I want to create a web app in which I can upload invoices(all invoices are of similar format) in PDF format and from these invoices, certain specific fields should be extracted and stored.
To extract the fields from the PDF, I am currently using regex because for some reason OCR does not seem to work very accurately.
What would be the best way to go about this web app?

Related

How to embed an XLSX local file into HTML page with Python and Django

For a Python web project (with Django) I developed a tool that generates an XLSX file. For questions of ergonomics and ease for users I would like to integrate this excel on my HTML page.
So I first thought of converting the XLSX to an HTML array, with the xlsx2html python library. It works but since I can’t determine the desired size for my cells or trim the content during conversion, I end up with huge cells and tiny text..
I found an interesting way with the html tag associated with OneDrive to embed an excel window into a web page, but my file being in my code and not on Excel Online I cannot import it like that. Yet the display is perfect and I don’t need the user to interact with this table.
I have searched a lot for other methods but apart from developing a function to browse my file and generate the script of the html table line by line, I have the feeling that I cannot simply use a method to convert or display it on my web page.
I am not accustomed to this need and wonder if there would not be a cleaner method to display an excel file in html.
Does it make sense to develop a function that builds my html table script in str? Or should I find a library that does it? Maybe there is a specific Django library ?
Thank you for your experience

How to make PDF fillable?

I have links to unfillable PDFs and I want to make them fillable in my python application.
I could use Adobe SDK for Python, but does this allow me to automate PDF fillability. I just want to create the fillable fields and display the PDF for users to fill out. I have 10,000 PDFs, so I want to just call the API to make them fillable. The fields are different in every PDF.
Adobe's Fill and Sign manual application that works really well. Do they have an API that will mimic this?

django wysiwyg text editor doesn't work properly with Android

I'm currently using summernote in my admin panel where I can put text and images and in DB it will save the formatted content by applying some html tags <b>, <i>, etc.
but when I render that content using json api to android, it doesn't use html formatting correctly.
now let me tell you my actual case scenario. I'm creating an application similar to Quora. its a simple app where users can post question and users can answer. in their question detail or answer they can attach images as well.
now for me here order of image does matter(similar to Quora again), I mean when user attach some image after some text it should be save in db in that order. I can't save text at one place and images in another place, in that case i wont be able to recognize the actual position of text and images. I mean where should I put image1 in this whole text, after line1? line5? line7?.
Django-summernote solved the problem about image positioning but not the formatting. is there any better library or approach to maintain image positioning and text formatting to support web and mobile application(through JSON api).
PS: from here https://www.djangopackages.com/grids/g/wysiwyg/
django-summernote seems to be the best

Form through Email that can be Parsed Using Python

I want to email out a document that will be filled in by many people and emailed back to me. I will then parse the responses using Python and load them into my database.
What is the best format to send out the initial document in?
I was thinking an interactive .pdf but do not want to have to pay for Adobe XI. Alternatively maybe a .html file but I'm not sure how easy it is to save the state of it once its been filled in in order to be emailed back to me. A .xls file may also be a solution but I'm leaning away from it simply because it would not be a particularly professional looking format.
The key points are:
Answers can be easily parsed using Python
The format should common enough to open on most computers
The document should look relatively pleasing to the eye
Send them a web-page with a FORM section, complete with some Javascript to grab the contents of the controls and send them to you (e.g. in JSON format) when they press "submit".
Another option is to set it up as a web application. There are several Python web frameworks that could be used for that. You could then e-mail people a link to the web-app.
Why don't you use Google Docs for the form. Create the form in Google Docs and save the answer in an excel sheet. And then use any python Excel format reader (Google them) to read the file. This way you don't need to parse through mails and will be performance friendly too. Or you could just make a simple form using AppEngine and save the data directly to the database.

Generate multiple OpenOffice pages from a one-page template

I have a one-page OpenOffice document which is a POD template. Basically, I use this template to replace tags in the template with automatically generated data in the rendered document.
This is how it works:
# loads the POD OpenOffice template engine
from appy.pod.renderer import Renderer as OORenderer
# defines data for template tags
ctx = {'template_tag': 'my_data'}
# renders the template in my_model.odt using defined data
r = OORenderer('my_model.odt', ctx, 'rendered_file.pdf')
# saves into the specified PDF file
r.run()
I would like to generate several pages like that, and merge them into a single PDF file.
Is there a way to do it using POD?
Or maybe can I programmatically merge generated PDF files into a single file?
Thank you!
and merge them into a single PDF file.
just try pyPDF.
A Pure-Python library built as a PDF toolkit. It is capable of:
extracting document information (title, author, ...),
splitting documents page by page,
merging documents page by page,
cropping pages,
merging multiple pages into a single page,
encrypting and decrypting PDF files.

Categories