I have one use-case .Lets say there is pdf report which has data from testing of some manufacturing components
and this PDF report is loaded in DB using some internally developed software.
We need to develop some reconciliation program wherein the data needs to be compared from PDF report to Database. We can assume pdf file has a fixed template.
If there are many tables and some raw text data in pdf then how mysql save this pdf data..in One table or in many tables .
Please suggest some approach(preferably in python) for comparing data
Finding and extracting specific text from URL PDF files, without downloading or writing (solution) have a look at this example and see if it will help. I found it worked efficiently for me, this is if the pdf is URL based, but you could simply change the input source to be your DB. In your case you can remove the two if statements under the if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal): line. You mention having PDFs with the same template, if you are looking to extract text from one specific area of the template, use the print statement that has been commented out to find coordinates of desired data. Then as is done in the example, use those coordinates in if statements.
Related
For a Python web project (with Django) I developed a tool that generates an XLSX file. For questions of ergonomics and ease for users I would like to integrate this excel on my HTML page.
So I first thought of converting the XLSX to an HTML array, with the xlsx2html python library. It works but since I can’t determine the desired size for my cells or trim the content during conversion, I end up with huge cells and tiny text..
I found an interesting way with the html tag associated with OneDrive to embed an excel window into a web page, but my file being in my code and not on Excel Online I cannot import it like that. Yet the display is perfect and I don’t need the user to interact with this table.
I have searched a lot for other methods but apart from developing a function to browse my file and generate the script of the html table line by line, I have the feeling that I cannot simply use a method to convert or display it on my web page.
I am not accustomed to this need and wonder if there would not be a cleaner method to display an excel file in html.
Does it make sense to develop a function that builds my html table script in str? Or should I find a library that does it? Maybe there is a specific Django library ?
Thank you for your experience
I have a question for you, I'm working on a new jenkins instance and as a result of the job I get a csv file with errors if there were any during the test. I would like to generate an HTML report based on this csv file, which would be more convenient to use than opening excel and loading the csv file to see the errors. I came across a plugin like HTML Publisher, unfortunately I don't know if it supports generating HTML reports based on csv files. Alternatively, you could do something like this with a python script and show the resulting html file in artifats. Do you have any ideas ??
I've come across an assignment which requires me to extract tabular data from images in a pdf file to neatly formatted dataframes via python code. There are several files to be processed and the relevant pages in all the files the may have different page numbers, hence the sequence of steps for this problem (my assumption) are:
Navigate to relevant section of the pdf
Extract images of the tabular data
Extract data from the images, format and convert to dataframes.
Some google searches resulted in me finding libraries for pdf text extraction, table extraction and more - modular solutions only.
I would appreciate some help in this regard. What packages should I use? Is my approach correct?
Can I get references to any helpful code snippets for similar problems?
page structure of the required tables
This started as a comment. I believe the answer is valid as it is in no way an endorsement of the service. I don't even use it. I know Azure uses SO as well.
This is the stuff of commercial services. You can try Azure Form Recognizer (with which I am not affiliated):
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer
Here are some python examples of how to use it:
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/how-to-guides/try-sdk-rest-api?pivots=programming-language-python
The AWS equivalent is Textract https://aws.amazon.com/textract
The Google Cloud version is called Form Parser - see https://cloud.google.com/document-ai/docs/processors-list#processor_form-parser
I want to automate some report creation. Some elements that I need in the report are saved as rich text, so an HTML file. There are a couple of libraries to do this, such as html2pdf or pdfforge. However, I would also like to add extra information to the report that is not located in this HTML file, like for example a title or some information queried from the DB that is not necessarily in the HTML file.
Does anyone have a suggestion to do this?
Thanks in advance.
I want to email out a document that will be filled in by many people and emailed back to me. I will then parse the responses using Python and load them into my database.
What is the best format to send out the initial document in?
I was thinking an interactive .pdf but do not want to have to pay for Adobe XI. Alternatively maybe a .html file but I'm not sure how easy it is to save the state of it once its been filled in in order to be emailed back to me. A .xls file may also be a solution but I'm leaning away from it simply because it would not be a particularly professional looking format.
The key points are:
Answers can be easily parsed using Python
The format should common enough to open on most computers
The document should look relatively pleasing to the eye
Send them a web-page with a FORM section, complete with some Javascript to grab the contents of the controls and send them to you (e.g. in JSON format) when they press "submit".
Another option is to set it up as a web application. There are several Python web frameworks that could be used for that. You could then e-mail people a link to the web-app.
Why don't you use Google Docs for the form. Create the form in Google Docs and save the answer in an excel sheet. And then use any python Excel format reader (Google them) to read the file. This way you don't need to parse through mails and will be performance friendly too. Or you could just make a simple form using AppEngine and save the data directly to the database.