I have a question for you, I'm working on a new jenkins instance and as a result of the job I get a csv file with errors if there were any during the test. I would like to generate an HTML report based on this csv file, which would be more convenient to use than opening excel and loading the csv file to see the errors. I came across a plugin like HTML Publisher, unfortunately I don't know if it supports generating HTML reports based on csv files. Alternatively, you could do something like this with a python script and show the resulting html file in artifats. Do you have any ideas ??
I have one use-case .Lets say there is pdf report which has data from testing of some manufacturing components
and this PDF report is loaded in DB using some internally developed software.
We need to develop some reconciliation program wherein the data needs to be compared from PDF report to Database. We can assume pdf file has a fixed template.
If there are many tables and some raw text data in pdf then how mysql save this pdf data..in One table or in many tables .
Please suggest some approach(preferably in python) for comparing data
Finding and extracting specific text from URL PDF files, without downloading or writing (solution) have a look at this example and see if it will help. I found it worked efficiently for me, this is if the pdf is URL based, but you could simply change the input source to be your DB. In your case you can remove the two if statements under the if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal): line. You mention having PDFs with the same template, if you are looking to extract text from one specific area of the template, use the print statement that has been commented out to find coordinates of desired data. Then as is done in the example, use those coordinates in if statements.
I want to automate some report creation. Some elements that I need in the report are saved as rich text, so an HTML file. There are a couple of libraries to do this, such as html2pdf or pdfforge. However, I would also like to add extra information to the report that is not located in this HTML file, like for example a title or some information queried from the DB that is not necessarily in the HTML file.
Does anyone have a suggestion to do this?
Thanks in advance.
I'm trying to parse the pdf into an html, and then I would like to extract the headings and subheading from the tags. The pdf document was generated by Microsoft word so, I'm pretty sure there must be a way to get those headings.
So far, I have tried parsing with Apache Tika and PDFMiner.six but so far the html I have got doesn't have such tags which I could use to extract headings and subheadings of the document.
I wonder if there is a way to do it, would appreciate any help. Thank you
I suggest you to use GROBID which is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.
A simple python client for GROBID REST services is available at https://github.com/kermitt2/grobid-client-python
This Python client can be used to process a set of PDF in a given directory by the GROBID service. Results are written in a given output directory and include the resulting XML TEI representation of the PDF.
Hope this helps.
I am trying to extract a table from a PDF document (example). It's not a scan/an image, so please focus on non-OCR solutions. OCR table extraction is here.
I tried the route of pdf -> html -> extract table. The pdf that I mentioned above when converted to html produces garbage, maybe because of the font, the document is not in English.
Extracting the pdf using x and y coordinate is not an option as this solution needs to work for future pdf from the url mention above which will have the table but not always in the same position.
The PDF does not contain explicit table data. It only contains lines and character glyphs which we tend to interpret as tables. Thus your task involves putting our human table recognition capabilities into code which is quite a task.
Generally speaking, if you are sure enough future PDFs will be generated by the same software in a very similar manner, it might be worth the time to investigate the file for some easy to follow hints to recognize the contents of individual fields.
Your specific document, though, has an additional shortcoming: It does not contain the required information for direct text extraction! You can try copying & pasting from Adobe Reader and you'll get (at least I do) semi-random characters from the WinAnsi range.
This is due to the fact that all fonts in the document claim that they use WinAnsiEncoding even though the characters referenced this way definitively are not from the WinAnsi character selection.
Thus reliable text extraction from your document without OCR is impossible after all!
(Trying copy&paste from Adobe Reader generally is a good first test whether text extraction is feasible at all; the text extraction methods of the Reader have been developed for many many years and, therefore, have become quite good. If you cannot extract anything sensible with Acrobat Reader, text extraction will be a very difficult task indeed.)
You could use Tabula:
http://tabula.nerdpower.org
It's free and kinda easy to use
One option is to use pdf-table-extract: https://github.com/ashima/pdf-table-extract.
Extracting tables from PDF documents is extremely hard as PDF does not contain a semantic layer.
Camelot
You can try camelot, maybe even in combination with its web interface excalibur:
>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
See also python-camelot
Tabula
tabula can be installed via
pip install tabula-py
But it requires Java, as tabula-py is only a wrapper for the Java project.
It's used like this:
import tabula
# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')
See also:
Reading a specific table with tabula
tabula
AWS Textract
I haven't tried it recently, but AWS Textract claims:
Amazon Textract can extract tables in a document, and extract cells, merged cells, and column headers within a table.
PdfPlumber
pdfplubmer table extraction methods:
import pdfplumber
pdf = pdfplumber.open("example.pdf")
page = pdf.pages[0]
page.extract_table()
See also
Tabula vs Camelot