how to extract tables from pdf using camelot? - python

I want to extract all tables from pdf using camelot in python 3.
import camelot
# PDF file to extract tables from
file = "./pdf_file/ooo.pdf"
tables = camelot.read_pdf(file)
# number of tables extracted
print("Total tables extracted:", tables.n)
# print the first table as Pandas DataFrame
print(tables[0].df)
# export individually
tables[0].to_csv("./pdf_file/ooo.csv")
and then I get only 1 table from the 1st page of the pdf.
how extract the whole tables from the pdf file??

tables = camelot.read_pdf(file, pages='1-end')
If pages parameter is not specified, Camelot analyzes only the first page.
For better explanation, see official documentation.

In order to extract pdf tables with camelot you have to use the following code. You have to use stream parameter because it is very powerful in order to detect almost all the pdf tables. Also if you have problem with the extraction you have to add as a parameter the row_tol and edge_tol parameters.For example row_tol = 0 and edge_tol=500.
pdf_archive = camelot.read_pdf(file_path, pages="all", flavor="stream")
for page, pdf_table in enumerate(pdf_archive):
print(pdf_archive[page].df)

Related

Python extract text between two tables as title for the table(outside tables) from pdf with tabula

I am trying to extract tables from a pdf files, after trying with multiple different packages, tabula is the best one to extract the tables from my pdf file correctly.
The thing is that, for each table, there is a title for it above the table (not included in the table part).
import tabula.io as tb
from tabula.io import read_pdf
file_path = ""
tables = tb.read_pdf(file_path, pages = "1")
I would like to extract the title with to each table as well, I tried using other packages, but they will also extract some text from table that I couldn't differentiate the text is inside table or outside.
*I have tried camelot as well, I know it can extract text from whole page, but this one would mess up my table format.
I would like to know if there is any way that I can extract text only outside table, or any suggestion that I can extract table and title at the same time?
Thanks!
Reference table image got from: image got from https://pspdfkit.com/guides/ios/customizing-the-interface/changing-the-document-title/
Camelot provides dimensions of pdf via utils.get_page_layout function:
import camelot
metadata, dim = camelot.utils.get_page_layout(self.path)
The dimensions could be useful to detect coordinates of possible area for table name:
box_for_table_name = (
table._bbox[0],
dim[1] - table._bbox[3] - 35,
table._bbox[2],
dim[1] - table._bbox[3] + 2
)
Via this calculation, we can convert pdf coordinates to bbox coordinates.
Not sure the calculation is fit for your case, but you can arrange it according to the font of the text and the gap between the text and the table.
Then you are able to extract the title you want using fitz:
import fitz
clip = fitz.Rect(box_for_table_name).round()
title = self.extract_text(clip=clip)

Pdfplumber cannot recognise table python [duplicate]

This question already has answers here:
How can I extract tables from PDF documents?
(4 answers)
Closed 10 days ago.
I use Pdfplumber to extract the table on page 2, section 3 (normally). But it only works on some pdf, others do not work. For failed pdf files, it seems like Pdfplumber read the button table instead of the table I want.
How can I get the table?
link of the pdf which doesn't work:
pdfA
link of the pdf which works:
pdfB
Here is my code:
import pdfplumber
pdf = pdfplumber.open("/Users/chueckingmok/Desktop/selenium/Shell Omala 68.pdf")
page = pdf.pages[1]
table=page.extract_table()
import pandas as pd
df = pd.DataFrame(table[1:], columns=table[0])
df
and the result is
But the table I want in page 2 is
However, this code works for pdfB (which I mentioned above).
Btw, the table I want in each pdf is in section 3.
Anyone can help?
Many thanks
Joan
Updated:
I just found a good package to extract pdf file without any problems.
the package is fitz, and it also names as PyMuPDF.
Hey Here is the proper solution for that problem but first please read some of my points below
Well, you used pdfplumber for table extraction but i think you should have read about settings of tables, there are so many settings of table when you read them according to your need you surely find your answers from there. PdfPlumber API - for Table Extraction is Here
As of now i give perfect solution for your problem in below, but first check documentation of pdfplumber API properly you can surely find all your answers from there, and i am sure that in future you don't need to ask question regarding table extraction using pdfplumber because you will surely find all your solution from there regarding table extraction and also other things like text extraction, word extraction, etc.
For better understanding of the tables settings you can also use Visual Debugging, this is very best feature of pdfplumber for knowing what exactly table settings does with table and how it extract the tables using table settings.Visual Debugging of Tables
Below Is the solution of your problem,
import pandas as pd
import pdfplumber
pdf = pdfplumber.open("GSAP_msds_01259319.pdf")
p1 = pdf.pages[1]
table = p1.extract_table(table_settings={"vertical_strategy": "lines",
"horizontal_strategy": "text",
"snap_tolerance": 4,})
df = pd.DataFrame(table[1:], columns=table[0])
df
See the output of the Above Code
To extract two tables from the same pages, I use this code:
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
first_page = pdf.pages[0].find_tables()
t1_content = first_page[0].extract(x_tolerance = 5)
t2_content = first_page[1].extract(x_tolerance = 5)
print(t1_content, '\n' ,t2_content)

Is it possible extract a specific table with format from a PDF?

I am trying to extract a specific table from a pdf, the pdf looks like the image below
I tried with different libraries on python,
With tabula-py
from tabula import read_pdf
from tabulate import tabulate
df = read_pdf("./tmp/pdf/Food Calories List.pdf")
df
With PyPDF2
pdf_file = open("./tmp/pdf/Food Calories List.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
data = page_content
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
aux = page_content
df = pd.DataFrame([x.split(';') for x in aux.split('\n')])
Even with textract and beautiful soup, the issue that I am facing is that the output format is a mess, Is there any way to extract this table with a better format?
I suspect the issues stem from the fact that the table have merged cells (on the left) and reading data from a table only works when the rows and cells are consistent rather than some merged and some not.
I'd skip over the first two columns and then recreate / populate them on the left hand side once you have the table loaded (As a pandas dataframe for example).
Then you can have one label per row and work with the data consistently, otherwise your cells per column will be inconsistently numbered.
I would look into using tabula templates which you can dynamically generate based on word locations on page. This will give tabula more guidance on which area to consider and lead to more accurate extraction. See tabula.read_pdf_with_template as documented here: https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.read_pdf_with_template.
Camelot can be another Python library to try. Its advanced settings seem to show that it can handle merged cells. However, this will likely require some adjustments to certain settings such as copy_text and shift_text.
Note: Camelot can only read text-based tables. If the table is inside an image, it won't be able to extract it.
If the above is not an issue, try the sample code below:
import camelot
tables = camelot.read_pdf('./tmp/pdf/Food Calories List.pdf', pages='1', copy_text=['v'])
print(tables[0].df)

how can i classify the chapters of a pdf file and analyze the content per chapter?

I want to classify and analyze chapters and subchapters from a book in PDF format. So count the number of words and examine which word occurs how often and in which chapter.
pip install PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader
# Creating a pdf file object
pdf = open('C:/Users/Dominik/Desktop/bsc/pdf1.pdf',"rb")
# creating pdf reader object
pdf_reader = PyPDF2.PdfFileReader(pdf)
# checking number of pages in a pdf file
print(pdf_reader.numPages)
print(pdf_reader.getDocumentInfo())
# creating a page object
page = pdf_reader.getPage(0)
# finally extracting text from the page
print(page.extractText())
# Extracting entire PDF
for i in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(i)
a = str(1+pdf_reader.getPageNumber(page))
print (a)
page_content = page.extractText()
print (page_content)
# closing the pdf file
pdf.close()
this code already works. now I want to do more analysis like
store each chapter in its own variable and count the number of words.
In the end, everything should be stored in an excel file.
I tried something similar like this with CVs in PDF format. But all I came to know is the following:
PDF is an unstructured format. It is not possible to extract information from all the PDFs in a structured way. But if you know the structure of the books in PDF format, you can divide the Title of the chapters by using their unique identity like if they are written on BOLD or Italic format. This link can help you extract those information.
You can then traverse through the chapter till it hits the next chapter title.

Creating publication quality tables in python

I'd like to create publication quality tables for output as svg or jpg or png images using python.
I'm familiar with the texttable module which produces nice text tables but if I have for example
data = [['Head 1','Head 2','Head 3'],['Sample Set Type 1',12.8,True],['Sample Set Type 2',15.7,False]]
and I wanted to produce something that looked like
Is there a module I can turn to, or can you point me to a process for going about it?
There are large amounts of possibilities for you.
You can convert a Pandas dataframe to Latex as per https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_latex.html
You can also use Tabular to output latex source as per http://en.wikibooks.org/wiki/LaTeX/Tables
You can use ReportLab, as per Python reportlab inserting image into table
You could also just write an HTML table file and style it with css.
with open("example.html", "w") as of:
of.write("<html><table>")
for index, row in enumerate(data):
if index == 0:
of.write("<th>")
else:
of.write("<tr>")
for cell in row:
of.write("<td>" + cell + "</td>")
if index == 0:
of.write("</th>")
else:
of.write("</tr>")
of.write("</table></html>")
You can do something similar with Latex tables as an output.

Categories