Pdfplumber cannot recognise table python [duplicate] - python

This question already has answers here:
How can I extract tables from PDF documents?
(4 answers)
Closed 10 days ago.
I use Pdfplumber to extract the table on page 2, section 3 (normally). But it only works on some pdf, others do not work. For failed pdf files, it seems like Pdfplumber read the button table instead of the table I want.
How can I get the table?
link of the pdf which doesn't work:
pdfA
link of the pdf which works:
pdfB
Here is my code:
import pdfplumber
pdf = pdfplumber.open("/Users/chueckingmok/Desktop/selenium/Shell Omala 68.pdf")
page = pdf.pages[1]
table=page.extract_table()
import pandas as pd
df = pd.DataFrame(table[1:], columns=table[0])
df
and the result is
But the table I want in page 2 is
However, this code works for pdfB (which I mentioned above).
Btw, the table I want in each pdf is in section 3.
Anyone can help?
Many thanks
Joan
Updated:
I just found a good package to extract pdf file without any problems.
the package is fitz, and it also names as PyMuPDF.

Hey Here is the proper solution for that problem but first please read some of my points below
Well, you used pdfplumber for table extraction but i think you should have read about settings of tables, there are so many settings of table when you read them according to your need you surely find your answers from there. PdfPlumber API - for Table Extraction is Here
As of now i give perfect solution for your problem in below, but first check documentation of pdfplumber API properly you can surely find all your answers from there, and i am sure that in future you don't need to ask question regarding table extraction using pdfplumber because you will surely find all your solution from there regarding table extraction and also other things like text extraction, word extraction, etc.
For better understanding of the tables settings you can also use Visual Debugging, this is very best feature of pdfplumber for knowing what exactly table settings does with table and how it extract the tables using table settings.Visual Debugging of Tables
Below Is the solution of your problem,
import pandas as pd
import pdfplumber
pdf = pdfplumber.open("GSAP_msds_01259319.pdf")
p1 = pdf.pages[1]
table = p1.extract_table(table_settings={"vertical_strategy": "lines",
"horizontal_strategy": "text",
"snap_tolerance": 4,})
df = pd.DataFrame(table[1:], columns=table[0])
df
See the output of the Above Code

To extract two tables from the same pages, I use this code:
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
first_page = pdf.pages[0].find_tables()
t1_content = first_page[0].extract(x_tolerance = 5)
t2_content = first_page[1].extract(x_tolerance = 5)
print(t1_content, '\n' ,t2_content)

Related

Python extract text between two tables as title for the table(outside tables) from pdf with tabula

I am trying to extract tables from a pdf files, after trying with multiple different packages, tabula is the best one to extract the tables from my pdf file correctly.
The thing is that, for each table, there is a title for it above the table (not included in the table part).
import tabula.io as tb
from tabula.io import read_pdf
file_path = ""
tables = tb.read_pdf(file_path, pages = "1")
I would like to extract the title with to each table as well, I tried using other packages, but they will also extract some text from table that I couldn't differentiate the text is inside table or outside.
*I have tried camelot as well, I know it can extract text from whole page, but this one would mess up my table format.
I would like to know if there is any way that I can extract text only outside table, or any suggestion that I can extract table and title at the same time?
Thanks!
Reference table image got from: image got from https://pspdfkit.com/guides/ios/customizing-the-interface/changing-the-document-title/
Camelot provides dimensions of pdf via utils.get_page_layout function:
import camelot
metadata, dim = camelot.utils.get_page_layout(self.path)
The dimensions could be useful to detect coordinates of possible area for table name:
box_for_table_name = (
table._bbox[0],
dim[1] - table._bbox[3] - 35,
table._bbox[2],
dim[1] - table._bbox[3] + 2
)
Via this calculation, we can convert pdf coordinates to bbox coordinates.
Not sure the calculation is fit for your case, but you can arrange it according to the font of the text and the gap between the text and the table.
Then you are able to extract the title you want using fitz:
import fitz
clip = fitz.Rect(box_for_table_name).round()
title = self.extract_text(clip=clip)

Partially searchable pdf document

I'm using tabula library to read each pdf. In each pdf there is a table with its headers (columns) and its corresponding information. It all worked perfectly except for the last pdf.
code:
import tabula
read_pdf(path, pages = "2", multiple_tables = False,
output_format = 'dataframe', pandas_options ={header: None})
part of the dataframe output (example):
nan SBI nan nan nan nan nan nan nan nan nan nan
JKL1LU1UKDAO/ /O/NEPLW45WF3CKL AF HSF1P PUAVKM RO0SA OSOAEAUMM5M31/6 PO LLÅ F
KLMIMOG 0TLSL P0EK RV V OKÅ GVJAVUAMNAWA ACADFUIF S JN FKFKLLLGLDAA2F LEV KA OTIF 2A4 KACNATULO01F2NVSCFRE BB AG05ANJA OLE4CPIVL1SGA 2AFK MR0HASET2PMG MLIONEKO0KF 0IEOJB1 L E NECGCVL1GXLDA 7019N8BVPV90
It is def. not the code since I tried even the web-based tabula link: https://tabula.technology/
where you can specify the aspect ratio (so as in the code that I used as well) and it just sometimes recognizes a word or character.
Seems like it has to do with the way the pdf table got constructed in the pdf. When I hit the edit in the pdf I can see bunch of text boxes sometimes with junk of texts as a group sometimes they are separate letters, words, etc.
There is also some sort of hidden layer - information - on some part of the pages.
Even after cropping specific parts, deleting metadata, hidden and overlapping objects then exporting it to pdf again (in adobe reader) when I try loading the pdf, problem remains.
The only way I could get the right text from the pdf is to scrape only the text with the following lib and code:
import fitz
text = ""
path = "file.pdf"
doc = fitz.open(path)
for page in doc:
text += page.getText()
This gives me exactly as it is in the pdf but this is far from the dataframe, meaning that it will take pretty long to pre-process it data clean it, and parse it in the right format in order to ultimately get the desired dataframe, that should be possible to do directly with tabula.
tried two more libraries: pyPDF2 and pdfMiner both produce string outputs, which will require long way to preprocess it.
from pdfminer.high_level import extract_text
text = extract_text(path.pdf)
Thus, my question would be:
what would be the best practice approach here. Should I try transforming the pdf to a fully-searchable text? If so what would be the most pythonic way?
trying to crop outside of python seems rookie approach where I'm cropping and deleting things to get the aspect ratio and getting rid of some data. Must be a way to access all this information in order to get a dataframe
The main idea is to read the pdf as it is and reproduce actually to get the tables in a dataframe to be able to manipulate with it. Any suggestions are welcome.
Thanks in advance!
The solution to extract table from a partially searchable pdf files is to use the feature of OCR in the adobe reader. After that tabula is able to read and extract it actually.

Is it possible extract a specific table with format from a PDF?

I am trying to extract a specific table from a pdf, the pdf looks like the image below
I tried with different libraries on python,
With tabula-py
from tabula import read_pdf
from tabulate import tabulate
df = read_pdf("./tmp/pdf/Food Calories List.pdf")
df
With PyPDF2
pdf_file = open("./tmp/pdf/Food Calories List.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
data = page_content
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
aux = page_content
df = pd.DataFrame([x.split(';') for x in aux.split('\n')])
Even with textract and beautiful soup, the issue that I am facing is that the output format is a mess, Is there any way to extract this table with a better format?
I suspect the issues stem from the fact that the table have merged cells (on the left) and reading data from a table only works when the rows and cells are consistent rather than some merged and some not.
I'd skip over the first two columns and then recreate / populate them on the left hand side once you have the table loaded (As a pandas dataframe for example).
Then you can have one label per row and work with the data consistently, otherwise your cells per column will be inconsistently numbered.
I would look into using tabula templates which you can dynamically generate based on word locations on page. This will give tabula more guidance on which area to consider and lead to more accurate extraction. See tabula.read_pdf_with_template as documented here: https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.read_pdf_with_template.
Camelot can be another Python library to try. Its advanced settings seem to show that it can handle merged cells. However, this will likely require some adjustments to certain settings such as copy_text and shift_text.
Note: Camelot can only read text-based tables. If the table is inside an image, it won't be able to extract it.
If the above is not an issue, try the sample code below:
import camelot
tables = camelot.read_pdf('./tmp/pdf/Food Calories List.pdf', pages='1', copy_text=['v'])
print(tables[0].df)

how to extract tables from pdf using camelot?

I want to extract all tables from pdf using camelot in python 3.
import camelot
# PDF file to extract tables from
file = "./pdf_file/ooo.pdf"
tables = camelot.read_pdf(file)
# number of tables extracted
print("Total tables extracted:", tables.n)
# print the first table as Pandas DataFrame
print(tables[0].df)
# export individually
tables[0].to_csv("./pdf_file/ooo.csv")
and then I get only 1 table from the 1st page of the pdf.
how extract the whole tables from the pdf file??
tables = camelot.read_pdf(file, pages='1-end')
If pages parameter is not specified, Camelot analyzes only the first page.
For better explanation, see official documentation.
In order to extract pdf tables with camelot you have to use the following code. You have to use stream parameter because it is very powerful in order to detect almost all the pdf tables. Also if you have problem with the extraction you have to add as a parameter the row_tol and edge_tol parameters.For example row_tol = 0 and edge_tol=500.
pdf_archive = camelot.read_pdf(file_path, pages="all", flavor="stream")
for page, pdf_table in enumerate(pdf_archive):
print(pdf_archive[page].df)

How to Read a WebPage with Python and write to a flat file?

Very novice at Python here.
Trying to read the table presented at this page (w/ the current filters set as is) and then write it to a csv file.
http://www65.myfantasyleague.com/2017/options?L=47579&O=243&TEAM=DAL&POS=RB
I tried this next approach. It creates the csv file but does not fill it w/ the actual table contents.
Appreciate any help in advance. thanks.
import requests
import pandas as pd
url = 'http://www65.myfantasyleague.com/2017/optionsL=47579&O=243&TEAM=DAL&POS=RB'
csv_file='DAL.RB.csv'
pd.read_html(requests.get(url).content)[-1].to_csv(csv_file)
Generally, try to emphasize your problems better, try to debug and don't put everything in one line. With that said, your specific problem here was the index and the missing ? in the code (after options):
import requests
import pandas as pd
url = 'http://www65.myfantasyleague.com/2017/options?L=47579&O=243&TEAM=DAL&POS=RB'
# -^-
csv_file='DAL.RB.csv'
pd.read_html(requests.get(url).content)[1].to_csv(csv_file)
# -^-
This yields a CSV file with the table in it.

Categories