Partially searchable pdf document - python

I'm using tabula library to read each pdf. In each pdf there is a table with its headers (columns) and its corresponding information. It all worked perfectly except for the last pdf.
code:
import tabula
read_pdf(path, pages = "2", multiple_tables = False,
output_format = 'dataframe', pandas_options ={header: None})
part of the dataframe output (example):
nan SBI nan nan nan nan nan nan nan nan nan nan
JKL1LU1UKDAO/ /O/NEPLW45WF3CKL AF HSF1P PUAVKM RO0SA OSOAEAUMM5M31/6 PO LLŠF
KLMIMOG 0TLSL P0EK RV V OKŠGVJAVUAMNAWA ACADFUIF S JN FKFKLLLGLDAA2F LEV KA OTIF 2A4 KACNATULO01F2NVSCFRE BB AG05ANJA OLE4CPIVL1SGA 2AFK MR0HASET2PMG MLIONEKO0KF 0IEOJB1 L E NECGCVL1GXLDA 7019N8BVPV90
It is def. not the code since I tried even the web-based tabula link: https://tabula.technology/
where you can specify the aspect ratio (so as in the code that I used as well) and it just sometimes recognizes a word or character.
Seems like it has to do with the way the pdf table got constructed in the pdf. When I hit the edit in the pdf I can see bunch of text boxes sometimes with junk of texts as a group sometimes they are separate letters, words, etc.
There is also some sort of hidden layer - information - on some part of the pages.
Even after cropping specific parts, deleting metadata, hidden and overlapping objects then exporting it to pdf again (in adobe reader) when I try loading the pdf, problem remains.
The only way I could get the right text from the pdf is to scrape only the text with the following lib and code:
import fitz
text = ""
path = "file.pdf"
doc = fitz.open(path)
for page in doc:
text += page.getText()
This gives me exactly as it is in the pdf but this is far from the dataframe, meaning that it will take pretty long to pre-process it data clean it, and parse it in the right format in order to ultimately get the desired dataframe, that should be possible to do directly with tabula.
tried two more libraries: pyPDF2 and pdfMiner both produce string outputs, which will require long way to preprocess it.
from pdfminer.high_level import extract_text
text = extract_text(path.pdf)
Thus, my question would be:
what would be the best practice approach here. Should I try transforming the pdf to a fully-searchable text? If so what would be the most pythonic way?
trying to crop outside of python seems rookie approach where I'm cropping and deleting things to get the aspect ratio and getting rid of some data. Must be a way to access all this information in order to get a dataframe
The main idea is to read the pdf as it is and reproduce actually to get the tables in a dataframe to be able to manipulate with it. Any suggestions are welcome.
Thanks in advance!

The solution to extract table from a partially searchable pdf files is to use the feature of OCR in the adobe reader. After that tabula is able to read and extract it actually.

Related

How to resolve UTF-8 encoding issues in python strings while extracting data from pdf?

I used PyPDF2 to extract text from a PDF file.
This is the code I wrote.
import PyPDF2 as pdf
file = open("file_to_scrape.pdf",'rb')
pdf_reader = pdf.PdfFileReader(file)
page_1 = pdf_reader.getPage(0)
print(page_1.extractText())
This gave out the following output.
˜˚Power: An Enabler for Industrialization
and Regional Cooperation
˜˚.˜ Introduction
The weird characters behind Power and Introduction are supposed to be numbers, 15 and 15.1 to be precise.
I copied them and tried to encode them to utf-8, but this is what I got.
b'\xcb\x9c\xcb\x9aPower: An Enabler for Industrialization and Regional Cooperation\xcb\x9c\xcb\x9a.\xcb\x9c Introduction'
THis is how the page looks like
Could someone please help in figuring out this issue? My aim is to extract list of all figures, headings in the PDF along with their numbering

Exporting pandas dataframe to csv causes random line breaks

I'm trying to scrape wikipedia for data on some famous people. I've got no problems getting the data, but when I try to export it to csv there's always a few entries causing a major issue. Basically, the output csv is formatted fine for most entries, except a few that cause random line-breaks that I can't seem to overcome. Here is sample data and code:
# 1. pull out wiki pages
sample_names_list = [{'name': 'Mikhail Fridman', 'index': 11.0}, #will work fine
{'name': 'Roman Abramovich', 'index': 12.0}, #will cause issue
{'name': 'Marit Rausing', 'index': 13.0}] #has no wiki page, hence 'try' in loops below
# 1.1 get page title for each name in list
import wikipedia as wk
for person in sample_names_list:
try:
wiki_page = person['name']
person['wiki_page'] = wk.page(title = wiki_page, auto_suggest = True)
except: pass
# 1.2 get page content for each page title in list
for person in sample_names_list:
try:
person_page = person['wiki_page']
person['wiki_text'] = person_page.content
except: pass
# 2. convert to dataframe
import pandas as pd
sample_names_data = pd.DataFrame(sample_names_list)
sample_names_data.drop('wiki_page', axis = 1, inplace= True) #drop unnecessary col
# 3. export csv
sample_names_data.to_csv('sample_names_data.csv')
Here is a screenshot of the output where, as you can see, random line-breaks are inserted in one of the entries and dispersed throughout with no apparent pattern:
I've tried fiddling with the data types in sample_names_list, I've tried messing with to_csv's parameters, I've tried other ways to export the csv. None of these approaches worked. I'm new to python so it could well be a very obvious solution. Any help much appreciated!
The wikipedia content has newlines in it, which are hard to reliably represent in a line-oriented format such as CSV.
You can use Excel's Open dialog (not just double-clicking the file) and select "Text file" as the format, which lets you choose how to interpret various delimiters and quoted strings... but preferably just don't use CSV for data interchange at all.
If you need to work with Excel,use .to_excel() in Pandas.
If you need to just work with Pandas, use e.g. .to_pickle().
If you need interoperability with other software, .to_json() would be a decent choice.

Pdfplumber cannot recognise table python [duplicate]

This question already has answers here:
How can I extract tables from PDF documents?
(4 answers)
Closed 10 days ago.
I use Pdfplumber to extract the table on page 2, section 3 (normally). But it only works on some pdf, others do not work. For failed pdf files, it seems like Pdfplumber read the button table instead of the table I want.
How can I get the table?
link of the pdf which doesn't work:
pdfA
link of the pdf which works:
pdfB
Here is my code:
import pdfplumber
pdf = pdfplumber.open("/Users/chueckingmok/Desktop/selenium/Shell Omala 68.pdf")
page = pdf.pages[1]
table=page.extract_table()
import pandas as pd
df = pd.DataFrame(table[1:], columns=table[0])
df
and the result is
But the table I want in page 2 is
However, this code works for pdfB (which I mentioned above).
Btw, the table I want in each pdf is in section 3.
Anyone can help?
Many thanks
Joan
Updated:
I just found a good package to extract pdf file without any problems.
the package is fitz, and it also names as PyMuPDF.
Hey Here is the proper solution for that problem but first please read some of my points below
Well, you used pdfplumber for table extraction but i think you should have read about settings of tables, there are so many settings of table when you read them according to your need you surely find your answers from there. PdfPlumber API - for Table Extraction is Here
As of now i give perfect solution for your problem in below, but first check documentation of pdfplumber API properly you can surely find all your answers from there, and i am sure that in future you don't need to ask question regarding table extraction using pdfplumber because you will surely find all your solution from there regarding table extraction and also other things like text extraction, word extraction, etc.
For better understanding of the tables settings you can also use Visual Debugging, this is very best feature of pdfplumber for knowing what exactly table settings does with table and how it extract the tables using table settings.Visual Debugging of Tables
Below Is the solution of your problem,
import pandas as pd
import pdfplumber
pdf = pdfplumber.open("GSAP_msds_01259319.pdf")
p1 = pdf.pages[1]
table = p1.extract_table(table_settings={"vertical_strategy": "lines",
"horizontal_strategy": "text",
"snap_tolerance": 4,})
df = pd.DataFrame(table[1:], columns=table[0])
df
See the output of the Above Code
To extract two tables from the same pages, I use this code:
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
first_page = pdf.pages[0].find_tables()
t1_content = first_page[0].extract(x_tolerance = 5)
t2_content = first_page[1].extract(x_tolerance = 5)
print(t1_content, '\n' ,t2_content)

Is it possible extract a specific table with format from a PDF?

I am trying to extract a specific table from a pdf, the pdf looks like the image below
I tried with different libraries on python,
With tabula-py
from tabula import read_pdf
from tabulate import tabulate
df = read_pdf("./tmp/pdf/Food Calories List.pdf")
df
With PyPDF2
pdf_file = open("./tmp/pdf/Food Calories List.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
data = page_content
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
aux = page_content
df = pd.DataFrame([x.split(';') for x in aux.split('\n')])
Even with textract and beautiful soup, the issue that I am facing is that the output format is a mess, Is there any way to extract this table with a better format?
I suspect the issues stem from the fact that the table have merged cells (on the left) and reading data from a table only works when the rows and cells are consistent rather than some merged and some not.
I'd skip over the first two columns and then recreate / populate them on the left hand side once you have the table loaded (As a pandas dataframe for example).
Then you can have one label per row and work with the data consistently, otherwise your cells per column will be inconsistently numbered.
I would look into using tabula templates which you can dynamically generate based on word locations on page. This will give tabula more guidance on which area to consider and lead to more accurate extraction. See tabula.read_pdf_with_template as documented here: https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.read_pdf_with_template.
Camelot can be another Python library to try. Its advanced settings seem to show that it can handle merged cells. However, this will likely require some adjustments to certain settings such as copy_text and shift_text.
Note: Camelot can only read text-based tables. If the table is inside an image, it won't be able to extract it.
If the above is not an issue, try the sample code below:
import camelot
tables = camelot.read_pdf('./tmp/pdf/Food Calories List.pdf', pages='1', copy_text=['v'])
print(tables[0].df)

Tabula-py omitting pages from a PDF document I am trying to extract

I am trying to extract tables from a multi-page PDF with tabula-py, and while the tables on some of the pages of the PDF are extracted perfectly, some pages are omitted entirely.
The omissions seem to be random and don't follow any visible visual features on the PDF (as each page looks the same), and so tabula omitted page 1, extracted page 2, omitted pages 3 and 4, extracted page 5, omitted page 6, extracted pages 8 and 9, omitted 10, extracted 11, etc. I have macOS Sierra 10.12.6 and Python 3.6.3 :: Anaconda custom (64-bit).
I've tried splitting the PDF into shorter sections, even into one-pagers, but the pages that are omitted don't seem to be possible to extract no matter what I've tried. I've read the related documentation and filed issues on the Tabula-py GitHub page as well as here on Stack Overflow, but I don't seem to find a solution.
The code I use through iPython notebooks is as follows:
To install tabula through the terminal:
pip install tabula-py
To extract the tables in my PDF:
from tabula import read_pdf
df = read_pdf("document_name.pdf", pages="all")
I also tried the following, which didn't make any difference
df = read_pdf("document_name", pages="1-361")
To save the data frame into csv:
df.to_csv('document_name.csv')
I'd be really thankful if you could help me with this, as I feel like I'm stuck with a PDF from which I've only managed to extract around 50% of data. This is infuriating, as the 50% looks absolutely perfect, but the other 50% seems out of my reach and renders the larger project of analyzing the data impossible.
I also wonder if this might be an issue of the PDF rather than Tabula - could the file be mistakenly set as protected or locked and whether any of you knows how I could check for that and open it up?
Thanks a ton in advance!
This could be because the area of your data in the PDF file exceeds the area that is being read by tabula. Try the following:
First get the location of your data, by parsing one of the pages into JSON format (here I chose page 2), then extract and print the locations:
tables = read_pdf("document_name.pdf", output_format="json", pages=2, silent=True)
top = tables[0]["top"]
left = tables[0]["left"]
bottom = tables[0]["height"] + top
right = tables[0]["width"] + left
print(f"{top=}\n{bottom=}\n{left=}\n{right=}")
You can now try to expand these locations slightly by experimentation, until you receive more data from the PDF document:
# area = [top, left, bottom, right]
# Example from page 2 json output: area = [30.0, 59.0, 761.0, 491.0]
# You could then nudge these locations slightly to include a wider data area:
test_area = [10.0, 30.0, 770.0, 500.0]
df = read_pdf(
"document_name.pdf",
multiple_tables=True,
pages="all",
area=test_area,
silent=True, # Suppress all stderr output
)
and the df variable will now hold your tables with the PDF data.
Try to use java_options like:
java_options="-Xmx4g"

Categories