I'm trying to extract some data from a PDF file using tabula.
The issues I'm facing is that it's only extracting from one page, even though the pages argument is specified.
Not too sure whats going on, any insight would be greatly appreciated!! ~
The code:
import tabula
tables = tabula.read_pdf("testfile.pdf", pages='all')
tabula.convert_into("testfile.pdf", "test_file_tables.csv")
THANK YOU!
After looking at the documentation I realised I forgot to specify the pages argument in the .convert_into
The correct code is:
tables = tabula.read_pdf("testfile.pdf", pages='all')
tabula.convert_into("testfile.pdf", "test_file_tables.csv", pages='all')
Related
I am attempting to read the following csv so I can process it further but I am getting an pandas.errors.ParserError. I would really appreciate any help on how I can read it. Can you help me identify what I am doing wrong?
My code:
import pandas as pd
logic_df = pd.read_csv("http://www.sharecsv.com/s/6c1b912f54d87d45f4728f8fb1510a5eb/random.csv")
I am not sure if there is something wrong with my csv because I used csv lint and it said my csv is fine so I am not sure what the issue is.
I also tried to do the following
logic_df = pd.read_csv("http://www.sharecsv.com/s/6cb912f54d87d45f4728f81fb1510a5eb/random.csv", error_bad_lines=False)
with no luck.
Changing the url to the direct link of the table should work:
df = pd.read_csv("http://www.sharecsv.com/dl/6cb912f54d87d45f4728f8fb1510a5eb/random.csv")
The thing is, your url is pointing to a html page, not a csv file per se. You can either use the url above, or reading the your url source with pd.read_html, like this:
df = pd.read_html('http://www.sharecsv.com/s/6cb912f54d87d45f4728f8fb1510a5eb/random.csv', header=0)[0]
Hope it helps!
This question already has answers here:
How can I extract tables from PDF documents?
(4 answers)
Closed 10 days ago.
I use Pdfplumber to extract the table on page 2, section 3 (normally). But it only works on some pdf, others do not work. For failed pdf files, it seems like Pdfplumber read the button table instead of the table I want.
How can I get the table?
link of the pdf which doesn't work:
pdfA
link of the pdf which works:
pdfB
Here is my code:
import pdfplumber
pdf = pdfplumber.open("/Users/chueckingmok/Desktop/selenium/Shell Omala 68.pdf")
page = pdf.pages[1]
table=page.extract_table()
import pandas as pd
df = pd.DataFrame(table[1:], columns=table[0])
df
and the result is
But the table I want in page 2 is
However, this code works for pdfB (which I mentioned above).
Btw, the table I want in each pdf is in section 3.
Anyone can help?
Many thanks
Joan
Updated:
I just found a good package to extract pdf file without any problems.
the package is fitz, and it also names as PyMuPDF.
Hey Here is the proper solution for that problem but first please read some of my points below
Well, you used pdfplumber for table extraction but i think you should have read about settings of tables, there are so many settings of table when you read them according to your need you surely find your answers from there. PdfPlumber API - for Table Extraction is Here
As of now i give perfect solution for your problem in below, but first check documentation of pdfplumber API properly you can surely find all your answers from there, and i am sure that in future you don't need to ask question regarding table extraction using pdfplumber because you will surely find all your solution from there regarding table extraction and also other things like text extraction, word extraction, etc.
For better understanding of the tables settings you can also use Visual Debugging, this is very best feature of pdfplumber for knowing what exactly table settings does with table and how it extract the tables using table settings.Visual Debugging of Tables
Below Is the solution of your problem,
import pandas as pd
import pdfplumber
pdf = pdfplumber.open("GSAP_msds_01259319.pdf")
p1 = pdf.pages[1]
table = p1.extract_table(table_settings={"vertical_strategy": "lines",
"horizontal_strategy": "text",
"snap_tolerance": 4,})
df = pd.DataFrame(table[1:], columns=table[0])
df
See the output of the Above Code
To extract two tables from the same pages, I use this code:
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
first_page = pdf.pages[0].find_tables()
t1_content = first_page[0].extract(x_tolerance = 5)
t2_content = first_page[1].extract(x_tolerance = 5)
print(t1_content, '\n' ,t2_content)
Very novice at Python here.
Trying to read the table presented at this page (w/ the current filters set as is) and then write it to a csv file.
http://www65.myfantasyleague.com/2017/options?L=47579&O=243&TEAM=DAL&POS=RB
I tried this next approach. It creates the csv file but does not fill it w/ the actual table contents.
Appreciate any help in advance. thanks.
import requests
import pandas as pd
url = 'http://www65.myfantasyleague.com/2017/optionsL=47579&O=243&TEAM=DAL&POS=RB'
csv_file='DAL.RB.csv'
pd.read_html(requests.get(url).content)[-1].to_csv(csv_file)
Generally, try to emphasize your problems better, try to debug and don't put everything in one line. With that said, your specific problem here was the index and the missing ? in the code (after options):
import requests
import pandas as pd
url = 'http://www65.myfantasyleague.com/2017/options?L=47579&O=243&TEAM=DAL&POS=RB'
# -^-
csv_file='DAL.RB.csv'
pd.read_html(requests.get(url).content)[1].to_csv(csv_file)
# -^-
This yields a CSV file with the table in it.
I am practicing github machine learning contest using Python. I start from other's submission, but stuck at the first step: use pandas to read CSV file:
import pandas as pd
import numpy as np
filename = './facies_vectors.csv'
training_data = pd.read_csv(filename)
print(set(training_data["Well Name"]))
[enter image description here][1]training_data.head()
This gave me the following error message:
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 104, saw 3
I could not understand that the .csv file describe itself as html DOCTYPE. Please help.
The representing segments of the csv data content are attached. Thanks
It turns out I download the csv file following the convention of regular web operation: right click and save as. The right way is open the item from github, and then open it from the github desktop. I got the tables now. But the way to work with html files from python is definite something I would learn more about. Thanks.
Fairly simple; I've got the data I want out of the excel file, but can't seem to find anything inside the XLRD readme that explains how to go from this:
xldate:40397.007905092592
number:10000.0
text:u'No'
number:0.1203
number:0.096000000000000002
number:0.126
to their respective python datatypes. Any ideas?
did you tried the documentation help --> date_function
I had the same issue and used the following as a last resort:
def numobj2fl(p):
return float(str(p).split(":")[1])
for converting the 'number object' to float.