PDF Table object list to csv format in Python - python

I am trying to build a panel database by appending tables by rows using same column names for data tables in pp. 149-157 from this pdf file:
https://www.uv.mx/personal/clelanda/files/2013/02/Garber-2000-Famous-first-bubbles.pdf
Here is the code I am currently using:
!pip install tabula-py
!pip install pandas
import pandas as pd
import tabula
from google.colab import files
def getLocalFiles():
_files = files.upload()
if len(_files) >0:
for k,v in _files.items():
open(k,'wb').write(v)
getLocalFiles()
#directory path
!ls
#Reading pdf tables
file = "bubbles.pdf"
path = 'bubbles.pdf'
tables = tabula.read_pdf(path, pages = [149,150,151,152,153,154,155,156,157], columns= (1,2,3,4,5,6,7))
print(tables)
#passing to csv format
from pandas import DataFrame
df=pd.DataFrame(page_1)
print(df)
df.to_csv('test.csv', index= False)
This is the output data:
In which way could I append all pdf tables?,
thanks in advance

Related

Seprate values on basis of '\n'

I have a pdf file and extracting the data from pdf file using pdfquery and pandas
Code is as follows:
import pdfquery
import pandas as pd
pdf = pdfquery.PDFQuery('data/BUSTA_PAGA - 2.pdf')
pdf.load()
pdf.tree.write('pdfXML.txt', pretty_print = True)
Name = pdf.pq('LTTextLineHorizontal:overlaps_bbox("25.509, 188.273, 188.558,
748.621")').text()
s=pd.DataFrame({
Name
})
s.to_csv('file_name.csv')
When I run this, It gives the data of the full text box which I wanted but there is specific data that I want to extract. How would I do that?

Read all .pdf files in directory; Extract fillable fields to pandas df

I have am writing a script that reads a folder of .pdfs and extracts their fillable fields to a pandas df. I had success extracting one .pdf with the following code:
import numpy as np
import pandas as pd
import PyPDF2
import glob, os
pwd = os.getcwd()
pdfFileObj = open('pdf_filename', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
fields_dict = pdfReader.getFormTextFields()
series = pd.Series(fields_dict).to_frame()
df = pd.DataFrame(pd.Series(fields_dict)).T
I want to build a function that runs this script for all pdfs in the directory. My first idea was to use a function in glob that collects all pdfs. Here is what I have so far:
import numpy as np
import pandas as pd
import PyPDF2
import glob, os
pwd = os.getcwd()
def readfiles():
os.chdir(pwd)
pdfs = []
for file in glob.glob("*.pdf"):
print(file)
pdfs.append(file)
pdfFileObj = open(readfiles, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
fields_dict = pdfReader.getFormTextFields()
series = pd.Series(fields_dict).to_frame()
df = pd.DataFrame(pd.Series(fields_dict)).T
Unfortunately, this doesn't work because I cannot put a function in the pdfFileReader. Does anyone have suggestions on a better way to do this? Thanks!
I can't comment, new account. But you could try making your readFiles function return the array pdfs.
Then in code execution below just:
listofPDF=readfiles()
arrayofDF=list()
for file in listofPDF:
pdfFileObj = open(file , 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
##execute your code to obtain a single dataframe from a pdf here
fields_dict = pdfReader.getFormTextFields()
series = pd.Series(fields_dict).to_frame()
df = pd.DataFrame(pd.Series(fields_dict)).T
arrayofDF.append(df)
You would end up having a list of dataframes, each one corresponding to one of the pdf files, if the first part of the code ( in which you get the dataframe from the singular pdf file) works.
Additionally, you could make a dictionary like {filename:file , dataframe: df} and then append that to your list, so you can later recover the dataframe based of the name of the file. It all depends on what you plan to do with the dataframes later.

Read XLS file with Pandas & xlrd returns error; xlrd opens file on its own

I am writing some automated scripts to process Excel files in Python, some are in XLS format. Here's a code snippet of my attempting to do so with Pandas:
df = pd.read_excel(contents, engine='xlrd', skiprows=5, names=['some', 'column', 'headers'])
contents is the file contents pulled from an AWS S3 bucket. When this line runs I get [ERROR] ValueError: File is not a recognized excel file.
In troubleshooting this, I have tried to access the spreadsheet using xlrd directly:
book = xlrd.open_workbook(file_contents=contents)
print("Number of worksheets is {}".format(book.nsheets))
print("Worksheet names: {}".format(book.sheet_names()))
This works without errors so xlrd seems to recognize it as an Excel file, just not when asked to do so by Pandas.
Anyone know why Pandas won't read the file with xlrd as the engine? Or can someone help me take the sheet from xlrd and convert it into a Pandas dataframe?
Or can someone help me take the sheet from xlrd and convert it into a
Pandas dataframe?
pd.read_excel can take a book...
import xlrd
book = xlrd.open_workbook(filename='./file_check/file.xls')
df = pd.read_excel(book, skiprows=5)
print(df)
some column headers
0 1 some foo
1 2 strings bar
2 3 here yes
3 4 too no
I'll include the code below that may help if you want to check/handle Excel file types. Maybe you can adapt it for your needs.
The code loops through a local folder and shows the file and extension but then uses python-magic to drill into it. It also has a column showing guessing from mimetypes but that isn't as good. Do zoom into the image of the frame and see that some .xls are not what the extension says. Also a .txt is actually an Excel file.
import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic
path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")
data = []
for file in all_files:
name, extension = os.path.splitext(file)
data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])
df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])
# del df['magic.from_file(file, mime=True)']
df
From there you could filter files based on their type:
xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
xls_file_format = 'application/vnd.ms-excel'
for file in all_files:
if magic.from_file(file, mime=True) == xlsx_file_format:
print('xlsx')
# DO SOMETHING SPECIAL WITH XLSX FILES
elif magic.from_file(file, mime=True) == xls_file_format:
print('xls')
# DO SOMETHING SPECIAL WITH XLS FILES
else:
continue
dfs = []
for file in all_files:
if (magic.from_file(file, mime=True) == xlsx_file_format) or \
(magic.from_file(file, mime=True) == xls_file_format):
# who cares, it all works with this for the demo...
df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
dfs.append(df)
print('\nHow many frames did we get from seven files? ', len(dfs))
Output:
xlsx
xls
xls
xlsx
How many frames did we get from seven files? 4

Using Pandas with XLSB File

Trying to read a xlsb file to create a DF in pandas.
import pandas as pd
a_data = pd.ExcelFile(
r'C:\\Desktop\\a.xlsb')
df_data = pd.read_excel(a_data, 'Sheet1', engine='pyxlsb')
print(df.head())
When I run the script I keep getting this error.
OSError: File contains no valid workbook part
You can use pyxlsb, all latest version of pandas support this.
Use following code:
import pandas as pd
a_data = pd.ExcelFile(r'C:\\Desktop\\a.xlsb')
df = pd.read_excel('a_data', sheet_name='Sheet1', engine='pyxlsb')
You will have to install pyxlsbfirst using command: pip install pyxlsb

how do i download the csv file from a website using python code for my jupyter notebook

I want to download the daily data about the daily covid19 cases from the ECDC Website. How do I do that with python code and import it to my notebook.
I have previously downloaded the data from GitHub, but I have no idea how to download the data from a link provided on a live website.
from github.MainClass import Github
g = Github('KEY')
repo = g.get_repo("CSSEGISandData/COVID-19")
file_list = repo.get_contents("csse_covid_19_data/csse_covid_19_daily_reports")
github_dir_path = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports/'
file_path = github_dir_path + str(file_list[-2]).split('/')[-1].split(".")[0]+ '.csv'
I was just able to use this and download it. Is your issue getting the list of files or you are unaware that you can use URLs in read_csv
import pandas as pd
url = 'https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv'
df = pd.read_csv(url, error_bad_lines = False)

Categories