I am using pdfplumber to extract tables from pdf. But the table in use does not have visible vertical lines separating content so the the data extracted are into 3 rows and one huge column.
I would like the above table to come into 13 rows.
import pdfplumber
import pandas as pd
import numpy as np
with pdfplumber.open('test.pdf') as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
print(tables)
From the documentation I could not understand if there was a specific table settings I could apply. I tried some but it did not help.
Please add below settings when using extract_tables() option (This may need to be changed based on your input file) :
import pdfplumber
import pandas as pd
import numpy as np
with pdfplumber.open(r'document.pdf') as pdf:
page = pdf.pages[0]
table = page.extract_table(table_settings={"vertical_strategy": "lines",
"horizontal_strategy": "text",
"snap_tolerance": 4,})
df = pd.DataFrame(table, columns=table[0]).T
Morover, Please have a read on pdfplumber documentation (extracting-tables) section, as there is many options to include in your code based in your input file :
https://github.com/jsvine/pdfplumber#extracting-tables
You can use pandas.DataFrame to customize your table instead of directly printing the table.
df = pd.DataFrame(tables[1:], columns=tables[0])
for column in df.columns.tolist():
df[column] = df[column].str.replace(" ", "")
print(df)
Related
Output
I want to work with PDF files, specially with tables. I code this
import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf
tab= tabula.read_pdf('..\PDFs\Ala.pdf',encoding='latin-1', pages ='all')
tab
But I get a list of values, like this:
[ Nombres Edad Ciudad
0 Noelia 20 Lima
1 Michelie 45 Lima
2 Ximena 18 Lima
3 Miguel 43 Lima]
I cannot analyze it die it's not a data frame. This is just an example the real PDF file contains tables between texts and several pages
So, please could someone help me with this issue?
tabula should return a list of Pandas dataframes, one for each table found in the PDF. You could display (and work with them) as follows:
import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf
dfs = tabula.read_pdf('..\PDFs\Ala.pdf', encoding='latin-1', pages='all')
print(f"Found {len(dfs)} tables")
# display each of the dataframes
for df in dfs:
print(df.size)
print(df)
tabula returns a list of Pandas DataFrame. But we can convert this list to Pandas DataFrame using the below statement.
import tabula
import pandas
tab = pandas.DataFrame(tabula.read_pdf('..\PDFs\Ala.pdf', pages ='all')[0])
I have been working on automating a series of reports in python. I have been trying to create a series of pivot tables from an imported csv (binlift.csv). I have found the Pandas library very useful for this however, I cant seem to find anything that helps me write the Panda created pivot tables to my excel document (Template.xlsx) and was wondering if anyone can help. So far I have the written the following code
import openpyxl
import csv
from datetime import datetime
import datetime
import pandas as pd
import numpy as np
file1 = "Template.xlsx" # template file
file2 = "binlift.csv" # raw data csv
wb1 = openpyxl.load_workbook(file1) # opens template
ws1 = wb1.create_sheet("Raw Data") # create a new sheet in template called Raw Data
summary = wb1.worksheets[0] # variables given to sheets for manipulation
rawdata = wb1.worksheets[1]
headings = ["READER","BEATID","LIFTYEAR","LIFTMONTH","LIFTWEEK","LIFTDAY","TAGGED","UNTAGGEDLIFT","LIFT"]
df = pd.read_csv(file2, names=headings)
pivot_1 = pd.pivot_table(df, index=["LIFTYEAR", "LIFTMONTH","LIFTWEEK"], values=["TAGGED","UNTAGGEDLIFT","LIFT"],aggfunc=np.sum)
pivot_2 = pd.pivot_table(df, index=["LIFTYEAR", "LIFTMONTH"], values=["TAGGED","UNTAGGEDLIFT"],aggfunc=np.sum)
pivot_3 = pd.pivot_table(df, index=["READER"], values=["TAGGED","UNTAGGEDLIFT","LIFT"],aggfunc=np.sum)
print(pivot_1)
print(pivot_2)
print(pivot_3)
wb1.save('test.xlsx')enter code here
There is an option in pandas to write the 'xlsx' files.
Here basically we get all the indices (at level 0) of the pivot table, and then one by one we go over these indices to subset the table and write that part of the table.
writer = pd.ExcelWriter('output.xlsx')
for manager in pivot_1.index.get_level_values(0).unique():
temp_df = pivot_1.xs(manager, level=0)
temp_df.to_excel(writer, manager)
writer.save()
I am new to scraping and python. I am trying to scrape multiple tables from this URL: https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes. I did the scraping and now I am trying to save the dataframe to a csv file. I tried but it just stores the the first table from the page.
code:
from pandas.io.html import read_html
page = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'
wikitables = read_html(page, index_col=0, attrs={"class":"wikitable plainrowheaders wikiepisodetable"})
print ("Extracted {num} wikitables".format(num=len(wikitables)))
for line in range(7):
df= pd.DataFrame(wikitables[line].head())
df.to_csv('file1.csv')
You need to reshape the list of dataframes into a single dataframe and then you need to export it to csv file.
wikitable = wikitables[0]
for i in range(1,len(wikitables)):
wikitable = wikitable.append(wikitables[i],sort=True)
wikitable.to_csv('wikitable.csv')
You forgot
import pandas as pd
but you don't need it because read_html gives list of dataframes and you don't have to convert it o dataframes. You can write it directly.
from pandas.io.html import read_html
url = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'
wikitables = read_html(url, index_col=0, attrs={"class":"wikitable plainrowheaders wikiepisodetable"})
print("Extracted {num} wikitables".format(num=len(wikitables)))
for i, dataframe in enumerate(wikitables):
dataframe.to_csv('file{}.csv'.format(i))
I am new to python and am stuck. I cant figure out how to only output one of the tables given. In the output, it gives the desired table, but three versions of them. The first two are awfully formatted, and the last table is the table desired.
I have tried running a for loop and counting to only print the third table.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in dfs:
print(df[0:])
Just use index to print the table.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
print(dfs[2])
OR
print(dfs[-1])
OR If you want to use loop then try that.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in range(len(dfs)):
if df==2:
print(dfs[df])
import requests
import pandas as pd
shot_chart_url = 'http://stats.nba.com/stats/shotchartdetail?CFID=33&CFPAR'\
'AMS=2014-15&ContextFilter=&ContextMeasure=FGA&DateFrom=&D'\
'ateTo=&GameID=&GameSegment=&LastNGames=0&LeagueID=00&Loca'\
'tion=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&'\
'PaceAdjust=N&PerMode=PerGame&Period=0&PlayerID=201935&Plu'\
'sMinus=N&Position=&Rank=N&RookieYear=&Season=2014-15&Seas'\
'onSegment=&SeasonType=Regular+Season&TeamID=0&VsConferenc'\
'e=&VsDivision=&mode=Advanced&showDetails=0&showShots=1&sh'\
'owZones=0'
# Get the webpage containing the data
response = requests.get(shot_chart_url)
# Grab the headers to be used as column headers for our DataFrame
headers = response.json()['resultSets'][0]['headers']
# Grab the shot chart data
shots = response.json()['resultSets'][0]['rowSet']
shot_df = pd.DataFrame(shots, columns=headers)
# View the head of the DataFrame and all its columns
from IPython.display import display
with pd.option_context('display.max_columns', None):
display(shot_df.head())
I want to dump the data from the pandas table into a CSV but I'm unsure of how to implement pandas.DataFrame.to_csv
Call the DataFrame instance's to_csv method with at least the path argument
shot_df.to_csv('/path/to/file.csv')