Converting html table to a pandas dataframe - python

I have been trying to import a html table from a website and to convert it into a pandas DataFrame. This is my code:
import pandas as pd
table = pd.read_html("http://www.sharesansar.com/c/today-share-price.html")
dfs = pd.DataFrame(data = table)
print dfs
It just displays this:
0 S.No ...
But if I do;
for df in dfs:
print df
It outputs the table..
How can I use pd.Dataframe to scrape the table?

HTML table on the given url is javascript rendered. pd.read_html() doesn't supports javascript rendered pages. You can try with dryscrape like so:
import pandas as pd
import dryscrape
s = dryscrape.Session()
s.visit("http://www.sharesansar.com/c/today-share-price.html")
df = pd.read_html(s.body())[5]
df.head()
Output:

Related

How do I export a read_html df to Excel, when it related to table ID rather than data in the code?

I am experiencing this error with the code below:
File "\<stdin\>", line 1, in \<module\>
AttributeError: 'list' object has no attribute 'to_excel'
I want to save the table I am scraping from wikipedia to an Excel file - but I can't work out how to adjust the code to get the data list from the terminal to the Excel file using to_excel.
I can see it works for a similar problem when a dataset has data set out as a 'DataFrame'
(i.e. df = pd.DataFrame(data, columns = \['Product', 'Price'\]).
But can't work out how to adjust my code for the df = pd.read*html(str(congresstable)) line - which I think is the issue. (i.e. using read*_html and sourcing the data from a table id)
How can I adjust the code to make it save an excel file to the path specified?
from bs4 import BeautifulSoup
import requests
import pandas as pd
wiki_url = 'https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives'
table_id = 'votingmembers'
response = requests.get(wiki_url)
soup = BeautifulSoup(response.text, 'html.parser')
congress_table = soup.find('table', attrs={'id': table_id})
df = pd.read_html(str(congress_table))
df.to_excel (r'C:\Users\name\OneDrive\Code\.vscode\Test.xlsx', index = False, header=True)
print(df)
I was expecting the data list to be saved to Excel at the folder path specified.
I tried following multiple guides, but they don't show the read_html item, only DataFrame solutions.
pandas.read_html() creates a list of tables respectiv dataframe objects, so you have to pick one by index in your case [0] - You also do not need requests and BeautifulSoup, separatly, just go with pandas.read_html()
pd.read_html(wiki_url,attrs={'id': table_id})[0]
Example
import pandas as pd
wiki_url = 'https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives'
table_id = 'votingmembers'
congress_table = soup.find('table', )
df = pd.read_html(wiki_url,attrs={'id': table_id})[0]
df.to_excel (r'C:\Users\name\OneDrive\Code\.vscode\Test.xlsx', index = False, header=True)

How to extract table details into rows and columns using pdfplumber

I am using pdfplumber to extract tables from pdf. But the table in use does not have visible vertical lines separating content so the the data extracted are into 3 rows and one huge column.
I would like the above table to come into 13 rows.
import pdfplumber
import pandas as pd
import numpy as np
with pdfplumber.open('test.pdf') as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
print(tables)
From the documentation I could not understand if there was a specific table settings I could apply. I tried some but it did not help.
Please add below settings when using extract_tables() option (This may need to be changed based on your input file) :
import pdfplumber
import pandas as pd
import numpy as np
with pdfplumber.open(r'document.pdf') as pdf:
page = pdf.pages[0]
table = page.extract_table(table_settings={"vertical_strategy": "lines",
"horizontal_strategy": "text",
"snap_tolerance": 4,})
df = pd.DataFrame(table, columns=table[0]).T
Morover, Please have a read on pdfplumber documentation (extracting-tables) section, as there is many options to include in your code based in your input file :
https://github.com/jsvine/pdfplumber#extracting-tables
You can use pandas.DataFrame to customize your table instead of directly printing the table.
df = pd.DataFrame(tables[1:], columns=tables[0])
for column in df.columns.tolist():
df[column] = df[column].str.replace(" ", "")
print(df)

How to store pandas dataframe information in a csv file

I am new to scraping and python. I am trying to scrape multiple tables from this URL: https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes. I did the scraping and now I am trying to save the dataframe to a csv file. I tried but it just stores the the first table from the page.
code:
from pandas.io.html import read_html
page = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'
wikitables = read_html(page, index_col=0, attrs={"class":"wikitable plainrowheaders wikiepisodetable"})
print ("Extracted {num} wikitables".format(num=len(wikitables)))
for line in range(7):
df= pd.DataFrame(wikitables[line].head())
df.to_csv('file1.csv')
You need to reshape the list of dataframes into a single dataframe and then you need to export it to csv file.
wikitable = wikitables[0]
for i in range(1,len(wikitables)):
wikitable = wikitable.append(wikitables[i],sort=True)
wikitable.to_csv('wikitable.csv')
You forgot
import pandas as pd
but you don't need it because read_html gives list of dataframes and you don't have to convert it o dataframes. You can write it directly.
from pandas.io.html import read_html
url = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'
wikitables = read_html(url, index_col=0, attrs={"class":"wikitable plainrowheaders wikiepisodetable"})
print("Extracted {num} wikitables".format(num=len(wikitables)))
for i, dataframe in enumerate(wikitables):
dataframe.to_csv('file{}.csv'.format(i))

Extract json data in web page using pd.read_json()?

Trying to extract the table from this page "https://www.hkex.com.hk/Market-Data/Statistics/Consolidated-Reports/Monthly-Bulletin?sc_lang=en#select1=0&select2=28". By inspect/network function of chorme, the data request link is "https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485". This links looks like json format when access directly. However, the codes using this link does not work.
My codes:
import pandas as pd
url="https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485"
df = pd.read_json(url)
print(df.info(verbose=True))
print(df)
also tried:
url="https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?"
You can try downloading the json first and then convert it back to DataFrame
import pandas as pd
url='https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485'
import urllib.request, json
with urllib.request.urlopen(url) as r:
data = json.loads(r.read().decode())
df = pd.DataFrame(data['tables'][0]['body'])
columns = [item['text'] for item in data['tables'][0]['header']]
row_count = max(df['row'])
new_df = pd.DataFrame(df.text.values.reshape((row_count,-1)),columns = columns)

Web scraping golf data from ESPN. I am receiving 3 ouputs of the same table and only want 1. How can I limit this?

I am new to python and am stuck. I cant figure out how to only output one of the tables given. In the output, it gives the desired table, but three versions of them. The first two are awfully formatted, and the last table is the table desired.
I have tried running a for loop and counting to only print the third table.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in dfs:
print(df[0:])
Just use index to print the table.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
print(dfs[2])
OR
print(dfs[-1])
OR If you want to use loop then try that.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in range(len(dfs)):
if df==2:
print(dfs[df])

Categories