How to scrape a table on a web page into a dataframe? - python

I want to scrape a table on a page into a dataframe with columns name of "Contracts" and "Funding Rate".(https://www.binance.com/en/futures/funding-history/1)
This is what i have tried so far but still can't work out. Appreciate if anyone can help me out of this.
from pandas.io.html import read_html
import lxml.html as lh
url="https://www.binance.com/cn/futures/funding-history/1"
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
text1= soup.find("table", attrs={"class": "bnc-table-row-cell-break-word"})
text1.find_all("tr")
The image attached is the dataframe what i want to generate through pandas.

Related

Web scraping by BeautifulSoup in Python

I tried to retrieve table data through the link below by python, unfortunately they brought all the html tags but haven't brought the table. Could you do me a favor and help me.
https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/exp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-02-01&r9=2022-02-01
my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/exp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-02-01&r9=2022-02-01'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup)

How to extract table from NHC website in Python?

Here,
https://www.nhc.noaa.gov/gis/
There is a table under the "Data & Products" section. I want to extract the table and save it to a CSV file. I wrote this basic code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.nhc.noaa.gov/gis/")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)
I only know the basics of scraping. Please, guide me from here. Thanks!
You can use pandas
import pandas as pd
url = 'https://www.nhc.noaa.gov/gis/'
df = pd.read_html(url)[0]
# create csv file
df.to_csv("mycsv.csv")
It is hard to know but i guess that this is what you want:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.nhc.noaa.gov/gis/')
soup = BeautifulSoup(r.content, 'html.parser')
for a in soup.find_all('a'):
if a.get('href'):
if '.' in a.get('href').split('/')[-1]\
and 'html' not in a.get('href')\
and '.php' not in a.get('href')\
and 'http' not in a.get('href')\
and 'mailto' not in a.get('href'):
print('https://www.nhc.noaa.gov' + a.get('href'))
prints:
https://www.nhc.noaa.gov/gis/examples/al112017_5day_020.zip
https://www.nhc.noaa.gov/gis/examples/AL112017_020adv_CONE.kmz
https://www.nhc.noaa.gov/gis/examples/AL112017_020adv_TRACK.kmz
https://www.nhc.noaa.gov/gis/examples/AL112017_020adv_WW.kmz
https://www.nhc.noaa.govforecast/archive/al092020_5day_latest.zip
https://www.nhc.noaa.gov/storm_graphics/api/AL092020_CONE_latest.kmz
https://www.nhc.noaa.gov/storm_graphics/api/AL092020_TRACK_latest.kmz
https://www.nhc.noaa.gov/storm_graphics/api/AL092020_WW_latest.kmz
https://www.nhc.noaa.govforecast/archive/al102020_5day_latest.zip
https://www.nhc.noaa.gov/storm_graphics/api/AL102020_CONE_latest.kmz
https://www.nhc.noaa.gov/storm_graphics/api/AL102020_TRACK_latest.kmz
https://www.nhc.noaa.gov/storm_graphics/api/AL102020_WW_latest.kmz
https://www.nhc.noaa.gov/gis/examples/al112017_fcst_020.zip
https://www.nhc.noaa.gov/gis/examples/AL112017_initialradii_020adv.kmz
https://www.nhc.noaa.gov/gis/examples/AL112017_forecastradii_020adv.kmz
https://www.nhc.noaa.govforecast/archive/al092020_fcst_latest.zip
https://www.nhc.noaa.gov/storm_graphics/api/AL092020_initialradii_latest.kmz
https://www.nhc.noaa.gov/storm_graphics/api/AL092020_forecastradii_latest.kmz
https://www.nhc.noaa.govforecast/archive/al102020_fcst_latest.zip
.. and so on...

Python BeautifulSoup - trouble parsing table from webpage

I'd like to parse the table data from the following site:
Pricing data and create a dataframe with all of the table values (vCPU, Memory, Storage, Price). However, with the following code, I can't seem to find the table on the page. Can someone help me figure out how to parse out the values?
Using the pd.read_html, an error shows up that no tables are found.
import pandas as pd
from bs4 import BeautifulSoup
import requests
import csv
url = "https://aws.amazon.com/ec2/pricing/on-demand/"
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'html.parser')
data=[]
tables = soup.find_all('table')
df = pd.read_html(url)
If your having trouble because of dynamic content a good work around is selenium, it simulates browser experience so you dont have to worry about managing cookies and other problems that come with dynamic web content. I was able to scrape the page with the following:
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep
driver = webdriver.Firefox()
driver.get('https://aws.amazon.com/ec2/pricing/on-demand/')
sleep(3)
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
driver.close()
data=[]
tables = soup.find_all('table')
print(tables)

Download table from wunderground with beautiful soup

I would like to Weather History & Observations table from the following link:
https://www.wunderground.com/history/airport/HDY/2011/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2011&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=
This is the code I have so far:
import pandas as pd
from bs4 import BeautifulSoup
import requests
link = 'https://www.wunderground.com/history/airport/HDY/2011/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2011&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo='
resp = requests.get(link)
c = resp.text
soup = BeautifulSoup(c)
I would like to know what is the next step to access the table info at the bottom of the page (assuming this is a good website format to allow this to happen).
Thank you
You can use find_all
table = soup.find('table', class_="responsive obs-table daily")
rows = table.find_all('tr')

Extracting table info using BeautifulSoup (bs4)

Could anyone please give me a snippet of BeautifulSoup code to extract some of the items in the table found here?
Here's my attempt:
from bs4 import BeautifulSoup
from urllib2 import urlopen
url = "http://biology.burke.washington.edu/conus/accounts/../recordview/record.php?ID=1ll&tabs=21100111&frms=1&res=&pglimit=A"
html = urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
tables = soup.findAll("table")
However, this is failing -- tables turns out to be empty.
Sorry, I'm a BeautifulSoup noob.
Thanks!
The given url page does not contain any table element in the source.
table is generated by javascript inside an iframe.
import urllib
from bs4 import BeautifulSoup
url = 'http://biology.burke.washington.edu/conus/recordview/description.php?ID=1l9l0l421l55llll&tabs=21100111&frms=1&pglimit=A&offset=&res=&srt=&sql2='
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tables = soup.find_all('table')
#print(tables)
selenium solution:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "http://biology.burke.washington.edu/conus/accounts/../recordview/record.php?ID=1ll&tabs=21100111&frms=1&res=&pglimit=A"
driver = webdriver.Firefox()
driver.get(url)
driver.switch_to_frame(driver.find_elements_by_tag_name('iframe')[0])
soup = BeautifulSoup(driver.page_source)
tables = soup.find_all('table')
#print(tables)
driver.quit()
this is my current workflow:
from bs4 import beautifulsoup
from urllib2 import urlopen
url = "http://somewebpage.com"
html = urlopen(url).read()
soup = BeautifulSoup(html)
tables = soup.find_all('table')

Categories