How to extract table from NHC website in Python?

How to extract table from NHC website in Python? - python

Here,
https://www.nhc.noaa.gov/gis/
There is a table under the "Data & Products" section. I want to extract the table and save it to a CSV file. I wrote this basic code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.nhc.noaa.gov/gis/")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)
I only know the basics of scraping. Please, guide me from here. Thanks!

You can use pandas
import pandas as pd
url = 'https://www.nhc.noaa.gov/gis/'
df = pd.read_html(url)[0]
# create csv file
df.to_csv("mycsv.csv")

It is hard to know but i guess that this is what you want:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.nhc.noaa.gov/gis/')
soup = BeautifulSoup(r.content, 'html.parser')
for a in soup.find_all('a'):
if a.get('href'):
if '.' in a.get('href').split('/')[-1]\
and 'html' not in a.get('href')\
and '.php' not in a.get('href')\
and 'http' not in a.get('href')\
and 'mailto' not in a.get('href'):
print('https://www.nhc.noaa.gov' + a.get('href'))
prints:
https://www.nhc.noaa.gov/gis/examples/al112017_5day_020.zip
https://www.nhc.noaa.gov/gis/examples/AL112017_020adv_CONE.kmz
https://www.nhc.noaa.gov/gis/examples/AL112017_020adv_TRACK.kmz
https://www.nhc.noaa.gov/gis/examples/AL112017_020adv_WW.kmz
https://www.nhc.noaa.govforecast/archive/al092020_5day_latest.zip
https://www.nhc.noaa.gov/storm_graphics/api/AL092020_CONE_latest.kmz
https://www.nhc.noaa.gov/storm_graphics/api/AL092020_TRACK_latest.kmz
https://www.nhc.noaa.gov/storm_graphics/api/AL092020_WW_latest.kmz
https://www.nhc.noaa.govforecast/archive/al102020_5day_latest.zip
https://www.nhc.noaa.gov/storm_graphics/api/AL102020_CONE_latest.kmz
https://www.nhc.noaa.gov/storm_graphics/api/AL102020_TRACK_latest.kmz
https://www.nhc.noaa.gov/storm_graphics/api/AL102020_WW_latest.kmz
https://www.nhc.noaa.gov/gis/examples/al112017_fcst_020.zip
https://www.nhc.noaa.gov/gis/examples/AL112017_initialradii_020adv.kmz
https://www.nhc.noaa.gov/gis/examples/AL112017_forecastradii_020adv.kmz
https://www.nhc.noaa.govforecast/archive/al092020_fcst_latest.zip
https://www.nhc.noaa.gov/storm_graphics/api/AL092020_initialradii_latest.kmz
https://www.nhc.noaa.gov/storm_graphics/api/AL092020_forecastradii_latest.kmz
https://www.nhc.noaa.govforecast/archive/al102020_fcst_latest.zip
.. and so on...

Related

Web scraping several a href

I would like to scrape this page with Python: https://statusinvest.com.br/acoes/proventos/ibovespa.
With this code:
import requests
from bs4 import BeautifulSoup as bs
URL = "https://statusinvest.com.br/acoes/proventos/ibovespa"
page = 1
req = requests.get(URL+str(page))
soup = bs(req.text, 'html.parser')
container = soup.find('div', attrs={'class','list'})
dividends = container.find('a')
for dividend in dividends:
links = dividend.find_all('a')
print(links)
But it doesn't return anything.
Can someone help me please?

Edited: you can see the below updated code to access any data you mentioned in the comment, you can modify according to your needs as all the data on that page is inside data variable.
Updated Code:
import json
import requests
from bs4 import BeautifulSoup as bs
url = "https://statusinvest.com.br"
links = []
req = requests.get(f"{url}/acoes/proventos/ibovespa")
soup = bs(req.content, 'html.parser')
data = json.loads(soup.find('input', attrs={'id': 'result'})["value"])
print("Date Com Data")
for datecom in data["dateCom"]:
print(f"{datecom['code']}\t{datecom['companyName']}\t{datecom['companyNameClean']}\t{datecom['companyId']}\t{datecom['companyId']}\t{datecom['resultAbsoluteValue']}\t{datecom['dateCom']}\t{datecom['paymentDividend']}\t{datecom['earningType']}\t{datecom['dy']}\t{datecom['recentEvents']}\t{datecom['recentEvents']}\t{datecom['uRLClear']}")
print("\nDate Payment Data")
for datePayment in data["datePayment"]:
print(f"{datePayment['code']}\t{datePayment['companyName']}\t{datePayment['companyNameClean']}\t{datePayment['companyId']}\t{datePayment['companyId']}\t{datePayment['resultAbsoluteValue']}\t{datePayment['dateCom']}\t{datePayment['paymentDividend']}\t{datePayment['earningType']}\t{datePayment['dy']}\t{datePayment['recentEvents']}\t{datePayment['recentEvents']}\t{datePayment['uRLClear']}")
print("\nProvisioned Data")
for provisioned in data["provisioned"]:
print(f"{provisioned['code']}\t{provisioned['companyName']}\t{provisioned['companyNameClean']}\t{provisioned['companyId']}\t{provisioned['companyId']}\t{provisioned['resultAbsoluteValue']}\t{provisioned['dateCom']}\t{provisioned['paymentDividend']}\t{provisioned['earningType']}\t{provisioned['dy']}\t{provisioned['recentEvents']}\t{provisioned['recentEvents']}\t{provisioned['uRLClear']}")
Seeing to the source code of that website one could fetch the json directly and get your desired links follow the below code.
Code:
import json
import requests
from bs4 import BeautifulSoup as bs
url = "https://statusinvest.com.br"
links=[]
req = requests.get(f"{url}/acoes/proventos/ibovespa")
soup = bs(req.content, 'html.parser')
data = json.loads(soup.find('input', attrs={'id': 'result'})["value"])
for datecom in data["dateCom"]:
links.append(f"{url}{datecom['uRLClear']}")
for datePayment in data["datePayment"]:
links.append(f"{url}{datePayment['uRLClear']}")
for provisioned in data["provisioned"]:
links.append(f"{url}{provisioned['uRLClear']}")
print(links)
Output:
Let me know if you have any questions :)

How to scrape a table on a web page into a dataframe?

I want to scrape a table on a page into a dataframe with columns name of "Contracts" and "Funding Rate".(https://www.binance.com/en/futures/funding-history/1)
This is what i have tried so far but still can't work out. Appreciate if anyone can help me out of this.
from pandas.io.html import read_html
import lxml.html as lh
url="https://www.binance.com/cn/futures/funding-history/1"
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
text1= soup.find("table", attrs={"class": "bnc-table-row-cell-break-word"})
text1.find_all("tr")
The image attached is the dataframe what i want to generate through pandas.

How to make beautiful soup grab only what is between a set of "[:" ":]" in a web page?

Good afternoon! How do I make Beautifulsoup grab only what is between multiple sets of "[:" and ":]" So far I have got the entire page in my soup, but it does not have tags, sadly.
What it looks like so far
I have tried a couple of things so far:
soup.findAll(text="[")
keys = soup.find("span", attrs = {"class": "objectBox objectBox-string"})
import bs4 as bs
import urllib.request
source = urllib.request.urlopen("https://login.microsoftonline.com/common/discovery/keys").read()
soup = bs.BeautifulSoup(source,'lxml')
# ---------------------------------------------
# prior script that I was playing with trying to tackle this issue
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set URL to scrape new certs from
newcerts = "https://login.microsoftonline.com/common/discovery/keys"
# Connect to the URL
response = requests.get(newcerts)
# Parse HTML and save to BeautifulSoup Object
soup = BeautifulSoup(response.text, "html.parser")
keys = soup.find("span", attrs = {"class": "objectBox objectBox-string"})
End goal is to retrieve the public PKI keys from Azure's website at https://login.microsoftonline.com/common/discovery/keys

Not sure if this is what you meant to grab. Try the script below:
import json
import requests
url = 'https://login.microsoftonline.com/common/discovery/keys'
res = requests.get(url)
jsonobject = json.loads(res.content)
for item in jsonobject['keys']:
print(item['x5c'])

Download table from wunderground with beautiful soup

I would like to Weather History & Observations table from the following link:
https://www.wunderground.com/history/airport/HDY/2011/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2011&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=
This is the code I have so far:
import pandas as pd
from bs4 import BeautifulSoup
import requests
link = 'https://www.wunderground.com/history/airport/HDY/2011/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2011&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo='
resp = requests.get(link)
c = resp.text
soup = BeautifulSoup(c)
I would like to know what is the next step to access the table info at the bottom of the page (assuming this is a good website format to allow this to happen).
Thank you

You can use find_all
table = soup.find('table', class_="responsive obs-table daily")
rows = table.find_all('tr')

Parsing table rows with beautiful soup

I'm trying to parse through this html and get the 53.1 and 41.7 values. I'm not quite sure how to do it.
I've been trying to do it using Beautiful Soup
Any suggestions or ideas would be greatly appreciated. Thanks.

from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('url/to/open').read()
soup = BeautifulSoup(r)
print type(soup)
-OR-
from bs4 import BeautifulSoup
import requests
url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get("http://" +url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
notice the .find_all() method. try exploring all helper methods of beautifulsoup. good luck.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract table from NHC website in Python? - python

You can use pandas import pandas as pd url = 'https://www.nhc.noaa.gov/gis/' df = pd.read_html(url)[0] # create csv file df.to_csv("mycsv.csv")

Related

Web scraping several a href

How to scrape a table on a web page into a dataframe?

How to make beautiful soup grab only what is between a set of "[:" ":]" in a web page?

Download table from wunderground with beautiful soup

Parsing table rows with beautiful soup

Categories

Resources