Scrape table from url

Scrape table from url - python

I am trying to scrape data from "https://www.investing.com/equities/pre-market" here the picture of what I need :
class="datatable_table__D_jso PreMarketMostActiveStocksTable_preMarketMostActiveStocksTable__9yGOv datatable_table--mobile-basic__W2ilt datatable_table--freeze-column__7YoIE"
It seems that this HTML code contains the table, I tried to scrape using
soup.find but I get no result.
here is my code.
import requests
from bs4 import BeautifulSoup
url = "https://www.investing.com/equities/pre-market"
html = requests.get(url).content
soup = BeautifulSoup(html)
table = soup.find('table', {'class': 'datatable_row__qHMpQ'})
print(soup)
Thanks!

The class you're using belongs to the header row of the table, not the table tag itself. (It's indicated by the class name itself - "datatable_row__qHMpQ"...)
You can use one of the table classes instead (like datatable_table__D_jso) or you could use the data-test attribute:
table = soup.find('table', {'data-test': "pre-market-most-active-stocks-table"})
# import pandas
print(pandas.read_html(table.prettify())[0].to_markdown(index=False))
prints
| Name | Symbol | Last | Chg. | Chg. % | Vol. | Time |
|:--------------------|:---------|-------:|-------:|:---------|:-------|:---------|
| Jiuzi Holdings Inc | JZXN | 0.24 | 0.092 | +62.09% | 19.14M | 09:27:41 |
| OpGen Inc | OPGN | 0.245 | 0.12 | +96.00% | 16.41M | 09:27:41 |
| Powerbridge | PBTS | 0.1001 | 0.0084 | +9.16% | 12.07M | 09:26:40 |
| Faraday Future Int. | FFIE | 0.57 | 0.11 | +24.59% | 12.03M | 09:27:56 |
| Magenta Therapeuti. | MGTA | 1.45 | 0.3 | +26.09% | 9.12M | 09:27:58 |
| Starry Holdings | STRY | 0.122 | 0.022 | +21.50% | 8.34M | 09:26:57 |
| Netcapital Inc | NCPL | 3.03 | 1.64 | +117.99% | 6.51M | 09:27:59 |
| China Pharma | CPHI | 0.1449 | 0.0044 | +3.13% | 3.55M | 09:26:52 |
| 111 Inc | YI | 3.81 | 0.27 | +7.63% | 2.98M | 09:28:00 |
| Amesite | AMST | 0.369 | 0.059 | +19.03% | 2.45M | 09:21:45 |
EDIT: full code with some additions for debugging and/or error handling:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
url = "https://www.investing.com/equities/pre-market"
resp = requests.get(url)
# resp.raise_for_status() # halt program right here if bad response
soup = BeautifulSoup(resp.content)
table = soup.find('table', {'data-test':"pre-market-most-active-stocks-table"})
if table is None and resp.status_code == 200: # ok respose, but no table
hfn = 'MISSING_DATA investing-com_equities_premarket.html'
with open(hfn, 'wb') as f: f.write(resp.content)
print('no such table found - inspect [on editor]: ', os.path.abspath(hfn))
elif table: print(pd.read_html(table.prettify())[0].to_markdown(index=False))
else: print(f'{resp.status_code} {resp.reason} - failed to scrape {url}')

Related

Web Scrape table data from this webpage

I'm trying to scrape the data from the table in the specifications section of this webpage:
Lochinvar Water Heaters
I'm using beautiful soup 4. I've tried searching for it by class - for example - (class="Table__Cell-sc-1e0v68l-0 kdksLO") but bs4 can't find the class on the webpage. I listed all the available classes that it could find and it doesn't find anything useful. Any help is appreciated.
Here's the code I tried to get the classes
import requests
from bs4 import BeautifulSoup
URL = "https://www.lochinvar.com/products/commercial-water-heaters/armor-condensing-water-heater"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find_all("div", class_='Table__Wrapper-sc-1e0v68l-3 iFOFNW')
classes = [value
for element in soup.find_all(class_=True)
for value in element["class"]]
classes = sorted(classes)
for cass in classes:
print(cass)

The page is populated with javascript, but fortunately in this case, much of the data [including the specs table you want] seems to be inside a script tag within the fetched html. The script just has one statement, so it's fairly easy to extract it as json
import json
### copied from your q ####
import requests
from bs4 import BeautifulSoup
URL = "https://www.lochinvar.com/products/commercial-water-heaters/armor-condensing-water-heater"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
###########################
wrInf = soup.find(lambda l: l.name == 'script' and '__routeInfo' in l.text)
wrInf = wrInf.text.replace('window.__routeInfo = ', '', 1) # remove variable name
wrInf = wrInf.strip()[:-1] # get rid of ; at end
wrInf = json.loads(wrInf) # convert to python dictionary
specsTables = wrInf['data']['product']['specifications'][0]['table'] # get table (tsv string)
specsTables = [tuple(row.split('\t')) for row in specsTables.split('\n')] # convert rows to tuples
To view it, you could use pandas,
import pandas
headers = specsTables[0]
st_df = pandas.DataFrame([dict(zip(headers, r)) for r in specsTables[1:]])
# or just
# st_df = pandas.DataFrame(specsTables[1:], columns=headers)
print(st_df.head())
or you could simply print it
for i, r in enumerate(specsTables):
print(" | ".join([f'{c:^18}' for c in r]))
if i == 0: print()
output:
Model Number | Btu/Hr Input | Thermal Efficiency | GPH # 100ÂºF Rise | A | B | C | D | E | F | G | H | I | J | K | L | M | Gas Conn. | Water Conn. | Air Inlet | Vent Size | Ship. Wt.
AWH0400NPM | 399,000 | 99% | 479 | 45" | 24" | 30-1/2" | 42-1/2" | 29-3/4" | 20-1/4" | 12" | 20" | 38" | 3-1/2" | 10-1/2" | 19-1/4" | 20" | 1" | 2" | 4" | 4" | 326
AWH0500NPM | 500,000 | 99% | 600 | 45" | 24" | 30-1/2" | 42-1/2" | 29-3/4" | 20-1/4" | 12" | 20" | 38" | 3-1/2" | 10-1/2" | 19-1/4" | 20" | 1" | 2" | 4" | 4" | 333
AWH0650NPM | 650,000 | 98% | 772 | 45" | 24" | 41" | 53" | 30-1/2" | 15-1/4" | 12" | 20" | 38" | 3-1/2" | 10-1/2" | 19-1/4" | 20" | 1-1/4" | 2" | 4" | 6" | 424
AWH0800NPM | 800,000 | 98% | 950 | 45" | 24" | 41" | 53" | 30-1/2" | 15-1/4" | 12" | 20" | 38" | 3-1/2" | 10-1/2" | 19-1/4" | 20" | 1-1/4" | 2" | 4" | 6" | 434
AWH1000NPM | 999,000 | 98% | 1,187 | 45" | 24" | 48" | 62" | 30-1/2" | 15-3/4" | 12" | 20" | 38" | 3-1/2" | 10-1/2" | 19-1/4" | 20" | 1-1/4" | 2-1/2" | 6" | 6" | 494
AWH1250NPM | 1,250,000 | 98% | 1,485 | 51-1/2" | 34" | 49" | 59" | 5-1/2" | 5-1/2" | 13-1/2" | 6-3/4" | 46-3/4" | 5-3/4" | 19-3/4" | 23" | 22-1/2" | 1-1/2" | 2-1/2" | 8" | 8" | 1,568
AWH1500NPM | 1,500,000 | 98% | 1,782 | 51-1/2" | 34" | 52-3/4" | 62-3/4" | 4-1/2" | 4-1/2" | 13-1/2" | 6-3/4" | 46-3/4" | 5-3/4" | 19-3/4" | 23" | 22-1/2" | 1-1/2" | 2-1/2" | 8" | 8" | 1,649
AWH2000NPM | 1,999,000 | 98% | 2,375 | 51-1/2" | 34" | 65-1/2" | 75-1/2" | 7" | 5-3/4" | 14-3/4" | 7-1/4" | 46-3/4" | 6-3/4" | 18-3/4" | 23" | 23-1/2" | 1-1/2" | 2-1/2" | 8" | 8" | 1,911
AWH3000NPM | 3,000,000 | 98% | 3,564 | 67-1/4" | 48-1/4" | 79-3/4" | 93-3/4" | 4-3/4" | 6-3/4" | 17-3/4" | 8-3/4" | 60-1/4" | 8-1/2" | 25-1/2" | 29-1/2" | 40" | 2" | 4" | 10" | 10" | 3,147
AWH4000NPM | 4,000,000 | 98% | 4,752 | 67-1/4" | 48-1/4" | 96" | 110" | 5" | 7-1/2" | 17-3/4" | 8-3/4" | 60-1/4" | 8-1/2" | 25-1/2" | 29-1/2" | 40" | 2-1/2" | 4" | 12" | 12" | 3,694
If you wanted a specific models specs:
modelNo = 'AWH1000NPM'
mSpecs = [r for r in specsTables if r[0] == modelNo]
mSpecs = [[]] if mSpecs == [] else mSpecs # in case there is no match
mSpecs = dict(zip(specsTables[0], mSpecs[0])) # convert to dictionary
print(mSpecs)
output:
{'Model Number': 'AWH1000NPM', 'Btu/Hr Input': '999,000', 'Thermal Efficiency': '98%', 'GPH # 100ÂºF Rise': '1,187', 'A': '45"', 'B': '24"', 'C': '48"', 'D': '62"', 'E': '30-1/2"', 'F': '15-3/4"', 'G': '12"', 'H': '20"', 'I': '38"', 'J': '3-1/2"', 'K': '10-1/2"', 'L': '19-1/4"', 'M': '20"', 'Gas Conn.': '1-1/4"', 'Water Conn.': '2-1/2"', 'Air Inlet': '6"', 'Vent Size': '6"', 'Ship. Wt.': '494'}

The contents for constructing the table are within a script tag. You can extract the relevant string and re-create the table through string manipulation.
import requests, re
import pandas as pd
r = requests.get('https://www.lochinvar.com/products/commercial-water-heaters/armor-condensing-water-heater/').text
s = re.sub(r'\\"', '"', re.search(r'table":"([\s\S]+?)(?:","tableFootNote)', r).groups(1)[0])
lines = [i.split('\\t') for i in s.split('\\n')]
df = pd.DataFrame(lines[1:], columns = lines[:1])
df.head(5)

How can I pull player names from Rotowire?

I have been using this code below to pull MLB lineups from BaseballPress.com. However this pulls the official MLB lineups which dont normally get posted until about an hour before the game.
import requests
import pandas as pd
import openpyxl
from bs4 import BeautifulSoup
url = "https://www.baseballpress.com/lineups/2022-08-09"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
def get_name(tag):
if tag.select_one(".desktop-name"):
return tag.select_one(".desktop-name").get_text()
elif tag.select_one(".mobile-name"):
return tag.select_one(".mobile-name").get_text()
else:
return tag.get_text()
data = []
for card in soup.select(".lineup-card"):
header = [
c.get_text(strip=True, separator=" ")
for c in card.select(".lineup-card-header .c")
]
h_p1, h_p2 = [
get_name(p) for p in card.select(".lineup-card-header .player")
]
data.append([*header, h_p1, h_p2])
for p1, p2 in zip(
card.select(".col--min:nth-of-type(1) .player"),
card.select(".col--min:nth-of-type(2) .player"),
):
p1 = get_name(p1).split(maxsplit=1)[-1]
p2 = get_name(p2).split(maxsplit=1)[-1]
data.append([*header, p1, p2])
df = pd.DataFrame(
data, columns=["Team1", "Date", "Team2", "Player1", "Player2"]
)
df.to_excel("MLB Games.xlsx", sheet_name='sheet1', index=False)
print(df.head(10).to_markdown(index=False))
In order to get around this, I found out that Rotowire releases the projected lineups about 24 hours in advance which is what I need for this analysis. I have changed the python script to match the website, except I am not sure how to alter the get_name() tag. Does anyone know how I would address this portion of the code? See the new code below:
import requests
import pandas as pd
import openpyxl
from bs4 import BeautifulSoup
url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
def get_name(tag):
if tag.select_one(".desktop-name"):
return tag.select_one(".desktop-name").get_text()
elif tag.select_one(".mobile-name"):
return tag.select_one(".mobile-name").get_text()
else:
return tag.get_text()
data = []
for card in soup.select(".lineup__main"):
header = [
c.get_text(strip=True, separator=" ")
for c in card.select(".lineup__teams .c")
]
h_p1, h_p2 = [
get_name(p) for p in card.select(".lineup__teams .lineup__player")
]
data.append([*header, h_p1, h_p2])
for p1, p2 in zip(
card.select(".lineup__list is-visit:nth-of-type(1) .lineup__player"),
card.select(".lineup__list is-home:nth-of-type(2) .lineup__player"),
):
p1 = get_name(p1).split(maxsplit=1)[-1]
p2 = get_name(p2).split(maxsplit=1)[-1]
data.append([*header, p1, p2])
df = pd.DataFrame(
data, columns=["Team1", "Date", "Team2", "Player1", "Player2"]
)
df.to_excel("MLB Predicted Lineups.xlsx", sheet_name='sheet1', index=False)
print(df.head(10).to_markdown(index=False))

You need to look at the actual html to see what tags and attributes the html source is using, in order to correctly identify the content you want. I had made a script to do this, what you are asking here, a while back, so I'm just using/posting that.
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
def get_players(home_away_dict):
rows = []
for home_away, v in home_away_dict.items():
players = v['players']
print("\n{} - {}".format(v['team'],v['lineupStatus']))
for idx, player in enumerate(players):
if home_away == 'Home':
team = home_away_dict['Home']['team']
opp = home_away_dict['Away']['team']
else:
team = home_away_dict['Away']['team']
opp = home_away_dict['Home']['team']
if player.find('span', {'class':'lineup__throws'}):
playerPosition = 'P'
handedness = player.find('span', {'class':'lineup__throws'}).text
else:
playerPosition = player.find('div', {'class':'lineup__pos'}).text
handedness = player.find('span', {'class':'lineup__bats'}).text
if 'title' in list(player.find('a').attrs.keys()):
playerName = player.find('a')['title'].strip()
else:
playerName = player.find('a').text.strip()
playerRow = {
'Bat Order':idx,
'Name':playerName,
'Position':playerPosition,
'Team':team,
'Opponent':opp,
'Home/Away':home_away,
'Handedness':handedness,
'Lineup Status':home_away_dict[home_away]['lineupStatus']}
rows.append(playerRow)
print('{} {}'.format(playerRow['Position'], playerRow['Name']))
return rows
rows = []
url = 'https://www.rotowire.com/baseball/daily-lineups.php'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
lineupBoxes = soup.find_all('div', {'class':'lineup__box'})
for lineupBox in lineupBoxes:
try:
awayTeam = lineupBox.find('div', {'class':'lineup__team is-visit'}).text.strip()
homeTeam = lineupBox.find('div', {'class':'lineup__team is-home'}).text.strip()
print(f'\n\n############\n {awayTeam} # {homeTeam}\n############')
awayLineup = lineupBox.find('ul', {'lineup__list is-visit'})
homeLineup = lineupBox.find('ul', {'lineup__list is-home'})
awayLineupStatus = awayLineup.find('li', {'class':re.compile('lineup__status.*')}).text.strip()
homeLineupStatus = homeLineup.find('li', {'class':re.compile('lineup__status.*')}).text.strip()
awayPlayers = awayLineup.find_all('li', {'class':re.compile('lineup__player.*')})
homePlayers = homeLineup.find_all('li', {'class':re.compile('lineup__player.*')})
home_away_dict = {
'Home':{
'team':homeTeam, 'players':homePlayers, 'lineupStatus':homeLineupStatus},
'Away':{
'team':awayTeam, 'players':awayPlayers,'lineupStatus':awayLineupStatus}}
playerRows = get_players(home_away_dict)
rows += playerRows
except:
continue
df = pd.DataFrame(rows)
Output: First 20 of 300 rows
print(df.head(20).to_markdown(index=False))
| Bat Order | Name | Position | Team | Opponent | Home/Away | Handedness | Lineup Status |
|------------:|:-----------------|:-----------|:-------|:-----------|:------------|:-------------|:----------------|
| 0 | Nick Lodolo | P | CIN | PHI | Home | L | Expected Lineup |
| 1 | Jonathan India | 2B | CIN | PHI | Home | R | Expected Lineup |
| 2 | Nick Senzel | CF | CIN | PHI | Home | R | Expected Lineup |
| 3 | Kyle Farmer | 3B | CIN | PHI | Home | R | Expected Lineup |
| 4 | Joey Votto | 1B | CIN | PHI | Home | L | Expected Lineup |
| 5 | Aristides Aquino | DH | CIN | PHI | Home | R | Expected Lineup |
| 6 | Albert Almora | LF | CIN | PHI | Home | R | Expected Lineup |
| 7 | Matt Reynolds | RF | CIN | PHI | Home | R | Expected Lineup |
| 8 | Jose Barrero | SS | CIN | PHI | Home | R | Expected Lineup |
| 9 | Austin Romine | C | CIN | PHI | Home | R | Expected Lineup |
| 0 | Ranger Suarez | P | PHI | CIN | Away | L | Expected Lineup |
| 1 | Jean Segura | 2B | PHI | CIN | Away | R | Expected Lineup |
| 2 | Kyle Schwarber | LF | PHI | CIN | Away | L | Expected Lineup |
| 3 | Rhys Hoskins | 1B | PHI | CIN | Away | R | Expected Lineup |
| 4 | J.T. Realmuto | C | PHI | CIN | Away | R | Expected Lineup |
| 5 | Nick Castellanos | RF | PHI | CIN | Away | R | Expected Lineup |
| 6 | Alec Bohm | 3B | PHI | CIN | Away | R | Expected Lineup |
| 7 | Darick Hall | DH | PHI | CIN | Away | L | Expected Lineup |
| 8 | Bryson Stott | SS | PHI | CIN | Away | L | Expected Lineup |
| 9 | Matt Vierling | CF | PHI | CIN | Away | R | Expected Lineup |

Scraping special graphical characters in an HTML table

I am trying to scrape a table, which in some cells has a "graphical" element (arrow up/down) using R. Unfortunately, the library rvest function html_table seems to skip these elements. This is how such a cell with arrow looks like in HTML:
<td>
<span style="font-weight: bold; color: darkgreen">Ba2</span>
<i class="glyphicon glyphicon-arrow-down" title="negative outlook"></i>
</td>
The code I am using is:
require(rvest)
require(tidyverse)
url = "https://tradingeconomics.com/country-list/rating"
#bypass company firewall
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
tables <- content %>% html_table(fill = TRUE, trim=TRUE)
But for example for the cell above, it gives me only Ba2 string. Is there a way to include also the arrows somehow (as text, e.g. Ba2 neg)? Solution in Python would be also useful, if R does not have such functionality.
Thank you!

I don't know if this is possible in R but in Python this will give you the required results.
I have tried to print the first few rows to give you an idea of how the data looks.
pos - Denotes Arrow-up and neg - Denotes Arrow-down
from bs4 import BeautifulSoup
import requests
url = 'https://tradingeconomics.com/country-list/rating'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
t = soup.find('table', attrs= {'id': 'ctl00_ContentPlaceHolder1_ctl01_GridView1'})
tr = t.findAll('tr')
for i in range(1,10):
tds = tr[i].findAll('td')
temp = []
for j in tds:
fa_down = j.find('i', class_='glyphicon-arrow-down')
fa_up = j.find('i', class_='glyphicon-arrow-up')
if fa_up:
print(f'{j.text.strip()} (pos)')
elif fa_down:
print(f'{j.text.strip()} (neg)')
else:
print(f'{j.text.strip()}')
Output:
+------------+---------+-----------+-----------+---------+---------+
| Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 |
+------------+---------+-----------+-----------+---------+---------+
| Albania | B+ | B1 | | | 35 |
| Andorra | BBB | | BBB+ | | 62 |
| Angola | CCC+ | Caa1 | CCC | | 21 |
| Argentina | CCC+ | Ca | CCC | CCC | 15 |
| Armenia | | Ba3 | B+ | | 16 |
| Aruba | BBB | | BB | | 52 |
| Australia | AAA | Aaa | AAA (neg) | AAA | 100 |
| Austria | AA+ | Aa1 | AA+ | AAA | 96 |
| Azerbaijan | BB+ | Ba2 (pos) | BB+ | | 48 |
+------------+---------+-----------+-----------+---------+---------+

what is NavigableString in this error refers to, and why that happened?

from bs4 import BeautifulSoup
import numpy as np
import requests
from selenium import webdriver
from nltk.tokenize import sent_tokenize,word_tokenize
html = webdriver.Firefox(executable_path=r'D:\geckodriver.exe')
html.get("https://www.tsa.gov/coronavirus/passenger-throughput")
def TSA_travel_numbers(html):
soup = BeautifulSoup(html,'lxml')
for i,rows in enumerate(soup.find('div',class_='view-content'),1):
# print(rows.content)
for header in rows.find('tr'):
number = rows.find_all('td',class_='views-field views-field field-2021-throughput views-align-center')
print(number.text)
TSA_travel_numbers(html.page_source)
My error as follows :
Traceback (most recent call last):
File "TSA_travel.py", line 23, in <module>
TSA_travel_numbers(html.page_source)
File "TSA_travel.py", line 15, in TSA_travel_numbers
for header in rows.find('tr'):
TypeError: 'int' object is not iterable
What is happening here?
I can't iter thru 'tr' tags, please help me to solve this problem.
Sorry for your time and advance thanks!

As the error says, you can't iterate over an int, which is your rows.
Also, there's no need for a webdriver as data on the page is static.
Here's my take on it:
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
def get_page(url):
return requests.get(url).text
def get_data(page):
soup = BeautifulSoup(page, 'lxml')
return [
item.getText(strip=True) for item in soup.select(".views-align-center")
]
def build_table(table_rows):
t = [table_rows[i:i + 4] for i in range(0, len(table_rows[1:]), 4)]
h = t[0]
return t[1:], h
if __name__ == '__main__':
source = "https://www.tsa.gov/coronavirus/passenger-throughput"
table, header = build_table(get_data(get_page(source)))
print(tabulate(table, headers=header, tablefmt="pretty"))
Output:
+------------+--------------------------+--------------------------+--------------------------+
| Date | 2021 Traveler Throughput | 2020 Traveler Throughput | 2019 Traveler Throughput |
+------------+--------------------------+--------------------------+--------------------------+
| 5/9/2021 | 1,707,805 | 200,815 | 2,419,114 |
| 5/8/2021 | 1,429,657 | 169,580 | 1,985,942 |
| 5/7/2021 | 1,703,267 | 215,444 | 2,602,631 |
| 5/6/2021 | 1,644,050 | 190,863 | 2,555,342 |
| 5/5/2021 | 1,268,938 | 140,409 | 2,270,662 |
| 5/4/2021 | 1,134,103 | 130,601 | 2,106,597 |
| 5/3/2021 | 1,463,672 | 163,692 | 2,470,969 |
| 5/2/2021 | 1,626,962 | 170,254 | 2,512,598 |
| 5/1/2021 | 1,335,535 | 134,261 | 1,968,278 |
| 4/30/2021 | 1,558,553 | 171,563 | 2,546,029 |
| 4/29/2021 | 1,526,681 | 154,695 | 2,499,461 |
| 4/28/2021 | 1,184,326 | 119,629 | 2,256,442 |
| 4/27/2021 | 1,077,199 | 110,913 | 2,102,068 |
| 4/26/2021 | 1,369,410 | 119,854 | 2,412,770 |
| 4/25/2021 | 1,571,220 | 128,875 | 2,506,809 |
| 4/24/2021 | 1,259,724 | 114,459 | 1,990,464 |
| 4/23/2021 | 1,521,393 | 123,464 | 2,521,897 |
| 4/22/2021 | 1,509,649 | 111,627 | 2,526,961 |
| 4/21/2021 | 1,164,099 | 98,968 | 2,254,209 |
| 4/20/2021 | 1,082,443 | 92,859 | 2,227,475 |
| 4/19/2021 | 1,412,500 | 99,344 | 2,594,171 |
| 4/18/2021 | 1,572,383 | 105,382 | 2,356,802 |
| 4/17/2021 | 1,277,815 | 97,236 | 1,988,205 |
| 4/16/2021 | 1,468,218 | 106,385 | 2,457,133 |
| 4/15/2021 | 1,491,435 | 95,085 | 2,616,158 |
| 4/14/2021 | 1,152,703 | 90,784 | 2,317,381 |
| 4/13/2021 | 1,085,034 | 87,534 | 2,208,688 |
| 4/12/2021 | 1,468,972 | 102,184 | 2,484,580 |
| 4/11/2021 | 1,561,495 | 90,510 | 2,446,801 |
and so on ...
Or, an ever shorter approach, just use pandas:
import pandas as pd
import requests
from tabulate import tabulate
if __name__ == '__main__':
source = "https://www.tsa.gov/coronavirus/passenger-throughput"
df = pd.read_html(requests.get(source).text, flavor="bs4")[0]
print(tabulate(df.head(10), tablefmt="pretty", showindex=False))
Output:
+-----------+-----------+--------+---------+
| 5/9/2021 | 1707805.0 | 200815 | 2419114 |
| 5/8/2021 | 1429657.0 | 169580 | 1985942 |
| 5/7/2021 | 1703267.0 | 215444 | 2602631 |
| 5/6/2021 | 1644050.0 | 190863 | 2555342 |
| 5/5/2021 | 1268938.0 | 140409 | 2270662 |
| 5/4/2021 | 1134103.0 | 130601 | 2106597 |
| 5/3/2021 | 1463672.0 | 163692 | 2470969 |
| 5/2/2021 | 1626962.0 | 170254 | 2512598 |
| 5/1/2021 | 1335535.0 | 134261 | 1968278 |
| 4/30/2021 | 1558553.0 | 171563 | 2546029 |
+-----------+-----------+--------+---------+

Beautiful Soup returning None

I am trying to scrape the th element, but the result keeps returning None. What am I doing wrong?
This is the code I have tried:
import requests
import bs4
import urllib3
dateList = []
openList = []
closeList = []
highList = []
lowList = []
r = requests.get(
'https://coinmarketcap.com/currencies/bitcoin/historical-data/')
soup = bs4.BeautifulSoup(r.text, 'lxml')
td = soup.find('th')
print(td)

There's an API endpoint so you can fetch the data from there.
Here's how:
import pandas as pd
import requests
from tabulate import tabulate
api_endpoint = "https://web-api.coinmarketcap.com/v1/cryptocurrency/ohlcv/historical?id=1&convert=USD&time_start=1609804800&time_end=1614902400"
bitcoin = requests.get(api_endpoint).json()
df = pd.DataFrame([q["quote"]["USD"] for q in bitcoin["data"]["quotes"]])
print(tabulate(df, headers="keys", showindex=False, disable_numparse=True, tablefmt="pretty"))
Output:
+----------------+----------------+----------------+----------------+--------------------+-------------------+--------------------------+
| open | high | low | close | volume | market_cap | timestamp |
+----------------+----------------+----------------+----------------+--------------------+-------------------+--------------------------+
| 34013.614533 | 36879.69856854 | 33514.03374162 | 36824.36441009 | 75289433810.59091 | 684671246323.6501 | 2021-01-06T23:59:59.999Z |
| 36833.87435728 | 40180.3679073 | 36491.18981083 | 39371.04235311 | 84762141031.49448 | 732062681138.1346 | 2021-01-07T23:59:59.999Z |
| 39381.76584266 | 41946.73935079 | 36838.63599637 | 40797.61071993 | 88107519479.50471 | 758625941266.7522 | 2021-01-08T23:59:59.999Z |
| 40788.64052286 | 41436.35000639 | 38980.87690625 | 40254.54649816 | 61984162837.0747 | 748563483043.1383 | 2021-01-09T23:59:59.999Z |
| 40254.21779758 | 41420.19103255 | 35984.62712175 | 38356.43950662 | 79980747690.35463 | 713304617760.9486 | 2021-01-10T23:59:59.999Z |
| 38346.52950301 | 38346.52950301 | 30549.59876946 | 35566.65594049 | 123320567398.62296 | 661457321418.0524 | 2021-01-11T23:59:59.999Z |
| 35516.36114084 | 36568.52697414 | 32697.97662163 | 33922.9605815 | 74773277909.4566 | 630920422745.0479 | 2021-01-12T23:59:59.999Z |
| 33915.11958124 | 37599.96059774 | 32584.66767186 | 37316.35939997 | 69364315979.27992 | 694069582193.7559 | 2021-01-13T23:59:59.999Z |
| 37325.10763475 | 39966.40524241 | 36868.5632453 | 39187.32812109 | 63615990033.01017 | 728904366964.3611 | 2021-01-14T23:59:59.999Z |
| 39156.7080858 | 39577.71118833 | 34659.58974449 | 36825.36585131 | 67760757880.723885 | 685005864471.3622 | 2021-01-15T23:59:59.999Z |
| 36821.64873201 | 37864.36887891 | 35633.55401669 | 36178.13890106 | 57706187875.104546 | 673000645230.8221 | 2021-01-16T23:59:59.999Z |
| 36163.64923243 | 36722.34987621 | 34069.32218533 | 35791.27792129 | 52359854336.21185 | 665831621390.9865 | 2021-01-17T23:59:59.999Z |
| 35792.23666766 | 37299.28580604 | 34883.84404829 | 36630.07568284 | 49511702429.3542 | 681470030572.0747 | 2021-01-18T23:59:59.999Z |
| 36642.23272357 | 37755.89185872 | 36069.80639361 | 36069.80639361 | 57244195485.50075 | 671081200699.8711 | 2021-01-19T23:59:59.999Z |
and so on ...

I think the request object doesn't have a text attribute.
Try soup = bs4.BeautifulSoup(r.content, 'lxml')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape table from url - python

Related

Web Scrape table data from this webpage

How can I pull player names from Rotowire?

Scraping special graphical characters in an HTML table

what is NavigableString in this error refers to, and why that happened?

Beautiful Soup returning None

Categories

Resources