Trying to organize scrapped BeautifulSoup table into a dictionary

Trying to organize scrapped BeautifulSoup table into a dictionary - python

Okay, I've been beating my head against the wall enough on this one - I'm stuck! I'm trying to build a function that I can input the Favorite from Sagarin's College Football site and it will calculate the spread including the Home advantage.
I am trying to pull the "Predictions_with_Totals" from Sagarin's site:
http://sagarin.com/sports/cfsend.htm#Predictions_with_Totals
I can get to it with the following code:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
html = requests.get("http://sagarin.com/sports/cfsend.htm#Predictions_with_Totals").text
soup = bs(html, "html.parser")
#find and create the table we want to import
collegeFB_HOME_ALL = soup.find_all("pre")
collegeFB_HOME = collegeFB_HOME_ALL[6]
df_collegeFB = collegeFB_HOME
This gets me a very nice table with a few headers I would need to get past to get to the "meat" of the data.
Predictions_with_Totals
These are the "regular method". _
HOME or CLOSEBY (c) team in CAPS _
both teams in lower case _
means "n" for NEUTRAL location Rating Favorite _
MONEY=odds to 100 _
FAVORITE Rating Predict Golden Recent UNDERDOG ODDS PCT% TOTAL _
======================================================================================================
CENTRAL FLORIDA(UCF) 6.35 4.66 5.99 7.92 smu 200 67% 52.40
ALABAMA 20.86 19.07 17.01 26.30 texas a&m 796 89% 42.65
snipped.....
However, I can't get rid of the top HTML code to format this into something useful. If he had made this a table or even a list I think I would find it a lot easier.
I have tried to make a dictionary and use row.find based on searches here but I don't know why it isn't working for me - maybe I need to trash the first few rows before the "FAVORITES" row? How would I do that?
output = []
for row in df_collegeFB:
test = {}
test["headers"] = row.find("FAVORITES")
test['data'] = row.find('all')
output.append(test)
Just gives me garbage. I'm sure I'm putting garbage in so not surprised I'm getting garbage out.
print(output)
[{'headers': -1, 'data': -1}, {'headers': None, 'data': None}, {'headers': -1, 'data': 1699}, {'headers': None, 'data': None}, {'headers': -1, 'data': -1}]

Not sure what exactly you are after. But if you are trying to get that table, you can use regex. Probably not the most efficient way I did it here, but none the less, gets that table into a dataframe:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import re
html = requests.get("http://sagarin.com/sports/cfsend.htm#Predictions_with_Totals").text
soup = bs(html, "html.parser")
#find and create the table we want to import
collegeFB_HOME_ALL = str(soup.find_all("pre")[6])
pattern = re.compile(r"\s{1,}([a-zA-Z\(\)].*)\s{1,}([0-9\.\-]+)\s{1,}([0-9\.\-]+)\s{1,}([0-9\.\-]+)\s{1,}([0-9\.\-]+)\s{1,}(\D+)([0-9]+)\s{1,}([0-9%]+)\s{1,}([0-9\.]+)")
rows = []
# find all matches to groups
for match in pattern.finditer(collegeFB_HOME_ALL):
row = {}
for i, col in enumerate(['FAVORITE', 'Rating', 'Predict', 'Golden', 'Recent', 'UNDERDOG', 'ODDS', 'PCT%', 'TOTAL'], start=1):
row[col] = match.group(i).strip()
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df)
FAVORITE Rating Predict ... ODDS PCT% TOTAL
0 CENTRAL FLORIDA(UCF) 6.35 4.66 ... 200 67% 52.40
1 ALABAMA 20.86 19.07 ... 796 89% 42.65
2 oregon 12.28 11.89 ... 362 78% 75.82
3 washington 8.28 8.47 ... 244 71% 64.72
4 james madison 8.08 8.52 ... 239 70% 64.71
.. ... ... ... ... ... ... ...
104 east tennessee state 7.92 7.75 ... 235 70% 41.16
105 WEBER STATE 15.32 17.25 ... 482 83% 62.36
106 delaware 2.10 2.89 ... 126 56% 38.73
107 YALE 0.87 0.83 ... 110 52% 54.32
108 YOUNGSTOWN STATE 2.11 4.51 ... 127 56% 48.10
[109 rows x 9 columns]

Related

using BeautifulSoup to grab data from a table

I've read through countless other posts and tried a lot of techniques, but I can't seem to get the data I want from a table on the below website. I can only return other divs and their classes, but not the values.
I am looking to get all the rows from the three columns (by airline, by origin airport, by destination airport) here:
https://flightaware.com/live/cancelled
I've tried searching for the 'th class' but it only returns the div information and not the data.
Any help is appreciated
Thank you
my attempt:
rows = soup.findAll('table', attrs={'class': 'cancellation_boards'})
for r in rows:
t = r.find_all_next('div', attrs= {'class':'cancellation_board'})
for r in rows:
r.text

The data you see is loaded via Ajax request so BeautifulSoup doesn't see it. You can simulate it via requests. To load the data to one big dataframe, you can use next example:
import requests
import pandas as pd
url = "https://flightaware.com/ajax/airport/cancelled_count.rvt"
params = {
"type": "airline",
"timeFilter": "((b.sch_block_out BETWEEN '2022-04-02 8:00' AND '2022-04-03 8:00') OR (b.sch_block_out IS NULL AND b.filed_departuretime BETWEEN '2022-04-02 8:00' AND '2022-04-03 8:00'))",
"timePeriod": "today",
"airportFilter": "",
}
all_dfs = []
for params["type"] in ("airline", "destination", "origin"):
df = pd.read_html(requests.get(url, params=params).text)[0]
df["type"] = params["type"]
all_dfs.append(df)
df_final = pd.concat(all_dfs)
print(df_final)
df_final.to_csv("data.csv", index=False)
Prints:
Airline Airport Cancelled Delayed type
Airline Airport # % # %
0 China Eastern NaN 509 45% 28 2% airline
1 Spring Airlines NaN 443 82% 5 0% airline
2 Southwest NaN 428 12% 1369 39% airline
3 American Airlines NaN 317 10% 472 16% airline
4 Delta NaN 229 8% 444 16% airline
5 Spirit NaN 190 23% 207 26% airline
6 Hainan Airlines NaN 167 41% 9 2% airline
7 JetBlue NaN 144 14% 494 48% airline
8 Lion Air NaN 129 20% 53 8% airline
9 easyJet NaN 121 8% 471 32% airline
...
and saves data.csv (screenshot from LibreOffice):

As the url is dynamic, you also can grab table data pandas with selenium.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
driver = webdriver.Chrome(ChromeDriverManager().install())
url ="https://flightaware.com/live/cancelled"
driver.maximize_window()
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
table=soup.select_one('table.cancellation_boards')
#driver.close()
df = pd.read_html(str(table),header=0)[0]
print(df)
Output:
By airline Unnamed: 1 By origin airport Unnamed: 3 By destination airport
0 Cancelled Cancelled Delayed Delayed Airline
1 # % # % Airline
2 Cancelled Cancelled Delayed Delayed Airport
3 # % # % Airport
4 Cancelled Cancelled Delayed Delayed Airport
.. ... ... ... ... ...
308 10 3% 55 17% Luis Munoz Marin Intl (SJU)
309 10 3% 126 39% Geneva Cointrin Int'l (GVA)
310 9 2% 33 10% Sydney (SYD)
311 9 11% 17 21% Punta Gorda (PGD)
312 9 3% 10 3% Chengdu Shuangliu Int'l (CTU)
[313 rows x 5 columns]

How to extract only header names from table into a list

I'm trying to extract just the header values from a Wikipedia table into a list. The following code is what I have so far, but I can't get the output correctly.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find('table')
column_names = [item.get_text() for item in table.find_all('th')]
column_names[2:18]
# current output: ['Origin of name[2][3]\n', 'Group\n','Period\n', 'Block\n' ...]
# expected outout ['Atomic Number', 'Symbol', 'Name', 'Origin of name',
# 'Group', 'Period', 'Standard atomic weight', 'Density',
# 'Melting Point'...]

I believe you need to do some data cleaning based on how the html is structured. The table has a multiiindex structure, so you won't get a flat list as columns. Remember pandas has the from_html() function which allows you to pass a raw html string and it does the parsing for you, removing the need to use BeautifulSoup or do any html parsing.
Thinking pragmatically I believe for this particular case it's better to do it manually, otherwise you will need to do a lot of string manipulation to get a clean list of column names. It is faster to write it manually.
Given you have already done most of writing, for an easier and time efficient solution I recommend:
df = pd.read_html(page.text)[0]
column_names = ['Atomic Number', 'Symbol', 'Name', 'Origin of name', 'Group', 'Period','Block','Standard atomic weight', 'Density', 'Melting Point','Boiling Point','Specific heat capacity','Electro-negativity',"Abundance in Earth's crust",'Origin','Phase at r.t.']
df.columns = column_names
Which outputs a nice and readable:
Atomic Number Symbol ... Origin Phase at r.t.
0 1 H ... primordial gas
1 2 He ... primordial gas
2 3 Li ... primordial solid
3 4 Be ... primordial solid
4 5 B ... primordial solid
.. ... ... ... ... ...
113 114 Fl ... synthetic unknown phase
114 115 Mc ... synthetic unknown phase
115 116 Lv ... synthetic unknown phase
116 117 Ts ... synthetic unknown phase
117 118 Og ... synthetic unknown phase
Otherwise if you want to go for a fully-automated approach:
page = requests.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
df = pd.read_html(page.text)[0]
df.columns = df.columns.droplevel()
Outputs:
Element Origin of name[2][3] Group Period Block Standardatomicweight[a] Density[b][c] Melting point[d] Boiling point[e] Specificheatcapacity[f] Electronegativity[g] Abundancein Earth'scrust[h] Origin[i] Phase at r.t.[j]
Atomic number.mw-parser-output .nobold{font-weight:normal}Z Symbol Name Unnamed: 3_level_2 Unnamed: 4_level_2 Unnamed: 5_level_2 Unnamed: 6_level_2 (Da) ('"`UNIQ--templatestyles-00000016-QINU`"'g/cm3) (K) (K) (J/g · K) Unnamed: 12_level_2 (mg/kg) Unnamed: 14_level_2 Unnamed: 15_level_2
0 1 H Hydrogen Greek elements hydro- and -gen, 'water-forming' 1.0 1 s-block 1.008 0.00008988 14.01 20.28 14.304 2.20 1400 primordial gas
1 2 He Helium Greek hḗlios, 'sun' 18.0 1 s-block 4.0026 0.0001785 –[k] 4.22 5.193 – 0.008 primordial gas
2 3 Li Lithium Greek líthos, 'stone' 1.0 2 s-block 6.94 0.534 453.69 1560 3.582 0.98 20 primordial solid
3 4 Be Beryllium Beryl, a mineral (ultimately from the name of ... 2.0 2 s-block 9.0122 1.85 1560 2742 1.825 1.57 2.8 primordial solid
4 5 B Boron Borax, a mineral (from Arabic bawraq) 13.0 2 p-block 10.81 2.34 2349 4200 1.026 2.04 10 primordial solid
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113 114 Fl Flerovium Flerov Laboratory of Nuclear Reactions, part o... 14.0 7 p-block [289] (9.928) (200)[b] (380) – – – synthetic unknown phase
114 115 Mc Moscovium Moscow, Russia, where the element was first sy... 15.0 7 p-block [290] (13.5) (700) (1400) – – – synthetic unknown phase
115 116 Lv Livermorium Lawrence Livermore National Laboratory in Live... 16.0 7 p-block [293] (12.9) (700) (1100) – – – synthetic unknown phase
116 117 Ts Tennessine Tennessee, United States, where Oak Ridge Nati... 17.0 7 p-block [294] (7.2) (700) (883) – – – synthetic unknown phase
117 118 Og Oganesson Yuri Oganessian, Russian physicist 18.0 7 p-block [294] (7) (325) (450) – – – synthetic unknown phase
And the string cleaning needed to make it look nice and tidy is going to take a lot longer than writing a few column names.

Scraping data from Morningstar using an API

I have a very specific issue which I have not been able to find a solution to.
Recently, I began a project for which I am monitoring about 100 ETFs and Mutual funds based on specific data acquired from Morningstar. The current solution works great - but I later found out that I need more data from another "Tab" within the website. Specifically, I am trying to get data from the 1st table from the following website: https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1
Right now, I have the code below for scraping data from a table from the tab "Indhold" on the website, and exporting it to Excel. My question is therefore: How do I adjust the code to scrape data from another part of the website?.
To briefly explain the code and reiterate: The code below scrapes data from another tab from the same websites. The many, many IDs are for each website representing each mutual fund/ETF. The setup works very well so I am hoping to simply adjust it (If that is possible) to extract the table from the link above. I have very limited knowledge of the topic so any help is much, much appreciated.
import requests
import re
import pandas as pd
from openpyxl import load_workbook
auth = 'https://www.morningstar.dk/Common/funds/snapshot/PortfolioSAL.aspx'
# Create a Pandas Excel writer using XlsxWriter as the engine.
path= r'/Users/karlemilthulstrup/Downloads/data2.xlsm'
book = load_workbook(path ,read_only = False, keep_vba=True)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
ids = ['F00000VA2N','F0GBR064OO','F00000YKC2','F000015MVX','F0000020YA','0P00015YTR','0P00015YTT','F0GBR05V8D','F0GBR06XKI','F000013CKH','F00000MG6K','F000014G49',
'F00000WC0Z','F00000QSD2','F000016551','F0000146QH','F0000146QI','F0GBR04KZS','F0GBR064VU','F00000VXLM','F0000119R1','F0GBR04L4T','F000015CS3','F000015CS5','F000015CS6',
'F000015CS4','F000013BZE','F0GBR05W0Q','F000016M1C','F0GBR04L68','F00000Z9T9','F0GBR04JI8','F00000Z9TG','F0GBR04L2P','F000014CU8','F00000ZG2G','F00000MLEW',
'F000013ZOY','F000016614','F00000WUI9','F000015KRL','F0GBR04LCR','F000010ES9','F00000P780','F0GBR04HC3','F000015CV6','F00000YWCK','F00000YWCJ','F00000NAI5',
'F0GBR04L81','F0GBR05KNU','F0GBR06XKB','F00000NAI3','F0GBR06XKF','F000016UA9','F000013FC2','F000014NRE','0P0000CNVT','0P0000CNVX','F000015KRI',
'F000015KRG','F00000XLK7','F0GBR04IDG','F00000XLK6','F00000073J','F00000XLK4','F000013CKG','F000013CKJ','F000013CKK','F000016P8R','F000016P8S','F000011JG6',
'F000014UZQ','F0000159PE','F0GBR04KZG','F0000002OY','F00000TW9K','F0000175CC','F00000NBEL','F000016054','F000016056','F00000TEYP','F0000025UI','F0GBR04FV7',
'F00000WP01','F000011SQ4','F0GBR04KZO','F000010E19','F000013ZOX','F0GBR04HD7','F00000YKC1','F0GBR064UG','F00000JSDD','F000010ROF','F0000100CA','F0000100CD',
'FOGBR05KQ0','F0GBR04LBB','F0GBR04LBZ','F0GBR04LCN','F00000WLA7','F0000147D7','F00000ZB5E','F00000WC0Y']
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
payload = {
'languageId': 'da-DK',
'locale': 'da-DK',
'clientId': 'MDC_intl',
'benchmarkId': 'category',
'component': 'sal-components-mip-factor-profile',
'version': '3.40.1'}
for api_id in ids:
payload = {
'Site': 'dk',
'FC': '%s' %api_id,
'IT': 'FO',
'LANG': 'da-DK',}
response = requests.get(auth, params=payload)
search = re.search('(tokenMaaS:[\w\s]*\")(.*)(\")', response.text, re.IGNORECASE)
bearer = 'Bearer ' + search.group(2)
headers.update({'Authorization': bearer})
url = 'https://www.us-api.morningstar.com/sal/sal-service/fund/factorProfile/%s/data' %api_id
jsonData = requests.get(url, headers=headers, params=payload).json()
rows = []
for k, v in jsonData['factors'].items():
row = {}
row['factor'] = k
historicRange = v.pop('historicRange')
row.update(v)
for each in historicRange:
row.update(each)
rows.append(row.copy())
df = pd.DataFrame(rows)
sheetName = jsonData['id']
df.to_excel(writer, sheet_name=sheetName, index=False)
print('Finished: %s' %sheetName)
writer.save()
writer.close()

If I understand you correctly, you want to get first table of that URL in the form of pandas dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# load the page into soup:
url = "https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(str, df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
print(df)
Prints:
Name 2014* 2015* 2016* 2017* 2018 2019 2020 31-08
0 Samlet afkast % 2627 1490 1432 584 -589 2648 -482 1841
1 +/- Kategori 1130 583 808 -255 164 22 -910 -080
2 +/- Indeks 788 591 363 -320 -127 -262 -1106 -162
3 Rank i kategori 2 9 4 80 38 54 92 63
EDIT: To load from multiple URLs:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1",
]
all_data = []
for url in urls:
print("Loading URL {}".format(url))
# load the page into soup:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(lambda x: str(x).replace("*", "").strip(), df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
df["Company"] = soup.h1.text.split("\n")[0].strip()
df["URL"] = url
all_data.append(df.loc[:, ~df.isna().all()])
df = pd.concat(all_data, ignore_index=True)
print(df)
Prints:
Name 2016 2017 2018 2019 2020 31-08 Company URL
0 Samlet afkast % 1755.0 942.0 -1317.0 1757.0 -189.0 3018 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
1 +/- Kategori 966.0 -54.0 -186.0 -662.0 -967.0 1152 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
2 +/- Indeks 686.0 38.0 -854.0 -1153.0 -813.0 1015 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
3 Rank i kategori 10.0 24.0 85.0 84.0 77.0 4 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
4 Samlet afkast % NaN 1016.0 -940.0 1899.0 767.0 2238 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
5 +/- Kategori NaN 20.0 190.0 -520.0 -12.0 373 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
6 +/- Indeks NaN 112.0 -478.0 -1011.0 143.0 235 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
7 Rank i kategori NaN 26.0 69.0 92.0 43.0 25 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
8 Samlet afkast % NaN NaN -939.0 1898.0 766.0 2239 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
9 +/- Kategori NaN NaN 191.0 -521.0 -12.0 373 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
10 +/- Indeks NaN NaN -477.0 -1012.0 142.0 236 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
11 Rank i kategori NaN NaN 68.0 92.0 44.0 24 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
12 Samlet afkast % NaN NaN NaN NaN NaN 2384 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
13 +/- Kategori NaN NaN NaN NaN NaN 518 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
14 +/- Indeks NaN NaN NaN NaN NaN 381 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
15 Rank i kategori NaN NaN NaN NaN NaN 18 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1

How to grab a complete table hidden beyond 'Show all' by web scraping in Python

According to the reply I found in my previous question, I am able to grab the table by web scraping in Python from the URL: https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html But it only grabs partially until the row "Show all" is appeared.
How can I grab the complete table in Python which is hidden beyond "Show all" ?
Here is the code I am using:
import pandas as pd
import requests
from bs4 import BeautifulSoup
#
vaccineDF = pd.read_html('https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html')[0]
vaccineDF = vaccineDF.reset_index(drop=True)
print(vaccineDF.head(100))
The output only grabs 15 rows (until Show All):
Unnamed: 0_level_0 Doses administered ... Unnamed: 8_level_0 Unnamed: 9_level_0
Unnamed: 0_level_1 Per 100 people ... Unnamed: 8_level_1 Unnamed: 9_level_1
0 World 11 ... NaN NaN
1 Israel 116 ... NaN NaN
2 Seychelles 116 ... NaN NaN
3 U.A.E. 99 ... NaN NaN
4 Chile 69 ... NaN NaN
5 Bahrain 66 ... NaN NaN
6 Bhutan 63 ... NaN NaN
7 U.K. 62 ... NaN NaN
8 United States 61 ... NaN NaN
9 San Marino 60 ... NaN NaN
10 Maldives 59 ... NaN NaN
11 Malta 55 ... NaN NaN
12 Monaco 53 ... NaN NaN
13 Hungary 45 ... NaN NaN
14 Serbia 44 ... NaN NaN
15 Show all Show all ... Show all Show all
Below is the screen shot of the partial table until "Show all" in the web (left part) and corresponding inspect elements (right part):

You can't print whole data directly because you can see your full data after clicking the Show all Button. So, from this scenario, we can understand that first of all we have to first create an on click() event for clicking the Show all Button then we can fetch the whole table.
I have used Selenium Library for the on click event for pressing the Show all Button. For this particular scenario, I have used Firefox() Webdriver of Selenium for fetching all data from url. Kindly refer to the code given below for fetching the whole table of the given COVID Dataset URL:-
# Import all the Important Libraries
from selenium import webdriver # This module help to fetch data and on-click event purpose
from pandas.io.html import read_html # This module will help to read 'html' source. So, we can __scrape__ data from it
import pandas as pd # This Module will help to Convert Our Data into 'DataFrame'
# Create 'FireFox' Webdriver Object
driver = webdriver.Firefox()
# Get Website
driver.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
# Find 'Show all' Button Using 'XPath'
show_all_button = driver.find_element_by_xpath("/html/body/div[1]/main/article/section/div/div/div[4]/div[1]/div/table/tbody/tr[16]")
# Click 'Show all' Button
show_all_button.click()
# Get 'HTML' Content of Page
html_data = driver.page_source
After fetching the whole data, let's see how many tables are there in our COVID Dataset URL
covid_data_tables = read_html(html_data, attrs = {"class":"g-summary-table svelte-2wimac"}, header = None)
# Print Number of Tables Extracted
print ("\nExtracted {num} COVID Data Table".format(num = len(covid_data_tables)), "\n")
# Output of Above Cell:-
Extracted 1 COVID Data Table
Now, let's fetch the Data Table:-
# Print Table Data
covid_data_tables[0].head(20)
# Output of above cell:-
Unnamed: 0_level_0 Doses administered Pct. of population
Unnamed: 0_level_1 Per 100 people Total Vaccinated Fully vaccinated
0 World 11 877933955 – –
1 Israel 116 10307583 60% 56%
2 Seychelles 116 112194 68% 47%
3 U.A.E. 99 9489684 – –
4 Chile 69 12934282 41% 28%
5 Bahrain 66 1042463 37% 29%
6 Bhutan 63 478219 63% –
7 U.K. 62 41505768 49% 13%
8 United States 61 202282923 38% 24%
9 San Marino 60 20424 35% 25%
10 Maldives 59 303752 53% 5.6%
11 Malta 55 264658 38% 17%
12 Monaco 53 20510 30% 23%
13 Hungary 45 4416581 32% 14%
14 Serbia 44 3041740 26% 17%
15 Qatar 43 1209648 – –
16 Uruguay 38 1310591 30% 8.3%
17 Singapore 30 1667522 20% 9.5%
18 Antigua and Barbuda 28 27032 28% –
19 Iceland 28 98672 20% 8.1%
As you can see it was not showing show all in our dataset. So, Now we can Convert this Data Table to DataFrame. For doing this task we have to store this Data into CSV Format and we can reupload it and store it in DataFrame. The code for the Same was stated below:-
# HTML Table to CSV Format Conversion For COVID Dataset
covid_data_file = 'covid_data.csv'
covid_data_tables[0].to_csv(covid_data_file, sep = ',')
# Read CSV Data From Data Table for Further Analysis
covid_data = pd.read_csv("covid_data.csv")
So, after Storing all the Data into csv Format let's convert data into DataFrame Format and Print Whole data:-
# Store 'CSV' Data into 'DataFrame' Format
vaccineDF = pd.DataFrame(covid_data)
vaccineDF = vaccineDF.drop(columns=["Unnamed: 0"], axis = 1) # 'drop' Unneccesary Columns from the Dataset
# Print Whole Dataset
vaccineDF
# Output of above cell:-
Unnamed: 0_level_0 Doses administered Doses administered.1 Pct. of population Pct. of population.1
0 Unnamed: 0_level_1 Per 100 people Total Vaccinated Fully vaccinated
1 World 11 877933955 – –
2 Israel 116 10307583 60% 56%
3 Seychelles 116 112194 68% 47%
4 U.A.E. 99 9489684 – –
... ... ... ... ... ...
154 Syria <0.1 2500 <0.1% –
155 Papua New Guinea <0.1 1081 <0.1% –
156 South Sudan <0.1 947 <0.1% –
157 Cameroon <0.1 400 <0.1% –
158 Zambia <0.1 106 <0.1% –
159 rows × 5 columns
From above Output we can see that we have successfully fetched whole data table. Hope this Solution will help you.

OWID provides this data, which effectively comes from JHU
if you want latest vaccination data by country, it's simple to use CSV interface
import requests, io
dfraw = pd.read_csv(io.StringIO(requests.get("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv").text))
dfraw["date"] = pd.to_datetime(dfraw["date"])
dfraw.sort_values(["iso_code","date"]).groupby("iso_code", as_index=False).last()

Obtaining \r\n\r\n while scraping from web in Python

I am workin on scraping text using Python from the link; tournament link
Here is my code to get the tabular data;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr') ## find the table rows
Now, the goal is to obtain the data as a dataframe.
listnew=[]
for row in rows:
row_td = row.find_all('td')
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text() ##obtain text part
listnew.append(cleantext) ## append to list
df = pd.DataFrame(listnew)
df.head(10)
Then we get following output;
0 []
1 [Finishers:, 577]
2 [Male:, 414]
3 [Female:, 163]
4 []
5 [1, 814, \r\n\r\n JARED WIL...
6 [2, 573, \r\n\r\n NATHAN A ...
7 [3, 687, \r\n\r\n FRANCISCO...
8 [4, 623, \r\n\r\n PAUL MORR...
9 [5, 569, \r\n\r\n DEREK G O..
I don't know why there is a new line character and carriage return character; \r\n\r\n? how can I remove them and get a dataframe in the proper format? Thanks in advance.

Pandas can parse HTML tables, give this a try:
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
table_1_html = soup.find('table', attrs={'id': 'individualResults'})
t_1 = pd.read_html(table_1_html.prettify())[0]
print(t_1)
Output:
Place Bib Name ... Chip Pace Gun Time Team
0 1 814 JARED WILSON ... 5:51 36:24 NaN
1 2 573 NATHAN A SUSTERSIC ... 5:55 36:45 INTEL TEAM F
2 3 687 FRANCISCO MAYA ... 6:05 37:48 NaN
3 4 623 PAUL MORROW ... 6:13 38:37 NaN
4 5 569 DEREK G OSBORNE ... 6:20 39:24 INTEL TEAM F
.. ... ... ... ... ... ... ...
572 573 273 RACHEL L VANEY ... 15:51 1:38:34 NaN
573 574 467 ROHIT B DSOUZA ... 15:53 1:40:32 INTEL TEAM I
574 575 471 CENITA D'SOUZA ... 15:53 1:40:34 NaN
575 576 338 PRANAVI APPANA ... 16:15 1:42:01 NaN
576 577 443 LIBBY B MITCHELL ... 16:20 1:42:10 NaN
[577 rows x 10 columns]

Seems like some cells in the HTML code has a lot of leading and trailing spaces and new lines:
<td>
JARED WILSON
</td>
Use str.strip to remove all leading and trailing whitespace, like this:
BeautifulSoup(str_cells, "lxml").get_text().strip().

Well looking at the url you provided, you can see the new lines in the :
...
<td>814</td>
<td>
JARED WILSON
</td>
...
so that's what you get when you scrape. These can easily be removed by the very convenient .strip() string method.
Your DataFrame is not formatted correctly because you are giving it a list of lists, which are not all of the same size (see the first 4 lines), which come from another table located on the top right. One easy fix is to remove the first 4 lines, though it would be way more robust to select the table you want based on its id ("individualResults").
df = pd.DataFrame(listnew[4:])
df.head(10)
Have a look here: BeautifulSoup table to dataframe

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to organize scrapped BeautifulSoup table into a dictionary - python

Related

using BeautifulSoup to grab data from a table

How to extract only header names from table into a list

Scraping data from Morningstar using an API

How to grab a complete table hidden beyond 'Show all' by web scraping in Python

Obtaining \r\n\r\n while scraping from web in Python

Categories

Resources