Scrape Each Table from Drop Down Menu Python - python

I am looking to scrape Division 3 College Basketball stats from the following NCAA stats page:
https://stats.ncaa.org/rankings/change_sport_year_div
To get to the page I am on, after clicking the link, Select Sport = Men's Basketball, Year = 2019-2020, and Div = III
Upon clicking the link, there is a dropdown above the top left corner table. It is labeled "Additional Stats". For each stat there is a table which you can get an excel file of, but I want to be more efficient. I was thinking there could be a way to iterate through the dropdown bar using BeautifulSoup (or perhaps even pd.read_html) to get a dataframe for every stat listed. Is there a way to do this? Going through each stat manually, downloading the excel file, and reading the excel file into pandas would be a pain. Thank you.

Here is my suggestion, to use a combination of requests, beautifulsoup and a great html table parser from Scott Rome(I modified a bit the parse_html_table function to remove \n and strip whitespaces).
First, you can see when you inspect the source code of the page that it takes the form : "https://stats.ncaa.org/rankings/national_ranking?academic_year=2020.0&division=3.0&ranking_period=110.0&sport_code=MBB&stat_seq=145.0"
for instance for the stat 145 ie "Scoring Offense".
You can therefore use the following code on each of these urls by replacing the 145.0 with values corresponding to the different stats, which you can see when you inspect the source code of the page.
# <option value="625">3-pt Field Goal Attempts</option>
# <option value="474">Assist Turnover Ratio</option>
# <option value="216">Assists Per Game</option>
# ...
For a specific stat, here for instance scoring offense, you can use the following code to extract the table as a pandas DataFrame:
import pandas as pd
from bs4 import BeautifulSoup
import requests
el = "https://stats.ncaa.org/rankings/national_ranking?academic_year=2020.0&division=3.0&ranking_period=110.0&sport_code=MBB&stat_seq=145.0"
page = requests.get(el).content.decode('utf-8')
soup = BeautifulSoup(page, "html.parser")
ta = soup.find_all('table', {"id": "rankings_table"})
# Scott Rome function tweaked a bit
def parse_html_table(table):
n_columns = 0
n_rows = 0
column_names = []
# Find number of rows and columns
# we also find the column titles if we can
for row in table.find_all('tr'):
# Determine the number of rows in the table
td_tags = row.find_all('td')
if len(td_tags) > 0:
n_rows += 1
if n_columns == 0:
# Set the number of columns for our table
n_columns = len(td_tags)
# Handle column names if we find them
th_tags = row.find_all('th')
if len(th_tags) > 0 and len(column_names) == 0:
for th in th_tags:
column_names.append(th.get_text())
# Safeguard on Column Titles
if len(column_names) > 0 and len(column_names) != n_columns:
raise Exception("Column titles do not match the number of columns")
columns = column_names if len(column_names) > 0 else range(0, n_columns)
df = pd.DataFrame(columns=columns,
index=range(0, n_rows))
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
df.iat[row_marker, column_marker] = column.get_text()
column_marker += 1
if len(columns) > 0:
row_marker += 1
# remove \n
for col in df:
try:
df[col] = df[col].str.replace("\n", "")
df[col] = df[col].str.strip()
except ValueError:
pass
# Convert to float if possible
for col in df:
try:
df[col] = df[col].astype(float)
except ValueError:
pass
return df
example = parse_html_table(ta[0])
The result is
Rank Team GM W-L PTS PPG
0 1 Greenville (SLIAC) 27.0 14-13 3,580 132.6
1 2 Grinnell (Midwest Conference) 25.0 13-12 2,717 108.7
2 3 Pacific (OR) (NWC) 25.0 7-18 2,384 95.4
3 4 Whitman (NWC) 28.0 20-8 2,646 94.5
4 5 Valley Forge (ACAA) 22.0 12-11 2,047 93.0
...
Now, what you have to do is apply this to all stat values mentioned above.
You can make a function of the code above, and apply it in a for loop to the url "https://stats.ncaa.org/rankings/national_ranking?academic_year=2020.0&division=3.0&ranking_period=110.0&sport_code=MBB&stat_seq={}".format(stat) where stat is in list of all possible values.
Hope it helps.

Maybe a more concise way to do this :
import requests as rq
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0"}
params = {"sport_code": "MBB", "stat_seq": "518", "academic_year": "2020.0", "division":"3.0", "ranking_period":"110.0"}
url = "https://stats.ncaa.org/rankings/national_ranking"
resp = rq.post(url, headers=headers, params=params)
soup = bs(resp.content)
colnames = [th.text.strip() for th in soup.find_all("thead")[0].find_all("th")]
data = [[td.text.strip() for td in tr.find_all('td')] for tr in soup.find_all('tbody')[0].find_all("tr")]
df = pd.DataFrame(data, columns=colnames)
df.astype({"GM": 'int32'}).dtypes # convert column in type u want
You have to look at the XHR requests [on Mozilla: F12 -> Network -> XHR].
When you select an item from the dropdown list, this makes a post Request through the following url: https://stats.ncaa.org/rankings/national_ranking.
Some params are required to make this post request, one of them is "stat_seq". The value corresponds to the "value" of dropdown options.
Inspector give you the list of "value"-StatName correspondence :
<option value="625" selected="selected">3-pt Field Goal Attempts</option>
<option value="474">Assist Turnover Ratio</option>
<option value="216">Assists Per Game</option>
<option value="214">Blocked Shots Per Game</option>
<option value="859">Defensive Rebounds per Game</option>
<option value="642">Fewest Fouls</option>
...
...
...

Related

How to scrape table data with th and td with BeautifulSoup?

Am new to programming and have been trying to practice web scrapping. Found a example where one of the columns I wish to have in my out put is part of the table header. I am able to extract all the table data I wish, but have been unable to get the Year dates to show.
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
import pandas as pd
"https://en.wikipedia.org/wiki/World_population"
data = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")
tables = soup.find_all('table')
len(tables)
for index,table in enumerate(tables):
if ("Global annual population growth" in str(table)):
table_index = index
print(table_index)
print(tables[table_index].prettify())
population_data = pd.DataFrame(columns=["Year","Population","Growth"])
for row in tables[table_index].tbody.find_all('tr'):
col = row.find_all('td')
if (col !=[]):
Population = col[0].text.strip()
Growth = col[1].text.strip()
population_data = population_data.append({"Population":Population,"Growth":Growth}, ignore_index= True)
population_data
You could use pandas directly here to get your goal with pandas.read_html() to scrape the table and pandas.T to transform it:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/World_population')[0].T.reset_index()
df.columns = df.loc[0]
df = df[1:]
df
or same result with BeautifulSoup and stripped_strings:
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://en.wikipedia.org/wiki/World_population').text)
pd.DataFrame(
{list(e.stripped_strings)[0]: list(e.stripped_strings)[1:] for e in soup.table.select('tr')}
)
Output
Population
Year
Years elapsed
1
1804
200,000+
2
1930
126
3
1960
30
4
1974
14
5
1987
13
6
1999
12
7
2011
12
8
2022
11
9
2037
15
10
2057
20
Actually it's because you are scraping <td> in this line:
col = row.find_all('td')
But if you will take a look at <tr> in developer tools(F12), you can see that table also contains <th> tag which keep the year and which you are not scraping. So everything that you have to do is add this line after If condition:
year = row.find('th').text, and after that you can append it in population data

bs4 soup.select() vs. soup.find()

I am trying to scrape the text of some elements in a table using requests and BeautifulSoup, specifically the country names and the 2-letter country codes from this website.
Here is my code, which I have progressively walked back:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for i in range(3):
row = soup.find(f'#row{i} td')
print(row) # printing to check progress for now
I had hoped to go row-by-row and walk the tags to get the strings like so (over range 249). However, soup.find() doesn't appear to work, just prints blank lists. soup.select() however, works fine:
for i in range(3):
row = soup.select(f'#row{i} td')
print(row)
Why does soup.find() not work as expected here?
While .find() deals only with the first occurence of an element, .select() / .find_all() will give you a ResultSet you can iterate.
There are a lot of ways to get your goal, but basic pattern is mostly the same - select rows of the table and iterate over them.
In this first case I selected table by its id and close to your initial approach the <tr> also by its id while using css selector and the [id^="row"] that represents id attribute whose value starts with row. In addition I used .stripped_strings to extract the text from the elements, stored it in a list and pick it by index :
for row in soup.select('#countriesTable tr[id^="row"]'):
row = list(row.stripped_strings)
print(row[2], row[3])
or more precisely selecting all <tr> in <tbody> of tag with id countriesTable:
for row in soup.select('#countriesTable tbody tr'):
row = list(row.stripped_strings)
print(row[2], row[3])
...
An alternative and in my opinion best way to scrape tables is the use of pandas.read_html() that works with beautifulsoup under the hood and is doing most work for you:
import pandas as pd
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,:]
or to get only the two specific rows:
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,[1,2]]
Name
ISO 2
0
Afghanistan
AF
1
Åland Islands
AX
2
Albania
AL
3
Algeria
DZ
4
American Samoa
AS
5
Andorra
AD
...
find expects the first argument to be the DOM element you're searching, it won't work with CSS selectors.
So you'll need:
row = soup.find('tr', { 'id': f"row{i}" })
To get the tr with the desired ID.
Then to get the 2-letter country code, for the first a with title: ISO 3166-1 alpha-2 code and get it's .text:
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
To get the full name, there is no classname to search for, so I'd use take the second element, then we'll need to search for the span containing the country name:
name = row.findAll('td')[2].findAll('span')[2].text
Putting it all together gives:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for i in range(3):
row = soup.find('tr', { 'id': f"row{i}" })
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
name = row.findAll('td')[2].findAll('span')[2].text
print(name, iso)
Which outputs:
Afghanistan  AF
Åland Islands  AX
Albania  AL
find_all() and select() select a list but find() and select_one() select only single element.
import requests
import bs4
import pandas as pd
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')
data=[]
for row in soup.select('.tablesorter.mark > tbody tr'):
name=row.find("span",class_="sortkey").text
country_code=row.select_one('td:nth-child(4)').text.replace('\n','').strip()
data.append({
'name':name,
'country_code':country_code})
df= pd.DataFrame(data)
print(df)
Output:
name country_code
0 afghanistan AF
1 aland-islands AX
2 albania AL
3 algeria DZ
4 american-samoa AS
.. ... ...
244 wallis-and-futuna WF
245 western-sahara EH
246 yemen YE
247 zambia ZM
248 zimbabwe ZW
[249 rows x 2 columns]

How to scrape data off morningstar

So Im new to the world of web scraping and so far I've only really been using beautifulsoup to scrape text and images off websites. I thought Id try and scrape some data points off a graph to test my understanding but I got a bit confused at this graph.
After inspecting the element of the piece of data I wanted to extract, I saw this:
<span id="TSMAIN">: 100.7490637</span>
The problem is, my original idea for scraping the data points would be to have iterated through some sort of id list containing all the different data points (if that makes sense?).
Instead, it seems that all the data points are contained within this same element, and the value depends on where your cursor is on the graph.
My problem is, If I use beautifulsoups find function and type in that specific element with that attribute of id = TSMAIN, I get a none type return, because I am guessing unless I have my cursor on the actual graph nothing will show up there.
Code:
from bs4 import BeautifulSoup
import requests
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"}
url = "https://www.morningstar.co.uk/uk/funds/snapshot/snapshot.aspx?id=F0GBR050AQ&tab=13"
source=requests.get(url,headers=headers)
soup = BeautifulSoup(source.content,'lxml')
data = soup.find("span",attrs={"id":"TSMAIN"})
print(data)
Output
None
How can I extract all the data points of this graph?
Seems like the data can be pulled form API. Only thing is the values it returns is relative to the start date entered in the payload. It'll set the out put of the start date to 0, then the numbers after are relative to that date.
import requests
import pandas as pd
from datetime import datetime
from dateutil import relativedelta
userInput = input('Choose:\n\t1. 3 Month\n\t2. 6 Month\n\t3. 1 Year\n\t4. 3 Year\n\t5. 5 Year\n\t6. 10 Year\n\n -->: ')
userDict = {'1':3,'2':6,'3':12,'4':36,'5':60,'6':120}
n = datetime.now()
n = n - relativedelta.relativedelta(days=1)
n = n - relativedelta.relativedelta(months=userDict[userInput])
dateStr = n.strftime('%Y-%m-%d')
url = 'https://tools.morningstar.co.uk/api/rest.svc/timeseries_cumulativereturn/t92wz0sj7c'
data = []
idDict = {
'Schroder Managed Balanced Instl Acc':'F0GBR050AQ]2]0]FOGBR$$ALL',
'GBP Moderately Adventurous Allocation':'EUCA000916]8]0]CAALL$$ALL',
'Mixed Investment 40-85% Shares':'LC00000012]8]0]CAALL$$ALL',
'':'F00000ZOR1]7]0]IXALL$$ALL',}
for k, v in idDict.items():
payload = {
'encyId': 'GBP',
'idtype': 'Morningstar',
'frequency': 'daily',
'startDate': dateStr,
'performanceType': '',
'outputType': 'COMPACTJSON',
'id': v,
'decPlaces': '8',
'applyTrackRecordExtension': 'false'}
temp_data = requests.get(url, params=payload).json()
df = pd.DataFrame(temp_data)
df['timestamp'] = pd.to_datetime(df[0], unit='ms')
df['date'] = df['timestamp'].dt.date
df = df[['date',1]]
df.columns = ['date', k]
data.append(df)
final_df = pd.concat(
(iDF.set_index('date') for iDF in data),
axis=1, join='inner'
).reset_index()
final_df.plot(x="date", y=list(idDict.keys()), kind="line")
Output:
print (final_df.head(5).to_string())
date Schroder Managed Balanced Instl Acc GBP Moderately Adventurous Allocation Mixed Investment 40-85% Shares
0 2019-12-22 0.000000 0.000000 0.000000 0.000000
1 2019-12-23 0.357143 0.406784 0.431372 0.694508
2 2019-12-24 0.714286 0.616217 0.632422 0.667586
3 2019-12-25 0.714286 0.616217 0.632422 0.655917
4 2019-12-26 0.714286 0.612474 0.629152 0.664124
....
To get those Ids, it took a little investigating of the requests. Searching through those, I was able to find the corresponding id values and with a little bit of trial and error to work out what values meant what.
Those "alternate" ids used. And where those line graphs get the data from (inthose 4 request, look at the Preview pane, and you'll see the data in there.
Here's the final output/graph:

How to get a value from a text document that has an unstructured table

I am trying to get the total assets values from the 10-K text filings. The problem is that the html format varies from one company to another.
Take Apple 10-K as an example:
total assets is in a table that has balance sheet header and typical terms like cash, inventories, ... exist in some rows of that table. In the last row, there is a summation of assets of 290,479 for 2015 and 231,839 for 2014. I wanted to get the number for the 2015 --> 290,479. I have not been able to find a way that
1) finds the relevant table that has some specific headings (like balance sheet) and words in rows (cash, ...)
2) get the value in the row that has the word total assets and belongs to the greater year (2015 for our example).
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "xml")
for tag in soup.find_all(text=re.compile('Total\sassets')):
print(tag.findParent('table').findParent('table'))
Using lxml or html.parser instead of xml I can get
title > CONSOLIDATED BALANCE SHEETS
row > Total assets
column 0 > Total assets
column 1 >
column 2 > $
column 3 > 290,479
column 4 >
column 5 >
column 6 > $
column 7 > 231,839
column 8 >
using code
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')# "lxml")
# get all `b` to find title
all_b = soup.find_all('b')
for item in all_b:
# check text in every `b`
title = item.get_text(strip=True)
if title == 'CONSOLIDATED BALANCE SHEETS':
print('title >', title)
# get first `table` after `b`
table = item.parent.findNext('table')
# all rows in table
all_tr = table.find_all('tr')
for tr in all_tr:
# all columns in row
all_td = tr.find_all('td')
# text in first column
text = all_td[0].get_text(strip=True)
if text == 'Total assets':
print('row >', text)
for i, td in enumerate(all_td):
print('column', i, '>', td.get_text(strip=True))

Scraping multiple tables along with their headers on a Wikipedia page using python requests and BeatifulSoup?

Using python libraries, request and BeautifulSoup, I am trying to scrape the tables on this wikipedia page: https://en.wikipedia.org/wiki/Mobile_country_code. I can get all the data in the tables; however, I want to add another column called Country, from table names, populate it with table names.Here is an example,
Wikipedia table(above) and the desired table(below).
The code below allows me to get all the data without the Country column:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
wiki = requests.get('https://en.wikipedia.org/wiki/Mobile_country_code')
soup = BeautifulSoup(wiki.content, 'html.parser')
# Get all the tables
tables = soup.find_all('table',class_="wikitable")
# extract the column names
column_names = [item.get_text() for item in tables[0].find_all('th')]
# extract the content
contents = [item.get_text() for item in tables[0].find_all('td')]
# put all the content into a list
values=[]
for table in tables:
for item in table.select('td'):
temp = item.get_text()
values.append(temp)
# Since there are 7 columns, obtain the number of rows and reshape the table
len(values)/7 # 2452 rows
# change the shape of the table
data = np.reshape(values,(2452,7))
# put all the data into a dataframe
df = pd.DataFrame(data = data, columns=header_list)
Try with:
#This is the table which I want to extract
# Get all the tables
tables = soup.find_all('table',class_="wikitable")
# extract the column names
column_names = [item.get_text() for item in tables[0].find_all('th')]
# extract the content
contents = [item.get_text() for item in tables[0].find_all('td')]
# put all the content into a list
values_list = []
#find all countries
countries = soup.find_all('h3')
international = [soup.find('span',{"id":"International_operators"}).parent]
countries = countries+international
for c in countries:
table = c.find_next_sibling("table")
if table is not None: #check the coutries has table
for item in table.select('tr')[1:]:
values = [e.get_text() for e in item.select('td')]
values = [c.text]+values
values_list.append(values)
header_list = ["COUNTRY"]+ column_names
# put all the data into a dataframe
df = pd.DataFrame(values_list, columns=header_list)
df will be:
COUNTRY MCC MNC Brand Operator Status Bands (MHz) References and notes
0 Abkhazia - GE-AB 289 67 Aquafon Aquafon JSC Operational GSM 900 / GSM 1800 / UMTS 2100 / LTE 800 MCC is not listed by ITU;[85] LTE band 20[95]
1 Abkhazia - GE-AB 289 88 A-Mobile A-Mobile LLSC Operational GSM 900 / GSM 1800 / UMTS 2100 / LTE 800 / LTE... MCC is not listed by ITU[85]
...

Categories