Okay, I've been beating my head against the wall enough on this one - I'm stuck! I'm trying to build a function that I can input the Favorite from Sagarin's College Football site and it will calculate the spread including the Home advantage.
I am trying to pull the "Predictions_with_Totals" from Sagarin's site:
http://sagarin.com/sports/cfsend.htm#Predictions_with_Totals
I can get to it with the following code:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
html = requests.get("http://sagarin.com/sports/cfsend.htm#Predictions_with_Totals").text
soup = bs(html, "html.parser")
#find and create the table we want to import
collegeFB_HOME_ALL = soup.find_all("pre")
collegeFB_HOME = collegeFB_HOME_ALL[6]
df_collegeFB = collegeFB_HOME
This gets me a very nice table with a few headers I would need to get past to get to the "meat" of the data.
Predictions_with_Totals
These are the "regular method". _
HOME or CLOSEBY (c) team in CAPS _
both teams in lower case _
means "n" for NEUTRAL location Rating Favorite _
MONEY=odds to 100 _
FAVORITE Rating Predict Golden Recent UNDERDOG ODDS PCT% TOTAL _
======================================================================================================
CENTRAL FLORIDA(UCF) 6.35 4.66 5.99 7.92 smu 200 67% 52.40
ALABAMA 20.86 19.07 17.01 26.30 texas a&m 796 89% 42.65
snipped.....
However, I can't get rid of the top HTML code to format this into something useful. If he had made this a table or even a list I think I would find it a lot easier.
I have tried to make a dictionary and use row.find based on searches here but I don't know why it isn't working for me - maybe I need to trash the first few rows before the "FAVORITES" row? How would I do that?
output = []
for row in df_collegeFB:
test = {}
test["headers"] = row.find("FAVORITES")
test['data'] = row.find('all')
output.append(test)
Just gives me garbage. I'm sure I'm putting garbage in so not surprised I'm getting garbage out.
print(output)
[{'headers': -1, 'data': -1}, {'headers': None, 'data': None}, {'headers': -1, 'data': 1699}, {'headers': None, 'data': None}, {'headers': -1, 'data': -1}]
Not sure what exactly you are after. But if you are trying to get that table, you can use regex. Probably not the most efficient way I did it here, but none the less, gets that table into a dataframe:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import re
html = requests.get("http://sagarin.com/sports/cfsend.htm#Predictions_with_Totals").text
soup = bs(html, "html.parser")
#find and create the table we want to import
collegeFB_HOME_ALL = str(soup.find_all("pre")[6])
pattern = re.compile(r"\s{1,}([a-zA-Z\(\)].*)\s{1,}([0-9\.\-]+)\s{1,}([0-9\.\-]+)\s{1,}([0-9\.\-]+)\s{1,}([0-9\.\-]+)\s{1,}(\D+)([0-9]+)\s{1,}([0-9%]+)\s{1,}([0-9\.]+)")
rows = []
# find all matches to groups
for match in pattern.finditer(collegeFB_HOME_ALL):
row = {}
for i, col in enumerate(['FAVORITE', 'Rating', 'Predict', 'Golden', 'Recent', 'UNDERDOG', 'ODDS', 'PCT%', 'TOTAL'], start=1):
row[col] = match.group(i).strip()
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df)
FAVORITE Rating Predict ... ODDS PCT% TOTAL
0 CENTRAL FLORIDA(UCF) 6.35 4.66 ... 200 67% 52.40
1 ALABAMA 20.86 19.07 ... 796 89% 42.65
2 oregon 12.28 11.89 ... 362 78% 75.82
3 washington 8.28 8.47 ... 244 71% 64.72
4 james madison 8.08 8.52 ... 239 70% 64.71
.. ... ... ... ... ... ... ...
104 east tennessee state 7.92 7.75 ... 235 70% 41.16
105 WEBER STATE 15.32 17.25 ... 482 83% 62.36
106 delaware 2.10 2.89 ... 126 56% 38.73
107 YALE 0.87 0.83 ... 110 52% 54.32
108 YOUNGSTOWN STATE 2.11 4.51 ... 127 56% 48.10
[109 rows x 9 columns]
I'm trying to pull information off of this web page (Which is providing an AJAX call to this page).
I'm able to print out the whole page, but the find_all function just returns a blank list. What am I doing wrong?
from bs4 import BeautifulSoup
import requests
url = "http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL®ion=usa&culture=en-US&cur=&order=asc&_=1653673850919"
def pageText():
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
return doc
specialNum = pageText()
print(specialNum)
specialNum = pageText().find_all('literally anything I am trying to pull off of the page')
print(specialNum) #This will always print a blank list
Apologies if this is a stupid question. I'm a bit of a beginner.
EDIT
as mentioned by #furas removing parameter and value callback=jsonp1653673850875 from url server will send pure JSON and you can get HTML directly via r.json()['componentData'].
Simplest approach in my opinion is to unwrap the JSON string and convert it with json.loads() to access the HTML.
From there you can go with beautifulsoup or pandas to scrape the content.
Example beautifulsoup
import json, requests
from bs4 import BeautifulSoup
r = requests.get('http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL®ion=usa&culture=en-US&cur=&order=asc&_=1653673850919')
soup = BeautifulSoup(
json.loads(
r.text.split('(',1)[-1].rsplit(')',1)[0]
)['componentData']
)
for row in soup.select('table tr'):
...
Example pandas
import json, requests
import pandas as pd
r = requests.get('http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL®ion=usa&culture=en-US&cur=&order=asc&_=1653673850919')
pd.read_html(json.loads(
r.text.split('(',1)[-1].rsplit(')',1)[0]
)['componentData']
)[0].dropna()
Output
Unnamed: 0
2012-09
2013-09
2014-09
2015-09
2016-09
2017-09
2018-09
2019-09
2020-09
2021-09
TTM
Revenue USD Mil
156508
170910
182795
233715
215639
229234
265595
260174
274515
365817
386017
Gross Margin %
43.9
37.6
38.6
40.1
39.1
38.5
38.3
37.8
38.2
41.8
43.3
Operating Income USD Mil
55241
48999
52503
71230
60024
61344
70898
63930
66288
108949
119379
Operating Margin %
35.3
28.7
28.7
30.5
27.8
26.8
26.7
24.6
24.1
29.8
30.9
Net Income USD Mil
41733
37037
39510
53394
45687
48351
59531
55256
57411
94680
101935
Earnings Per Share USD
1.58
1.42
1.61
2.31
2.08
2.3
2.98
2.97
3.28
5.61
6.15
Dividends USD
0.09
0.41
0.45
0.49
0.55
0.6
0.68
0.75
0.8
0.85
0.88
Payout Ratio % *
—
27.4
28.5
22.3
24.8
26.5
23.7
25.1
23.7
16.3
14.3
Shares Mil
26470
26087
24491
23172
22001
21007
20000
18596
17528
16865
16585
Book Value Per Share * USD
4.25
4.9
5.15
5.63
5.93
6.46
6.04
5.43
4.26
3.91
4.16
Operating Cash Flow USD Mil
50856
53666
59713
81266
65824
63598
77434
69391
80674
104038
116426
Cap Spending USD Mil
-9402
-9076
-9813
-11488
-13548
-12795
-13313
-10495
-7309
-11085
-10633
Free Cash Flow USD Mil
41454
44590
49900
69778
52276
50803
64121
58896
73365
92953
105793
Free Cash Flow Per Share * USD
1.58
1.61
1.93
2.96
2.24
2.41
2.88
3.07
4.04
5.57
—
Working Capital USD Mil
19111
29628
5083
8768
27863
27831
14473
57101
38321
9355
—
I have a very specific issue which I have not been able to find a solution to.
Recently, I began a project for which I am monitoring about 100 ETFs and Mutual funds based on specific data acquired from Morningstar. The current solution works great - but I later found out that I need more data from another "Tab" within the website. Specifically, I am trying to get data from the 1st table from the following website: https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1
Right now, I have the code below for scraping data from a table from the tab "Indhold" on the website, and exporting it to Excel. My question is therefore: How do I adjust the code to scrape data from another part of the website?.
To briefly explain the code and reiterate: The code below scrapes data from another tab from the same websites. The many, many IDs are for each website representing each mutual fund/ETF. The setup works very well so I am hoping to simply adjust it (If that is possible) to extract the table from the link above. I have very limited knowledge of the topic so any help is much, much appreciated.
import requests
import re
import pandas as pd
from openpyxl import load_workbook
auth = 'https://www.morningstar.dk/Common/funds/snapshot/PortfolioSAL.aspx'
# Create a Pandas Excel writer using XlsxWriter as the engine.
path= r'/Users/karlemilthulstrup/Downloads/data2.xlsm'
book = load_workbook(path ,read_only = False, keep_vba=True)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
ids = ['F00000VA2N','F0GBR064OO','F00000YKC2','F000015MVX','F0000020YA','0P00015YTR','0P00015YTT','F0GBR05V8D','F0GBR06XKI','F000013CKH','F00000MG6K','F000014G49',
'F00000WC0Z','F00000QSD2','F000016551','F0000146QH','F0000146QI','F0GBR04KZS','F0GBR064VU','F00000VXLM','F0000119R1','F0GBR04L4T','F000015CS3','F000015CS5','F000015CS6',
'F000015CS4','F000013BZE','F0GBR05W0Q','F000016M1C','F0GBR04L68','F00000Z9T9','F0GBR04JI8','F00000Z9TG','F0GBR04L2P','F000014CU8','F00000ZG2G','F00000MLEW',
'F000013ZOY','F000016614','F00000WUI9','F000015KRL','F0GBR04LCR','F000010ES9','F00000P780','F0GBR04HC3','F000015CV6','F00000YWCK','F00000YWCJ','F00000NAI5',
'F0GBR04L81','F0GBR05KNU','F0GBR06XKB','F00000NAI3','F0GBR06XKF','F000016UA9','F000013FC2','F000014NRE','0P0000CNVT','0P0000CNVX','F000015KRI',
'F000015KRG','F00000XLK7','F0GBR04IDG','F00000XLK6','F00000073J','F00000XLK4','F000013CKG','F000013CKJ','F000013CKK','F000016P8R','F000016P8S','F000011JG6',
'F000014UZQ','F0000159PE','F0GBR04KZG','F0000002OY','F00000TW9K','F0000175CC','F00000NBEL','F000016054','F000016056','F00000TEYP','F0000025UI','F0GBR04FV7',
'F00000WP01','F000011SQ4','F0GBR04KZO','F000010E19','F000013ZOX','F0GBR04HD7','F00000YKC1','F0GBR064UG','F00000JSDD','F000010ROF','F0000100CA','F0000100CD',
'FOGBR05KQ0','F0GBR04LBB','F0GBR04LBZ','F0GBR04LCN','F00000WLA7','F0000147D7','F00000ZB5E','F00000WC0Y']
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
payload = {
'languageId': 'da-DK',
'locale': 'da-DK',
'clientId': 'MDC_intl',
'benchmarkId': 'category',
'component': 'sal-components-mip-factor-profile',
'version': '3.40.1'}
for api_id in ids:
payload = {
'Site': 'dk',
'FC': '%s' %api_id,
'IT': 'FO',
'LANG': 'da-DK',}
response = requests.get(auth, params=payload)
search = re.search('(tokenMaaS:[\w\s]*\")(.*)(\")', response.text, re.IGNORECASE)
bearer = 'Bearer ' + search.group(2)
headers.update({'Authorization': bearer})
url = 'https://www.us-api.morningstar.com/sal/sal-service/fund/factorProfile/%s/data' %api_id
jsonData = requests.get(url, headers=headers, params=payload).json()
rows = []
for k, v in jsonData['factors'].items():
row = {}
row['factor'] = k
historicRange = v.pop('historicRange')
row.update(v)
for each in historicRange:
row.update(each)
rows.append(row.copy())
df = pd.DataFrame(rows)
sheetName = jsonData['id']
df.to_excel(writer, sheet_name=sheetName, index=False)
print('Finished: %s' %sheetName)
writer.save()
writer.close()
If I understand you correctly, you want to get first table of that URL in the form of pandas dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# load the page into soup:
url = "https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(str, df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
print(df)
Prints:
Name 2014* 2015* 2016* 2017* 2018 2019 2020 31-08
0 Samlet afkast % 2627 1490 1432 584 -589 2648 -482 1841
1 +/- Kategori 1130 583 808 -255 164 22 -910 -080
2 +/- Indeks 788 591 363 -320 -127 -262 -1106 -162
3 Rank i kategori 2 9 4 80 38 54 92 63
EDIT: To load from multiple URLs:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1",
]
all_data = []
for url in urls:
print("Loading URL {}".format(url))
# load the page into soup:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(lambda x: str(x).replace("*", "").strip(), df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
df["Company"] = soup.h1.text.split("\n")[0].strip()
df["URL"] = url
all_data.append(df.loc[:, ~df.isna().all()])
df = pd.concat(all_data, ignore_index=True)
print(df)
Prints:
Name 2016 2017 2018 2019 2020 31-08 Company URL
0 Samlet afkast % 1755.0 942.0 -1317.0 1757.0 -189.0 3018 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
1 +/- Kategori 966.0 -54.0 -186.0 -662.0 -967.0 1152 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
2 +/- Indeks 686.0 38.0 -854.0 -1153.0 -813.0 1015 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
3 Rank i kategori 10.0 24.0 85.0 84.0 77.0 4 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
4 Samlet afkast % NaN 1016.0 -940.0 1899.0 767.0 2238 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
5 +/- Kategori NaN 20.0 190.0 -520.0 -12.0 373 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
6 +/- Indeks NaN 112.0 -478.0 -1011.0 143.0 235 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
7 Rank i kategori NaN 26.0 69.0 92.0 43.0 25 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
8 Samlet afkast % NaN NaN -939.0 1898.0 766.0 2239 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
9 +/- Kategori NaN NaN 191.0 -521.0 -12.0 373 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
10 +/- Indeks NaN NaN -477.0 -1012.0 142.0 236 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
11 Rank i kategori NaN NaN 68.0 92.0 44.0 24 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
12 Samlet afkast % NaN NaN NaN NaN NaN 2384 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
13 +/- Kategori NaN NaN NaN NaN NaN 518 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
14 +/- Indeks NaN NaN NaN NaN NaN 381 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
15 Rank i kategori NaN NaN NaN NaN NaN 18 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
I want to extract the data from the table given in 'https://statisticstimes.com/demographics/india/indian-states-population.php' and put it in a list or a dictionary.
I am a beginner in Python. From what I have learned so far all I could do is:
import urllib.request , urllib.error , urllib.parse
from bs4 import BeautifulSoup
url = input("Enter url: ")
html = urllib.request.urlopen(url).read()
x = BeautifulSoup(html , 'html.parser')
tags = x('tr')
lst = list()
for tag in tags:
lst.append(tag.findAll('td'))
print(lst)
You can use requests and pandas.
Here's how:
import pandas as pd
import requests
from tabulate import tabulate
url = "https://statisticstimes.com/demographics/india/indian-states-population.php"
df = pd.read_html(requests.get(url).text, flavor="bs4")[-1]
print(tabulate(df.head(10), showindex=False))
Output:
--- ---------------- -------- -------- ------- ----- ---- -------------------- ---
NCT Delhi 18710922 16787941 1922981 11.45 1.36 Malawi 63
18 Haryana 28204692 25351462 2853230 11.25 2.06 Venezuela 51
14 Kerala 35699443 33406061 2293382 6.87 2.6 Morocco 41
20 Himachal Pradesh 7451955 6864602 587353 8.56 0.54 China, Hong Kong SAR 104
16 Punjab 30141373 27743338 2398035 8.64 2.2 Mozambique 48
12 Telangana 39362732 35004000 4358732 12.45 2.87 Iraq 36
25 Goa 1586250 1458545 127705 8.76 0.12 Bahrain 153
19 Uttarakhand 11250858 10086292 1164566 11.55 0.82 Haiti 84
UT3 Chandigarh 1158473 1055450 103023 9.76 0.08 Eswatini 159
9 Gujarat 63872399 60439692 3432707 5.68 4.66 France 23
--- ---------------- -------- -------- ------- ----- ---- -------------------- ---
With:
df.to_csv("your_table.csv", index=False)
you can dump the table to a .csv file:
I'm trying to retrieve Financial Information from reuters.com, especially the Long Term Growth Rates of Companies. The element I want to scrape doesn't appear on all Webpages, in my example not for the Ticker 'AMCR'. All scraped info shall be appended to a list.
I've already figured out to exclude the element if it doesn't exist, but instead of appending it to the list in a place where it should be, the "NaN" is appended as the last element and not in a place where it should be.
import requests
from bs4 import BeautifulSoup
LTGRMean = []
tickers = ['MMM','AES','LLY','LOW','PWR','TSCO','YUM','ICE','FB','AAPL','AMCR','FLS','GOOGL','FB','MSFT']
Ticker LTGRMean
0 MMM 3.70
1 AES 9.00
2 LLY 10.42
3 LOW 13.97
4 PWR 12.53
5 TSCO 11.44
6 YUM 15.08
7 ICE 8.52
8 FB 19.07
9 AAPL 12.00
10 AMCR 19.04
11 FLS 16.14
12 GOOGL 19.07
13 FB 14.80
14 MSFT NaN
My individual text "not existing" isn't appearing.
Instead of for AMCR where Reuters doesn't provide any information, the Growth Rate of FLS (19.04) is set instead. So, as a result, all info is shifted up one index, where NaN should appear next to AMCR.
Stack() Function in dataframe stacks the column to rows at level 1.
import requests
from bs4 import BeautifulSoup
import pandas as pd
LTGRMean = []
tickers = ['MMM', 'AES', 'LLY', 'LOW', 'PWR', 'TSCO', 'YUM', 'ICE', 'FB', 'AAPL', 'AMCR', 'FLS', 'GOOGL', 'FB', 'MSFT']
for i in tickers:
Test = requests.get('https://www.reuters.com/finance/stocks/financial-highlights/' + i)
ReutSoup = BeautifulSoup(Test.content, 'html.parser')
td = ReutSoup.find('td', string="LT Growth Rate (%)")
my_dict = {}
#validate td object not none
if td is not None:
result = td.findNext('td').findNext('td').text
else:
result = "NaN"
my_dict[i] = result
LTGRMean.append(my_dict)
df = pd.DataFrame(LTGRMean)
print(df.stack())
O/P:
0 MMM 3.70
1 AES 9.00
2 LLY 10.42
3 LOW 13.97
4 PWR 12.53
5 TSCO 11.44
6 YUM 15.08
7 ICE 8.52
8 FB 19.90
9 AAPL 12.00
10 AMCR NaN
11 FLS 19.04
12 GOOGL 16.14
13 FB 19.90
14 MSFT 14.80
dtype: object