Scraping data from Morningstar using an API - python

I have a very specific issue which I have not been able to find a solution to.
Recently, I began a project for which I am monitoring about 100 ETFs and Mutual funds based on specific data acquired from Morningstar. The current solution works great - but I later found out that I need more data from another "Tab" within the website. Specifically, I am trying to get data from the 1st table from the following website: https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1
Right now, I have the code below for scraping data from a table from the tab "Indhold" on the website, and exporting it to Excel. My question is therefore: How do I adjust the code to scrape data from another part of the website?.
To briefly explain the code and reiterate: The code below scrapes data from another tab from the same websites. The many, many IDs are for each website representing each mutual fund/ETF. The setup works very well so I am hoping to simply adjust it (If that is possible) to extract the table from the link above. I have very limited knowledge of the topic so any help is much, much appreciated.
import requests
import re
import pandas as pd
from openpyxl import load_workbook
auth = 'https://www.morningstar.dk/Common/funds/snapshot/PortfolioSAL.aspx'
# Create a Pandas Excel writer using XlsxWriter as the engine.
path= r'/Users/karlemilthulstrup/Downloads/data2.xlsm'
book = load_workbook(path ,read_only = False, keep_vba=True)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
ids = ['F00000VA2N','F0GBR064OO','F00000YKC2','F000015MVX','F0000020YA','0P00015YTR','0P00015YTT','F0GBR05V8D','F0GBR06XKI','F000013CKH','F00000MG6K','F000014G49',
'F00000WC0Z','F00000QSD2','F000016551','F0000146QH','F0000146QI','F0GBR04KZS','F0GBR064VU','F00000VXLM','F0000119R1','F0GBR04L4T','F000015CS3','F000015CS5','F000015CS6',
'F000015CS4','F000013BZE','F0GBR05W0Q','F000016M1C','F0GBR04L68','F00000Z9T9','F0GBR04JI8','F00000Z9TG','F0GBR04L2P','F000014CU8','F00000ZG2G','F00000MLEW',
'F000013ZOY','F000016614','F00000WUI9','F000015KRL','F0GBR04LCR','F000010ES9','F00000P780','F0GBR04HC3','F000015CV6','F00000YWCK','F00000YWCJ','F00000NAI5',
'F0GBR04L81','F0GBR05KNU','F0GBR06XKB','F00000NAI3','F0GBR06XKF','F000016UA9','F000013FC2','F000014NRE','0P0000CNVT','0P0000CNVX','F000015KRI',
'F000015KRG','F00000XLK7','F0GBR04IDG','F00000XLK6','F00000073J','F00000XLK4','F000013CKG','F000013CKJ','F000013CKK','F000016P8R','F000016P8S','F000011JG6',
'F000014UZQ','F0000159PE','F0GBR04KZG','F0000002OY','F00000TW9K','F0000175CC','F00000NBEL','F000016054','F000016056','F00000TEYP','F0000025UI','F0GBR04FV7',
'F00000WP01','F000011SQ4','F0GBR04KZO','F000010E19','F000013ZOX','F0GBR04HD7','F00000YKC1','F0GBR064UG','F00000JSDD','F000010ROF','F0000100CA','F0000100CD',
'FOGBR05KQ0','F0GBR04LBB','F0GBR04LBZ','F0GBR04LCN','F00000WLA7','F0000147D7','F00000ZB5E','F00000WC0Y']
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
payload = {
'languageId': 'da-DK',
'locale': 'da-DK',
'clientId': 'MDC_intl',
'benchmarkId': 'category',
'component': 'sal-components-mip-factor-profile',
'version': '3.40.1'}
for api_id in ids:
payload = {
'Site': 'dk',
'FC': '%s' %api_id,
'IT': 'FO',
'LANG': 'da-DK',}
response = requests.get(auth, params=payload)
search = re.search('(tokenMaaS:[\w\s]*\")(.*)(\")', response.text, re.IGNORECASE)
bearer = 'Bearer ' + search.group(2)
headers.update({'Authorization': bearer})
url = 'https://www.us-api.morningstar.com/sal/sal-service/fund/factorProfile/%s/data' %api_id
jsonData = requests.get(url, headers=headers, params=payload).json()
rows = []
for k, v in jsonData['factors'].items():
row = {}
row['factor'] = k
historicRange = v.pop('historicRange')
row.update(v)
for each in historicRange:
row.update(each)
rows.append(row.copy())
df = pd.DataFrame(rows)
sheetName = jsonData['id']
df.to_excel(writer, sheet_name=sheetName, index=False)
print('Finished: %s' %sheetName)
writer.save()
writer.close()

If I understand you correctly, you want to get first table of that URL in the form of pandas dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# load the page into soup:
url = "https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(str, df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
print(df)
Prints:
Name 2014* 2015* 2016* 2017* 2018 2019 2020 31-08
0 Samlet afkast % 2627 1490 1432 584 -589 2648 -482 1841
1 +/- Kategori 1130 583 808 -255 164 22 -910 -080
2 +/- Indeks 788 591 363 -320 -127 -262 -1106 -162
3 Rank i kategori 2 9 4 80 38 54 92 63
EDIT: To load from multiple URLs:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1",
]
all_data = []
for url in urls:
print("Loading URL {}".format(url))
# load the page into soup:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(lambda x: str(x).replace("*", "").strip(), df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
df["Company"] = soup.h1.text.split("\n")[0].strip()
df["URL"] = url
all_data.append(df.loc[:, ~df.isna().all()])
df = pd.concat(all_data, ignore_index=True)
print(df)
Prints:
Name 2016 2017 2018 2019 2020 31-08 Company URL
0 Samlet afkast % 1755.0 942.0 -1317.0 1757.0 -189.0 3018 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
1 +/- Kategori 966.0 -54.0 -186.0 -662.0 -967.0 1152 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
2 +/- Indeks 686.0 38.0 -854.0 -1153.0 -813.0 1015 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
3 Rank i kategori 10.0 24.0 85.0 84.0 77.0 4 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
4 Samlet afkast % NaN 1016.0 -940.0 1899.0 767.0 2238 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
5 +/- Kategori NaN 20.0 190.0 -520.0 -12.0 373 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
6 +/- Indeks NaN 112.0 -478.0 -1011.0 143.0 235 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
7 Rank i kategori NaN 26.0 69.0 92.0 43.0 25 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
8 Samlet afkast % NaN NaN -939.0 1898.0 766.0 2239 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
9 +/- Kategori NaN NaN 191.0 -521.0 -12.0 373 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
10 +/- Indeks NaN NaN -477.0 -1012.0 142.0 236 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
11 Rank i kategori NaN NaN 68.0 92.0 44.0 24 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
12 Samlet afkast % NaN NaN NaN NaN NaN 2384 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
13 +/- Kategori NaN NaN NaN NaN NaN 518 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
14 +/- Indeks NaN NaN NaN NaN NaN 381 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
15 Rank i kategori NaN NaN NaN NaN NaN 18 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1

Related

Web-scraping with Python to extract microdata for each page from a sitemap.xml

I'm trying to extract name, brand, prices, stock microdata from pages extracted from sitemap.xml
But I'm blocked with the following step, thank you for helping me as I'm a newbie I can't understand the blocking element
Scrape the sitemap.xml to have list of urls : OK
Extract the metadata : OK
Extract the product schema : OK
Extract the products not OK
Crawl the site and store the products not OK
Scrape the sitemap.xml to have list of urls : OK
import pandas as pd
import requests
import extruct
from w3lib.html import get_base_url
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import advertools as adv
proximus_sitemap = adv.sitemap_to_df('https://www.proximus.be/iportal/sitemap.xml')
proximus_sitemap = proximus_sitemap[proximus_sitemap['loc'].str.contains('boutique')]
proximus_sitemap = proximus_sitemap[proximus_sitemap['loc'].str.contains('/fr/')]
Extract the metadata : OK
def extract_metadata(url):
r = requests.get(url)
base_url = get_base_url(r.text, r.url)
metadata = extruct.extract(r.text,
base_url=base_url,
uniform=True,
syntaxes=['json-ld',
'microdata',
'opengraph'])
return metadata
metadata = extract_metadata('https://www.proximus.be/fr/id_cr_apple-iphone-13-128gb-blue/particuliers/equipement/boutique/apple-iphone-13-128gb-blue.html')
metadata
Extract the product schema : OK
def get_dictionary_by_key_value(dictionary, target_key, target_value):
for key in dictionary:
if len(dictionary[key]) > 0:
for item in dictionary[key]:
if item[target_key] == target_value:
return item
Product = get_dictionary_by_key_value(metadata, "#type", "Product")
Product
Extract the products not OK => errormessage = errorkey 'offers'
def get_products(metadata):
Product = get_dictionary_by_key_value(metadata, "#type", "Product")
if Product:
products = []
for offer in Product['offers']['offers']:
product = {
'product_name': Product.get('name', ''),
'brand': offer.get('description', ''),
'availability': offer.get('availability', ''),
'lowprice': offer.get('lowPrice', ''),
'highprice': offer.get('highPrice', ''),
'price': offer.get('price', ''),
'priceCurrency': offer.get('priceCurrency', ''),
}
products.append(product)
return products
Crawl the site and store the products not OK as blocked during previous step
def scrape_products(proximus_sitemap, url='url'):
df_products = pd.DataFrame(columns = ['product_name', 'brand', 'name', 'availability',
'lowprice', 'highprice','price','priceCurrency'])
for index, row in proximus_sitemap.iterrows():
metadata = extract_metadata(row[url])
products = get_products(metadata)
if products is not None:
for product in products:
df_products = df_products.append(product, ignore_index=True)
return df_products
df_products = scrape_products(proximus_sitemap, url='loc')
df_products.to_csv('patch.csv', index=False)
df_products.head()
You can simply continue by using the advertools SEO crawler. It has a crawl function that also extracts structured data by default (JSON-LD, OpenGraph, and Twitter).
I tried to crawl a sample of ten pages, and this what the output looks like:
adv.crawl(proximus_sitemap['loc'], 'proximums.jl')
proximus_crawl = pd.read_json('proximums.jl', lines=True)
proximus_crawl.filter(regex='jsonld').columns
Index(['jsonld_#context', 'jsonld_#type', 'jsonld_name', 'jsonld_url',
'jsonld_potentialAction.#type', 'jsonld_potentialAction.target',
'jsonld_potentialAction.query-input', 'jsonld_1_#context',
'jsonld_1_#type', 'jsonld_1_name', 'jsonld_1_url', 'jsonld_1_logo',
'jsonld_1_sameAs', 'jsonld_2_#context', 'jsonld_2_#type',
'jsonld_2_itemListElement', 'jsonld_2_name', 'jsonld_2_image',
'jsonld_2_description', 'jsonld_2_sku', 'jsonld_2_review',
'jsonld_2_brand.#type', 'jsonld_2_brand.name',
'jsonld_2_aggregateRating.#type',
'jsonld_2_aggregateRating.ratingValue',
'jsonld_2_aggregateRating.reviewCount', 'jsonld_2_offers.#type',
'jsonld_2_offers.priceCurrency', 'jsonld_2_offers.availability',
'jsonld_2_offers.price', 'jsonld_3_#context', 'jsonld_3_#type',
'jsonld_3_itemListElement', 'jsonld_image', 'jsonld_description',
'jsonld_sku', 'jsonld_review', 'jsonld_brand.#type',
'jsonld_brand.name', 'jsonld_aggregateRating.#type',
'jsonld_aggregateRating.ratingValue',
'jsonld_aggregateRating.reviewCount', 'jsonld_offers.#type',
'jsonld_offers.lowPrice', 'jsonld_offers.highPrice',
'jsonld_offers.priceCurrency', 'jsonld_offers.availability',
'jsonld_offers.price', 'jsonld_offers.offerCount',
'jsonld_1_itemListElement', 'jsonld_2_offers.lowPrice',
'jsonld_2_offers.highPrice', 'jsonld_2_offers.offerCount',
'jsonld_itemListElement'],
dtype='object')
These are some of the columns you might be interested in (containing price, currency, availability, etc.)
jsonld_2_description
jsonld_2_offers.priceCurrency
jsonld_2_offers.availability
jsonld_2_offers.price
jsonld_description
jsonld_offers.lowPrice
jsonld_offers.priceCurrency
jsonld_offers.availability
jsonld_offers.price
jsonld_2_offers.lowPrice
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
1
Numéro 7
EUR
OutOfStock
369.99
nan
nan
nan
nan
nan
nan
2
nan
nan
nan
nan
Apple
81.82
EUR
InStock
487.6
nan
3
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
4
nan
nan
nan
nan
Huawei
nan
EUR
OutOfStock
330.57
nan
5
nan
nan
nan
nan
Apple
81.82
EUR
LimitedAvailability
487.6
nan
6
Apple
EUR
InStock
589.99
nan
nan
nan
nan
nan
99
7
Apple
EUR
LimitedAvailability
589.99
nan
nan
nan
nan
nan
99
8
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
9
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan

Python table how to scrape with bs4

I have been trying to figure out how to scrape a table off a website using BS4 and the HTML, and I've been seeing the same type of code around this forum.
from bs4 import NavigableString
url="https://www.basketball-reference.com/leagues/NBA_2020.html"
res = requests.get(url)
id="all_misc_stats"
html = BeautifulSoup(res.content, 'html.parser')
data=pd.read_html(html.find_all(string=lambda x: isinstance(x, NavigableString) and id in x))
pace=pd.read_html(data)[0]
I'm trying to get the Miscellaneous stats table, but it keeps telling me it is either out of range or cannot parse. What should I do?
The table data you are looking for is placed inside an HTML comment, so a possible solution would be to parse these elements, and return when it finds the matching id.
from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment #import the Comment object
import pandas as pd
url = "https://www.basketball-reference.com/leagues/NBA_2020.html"
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
ele = BeautifulSoup(c.strip(), 'html.parser')
if tbl := ele.find("table"):
if (tbl_id := tbl.get("id")) == "misc_stats":
pace = pd.read_html(str(tbl), header=1)[0]
print(pace.head())
Output:
Rk Team Age W L PW ... TOV%.1 DRB% FT/FGA.1 Arena Attend. Attend./G
0 1.0 Milwaukee Bucks* 29.2 56.0 17.0 57 ... 12.0 81.6 0.178 Fiserv Forum 549036 17711
1 2.0 Los Angeles Clippers* 27.4 49.0 23.0 50 ... 12.2 77.6 0.206 STAPLES Center 610176 19068
2 3.0 Los Angeles Lakers* 29.5 52.0 19.0 48 ... 14.1 78.8 0.205 STAPLES Center 588907 18997
3 4.0 Toronto Raptors* 26.6 53.0 19.0 50 ... 14.6 76.7 0.202 Scotiabank Arena 633456 19796
4 5.0 Boston Celtics* 25.3 48.0 24.0 50 ... 13.5 77.4 0.215 TD Garden 610864 19090

How to format CSV data downloaded using requests library in an intuitive way?

I have this code I am using to extract data from a site:
import csv
import requests
url = 'https://covid-19.dataflowkit.com/v1'
response = requests.get(url)
with open('covid.csv', 'w') as f:
writer = csv.writer(f)
for line in response.iter_lines():
writer.writerows(line.decode('utf-8').split(','))
I am able to get data out in a CSV file, but the format is wrong and confusing.
How do I format the output in a meaningful way in the CSV file?
Or how can I insert this result/data into a table in SQL server?
The response is a json. I would say the response itself is put in an appropriate manner.
import requests
import pandas as pd
url = 'https://covid-19.dataflowkit.com/v1'
response = requests.get(url)
df = pd.DataFrame(response.json())
df.to_csv("data.csv", index=False)
How does the csv look like?
Active Cases_text Country_text Last Update New Cases_text New Deaths_text Total Cases_text Total Deaths_text Total Recovered_text
0 4,871,695 World 2020-07-12 20:16 +175,247 +3,530 13,008,752 570,564 7,566,493
1 1,757,520 USA 2020-07-12 20:16 +52,144 +331 3,407,790 137,733 1,512,537
2 579,069 Brazil 2020-07-12 19:16 +23,869 +608 1,864,681 72,100 1,213,512
3 301,850 India 2020-07-12 19:16 +29,108 +500 879,466 23,187 554,429
4 214,766 Russia 2020-07-12 20:16 +6,615 +130 727,162 11,335 501,061
.. ... ... ... ... ... ... ... ...
212 0 Caribbean Netherlands NaN 7 7
213 0 St. Barth NaN 6 6
214 0 Anguilla NaN 3 3
215 1 Saint Pierre Miquelon NaN 2 1
If you want to pull out meaning from the data, then I would suggest to analyse the data on the pandas dataframe
If you want to analyse the data in a database, then you use this answer - https://stackoverflow.com/a/25662997/6849682 for sql server

BeautifulSoup webscraping find_all( ): excluded element appended as last element

I'm trying to retrieve Financial Information from reuters.com, especially the Long Term Growth Rates of Companies. The element I want to scrape doesn't appear on all Webpages, in my example not for the Ticker 'AMCR'. All scraped info shall be appended to a list.
I've already figured out to exclude the element if it doesn't exist, but instead of appending it to the list in a place where it should be, the "NaN" is appended as the last element and not in a place where it should be.
import requests
from bs4 import BeautifulSoup
LTGRMean = []
tickers = ['MMM','AES','LLY','LOW','PWR','TSCO','YUM','ICE','FB','AAPL','AMCR','FLS','GOOGL','FB','MSFT']
Ticker LTGRMean
0 MMM 3.70
1 AES 9.00
2 LLY 10.42
3 LOW 13.97
4 PWR 12.53
5 TSCO 11.44
6 YUM 15.08
7 ICE 8.52
8 FB 19.07
9 AAPL 12.00
10 AMCR 19.04
11 FLS 16.14
12 GOOGL 19.07
13 FB 14.80
14 MSFT NaN
My individual text "not existing" isn't appearing.
Instead of for AMCR where Reuters doesn't provide any information, the Growth Rate of FLS (19.04) is set instead. So, as a result, all info is shifted up one index, where NaN should appear next to AMCR.
Stack() Function in dataframe stacks the column to rows at level 1.
import requests
from bs4 import BeautifulSoup
import pandas as pd
LTGRMean = []
tickers = ['MMM', 'AES', 'LLY', 'LOW', 'PWR', 'TSCO', 'YUM', 'ICE', 'FB', 'AAPL', 'AMCR', 'FLS', 'GOOGL', 'FB', 'MSFT']
for i in tickers:
Test = requests.get('https://www.reuters.com/finance/stocks/financial-highlights/' + i)
ReutSoup = BeautifulSoup(Test.content, 'html.parser')
td = ReutSoup.find('td', string="LT Growth Rate (%)")
my_dict = {}
#validate td object not none
if td is not None:
result = td.findNext('td').findNext('td').text
else:
result = "NaN"
my_dict[i] = result
LTGRMean.append(my_dict)
df = pd.DataFrame(LTGRMean)
print(df.stack())
O/P:
0 MMM 3.70
1 AES 9.00
2 LLY 10.42
3 LOW 13.97
4 PWR 12.53
5 TSCO 11.44
6 YUM 15.08
7 ICE 8.52
8 FB 19.90
9 AAPL 12.00
10 AMCR NaN
11 FLS 19.04
12 GOOGL 16.14
13 FB 19.90
14 MSFT 14.80
dtype: object

Turn an HTML table into a CSV file

How do I turn a table like this--batting gamelogs table--into a CSV file using Python and BeautifulSoup?
I want the first header where it says Rk, Gcar, Gtm, etc. and not any of the other headers within the table (the ones for each month of the season).
Here is the code I have so far:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
def stir_the_soup():
player_links = open('player_links.txt', 'r')
player_ID_nums = open('player_ID_nums.txt', 'r')
id_nums = [x.rstrip('\n') for x in player_ID_nums]
idx = 0
for url in player_links:
print url
soup = BeautifulSoup(urlopen(url), "lxml")
p_type = ""
if url[-12] == 'p':
p_type = "pitching"
elif url[-12] == 'b':
p_type = "batting"
table = soup.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']== (p_type + "_gamelogs"))
header = [[val.text.encode('utf8') for val in table.find_all('thead')]]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('th')])
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
with open("%s.csv" % id_nums[idx], 'wb') as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(row for row in rows if row)
idx += 1
player_links.close()
if __name__ == "__main__":
stir_the_soup()
The id_nums list contains all of the id numbers for each player to use as the names for the separate CSV files.
For each row, the leftmost cell is a tag and the rest of the row is tags. In addition to the header how do I put that into one row?
this code gets you the big table of stats, which is what I think you want.
Make sure you have lxml, beautifulsoup4 and pandas installed.
df = pd.read_html(r'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010')
print(df[4])
Here is the output of first 5 rows. You may need to clean it slightly as I don't know what your exact endgoal is:
df[4].head(5)
Rk Gcar Gtm Date Tm Unnamed: 5 Opp Rslt Inngs PA ... CS BA OBP SLG OPS BOP aLI WPA RE24 Pos
0 1 66 2 (1) Apr 6 ARI NaN SDP L,3-6 7-8 1 ... 0 1.000 1.000 1.000 2.000 9 .94 0.041 0.51 PH
1 2 67 3 Apr 7 ARI NaN SDP W,5-3 7-8 1 ... 0 .500 .500 .500 1.000 9 1.16 -0.062 -0.79 PH
2 3 68 4 Apr 9 ARI NaN PIT W,9-1 8-GF 1 ... 0 .667 .667 .667 1.333 2 .00 0.000 0.13 PH SS
3 4 69 5 Apr 10 ARI NaN PIT L,3-6 CG 4 ... 0 .500 .429 .500 .929 2 1.30 -0.040 -0.37 SS
4 5 70 7 (1) Apr 13 ARI # LAD L,5-9 6-6 1 ... 0 .429 .375 .429 .804 9 1.52 -0.034 -0.46 PH
to select certain columns within this DataFrame: df[4]['COLUMN_NAME_HERE'].head(5)
Example: df[4]['Gcar']
Also, if doing df[4] is getting annoying you could always just switch to another dataframe df2=df[4]
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010'
html=urllib2.urlopen(url)
bs = BeautifulSoup(html,'lxml')
table = str(bs.find('table',{'id':'batting_gamelogs'}))
dfs = pd.read_html(table)
This uses Pandas, which is pretty useful for stuff like this. It also puts it in a pretty reasonable format to do other operations on.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

Categories