scraping table from site - finding blank cells, python - python

Here's the site I'm working with: http://www.fantasypros.com/mlb/probable-pitchers.php
What I want to do it run the code every day, and it return a list of pitchers that are pitching that day, so just the first column. Here's what I have so far.
from bs4 import BeautifulSoup
import urllib.request
url = 'http://www.fantasypros.com/mlb/probable-pitchers.php'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
table = soup.find('table',{'class': 'table table-condensed'})
table2 = table.find('tbody') #this find just the rows with pitchers (excludes dates)
daysOnPage = []
for row in table.findAll('th'):
daysOnPage.append(row.text)
daysOnPage.pop(0)
#print(daysOnPage)
pitchers = []
for row in table2.findAll('a', {'class': 'available mpb-available'}):
pitchers.append(row.text)
This returns a list of every pitcher on the page. If every cell on the table was always filled, I could do something like deleting every nth player or something like that, but that seems pretty inelegant, and also doesn't work since you don't ever know which cells will be blank. I've looked through the table2.prettify() code but I can't find anything that indicates to me where a blank cell is coming.
Thanks for the help.
Edit: Tinkering a little bit, I've figured this much out:
for row in table2.find('tr'):
for a in row.findAll('a', {'class': 'available mpb-available'}):
pitchers.append(a.text)
continue
That prints the first row of pitchers, which is also a problem I was going to tackle later. Why does the continue not make it iterate through the rows?

When I hear table, I think pandas. You can have pandas.read_html do the parsing for you then use pandas.Series.dropna return only valid values.
In [1]: import pandas as pd
In [2]: dfs = pd.read_html('http://www.fantasypros.com/mlb/probable-pitchers.php')
In [3]: df = dfs[0].head(10) # get the first dataframe and just use the first 10 teams for this example
In [4]: print(df['Thu Aug 6']) # Selecting just one day by label
0 #WSHJ. Hellickson(7-6)SP 124
1 MIAM. Wisler(5-1)SP 306
2 NaN
3 #NYYE. Rodriguez(6-3)SP 177
4 SFK. Hendricks(4-5)SP 51
5 STLM. Lorenzen(3-6)SP 300
6 NaN
7 NaN
8 NaN
9 KCB. Farmer(0-2)SP 270
Name: Thu Aug 6, dtype: object
In [5]: active = df['Thu Aug 6'].dropna() # now just drop any fields that are NaNs
In [6]: print(active)
0 #WSHJ. Hellickson(7-6)SP 124
1 MIAM. Wisler(5-1)SP 306
3 #NYYE. Rodriguez(6-3)SP 177
4 SFK. Hendricks(4-5)SP 51
5 STLM. Lorenzen(3-6)SP 300
9 KCB. Farmer(0-2)SP 270
Name: Thu Aug 6, dtype: object
I suppose the last thing you'll want to do is parse the strings in the table to get just the pitchers name.
If you want to write the Series to a csv, you can do so directly by:
In [7]: active.to_csv('active.csv')
This gives you a csv that looks something like this:
0,#WSHJ. Hellickson(7-6)SP 126
1,MIAM. Wisler(5-1)SP 306
3,#NYYE. Rodriguez(6-3)SP 179
4,SFK. Hendricks(4-5)SP 51
5,STLM. Lorenzen(3-6)SP 301
9,KCB. Farmer(0-2)SP 267

Related

Struggling to grab data from baseball reference

I'm trying to grab the tables for all pitchers batting against found on this page.
I believe the problems lies with the data being behind a comment.
For the sake of the example, I'd like to find, say, Sandy Alcantara's home runs allowed.
import requests
from bs4 import BeautifulSoup as bs
url="https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page=requests.get(url)
soup=bs(page.content,"html.parser")
for tr in soup.find_all('tr'):
td=tr.find_all('td')
print(td)
This prints a lot of team data, but doesn't print the pitcher data.
How can I cleverly get it to print the pitcher data? Ideally, I'd have it in a list or something.
object[0]=Rk
object[1]=Name
object[4]=IP
object[13]=HR
The problem related to the extraction of the table content is that the table itself is stored inside a comment string.
After you have fetched your web page, and loaded it into BeautifulSoup, you can solve this web page scraping issue following these steps:
gather the div tagged id = 'all_players_batting_pitching', which contains your table
extract the table from the comments using the decode_contents function, then reloading your text into a soup
extract each record of your table by looking for the tr tag, then each value by looking for the td tag, if that value is included in your indices [1, 4, 13]
load your values into a Pandas.DataFrame, ready to be used
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
# fetching web page
url = "https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page = requests.get(url)
# extracting table from html
soup = bs(page.content,"html.parser")
table = soup.find(id = 'all_players_batting_pitching')
tab_text = table.decode_contents().split('--')[1].strip()
tab_soup = bs(tab_text,"html.parser")
# extracting records from table
records = []
for i, row in enumerate(tab_soup.find_all('tr')):
record = [ele.text.strip() for j, ele in enumerate(row.find_all('td')) if j in [0, 3, 12]]
if record != []:
records.append([row.a['href']] + [i] + record)
Output:
href Rk Name IP HR
0 /players/a/abbotco01.shtml 1 Cory Abbott 48.0 12
1 /players/a/abreual01.shtml 2 Albert Abreu 38.2 5
2 /players/a/abreual01.shtml 3 Albert Abreu 8.2 2
3 /players/a/abreual01.shtml 4 Albert Abreu 4.1 1
4 /players/a/abreual01.shtml 5 Albert Abreu 25.2 2
... ... ... ... ... ...
1063 /players/z/zastrro01.shtml 1106 Rob Zastryzny* 1.0 0
1064 /players/z/zastrro01.shtml 1107 Rob Zastryzny* 3.0 0
1065 /players/z/zerpaan01.shtml 1108 Angel Zerpa* 11.0 2
1066 /players/z/zeuchtj01.shtml 1109 T.J. Zeuch 10.2 5
1067 /players/z/zimmebr02.shtml 1110 Bruce Zimmermann* 73.2 21

How to scrape table data with th and td with BeautifulSoup?

Am new to programming and have been trying to practice web scrapping. Found a example where one of the columns I wish to have in my out put is part of the table header. I am able to extract all the table data I wish, but have been unable to get the Year dates to show.
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
import pandas as pd
"https://en.wikipedia.org/wiki/World_population"
data = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")
tables = soup.find_all('table')
len(tables)
for index,table in enumerate(tables):
if ("Global annual population growth" in str(table)):
table_index = index
print(table_index)
print(tables[table_index].prettify())
population_data = pd.DataFrame(columns=["Year","Population","Growth"])
for row in tables[table_index].tbody.find_all('tr'):
col = row.find_all('td')
if (col !=[]):
Population = col[0].text.strip()
Growth = col[1].text.strip()
population_data = population_data.append({"Population":Population,"Growth":Growth}, ignore_index= True)
population_data
You could use pandas directly here to get your goal with pandas.read_html() to scrape the table and pandas.T to transform it:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/World_population')[0].T.reset_index()
df.columns = df.loc[0]
df = df[1:]
df
or same result with BeautifulSoup and stripped_strings:
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://en.wikipedia.org/wiki/World_population').text)
pd.DataFrame(
{list(e.stripped_strings)[0]: list(e.stripped_strings)[1:] for e in soup.table.select('tr')}
)
Output
Population
Year
Years elapsed
1
1804
200,000+
2
1930
126
3
1960
30
4
1974
14
5
1987
13
6
1999
12
7
2011
12
8
2022
11
9
2037
15
10
2057
20
Actually it's because you are scraping <td> in this line:
col = row.find_all('td')
But if you will take a look at <tr> in developer tools(F12), you can see that table also contains <th> tag which keep the year and which you are not scraping. So everything that you have to do is add this line after If condition:
year = row.find('th').text, and after that you can append it in population data

Data/Table Scraping from Website using Python

I'm trying to scrape a data from a table on a website.
However, I am continuously running into "ValueError: cannot set a row with mismatched columns".
The set-up is:
url = 'https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en'
page = requests.get(url)
soup = BeautifulSoup(page.text,'lxml')
table1 = soup.find('div', id = 'content')
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
my_data = pd.DataFrame(columns = headers)
my_data = my_data.iloc[:,:-4]
Here, I was able to make an empty dataframe with headers same as the table (I did iloc because there were some repeating columns at the end).
Now, I wanted to fill in the empty dataframe through:
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(my_data)
my_data.loc[length] = row
However, as mentioned, I get "ValueError: cannot set a row with mismatched columns" in this line: length = len(my_data).
I would really appreciate any help to solve this problem and to fill in the empty dataframe.
Thanks in advance.
Rather than trying to fill an empty DataFrame, it would be simpler to utilize .read_html, which returns a list of DataFrames after parsing every table tag within the HTML.
Even though this page has only two tables ("Top Youtube channels" and "Top Youtube channels - detail stats"), 3 DataFrames are returned because the second table is split into two table tags between rows 12 and 13 for some reason; but they can all be combined into DataFrame.
dfList = pd.read_html(url) # OR
# dfList = pd.read_html(page.text) # OR
# dfList = pd.read_html(soup.prettify())
allTime = dfList[0].set_index(['rank', 'Youtuber'])
# (header row in 1st half so 2nd half reads as headerless to pandas)
dfList[2].columns = dfList[1].columns
perYear = pd.concat(dfList[1:]).set_index(['rank', 'Youtuber'])
columns_ordered = [
'started', 'category', 'subscribers', 'subscribers/year',
'video views', 'Video views/Year', 'video count', 'Video count/Year'
] # re-order columns as preferred
combinedDf = pd.concat([allTime, perYear], axis='columns')[columns_ordered]
If the [columns_ordered] part is omitted from the last line, then the expected column order would be 'subscribers', 'video views', 'video count', 'category', 'started', 'subscribers/year', 'Video views/Year', 'Video count/Year'.
combinedDf should look like
You can try to use pd.read_html to read the table into a dataframe:
import pandas as pd
url = "https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en"
df = pd.read_html(url)[0]
print(df)
Prints:
rank Youtuber subscribers video views video count category started
0 1 ✿ Kids Diana Show 106000000 86400421379 1052 People & Blogs 2015
1 2 Movieclips 58500000 59672883333 39903 Film & Animation 2006
2 3 Ryan's World 34100000 53568277882 2290 Entertainment 2015
3 4 Toys and Colors 38300000 44050683425 901 Entertainment 2016
4 5 LooLoo Kids - Nursery Rhymes and Children's Songs 52200000 30758617681 605 Music 2014
5 6 LankyBox 22500000 30147589773 6913 Comedy 2016
6 7 D Billions 24200000 27485780190 582 NaN 2019
7 8 BabyBus - Kids Songs and Cartoons 31200000 25202247059 1946 Education 2016
8 9 FGTeeV 21500000 23255537029 1659 Gaming 2013
...and so on.

How can I scrape tables that seem to be hidden by jquery?

I'm trying to scrape these words with their meanings on this website, I scraped the first table, but even after revealing word list 2 by clicking on it, bs4 can't find that table (or any other of the hidden tables). Is there anything different I'm meant to do for toggled/hidden elements like this?
Here's what I used to access the first table:
root = "https://www.graduateshotline.com/gre-word-list.html#x2"
content = requests.get(root).text
soup = BeautifulSoup(content,'html.parser')
table = soup.find_all('table',attrs={'class':'tablex border1'})[0]
print(table)
import pandas as pd
df = pd.read_html('https://www.graduateshotline.com/gre/load.php?file=list2.html',
attrs={'class': 'tablex border1'})[0]
print(df)
Output:
0 1
0 multifarious varied; motley; greatly diversified
1 substantiation giving facts to support (statement)
2 feud bitter quarrel over a long period of time
3 indefatigability not easily exhaustible; tirelessness
4 convoluted complicated;coiled; twisted
.. ... ...
257 insensible unconscious; unresponsive; unaffected
258 gourmand a person who is devoted to eating and drinking...
259 plead address a court of law as an advocate
260 morbid diseased; unhealthy (e.g.. about ideas)
261 enmity hatred being an enemy
[262 rows x 2 columns]

Nested string in a list - need to split nested string to help turn it into a dataframe

I'm working on a web scraping project with nba stats. When I am scraping, I can get all of the information. However, all of the stats are returning as one string, which, turned into a dataframe, puts all the stats in one column. I'm attempting to split this string. and replace it in it's own nested area.
Hopefully the image explains this better.
I am webscraping from https://stats.nba.com/players/traditional/?sort=PTS&dir=-1 using selenium because I am planning on clicking though all of the pages
code I've done so far
here is the function I'm working on:
In the last line I would like to replace z[2] with the split version I've created. When I try z[2] = z[2].split(' ') I get the error AttributeError: 'list' object has no attribute 'split'
new_split = []
for i in player:
player_stats.append(i.text.split('\n'))
for z in player_stats:
new_split.append(z[2].split(' '))```
You didn't mention where you're getting your data from. (I've updated the url in my code. It's still the same API, which returns information for all 457 players, so there is no need to use selenium to navigate to the other pages). The official nba website seems to be offering their data in JSON format, which is always desirable when web scraping:
import requests
import json
# url = "https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2019-20&SeasonType=Regular+Season&StatCategory=PTS"
url = "https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision=&Weight="
response = requests.get(url)
response.raise_for_status()
data = json.loads(response.text)
players = []
for player_data in data["resultSet"]["rowSet"]:
player = dict(zip(data["resultSet"]["headers"], player_data))
players.append(player)
for player in players[:10]:
print(f"{player['PLAYER']} ({player['TEAM_ABBREVIATION']}) is rank {player['RANK']} with a GP of {player['GP']}")
Output:
James Harden (HOU) is rank 1 with a GP of 18
Giannis Antetokounmpo (MIL) is rank 2 with a GP of 19
Luka Doncic (DAL) is rank 3 with a GP of 18
Bradley Beal (WAS) is rank 4 with a GP of 17
Trae Young (ATL) is rank 5 with a GP of 18
Damian Lillard (POR) is rank 6 with a GP of 18
Karl-Anthony Towns (MIN) is rank 7 with a GP of 16
Anthony Davis (LAL) is rank 8 with a GP of 18
Brandon Ingram (NOP) is rank 9 with a GP of 15
LeBron James (LAL) is rank 10 with a GP of 19
Note: I have no idea what a "GP" is - I just picked that for demonstration. Here's a screenshot of Chrome's network logger, showing a small part of the expanded JSON resource (EDIT The json response from the new url looks exactly the same, except some of the headers are different, like "TEAM" -> "TEAM_ABBREVIATION"):
You can see the values - which you're struggling to extract out of one giant string - nicely separated into separate elements. The code I posted above creates key-value pairs using the headers ("PLAYER_ID", "RANK", etc. found in data["resultSet"]["headers"]) and these values.
If the second column is a string, you could try to split this string into
a list, turn each element of this list into a series, and then concat
this new data frame with the first two columns of the original data frame.
df_stats = df["2"].apply(lambda x: x[0].split(" ")).apply(pd.Series)
df_end = pd.concat([df[["0","1"]].reset_index(drop=True), df_stats], axis=1)
Example:
df = pd.DataFrame({"0": [1, 2],
"1": ["Name1", "Name2"],
"2":[["HOU 30 80"], ["LA 30 50"]]})
df_stats = df["2"].apply(lambda x: x[0].split(" ")).apply(pd.Series)
df_end = pd.concat([df[["0","1"]].reset_index(drop=True), df_stats], axis=1)
0 1 0 1 2
0 1 Name1 HOU 30 80
1 2 Name2 LA 30 50

Categories