Wikipedia Data Scraping with Python

Wikipedia Data Scraping with Python - python

I am trying to retrieve 3 columns (NFL Team, Player Name, College Team) from the following wikipedia page. I am new to python and have been trying to use beautifulsoup to get this done. I only need the columns that belong to QB's but I haven't even been able to get all the columns despite position. This is what I have so far and it outputs nothing and I'm not entirely sure why. I believe it is due to the a tags but I do not know what to change. Any help would be greatly appreciated.'
wiki = "http://en.wikipedia.org/wiki/2008_NFL_draft"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
rnd = ""
pick = ""
NFL = ""
player = ""
pos = ""
college = ""
conf = ""
notes = ""
table = soup.find("table", { "class" : "wikitable sortable" })
#print table
#output = open('output.csv','w')
for row in table.findAll("tr"):
cells = row.findAll("href")
print "---"
print cells.text
print "---"
#For each "tr", assign each "td" to a variable.
#if len(cells) > 1:
#NFL = cells[1].find(text=True)
#player = cells[2].find(text = True)
#pos = cells[3].find(text=True)
#college = cells[4].find(text=True)
#write_to_file = player + " " + NFL + " " + college + " " + pos
#print write_to_file
#output.write(write_to_file)
#output.close()
I know a lot of it is commented it out because I was trying to find where the breakdown was.

Here is what I would do:
find the Player Selections paragraph
get the next wikitable using find_next_sibling()
find all tr tags inside
for every row, find td an th tags and get the desired cells by index
Here is the code:
filter_position = 'QB'
player_selections = soup.find('span', id='Player_selections').parent
for row in player_selections.find_next_sibling('table', class_='wikitable').find_all('tr')[1:]:
cells = row.find_all(['td', 'th'])
try:
nfl_team, name, position, college = cells[3].text, cells[4].text, cells[5].text, cells[6].text
except IndexError:
continue
if position != filter_position:
continue
print nfl_team, name, position, college
And here is the output (only quarterbacks are filtered):
Atlanta Falcons Ryan, MattMatt Ryan† QB Boston College
Baltimore Ravens Flacco, JoeJoe Flacco QB Delaware
Green Bay Packers Brohm, BrianBrian Brohm QB Louisville
Miami Dolphins Henne, ChadChad Henne QB Michigan
New England Patriots O'Connell, KevinKevin O'Connell QB San Diego State
Minnesota Vikings Booty, John DavidJohn David Booty QB USC
Pittsburgh Steelers Dixon, DennisDennis Dixon QB Oregon
Tampa Bay Buccaneers Johnson, JoshJosh Johnson QB San Diego
New York Jets Ainge, ErikErik Ainge QB Tennessee
Washington Redskins Brennan, ColtColt Brennan QB Hawaiʻi
New York Giants Woodson, Andre'Andre' Woodson QB Kentucky
Green Bay Packers Flynn, MattMatt Flynn QB LSU
Houston Texans Brink, AlexAlex Brink QB Washington State

Related

Python Webscraping Approach for Comparing Football Players' college alma maters with total NFL Fantasy Football output

I am looking to a data science project where I will be able to sum up the fantasy football points by the college the players went to (e.g. Alabama has 56 active players in the NFL so I will go through a database and add up all of their fantasy points to compare with other schools).
I was looking at the website:
https://fantasydata.com/nfl/fantasy-football-leaders?season=2020&seasontype=1&scope=1&subscope=1&aggregatescope=1&range=3
and I was going to use Beautiful Soup to scrape the rows of players and statistics and ultimately, fantasy football points.
However, I am having trouble figuring out how to extract the players' college alma mater. To do so, I would have to:
Click each "players" name
Scrape each and every profile of the hundreds of NFL players for one line "College"
Place all of this information into its own column.
Any suggestions here?

There's no need for Selenium, or other headless, automated browsers. That's overkill.
If you take a look at your browser's network traffic, you'll notice that your browser makes a POST request to this REST API endpoint: https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read
If the POST request is well-formed, the API responds with JSON, containing information about every single player. Normally, this information would be used to populate the DOM asynchronously using JavaScript. There's quite a lot of information there, but unfortunately, the college information isn't part of the JSON response. However, there is a field PlayerUrlString, which is a relative-URL to a given player's profile page, which does contain the college name. So:
Make a POST request to the API to get information about all players
For each player in the response JSON:
Visit that player's profile
Use BeautifulSoup to extract the college name from the current
player's profile
Code:
def main():
import requests
from bs4 import BeautifulSoup
url = "https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read"
data = {
"sort": "FantasyPoints-desc",
"pageSize": "50",
"filters.season": "2020",
"filters.seasontype": "1",
"filters.scope": "1",
"filters.subscope": "1",
"filters.aggregatescope": "1",
"filters.range": "3",
}
response = requests.post(url, data=data)
response.raise_for_status()
players = response.json()["Data"]
for player in players:
url = "https://fantasydata.com" + player["PlayerUrlString"]
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
college = soup.find("dl", {"class": "dl-horizontal"}).findAll("dd")[-1].text.strip()
print(player["Name"] + " went to " + college)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Patrick Mahomes went to Texas Tech
Kyler Murray went to Oklahoma
Aaron Rodgers went to California
Russell Wilson went to Wisconsin
Josh Allen went to Wyoming
Deshaun Watson went to Clemson
Ryan Tannehill went to Texas A&M
Lamar Jackson went to Louisville
Dalvin Cook went to Florida State
...
You can also edit the pageSize POST parameter in the data dictionary. The 50 corresponds to information about the first 50 players in the JSON response (according to the filters set by the other POST parameters). Changing this value will yield more or less players in the JSON response.

I agree, API are the way to go if they are there. My second "go to" is pandas' .read_html() (which uses BeautifulSoup under the hood to parse <table> tags. Here's an alternate solution using ESPNs api to get team roster links, then use pandas to pull the table from each link. Saves you the trouble of having to iterate througheach player to get the college (I whish they just had an api that returned all players. nfl.com USED to have that, but is no longer publicly available, that I know of).
Code:
import requests
import pandas as pd
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/athletes/101'
all_teams = []
roster_links = []
for i in range(1,35):
url = 'http://site.api.espn.com/apis/site/v2/sports/football/nfl/teams/{teamId}'.format(teamId=i)
jsonData = requests.get(url).json()
print (jsonData['team']['displayName'])
for link in jsonData['team']['links']:
if link['text'] == 'Roster':
roster_links.append(link['href'])
break
for link in roster_links:
print (link)
tables = pd.read_html(link)
df = pd.concat(tables).drop('Unnamed: 0',axis=1)
df['Jersey'] = df['Name'].str.replace("([A-Za-z.' ]+)", '')
df['Name'] = df['Name'].str.extract("([A-Za-z.' ]+)")
all_teams.append(df)
final_df = pd.concat(all_teams).reset_index(drop=True)
Output:
print (final_df)
Name POS Age HT WT Exp College Jersey
0 Matt Ryan QB 35 6' 4" 217 lbs 13 Boston College 2
1 Matt Schaub QB 39 6' 6" 245 lbs 17 Virginia 8
2 Todd Gurley II RB 26 6' 1" 224 lbs 6 Georgia 21
3 Brian Hill RB 25 6' 1" 219 lbs 4 Wyoming 23
4 Qadree Ollison RB 24 6' 1" 232 lbs 2 Pittsburgh 30
... .. ... ... ... .. ... ...
1772 Jonathan Owens S 25 5' 11" 210 lbs 2 Missouri Western 36
1773 Justin Reid S 23 6' 1" 203 lbs 3 Stanford 20
1774 Ka'imi Fairbairn PK 26 6' 0" 183 lbs 5 UCLA 7
1775 Bryan Anger P 32 6' 3" 205 lbs 9 California 9
1776 Jon Weeks LS 34 5' 10" 242 lbs 11 Baylor 46
[1777 rows x 8 columns]

BeautifulSoup not getting web data

I'm creating a web scraper in order to pull the name of a company from a chamber of commerce website directory.
Im using BeautifulSoup. The page and soup objects appear to be working, but when I scrape the HTML content, an empty list is returned when it should be filled with the directory names on the page.
Web page trying to scrape: https://www.austinchamber.com/directory
Here is the HTML:
<div>
<ul> class="item-list item-list--small"> == $0
<li>
<div class='item-content'>
<div class='item-description'>
<h5 class = 'h5'>Women Helping Women LLC</h5>
Here is the python code:
def pageRequest(url):
page = requests.get(url)
return page
def htmlSoup(page):
soup = BeautifulSoup(page.content, "html.parser")
return soup
def getNames(soup):
name = soup.find_all('h5', class_='h5')
return name
page = pageRequest("https://www.austinchamber.com/directory")
soup = htmlSoup(page)
name = getNames(soup)
for n in name:
print(n)

The data is loaded dynamically via Ajax. To get the data, you can use this script:
import json
import requests
url = 'https://www.austinchamber.com/api/v1/directory?filter[categories]=&filter[show]=all&page={page}&limit=24'
page = 1
for page in range(1, 10):
print('Page {}..'.format(page))
data = requests.get(url.format(page=page)).json()
# uncommentthis to print all data:
# print(json.dumps(data, indent=4))
for d in data['data']:
print(d['title'])
Prints:
...
Indeed
Austin Telco Federal Credit Union - Taos
Green Bank
Seton Medical Center Austin
Austin Telco Federal Credit Union - Jollyville
Page 42..
Texas State SBDC - San Marcos Office
PlainsCapital Bank - Motor Bank
University of Texas - Thompson Conference Center
Lamb's Tire & Automotive Centers - #2 Research & Braker
AT&T Labs
Prosperity Bank - Rollingwood
Kerbey Lane Cafe - Central
Lamb's Tire & Automotive Centers - #9 Bee Caves
Seton Medical Center Hays
PlainsCapital Bank - North Austin
Ellis & Salazar Body Shop
aLamb's Tire & Automotive Centers - #6 Lake Creek
Rudy's Country Store and BarBQ
...

How can I webscrape a Website for the Winners

Hi I am trying to scrape this website with Python 3 and noticed that in the source code it does not give a clear indication of how I would scrape the names of the winners in these primary elections. Can you show me how to scrape a list of all the winners in every MD primary election with this website?
https://elections2018.news.baltimoresun.com/results/

The parsing is a little bit complicated, because the results are in many subpages. This scripts collects them and prints result (all data is stored in variable data):
from bs4 import BeautifulSoup
import requests
url = "https://elections2018.news.baltimoresun.com/results/"
r = requests.get(url)
data = {}
soup = BeautifulSoup(r.text, 'lxml')
for race in soup.select('div[id^=race]'):
r = requests.get(f"https://elections2018.news.baltimoresun.com/results/contests/{race['id'].split('-')[1]}.html")
s = BeautifulSoup(r.text, 'lxml')
l = []
data[(s.find('h3').text, s.find('div', {'class': 'party-header'}).text)] = l
for candidate, votes, percent in zip(s.select('td.candidate'), s.select('td.votes'), s.select('td.percent')):
l.append((candidate.text, votes.text, percent.text))
print('Winners:')
for (race, party), v in data.items():
print(race, party, v[0])
# print(data)
Outputs:
Winners:
Governor / Lt. Governor Democrat ('Ben Jealous and Susan Turnbull', '227,764', '39.6%')
U.S. Senator Republican ('Tony Campbell', '50,915', '29.2%')
U.S. Senator Democrat ('Ben Cardin', '468,909', '80.4%')
State's Attorney Democrat ('Marilyn J. Mosby', '39,519', '49.4%')
County Executive Democrat ('John "Johnny O" Olszewski, Jr.', '27,270', '32.9%')
County Executive Republican ('Al Redmer, Jr.', '17,772', '55.7%')

Nested, Same-Level For Loop, Output to List

I am having trouble appending data into a list as I iterate through the following:
import urllib
import urllib.request
from bs4 import BeautifulSoup
import pandas
def make_soup(url):
thepage = urllib.request.urlopen(url)
thepage.addheaders = [('User-Agent', 'Mozilla/5.0')]
soupdata = BeautifulSoup(thepage, 'html.parser')
return soupdata
soup = make_soup('https://www.wellstar.org/locations/pages/default.aspx')
locationdata = []
for table in soup.findAll('table', class_ = 's4-wpTopTable'):
for name in table.findAll('div', 'PurpleBackgroundHeading'):
name = name.get_text(strip = True)
for loc_type in table.findAll('h3', class_ = 'WebFont SpotBodyGreen'):
loc_type = loc_type.get_text()
for address in table.findAll('div', class_ = ['WS_Location_Address', 'WS_Location_Adddress']):
address = address.get_text(strip = True, separator = ' ')
locationdata.append([name, loc_type, address])
df = pandas.DataFrame(columns = ['name', 'loc_type', 'address'], data = locationdata)
print(df)
The produced dataframe includes all unique addresses, however only the last possible text corresponding to the name.
For example, even though 'WellStar Windy Hill Hospital' is the last hospital within the hospital category/type, it appears as the name for all hospitals. If possible, I prefer a list.append solution as I have several more, similar steps to go through to finalize this project.

This is occurring because you're looping through all the names and loc_types before you're appending to locationdata.
You can instead do:
import itertools as it
from pprint import pprint as pp
for table in soup.findAll('table', class_='s4-wpTopTable'):
names = [name.get_text(strip=True) for
name in table.findAll('div', 'PurpleBackgroundHeading')]
loc_types = [loc_type.get_text() for
loc_type in table.findAll('h3', class_='WebFont SpotBodyGreen')]
addresses = [address.get_text(strip=True, separator=' ') for
address in table.findAll('div', class_=['WS_Location_Address',
'WS_Location_Adddress'])]
for name, loc_type, address in it.izip_longest(names,loc_types,addresses):
locationdata.append([name, loc_type, address])
Result:
>>> pp.pprint(locationdata)
[[u'WellStar Urgent Care in Acworth',
u'WellStar Urgent Care Centers',
u'4550 Cobb Parkway NW Suite 101 Acworth, GA 30101 770-917-8140'],
[u'WellStar Urgent Care in Kennesaw',
None,
u'3805 Cherokee Street Kennesaw, GA 30144 770-426-5665'],
[u'WellStar Urgent Care in Marietta - Delk Road',
None,
u'2890 Delk Road Marietta, GA 30067 770-955-8620'],
[u'WellStar Urgent Care in Marietta - East Cobb',
None,
u'3747 Roswell Road Ne Suite 107 Marietta, GA 30062 470-956-0150'],
[u'WellStar Urgent Care in Marietta - Kennestone',
None,
u'818 Church Street Suite 100 Marietta, GA 30060 770-590-4190'],
[u'WellStar Urgent Care in Marietta - Sandy Plains Road',
None,
u'3600 Sandy Plains Road Marietta, GA 30066 770-977-4547'],
[u'WellStar Urgent Care in Smyrna',
None,
u'4480 North Cooper Lake Road SE Suite 100 Smryna, GA 30082 770-333-1300'],
[u'WellStar Urgent Care in Woodstock',
None,
u'120 Stonebridge Parkway Suite 310 Woodstock, GA 30189 678-494-2500']]

Beautiful Soup Web Scraping: CIA WorldFactBook Data

For class, I need to scrape data from https://www.cia.gov/Library/publications/the-world-factbook/fields/2047.html. I am able to scrape single data points with the following code, specifically for the Country Name and the Highest 10% (which is all I need for this assignment). Using the following code, I can scrape the name "Afghanistan" and the data point for highest 10% "24":
f = open('cia.txt', 'w')
import os
os.getcwd()
ciapage = 'https://www.cia.gov/Library/publications/the-world-factbook/fields/2047.html'
page = urllib2.urlopen(ciapage)
soup = BeautifulSoup(page, "html.parser")
soup.title
soup.findAll(attrs={"class":"country"})
country = soup.findAll(attrs={"class":"country"})
print country[0]
countries = country[0].string
print countries
f.write(countries + "\n")
f.close()
f = open('cia.txt', 'w')
import gettext
percents = soup.findAll(attrs={"class":"fieldData"})
print percents[0].get_text()
print percents[0].contents
for string in percents[0].strings:
print(repr(string))
for string in percents[0].stripped_strings:
print(repr(string))
print percents[0].contents[6]
f.write(percents[0].contents[6])
f.close()
While all of that runs well, I do not know how to do it for all country names and highest 10%s. I have done very little Python, so perhaps using a # with comments on what that line of code means would be very helpful. I need my final product to be a .txt file with comma delineated values (e.g. Afghanistan, 24%).

import requests
from bs4 import BeautifulSoup
url="https://www.cia.gov/Library/publications/the-world-factbook/fields/2047.html"
r=requests.get(url)
soup=BeautifulSoup(r.content,"lxml")
table=soup.find("table", id="fieldListing")
with open('a.txt', 'w') as f:
for tr in table('tr', id=True):
l = list(tr.stripped_strings) #['Afghanistan', 'lowest 10%:', '3.8%', 'highest 10%:', '24% (2008)']
country = l[0]
highest = l[-1].split()[0]
f.write(country + ' ' + highest + '\n')
out:
Afghanistan 24%
Albania 20.5%
Algeria 26.8%
American Samoa NA%
Andorra NA%
Angola 44.7%
Anguilla NA%
Antigua and Barbuda NA%
Argentina 32.3%
Armenia 24.8%
Aruba NA%
Australia 25.4%
Austria 23.5%
Azerbaijan 27.4%
Bahamas, The 22%
Bahrain NA%
Bangladesh 27%
Barbados NA%

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Wikipedia Data Scraping with Python - python

Related

Python Webscraping Approach for Comparing Football Players' college alma maters with total NFL Fantasy Football output

BeautifulSoup not getting web data

How can I webscrape a Website for the Winners

Nested, Same-Level For Loop, Output to List

Beautiful Soup Web Scraping: CIA WorldFactBook Data

Categories

Resources