How to pull only certain fields with BeautifulSoup

How to pull only certain fields with BeautifulSoup - python

I'm trying to print all the fields that have England in them, the current code i have prints all the Nationalities into a txt file for me, but i want just the england fields to print. the page im pulling from is https://www.premierleague.com/players
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
players.write(playerCountry.text.strip())
players.write("\n")

Just need to check if it's not equal 'England', and if so, skip to next item in list:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
if playerCountry.text.strip() != 'England':
continue
players.write(playerCountry.text.strip())
players.write("\n")

Or, you could just use pandas.read_html() and a couple lines of code:
import pandas as pd
df = pd.read_html("https://www.premierleague.com/players")[0]
print(df.loc[df['Nationality'] != 'England'])
Prints:
Player Position Nationality
2 Charlie Adam Midfielder Scotland
3 Adrián Goalkeeper Spain
4 Adrien Silva Midfielder Portugal
5 Ibrahim Afellay Midfielder Netherlands
6 Benik Afobe Forward The Democratic Republic Of Congo
7 Sergio Agüero Forward Argentina
9 Soufyan Ahannach Midfielder Netherlands
10 Ahmed Hegazi Defender Egypt
11 Nathan Aké Defender Netherlands
14 Toby Alderweireld Defender Belgium
15 Aleix García Midfielder Spain
17 Ali Gabr Defender Egypt
18 Allan Nyom Defender Cameroon
19 Allan Souza Midfielder Brazil
20 Joe Allen Midfielder Wales
22 Marcos Alonso Defender Spain
23 Paulo Alves Midfielder Portugal
24 Daniel Amartey Midfielder Ghana
25 Jordi Amat Defender Spain
27 Ethan Ampadu Defender Wales
28 Nordin Amrabat Forward Morocco

Related

NetworkX graph with some specifications based on two dataframes

I have two dataframes. The first shows the name of people of a program, called df_student.
Student-ID
Name
20202456
Luke De Paul
20202713
Emil Smith
20202456
Alexander Müller
20202713
Paul Bernard
20202456
Zoe Michailidis
20202713
Joanna Grimaldi
20202456
Kepler Santos
20202713
Dominic Borg
20202456
Jessica Murphy
20202713
Danielle Dominguez
And the other shows a dataframe where people reach the best grades with at least one person from the df_student in a course and is called df_course.
Course-ID
Name
Grade
UNI44
Luke De Paul, Benjamin Harper
17
UNI45
Dominic Borg
20
UNI61
Luke De Paul, Jonathan MacAllister
20
UNI62
Alexander Müller, Kepler Santos
17
UNI63
Joanna Grimaldi
19
UNI65
Emil Smith, Filippo Visconti
18
UNI71
Moshe Azerad, Emil Smith
18
UNI72
Luke De Paul, Jessica Murphy
18
UNI73
Luke De Paul, Filippo Visconti
17
UNI74
Matthias Noem, Kepler Santos
19
UNI75
Luke De Paul, Kepler Santos
16
UNI76
Kepler Santos
17
UNI77
Kepler Santos, Benjamin Harper
17
UNI78
Dominic Borg, Kepler Santos
18
UNI80
Luke De Paul, Gabriel Martin
18
UNI81
Dominic Borg, Alexander Müller
19
UNI82
Luke De Paul, Giancarlo Di Lorenzo
20
UNI83
Emil Smith,Joanna Grimaldi
20
I would like to create a NetworkX graph where there is a vertex for each student from df_student and also from each student from df_course. There should also be an unweighted each between two vertices only if two student received the best grade in the same course.
Now what I tried is this
import networkx as nx
G = nx.Graph()
G.add_edge(student, course)
But when I doing is it say that argument is not right. And so I don't know how to continue

Try:
import networkx as nx
import pandas as pd
df_students = pd.read_clipboard()
df_course = pd.read_clipboard()
df_s_t = df_course['Name'].str.split(',', expand=True)
G = nx.from_pandas_edgelist(df_net, 0, 1)
df_net = df_s_t[df_s_t.notna().all(1)]
G.add_nodes_from(pd.concat([df_students['Name'],
df_s_t.loc[~df_s_t.notna().all(1),0]]))
fig, ax = plt.subplots(1,1, figsize=(15,15))
nx.draw_networkx(G)
Output:

cleaning up web scrape data and combining together?

The website URL is https://www.justia.com/lawyers/criminal-law/maine
I'm wanting to scrape only the name of the lawyer and where their office is.
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
Lawyer_name= soup.find_all("a","url main-profile-link")
for i in Lawyer_name:
print(i.find(text=True))
address= soup.find_all("span","-address -hide-landscape-tablet")
for x in address:
print(x.find_all(text=True))
The name prints out just find but the address is printing off with extra that I want to remove:
['\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t88 Hammond Street', '\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBangor,\t\t\t\t\tME 04401\t\t\t\t\t\t ']
so the output I'm attempting to get for each lawyer is like this (the 1st one example):
Hunter J Tzovarras
88 Hammond Street
Bangor, ME 04401
two issues I'm trying to figure out
How can I clean up the address so it is easier to read?
How can I save the matching lawyer name with the address so they
don't get mixed up.

Use x.get_text() instead of x.find_all
for x in address:
print(x.get_text(strip=True))
Full working code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
n=[]
ad=[]
Lawyer_name= [x.get('title').strip() for x in soup.select('a.lawyer-avatar')]
n.extend(Lawyer_name)
#print(Lawyer_name)
address= [x.get_text(strip=True).replace('\t','').strip() for x in soup.find_all("span",class_="-address -hide-landscape-tablet")]
#print(address)
ad.extend(address)
df = pd.DataFrame(data=list(zip(n,ad)),columns=[['Lawyer_name','address']])
print(df)
Output:
Lawyer_name address
0 William T. Bly Esq 119 Main StreetKennebunk,ME 04043
1 John S. Webb 949 Main StreetSanford,ME 04073
2 William T. Bly Esq 20 Oak StreetEllsworth,ME 04605
3 Christopher Causey Esq 16 Middle StSaco,ME 04072
4 Robert Van Horn 88 Hammond StreetBangor,ME 04401
5 John S. Webb 37 Western Ave., Unit #307Kennebunk,ME 04043
6 Hunter J Tzovarras 4 Union Park RoadTopsham,ME 04086
7 Michael Stephen Bowser Jr. 241 Main StreetP.O. Box 57Saco,ME 04072
8 Richard Regan 6 City CenterSuite 301Portland,ME 04101
9 Robert Guillory Esq 75 Pearl St. Suite 400Portland,ME 04101
10 Dylan R. Boyd 160 Capitol StreetP.O. Box 79Augusta,ME 04332
11 Luke Rioux Esq 10 Stoney Brook LaneLyman,ME 04002
12 David G. Webbert 15 Columbia Street, Ste. 301Bangor,ME 04401
13 Amy Fairfield 32 Saco AveOld Orchard Beach,ME 04064
14 Mr. Richard Lyman Hartley 62 Portland Rd., Ste. 44Kennebunk,ME 04043
15 Neal L Weinstein Esq 647 U.S. Route One#203York,ME 03909
16 Albert Hansen 76 Tandberg Trail (Route 115)Windham,ME 04062
17 Russell Goldsmith Esq Two Canal PlazaPO Box 4600Portland,ME 04112
18 Miklos Pongratz Esq 18 Market Square Suite 5Houlton,ME 04730
19 Bradford Pattershall Esq 5 Island View DrCumberland Foreside,ME 04110
20 Michele D L Kenney 12 Silver StreetP.O. Box 559Waterville,ME 04903
21 John Simpson 344 Mount Hope Ave.Bangor,ME 04402
22 Mariah America Gleaton 192 Main StreetEllsworth,ME 04605
23 Wayne Foote Esq 85 Brackett StreetPortland,ME 04102
24 Will Ashe 16 Union StreetBrunswick,ME 04011
25 Peter J Cyr Esq 482 Congress Street Suite 402Portland,ME 04101
26 Jonathan Steven Handelman Esq PO Box 335York,ME 03909
27 Richard Smith Berne 36 Ossipee Trl W.Standish,ME 04084
28 Meredith G. Schmid 75 Pearl St.Suite 216Portland,ME 04101
29 Gregory LeClerc 28 Long Sands Road, Suite 5York,ME 03909
30 Cory McKenna 20 Mechanic StCamden,ME 04843
31 Thomas P. Elias P.O. Box 1049304 Hancock St. Suite 1KBangor,ME...
32 Christopher MacLean 1250 Forest Avenue, Ste 3APortland,ME 04103
33 Zachary J. Smith 415 Congress StreetSuite 202Portland,ME 04101
34 Stephen Sweatt 919 Ridge RoadP.O. BOX 119Bowdoinham,ME 04008
35 Michael Turndorf Esq 1250 Forest Avenue, Ste 3APortland,ME 04103
36 Andrews Bruce Campbell Esq 133 State StreetAugusta,ME 04330
37 Timothy Zerillo 110 Portland StreetFryeburg,ME 04037
38 Walter McKee Esq 440 Walnut Hill RdNorth Yarmouth,ME 04097
39 Shelley Carter 70 State StreetEllsworth,ME 04605

for your second query You can save them into a dictionary like this -
url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
# parse all names and save them in a list
lawyer_names = soup.find_all("a","url main-profile-link")
lawyer_names = [name.find(text=True).strip() for name in lawyer_names]
# parse all addresses and save them in a list
lawyer_addresses = soup.find_all("span","-address -hide-landscape-tablet")
lawyer_addresses = [re.sub('\s+',' ', address.get_text(strip=True)) for address in lawyer_addresses]
# map names with addresses
lawyer_dict = dict(zip(lawyer_names, lawyer_addresses))
print(lawyer_dict)
Output dictionary -
{'Albert Hansen': '62 Portland Rd., Ste. 44Kennebunk, ME 04043',
'Amber Lynn Tucker': '415 Congress St., Ste. 202P.O. Box 7542Portland, ME 04112',
'Amy Fairfield': '10 Stoney Brook LaneLyman, ME 04002',
'Andrews Bruce Campbell Esq': '919 Ridge RoadP.O. BOX 119Bowdoinham, ME 04008',
'Bradford Pattershall Esq': 'Two Canal PlazaPO Box 4600Portland, ME 04112',
'Christopher Causey Esq': '949 Main StreetSanford, ME 04073',
'Cory McKenna': '75 Pearl St.Suite 216Portland, ME 04101',
'David G. Webbert': '160 Capitol StreetP.O. Box 79Augusta, ME 04332',
'David Nelson Wood Esq': '120 Main StreetSuite 110Saco, ME 04072',
'Dylan R. Boyd': '6 City CenterSuite 301Portland, ME 04101',
'Gregory LeClerc': '36 Ossipee Trl W.Standish, ME 04084',
'Hunter J Tzovarras': '88 Hammond StreetBangor, ME 04401',
'John S. Webb': '16 Middle StSaco, ME 04072',
'John Simpson': '5 Island View DrCumberland Foreside, ME 04110',
'Jonathan Steven Handelman Esq': '16 Union StreetBrunswick, ME 04011',
'Luke Rioux Esq': '75 Pearl St. Suite 400Portland, ME 04101',
'Mariah America Gleaton': '12 Silver StreetP.O. Box 559Waterville, ME 04903',
'Meredith G. Schmid': 'PO Box 335York, ME 03909',
'Michael Stephen Bowser Jr.': '37 Western Ave., Unit #307Kennebunk, ME 04043',
'Michael Turndorf Esq': '415 Congress StreetSuite 202Portland, ME 04101',
'Michele D L Kenney': '18 Market Square Suite 5Houlton, ME 04730',
'Miklos Pongratz Esq': '76 Tandberg Trail (Route 115)Windham, ME 04062',
'Mr. Richard Lyman Hartley': '15 Columbia Street, Ste. 301Bangor, ME 04401',
'Neal L Weinstein Esq': '32 Saco AveOld Orchard Beach, ME 04064',
'Peter J Cyr Esq': '85 Brackett StreetPortland, ME 04102',
'Richard Regan': '4 Union Park RoadTopsham, ME 04086',
'Richard Smith Berne': '482 Congress Street Suite 402Portland, ME 04101',
'Robert Guillory Esq': '241 Main StreetP.O. Box 57Saco, ME 04072',
'Robert Van Horn': '20 Oak StreetEllsworth, ME 04605',
'Russell Goldsmith Esq': '647 U.S. Route One#203York, ME 03909',
'Shelley Carter': '110 Portland StreetFryeburg, ME 04037',
'Thaddeus Day Esq': '440 Walnut Hill RdNorth Yarmouth, ME 04097',
'Thomas P. Elias': '28 Long Sands Road, Suite 5York, ME 03909',
'Timothy Zerillo': '1250 Forest Avenue, Ste 3APortland, ME 04103',
'Todd H Crawford Jr': '1288 Roosevelt Trl, Ste #3P.O. Box 753Raymond, ME 04071',
'Walter McKee Esq': '133 State StreetAugusta, ME 04330',
'Wayne Foote Esq': '344 Mount Hope Ave.Bangor, ME 04402',
'Will Ashe': '192 Main StreetEllsworth, ME 04605',
'William T. Bly Esq': '119 Main StreetKennebunk, ME 04043',
'Zachary J. Smith': 'P.O. Box 1049304 Hancock St. Suite 1KBangor, ME 04401'}

Python Scaling loops

For each letter in the alphabet. The code should go to website.com/a and grab a table. Then it should check for a next button grab the link and makesoup and grab the next table and repeat until there is no valid next link. Then move to website.com/b(next letter in alphabet) and repeat. But I can only get as far as 2 pages for each letter. the first for loop grabs page 1 and the second grabs page 2 for each letter. I know I could write a loop for as many pages as needed but that is not scalable. How can I fix this?
from nfl_fun import make_soup
import urllib.request
import os
from string import ascii_lowercase
import requests
letter = ascii_lowercase
link = "https://www.nfl.com"
for letter in ascii_lowercase:
soup = make_soup(f"https://www.nfl.com/players/active/{letter}")
for tbody in soup.findAll("tbody"):
for tr in tbody.findAll("a"):
if tr.has_attr("href"):
print(tr.attrs["href"])
for letter in ascii_lowercase:
soup = make_soup(f"https://www.nfl.com/players/active/{letter}")
for page in soup.footer.findAll("a", {"nfl-o-table-pagination__next"}):
pagelink = ""
footer = ""
footer = page.attrs["href"]
pagelink = f"{link}{footer}"
print(footer)
getpage = requests.get(pagelink)
if getpage.status_code == 200:
next_soup = make_soup(pagelink)
for next_page in next_soup.footer.findAll("a", {"nfl-o-table-pagination__next"}):
print(getpage)
for tbody in next_soup.findAll("tbody"):
for tr in tbody.findAll("a"):
if tr.has_attr("href"):
print(tr.attrs["href"])
soup = next_soup
Thank You again,

There is an element in there that says when the "Next" button is inactive. So that'll tell you you are on the last page. So what you can do is a while loop, and just keep going to the next page, until it reaches the last page (Ie "Next" is inactive) and then tell it to stop the loop and go to the next letter:
from bs4 import BeautifulSoup
from string import ascii_lowercase
import requests
import pandas as pd
import re
letters = ascii_lowercase
link = "https://www.nfl.com"
results = pd.DataFrame()
for letter in letters:
continueToNextPage = True
after = ''
page=1
while continueToNextPage == True:
# Get the Table
url = f"https://www.nfl.com/players/active/{letter}?query={letter}&after={after}"
response = requests.get(url, 'html.parser')
soup = BeautifulSoup(response.text, 'html.parser')
temp_df = pd.read_html(response.text)[0]
results = results.append(temp_df, sort=False).reset_index(drop=True)
print ("{letter}: Page: {page}".format(letter=letter.upper(), page=page))
# Check if next page is inactive
buttons = soup.find('div', {'class':'nfl-o-table-pagination__buttons'})
regex = re.compile('.*pagination__next.*is-inactive.*')
if buttons.find('span', {'class':regex}):
continueToNextPage = False
else:
after = buttons.find('a', {'title':'Next'})['href'].split('after=')[-1]
page+=1
Output:
print (results)
Player Current Team Position Status
0 Chidobe Awuzie Dallas Cowboys CB ACT
1 Josh Avery Seattle Seahawks DT ACT
2 Genard Avery Philadelphia Eagles DE ACT
3 Anthony Averett Baltimore Ravens CB ACT
4 Lee Autry Chicago Bears DT ACT
5 Denico Autry Indianapolis Colts DT ACT
6 Tavon Austin Dallas Cowboys WR UFA
7 Blessuan Austin New York Jets CB ACT
8 Antony Auclair Tampa Bay Buccaneers TE ACT
9 Jeremiah Attaochu Denver Broncos LB ACT
10 Hunter Atkinson Atlanta Falcons OT ACT
11 John Atkins Detroit Lions DE ACT
12 Geno Atkins Cincinnati Bengals DT ACT
13 Marcell Ateman Las Vegas Raiders WR ACT
14 George Aston New York Giants RB ACT
15 Dravon Askew-Henry New York Giants DB ACT
16 Devin Asiasi New England Patriots TE ACT
17 George Asafo-Adjei New York Giants OT ACT
18 Ade Aruna Las Vegas Raiders DE ACT
19 Grayland Arnold Philadelphia Eagles SAF ACT
20 Dan Arnold Arizona Cardinals TE ACT
21 Damon Arnette Las Vegas Raiders CB UDF
22 Ray-Ray Armstrong Dallas Cowboys LB UFA
23 Ka'John Armstrong Denver Broncos OT ACT
24 Dorance Armstrong Dallas Cowboys DE ACT
25 Cornell Armstrong Houston Texans CB ACT
26 Terron Armstead New Orleans Saints OT ACT
27 Ryquell Armstead Jacksonville Jaguars RB ACT
28 Arik Armstead San Francisco 49ers DE ACT
29 Alex Armah Carolina Panthers FB ACT
... ... ... ...
3180 Clive Walford Miami Dolphins TE UFA
3181 Cameron Wake Tennessee Titans DE UFA
3182 Corliss Waitman Pittsburgh Steelers P ACT
3183 Rick Wagner Green Bay Packers OT ACT
3184 Bobby Wagner Seattle Seahawks MLB ACT
3185 Ahmad Wagner Chicago Bears WR ACT
3186 Colby Wadman Denver Broncos P ACT
3187 Christian Wade Buffalo Bills RB ACT
3188 LaAdrian Waddle Buffalo Bills OT UFA
3189 Oshane Ximines New York Giants LB ACT
3190 Trevon Young Cleveland Browns DE ACT
3191 Sam Young Las Vegas Raiders OT ACT
3192 Kenny Young Los Angeles Rams ILB ACT
3193 Chase Young Washington Redskins DE UDF
3194 Bryson Young Atlanta Falcons DE ACT
3195 Isaac Yiadom Denver Broncos CB ACT
3196 T.J. Yeldon Buffalo Bills RB ACT
3197 Deon Yelder Kansas City Chiefs TE ACT
3198 Rock Ya-Sin Indianapolis Colts CB ACT
3199 Eddie Yarbrough Minnesota Vikings DE ACT
3200 Marshal Yanda Baltimore Ravens OG ACT
3201 Tavon Young Baltimore Ravens CB ACT
3202 Brandon Zylstra Carolina Panthers WR ACT
3203 Jabari Zuniga New York Jets DE UDF
3204 Greg Zuerlein Dallas Cowboys K ACT
3205 Isaiah Zuber New England Patriots WR ACT
3206 Justin Zimmer Cleveland Browns DT ACT
3207 Anthony Zettel Minnesota Vikings DE ACT
3208 Kevin Zeitler New York Giants OG ACT
3209 Olamide Zaccheaus Atlanta Falcons WR ACT
[3210 rows x 4 columns]

Scraping table by beautiful soup 4

Hello I am trying to scrape this table in this url: https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc
There are 50 rows in this table.. however if you click Show more (just below the table), more of the rows appear. My beautiful soup code works fine, But the problem is it retrieves only the first 50 rows. It doesnot retrieve rows that appear after clicking the Show more. How can i get all the rows including first 50 and also those appears after clicking Show more?
Here is the code:
#Request to get the target wiki page
rqst = requests.get("https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc")
soup = BeautifulSoup(rqst.content,'lxml')
table = soup.find_all('table')
NFL_player_stats = pd.read_html(str(table))
players = NFL_player_stats[0]
players.shape
out[0]: (50,1)

Using DevTools in Firefox I see it gets data (in JSON format) for next page from
https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=2
If you change value in page= then you can get other pages.
import requests
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='
for page in range(1, 4):
print('\n---', page, '---\n')
r = requests.get(url + str(page))
data = r.json()
#print(data.keys())
for item in data['athletes']:
print(item['athlete']['displayName'])
Result:
--- 1 ---
Ezekiel Elliott
Saquon Barkley
Todd Gurley II
Joe Mixon
Chris Carson
Christian McCaffrey
Derrick Henry
Adrian Peterson
Phillip Lindsay
Nick Chubb
Lamar Miller
James Conner
David Johnson
Jordan Howard
Sony Michel
Marlon Mack
Melvin Gordon
Alvin Kamara
Peyton Barber
Kareem Hunt
Matt Breida
Tevin Coleman
Aaron Jones
Doug Martin
Frank Gore
Gus Edwards
Lamar Jackson
Isaiah Crowell
Mark Ingram II
Kerryon Johnson
Josh Allen
Dalvin Cook
Latavius Murray
Carlos Hyde
Austin Ekeler
Deshaun Watson
Kenyan Drake
Royce Freeman
Dion Lewis
LeSean McCoy
Mike Davis
Josh Adams
Alfred Blue
Cam Newton
Jamaal Williams
Tarik Cohen
Leonard Fournette
Alfred Morris
James White
Mitchell Trubisky
--- 2 ---
Rashaad Penny
LeGarrette Blount
T.J. Yeldon
Alex Collins
C.J. Anderson
Chris Ivory
Marshawn Lynch
Russell Wilson
Blake Bortles
Wendell Smallwood
Marcus Mariota
Bilal Powell
Jordan Wilkins
Kenneth Dixon
Ito Smith
Nyheim Hines
Dak Prescott
Jameis Winston
Elijah McGuire
Patrick Mahomes
Aaron Rodgers
Jeff Wilson Jr.
Zach Zenner
Raheem Mostert
Corey Clement
Jalen Richard
Damien Williams
Jaylen Samuels
Marcus Murphy
Spencer Ware
Cordarrelle Patterson
Malcolm Brown
Giovani Bernard
Chase Edmonds
Justin Jackson
Duke Johnson
Taysom Hill
Kalen Ballage
Ty Montgomery
Rex Burkhead
Jay Ajayi
Devontae Booker
Chris Thompson
Wayne Gallman
DJ Moore
Theo Riddick
Alex Smith
Robert Woods
Brian Hill
Dwayne Washington
--- 3 ---
Ryan Fitzpatrick
Tyreek Hill
Andrew Luck
Ryan Tannehill
Josh Rosen
Sam Darnold
Baker Mayfield
Jeff Driskel
Rod Smith
Matt Ryan
Tyrod Taylor
Kirk Cousins
Cody Kessler
Darren Sproles
Josh Johnson
DeAndre Washington
Trenton Cannon
Javorius Allen
Jared Goff
Julian Edelman
Jacquizz Rodgers
Kapri Bibbs
Andy Dalton
Ben Roethlisberger
Dede Westbrook
Case Keenum
Carson Wentz
Brandon Bolden
Curtis Samuel
Stevan Ridley
Keith Ford
Keenan Allen
John Kelly
Kenjon Barner
Matthew Stafford
Tyler Lockett
C.J. Beathard
Cameron Artis-Payne
Devonta Freeman
Brandin Cooks
Isaiah McKenzie
Colt McCoy
Stefon Diggs
Taylor Gabriel
Jarvis Landry
Tavon Austin
Corey Davis
Emmanuel Sanders
Sammy Watkins
Nathan Peterman
EDIT: get all data as DataFrame
import requests
import pandas as pd
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='
df = pd.DataFrame() # emtpy DF at start
for page in range(1, 4):
print('page:', page)
r = requests.get(url + str(page))
data = r.json()
#print(data.keys())
for item in data['athletes']:
player_name = item['athlete']['displayName']
position = item['athlete']['position']['abbreviation']
gp = item['categories'][0]['totals'][0]
other_values = item['categories'][2]['totals']
row = [player_name, position, gp] + other_values
df = df.append( [row] ) # append one row
df.columns = ['NAME', 'POS', 'GP', 'ATT', 'YDS', 'AVG', 'LNG', 'BIG', 'TD', 'YDS/G', 'FUM', 'LST', 'FD']
print(len(df)) # 150
print(df.head(20))

Splitting a string into a dataframe pandas

I have a string the exact same as below, My goal is so split it into a dataframe but I am finding trouble getting it to work. I have tried search on stack but have got nowhere.
'Position Players Average Form\nGoalkeeper Manuel Neuer 4.17017132535\n Defender Diego Godin 4.14973163459\n Defender Giorgio Chiellini 4.10115207373\n Defender Thiago Silva 3.93318274318\n Defender Andrea Barzagli 3.85132973289\nMidfielder Arjen Robben 4.80556193806\nMidfielder Alexander Meier 4.51037598508\nMidfielder Franck Ribery 4.48063714064\nMidfielder David Silva 3.76028050109\n Forward Cristiano Ronaldo 7.87909462636\n Forward Zlatan Ibrahimovic 6.85401665065'
Is there a way to turn this into a dataframe, in a reproducible way so I could do it with other strings?
My goal dataframe would look like as follows:
Position name Average
Goalkeeper Manuel 4.17017132535
Defender Diego 4.14973163459
Defender Giorgio 4.10115207373
Defender Thiago 3.93318274318
Defender Andrea 3.85132973289
Midfielder Arjen 4.80556193806
Midfielder Alexander 4.51037598508
Midfielder Franck 4.48063714064
Midfielder David 3.76028050109
Forward Cristiano 7.87909462636
Forward Hnery 6.85401665065
I am new to pandas so any help would be greatly appreciated

This is one way.
import pandas as pd
mystr = 'Position Players Average Form\nGoalkeeper Manuel Neuer 4.17017132535\n Defender Diego Godin 4.14973163459\n Defender Giorgio Chiellini 4.10115207373\n Defender Thiago Silva 3.93318274318\n Defender Andrea Barzagli 3.85132973289\nMidfielder Arjen Robben 4.80556193806\nMidfielder Alexander Meier 4.51037598508\nMidfielder Franck Ribery 4.48063714064\nMidfielder David Silva 3.76028050109\n Forward Cristiano Ronaldo 7.87909462636\n Forward Zlatan Ibrahimovic 6.85401665065'
lst = mystr.split()
data = [lst[pos:pos+4] for pos in range(0, len(lst), 4)]
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
# Position Players Average Form
# 0 Goalkeeper Manuel Neuer 4.17017132535
# 1 Defender Diego Godin 4.14973163459
# 2 Defender Giorgio Chiellini 4.10115207373
# 3 Defender Thiago Silva 3.93318274318
# 4 Defender Andrea Barzagli 3.85132973289
# 5 Midfielder Arjen Robben 4.80556193806
# 6 Midfielder Alexander Meier 4.51037598508
# 7 Midfielder Franck Ribery 4.48063714064
# 8 Midfielder David Silva 3.76028050109
# 9 Forward Cristiano Ronaldo 7.87909462636
# 10 Forward Zlatan Ibrahimovic 6.85401665065
This method will not be perfect in these instances:
Whitespace in column names, as above. In this case, you will need to redefine column names.
Whitespace in player names. This does not appear to be a problem with the data provided.

Here is how you will work your way around that.
import pandas as pd
from io import StringIO
data = StringIO('Position Players Average Form\nGoalkeeper Manuel Neuer 4.17017132535\n Defender Diego Godin 4.14973163459\n Defender Giorgio Chiellini 4.10115207373\n Defender Thiago Silva 3.93318274318\n Defender Andrea Barzagli 3.85132973289\nMidfielder Arjen Robben 4.80556193806\nMidfielder Alexander Meier 4.51037598508\nMidfielder Franck Ribery 4.48063714064\nMidfielder David Silva 3.76028050109\n Forward Cristiano Ronaldo 7.87909462636\n Forward Zlatan Ibrahimovic 6.85401665065')
df = pd.read_csv(data, sep="\n")
print(df)
Output :
Position Players Average Form
0 Goalkeeper Manuel Neuer 4.17017132535
1 Defender Diego Godin 4.14973163459
2 Defender Giorgio Chiellini 4.10115207373
3 Defender Thiago Silva 3.93318274318
4 Defender Andrea Barzagli 3.85132973289
5 Midfielder Arjen Robben 4.80556193806
6 Midfielder Alexander Meier 4.51037598508
7 Midfielder Franck Ribery 4.48063714064
8 Midfielder David Silva 3.76028050109
9 Forward Cristiano Ronaldo 7.87909462636
10 Forward Zlatan Ibrahimovic 6.85401665065

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pull only certain fields with BeautifulSoup - python

Related

NetworkX graph with some specifications based on two dataframes

cleaning up web scrape data and combining together?

Python Scaling loops

Scraping table by beautiful soup 4

Splitting a string into a dataframe pandas

Categories

Resources