Scraping table by beautiful soup 4

Scraping table by beautiful soup 4 - python

Hello I am trying to scrape this table in this url: https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc
There are 50 rows in this table.. however if you click Show more (just below the table), more of the rows appear. My beautiful soup code works fine, But the problem is it retrieves only the first 50 rows. It doesnot retrieve rows that appear after clicking the Show more. How can i get all the rows including first 50 and also those appears after clicking Show more?
Here is the code:
#Request to get the target wiki page
rqst = requests.get("https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc")
soup = BeautifulSoup(rqst.content,'lxml')
table = soup.find_all('table')
NFL_player_stats = pd.read_html(str(table))
players = NFL_player_stats[0]
players.shape
out[0]: (50,1)

Using DevTools in Firefox I see it gets data (in JSON format) for next page from
https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=2
If you change value in page= then you can get other pages.
import requests
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='
for page in range(1, 4):
print('\n---', page, '---\n')
r = requests.get(url + str(page))
data = r.json()
#print(data.keys())
for item in data['athletes']:
print(item['athlete']['displayName'])
Result:
--- 1 ---
Ezekiel Elliott
Saquon Barkley
Todd Gurley II
Joe Mixon
Chris Carson
Christian McCaffrey
Derrick Henry
Adrian Peterson
Phillip Lindsay
Nick Chubb
Lamar Miller
James Conner
David Johnson
Jordan Howard
Sony Michel
Marlon Mack
Melvin Gordon
Alvin Kamara
Peyton Barber
Kareem Hunt
Matt Breida
Tevin Coleman
Aaron Jones
Doug Martin
Frank Gore
Gus Edwards
Lamar Jackson
Isaiah Crowell
Mark Ingram II
Kerryon Johnson
Josh Allen
Dalvin Cook
Latavius Murray
Carlos Hyde
Austin Ekeler
Deshaun Watson
Kenyan Drake
Royce Freeman
Dion Lewis
LeSean McCoy
Mike Davis
Josh Adams
Alfred Blue
Cam Newton
Jamaal Williams
Tarik Cohen
Leonard Fournette
Alfred Morris
James White
Mitchell Trubisky
--- 2 ---
Rashaad Penny
LeGarrette Blount
T.J. Yeldon
Alex Collins
C.J. Anderson
Chris Ivory
Marshawn Lynch
Russell Wilson
Blake Bortles
Wendell Smallwood
Marcus Mariota
Bilal Powell
Jordan Wilkins
Kenneth Dixon
Ito Smith
Nyheim Hines
Dak Prescott
Jameis Winston
Elijah McGuire
Patrick Mahomes
Aaron Rodgers
Jeff Wilson Jr.
Zach Zenner
Raheem Mostert
Corey Clement
Jalen Richard
Damien Williams
Jaylen Samuels
Marcus Murphy
Spencer Ware
Cordarrelle Patterson
Malcolm Brown
Giovani Bernard
Chase Edmonds
Justin Jackson
Duke Johnson
Taysom Hill
Kalen Ballage
Ty Montgomery
Rex Burkhead
Jay Ajayi
Devontae Booker
Chris Thompson
Wayne Gallman
DJ Moore
Theo Riddick
Alex Smith
Robert Woods
Brian Hill
Dwayne Washington
--- 3 ---
Ryan Fitzpatrick
Tyreek Hill
Andrew Luck
Ryan Tannehill
Josh Rosen
Sam Darnold
Baker Mayfield
Jeff Driskel
Rod Smith
Matt Ryan
Tyrod Taylor
Kirk Cousins
Cody Kessler
Darren Sproles
Josh Johnson
DeAndre Washington
Trenton Cannon
Javorius Allen
Jared Goff
Julian Edelman
Jacquizz Rodgers
Kapri Bibbs
Andy Dalton
Ben Roethlisberger
Dede Westbrook
Case Keenum
Carson Wentz
Brandon Bolden
Curtis Samuel
Stevan Ridley
Keith Ford
Keenan Allen
John Kelly
Kenjon Barner
Matthew Stafford
Tyler Lockett
C.J. Beathard
Cameron Artis-Payne
Devonta Freeman
Brandin Cooks
Isaiah McKenzie
Colt McCoy
Stefon Diggs
Taylor Gabriel
Jarvis Landry
Tavon Austin
Corey Davis
Emmanuel Sanders
Sammy Watkins
Nathan Peterman
EDIT: get all data as DataFrame
import requests
import pandas as pd
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='
df = pd.DataFrame() # emtpy DF at start
for page in range(1, 4):
print('page:', page)
r = requests.get(url + str(page))
data = r.json()
#print(data.keys())
for item in data['athletes']:
player_name = item['athlete']['displayName']
position = item['athlete']['position']['abbreviation']
gp = item['categories'][0]['totals'][0]
other_values = item['categories'][2]['totals']
row = [player_name, position, gp] + other_values
df = df.append( [row] ) # append one row
df.columns = ['NAME', 'POS', 'GP', 'ATT', 'YDS', 'AVG', 'LNG', 'BIG', 'TD', 'YDS/G', 'FUM', 'LST', 'FD']
print(len(df)) # 150
print(df.head(20))

Related

cleaning up web scrape data and combining together?

The website URL is https://www.justia.com/lawyers/criminal-law/maine
I'm wanting to scrape only the name of the lawyer and where their office is.
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
Lawyer_name= soup.find_all("a","url main-profile-link")
for i in Lawyer_name:
print(i.find(text=True))
address= soup.find_all("span","-address -hide-landscape-tablet")
for x in address:
print(x.find_all(text=True))
The name prints out just find but the address is printing off with extra that I want to remove:
['\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t88 Hammond Street', '\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBangor,\t\t\t\t\tME 04401\t\t\t\t\t\t ']
so the output I'm attempting to get for each lawyer is like this (the 1st one example):
Hunter J Tzovarras
88 Hammond Street
Bangor, ME 04401
two issues I'm trying to figure out
How can I clean up the address so it is easier to read?
How can I save the matching lawyer name with the address so they
don't get mixed up.

Use x.get_text() instead of x.find_all
for x in address:
print(x.get_text(strip=True))
Full working code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
n=[]
ad=[]
Lawyer_name= [x.get('title').strip() for x in soup.select('a.lawyer-avatar')]
n.extend(Lawyer_name)
#print(Lawyer_name)
address= [x.get_text(strip=True).replace('\t','').strip() for x in soup.find_all("span",class_="-address -hide-landscape-tablet")]
#print(address)
ad.extend(address)
df = pd.DataFrame(data=list(zip(n,ad)),columns=[['Lawyer_name','address']])
print(df)
Output:
Lawyer_name address
0 William T. Bly Esq 119 Main StreetKennebunk,ME 04043
1 John S. Webb 949 Main StreetSanford,ME 04073
2 William T. Bly Esq 20 Oak StreetEllsworth,ME 04605
3 Christopher Causey Esq 16 Middle StSaco,ME 04072
4 Robert Van Horn 88 Hammond StreetBangor,ME 04401
5 John S. Webb 37 Western Ave., Unit #307Kennebunk,ME 04043
6 Hunter J Tzovarras 4 Union Park RoadTopsham,ME 04086
7 Michael Stephen Bowser Jr. 241 Main StreetP.O. Box 57Saco,ME 04072
8 Richard Regan 6 City CenterSuite 301Portland,ME 04101
9 Robert Guillory Esq 75 Pearl St. Suite 400Portland,ME 04101
10 Dylan R. Boyd 160 Capitol StreetP.O. Box 79Augusta,ME 04332
11 Luke Rioux Esq 10 Stoney Brook LaneLyman,ME 04002
12 David G. Webbert 15 Columbia Street, Ste. 301Bangor,ME 04401
13 Amy Fairfield 32 Saco AveOld Orchard Beach,ME 04064
14 Mr. Richard Lyman Hartley 62 Portland Rd., Ste. 44Kennebunk,ME 04043
15 Neal L Weinstein Esq 647 U.S. Route One#203York,ME 03909
16 Albert Hansen 76 Tandberg Trail (Route 115)Windham,ME 04062
17 Russell Goldsmith Esq Two Canal PlazaPO Box 4600Portland,ME 04112
18 Miklos Pongratz Esq 18 Market Square Suite 5Houlton,ME 04730
19 Bradford Pattershall Esq 5 Island View DrCumberland Foreside,ME 04110
20 Michele D L Kenney 12 Silver StreetP.O. Box 559Waterville,ME 04903
21 John Simpson 344 Mount Hope Ave.Bangor,ME 04402
22 Mariah America Gleaton 192 Main StreetEllsworth,ME 04605
23 Wayne Foote Esq 85 Brackett StreetPortland,ME 04102
24 Will Ashe 16 Union StreetBrunswick,ME 04011
25 Peter J Cyr Esq 482 Congress Street Suite 402Portland,ME 04101
26 Jonathan Steven Handelman Esq PO Box 335York,ME 03909
27 Richard Smith Berne 36 Ossipee Trl W.Standish,ME 04084
28 Meredith G. Schmid 75 Pearl St.Suite 216Portland,ME 04101
29 Gregory LeClerc 28 Long Sands Road, Suite 5York,ME 03909
30 Cory McKenna 20 Mechanic StCamden,ME 04843
31 Thomas P. Elias P.O. Box 1049304 Hancock St. Suite 1KBangor,ME...
32 Christopher MacLean 1250 Forest Avenue, Ste 3APortland,ME 04103
33 Zachary J. Smith 415 Congress StreetSuite 202Portland,ME 04101
34 Stephen Sweatt 919 Ridge RoadP.O. BOX 119Bowdoinham,ME 04008
35 Michael Turndorf Esq 1250 Forest Avenue, Ste 3APortland,ME 04103
36 Andrews Bruce Campbell Esq 133 State StreetAugusta,ME 04330
37 Timothy Zerillo 110 Portland StreetFryeburg,ME 04037
38 Walter McKee Esq 440 Walnut Hill RdNorth Yarmouth,ME 04097
39 Shelley Carter 70 State StreetEllsworth,ME 04605

for your second query You can save them into a dictionary like this -
url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
# parse all names and save them in a list
lawyer_names = soup.find_all("a","url main-profile-link")
lawyer_names = [name.find(text=True).strip() for name in lawyer_names]
# parse all addresses and save them in a list
lawyer_addresses = soup.find_all("span","-address -hide-landscape-tablet")
lawyer_addresses = [re.sub('\s+',' ', address.get_text(strip=True)) for address in lawyer_addresses]
# map names with addresses
lawyer_dict = dict(zip(lawyer_names, lawyer_addresses))
print(lawyer_dict)
Output dictionary -
{'Albert Hansen': '62 Portland Rd., Ste. 44Kennebunk, ME 04043',
'Amber Lynn Tucker': '415 Congress St., Ste. 202P.O. Box 7542Portland, ME 04112',
'Amy Fairfield': '10 Stoney Brook LaneLyman, ME 04002',
'Andrews Bruce Campbell Esq': '919 Ridge RoadP.O. BOX 119Bowdoinham, ME 04008',
'Bradford Pattershall Esq': 'Two Canal PlazaPO Box 4600Portland, ME 04112',
'Christopher Causey Esq': '949 Main StreetSanford, ME 04073',
'Cory McKenna': '75 Pearl St.Suite 216Portland, ME 04101',
'David G. Webbert': '160 Capitol StreetP.O. Box 79Augusta, ME 04332',
'David Nelson Wood Esq': '120 Main StreetSuite 110Saco, ME 04072',
'Dylan R. Boyd': '6 City CenterSuite 301Portland, ME 04101',
'Gregory LeClerc': '36 Ossipee Trl W.Standish, ME 04084',
'Hunter J Tzovarras': '88 Hammond StreetBangor, ME 04401',
'John S. Webb': '16 Middle StSaco, ME 04072',
'John Simpson': '5 Island View DrCumberland Foreside, ME 04110',
'Jonathan Steven Handelman Esq': '16 Union StreetBrunswick, ME 04011',
'Luke Rioux Esq': '75 Pearl St. Suite 400Portland, ME 04101',
'Mariah America Gleaton': '12 Silver StreetP.O. Box 559Waterville, ME 04903',
'Meredith G. Schmid': 'PO Box 335York, ME 03909',
'Michael Stephen Bowser Jr.': '37 Western Ave., Unit #307Kennebunk, ME 04043',
'Michael Turndorf Esq': '415 Congress StreetSuite 202Portland, ME 04101',
'Michele D L Kenney': '18 Market Square Suite 5Houlton, ME 04730',
'Miklos Pongratz Esq': '76 Tandberg Trail (Route 115)Windham, ME 04062',
'Mr. Richard Lyman Hartley': '15 Columbia Street, Ste. 301Bangor, ME 04401',
'Neal L Weinstein Esq': '32 Saco AveOld Orchard Beach, ME 04064',
'Peter J Cyr Esq': '85 Brackett StreetPortland, ME 04102',
'Richard Regan': '4 Union Park RoadTopsham, ME 04086',
'Richard Smith Berne': '482 Congress Street Suite 402Portland, ME 04101',
'Robert Guillory Esq': '241 Main StreetP.O. Box 57Saco, ME 04072',
'Robert Van Horn': '20 Oak StreetEllsworth, ME 04605',
'Russell Goldsmith Esq': '647 U.S. Route One#203York, ME 03909',
'Shelley Carter': '110 Portland StreetFryeburg, ME 04037',
'Thaddeus Day Esq': '440 Walnut Hill RdNorth Yarmouth, ME 04097',
'Thomas P. Elias': '28 Long Sands Road, Suite 5York, ME 03909',
'Timothy Zerillo': '1250 Forest Avenue, Ste 3APortland, ME 04103',
'Todd H Crawford Jr': '1288 Roosevelt Trl, Ste #3P.O. Box 753Raymond, ME 04071',
'Walter McKee Esq': '133 State StreetAugusta, ME 04330',
'Wayne Foote Esq': '344 Mount Hope Ave.Bangor, ME 04402',
'Will Ashe': '192 Main StreetEllsworth, ME 04605',
'William T. Bly Esq': '119 Main StreetKennebunk, ME 04043',
'Zachary J. Smith': 'P.O. Box 1049304 Hancock St. Suite 1KBangor, ME 04401'}

webscraping stars from imdb page using beautifulsoup

I am trying to get the name of stars from an IMDb page. below is my code
from requests import get
url = 'https://www.imdb.com/search/title/?title_type=tv_movie,tv_series&user_rating=6.0,10.0&adult=include&ref_=adv_prv'
response = get(url)
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
movie_containers = html_soup.find_all('div', 'lister-item mode-advanced')
first_movie = movie_containers[0]
first_stars = first_movie.select('a[href*="name"]')
first_stars
I got the following output
[Bob Odenkirk,
Rhea Seehorn,
Jonathan Banks,
Michael Mando]
i am trying to get only the names of the stars and first_stars.text gives the following error
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_3104\1297903165.py in <module>
1 first_stars = first_movie.select('a[href*="name"]')
----> 2 first_stars.text
~\Anaconda3\lib\site-packages\bs4\element.py in __getattr__(self, key)
2288 """Raise a helpful exception to explain a common code fix."""
2289 raise AttributeError(
-> 2290 "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
2291 )
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
when i tried
first_stars = first_movie.find('a[href*="name"]')
first_stars.text
i also got the following error
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_3104\2359725208.py in <module>
1 first_stars = first_movie.find('a[href*="name"]')
----> 2 first_stars.text
AttributeError: 'NoneType' object has no attribute 'text'
Any idea how i can extract only the name of the stars?

If you need the star name without distinction this might help.
block = soup.find_all("div", attrs={"class":"lister-item mode-advanced"})
starList= list()
for star in block:
starList.append(star.find("p", attrs={"class":""}).text.replace("Stars:", "").replace("\n", "").strip())
print(starList)
It prints
['Bob Odenkirk, Rhea Seehorn, Jonathan Banks, Michael Mando', 'Jason Bateman, Laura Linney, Sofia Hublitz, Skylar Gaertner', 'Gia Sandhu, Anson Mount, Ethan Peck, Jess Bush', 'Josh Brolin, Imogen Poots, Lili Taylor, Tom Pelphrey', 'Pablo Schreiber, Shabana Azmi, Natasha Culzac, Olive Gray', 'Titus Welliver, Mimi Rogers, Madison Lintz, Stephen A. Chang', 'Rachel Griffiths, Sophia Ali, Shannon Berry, Jenna Clause', 'Adam Scott, Zach Cherry, Britt Lower, Tramell Tillman', 'Milo Ventimiglia, Mandy Moore, Sterling K. Brown, Chrissy Metz', 'Emilia Clarke, Peter Dinklage, Kit Harington, Lena Headey', 'Bryan Cranston, Aaron Paul, Anna Gunn, Betsy Brandt', 'Millie Bobby Brown, Finn Wolfhard, Winona Ryder, David Harbour', 'Joe Locke, Kit Connor, Yasmin Finney, William Gao', 'John C. Reilly, Quincy Isaiah, Jason Clarke, Gaby Hoffmann', 'Bill Hader, Stephen Root, Sarah Goldberg, Anthony Carrigan', 'Kaley Cuoco, Zosia Mamet, Griffin Matthews, Rosie Perez', 'Caitríona Balfe, Sam Heughan, Sophie Skelton, Richard Rankin', 'Evan Rachel Wood, Jeffrey Wright, Ed Harris, Thandiwe Newton', 'Patrick Stewart, Alison Pill, Michelle Hurd, Santiago Cabrera', 'Nicola Coughlan, Jonathan Bailey, Ruth Gemmell, Florence Hunt', 'Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan', 'Cillian Murphy, Paul Anderson, Sophie Rundle, Helen McCrory', 'Manuel Garcia-Rulfo, Becki Newton, Neve Campbell, Christopher Gorham', 'Jane Fonda, Lily Tomlin, Sam Waterston, Martin Sheen', 'Ellen Pompeo, Chandra Wilson, James Pickens Jr., Justin Chambers', 'Alexander Dreymon, Eliza Butterworth, Arnas Fedaravicius, Mark Rowley', 'Luke Grimes, Kelly Reilly, Wes Bentley, Cole Hauser', 'Elisabeth Moss, Wagner Moura, Phillipa Soo, Chris Chalk', "Scott Whyte, Nolan North, Steven Pacey, Emily O'Brien", 'Steve Carell, Jenna Fischer, John Krasinski, Rainn Wilson', 'Jodie Whittaker, Peter Capaldi, Pearl Mackie, Matt Smith', 'Ansel
Elgort, Ken Watanabe, Rachel Keller, Shô Kasamatsu', 'James Spader, Megan Boone, Diego Klattenhoff, Ryan Eggold', 'Mark Harmon, David McCallum, Sean Murray, Pauley Perrette', 'Zendaya, Hunter Schafer, Angus Cloud, Jacob Elordi', 'Niv Sultan, Shaun Toub, Shervin Alenabi, Arash Marandi', 'Asa Butterfield, Gillian Anderson, Emma Mackey, Ncuti Gatwa', 'Jack Lowden, Kristin Scott Thomas, Gary Oldman, Chris Reilly', 'Karl Urban, Jack Quaid, Antony Starr, Erin Moriarty', 'Mariska Hargitay, Christopher Meloni, Ice-T, Dann Florek', "Nathan Fillion, Alyssa Diaz, Richard T. Jones, Melissa O'Neil", "Saoirse-Monica Jackson, Louisa Harland, Tara Lynne O'Neill, Kathy Kiera Clarke", 'Donald Glover, Brian Tyree Henry, LaKeith Stanfield, Zazie Beetz', 'Jennifer Aniston, Courteney Cox, Lisa Kudrow, Matt LeBlanc', 'Jared Padalecki, Jensen Ackles, Jim Beaver, Misha Collins', 'Julia Roberts, Sean Penn, Dan Stevens, Betty Gilpin', 'James Gandolfini, Lorraine Bracco, Edie Falco, Michael Imperioli', 'Natasha Lyonne, Charlie Barnett, Greta Lee, Elizabeth Ashley', 'Jean Smart, Hannah Einbinder, Carl Clemons-Hopkins, Rose Abdoo', 'Katheryn Winnick, Gustaf Skarsgård, Alexander Ludwig, Georgia Hirst']
or if you need both title and its stars
block = soup.find_all("div", attrs={"class":"lister-item mode-advanced"})
starList= list()
movieDict = dict()
for star in block:
movieDict = {
"moviename":star.find("h3", attrs={"class":"lister-item-header"}).text.split("\n")[2],
"stars": star.find("p", attrs={"class":""}).text.replace("Stars:", "").replace("\n", "").strip()
}
starList.append(movieDict)
print(starList)
this will print
[{'moviename': 'Better Call Saul', 'stars': 'Bob Odenkirk, Rhea Seehorn, Jonathan Banks, Michael
Mando'}, {'moviename': 'Ozark', 'stars': 'Jason Bateman, Laura Linney, Sofia Hublitz, Skylar Gaertner'}, {'moviename': 'Star Trek: Strange New Worlds', 'stars': 'Gia Sandhu, Anson Mount, Ethan Peck, Jess Bush'}, {'moviename': 'Outer Range', 'stars': 'Josh Brolin, Imogen Poots, Lili Taylor,
Tom Pelphrey'}, {'moviename': 'Halo', 'stars': 'Pablo Schreiber, Shabana Azmi, Natasha Culzac, Olive Gray'}, {'moviename': 'Bosch: Legacy', 'stars': 'Titus Welliver, Mimi Rogers, Madison Lintz,
Stephen A. Chang'}, {'moviename': 'The Wilds', 'stars': 'Rachel Griffiths, Sophia Ali, Shannon Berry, Jenna Clause'}, {'moviename': 'Severance', 'stars': 'Adam Scott, Zach Cherry, Britt Lower, Tramell Tillman'}, {'moviename': 'This Is Us', 'stars': 'Milo Ventimiglia, Mandy Moore, Sterling K. Brown, Chrissy Metz'}, {'moviename': 'Game of Thrones', 'stars': 'Emilia Clarke, Peter Dinklage, Kit Harington, Lena Headey'}, {'moviename': 'Breaking Bad', 'stars': 'Bryan Cranston, Aaron Paul, Anna Gunn, Betsy Brandt'}, {'moviename': 'Stranger Things', 'stars': 'Millie Bobby Brown, Finn Wolfhard, Winona Ryder, David Harbour'}, {'moviename': 'Heartstopper', 'stars': 'Joe Locke, Kit
Connor, Yasmin Finney, William Gao'}, {'moviename': 'Winning Time: The Rise of the Lakers Dynasty', 'stars': 'John C. Reilly, Quincy Isaiah, Jason Clarke, Gaby Hoffmann'}, {'moviename': 'Barry', 'stars': 'Bill Hader, Stephen Root, Sarah Goldberg, Anthony Carrigan'}, {'moviename': 'The Flight Attendant', 'stars': 'Kaley Cuoco, Zosia Mamet, Griffin Matthews, Rosie Perez'}, {'moviename':
'Outlander', 'stars': 'Caitríona Balfe, Sam Heughan, Sophie Skelton, Richard Rankin'}, {'moviename': 'Westworld', 'stars': 'Evan Rachel Wood, Jeffrey Wright, Ed Harris, Thandiwe Newton'}, {'moviename': 'Star Trek: Picard', 'stars': 'Patrick Stewart, Alison Pill, Michelle Hurd, Santiago Cabrera'}, {'moviename': 'Bridgerton', 'stars': 'Nicola Coughlan, Jonathan Bailey, Ruth Gemmell, Florence Hunt'}, {'moviename': 'The Walking Dead', 'stars': 'Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan'}, {'moviename': 'Peaky Blinders', 'stars': 'Cillian Murphy, Paul Anderson,
Sophie Rundle, Helen McCrory'}, {'moviename': 'The Lincoln Lawyer', 'stars': 'Manuel Garcia-Rulfo, Becki Newton, Neve Campbell, Christopher Gorham'}, {'moviename': 'Grace and Frankie', 'stars':
'Jane Fonda, Lily Tomlin, Sam Waterston, Martin Sheen'}, {'moviename': "Grey's Anatomy", 'stars': 'Ellen Pompeo, Chandra Wilson, James Pickens Jr., Justin Chambers'}, {'moviename': 'The Last Kingdom', 'stars': 'Alexander Dreymon, Eliza Butterworth, Arnas Fedaravicius, Mark Rowley'}, {'moviename': 'Yellowstone', 'stars': 'Luke Grimes, Kelly Reilly, Wes Bentley, Cole Hauser'}, {'moviename': 'Shining Girls', 'stars': 'Elisabeth Moss, Wagner Moura, Phillipa Soo, Chris Chalk'}, {'moviename': 'Love, Death & Robots', 'stars': "Scott Whyte, Nolan North, Steven Pacey, Emily O'Brien"}, {'moviename': 'The Office', 'stars': 'Steve Carell, Jenna Fischer, John Krasinski, Rainn Wilson'}, {'moviename': 'Doctor Who', 'stars': 'Jodie Whittaker, Peter Capaldi, Pearl Mackie, Matt Smith'}, {'moviename': 'Tokyo Vice', 'stars': 'Ansel Elgort, Ken Watanabe, Rachel Keller, Shô Kasamatsu'}, {'moviename': 'The Blacklist', 'stars': 'James Spader, Megan Boone, Diego Klattenhoff, Ryan
Eggold'}, {'moviename': 'NCIS: Naval Criminal Investigative Service', 'stars': 'Mark Harmon, David McCallum, Sean Murray, Pauley Perrette'}, {'moviename': 'Euphoria', 'stars': 'Zendaya, Hunter Schafer, Angus Cloud, Jacob Elordi'}, {'moviename': 'Tehran', 'stars': 'Niv Sultan, Shaun Toub, Shervin Alenabi, Arash Marandi'}, {'moviename': 'Sex Education', 'stars': 'Asa Butterfield, Gillian Anderson, Emma Mackey, Ncuti Gatwa'}, {'moviename': 'Slow Horses', 'stars': 'Jack Lowden, Kristin Scott Thomas, Gary Oldman, Chris Reilly'}, {'moviename': 'The Boys', 'stars': 'Karl Urban, Jack Quaid, Antony Starr, Erin Moriarty'}, {'moviename': 'Law & Order: Special Victims Unit', 'stars': 'Mariska Hargitay, Christopher Meloni, Ice-T, Dann Florek'}, {'moviename': 'The Rookie', 'stars': "Nathan Fillion, Alyssa Diaz, Richard T. Jones, Melissa O'Neil"}, {'moviename': 'Derry Girls', 'stars': "Saoirse-Monica Jackson, Louisa Harland, Tara Lynne O'Neill, Kathy Kiera Clarke"}, {'moviename': 'Atlanta', 'stars': 'Donald Glover, Brian Tyree Henry, LaKeith Stanfield, Zazie Beetz'}, {'moviename': 'Friends', 'stars': 'Jennifer Aniston, Courteney Cox, Lisa Kudrow, Matt LeBlanc'}, {'moviename': 'Supernatural', 'stars': 'Jared Padalecki, Jensen Ackles, Jim Beaver, Misha Collins'}, {'moviename': 'Gaslit', 'stars': 'Julia Roberts, Sean Penn, Dan Stevens, Betty Gilpin'}, {'moviename': 'The Sopranos', 'stars': 'James Gandolfini, Lorraine Bracco, Edie Falco, Michael Imperioli'}, {'moviename': 'Russian Doll', 'stars': 'Natasha Lyonne, Charlie Barnett, Greta Lee, Elizabeth Ashley'}, {'moviename': 'Hacks', 'stars': 'Jean Smart, Hannah Einbinder, Carl Clemons-Hopkins, Rose Abdoo'}, {'moviename': 'Vikings', 'stars': 'Katheryn Winnick, Gustaf Skarsgård, Alexander Ludwig, Georgia Hirst'}]

You have to iterate the ResultSet:
first_stars = [s.text for s in first_movie.select('a[href*="name"]')]
first_stars
Output:
['Bob Odenkirk', 'Rhea Seehorn', 'Jonathan Banks', 'Michael Mando']

How to insert a variable in xpath within a for loop?

for i in range(length):
# print(i)
driver.execute_script("window.history.go(-1)")
range = driver.find_element_by_xpath("(//a[#class = 'button'])[i]").click()
content2 = driver.page_source.encode('utf-8').strip()
soup2 = BeautifulSoup(content2,"html.parser")
name2 = soup2.find('h1', {'data-qa-target': 'ProviderDisplayName'}).text
phone2 = soup2.find('a', {'class': 'click-to-call-button-secondary hg-track mobile-click-to-call'}).text
print(name2, phone2)
Hey guy I am trying to scrape the First and last Name, Telephone for each person this website: https://www.healthgrades.com/family-marriage-counseling-directory. I want the (l.4) button to adapt to the variable (i). if i manually change i to a number everything works perfectly fine. But as soon as I placed in the variable i it doesn't work, any help much appreciated!

Instead of this :
range = driver.find_element_by_xpath("(//a[#class = 'button'])[i]").click()
do this :
range = driver.find_element_by_xpath(f"(//a[#class = 'button'])[{i}]").click()
Update 1 :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
driver.get("https://www.healthgrades.com/family-marriage-counseling-directory")
for name in driver.find_elements(By.CSS_SELECTOR, "a[data-qa-target='provider-details-provider-name']"):
print(name.text)
Output :
Noe Gutierrez, MSW
Melissa Huston, LCSW
Gina Kane, LMHC
Dr. Mary Marino, PHD
Emili-Erin Puente, MED
Richard Vogel, LMFT
Lynn Bednarz, LCPC
Nicole Palow, LMHC
Dennis Hart, LPCC
Dr. Robert Meeks, PHD
Jody Davis
Dr. Kim Logan, PHD
Artemis Paschalis, LMHC
Mark Webb, LMFT
Deirdre Holland, LCSW-R
John Paul Dilorenzo, LMHC
Joseph Hayes, LPC
Dr. Maylin Batista, PHD
Ella Gray, LCPC
Cynthia Mack-Ernsdorff, MA
Dr. Edward Muldrow, PHD
Rachel Sievers, LMFT
Dr. Lisa Burton, PHD
Ami Owen, LMFT
Sharon Lorber, LCSW
Heather Rowley, LCMHC
Dr. Bonnie Bryant, PHD
Marilyn Pearlman, LCSW
Charles Washam, BCD
Dr. Liliana Wolf, PHD
Christy Kobe, LCSW
Dana Paine, LPCC
Scott Kohner, LCSW
Elizabeth Krzewski, LMHC
Luisa Contreras, LMFT
Dr. Joel Nunez, PHD
Susanne Sacco, LISW
Lauren Reminger, MA
Thomas Recher, AUD
Kristi Smith, LCSW
Kecia West, LPC
Gregory Douglas, MED
Gina Smith, LCPC
Anne Causey, LPC
Dr. David Greenfield, PHD
Olga Rothschild, LMHC
Dr. Susan Levin, PHD
Ferguson Jennifer, LMHC
Marci Ober, LMFT
Christopher Checke, LMHC
Process finished with exit code 0
Update 2 :
leng = len(driver.find_elements(By.CSS_SELECTOR, "a[data-qa-target='provider-details-provider-name']"))
for i in range(leng):
driver.find_element_by_xpath(f"(//a[text()='View Profile'])[{i}]").click()

How to take row values from one pandas dataframe and use them as reference to get values from another dataframe

I have two dataframes. One contains contact information for constituents. The other was created to pair up constituents that might be part of the same household.
Sample:
data1 = {'Household_0':['1234567','2345678','3456789','4567890'],
'Individual_0':['1111111','2222222','3333333','4444444'],
'Individual_1':['5555555','6666666','7777777','']}
df1=pd.DataFrame(data1)
data2 = {'Constituent Id':['1234567','2345678','3456789','4567890',
'1111111','2222222','3333333','4444444',
'5555555','6666666','7777777'],
'Display Name':['Clark Kent and Lois Lane','Bruce Banner and Betty Ross',
'Tony Stark and Pepper Pots','Steve Rogers','Clark Kent','Bruce Banner',
'Tony Stark','Steve Rogers','Lois Lane','Betty Ross','Pepper Pots']}
df2=pd.DataFrame(data2)
Resulting in:
df1
Household_0 Individual_0 Individual_1
0 1234567 1111111 5555555
1 2345678 2222222 6666666
2 3456789 3333333 7777777
3 4567890 4444444
df2
Constituent Id Display Name
0 1234567 Clark Kent and Lois Lane
1 2345678 Bruce Banner and Betty Ross
2 3456789 Tony Stark and Pepper Pots
3 4567890 Steve Rogers
4 1111111 Clark Kent
5 2222222 Bruce Banner
6 3333333 Tony Stark
7 4444444 Steve Rogers
8 5555555 Lois Lane
9 6666666 Betty Ross
10 7777777 Pepper Pots
I would like to take df1, reference the Constituent Id out of df2, and create a new dataframe that has the names of the constituents instead of their IDs, so that we can ensure they are truly family/household members.
I believe I can do this by iterating, but that seems like the wrong approach. Is there a straightforward way to do this?

you can map each column from df1 with a series based on df2 once set_index Constituent Id and select the column Display Name. Use apply to repeat the operation on each column.
print (df1.apply(lambda x: x.map(df2.set_index('Constituent Id')['Display Name'])))
Household_0 Individual_0 Individual_1
0 Clark Kent and Lois Lane Clark Kent Lois Lane
1 Bruce Banner and Betty Ross Bruce Banner Betty Ross
2 Tony Stark and Pepper Pots Tony Stark Pepper Pots
3 Steve Rogers Steve Rogers NaN

You can pipeline melt, merge and pivot_table.
df3 = (
df1
.reset_index()
.melt('index')
.merge(df2, left_on='value', right_on='Constituent Id')
.pivot_table(values='Display Name', index='index', columns='variable', aggfunc='last')
)
print(df3)
outputs
variable Household_0 Individual_0 Individual_1
index
0 Clark Kent and Lois Lane Clark Kent Lois Lane
1 Bruce Banner and Betty Ross Bruce Banner Betty Ross
2 Tony Stark and Pepper Pots Tony Stark Pepper Pots
3 Steve Rogers Steve Rogers NaN

You can also try using .applymap() to link the two together.
reference = df2.set_index('Constituent Id')['Display Name'].to_dict()
df1[df1.columns] = df1[df1.columns].applymap(reference.get)

How to pull only certain fields with BeautifulSoup

I'm trying to print all the fields that have England in them, the current code i have prints all the Nationalities into a txt file for me, but i want just the england fields to print. the page im pulling from is https://www.premierleague.com/players
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
players.write(playerCountry.text.strip())
players.write("\n")

Just need to check if it's not equal 'England', and if so, skip to next item in list:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
if playerCountry.text.strip() != 'England':
continue
players.write(playerCountry.text.strip())
players.write("\n")

Or, you could just use pandas.read_html() and a couple lines of code:
import pandas as pd
df = pd.read_html("https://www.premierleague.com/players")[0]
print(df.loc[df['Nationality'] != 'England'])
Prints:
Player Position Nationality
2 Charlie Adam Midfielder Scotland
3 Adrián Goalkeeper Spain
4 Adrien Silva Midfielder Portugal
5 Ibrahim Afellay Midfielder Netherlands
6 Benik Afobe Forward The Democratic Republic Of Congo
7 Sergio Agüero Forward Argentina
9 Soufyan Ahannach Midfielder Netherlands
10 Ahmed Hegazi Defender Egypt
11 Nathan Aké Defender Netherlands
14 Toby Alderweireld Defender Belgium
15 Aleix García Midfielder Spain
17 Ali Gabr Defender Egypt
18 Allan Nyom Defender Cameroon
19 Allan Souza Midfielder Brazil
20 Joe Allen Midfielder Wales
22 Marcos Alonso Defender Spain
23 Paulo Alves Midfielder Portugal
24 Daniel Amartey Midfielder Ghana
25 Jordi Amat Defender Spain
27 Ethan Ampadu Defender Wales
28 Nordin Amrabat Forward Morocco

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping table by beautiful soup 4 - python

Related

cleaning up web scrape data and combining together?

webscraping stars from imdb page using beautifulsoup

How to insert a variable in xpath within a for loop?

How to take row values from one pandas dataframe and use them as reference to get values from another dataframe

How to pull only certain fields with BeautifulSoup

Categories

Resources