webscraping stars from imdb page using beautifulsoup - python
I am trying to get the name of stars from an IMDb page. below is my code
from requests import get
url = 'https://www.imdb.com/search/title/?title_type=tv_movie,tv_series&user_rating=6.0,10.0&adult=include&ref_=adv_prv'
response = get(url)
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
movie_containers = html_soup.find_all('div', 'lister-item mode-advanced')
first_movie = movie_containers[0]
first_stars = first_movie.select('a[href*="name"]')
first_stars
I got the following output
[Bob Odenkirk,
Rhea Seehorn,
Jonathan Banks,
Michael Mando]
i am trying to get only the names of the stars and first_stars.text gives the following error
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_3104\1297903165.py in <module>
1 first_stars = first_movie.select('a[href*="name"]')
----> 2 first_stars.text
~\Anaconda3\lib\site-packages\bs4\element.py in __getattr__(self, key)
2288 """Raise a helpful exception to explain a common code fix."""
2289 raise AttributeError(
-> 2290 "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
2291 )
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
when i tried
first_stars = first_movie.find('a[href*="name"]')
first_stars.text
i also got the following error
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_3104\2359725208.py in <module>
1 first_stars = first_movie.find('a[href*="name"]')
----> 2 first_stars.text
AttributeError: 'NoneType' object has no attribute 'text'
Any idea how i can extract only the name of the stars?
If you need the star name without distinction this might help.
block = soup.find_all("div", attrs={"class":"lister-item mode-advanced"})
starList= list()
for star in block:
starList.append(star.find("p", attrs={"class":""}).text.replace("Stars:", "").replace("\n", "").strip())
print(starList)
It prints
['Bob Odenkirk, Rhea Seehorn, Jonathan Banks, Michael Mando', 'Jason Bateman, Laura Linney, Sofia Hublitz, Skylar Gaertner', 'Gia Sandhu, Anson Mount, Ethan Peck, Jess Bush', 'Josh Brolin, Imogen Poots, Lili Taylor, Tom Pelphrey', 'Pablo Schreiber, Shabana Azmi, Natasha Culzac, Olive Gray', 'Titus Welliver, Mimi Rogers, Madison Lintz, Stephen A. Chang', 'Rachel Griffiths, Sophia Ali, Shannon Berry, Jenna Clause', 'Adam Scott, Zach Cherry, Britt Lower, Tramell Tillman', 'Milo Ventimiglia, Mandy Moore, Sterling K. Brown, Chrissy Metz', 'Emilia Clarke, Peter Dinklage, Kit Harington, Lena Headey', 'Bryan Cranston, Aaron Paul, Anna Gunn, Betsy Brandt', 'Millie Bobby Brown, Finn Wolfhard, Winona Ryder, David Harbour', 'Joe Locke, Kit Connor, Yasmin Finney, William Gao', 'John C. Reilly, Quincy Isaiah, Jason Clarke, Gaby Hoffmann', 'Bill Hader, Stephen Root, Sarah Goldberg, Anthony Carrigan', 'Kaley Cuoco, Zosia Mamet, Griffin Matthews, Rosie Perez', 'Caitríona Balfe, Sam Heughan, Sophie Skelton, Richard Rankin', 'Evan Rachel Wood, Jeffrey Wright, Ed Harris, Thandiwe Newton', 'Patrick Stewart, Alison Pill, Michelle Hurd, Santiago Cabrera', 'Nicola Coughlan, Jonathan Bailey, Ruth Gemmell, Florence Hunt', 'Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan', 'Cillian Murphy, Paul Anderson, Sophie Rundle, Helen McCrory', 'Manuel Garcia-Rulfo, Becki Newton, Neve Campbell, Christopher Gorham', 'Jane Fonda, Lily Tomlin, Sam Waterston, Martin Sheen', 'Ellen Pompeo, Chandra Wilson, James Pickens Jr., Justin Chambers', 'Alexander Dreymon, Eliza Butterworth, Arnas Fedaravicius, Mark Rowley', 'Luke Grimes, Kelly Reilly, Wes Bentley, Cole Hauser', 'Elisabeth Moss, Wagner Moura, Phillipa Soo, Chris Chalk', "Scott Whyte, Nolan North, Steven Pacey, Emily O'Brien", 'Steve Carell, Jenna Fischer, John Krasinski, Rainn Wilson', 'Jodie Whittaker, Peter Capaldi, Pearl Mackie, Matt Smith', 'Ansel
Elgort, Ken Watanabe, Rachel Keller, Shô Kasamatsu', 'James Spader, Megan Boone, Diego Klattenhoff, Ryan Eggold', 'Mark Harmon, David McCallum, Sean Murray, Pauley Perrette', 'Zendaya, Hunter Schafer, Angus Cloud, Jacob Elordi', 'Niv Sultan, Shaun Toub, Shervin Alenabi, Arash Marandi', 'Asa Butterfield, Gillian Anderson, Emma Mackey, Ncuti Gatwa', 'Jack Lowden, Kristin Scott Thomas, Gary Oldman, Chris Reilly', 'Karl Urban, Jack Quaid, Antony Starr, Erin Moriarty', 'Mariska Hargitay, Christopher Meloni, Ice-T, Dann Florek', "Nathan Fillion, Alyssa Diaz, Richard T. Jones, Melissa O'Neil", "Saoirse-Monica Jackson, Louisa Harland, Tara Lynne O'Neill, Kathy Kiera Clarke", 'Donald Glover, Brian Tyree Henry, LaKeith Stanfield, Zazie Beetz', 'Jennifer Aniston, Courteney Cox, Lisa Kudrow, Matt LeBlanc', 'Jared Padalecki, Jensen Ackles, Jim Beaver, Misha Collins', 'Julia Roberts, Sean Penn, Dan Stevens, Betty Gilpin', 'James Gandolfini, Lorraine Bracco, Edie Falco, Michael Imperioli', 'Natasha Lyonne, Charlie Barnett, Greta Lee, Elizabeth Ashley', 'Jean Smart, Hannah Einbinder, Carl Clemons-Hopkins, Rose Abdoo', 'Katheryn Winnick, Gustaf Skarsgård, Alexander Ludwig, Georgia Hirst']
or if you need both title and its stars
block = soup.find_all("div", attrs={"class":"lister-item mode-advanced"})
starList= list()
movieDict = dict()
for star in block:
movieDict = {
"moviename":star.find("h3", attrs={"class":"lister-item-header"}).text.split("\n")[2],
"stars": star.find("p", attrs={"class":""}).text.replace("Stars:", "").replace("\n", "").strip()
}
starList.append(movieDict)
print(starList)
this will print
[{'moviename': 'Better Call Saul', 'stars': 'Bob Odenkirk, Rhea Seehorn, Jonathan Banks, Michael
Mando'}, {'moviename': 'Ozark', 'stars': 'Jason Bateman, Laura Linney, Sofia Hublitz, Skylar Gaertner'}, {'moviename': 'Star Trek: Strange New Worlds', 'stars': 'Gia Sandhu, Anson Mount, Ethan Peck, Jess Bush'}, {'moviename': 'Outer Range', 'stars': 'Josh Brolin, Imogen Poots, Lili Taylor,
Tom Pelphrey'}, {'moviename': 'Halo', 'stars': 'Pablo Schreiber, Shabana Azmi, Natasha Culzac, Olive Gray'}, {'moviename': 'Bosch: Legacy', 'stars': 'Titus Welliver, Mimi Rogers, Madison Lintz,
Stephen A. Chang'}, {'moviename': 'The Wilds', 'stars': 'Rachel Griffiths, Sophia Ali, Shannon Berry, Jenna Clause'}, {'moviename': 'Severance', 'stars': 'Adam Scott, Zach Cherry, Britt Lower, Tramell Tillman'}, {'moviename': 'This Is Us', 'stars': 'Milo Ventimiglia, Mandy Moore, Sterling K. Brown, Chrissy Metz'}, {'moviename': 'Game of Thrones', 'stars': 'Emilia Clarke, Peter Dinklage, Kit Harington, Lena Headey'}, {'moviename': 'Breaking Bad', 'stars': 'Bryan Cranston, Aaron Paul, Anna Gunn, Betsy Brandt'}, {'moviename': 'Stranger Things', 'stars': 'Millie Bobby Brown, Finn Wolfhard, Winona Ryder, David Harbour'}, {'moviename': 'Heartstopper', 'stars': 'Joe Locke, Kit
Connor, Yasmin Finney, William Gao'}, {'moviename': 'Winning Time: The Rise of the Lakers Dynasty', 'stars': 'John C. Reilly, Quincy Isaiah, Jason Clarke, Gaby Hoffmann'}, {'moviename': 'Barry', 'stars': 'Bill Hader, Stephen Root, Sarah Goldberg, Anthony Carrigan'}, {'moviename': 'The Flight Attendant', 'stars': 'Kaley Cuoco, Zosia Mamet, Griffin Matthews, Rosie Perez'}, {'moviename':
'Outlander', 'stars': 'Caitríona Balfe, Sam Heughan, Sophie Skelton, Richard Rankin'}, {'moviename': 'Westworld', 'stars': 'Evan Rachel Wood, Jeffrey Wright, Ed Harris, Thandiwe Newton'}, {'moviename': 'Star Trek: Picard', 'stars': 'Patrick Stewart, Alison Pill, Michelle Hurd, Santiago Cabrera'}, {'moviename': 'Bridgerton', 'stars': 'Nicola Coughlan, Jonathan Bailey, Ruth Gemmell, Florence Hunt'}, {'moviename': 'The Walking Dead', 'stars': 'Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan'}, {'moviename': 'Peaky Blinders', 'stars': 'Cillian Murphy, Paul Anderson,
Sophie Rundle, Helen McCrory'}, {'moviename': 'The Lincoln Lawyer', 'stars': 'Manuel Garcia-Rulfo, Becki Newton, Neve Campbell, Christopher Gorham'}, {'moviename': 'Grace and Frankie', 'stars':
'Jane Fonda, Lily Tomlin, Sam Waterston, Martin Sheen'}, {'moviename': "Grey's Anatomy", 'stars': 'Ellen Pompeo, Chandra Wilson, James Pickens Jr., Justin Chambers'}, {'moviename': 'The Last Kingdom', 'stars': 'Alexander Dreymon, Eliza Butterworth, Arnas Fedaravicius, Mark Rowley'}, {'moviename': 'Yellowstone', 'stars': 'Luke Grimes, Kelly Reilly, Wes Bentley, Cole Hauser'}, {'moviename': 'Shining Girls', 'stars': 'Elisabeth Moss, Wagner Moura, Phillipa Soo, Chris Chalk'}, {'moviename': 'Love, Death & Robots', 'stars': "Scott Whyte, Nolan North, Steven Pacey, Emily O'Brien"}, {'moviename': 'The Office', 'stars': 'Steve Carell, Jenna Fischer, John Krasinski, Rainn Wilson'}, {'moviename': 'Doctor Who', 'stars': 'Jodie Whittaker, Peter Capaldi, Pearl Mackie, Matt Smith'}, {'moviename': 'Tokyo Vice', 'stars': 'Ansel Elgort, Ken Watanabe, Rachel Keller, Shô Kasamatsu'}, {'moviename': 'The Blacklist', 'stars': 'James Spader, Megan Boone, Diego Klattenhoff, Ryan
Eggold'}, {'moviename': 'NCIS: Naval Criminal Investigative Service', 'stars': 'Mark Harmon, David McCallum, Sean Murray, Pauley Perrette'}, {'moviename': 'Euphoria', 'stars': 'Zendaya, Hunter Schafer, Angus Cloud, Jacob Elordi'}, {'moviename': 'Tehran', 'stars': 'Niv Sultan, Shaun Toub, Shervin Alenabi, Arash Marandi'}, {'moviename': 'Sex Education', 'stars': 'Asa Butterfield, Gillian Anderson, Emma Mackey, Ncuti Gatwa'}, {'moviename': 'Slow Horses', 'stars': 'Jack Lowden, Kristin Scott Thomas, Gary Oldman, Chris Reilly'}, {'moviename': 'The Boys', 'stars': 'Karl Urban, Jack Quaid, Antony Starr, Erin Moriarty'}, {'moviename': 'Law & Order: Special Victims Unit', 'stars': 'Mariska Hargitay, Christopher Meloni, Ice-T, Dann Florek'}, {'moviename': 'The Rookie', 'stars': "Nathan Fillion, Alyssa Diaz, Richard T. Jones, Melissa O'Neil"}, {'moviename': 'Derry Girls', 'stars': "Saoirse-Monica Jackson, Louisa Harland, Tara Lynne O'Neill, Kathy Kiera Clarke"}, {'moviename': 'Atlanta', 'stars': 'Donald Glover, Brian Tyree Henry, LaKeith Stanfield, Zazie Beetz'}, {'moviename': 'Friends', 'stars': 'Jennifer Aniston, Courteney Cox, Lisa Kudrow, Matt LeBlanc'}, {'moviename': 'Supernatural', 'stars': 'Jared Padalecki, Jensen Ackles, Jim Beaver, Misha Collins'}, {'moviename': 'Gaslit', 'stars': 'Julia Roberts, Sean Penn, Dan Stevens, Betty Gilpin'}, {'moviename': 'The Sopranos', 'stars': 'James Gandolfini, Lorraine Bracco, Edie Falco, Michael Imperioli'}, {'moviename': 'Russian Doll', 'stars': 'Natasha Lyonne, Charlie Barnett, Greta Lee, Elizabeth Ashley'}, {'moviename': 'Hacks', 'stars': 'Jean Smart, Hannah Einbinder, Carl Clemons-Hopkins, Rose Abdoo'}, {'moviename': 'Vikings', 'stars': 'Katheryn Winnick, Gustaf Skarsgård, Alexander Ludwig, Georgia Hirst'}]
You have to iterate the ResultSet:
first_stars = [s.text for s in first_movie.select('a[href*="name"]')]
first_stars
Output:
['Bob Odenkirk', 'Rhea Seehorn', 'Jonathan Banks', 'Michael Mando']
Related
cleaning up web scrape data and combining together?
The website URL is https://www.justia.com/lawyers/criminal-law/maine I'm wanting to scrape only the name of the lawyer and where their office is. response = requests.get(url) soup= BeautifulSoup(response.text,"html.parser") Lawyer_name= soup.find_all("a","url main-profile-link") for i in Lawyer_name: print(i.find(text=True)) address= soup.find_all("span","-address -hide-landscape-tablet") for x in address: print(x.find_all(text=True)) The name prints out just find but the address is printing off with extra that I want to remove: ['\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t88 Hammond Street', '\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBangor,\t\t\t\t\tME 04401\t\t\t\t\t\t '] so the output I'm attempting to get for each lawyer is like this (the 1st one example): Hunter J Tzovarras 88 Hammond Street Bangor, ME 04401 two issues I'm trying to figure out How can I clean up the address so it is easier to read? How can I save the matching lawyer name with the address so they don't get mixed up.
Use x.get_text() instead of x.find_all for x in address: print(x.get_text(strip=True)) Full working code: import pandas as pd import requests from bs4 import BeautifulSoup url = 'https://www.justia.com/lawyers/criminal-law/maine' response = requests.get(url) soup= BeautifulSoup(response.text,"html.parser") n=[] ad=[] Lawyer_name= [x.get('title').strip() for x in soup.select('a.lawyer-avatar')] n.extend(Lawyer_name) #print(Lawyer_name) address= [x.get_text(strip=True).replace('\t','').strip() for x in soup.find_all("span",class_="-address -hide-landscape-tablet")] #print(address) ad.extend(address) df = pd.DataFrame(data=list(zip(n,ad)),columns=[['Lawyer_name','address']]) print(df) Output: Lawyer_name address 0 William T. Bly Esq 119 Main StreetKennebunk,ME 04043 1 John S. Webb 949 Main StreetSanford,ME 04073 2 William T. Bly Esq 20 Oak StreetEllsworth,ME 04605 3 Christopher Causey Esq 16 Middle StSaco,ME 04072 4 Robert Van Horn 88 Hammond StreetBangor,ME 04401 5 John S. Webb 37 Western Ave., Unit #307Kennebunk,ME 04043 6 Hunter J Tzovarras 4 Union Park RoadTopsham,ME 04086 7 Michael Stephen Bowser Jr. 241 Main StreetP.O. Box 57Saco,ME 04072 8 Richard Regan 6 City CenterSuite 301Portland,ME 04101 9 Robert Guillory Esq 75 Pearl St. Suite 400Portland,ME 04101 10 Dylan R. Boyd 160 Capitol StreetP.O. Box 79Augusta,ME 04332 11 Luke Rioux Esq 10 Stoney Brook LaneLyman,ME 04002 12 David G. Webbert 15 Columbia Street, Ste. 301Bangor,ME 04401 13 Amy Fairfield 32 Saco AveOld Orchard Beach,ME 04064 14 Mr. Richard Lyman Hartley 62 Portland Rd., Ste. 44Kennebunk,ME 04043 15 Neal L Weinstein Esq 647 U.S. Route One#203York,ME 03909 16 Albert Hansen 76 Tandberg Trail (Route 115)Windham,ME 04062 17 Russell Goldsmith Esq Two Canal PlazaPO Box 4600Portland,ME 04112 18 Miklos Pongratz Esq 18 Market Square Suite 5Houlton,ME 04730 19 Bradford Pattershall Esq 5 Island View DrCumberland Foreside,ME 04110 20 Michele D L Kenney 12 Silver StreetP.O. Box 559Waterville,ME 04903 21 John Simpson 344 Mount Hope Ave.Bangor,ME 04402 22 Mariah America Gleaton 192 Main StreetEllsworth,ME 04605 23 Wayne Foote Esq 85 Brackett StreetPortland,ME 04102 24 Will Ashe 16 Union StreetBrunswick,ME 04011 25 Peter J Cyr Esq 482 Congress Street Suite 402Portland,ME 04101 26 Jonathan Steven Handelman Esq PO Box 335York,ME 03909 27 Richard Smith Berne 36 Ossipee Trl W.Standish,ME 04084 28 Meredith G. Schmid 75 Pearl St.Suite 216Portland,ME 04101 29 Gregory LeClerc 28 Long Sands Road, Suite 5York,ME 03909 30 Cory McKenna 20 Mechanic StCamden,ME 04843 31 Thomas P. Elias P.O. Box 1049304 Hancock St. Suite 1KBangor,ME... 32 Christopher MacLean 1250 Forest Avenue, Ste 3APortland,ME 04103 33 Zachary J. Smith 415 Congress StreetSuite 202Portland,ME 04101 34 Stephen Sweatt 919 Ridge RoadP.O. BOX 119Bowdoinham,ME 04008 35 Michael Turndorf Esq 1250 Forest Avenue, Ste 3APortland,ME 04103 36 Andrews Bruce Campbell Esq 133 State StreetAugusta,ME 04330 37 Timothy Zerillo 110 Portland StreetFryeburg,ME 04037 38 Walter McKee Esq 440 Walnut Hill RdNorth Yarmouth,ME 04097 39 Shelley Carter 70 State StreetEllsworth,ME 04605
for your second query You can save them into a dictionary like this - url = 'https://www.justia.com/lawyers/criminal-law/maine' response = requests.get(url) soup= BeautifulSoup(response.text,"html.parser") # parse all names and save them in a list lawyer_names = soup.find_all("a","url main-profile-link") lawyer_names = [name.find(text=True).strip() for name in lawyer_names] # parse all addresses and save them in a list lawyer_addresses = soup.find_all("span","-address -hide-landscape-tablet") lawyer_addresses = [re.sub('\s+',' ', address.get_text(strip=True)) for address in lawyer_addresses] # map names with addresses lawyer_dict = dict(zip(lawyer_names, lawyer_addresses)) print(lawyer_dict) Output dictionary - {'Albert Hansen': '62 Portland Rd., Ste. 44Kennebunk, ME 04043', 'Amber Lynn Tucker': '415 Congress St., Ste. 202P.O. Box 7542Portland, ME 04112', 'Amy Fairfield': '10 Stoney Brook LaneLyman, ME 04002', 'Andrews Bruce Campbell Esq': '919 Ridge RoadP.O. BOX 119Bowdoinham, ME 04008', 'Bradford Pattershall Esq': 'Two Canal PlazaPO Box 4600Portland, ME 04112', 'Christopher Causey Esq': '949 Main StreetSanford, ME 04073', 'Cory McKenna': '75 Pearl St.Suite 216Portland, ME 04101', 'David G. Webbert': '160 Capitol StreetP.O. Box 79Augusta, ME 04332', 'David Nelson Wood Esq': '120 Main StreetSuite 110Saco, ME 04072', 'Dylan R. Boyd': '6 City CenterSuite 301Portland, ME 04101', 'Gregory LeClerc': '36 Ossipee Trl W.Standish, ME 04084', 'Hunter J Tzovarras': '88 Hammond StreetBangor, ME 04401', 'John S. Webb': '16 Middle StSaco, ME 04072', 'John Simpson': '5 Island View DrCumberland Foreside, ME 04110', 'Jonathan Steven Handelman Esq': '16 Union StreetBrunswick, ME 04011', 'Luke Rioux Esq': '75 Pearl St. Suite 400Portland, ME 04101', 'Mariah America Gleaton': '12 Silver StreetP.O. Box 559Waterville, ME 04903', 'Meredith G. Schmid': 'PO Box 335York, ME 03909', 'Michael Stephen Bowser Jr.': '37 Western Ave., Unit #307Kennebunk, ME 04043', 'Michael Turndorf Esq': '415 Congress StreetSuite 202Portland, ME 04101', 'Michele D L Kenney': '18 Market Square Suite 5Houlton, ME 04730', 'Miklos Pongratz Esq': '76 Tandberg Trail (Route 115)Windham, ME 04062', 'Mr. Richard Lyman Hartley': '15 Columbia Street, Ste. 301Bangor, ME 04401', 'Neal L Weinstein Esq': '32 Saco AveOld Orchard Beach, ME 04064', 'Peter J Cyr Esq': '85 Brackett StreetPortland, ME 04102', 'Richard Regan': '4 Union Park RoadTopsham, ME 04086', 'Richard Smith Berne': '482 Congress Street Suite 402Portland, ME 04101', 'Robert Guillory Esq': '241 Main StreetP.O. Box 57Saco, ME 04072', 'Robert Van Horn': '20 Oak StreetEllsworth, ME 04605', 'Russell Goldsmith Esq': '647 U.S. Route One#203York, ME 03909', 'Shelley Carter': '110 Portland StreetFryeburg, ME 04037', 'Thaddeus Day Esq': '440 Walnut Hill RdNorth Yarmouth, ME 04097', 'Thomas P. Elias': '28 Long Sands Road, Suite 5York, ME 03909', 'Timothy Zerillo': '1250 Forest Avenue, Ste 3APortland, ME 04103', 'Todd H Crawford Jr': '1288 Roosevelt Trl, Ste #3P.O. Box 753Raymond, ME 04071', 'Walter McKee Esq': '133 State StreetAugusta, ME 04330', 'Wayne Foote Esq': '344 Mount Hope Ave.Bangor, ME 04402', 'Will Ashe': '192 Main StreetEllsworth, ME 04605', 'William T. Bly Esq': '119 Main StreetKennebunk, ME 04043', 'Zachary J. Smith': 'P.O. Box 1049304 Hancock St. Suite 1KBangor, ME 04401'}
How to insert a variable in xpath within a for loop?
for i in range(length): # print(i) driver.execute_script("window.history.go(-1)") range = driver.find_element_by_xpath("(//a[#class = 'button'])[i]").click() content2 = driver.page_source.encode('utf-8').strip() soup2 = BeautifulSoup(content2,"html.parser") name2 = soup2.find('h1', {'data-qa-target': 'ProviderDisplayName'}).text phone2 = soup2.find('a', {'class': 'click-to-call-button-secondary hg-track mobile-click-to-call'}).text print(name2, phone2) Hey guy I am trying to scrape the First and last Name, Telephone for each person this website: https://www.healthgrades.com/family-marriage-counseling-directory. I want the (l.4) button to adapt to the variable (i). if i manually change i to a number everything works perfectly fine. But as soon as I placed in the variable i it doesn't work, any help much appreciated!
Instead of this : range = driver.find_element_by_xpath("(//a[#class = 'button'])[i]").click() do this : range = driver.find_element_by_xpath(f"(//a[#class = 'button'])[{i}]").click() Update 1 : driver = webdriver.Chrome(driver_path) driver.maximize_window() driver.implicitly_wait(50) driver.get("https://www.healthgrades.com/family-marriage-counseling-directory") for name in driver.find_elements(By.CSS_SELECTOR, "a[data-qa-target='provider-details-provider-name']"): print(name.text) Output : Noe Gutierrez, MSW Melissa Huston, LCSW Gina Kane, LMHC Dr. Mary Marino, PHD Emili-Erin Puente, MED Richard Vogel, LMFT Lynn Bednarz, LCPC Nicole Palow, LMHC Dennis Hart, LPCC Dr. Robert Meeks, PHD Jody Davis Dr. Kim Logan, PHD Artemis Paschalis, LMHC Mark Webb, LMFT Deirdre Holland, LCSW-R John Paul Dilorenzo, LMHC Joseph Hayes, LPC Dr. Maylin Batista, PHD Ella Gray, LCPC Cynthia Mack-Ernsdorff, MA Dr. Edward Muldrow, PHD Rachel Sievers, LMFT Dr. Lisa Burton, PHD Ami Owen, LMFT Sharon Lorber, LCSW Heather Rowley, LCMHC Dr. Bonnie Bryant, PHD Marilyn Pearlman, LCSW Charles Washam, BCD Dr. Liliana Wolf, PHD Christy Kobe, LCSW Dana Paine, LPCC Scott Kohner, LCSW Elizabeth Krzewski, LMHC Luisa Contreras, LMFT Dr. Joel Nunez, PHD Susanne Sacco, LISW Lauren Reminger, MA Thomas Recher, AUD Kristi Smith, LCSW Kecia West, LPC Gregory Douglas, MED Gina Smith, LCPC Anne Causey, LPC Dr. David Greenfield, PHD Olga Rothschild, LMHC Dr. Susan Levin, PHD Ferguson Jennifer, LMHC Marci Ober, LMFT Christopher Checke, LMHC Process finished with exit code 0 Update 2 : leng = len(driver.find_elements(By.CSS_SELECTOR, "a[data-qa-target='provider-details-provider-name']")) for i in range(leng): driver.find_element_by_xpath(f"(//a[text()='View Profile'])[{i}]").click()
startswith() function help needed in Pandas Dataframe
I have a Name Column in Dataframe in which there are Multiple names. DataFrame import pandas as pd df = pd.DataFrame({'name': ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux', "Mr. Roderick Robert Crispin", "Cunningham"," Mr. Alfred Fleming"]})` OUTPUT Name 0 Brailey, Mr. William Theodore Ronald 1 Roger Marie Bricoux 2 Mr. Roderick Robert Crispin 3 Cunningham 4 Mr. Alfred Fleming I wrote a row classification function, like if I pass a row/name it should return output class mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux', 'John Frederick Preston Clarke'] def classify_role(row): if row.loc['name'] in mus: return 'musician' Calling a function is_brailey = df['name'].str.startswith('Brailey') print(classify_role(df[is_brailey].iloc[0])) Should show 'musician' But output is showing different class I think I am writing something wrong here in classify_role() Must be this row if row.loc['name'] in mus: Summary: I am in need of a solution if I put first name of a person in startswith() who is in musi it should return musician
EDIT: If want test if values exist in lists you can create dictionary and test membership by Series.isin: mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux', 'John Frederick Preston Clarke'] cat1 = ['Mr. Alfred Fleming','Cunningham'] d = {'musician':mus, 'category':cat1} for k, v in d.items(): df.loc[df['Name'].isin(v), 'type'] = k print (df) Name type 0 Brailey, Mr. William Theodore Ronald musician 1 Roger Marie Bricoux musician 2 Mr. Roderick Robert Crispin NaN 3 Cunningham category 4 Mr. Alfred Fleming category Your solution should be changed: mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux', 'John Frederick Preston Clarke'] def classify_role(row): if row in mus: return 'musician' df['type'] = df['Name'].apply(classify_role) print (df) Name type 0 Brailey, Mr. William Theodore Ronald musician 1 Roger Marie Bricoux musician 2 Mr. Roderick Robert Crispin None 3 Cunningham None 4 Mr. Alfred Fleming None You can pass values in tuple to Series.str.startswith, solution should be expand to match more categories by dictionary: d = {'musician': ['Brailey, Mr. William Theodore Ronald'], 'cat1':['Roger Marie Bricoux', 'Cunningham']} for k, v in d.items(): df.loc[df['Name'].str.startswith(tuple(v)), 'type'] = k print (df) Name type 0 Brailey, Mr. William Theodore Ronald musician 1 Roger Marie Bricoux cat1 2 Mr. Roderick Robert Crispin NaN 3 Cunningham cat1 4 Mr. Alfred Fleming NaN
Scraping table by beautiful soup 4
Hello I am trying to scrape this table in this url: https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc There are 50 rows in this table.. however if you click Show more (just below the table), more of the rows appear. My beautiful soup code works fine, But the problem is it retrieves only the first 50 rows. It doesnot retrieve rows that appear after clicking the Show more. How can i get all the rows including first 50 and also those appears after clicking Show more? Here is the code: #Request to get the target wiki page rqst = requests.get("https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc") soup = BeautifulSoup(rqst.content,'lxml') table = soup.find_all('table') NFL_player_stats = pd.read_html(str(table)) players = NFL_player_stats[0] players.shape out[0]: (50,1)
Using DevTools in Firefox I see it gets data (in JSON format) for next page from https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=2 If you change value in page= then you can get other pages. import requests url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=' for page in range(1, 4): print('\n---', page, '---\n') r = requests.get(url + str(page)) data = r.json() #print(data.keys()) for item in data['athletes']: print(item['athlete']['displayName']) Result: --- 1 --- Ezekiel Elliott Saquon Barkley Todd Gurley II Joe Mixon Chris Carson Christian McCaffrey Derrick Henry Adrian Peterson Phillip Lindsay Nick Chubb Lamar Miller James Conner David Johnson Jordan Howard Sony Michel Marlon Mack Melvin Gordon Alvin Kamara Peyton Barber Kareem Hunt Matt Breida Tevin Coleman Aaron Jones Doug Martin Frank Gore Gus Edwards Lamar Jackson Isaiah Crowell Mark Ingram II Kerryon Johnson Josh Allen Dalvin Cook Latavius Murray Carlos Hyde Austin Ekeler Deshaun Watson Kenyan Drake Royce Freeman Dion Lewis LeSean McCoy Mike Davis Josh Adams Alfred Blue Cam Newton Jamaal Williams Tarik Cohen Leonard Fournette Alfred Morris James White Mitchell Trubisky --- 2 --- Rashaad Penny LeGarrette Blount T.J. Yeldon Alex Collins C.J. Anderson Chris Ivory Marshawn Lynch Russell Wilson Blake Bortles Wendell Smallwood Marcus Mariota Bilal Powell Jordan Wilkins Kenneth Dixon Ito Smith Nyheim Hines Dak Prescott Jameis Winston Elijah McGuire Patrick Mahomes Aaron Rodgers Jeff Wilson Jr. Zach Zenner Raheem Mostert Corey Clement Jalen Richard Damien Williams Jaylen Samuels Marcus Murphy Spencer Ware Cordarrelle Patterson Malcolm Brown Giovani Bernard Chase Edmonds Justin Jackson Duke Johnson Taysom Hill Kalen Ballage Ty Montgomery Rex Burkhead Jay Ajayi Devontae Booker Chris Thompson Wayne Gallman DJ Moore Theo Riddick Alex Smith Robert Woods Brian Hill Dwayne Washington --- 3 --- Ryan Fitzpatrick Tyreek Hill Andrew Luck Ryan Tannehill Josh Rosen Sam Darnold Baker Mayfield Jeff Driskel Rod Smith Matt Ryan Tyrod Taylor Kirk Cousins Cody Kessler Darren Sproles Josh Johnson DeAndre Washington Trenton Cannon Javorius Allen Jared Goff Julian Edelman Jacquizz Rodgers Kapri Bibbs Andy Dalton Ben Roethlisberger Dede Westbrook Case Keenum Carson Wentz Brandon Bolden Curtis Samuel Stevan Ridley Keith Ford Keenan Allen John Kelly Kenjon Barner Matthew Stafford Tyler Lockett C.J. Beathard Cameron Artis-Payne Devonta Freeman Brandin Cooks Isaiah McKenzie Colt McCoy Stefon Diggs Taylor Gabriel Jarvis Landry Tavon Austin Corey Davis Emmanuel Sanders Sammy Watkins Nathan Peterman EDIT: get all data as DataFrame import requests import pandas as pd url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=' df = pd.DataFrame() # emtpy DF at start for page in range(1, 4): print('page:', page) r = requests.get(url + str(page)) data = r.json() #print(data.keys()) for item in data['athletes']: player_name = item['athlete']['displayName'] position = item['athlete']['position']['abbreviation'] gp = item['categories'][0]['totals'][0] other_values = item['categories'][2]['totals'] row = [player_name, position, gp] + other_values df = df.append( [row] ) # append one row df.columns = ['NAME', 'POS', 'GP', 'ATT', 'YDS', 'AVG', 'LNG', 'BIG', 'TD', 'YDS/G', 'FUM', 'LST', 'FD'] print(len(df)) # 150 print(df.head(20))
A better/faster way to handle human names in Pandas columns?
I am dealing with a large amount of data that includes the standard five columns for human names (prefix, firstname, middlename, lastname, suffix) and I would like to merge them in a separate column as a readable name. The issue I have is with handling blank values - the issue creates spacing problems. Also, I cannot modify the original columns. My current process feels a little insane (but it works!) so I am looking for a more elegant solution. My current code: def add_space_prefix(x): x = str(x) if len(x) > 0: return x + ' ' else: return x def add_space_middle(x): x = str(x) if len(x) > 0: return ' ' + x else: return x def add_space_suffix(x): x = str(x) if len(x) > 0: return ', ' + x else: return x` df["middlename"] = df["middlename"].map(lambda x: add_space_middle(x)) df["prefix"] = df["prefix"].map(lambda x: add_space_prefix(x)) df["suffix"] = df["suffix"].map(lambda x: add_space_suffix(x)) df['fullname'] = df["prefix"] + df["firstname"] + df[ "middlename"] + ' ' + df["lastname"] + df['suffix'] Sample Dataframe prefix firstname middlename lastname suffix fullname 0 Michael Hobart Jr. Michael Jobart, Jr. 1 Mr. Alan Lilt Mr. Alan Lilt 2 Jon A. Smith III Jon A. Smith, III 3 Joe Miller Joe Miller 4 Mika Jennifer Shabosky Mika Jennifer Shabosky 5 Mrs. Angela Calder Mrs. Angela Calder 6 Boris Al Bert Esq. Boris Al Bert, Esq. 7 Dr. Natasha Chorus Dr. Natasha Chorus 8 Bill Gibbons Bill Gibbons
Option 1 ' '.join and pd.Series.str In this solution we join the entire row by spaces. This may lead to spaces at the beginning or end of the string or with 2 or more spaces in the middle. We handle this by chaining string accessor methods. df.assign( lastname=df.lastname + ',' ).apply(' '.join, 1).str.replace('\s+', ' ').str.strip(' ,') 0 Michael Hobart, Jr. 1 Mr. Alan Lilt 2 Jon A. Smith, III 3 Joe Miller 4 Mika Jennifer Shabosky 5 Mrs. Angela Calder 6 Boris Al Bert, Esq. 7 Dr. Natasha Chorus 8 Bill Gibbons dtype: object df['fullname'] = df.assign( lastname=df.lastname + ',' ).apply(' '.join, 1).str.replace('\s+', ' ').str.strip(' ,') df prefix firstname middlename lastname suffix fullname 0 Michael Hobart Jr. Michael Hobart, Jr. 1 Mr. Alan Lilt Mr. Alan Lilt 2 Jon A. Smith III Jon A. Smith, III 3 Joe Miller Joe Miller 4 Mika Jennifer Shabosky Mika Jennifer Shabosky 5 Mrs. Angela Calder Mrs. Angela Calder 6 Boris Al Bert Esq. Boris Al Bert, Esq. 7 Dr. Natasha Chorus Dr. Natasha Chorus 8 Bill Gibbons Bill Gibbons Option 2 list comprehension In this solution, we perform the same activities as with the first solution, but we bundle the string operations together and within a comprehension. [re.sub(r'\s+', ' ', ' '.join(s)).strip(' ,') for s in df.assign(lastname=df.lastname + ',').values.tolist()] ['Michael Hobart, Jr.', 'Mr. Alan Lilt', 'Jon A. Smith, III', 'Joe Miller', 'Mika Jennifer Shabosky', 'Mrs. Angela Calder', 'Boris Al Bert, Esq.', 'Dr. Natasha Chorus', 'Bill Gibbons'] df['fullname'] = [re.sub(r'\s+', ' ', ' '.join(s)).strip(' ,') for s in df.assign(lastname=df.lastname + ',').values.tolist()] df prefix firstname middlename lastname suffix fullname 0 Michael Hobart Jr. Michael Hobart, Jr. 1 Mr. Alan Lilt Mr. Alan Lilt 2 Jon A. Smith III Jon A. Smith, III 3 Joe Miller Joe Miller 4 Mika Jennifer Shabosky Mika Jennifer Shabosky 5 Mrs. Angela Calder Mrs. Angela Calder 6 Boris Al Bert Esq. Boris Al Bert, Esq. 7 Dr. Natasha Chorus Dr. Natasha Chorus 8 Bill Gibbons Bill Gibbons Option 3 pd.replace and pd.DataFrame.stack This one is a bit different in that we replace blanks '' with np.nan so that when we stack the np.nan are naturally dropped. This makes for the joining with ' ' more straight forward. df.assign( lastname=df.lastname + ',' ).replace('', np.nan).stack().groupby(level=0).apply(' '.join).str.strip(',') 0 Michael Hobart, Jr. 1 Mr. Alan Lilt 2 Jon A. Smith, III 3 Joe Miller 4 Mika Jennifer Shabosky 5 Mrs. Angela Calder 6 Boris Al Bert, Esq. 7 Dr. Natasha Chorus 8 Bill Gibbons dtype: object df['fullname'] = df.assign( lastname=df.lastname + ',' ).replace('', np.nan).stack().groupby(level=0).apply(' '.join).str.strip(',') df prefix firstname middlename lastname suffix fullname 0 Michael Hobart Jr. Michael Hobart, Jr. 1 Mr. Alan Lilt Mr. Alan Lilt 2 Jon A. Smith III Jon A. Smith, III 3 Joe Miller Joe Miller 4 Mika Jennifer Shabosky Mika Jennifer Shabosky 5 Mrs. Angela Calder Mrs. Angela Calder 6 Boris Al Bert Esq. Boris Al Bert, Esq. 7 Dr. Natasha Chorus Dr. Natasha Chorus 8 Bill Gibbons Bill Gibbons Timing bundling within a comprehension is fastest! %timeit df.assign(fullname=df.replace('', np.nan).stack().groupby(level=0).apply(' '.join)) %timeit df.assign(fullname=df.apply(' '.join, 1).str.replace('\s+', ' ').str.strip()) %timeit df.assign(fullname=[re.sub(r'\s+', ' ', ' '.join(s)).strip() for s in df.values.tolist()]) 100 loops, best of 3: 2.51 ms per loop 1000 loops, best of 3: 979 µs per loop 1000 loops, best of 3: 384 µs per loop