cleaning up web scrape data and combining together? - python
The website URL is https://www.justia.com/lawyers/criminal-law/maine
I'm wanting to scrape only the name of the lawyer and where their office is.
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
Lawyer_name= soup.find_all("a","url main-profile-link")
for i in Lawyer_name:
print(i.find(text=True))
address= soup.find_all("span","-address -hide-landscape-tablet")
for x in address:
print(x.find_all(text=True))
The name prints out just find but the address is printing off with extra that I want to remove:
['\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t88 Hammond Street', '\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBangor,\t\t\t\t\tME 04401\t\t\t\t\t\t ']
so the output I'm attempting to get for each lawyer is like this (the 1st one example):
Hunter J Tzovarras
88 Hammond Street
Bangor, ME 04401
two issues I'm trying to figure out
How can I clean up the address so it is easier to read?
How can I save the matching lawyer name with the address so they
don't get mixed up.
Use x.get_text() instead of x.find_all
for x in address:
print(x.get_text(strip=True))
Full working code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
n=[]
ad=[]
Lawyer_name= [x.get('title').strip() for x in soup.select('a.lawyer-avatar')]
n.extend(Lawyer_name)
#print(Lawyer_name)
address= [x.get_text(strip=True).replace('\t','').strip() for x in soup.find_all("span",class_="-address -hide-landscape-tablet")]
#print(address)
ad.extend(address)
df = pd.DataFrame(data=list(zip(n,ad)),columns=[['Lawyer_name','address']])
print(df)
Output:
Lawyer_name address
0 William T. Bly Esq 119 Main StreetKennebunk,ME 04043
1 John S. Webb 949 Main StreetSanford,ME 04073
2 William T. Bly Esq 20 Oak StreetEllsworth,ME 04605
3 Christopher Causey Esq 16 Middle StSaco,ME 04072
4 Robert Van Horn 88 Hammond StreetBangor,ME 04401
5 John S. Webb 37 Western Ave., Unit #307Kennebunk,ME 04043
6 Hunter J Tzovarras 4 Union Park RoadTopsham,ME 04086
7 Michael Stephen Bowser Jr. 241 Main StreetP.O. Box 57Saco,ME 04072
8 Richard Regan 6 City CenterSuite 301Portland,ME 04101
9 Robert Guillory Esq 75 Pearl St. Suite 400Portland,ME 04101
10 Dylan R. Boyd 160 Capitol StreetP.O. Box 79Augusta,ME 04332
11 Luke Rioux Esq 10 Stoney Brook LaneLyman,ME 04002
12 David G. Webbert 15 Columbia Street, Ste. 301Bangor,ME 04401
13 Amy Fairfield 32 Saco AveOld Orchard Beach,ME 04064
14 Mr. Richard Lyman Hartley 62 Portland Rd., Ste. 44Kennebunk,ME 04043
15 Neal L Weinstein Esq 647 U.S. Route One#203York,ME 03909
16 Albert Hansen 76 Tandberg Trail (Route 115)Windham,ME 04062
17 Russell Goldsmith Esq Two Canal PlazaPO Box 4600Portland,ME 04112
18 Miklos Pongratz Esq 18 Market Square Suite 5Houlton,ME 04730
19 Bradford Pattershall Esq 5 Island View DrCumberland Foreside,ME 04110
20 Michele D L Kenney 12 Silver StreetP.O. Box 559Waterville,ME 04903
21 John Simpson 344 Mount Hope Ave.Bangor,ME 04402
22 Mariah America Gleaton 192 Main StreetEllsworth,ME 04605
23 Wayne Foote Esq 85 Brackett StreetPortland,ME 04102
24 Will Ashe 16 Union StreetBrunswick,ME 04011
25 Peter J Cyr Esq 482 Congress Street Suite 402Portland,ME 04101
26 Jonathan Steven Handelman Esq PO Box 335York,ME 03909
27 Richard Smith Berne 36 Ossipee Trl W.Standish,ME 04084
28 Meredith G. Schmid 75 Pearl St.Suite 216Portland,ME 04101
29 Gregory LeClerc 28 Long Sands Road, Suite 5York,ME 03909
30 Cory McKenna 20 Mechanic StCamden,ME 04843
31 Thomas P. Elias P.O. Box 1049304 Hancock St. Suite 1KBangor,ME...
32 Christopher MacLean 1250 Forest Avenue, Ste 3APortland,ME 04103
33 Zachary J. Smith 415 Congress StreetSuite 202Portland,ME 04101
34 Stephen Sweatt 919 Ridge RoadP.O. BOX 119Bowdoinham,ME 04008
35 Michael Turndorf Esq 1250 Forest Avenue, Ste 3APortland,ME 04103
36 Andrews Bruce Campbell Esq 133 State StreetAugusta,ME 04330
37 Timothy Zerillo 110 Portland StreetFryeburg,ME 04037
38 Walter McKee Esq 440 Walnut Hill RdNorth Yarmouth,ME 04097
39 Shelley Carter 70 State StreetEllsworth,ME 04605
for your second query You can save them into a dictionary like this -
url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
# parse all names and save them in a list
lawyer_names = soup.find_all("a","url main-profile-link")
lawyer_names = [name.find(text=True).strip() for name in lawyer_names]
# parse all addresses and save them in a list
lawyer_addresses = soup.find_all("span","-address -hide-landscape-tablet")
lawyer_addresses = [re.sub('\s+',' ', address.get_text(strip=True)) for address in lawyer_addresses]
# map names with addresses
lawyer_dict = dict(zip(lawyer_names, lawyer_addresses))
print(lawyer_dict)
Output dictionary -
{'Albert Hansen': '62 Portland Rd., Ste. 44Kennebunk, ME 04043',
'Amber Lynn Tucker': '415 Congress St., Ste. 202P.O. Box 7542Portland, ME 04112',
'Amy Fairfield': '10 Stoney Brook LaneLyman, ME 04002',
'Andrews Bruce Campbell Esq': '919 Ridge RoadP.O. BOX 119Bowdoinham, ME 04008',
'Bradford Pattershall Esq': 'Two Canal PlazaPO Box 4600Portland, ME 04112',
'Christopher Causey Esq': '949 Main StreetSanford, ME 04073',
'Cory McKenna': '75 Pearl St.Suite 216Portland, ME 04101',
'David G. Webbert': '160 Capitol StreetP.O. Box 79Augusta, ME 04332',
'David Nelson Wood Esq': '120 Main StreetSuite 110Saco, ME 04072',
'Dylan R. Boyd': '6 City CenterSuite 301Portland, ME 04101',
'Gregory LeClerc': '36 Ossipee Trl W.Standish, ME 04084',
'Hunter J Tzovarras': '88 Hammond StreetBangor, ME 04401',
'John S. Webb': '16 Middle StSaco, ME 04072',
'John Simpson': '5 Island View DrCumberland Foreside, ME 04110',
'Jonathan Steven Handelman Esq': '16 Union StreetBrunswick, ME 04011',
'Luke Rioux Esq': '75 Pearl St. Suite 400Portland, ME 04101',
'Mariah America Gleaton': '12 Silver StreetP.O. Box 559Waterville, ME 04903',
'Meredith G. Schmid': 'PO Box 335York, ME 03909',
'Michael Stephen Bowser Jr.': '37 Western Ave., Unit #307Kennebunk, ME 04043',
'Michael Turndorf Esq': '415 Congress StreetSuite 202Portland, ME 04101',
'Michele D L Kenney': '18 Market Square Suite 5Houlton, ME 04730',
'Miklos Pongratz Esq': '76 Tandberg Trail (Route 115)Windham, ME 04062',
'Mr. Richard Lyman Hartley': '15 Columbia Street, Ste. 301Bangor, ME 04401',
'Neal L Weinstein Esq': '32 Saco AveOld Orchard Beach, ME 04064',
'Peter J Cyr Esq': '85 Brackett StreetPortland, ME 04102',
'Richard Regan': '4 Union Park RoadTopsham, ME 04086',
'Richard Smith Berne': '482 Congress Street Suite 402Portland, ME 04101',
'Robert Guillory Esq': '241 Main StreetP.O. Box 57Saco, ME 04072',
'Robert Van Horn': '20 Oak StreetEllsworth, ME 04605',
'Russell Goldsmith Esq': '647 U.S. Route One#203York, ME 03909',
'Shelley Carter': '110 Portland StreetFryeburg, ME 04037',
'Thaddeus Day Esq': '440 Walnut Hill RdNorth Yarmouth, ME 04097',
'Thomas P. Elias': '28 Long Sands Road, Suite 5York, ME 03909',
'Timothy Zerillo': '1250 Forest Avenue, Ste 3APortland, ME 04103',
'Todd H Crawford Jr': '1288 Roosevelt Trl, Ste #3P.O. Box 753Raymond, ME 04071',
'Walter McKee Esq': '133 State StreetAugusta, ME 04330',
'Wayne Foote Esq': '344 Mount Hope Ave.Bangor, ME 04402',
'Will Ashe': '192 Main StreetEllsworth, ME 04605',
'William T. Bly Esq': '119 Main StreetKennebunk, ME 04043',
'Zachary J. Smith': 'P.O. Box 1049304 Hancock St. Suite 1KBangor, ME 04401'}
Related
Draw a Map of cities in python
I have a ranking of countries across the world in a variable called rank_2000 that looks like this: Seoul Tokyo Paris New_York_Greater Shizuoka Chicago Minneapolis Boston Austin Munich Salt_Lake Greater_Sydney Houston Dallas London San_Francisco_Greater Berlin Seattle Toronto Stockholm Atlanta Indianapolis Fukuoka San_Diego Phoenix Frankfurt_am_Main Stuttgart Grenoble Albany Singapore Washington_Greater Helsinki Nuremberg Detroit_Greater TelAviv Zurich Hamburg Pittsburgh Philadelphia_Greater Taipei Los_Angeles_Greater Miami_Greater MannheimLudwigshafen Brussels Milan Montreal Dublin Sacramento Ottawa Vancouver Malmo Karlsruhe Columbus Dusseldorf Shenzen Copenhagen Milwaukee Marseille Greater_Melbourne Toulouse Beijing Dresden Manchester Lyon Vienna Shanghai Guangzhou San_Antonio Utrecht New_Delhi Basel Oslo Rome Barcelona Madrid Geneva Hong_Kong Valencia Edinburgh Amsterdam Taichung The_Hague Bucharest Muenster Greater_Adelaide Chengdu Greater_Brisbane Budapest Manila Bologna Quebec Dubai Monterrey Wellington Shenyang Tunis Johannesburg Auckland Hangzhou Athens Wuhan Bangalore Chennai Istanbul Cape_Town Lima Xian Bangkok Penang Luxembourg Buenos_Aires Warsaw Greater_Perth Kuala_Lumpur Santiago Lisbon Dalian Zhengzhou Prague Changsha Chongqing Ankara Fuzhou Jinan Xiamen Sao_Paulo Kunming Jakarta Cairo Curitiba Riyadh Rio_de_Janeiro Mexico_City Hefei Almaty Beirut Belgrade Belo_Horizonte Bogota_DC Bratislava Dhaka Durban Hanoi Ho_Chi_Minh_City Kampala Karachi Kuwait_City Manama Montevideo Panama_City Quito San_Juan What I would like to do is a map of the world where those cities are colored according to their position on the ranking above. I am opened to further solutions for the representation (such as bubbles of increasing dimension according to the position of the cities in the rank or, if necessary, representing only a sample of countries taken from the top rank, the middle and the bottom). Thank you, Federico
Your question has two parts; finding the location of each city and then drawing them on the map. Assuming you have the latitude and longitude of each city, here's how you'd tackle the latter part. I like Folium (https://pypi.org/project/folium/) for drawing maps. Here's an example of how you might draw a circle for each city, with it's position in the list is used to determine the size of that circle. import folium cities = [ {'name':'Seoul', 'coodrs':[37.5639715, 126.9040468]}, {'name':'Tokyo', 'coodrs':[35.5090627, 139.2094007]}, {'name':'Paris', 'coodrs':[48.8588787,2.2035149]}, {'name':'New York', 'coodrs':[40.6976637,-74.1197631]}, # etc. etc. ] m = folium.Map(zoom_start=15) for counter, city in enumerate(cities): circle_size = 5 + counter folium.CircleMarker( location=city['coodrs'], radius=circle_size, popup=city['name'], color="crimson", fill=True, fill_color="crimson", ).add_to(m) m.save('map.html') Output: You may need to adjust the circle_size calculation a little to work with the number of cities you want to include.
webscraping stars from imdb page using beautifulsoup
I am trying to get the name of stars from an IMDb page. below is my code from requests import get url = 'https://www.imdb.com/search/title/?title_type=tv_movie,tv_series&user_rating=6.0,10.0&adult=include&ref_=adv_prv' response = get(url) from bs4 import BeautifulSoup html_soup = BeautifulSoup(response.text, 'html.parser') movie_containers = html_soup.find_all('div', 'lister-item mode-advanced') first_movie = movie_containers[0] first_stars = first_movie.select('a[href*="name"]') first_stars I got the following output [Bob Odenkirk, Rhea Seehorn, Jonathan Banks, Michael Mando] i am trying to get only the names of the stars and first_stars.text gives the following error AttributeError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_3104\1297903165.py in <module> 1 first_stars = first_movie.select('a[href*="name"]') ----> 2 first_stars.text ~\Anaconda3\lib\site-packages\bs4\element.py in __getattr__(self, key) 2288 """Raise a helpful exception to explain a common code fix.""" 2289 raise AttributeError( -> 2290 "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key 2291 ) AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()? when i tried first_stars = first_movie.find('a[href*="name"]') first_stars.text i also got the following error AttributeError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_3104\2359725208.py in <module> 1 first_stars = first_movie.find('a[href*="name"]') ----> 2 first_stars.text AttributeError: 'NoneType' object has no attribute 'text' Any idea how i can extract only the name of the stars?
If you need the star name without distinction this might help. block = soup.find_all("div", attrs={"class":"lister-item mode-advanced"}) starList= list() for star in block: starList.append(star.find("p", attrs={"class":""}).text.replace("Stars:", "").replace("\n", "").strip()) print(starList) It prints ['Bob Odenkirk, Rhea Seehorn, Jonathan Banks, Michael Mando', 'Jason Bateman, Laura Linney, Sofia Hublitz, Skylar Gaertner', 'Gia Sandhu, Anson Mount, Ethan Peck, Jess Bush', 'Josh Brolin, Imogen Poots, Lili Taylor, Tom Pelphrey', 'Pablo Schreiber, Shabana Azmi, Natasha Culzac, Olive Gray', 'Titus Welliver, Mimi Rogers, Madison Lintz, Stephen A. Chang', 'Rachel Griffiths, Sophia Ali, Shannon Berry, Jenna Clause', 'Adam Scott, Zach Cherry, Britt Lower, Tramell Tillman', 'Milo Ventimiglia, Mandy Moore, Sterling K. Brown, Chrissy Metz', 'Emilia Clarke, Peter Dinklage, Kit Harington, Lena Headey', 'Bryan Cranston, Aaron Paul, Anna Gunn, Betsy Brandt', 'Millie Bobby Brown, Finn Wolfhard, Winona Ryder, David Harbour', 'Joe Locke, Kit Connor, Yasmin Finney, William Gao', 'John C. Reilly, Quincy Isaiah, Jason Clarke, Gaby Hoffmann', 'Bill Hader, Stephen Root, Sarah Goldberg, Anthony Carrigan', 'Kaley Cuoco, Zosia Mamet, Griffin Matthews, Rosie Perez', 'Caitríona Balfe, Sam Heughan, Sophie Skelton, Richard Rankin', 'Evan Rachel Wood, Jeffrey Wright, Ed Harris, Thandiwe Newton', 'Patrick Stewart, Alison Pill, Michelle Hurd, Santiago Cabrera', 'Nicola Coughlan, Jonathan Bailey, Ruth Gemmell, Florence Hunt', 'Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan', 'Cillian Murphy, Paul Anderson, Sophie Rundle, Helen McCrory', 'Manuel Garcia-Rulfo, Becki Newton, Neve Campbell, Christopher Gorham', 'Jane Fonda, Lily Tomlin, Sam Waterston, Martin Sheen', 'Ellen Pompeo, Chandra Wilson, James Pickens Jr., Justin Chambers', 'Alexander Dreymon, Eliza Butterworth, Arnas Fedaravicius, Mark Rowley', 'Luke Grimes, Kelly Reilly, Wes Bentley, Cole Hauser', 'Elisabeth Moss, Wagner Moura, Phillipa Soo, Chris Chalk', "Scott Whyte, Nolan North, Steven Pacey, Emily O'Brien", 'Steve Carell, Jenna Fischer, John Krasinski, Rainn Wilson', 'Jodie Whittaker, Peter Capaldi, Pearl Mackie, Matt Smith', 'Ansel Elgort, Ken Watanabe, Rachel Keller, Shô Kasamatsu', 'James Spader, Megan Boone, Diego Klattenhoff, Ryan Eggold', 'Mark Harmon, David McCallum, Sean Murray, Pauley Perrette', 'Zendaya, Hunter Schafer, Angus Cloud, Jacob Elordi', 'Niv Sultan, Shaun Toub, Shervin Alenabi, Arash Marandi', 'Asa Butterfield, Gillian Anderson, Emma Mackey, Ncuti Gatwa', 'Jack Lowden, Kristin Scott Thomas, Gary Oldman, Chris Reilly', 'Karl Urban, Jack Quaid, Antony Starr, Erin Moriarty', 'Mariska Hargitay, Christopher Meloni, Ice-T, Dann Florek', "Nathan Fillion, Alyssa Diaz, Richard T. Jones, Melissa O'Neil", "Saoirse-Monica Jackson, Louisa Harland, Tara Lynne O'Neill, Kathy Kiera Clarke", 'Donald Glover, Brian Tyree Henry, LaKeith Stanfield, Zazie Beetz', 'Jennifer Aniston, Courteney Cox, Lisa Kudrow, Matt LeBlanc', 'Jared Padalecki, Jensen Ackles, Jim Beaver, Misha Collins', 'Julia Roberts, Sean Penn, Dan Stevens, Betty Gilpin', 'James Gandolfini, Lorraine Bracco, Edie Falco, Michael Imperioli', 'Natasha Lyonne, Charlie Barnett, Greta Lee, Elizabeth Ashley', 'Jean Smart, Hannah Einbinder, Carl Clemons-Hopkins, Rose Abdoo', 'Katheryn Winnick, Gustaf Skarsgård, Alexander Ludwig, Georgia Hirst'] or if you need both title and its stars block = soup.find_all("div", attrs={"class":"lister-item mode-advanced"}) starList= list() movieDict = dict() for star in block: movieDict = { "moviename":star.find("h3", attrs={"class":"lister-item-header"}).text.split("\n")[2], "stars": star.find("p", attrs={"class":""}).text.replace("Stars:", "").replace("\n", "").strip() } starList.append(movieDict) print(starList) this will print [{'moviename': 'Better Call Saul', 'stars': 'Bob Odenkirk, Rhea Seehorn, Jonathan Banks, Michael Mando'}, {'moviename': 'Ozark', 'stars': 'Jason Bateman, Laura Linney, Sofia Hublitz, Skylar Gaertner'}, {'moviename': 'Star Trek: Strange New Worlds', 'stars': 'Gia Sandhu, Anson Mount, Ethan Peck, Jess Bush'}, {'moviename': 'Outer Range', 'stars': 'Josh Brolin, Imogen Poots, Lili Taylor, Tom Pelphrey'}, {'moviename': 'Halo', 'stars': 'Pablo Schreiber, Shabana Azmi, Natasha Culzac, Olive Gray'}, {'moviename': 'Bosch: Legacy', 'stars': 'Titus Welliver, Mimi Rogers, Madison Lintz, Stephen A. Chang'}, {'moviename': 'The Wilds', 'stars': 'Rachel Griffiths, Sophia Ali, Shannon Berry, Jenna Clause'}, {'moviename': 'Severance', 'stars': 'Adam Scott, Zach Cherry, Britt Lower, Tramell Tillman'}, {'moviename': 'This Is Us', 'stars': 'Milo Ventimiglia, Mandy Moore, Sterling K. Brown, Chrissy Metz'}, {'moviename': 'Game of Thrones', 'stars': 'Emilia Clarke, Peter Dinklage, Kit Harington, Lena Headey'}, {'moviename': 'Breaking Bad', 'stars': 'Bryan Cranston, Aaron Paul, Anna Gunn, Betsy Brandt'}, {'moviename': 'Stranger Things', 'stars': 'Millie Bobby Brown, Finn Wolfhard, Winona Ryder, David Harbour'}, {'moviename': 'Heartstopper', 'stars': 'Joe Locke, Kit Connor, Yasmin Finney, William Gao'}, {'moviename': 'Winning Time: The Rise of the Lakers Dynasty', 'stars': 'John C. Reilly, Quincy Isaiah, Jason Clarke, Gaby Hoffmann'}, {'moviename': 'Barry', 'stars': 'Bill Hader, Stephen Root, Sarah Goldberg, Anthony Carrigan'}, {'moviename': 'The Flight Attendant', 'stars': 'Kaley Cuoco, Zosia Mamet, Griffin Matthews, Rosie Perez'}, {'moviename': 'Outlander', 'stars': 'Caitríona Balfe, Sam Heughan, Sophie Skelton, Richard Rankin'}, {'moviename': 'Westworld', 'stars': 'Evan Rachel Wood, Jeffrey Wright, Ed Harris, Thandiwe Newton'}, {'moviename': 'Star Trek: Picard', 'stars': 'Patrick Stewart, Alison Pill, Michelle Hurd, Santiago Cabrera'}, {'moviename': 'Bridgerton', 'stars': 'Nicola Coughlan, Jonathan Bailey, Ruth Gemmell, Florence Hunt'}, {'moviename': 'The Walking Dead', 'stars': 'Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan'}, {'moviename': 'Peaky Blinders', 'stars': 'Cillian Murphy, Paul Anderson, Sophie Rundle, Helen McCrory'}, {'moviename': 'The Lincoln Lawyer', 'stars': 'Manuel Garcia-Rulfo, Becki Newton, Neve Campbell, Christopher Gorham'}, {'moviename': 'Grace and Frankie', 'stars': 'Jane Fonda, Lily Tomlin, Sam Waterston, Martin Sheen'}, {'moviename': "Grey's Anatomy", 'stars': 'Ellen Pompeo, Chandra Wilson, James Pickens Jr., Justin Chambers'}, {'moviename': 'The Last Kingdom', 'stars': 'Alexander Dreymon, Eliza Butterworth, Arnas Fedaravicius, Mark Rowley'}, {'moviename': 'Yellowstone', 'stars': 'Luke Grimes, Kelly Reilly, Wes Bentley, Cole Hauser'}, {'moviename': 'Shining Girls', 'stars': 'Elisabeth Moss, Wagner Moura, Phillipa Soo, Chris Chalk'}, {'moviename': 'Love, Death & Robots', 'stars': "Scott Whyte, Nolan North, Steven Pacey, Emily O'Brien"}, {'moviename': 'The Office', 'stars': 'Steve Carell, Jenna Fischer, John Krasinski, Rainn Wilson'}, {'moviename': 'Doctor Who', 'stars': 'Jodie Whittaker, Peter Capaldi, Pearl Mackie, Matt Smith'}, {'moviename': 'Tokyo Vice', 'stars': 'Ansel Elgort, Ken Watanabe, Rachel Keller, Shô Kasamatsu'}, {'moviename': 'The Blacklist', 'stars': 'James Spader, Megan Boone, Diego Klattenhoff, Ryan Eggold'}, {'moviename': 'NCIS: Naval Criminal Investigative Service', 'stars': 'Mark Harmon, David McCallum, Sean Murray, Pauley Perrette'}, {'moviename': 'Euphoria', 'stars': 'Zendaya, Hunter Schafer, Angus Cloud, Jacob Elordi'}, {'moviename': 'Tehran', 'stars': 'Niv Sultan, Shaun Toub, Shervin Alenabi, Arash Marandi'}, {'moviename': 'Sex Education', 'stars': 'Asa Butterfield, Gillian Anderson, Emma Mackey, Ncuti Gatwa'}, {'moviename': 'Slow Horses', 'stars': 'Jack Lowden, Kristin Scott Thomas, Gary Oldman, Chris Reilly'}, {'moviename': 'The Boys', 'stars': 'Karl Urban, Jack Quaid, Antony Starr, Erin Moriarty'}, {'moviename': 'Law & Order: Special Victims Unit', 'stars': 'Mariska Hargitay, Christopher Meloni, Ice-T, Dann Florek'}, {'moviename': 'The Rookie', 'stars': "Nathan Fillion, Alyssa Diaz, Richard T. Jones, Melissa O'Neil"}, {'moviename': 'Derry Girls', 'stars': "Saoirse-Monica Jackson, Louisa Harland, Tara Lynne O'Neill, Kathy Kiera Clarke"}, {'moviename': 'Atlanta', 'stars': 'Donald Glover, Brian Tyree Henry, LaKeith Stanfield, Zazie Beetz'}, {'moviename': 'Friends', 'stars': 'Jennifer Aniston, Courteney Cox, Lisa Kudrow, Matt LeBlanc'}, {'moviename': 'Supernatural', 'stars': 'Jared Padalecki, Jensen Ackles, Jim Beaver, Misha Collins'}, {'moviename': 'Gaslit', 'stars': 'Julia Roberts, Sean Penn, Dan Stevens, Betty Gilpin'}, {'moviename': 'The Sopranos', 'stars': 'James Gandolfini, Lorraine Bracco, Edie Falco, Michael Imperioli'}, {'moviename': 'Russian Doll', 'stars': 'Natasha Lyonne, Charlie Barnett, Greta Lee, Elizabeth Ashley'}, {'moviename': 'Hacks', 'stars': 'Jean Smart, Hannah Einbinder, Carl Clemons-Hopkins, Rose Abdoo'}, {'moviename': 'Vikings', 'stars': 'Katheryn Winnick, Gustaf Skarsgård, Alexander Ludwig, Georgia Hirst'}]
You have to iterate the ResultSet: first_stars = [s.text for s in first_movie.select('a[href*="name"]')] first_stars Output: ['Bob Odenkirk', 'Rhea Seehorn', 'Jonathan Banks', 'Michael Mando']
Delete the rows with repeated characters in the dataframe
I have a large dataset from csv file to clean with the patterns I've identified but I can't upload the file here so I've just hardcoded a small sample to give an overview of what I'm looking for. The identified patterns are the repeated characters in the values. However, if you look at the dataframe below, there are actually repeated 'single characters' like ssssss, fffff, aaaaa, etc and then the repeated 'double characters' like dgdg, bvbvbv, tutu, etc. There are also repeated 'triple characters' such as yutyut and fdgfdg. Despite of this, would it be also possible to delete the rows with ANY repeated 'single/double/triple characters' so that I can apply them to the large dataset? For example, the dataframe here only shows the patterns I identified above, however, there could be repeated characters of ANY letters like 'uuuu', 'zzzz', 'eded, 'rsrsrs', 'xyzxyz', etc in the large dataset. Address1 Address2 Address3 Address4 0 High Street Park Avenue St. John’s Road The Grove 1 wssssss The Crescent tyutyut Mill Road 2 qfdgfdgdg dddfffff qdffgfdgfggfbvbvbv sefsdfdyuytutu 3 Green Lane Highfield Road Springfield Road School Lane 4 Kingsway Stanley Road George Street Albert Road 5 Church Street New Street Queensway Broadway 6 qaaaaass mjkhjk chfghfghh fghfhfh Here's the code: import pandas as pd import numpy as np data = {'Address1': ['High Street', 'wssssss', 'qfdgfdgdg', 'Green Lane', 'Kingsway', 'Church Street', 'qaaaaass'], 'Address2': ['Park Avenue', 'The Crescent', 'dddfffff', 'Highfield Road', 'Stanley Road', 'New Street', 'mjkhjk'], 'Address3': ['St. John’s Road', 'tyutyut', 'qdffgfdgfggfbvbvbv', 'Springfield Road', 'George Street', 'Queensway', 'chfghfghh'], 'Address4': ['The Grove', 'Mill Road', 'sefsdfdyuytutu', 'School Lane', 'Albert Road', 'Broadway', 'fghfhfh']} address_details = pd.DataFrame(data) #Code to delete the data for the identified patterns print(address_details) The output I expect is: Address1 Address2 Address3 Address4 0 High Street Park Avenue St. John’s Road The Grove 1 Green Lane Highfield Road Springfield Road School Lane 2 Kingsway Stanley Road George Street Albert Road 3 Church Street New Street Queensway Broadway Please advise, thank you!
Try with str.contains and loc with agg: print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"(.)\1+\b"), axis=1).any(1)]) Output: Address1 Address2 Address3 Address4 0 High Street Park Avenue St. John’s Road The Grove 3 Green Lane Highfield Road Springfield Road School Lane 4 Kingsway Stanley Road George Street Albert Road 5 Church Street New Street Queensway Broadway Or if you care about index: print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"(.)\1+\b"), axis=1).any(1)].reset_index(drop=True)) Output: Address1 Address2 Address3 Address4 0 High Street Park Avenue St. John’s Road The Grove 1 Green Lane Highfield Road Springfield Road School Lane 2 Kingsway Stanley Road George Street Albert Road 3 Church Street New Street Queensway Broadway Edit: For only lowercase letters, try: print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"([a-z]+)\1{1,}\b"), axis=1).any(1)].reset_index(drop=True))
Python Scaling loops
For each letter in the alphabet. The code should go to website.com/a and grab a table. Then it should check for a next button grab the link and makesoup and grab the next table and repeat until there is no valid next link. Then move to website.com/b(next letter in alphabet) and repeat. But I can only get as far as 2 pages for each letter. the first for loop grabs page 1 and the second grabs page 2 for each letter. I know I could write a loop for as many pages as needed but that is not scalable. How can I fix this? from nfl_fun import make_soup import urllib.request import os from string import ascii_lowercase import requests letter = ascii_lowercase link = "https://www.nfl.com" for letter in ascii_lowercase: soup = make_soup(f"https://www.nfl.com/players/active/{letter}") for tbody in soup.findAll("tbody"): for tr in tbody.findAll("a"): if tr.has_attr("href"): print(tr.attrs["href"]) for letter in ascii_lowercase: soup = make_soup(f"https://www.nfl.com/players/active/{letter}") for page in soup.footer.findAll("a", {"nfl-o-table-pagination__next"}): pagelink = "" footer = "" footer = page.attrs["href"] pagelink = f"{link}{footer}" print(footer) getpage = requests.get(pagelink) if getpage.status_code == 200: next_soup = make_soup(pagelink) for next_page in next_soup.footer.findAll("a", {"nfl-o-table-pagination__next"}): print(getpage) for tbody in next_soup.findAll("tbody"): for tr in tbody.findAll("a"): if tr.has_attr("href"): print(tr.attrs["href"]) soup = next_soup Thank You again,
There is an element in there that says when the "Next" button is inactive. So that'll tell you you are on the last page. So what you can do is a while loop, and just keep going to the next page, until it reaches the last page (Ie "Next" is inactive) and then tell it to stop the loop and go to the next letter: from bs4 import BeautifulSoup from string import ascii_lowercase import requests import pandas as pd import re letters = ascii_lowercase link = "https://www.nfl.com" results = pd.DataFrame() for letter in letters: continueToNextPage = True after = '' page=1 while continueToNextPage == True: # Get the Table url = f"https://www.nfl.com/players/active/{letter}?query={letter}&after={after}" response = requests.get(url, 'html.parser') soup = BeautifulSoup(response.text, 'html.parser') temp_df = pd.read_html(response.text)[0] results = results.append(temp_df, sort=False).reset_index(drop=True) print ("{letter}: Page: {page}".format(letter=letter.upper(), page=page)) # Check if next page is inactive buttons = soup.find('div', {'class':'nfl-o-table-pagination__buttons'}) regex = re.compile('.*pagination__next.*is-inactive.*') if buttons.find('span', {'class':regex}): continueToNextPage = False else: after = buttons.find('a', {'title':'Next'})['href'].split('after=')[-1] page+=1 Output: print (results) Player Current Team Position Status 0 Chidobe Awuzie Dallas Cowboys CB ACT 1 Josh Avery Seattle Seahawks DT ACT 2 Genard Avery Philadelphia Eagles DE ACT 3 Anthony Averett Baltimore Ravens CB ACT 4 Lee Autry Chicago Bears DT ACT 5 Denico Autry Indianapolis Colts DT ACT 6 Tavon Austin Dallas Cowboys WR UFA 7 Blessuan Austin New York Jets CB ACT 8 Antony Auclair Tampa Bay Buccaneers TE ACT 9 Jeremiah Attaochu Denver Broncos LB ACT 10 Hunter Atkinson Atlanta Falcons OT ACT 11 John Atkins Detroit Lions DE ACT 12 Geno Atkins Cincinnati Bengals DT ACT 13 Marcell Ateman Las Vegas Raiders WR ACT 14 George Aston New York Giants RB ACT 15 Dravon Askew-Henry New York Giants DB ACT 16 Devin Asiasi New England Patriots TE ACT 17 George Asafo-Adjei New York Giants OT ACT 18 Ade Aruna Las Vegas Raiders DE ACT 19 Grayland Arnold Philadelphia Eagles SAF ACT 20 Dan Arnold Arizona Cardinals TE ACT 21 Damon Arnette Las Vegas Raiders CB UDF 22 Ray-Ray Armstrong Dallas Cowboys LB UFA 23 Ka'John Armstrong Denver Broncos OT ACT 24 Dorance Armstrong Dallas Cowboys DE ACT 25 Cornell Armstrong Houston Texans CB ACT 26 Terron Armstead New Orleans Saints OT ACT 27 Ryquell Armstead Jacksonville Jaguars RB ACT 28 Arik Armstead San Francisco 49ers DE ACT 29 Alex Armah Carolina Panthers FB ACT ... ... ... ... 3180 Clive Walford Miami Dolphins TE UFA 3181 Cameron Wake Tennessee Titans DE UFA 3182 Corliss Waitman Pittsburgh Steelers P ACT 3183 Rick Wagner Green Bay Packers OT ACT 3184 Bobby Wagner Seattle Seahawks MLB ACT 3185 Ahmad Wagner Chicago Bears WR ACT 3186 Colby Wadman Denver Broncos P ACT 3187 Christian Wade Buffalo Bills RB ACT 3188 LaAdrian Waddle Buffalo Bills OT UFA 3189 Oshane Ximines New York Giants LB ACT 3190 Trevon Young Cleveland Browns DE ACT 3191 Sam Young Las Vegas Raiders OT ACT 3192 Kenny Young Los Angeles Rams ILB ACT 3193 Chase Young Washington Redskins DE UDF 3194 Bryson Young Atlanta Falcons DE ACT 3195 Isaac Yiadom Denver Broncos CB ACT 3196 T.J. Yeldon Buffalo Bills RB ACT 3197 Deon Yelder Kansas City Chiefs TE ACT 3198 Rock Ya-Sin Indianapolis Colts CB ACT 3199 Eddie Yarbrough Minnesota Vikings DE ACT 3200 Marshal Yanda Baltimore Ravens OG ACT 3201 Tavon Young Baltimore Ravens CB ACT 3202 Brandon Zylstra Carolina Panthers WR ACT 3203 Jabari Zuniga New York Jets DE UDF 3204 Greg Zuerlein Dallas Cowboys K ACT 3205 Isaiah Zuber New England Patriots WR ACT 3206 Justin Zimmer Cleveland Browns DT ACT 3207 Anthony Zettel Minnesota Vikings DE ACT 3208 Kevin Zeitler New York Giants OG ACT 3209 Olamide Zaccheaus Atlanta Falcons WR ACT [3210 rows x 4 columns]
Scraping table by beautiful soup 4
Hello I am trying to scrape this table in this url: https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc There are 50 rows in this table.. however if you click Show more (just below the table), more of the rows appear. My beautiful soup code works fine, But the problem is it retrieves only the first 50 rows. It doesnot retrieve rows that appear after clicking the Show more. How can i get all the rows including first 50 and also those appears after clicking Show more? Here is the code: #Request to get the target wiki page rqst = requests.get("https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc") soup = BeautifulSoup(rqst.content,'lxml') table = soup.find_all('table') NFL_player_stats = pd.read_html(str(table)) players = NFL_player_stats[0] players.shape out[0]: (50,1)
Using DevTools in Firefox I see it gets data (in JSON format) for next page from https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=2 If you change value in page= then you can get other pages. import requests url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=' for page in range(1, 4): print('\n---', page, '---\n') r = requests.get(url + str(page)) data = r.json() #print(data.keys()) for item in data['athletes']: print(item['athlete']['displayName']) Result: --- 1 --- Ezekiel Elliott Saquon Barkley Todd Gurley II Joe Mixon Chris Carson Christian McCaffrey Derrick Henry Adrian Peterson Phillip Lindsay Nick Chubb Lamar Miller James Conner David Johnson Jordan Howard Sony Michel Marlon Mack Melvin Gordon Alvin Kamara Peyton Barber Kareem Hunt Matt Breida Tevin Coleman Aaron Jones Doug Martin Frank Gore Gus Edwards Lamar Jackson Isaiah Crowell Mark Ingram II Kerryon Johnson Josh Allen Dalvin Cook Latavius Murray Carlos Hyde Austin Ekeler Deshaun Watson Kenyan Drake Royce Freeman Dion Lewis LeSean McCoy Mike Davis Josh Adams Alfred Blue Cam Newton Jamaal Williams Tarik Cohen Leonard Fournette Alfred Morris James White Mitchell Trubisky --- 2 --- Rashaad Penny LeGarrette Blount T.J. Yeldon Alex Collins C.J. Anderson Chris Ivory Marshawn Lynch Russell Wilson Blake Bortles Wendell Smallwood Marcus Mariota Bilal Powell Jordan Wilkins Kenneth Dixon Ito Smith Nyheim Hines Dak Prescott Jameis Winston Elijah McGuire Patrick Mahomes Aaron Rodgers Jeff Wilson Jr. Zach Zenner Raheem Mostert Corey Clement Jalen Richard Damien Williams Jaylen Samuels Marcus Murphy Spencer Ware Cordarrelle Patterson Malcolm Brown Giovani Bernard Chase Edmonds Justin Jackson Duke Johnson Taysom Hill Kalen Ballage Ty Montgomery Rex Burkhead Jay Ajayi Devontae Booker Chris Thompson Wayne Gallman DJ Moore Theo Riddick Alex Smith Robert Woods Brian Hill Dwayne Washington --- 3 --- Ryan Fitzpatrick Tyreek Hill Andrew Luck Ryan Tannehill Josh Rosen Sam Darnold Baker Mayfield Jeff Driskel Rod Smith Matt Ryan Tyrod Taylor Kirk Cousins Cody Kessler Darren Sproles Josh Johnson DeAndre Washington Trenton Cannon Javorius Allen Jared Goff Julian Edelman Jacquizz Rodgers Kapri Bibbs Andy Dalton Ben Roethlisberger Dede Westbrook Case Keenum Carson Wentz Brandon Bolden Curtis Samuel Stevan Ridley Keith Ford Keenan Allen John Kelly Kenjon Barner Matthew Stafford Tyler Lockett C.J. Beathard Cameron Artis-Payne Devonta Freeman Brandin Cooks Isaiah McKenzie Colt McCoy Stefon Diggs Taylor Gabriel Jarvis Landry Tavon Austin Corey Davis Emmanuel Sanders Sammy Watkins Nathan Peterman EDIT: get all data as DataFrame import requests import pandas as pd url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=' df = pd.DataFrame() # emtpy DF at start for page in range(1, 4): print('page:', page) r = requests.get(url + str(page)) data = r.json() #print(data.keys()) for item in data['athletes']: player_name = item['athlete']['displayName'] position = item['athlete']['position']['abbreviation'] gp = item['categories'][0]['totals'][0] other_values = item['categories'][2]['totals'] row = [player_name, position, gp] + other_values df = df.append( [row] ) # append one row df.columns = ['NAME', 'POS', 'GP', 'ATT', 'YDS', 'AVG', 'LNG', 'BIG', 'TD', 'YDS/G', 'FUM', 'LST', 'FD'] print(len(df)) # 150 print(df.head(20))