Webscraping from a Wikipedia Table?

Webscraping from a Wikipedia Table? - python

I am trying to get data from this Wikipedia Article containing a table of each National Park along with some details of each park. Changing the code from a similar tutorial I found, I was able to display the name and state of each park, through the area of the park is not working. I suspect that this is because the name and state are links in the Wikipedia article though I am not certain. How would I change my code to be able to display the area as well?
import requests
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States"
res = requests.get(URL).text
soup = BeautifulSoup(res,'html.parser')
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
data = items.find_all(['th','td'])
try:
parkName = data[0].a.text
parkState = data[2].a.text
parkArea = data[4].span.text
except IndexError:pass
print("{} | {} | {}".format(parkName, parkState, parkArea))
Snippet of my Output

To get the text of the area, you can use .get_text() and then str.rsplit() to get only area in acres:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States"
soup = BeautifulSoup(requests.get(url).content,'html.parser')
rows = iter(soup.select('.wikitable tr:has(td, th)'))
next(rows) # skip headers
for tr in rows:
name, _, state, _, area, *_ = tr.select('td, th')
name = name.get_text(strip=True)
state = state.a.get_text(strip=True)
area = area.get_text(strip=True).rsplit(maxsplit=2)[0]
print('{:<35}{:<25}{}'.format(name, state, area))
Prints:
Acadia Maine 49,076.63 acres
American Samoa American Samoa 8,256.67 acres
Arches Utah 76,678.98 acres
Badlands South Dakota 242,755.94 acres
Big Bend Texas 801,163.21 acres
Biscayne Florida 172,971.11 acres
Black Canyon of the Gunnison Colorado 30,779.83 acres
Bryce Canyon Utah 35,835.08 acres
Canyonlands Utah 337,597.83 acres
Capitol Reef Utah 241,904.50 acres
Carlsbad Caverns* New Mexico 46,766.45 acres
Channel Islands California 249,561.00 acres
Congaree South Carolina 26,476.47 acres
Crater Lake Oregon 183,224.05 acres
Cuyahoga Valley Ohio 32,571.88 acres
Death Valley California 3,408,406.73 acres
Denali Alaska 4,740,911.16 acres
Dry Tortugas Florida 64,701.22 acres
Everglades Florida 1,508,938.57 acres
Gates of the Arctic Alaska 7,523,897.45 acres
Gateway Arch Missouri 192.83 acres
Glacier Montana 1,013,125.99 acres
Glacier Bay Alaska 3,223,383.43 acres
Grand Canyon* Arizona 1,201,647.03 acres
Grand Teton Wyoming 310,044.36 acres
Great Basin Nevada 77,180.00 acres
Great Sand Dunes Colorado 107,341.87 acres
Great Smoky Mountains North Carolina 522,426.88 acres
Guadalupe Mountains Texas 86,367.10 acres
Haleakalā Hawaii 33,264.62 acres
Hawaiʻi Volcanoes Hawaii 325,605.28 acres
Hot Springs Arkansas 5,554.15 acres
Indiana Dunes Indiana 15,349.08 acres
Isle Royale Michigan 571,790.30 acres
Joshua Tree California 795,155.85 acres
Katmai Alaska 3,674,529.33 acres
Kenai Fjords Alaska 669,650.05 acres
Kings Canyon California 461,901.20 acres
Kobuk Valley Alaska 1,750,716.16 acres
Lake Clark Alaska 2,619,816.49 acres
Lassen Volcanic California 106,589.02 acres
Mammoth Cave Kentucky 54,011.91 acres
Mesa Verde* Colorado 52,485.17 acres
Mount Rainier Washington 236,381.64 acres
North Cascades Washington 504,780.94 acres
Olympic Washington 922,649.41 acres
Petrified Forest Arizona 221,390.21 acres
Pinnacles California 26,685.73 acres
Redwood* California 138,999.37 acres
Rocky Mountain Colorado 265,807.25 acres
Saguaro Arizona 91,715.72 acres
Sequoia California 404,062.63 acres
Shenandoah Virginia 199,223.77 acres
Theodore Roosevelt North Dakota 70,446.89 acres
Virgin Islands U.S. Virgin Islands 15,052.53 acres
Voyageurs Minnesota 218,222.35 acres
White Sands New Mexico 146,344.31 acres
Wind Cave South Dakota 33,970.84 acres
Wrangell–St. Elias* Alaska 8,323,146.48 acres
Yellowstone Wyoming 2,219,790.71 acres
Yosemite* California 761,747.50 acres
Zion Utah 147,242.66 acres

You can change this line:
parkArea = data[4].span.text
to this one if you want area in acres:
parkArea = data[4].text.split(' ')[0]
or this in km2:
parkArea = data[4].text.split(' ')[2]

Related

Draw a Map of cities in python

I have a ranking of countries across the world in a variable called rank_2000 that looks like this:
Seoul
Tokyo
Paris
New_York_Greater
Shizuoka
Chicago
Minneapolis
Boston
Austin
Munich
Salt_Lake
Greater_Sydney
Houston
Dallas
London
San_Francisco_Greater
Berlin
Seattle
Toronto
Stockholm
Atlanta
Indianapolis
Fukuoka
San_Diego
Phoenix
Frankfurt_am_Main
Stuttgart
Grenoble
Albany
Singapore
Washington_Greater
Helsinki
Nuremberg
Detroit_Greater
TelAviv
Zurich
Hamburg
Pittsburgh
Philadelphia_Greater
Taipei
Los_Angeles_Greater
Miami_Greater
MannheimLudwigshafen
Brussels
Milan
Montreal
Dublin
Sacramento
Ottawa
Vancouver
Malmo
Karlsruhe
Columbus
Dusseldorf
Shenzen
Copenhagen
Milwaukee
Marseille
Greater_Melbourne
Toulouse
Beijing
Dresden
Manchester
Lyon
Vienna
Shanghai
Guangzhou
San_Antonio
Utrecht
New_Delhi
Basel
Oslo
Rome
Barcelona
Madrid
Geneva
Hong_Kong
Valencia
Edinburgh
Amsterdam
Taichung
The_Hague
Bucharest
Muenster
Greater_Adelaide
Chengdu
Greater_Brisbane
Budapest
Manila
Bologna
Quebec
Dubai
Monterrey
Wellington
Shenyang
Tunis
Johannesburg
Auckland
Hangzhou
Athens
Wuhan
Bangalore
Chennai
Istanbul
Cape_Town
Lima
Xian
Bangkok
Penang
Luxembourg
Buenos_Aires
Warsaw
Greater_Perth
Kuala_Lumpur
Santiago
Lisbon
Dalian
Zhengzhou
Prague
Changsha
Chongqing
Ankara
Fuzhou
Jinan
Xiamen
Sao_Paulo
Kunming
Jakarta
Cairo
Curitiba
Riyadh
Rio_de_Janeiro
Mexico_City
Hefei
Almaty
Beirut
Belgrade
Belo_Horizonte
Bogota_DC
Bratislava
Dhaka
Durban
Hanoi
Ho_Chi_Minh_City
Kampala
Karachi
Kuwait_City
Manama
Montevideo
Panama_City
Quito
San_Juan
What I would like to do is a map of the world where those cities are colored according to their position on the ranking above. I am opened to further solutions for the representation (such as bubbles of increasing dimension according to the position of the cities in the rank or, if necessary, representing only a sample of countries taken from the top rank, the middle and the bottom).
Thank you,
Federico

Your question has two parts; finding the location of each city and then drawing them on the map. Assuming you have the latitude and longitude of each city, here's how you'd tackle the latter part.
I like Folium (https://pypi.org/project/folium/) for drawing maps. Here's an example of how you might draw a circle for each city, with it's position in the list is used to determine the size of that circle.
import folium
cities = [
{'name':'Seoul', 'coodrs':[37.5639715, 126.9040468]},
{'name':'Tokyo', 'coodrs':[35.5090627, 139.2094007]},
{'name':'Paris', 'coodrs':[48.8588787,2.2035149]},
{'name':'New York', 'coodrs':[40.6976637,-74.1197631]},
# etc. etc.
]
m = folium.Map(zoom_start=15)
for counter, city in enumerate(cities):
circle_size = 5 + counter
folium.CircleMarker(
location=city['coodrs'],
radius=circle_size,
popup=city['name'],
color="crimson",
fill=True,
fill_color="crimson",
).add_to(m)
m.save('map.html')
Output:
You may need to adjust the circle_size calculation a little to work with the number of cities you want to include.

I am trying to return the address from find_all

I am attempting to web scrape using Python and Beautiful Soup. Url for reference = https://www.zoopla.co.uk/for-sale/property/london/?q=London&results_sort=newest_listings&search_source=home
This is how far I have managed to get:
>>>address = container.find_all("span")
>>>print(address)
[<span class="price-modifier">Guide price</span>, <span class="listing-results-just-added">Just added</span>, <span><a class="listing-results-address" href="/for-sale/details/50074267">Wolseley Road, Crouch End, London N8</a></span>, <span class="interface nearby_stations_schools_national_rail_station" title="Hornsey"></span>, <span class="nearby_stations_schools_name" title="Hornsey">Hornsey</span>, <span class="interface nearby_stations_schools_national_rail_station" title="Crouch Hill"></span>, <span class="nearby_stations_schools_name" title="Crouch Hill">Crouch Hill</span>]
Why does the following not work?
address = container.find_all("span", attrs={"class": "listing-results-address"})
I am trying to get the address only i.e. Wolseley Road, Crouch End, London N8

You should search for <a> tag, not <span> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://www.zoopla.co.uk/for-sale/property/london/?q=London&results_sort=newest_listings&search_source=home'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for a in soup.find_all('a', class_='listing-results-address'):
print(a.get_text(strip=True))
Prints:
Linstead Way, London SW18
Tudor Court, London E17
Pendlestone Road, Walthamstow, ...
Discovery House, Juniper Drive, Wandsworth, London SW18
Woodlea Grove, Northwood HA6
Elsham Road, London W14
Lytham Street, London SE17
Isleworth, London TW7
Islip Manor Road, Northolt UB5
Teignmouth Road, Welling, Kent DA16
Wimpole Street, London W1G
Cranborne Crescent, Potters Bar, Herts EN6
Forest Road, London E17
Highclere Road, New Malden KT3
Coppermill Lane, London E17
Diana Road, London E17
Chiswick High Road, London W4
Holmesdale Road, London SE25
Warrington Crescent, London W9
Grasmere Road, Purley CR8
Bonar Place, Chislehurst BR7
Samos Road, London SE20
Tredegar Road, London E3
Widdenham Road, Islington, London N7
Eddystone Road, London SE4
Benhurst Avenue, Hornchurch RM12
Woodfield Gardens, New Malden KT3
Old Road, London SE13
EDIT: To get price along the address, you can do:
import requests
from bs4 import BeautifulSoup
url = 'https://www.zoopla.co.uk/for-sale/property/london/?q=London&results_sort=newest_listings&search_source=home'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for a in soup.find_all('a', class_='listing-results-address'):
price = a.find_previous(class_='listing-results-price').find(text=True).strip()
print('{:<15} {}'.format(price, a.get_text(strip=True)))
Prints:
£1,500,000 Brondesbury Park, Brondesbury ...
£460,000 2D Harold Road, Upper Norwood SE19
£450,000 Anerley Road, London SE20
£450,000 Grange Road, London SE19
£225,000 Bath Road, Harlington, Hayes UB3
£440,000 George Beard Road, London SE8
£615,000 Cumberland Drive, Chessington KT9
£800,000 Woodmansterne Road, Carshalton SM5
£600,000 Willow Close, Bexley DA5
£165,000 Essex Road, Islington On The Green, Islington, London N1
£695,000 Advance House, 101 Ladbroke Grove, London W11
£1,500,000 Riverview Gardens, London SW13
£350,000 Church Road, London SE19
£935,000 Ansdell Road, Nunhead SE15
£350,000 Marlborough Close, London SE17
£380,000 Graveney Road, London SW17
£360,000 Violet Lane, Croydon CR0
£325,000 Montana Gardens, Sutton SM1
£550,000 Albert Road, Bromley, Kent BR2
£365,000 Hadleigh Walk, London E6
£650,000 Eton Rise, Eton College Road, London NW3
£480,000 Russell Road, London N13
£500,000 Heligan House, Watergarden Square, Canada Water SE16
£1,850,000 Melrose Gardens, Brook Green, London W6
£475,000 Cowper Close, Welling DA16
£4,950,000 Edwardes Square, London W8
£735,000 Arbuthnot Road, New Cross SE14
£750,000 Gosterwood Street, London SE8

Python Scaling loops

For each letter in the alphabet. The code should go to website.com/a and grab a table. Then it should check for a next button grab the link and makesoup and grab the next table and repeat until there is no valid next link. Then move to website.com/b(next letter in alphabet) and repeat. But I can only get as far as 2 pages for each letter. the first for loop grabs page 1 and the second grabs page 2 for each letter. I know I could write a loop for as many pages as needed but that is not scalable. How can I fix this?
from nfl_fun import make_soup
import urllib.request
import os
from string import ascii_lowercase
import requests
letter = ascii_lowercase
link = "https://www.nfl.com"
for letter in ascii_lowercase:
soup = make_soup(f"https://www.nfl.com/players/active/{letter}")
for tbody in soup.findAll("tbody"):
for tr in tbody.findAll("a"):
if tr.has_attr("href"):
print(tr.attrs["href"])
for letter in ascii_lowercase:
soup = make_soup(f"https://www.nfl.com/players/active/{letter}")
for page in soup.footer.findAll("a", {"nfl-o-table-pagination__next"}):
pagelink = ""
footer = ""
footer = page.attrs["href"]
pagelink = f"{link}{footer}"
print(footer)
getpage = requests.get(pagelink)
if getpage.status_code == 200:
next_soup = make_soup(pagelink)
for next_page in next_soup.footer.findAll("a", {"nfl-o-table-pagination__next"}):
print(getpage)
for tbody in next_soup.findAll("tbody"):
for tr in tbody.findAll("a"):
if tr.has_attr("href"):
print(tr.attrs["href"])
soup = next_soup
Thank You again,

There is an element in there that says when the "Next" button is inactive. So that'll tell you you are on the last page. So what you can do is a while loop, and just keep going to the next page, until it reaches the last page (Ie "Next" is inactive) and then tell it to stop the loop and go to the next letter:
from bs4 import BeautifulSoup
from string import ascii_lowercase
import requests
import pandas as pd
import re
letters = ascii_lowercase
link = "https://www.nfl.com"
results = pd.DataFrame()
for letter in letters:
continueToNextPage = True
after = ''
page=1
while continueToNextPage == True:
# Get the Table
url = f"https://www.nfl.com/players/active/{letter}?query={letter}&after={after}"
response = requests.get(url, 'html.parser')
soup = BeautifulSoup(response.text, 'html.parser')
temp_df = pd.read_html(response.text)[0]
results = results.append(temp_df, sort=False).reset_index(drop=True)
print ("{letter}: Page: {page}".format(letter=letter.upper(), page=page))
# Check if next page is inactive
buttons = soup.find('div', {'class':'nfl-o-table-pagination__buttons'})
regex = re.compile('.*pagination__next.*is-inactive.*')
if buttons.find('span', {'class':regex}):
continueToNextPage = False
else:
after = buttons.find('a', {'title':'Next'})['href'].split('after=')[-1]
page+=1
Output:
print (results)
Player Current Team Position Status
0 Chidobe Awuzie Dallas Cowboys CB ACT
1 Josh Avery Seattle Seahawks DT ACT
2 Genard Avery Philadelphia Eagles DE ACT
3 Anthony Averett Baltimore Ravens CB ACT
4 Lee Autry Chicago Bears DT ACT
5 Denico Autry Indianapolis Colts DT ACT
6 Tavon Austin Dallas Cowboys WR UFA
7 Blessuan Austin New York Jets CB ACT
8 Antony Auclair Tampa Bay Buccaneers TE ACT
9 Jeremiah Attaochu Denver Broncos LB ACT
10 Hunter Atkinson Atlanta Falcons OT ACT
11 John Atkins Detroit Lions DE ACT
12 Geno Atkins Cincinnati Bengals DT ACT
13 Marcell Ateman Las Vegas Raiders WR ACT
14 George Aston New York Giants RB ACT
15 Dravon Askew-Henry New York Giants DB ACT
16 Devin Asiasi New England Patriots TE ACT
17 George Asafo-Adjei New York Giants OT ACT
18 Ade Aruna Las Vegas Raiders DE ACT
19 Grayland Arnold Philadelphia Eagles SAF ACT
20 Dan Arnold Arizona Cardinals TE ACT
21 Damon Arnette Las Vegas Raiders CB UDF
22 Ray-Ray Armstrong Dallas Cowboys LB UFA
23 Ka'John Armstrong Denver Broncos OT ACT
24 Dorance Armstrong Dallas Cowboys DE ACT
25 Cornell Armstrong Houston Texans CB ACT
26 Terron Armstead New Orleans Saints OT ACT
27 Ryquell Armstead Jacksonville Jaguars RB ACT
28 Arik Armstead San Francisco 49ers DE ACT
29 Alex Armah Carolina Panthers FB ACT
... ... ... ...
3180 Clive Walford Miami Dolphins TE UFA
3181 Cameron Wake Tennessee Titans DE UFA
3182 Corliss Waitman Pittsburgh Steelers P ACT
3183 Rick Wagner Green Bay Packers OT ACT
3184 Bobby Wagner Seattle Seahawks MLB ACT
3185 Ahmad Wagner Chicago Bears WR ACT
3186 Colby Wadman Denver Broncos P ACT
3187 Christian Wade Buffalo Bills RB ACT
3188 LaAdrian Waddle Buffalo Bills OT UFA
3189 Oshane Ximines New York Giants LB ACT
3190 Trevon Young Cleveland Browns DE ACT
3191 Sam Young Las Vegas Raiders OT ACT
3192 Kenny Young Los Angeles Rams ILB ACT
3193 Chase Young Washington Redskins DE UDF
3194 Bryson Young Atlanta Falcons DE ACT
3195 Isaac Yiadom Denver Broncos CB ACT
3196 T.J. Yeldon Buffalo Bills RB ACT
3197 Deon Yelder Kansas City Chiefs TE ACT
3198 Rock Ya-Sin Indianapolis Colts CB ACT
3199 Eddie Yarbrough Minnesota Vikings DE ACT
3200 Marshal Yanda Baltimore Ravens OG ACT
3201 Tavon Young Baltimore Ravens CB ACT
3202 Brandon Zylstra Carolina Panthers WR ACT
3203 Jabari Zuniga New York Jets DE UDF
3204 Greg Zuerlein Dallas Cowboys K ACT
3205 Isaiah Zuber New England Patriots WR ACT
3206 Justin Zimmer Cleveland Browns DT ACT
3207 Anthony Zettel Minnesota Vikings DE ACT
3208 Kevin Zeitler New York Giants OG ACT
3209 Olamide Zaccheaus Atlanta Falcons WR ACT
[3210 rows x 4 columns]

Reading excel file with line breaks and tabs preserved using xlrd

I am trying to read excel file cells having multi line text in it. I am using xlrd 1.2.0. But when I print or even write the text in cell to .txt file it doesn't preserve line breaks or tabs i.e \n or \t.
Input:
File URL:
Excel file
Code:
import xlrd
filenamedotxlsx = '16.xlsx'
gall_artists = xlrd.open_workbook(filenamedotxlsx)
sheet = gall_artists.sheet_by_index(0)
bio = sheet.cell_value(0,1)
print(bio)
Output:
"Biography 2018-2019 Manoeuvre Textiles Atelier, Gent, Belgium 2017-2018 Thalielab, Brussels, Belgium 2017 Laboratoires d'Aubervilliers, Paris 2014-2015 Galveston Artist Residency (GAR), Texas 2014 MACBA, Barcelona & L'appartment 22, Morocco - Residency 2013 International Residence Recollets, Paris 2007 Gulbenkian & RSA Residency, BBC Natural History Dept, UK 2004-2006 Delfina Studios, UK Studio Award, London 1998-2000 De Ateliers, Post-grad Residency, Amsterdam 1995-1998 BA (Hons) Textile Art, Winchester School of Art UK "
Expected Output:
1975 Born in Hangzhou, Zhejiang, China
1980 Started to learn Chinese ink painting
2000 BA, Major in Oil Painting, China Academy of Art, Hangzhou, China
Curator, Hangzhou group exhibition for 6 female artists Untitled, 2000 Present
2007 MA, New Media, China Academy of Art, Hangzhou, China, studied under Jiao Jian
Lecturer, Department of Art, Zhejiang University, Hangzhou, China
2015 PhD, Calligraphy, China Academy of Art, Hangzhou, China, studied under Wang Dongling
Jury, 25th National Photographic Art Exhibition, China Millennium Monument, Beijing, China
2016 Guest professor, Faculty of Humanities, Zhejiang University, Hangzhou, China
Associate professor, Research Centre of Modern Calligraphy, China Academy of Art, Hangzhou, China
Researcher, Lanting Calligraphy Commune, Zhejiang, China
2017 Christie's produced a video about Chu Chu's art
2018 Featured by Poetry Calligraphy Painting Quarterly No.2, Beijing, China
Present Vice Secretary, Lanting Calligraphy Society, Hangzhou, China
Vice President, Zhejiang Female Calligraphers Association, Hangzhou, China
I have also used repr() to see if there are \n characters or not, but there aren't any.

How to Nest If Statement Within For Loop When Scraping Div Class HTML

Below is a scraper that uses Beautiful Soup to scrape physician information off of this webpage. As you can see from the html code directly below, each physician has an individual profile on the webpage that displays the physician's name, clinic, profession, taxonomy, and city.
<div class="views-field views-field-title practitioner__name" >Marilyn Adams</div>
<div class="views-field views-field-field-pract-clinic practitioner__clinic" >Fortius Sport & Health</div>
<div class="views-field views-field-field-pract-profession practitioner__profession" >Physiotherapist</div>
<div class="views-field views-field-taxonomy-vocabulary-5 practitioner__region" >Fraser River Delta</div>
<div class="views-field views-field-city practitioner__city" ></div>
As you can see from the sample html code, the physician profiles occasionally have information missing. If this occurs, I would like the scraper to print 'N/A'. I need the scraper to print 'N/A' because I would eventually like to put each div class category (name, clinic, profession, etc.) into an array where the lengths of each column are exactly the same so I can properly export the data to a CSV file. Here is an example of what I want the output to look like compared to what is actually showing up.
Actual Expected
[Names] [Names]
Greg Greg
Bob Bob
[Clinic] [Clinic]
Sport/Health Sport/Health
N/A
[Profession] [Profession]
Physical Therapist Physical Therapist
Physical Therapist Physical Therapist
[Taxonomy] [Taxonomy]
Fraser River Fraser River
N/A
[City] [City]
Vancouver Vancouver
Vancouver Vancouver
I have tried writing an if statement nested within each for loop, but the code does not seem to be looping correctly as the "N/A" only shows up once for each div class section. Does anyone know how to properly nest an if statement with a for loop so I am getting the proper amount of "N/As" in each column? Thanks in advance!
import requests
import re
from bs4 import BeautifulSoup
page=requests.get('https://sportmedbc.com/practitioners')
soup=BeautifulSoup(page.text, 'html.parser')
#Find Doctor Info
for doctor in soup.find_all('div',attrs={'class':'views-field views-field-title practitioner__name'}):
for a in doctor.find_all('a'):
print(a.text)
for clinic_name in soup.find_all('div',attrs={'class':'views-field views-field-field-pract-clinic practitioner__clinic'}):
for b in clinic_name.find_all('a'):
if b==(''):
print('N/A')
profession_links=soup.findAll('div',attrs={'class':'views-field views-field-field-pract-profession practitioner__profession'})
for profession in profession_links:
if profession.text==(''):
print('N/A')
print(profession.text)
taxonomy_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for taxonomy in taxonomy_links:
if taxonomy.text==(''):
print('N/A')
print(taxonomy.text)
city_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for city in city_links:
if city.text==(''):
print('N/A')
print(city.text)

For this problem you can use ChainMap from collections module (docs here). That way you can define your default values, in this case 'n/a' and only grab information that exists for each doctor:
from bs4 import BeautifulSoup
import requests
from collections import ChainMap
url = 'https://sportmedbc.com/practitioners'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
def get_data(soup):
default_data = {'name': 'n/a', 'clinic': 'n/a', 'profession': 'n/a', 'region': 'n/a', 'city': 'n/a'}
for doctor in soup.select('.view-practitioners .practitioner'):
doctor_data = {}
if doctor.select_one('.practitioner__name').text.strip():
doctor_data['name'] = doctor.select_one('.practitioner__name').text
if doctor.select_one('.practitioner__clinic').text.strip():
doctor_data['clinic'] = doctor.select_one('.practitioner__clinic').text
if doctor.select_one('.practitioner__profession').text.strip():
doctor_data['profession'] = doctor.select_one('.practitioner__profession').text
if doctor.select_one('.practitioner__region').text.strip():
doctor_data['region'] = doctor.select_one('.practitioner__region').text
if doctor.select_one('.practitioner__city').text.strip():
doctor_data['city'] = doctor.select_one('.practitioner__city').text
yield ChainMap(doctor_data, default_data)
for doctor in get_data(soup):
print('name:\t\t', doctor['name'])
print('clinic:\t\t',doctor['clinic'])
print('profession:\t',doctor['profession'])
print('city:\t\t',doctor['city'])
print('region:\t\t',doctor['region'])
print('-' * 80)
Prints:
name: Jaimie Ackerman
clinic: n/a
profession: n/a
city: n/a
region: n/a
--------------------------------------------------------------------------------
name: Marilyn Adams
clinic: Fortius Sport & Health
profession: Physiotherapist
city: n/a
region: Fraser River Delta
--------------------------------------------------------------------------------
name: Mahsa Ahmadi
clinic: Wellpoint Acupuncture (Sports Medicine)
profession: Acupuncturist
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Tracie Albisser
clinic: Pacific Sport Northern BC, Tracie Albisser
profession: Strength and Conditioning Specialist, Exercise Physiologist
city: n/a
region: Cariboo - North East
--------------------------------------------------------------------------------
name: Christine Alder
clinic: n/a
profession: n/a
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Steacy Alexander
clinic: Go! Physiotherapy Sports and Wellness Centre
profession: Physiotherapist
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Page Allison
clinic: AET Clinic, .
profession: Athletic Therapist
city: Victoria
region: Vancouver Island - Central Coast
--------------------------------------------------------------------------------
name: Dana Alumbaugh
clinic: n/a
profession: Podiatrist
city: Squamish
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Manouch Amel
clinic: Mountainview Kinesiology Ltd.
profession: Strength and Conditioning Specialist
city: Anmore
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Janet Ames
clinic: Dr. Janet Ames
profession: Physician
city: Prince George
region: Cariboo - North East
--------------------------------------------------------------------------------
name: Sandi Anderson
clinic: n/a
profession: n/a
city: Coquitlam
region: Fraser Valley
--------------------------------------------------------------------------------
name: Greg Anderson
clinic: University of the Fraser Valley
profession: Exercise Physiologist
city: Mission
region: Fraser Valley
--------------------------------------------------------------------------------
EDIT:
For getting the output in columns, you can use this example:
def print_data(header_text, data, key):
print(header_text)
for d in data:
print(d[key])
print()
data = list(get_data(soup))
print_data('[Names]', data, 'name')
print_data('[Clinic]', data, 'clinic')
print_data('[Profession]', data, 'profession')
print_data('[Taxonomy]', data, 'region')
print_data('[City]', data, 'city')
This prints:
[Names]
Jaimie Ackerman
Marilyn Adams
Mahsa Ahmadi
Tracie Albisser
Christine Alder
Steacy Alexander
Page Allison
Dana Alumbaugh
Manouch Amel
Janet Ames
Sandi Anderson
Greg Anderson
[Clinic]
n/a
Fortius Sport & Health
Wellpoint Acupuncture (Sports Medicine)
Pacific Sport Northern BC, Tracie Albisser
n/a
Go! Physiotherapy Sports and Wellness Centre
AET Clinic, .
n/a
Mountainview Kinesiology Ltd.
Dr. Janet Ames
n/a
University of the Fraser Valley
[Profession]
n/a
Physiotherapist
Acupuncturist
Strength and Conditioning Specialist, Exercise Physiologist
n/a
Physiotherapist
Athletic Therapist
Podiatrist
Strength and Conditioning Specialist
Physician
n/a
Exercise Physiologist
[Taxonomy]
n/a
Fraser River Delta
Vancouver & Sea to Sky
Cariboo - North East
Vancouver & Sea to Sky
Vancouver & Sea to Sky
Vancouver Island - Central Coast
Vancouver & Sea to Sky
Vancouver & Sea to Sky
Cariboo - North East
Fraser Valley
Fraser Valley
[City]
n/a
n/a
Vancouver
n/a
Vancouver
Vancouver
Victoria
Squamish
Anmore
Prince George
Coquitlam
Mission

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping from a Wikipedia Table? - python

You can change this line: parkArea = data[4].span.text to this one if you want area in acres: parkArea = data[4].text.split(' ')[0] or this in km2: parkArea = data[4].text.split(' ')[2]

Related

Draw a Map of cities in python

I am trying to return the address from find_all

Python Scaling loops

Reading excel file with line breaks and tabs preserved using xlrd

How to Nest If Statement Within For Loop When Scraping Div Class HTML

Categories

Resources