Parsing through HTML in a dictionary - python

I'm trying to pull table data from the following website: https://msih.bgu.ac.il/md-program/residency-placements/
While there are no table tags I found the common tag to pull individual segments of the table to be div class=accord-con
I made a dictionary where the keys are the graduation year (ie, 2019, 2018, etc), and the values is the html from each div class-accord con.
I'm stuck and don't know how to parse the html within the dictionary. My goal is to have separate lists of the specialty, hospital, and location for each year. I don't know how to move forward.
Below is my working code:
import numpy as np
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
grad_yr_list.append(header.h2.text[-4:])
rez_classes = soup.find_all('div', class_={'accord-con'})
data_dict = dict(zip(grad_yr_list, rez_classes))
Here is a sample of what my dictionary looks like:
{'2019': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>University at Buffalo School of Medicine, Buffalo, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Aventura Hospital, Aventura, Fl</li></ul><h4>Family Medicine</h4><ul><li>Louisiana State University School of Medicine, New Orleans, LA</li><li>UT St Thomas Hospitals, Murfreesboro, TN</li><li>Sea Mar Community Health Center, Seattle, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>St Joseph Hospital, Denver, CO </li></ul><h4>Obstetrics-Gynecology</h4><ul><li>Jersey City Medical Center, Jersey City, NJ</li><li>New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY</li></ul><h4>Pediatrics</h4><ul><li>St Louis Children’s Hospital, St Louis, MO</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>St Christopher’s Hospital, Philadelphia, PA</li></ul><h4>Surgery</h4><ul><li>Mountain Area Health Education Center, Asheville, NC</li></ul><p></p></div>,
'2018': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>NYU School of Medicine, New York, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Kent Hospital, Warwick, Rhode Island</li><li>University of Connecticut School of Medicine, Farmington, CT</li><li>University of Texas Health Science Center at San Antonio, San Antonio, TX</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Family Medicine</h4><ul><li>University of Kansas Medical Center, Wichita, KS</li><li>Ellis Hospital, Schenectady, NY</li><li>Harrison Medical Center, Seattle, WA</li><li>St Francis Hospital, Wilmington, DE </li><li>University of Virginia, Charlottesville, VA</li><li>Valley Medical Center, Renton, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>Virginia Commonwealth University Health Systems, Richmond, VA</li><li>University of Chicago Medical Center, Chicago, IL</li></ul><h4>Obstetrics-Gynecology</h4><ul><li>St Francis Hospital, Hartford, CT</li></ul><h4>Pediatrics</h4><ul><li>Case Western University Hospitals Cleveland Medical Center, Cleveland, OH</li><li>Jersey Shore University Medical Center, Neptune City, NJ</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>University of Virginia, Charlottesville, VA</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Preliminary Medicine Neurology</h4><ul><li>Howard University Hospital, Washington, DC</li></ul><h4>Preliminary Medicine Radiology</h4><ul><li>Maimonides Medical Center, Bronx, NY</li></ul><h4>Preliminary Medicine Surgery</h4><ul><li>Providence Park Hospital, Southfield, MI</li></ul><h4>Psychiatry</h4><ul><li>University of Maryland Medical Center, Baltimore, MI</li></ul><p></p></div>,
My ultimate goal is to pull this data into a pandas dataframe with the following columns: grad year, specialty, hospital, location

Your code is quite close to finding the end result. Once you have paired the years with the student placement data, simply apply an extraction function to the latter.:
from bs4 import BeautifulSoup as soup
import re
from selenium import webdriver
_d = webdriver.Chrome('/path/to/chromedriver')
_d.get('https://msih.bgu.ac.il/md-program/residency-placements/')
d = soup(_d.page_source, 'html.parser')
def placement(block):
r = block.find_all(re.compile('ul|h4'))
return {r[i].text:[b.text for b in r[i+1].find_all('li')] for i in range(0, len(r)-1, 2)}
result = {i.h2.text:placement(i) for i in d.find_all('div', {'class':'accord-head'})}
print(result['Class of 2019'])
Output:
{'Anesthesiology': ['University at Buffalo School of Medicine, Buffalo, NY'], 'Emergency Medicine': ['Aventura Hospital, Aventura, Fl'], 'Family Medicine': ['Louisiana State University School of Medicine, New Orleans, LA', 'UT St Thomas Hospitals, Murfreesboro, TN', 'Sea Mar Community Health Center, Seattle, WA'], 'Internal Medicine': ['Oregon Health and Science University, Portland, OR', 'St Joseph Hospital, Denver, CO\xa0'], 'Obstetrics-Gynecology': ['Jersey City Medical Center, Jersey City, NJ', 'New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY'], 'Pediatrics': ['St Louis Children’s Hospital, St Louis, MO', 'University of Maryland Medical Center, Baltimore, MD', 'St Christopher’s Hospital, Philadelphia, PA'], 'Surgery': ['Mountain Area Health Education Center, Asheville, NC']}
Note: I ended up using selenium because for me, the returned HTML response from requests.get did not included the rendered student placement data.

You have dictionary with BS elements ('bs4.element.Tag') and you don't have to parse them.
You can directly uses find(), find_all(), etc.
for key, value in data_dict.items():
print(type(value), key, value.find('h4').text)
Result
<class 'bs4.element.Tag'> 2019 Anesthesiology
<class 'bs4.element.Tag'> 2018 Anesthesiology
<class 'bs4.element.Tag'> 2017 Anesthesiology
<class 'bs4.element.Tag'> 2016 Emergency Medicine
<class 'bs4.element.Tag'> 2015 Emergency Medicine
<class 'bs4.element.Tag'> 2014 Anesthesiology
<class 'bs4.element.Tag'> 2013 Anesthesiology
<class 'bs4.element.Tag'> 2012 Emergency Medicine
<class 'bs4.element.Tag'> 2011 Emergency Medicine
<class 'bs4.element.Tag'> 2010 Dermatology
<class 'bs4.element.Tag'> 2009 Emergency Medicine
<class 'bs4.element.Tag'> 2008 Family Medicine
<class 'bs4.element.Tag'> 2007 Anesthesiology
<class 'bs4.element.Tag'> 2006 Triple Board (Pediatrics/Adult Psychiatry/Child Psychiatry)
<class 'bs4.element.Tag'> 2005 Family Medicine
<class 'bs4.element.Tag'> 2004 Anesthesiology
<class 'bs4.element.Tag'> 2003 Emergency Medicine
<class 'bs4.element.Tag'> 2002 Family Medicine
Full code:
import urllib.request
import bs4 as bs
sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
grad_yr_list.append(header.h2.text[-4:])
rez_classes = soup.find_all('div', class_={'accord-con'})
data_dict = dict(zip(grad_yr_list, rez_classes))
for key, value in data_dict.items():
print(type(value), key, value.find('h4').text)

You can go to pandas once you get the soup, then parse the necessary information
df = pd.DataFrame(soup)
df['grad_year'] = df[0].map(lambda x: x.text[-4:])
df['specialty'] = df[1].map(lambda x: [i.text for i in x.find_all('h4')])
df['hospital'] = df[1].map(lambda x: [i.text for i in x.find_all('li')])
df['location'] = df[1].map(lambda x: [''.join(i.text.split(',')[1:]) for i in x.find_all('li')])
You will have to do some pandas magic after that

I don't know pandas. The following code can get the data in the table. I don't know if this will meet your needs.
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://msih.bgu.ac.il/md-program/residency-placements/'
response = requests.get(url)
doc = SimplifiedDoc(response.text)
divs = doc.getElementsByClass('accord-head')
datas={}
for div in divs:
grad_year = div.h2.text[-4:]
rez_classe = div.getElementByClass('accord-con')
h4s = rez_classe.h4s # get h4
for h4 in h4s:
if not h4.next:
continue
lis = h4.next.lis
specialty = h4.text
hospital = [li.text for li in lis]
datas[grad_year]={'specialty':specialty,'hospital':hospital}
for data in datas:
print (data,datas[data])

Related

Scrape Certain elements from HTML using Python and Beautifulsoup

So this is the html I'm working with
<hr>
<b>1914 December 12 - </b>.
<ul>
<li>
<b>Birth of Herbert Hans Guendel</b> - .
<i>Nation</i>:
Germany,
USA.
<i>Related Persons</i>:
Guendel.
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
</li>
</ul>
I would like for it to look like this:
Birth of Herbert Hans Guendel
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York.
Here's my code:
from bs4 import BeautifulSoup
import requests
import linkMaker as linkMaker
url = linkMaker.link
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
with open("test1.txt", "w") as file:
hrs = soup.find_all('hr')
for hr in hrs:
lis = soup.find_all('li')
for li in lis:
file.write(str(li.text)+str(hr.text)+"\n"+"\n"+"\n")
Here's what it's returning:
Birth of Herbert Hans Guendel - .
: Germany,
USA.
Related Persons: Guendel.
German-American engineer in WW2, member of the Rocket Team in the United States thereafter. German expert in guided missiles during WW2. As of January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
My ultimate Goal is to get those two parts of the html tags to tweet them out.
Looking at the HTML snippet for title you can search for first <b> inside the <li> tag. For the text you can search the last .contents of the <li> tag:
from bs4 import BeautifulSoup
html_doc = """\
<hr>
<b>1914 December 12 - </b>.
<ul>
<li>
<b>Birth of Herbert Hans Guendel</b> - .
<i>Nation</i>:
Germany,
USA.
<i>Related Persons</i>:
Guendel.
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
</li>
</ul>"""
soup = BeautifulSoup(html_doc, "html.parser")
title = soup.find("li").b.text
text = soup.find("li").contents[-1].strip(" .\n")
print(title)
print(text)
Prints:
Birth of Herbert Hans Guendel
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York

How do I remove the <a href... tags from my web scrapper

So, right now, what I'm trying to do is that I'm trying to scrape a table from rottentomatoes.com and but every time I run the code, I'm facing an issue that it just prints <a href tags. For now, all I want are the Movie titles numbered.
This is my code so far:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}
titles = []
year_released = []
def get_requests():
try:
result = requests.get(url=url)
soup = BeautifulSoup(result.text, 'html.parser')
table = soup.find('table', class_='table')
for name in table:
td = soup.find_all('a', class_='unstyled articleLink')
titles.append(td)
print(titles)
break
except:
print("The result could not get fetched")
And this is my output:
[[Opening This Week, Top Box Office, Coming Soon to Theaters, Weekend Earnings, Certified Fresh Movies, On Dvd & Streaming, VUDU, Netflix Streaming, iTunes, Amazon and Amazon Prime, Top DVD & Streaming, New Releases, Coming Soon to DVD, Certified Fresh Movies, Browse All, Top Movies, Trailers, Forums,
View All
,
View All
, Top TV Shows, Certified Fresh TV, 24 Frames, All-Time Lists, Binge Guide, Comics on TV, Countdown, Critics Consensus, Five Favorite Films, Now Streaming, Parental Guidance, Red Carpet Roundup, Scorecards, Sub-Cult, Total Recall, Video Interviews, Weekend Box Office, Weekly Ketchup, What to Watch, The Zeros, View All, View All, View All,
It Happened One Night (1934),
Citizen Kane (1941),
The Wizard of Oz (1939),
Modern Times (1936),
Black Panther (2018),
Parasite (Gisaengchung) (2019),
Avengers: Endgame (2019),
Casablanca (1942),
Knives Out (2019),
Us (2019),
Toy Story 4 (2019),
Lady Bird (2017),
Mission: Impossible - Fallout (2018),
BlacKkKlansman (2018),
Get Out (2017),
The Irishman (2019),
The Godfather (1972),
Mad Max: Fury Road (2015),
Spider-Man: Into the Spider-Verse (2018),
Moonlight (2016),
Sunset Boulevard (1950),
All About Eve (1950),
The Cabinet of Dr. Caligari (Das Cabinet des Dr. Caligari) (1920),
The Philadelphia Story (1940),
Roma (2018),
Wonder Woman (2017),
A Star Is Born (2018),
Inside Out (2015),
A Quiet Place (2018),
One Night in Miami (2020),
Eighth Grade (2018),
Rebecca (1940),
Booksmart (2019),
Logan (2017),
His Girl Friday (1940),
Portrait of a Lady on Fire (Portrait de la jeune fille en feu) (2020),
Coco (2017),
Dunkirk (2017),
Star Wars: The Last Jedi (2017),
A Night at the Opera (1935),
The Shape of Water (2017),
Thor: Ragnarok (2017),
Spotlight (2015),
The Farewell (2019),
Selma (2014),
The Third Man (1949),
Rear Window (1954),
E.T. The Extra-Terrestrial (1982),
Seven Samurai (Shichinin no Samurai) (1956),
La Grande illusion (Grand Illusion) (1938),
Arrival (2016),
Singin' in the Rain (1952),
The Favourite (2018),
Double Indemnity (1944),
All Quiet on the Western Front (1930),
Snow White and the Seven Dwarfs (1937),
Marriage Story (2019),
The Big Sick (2017),
On the Waterfront (1954),
Star Wars: Episode VII - The Force Awakens (2015),
An American in Paris (1951),
The Best Years of Our Lives (1946),
Metropolis (1927),
Boyhood (2014),
Gravity (2013),
Leave No Trace (2018),
The Maltese Falcon (1941),
The Invisible Man (2020),
12 Years a Slave (2013),
Once Upon a Time In Hollywood (2019),
Argo (2012),
Soul (2020),
Ma Rainey's Black Bottom (2020),
The Kid (1921),
Manchester by the Sea (2016),
Nosferatu, a Symphony of Horror (Nosferatu, eine Symphonie des Grauens) (Nosferatu the Vampire) (1922),
The Adventures of Robin Hood (1938),
La La Land (2016),
North by Northwest (1959),
Laura (1944),
Spider-Man: Far From Home (2019),
Incredibles 2 (2018),
Zootopia (2016),
Alien (1979),
King Kong (1933),
Shadow of a Doubt (1943),
Call Me by Your Name (2018),
Psycho (1960),
1917 (2020),
L.A. Confidential (1997),
The Florida Project (2017),
War for the Planet of the Apes (2017),
Paddington 2 (2018),
A Hard Day's Night (1964),
Widows (2018),
Never Rarely Sometimes Always (2020),
Baby Driver (2017),
Spider-Man: Homecoming (2017),
The Godfather, Part II (1974),
The Battle of Algiers (La Battaglia di Algeri) (1967), View All, View All]]
Reading tables via pandas.read_html() as provided by #F.Hoque would probably the leaner approache but you can also get your results with BeautifulSoup only.
Iterate over all <tr> of the <table>, pick information from tags via .text / .get_text() and store it structured in list of dicts:
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}
result = requests.get(url=url)
soup = BeautifulSoup(result.text, 'html.parser')
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
data
Output
[{'rank': '1.', 'title': 'It Happened One Night', 'releaseYear': '1934'},
{'rank': '2.', 'title': 'Citizen Kane', 'releaseYear': '1941'},
{'rank': '3.', 'title': 'The Wizard of Oz', 'releaseYear': '1939'},
{'rank': '4.', 'title': 'Modern Times', 'releaseYear': '1936'},
{'rank': '5.', 'title': 'Black Panther', 'releaseYear': '2018'},...]

BeautifulSoup not getting web data

I'm creating a web scraper in order to pull the name of a company from a chamber of commerce website directory.
Im using BeautifulSoup. The page and soup objects appear to be working, but when I scrape the HTML content, an empty list is returned when it should be filled with the directory names on the page.
Web page trying to scrape: https://www.austinchamber.com/directory
Here is the HTML:
<div>
<ul> class="item-list item-list--small"> == $0
<li>
<div class='item-content'>
<div class='item-description'>
<h5 class = 'h5'>Women Helping Women LLC</h5>
Here is the python code:
def pageRequest(url):
page = requests.get(url)
return page
def htmlSoup(page):
soup = BeautifulSoup(page.content, "html.parser")
return soup
def getNames(soup):
name = soup.find_all('h5', class_='h5')
return name
page = pageRequest("https://www.austinchamber.com/directory")
soup = htmlSoup(page)
name = getNames(soup)
for n in name:
print(n)
The data is loaded dynamically via Ajax. To get the data, you can use this script:
import json
import requests
url = 'https://www.austinchamber.com/api/v1/directory?filter[categories]=&filter[show]=all&page={page}&limit=24'
page = 1
for page in range(1, 10):
print('Page {}..'.format(page))
data = requests.get(url.format(page=page)).json()
# uncommentthis to print all data:
# print(json.dumps(data, indent=4))
for d in data['data']:
print(d['title'])
Prints:
...
Indeed
Austin Telco Federal Credit Union - Taos
Green Bank
Seton Medical Center Austin
Austin Telco Federal Credit Union - Jollyville
Page 42..
Texas State SBDC - San Marcos Office
PlainsCapital Bank - Motor Bank
University of Texas - Thompson Conference Center
Lamb's Tire & Automotive Centers - #2 Research & Braker
AT&T Labs
Prosperity Bank - Rollingwood
Kerbey Lane Cafe - Central
Lamb's Tire & Automotive Centers - #9 Bee Caves
Seton Medical Center Hays
PlainsCapital Bank - North Austin
Ellis & Salazar Body Shop
aLamb's Tire & Automotive Centers - #6 Lake Creek
Rudy's Country Store and BarBQ
...

How to pull links from within an 'a' tag

I have attempted several methods to pull links from the following webpage, but can't seem to find the desired links. From this webpage (https://www.espn.com/collegefootball/scoreboard//year/2019/seasontype/2/week/1) I am attempting to extract all of the links for the "gamecast" button. The example of the first one I would be attempting to get is this: https://www.espn.com/college-football/game//gameId/401110723
When I try to just pull all links on the page I do not even seem to get the desired ones at all, so I'm confused where I'm going wrong here. A few attempts I have made below that don't seem to be pulling in what I want. First method I tried below.
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(page.text, 'html.parser')
# game_id = soup.find(name_='&lpos=college-football:scoreboard:gamecast')
game_id = soup.find('a',class_='button-alt sm')
Here is a second method I tried. Any help is greatly appreciated.
for a in soup.find_all('a'):
if 'college-football' in a['href']:
print(link['href'])
Edit: as a clarification I am attempting to pull all links that contain a gameID as in the example link.
The button with the link you are trying to have is loaded with javascript. The requests module does not load the javascript in the html it is searching through. Therefore, you cannot scrape the button directly to find the links you desire (without a web page simulator like Selenium). However, I found json data in the html that contains the scoreboard data in which the link is located in. If you are also looking to scrape more information (times, etc.) from this page, I highly recommend looking through the json data in the variable json_scoreboard in the code.
Code
import requests, re, json
from bs4 import BeautifulSoup
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(r.text, 'html.parser')
scripts_head = soup.find('head').find_all('script')
all_links = {}
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
print(all_links)
Output
{'Miami Hurricanes at Florida Gators': 'http://www.espn.com/college-football/game/_/gameId/401110723', 'Georgia Tech Yellow Jackets at Clemson Tigers': 'http://www.espn.com/college-football/game/_/gameId/401111653', 'Texas State Bobcats at Texas A&M Aggies': 'http://www.espn.com/college-football/game/_/gameId/401110731', 'Utah Utes at BYU Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114223', 'Florida A&M Rattlers at UCF Knights': 'http://www.espn.com/college-football/game/_/gameId/401117853', 'Tulsa Golden Hurricane at Michigan State Spartans': 'http://www.espn.com/college-football/game/_/gameId/401112212', 'Wisconsin Badgers at South Florida Bulls': 'http://www.espn.com/college-football/game/_/gameId/401117856', 'Duke Blue Devils at Alabama Crimson Tide': 'http://www.espn.com/college-football/game/_/gameId/401110720', 'Georgia Bulldogs at Vanderbilt Commodores': 'http://www.espn.com/college-football/game/_/gameId/401110732', 'Florida Atlantic Owls at Ohio State Buckeyes': 'http://www.espn.com/college-football/game/_/gameId/401112251', 'Georgia Southern Eagles at LSU Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110725', 'Middle Tennessee Blue Raiders at Michigan Wolverines': 'http://www.espn.com/college-football/game/_/gameId/401112222', 'Louisiana Tech Bulldogs at Texas Longhorns': 'http://www.espn.com/college-football/game/_/gameId/401112135', 'Oregon Ducks at Auburn Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110722', 'Eastern Washington Eagles at Washington Huskies': 'http://www.espn.com/college-football/game/_/gameId/401114233', 'Idaho Vandals at Penn State Nittany Lions': 'http://www.espn.com/college-football/game/_/gameId/401112257', 'Miami (OH) RedHawks at Iowa Hawkeyes': 'http://www.espn.com/college-football/game/_/gameId/401112191', 'Northern Iowa Panthers at Iowa State Cyclones': 'http://www.espn.com/college-football/game/_/gameId/401112085', 'Syracuse Orange at Liberty Flames': 'http://www.espn.com/college-football/game/_/gameId/401112434', 'New Mexico State Aggies at Washington State Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114228', 'South Alabama Jaguars at Nebraska Cornhuskers': 'http://www.espn.com/college-football/game/_/gameId/401112238', 'Northwestern Wildcats at Stanford Cardinal': 'http://www.espn.com/college-football/game/_/gameId/401112245', 'Houston Cougars at Oklahoma Sooners': 'http://www.espn.com/college-football/game/_/gameId/401112114', 'Notre Dame Fighting Irish at Louisville Cardinals': 'http://www.espn.com/college-football/game/_/gameId/401112436'}

Python: html table content

I am trying to scrape this website but I keep getting error when I try to print out just the content of the table.
soup = BeautifulSoup(urllib2.urlopen('http://clinicaltrials.gov/show/NCT01718158
').read())
print soup('table')[6].prettify()
for row in soup('table')[6].findAll('tr'):
tds = row('td')
print tds[0].string,tds[1].string
IndexError Traceback (most recent call last)
<ipython-input-70-da84e74ab3b1> in <module>()
1 for row in soup('table')[6].findAll('tr'):
2 tds = row('td')
3 print tds[0].string,tds[1].string
4
IndexError: list index out of range
The table has a header row, with <th> header elements rather than <td> cells. Your code assumes there will always be <td> elements in each row, and that fails for the first row.
You could skip the row with not enough <td> elements:
for row in soup('table')[6].findAll('tr'):
tds = row('td')
if len(tds) < 2:
continue
print tds[0].string, tds[1].string
at which point you get output:
>>> for row in soup('table')[6].findAll('tr'):
... tds = row('td')
... if len(tds) < 2:
... continue
... print tds[0].string, tds[1].string
...
Responsible Party: Bristol-Myers Squibb
ClinicalTrials.gov Identifier: None
Other Study ID Numbers: AI452-021, 2011‐005409‐65
Study First Received: October 29, 2012
Last Updated: November 7, 2014
Health Authority: None
The last row contains text interspersed with <br/> elements; you could use the element.strings generator to extract all strings and perhaps join them into newlines; I'd strip each string first though:
>>> for row in soup('table')[6].findAll('tr'):
... tds = row('td')
... if len(tds) < 2:
... continue
... print tds[0].string, '\n'.join(filter(unicode.strip, tds[1].strings))
...
Responsible Party: Bristol-Myers Squibb
ClinicalTrials.gov Identifier: NCT01718158
History of Changes
Other Study ID Numbers: AI452-021, 2011‐005409‐65
Study First Received: October 29, 2012
Last Updated: November 7, 2014
Health Authority: United States: Institutional Review Board
United States: Food and Drug Administration
Argentina: Administracion Nacional de Medicamentos, Alimentos y Tecnologia Medica
France: Afssaps - Agence française de sécurité sanitaire des produits de santé (Saint-Denis)
Germany: Federal Institute for Drugs and Medical Devices
Germany: Ministry of Health
Israel: Israeli Health Ministry Pharmaceutical Administration
Israel: Ministry of Health
Italy: Ministry of Health
Italy: National Bioethics Committee
Italy: National Institute of Health
Italy: National Monitoring Centre for Clinical Trials - Ministry of Health
Italy: The Italian Medicines Agency
Japan: Pharmaceuticals and Medical Devices Agency
Japan: Ministry of Health, Labor and Welfare
Korea: Food and Drug Administration
Poland: National Institute of Medicines
Poland: Ministry of Health
Poland: Ministry of Science and Higher Education
Poland: Office for Registration of Medicinal Products, Medical Devices and Biocidal Products
Russia: FSI Scientific Center of Expertise of Medical Application
Russia: Ethics Committee
Russia: Ministry of Health of the Russian Federation
Spain: Spanish Agency of Medicines
Taiwan: Department of Health
Taiwan: National Bureau of Controlled Drugs
United Kingdom: Medicines and Healthcare Products Regulatory Agency

Categories