So this is the html I'm working with
<hr>
<b>1914 December 12 - </b>.
<ul>
<li>
<b>Birth of Herbert Hans Guendel</b> - .
<i>Nation</i>:
Germany,
USA.
<i>Related Persons</i>:
Guendel.
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
</li>
</ul>
I would like for it to look like this:
Birth of Herbert Hans Guendel
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York.
Here's my code:
from bs4 import BeautifulSoup
import requests
import linkMaker as linkMaker
url = linkMaker.link
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
with open("test1.txt", "w") as file:
hrs = soup.find_all('hr')
for hr in hrs:
lis = soup.find_all('li')
for li in lis:
file.write(str(li.text)+str(hr.text)+"\n"+"\n"+"\n")
Here's what it's returning:
Birth of Herbert Hans Guendel - .
: Germany,
USA.
Related Persons: Guendel.
German-American engineer in WW2, member of the Rocket Team in the United States thereafter. German expert in guided missiles during WW2. As of January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
My ultimate Goal is to get those two parts of the html tags to tweet them out.
Looking at the HTML snippet for title you can search for first <b> inside the <li> tag. For the text you can search the last .contents of the <li> tag:
from bs4 import BeautifulSoup
html_doc = """\
<hr>
<b>1914 December 12 - </b>.
<ul>
<li>
<b>Birth of Herbert Hans Guendel</b> - .
<i>Nation</i>:
Germany,
USA.
<i>Related Persons</i>:
Guendel.
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
</li>
</ul>"""
soup = BeautifulSoup(html_doc, "html.parser")
title = soup.find("li").b.text
text = soup.find("li").contents[-1].strip(" .\n")
print(title)
print(text)
Prints:
Birth of Herbert Hans Guendel
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York
So, right now, what I'm trying to do is that I'm trying to scrape a table from rottentomatoes.com and but every time I run the code, I'm facing an issue that it just prints <a href tags. For now, all I want are the Movie titles numbered.
This is my code so far:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}
titles = []
year_released = []
def get_requests():
try:
result = requests.get(url=url)
soup = BeautifulSoup(result.text, 'html.parser')
table = soup.find('table', class_='table')
for name in table:
td = soup.find_all('a', class_='unstyled articleLink')
titles.append(td)
print(titles)
break
except:
print("The result could not get fetched")
And this is my output:
[[Opening This Week, Top Box Office, Coming Soon to Theaters, Weekend Earnings, Certified Fresh Movies, On Dvd & Streaming, VUDU, Netflix Streaming, iTunes, Amazon and Amazon Prime, Top DVD & Streaming, New Releases, Coming Soon to DVD, Certified Fresh Movies, Browse All, Top Movies, Trailers, Forums,
View All
,
View All
, Top TV Shows, Certified Fresh TV, 24 Frames, All-Time Lists, Binge Guide, Comics on TV, Countdown, Critics Consensus, Five Favorite Films, Now Streaming, Parental Guidance, Red Carpet Roundup, Scorecards, Sub-Cult, Total Recall, Video Interviews, Weekend Box Office, Weekly Ketchup, What to Watch, The Zeros, View All, View All, View All,
It Happened One Night (1934),
Citizen Kane (1941),
The Wizard of Oz (1939),
Modern Times (1936),
Black Panther (2018),
Parasite (Gisaengchung) (2019),
Avengers: Endgame (2019),
Casablanca (1942),
Knives Out (2019),
Us (2019),
Toy Story 4 (2019),
Lady Bird (2017),
Mission: Impossible - Fallout (2018),
BlacKkKlansman (2018),
Get Out (2017),
The Irishman (2019),
The Godfather (1972),
Mad Max: Fury Road (2015),
Spider-Man: Into the Spider-Verse (2018),
Moonlight (2016),
Sunset Boulevard (1950),
All About Eve (1950),
The Cabinet of Dr. Caligari (Das Cabinet des Dr. Caligari) (1920),
The Philadelphia Story (1940),
Roma (2018),
Wonder Woman (2017),
A Star Is Born (2018),
Inside Out (2015),
A Quiet Place (2018),
One Night in Miami (2020),
Eighth Grade (2018),
Rebecca (1940),
Booksmart (2019),
Logan (2017),
His Girl Friday (1940),
Portrait of a Lady on Fire (Portrait de la jeune fille en feu) (2020),
Coco (2017),
Dunkirk (2017),
Star Wars: The Last Jedi (2017),
A Night at the Opera (1935),
The Shape of Water (2017),
Thor: Ragnarok (2017),
Spotlight (2015),
The Farewell (2019),
Selma (2014),
The Third Man (1949),
Rear Window (1954),
E.T. The Extra-Terrestrial (1982),
Seven Samurai (Shichinin no Samurai) (1956),
La Grande illusion (Grand Illusion) (1938),
Arrival (2016),
Singin' in the Rain (1952),
The Favourite (2018),
Double Indemnity (1944),
All Quiet on the Western Front (1930),
Snow White and the Seven Dwarfs (1937),
Marriage Story (2019),
The Big Sick (2017),
On the Waterfront (1954),
Star Wars: Episode VII - The Force Awakens (2015),
An American in Paris (1951),
The Best Years of Our Lives (1946),
Metropolis (1927),
Boyhood (2014),
Gravity (2013),
Leave No Trace (2018),
The Maltese Falcon (1941),
The Invisible Man (2020),
12 Years a Slave (2013),
Once Upon a Time In Hollywood (2019),
Argo (2012),
Soul (2020),
Ma Rainey's Black Bottom (2020),
The Kid (1921),
Manchester by the Sea (2016),
Nosferatu, a Symphony of Horror (Nosferatu, eine Symphonie des Grauens) (Nosferatu the Vampire) (1922),
The Adventures of Robin Hood (1938),
La La Land (2016),
North by Northwest (1959),
Laura (1944),
Spider-Man: Far From Home (2019),
Incredibles 2 (2018),
Zootopia (2016),
Alien (1979),
King Kong (1933),
Shadow of a Doubt (1943),
Call Me by Your Name (2018),
Psycho (1960),
1917 (2020),
L.A. Confidential (1997),
The Florida Project (2017),
War for the Planet of the Apes (2017),
Paddington 2 (2018),
A Hard Day's Night (1964),
Widows (2018),
Never Rarely Sometimes Always (2020),
Baby Driver (2017),
Spider-Man: Homecoming (2017),
The Godfather, Part II (1974),
The Battle of Algiers (La Battaglia di Algeri) (1967), View All, View All]]
Reading tables via pandas.read_html() as provided by #F.Hoque would probably the leaner approache but you can also get your results with BeautifulSoup only.
Iterate over all <tr> of the <table>, pick information from tags via .text / .get_text() and store it structured in list of dicts:
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}
result = requests.get(url=url)
soup = BeautifulSoup(result.text, 'html.parser')
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
data
Output
[{'rank': '1.', 'title': 'It Happened One Night', 'releaseYear': '1934'},
{'rank': '2.', 'title': 'Citizen Kane', 'releaseYear': '1941'},
{'rank': '3.', 'title': 'The Wizard of Oz', 'releaseYear': '1939'},
{'rank': '4.', 'title': 'Modern Times', 'releaseYear': '1936'},
{'rank': '5.', 'title': 'Black Panther', 'releaseYear': '2018'},...]
I'm creating a web scraper in order to pull the name of a company from a chamber of commerce website directory.
Im using BeautifulSoup. The page and soup objects appear to be working, but when I scrape the HTML content, an empty list is returned when it should be filled with the directory names on the page.
Web page trying to scrape: https://www.austinchamber.com/directory
Here is the HTML:
<div>
<ul> class="item-list item-list--small"> == $0
<li>
<div class='item-content'>
<div class='item-description'>
<h5 class = 'h5'>Women Helping Women LLC</h5>
Here is the python code:
def pageRequest(url):
page = requests.get(url)
return page
def htmlSoup(page):
soup = BeautifulSoup(page.content, "html.parser")
return soup
def getNames(soup):
name = soup.find_all('h5', class_='h5')
return name
page = pageRequest("https://www.austinchamber.com/directory")
soup = htmlSoup(page)
name = getNames(soup)
for n in name:
print(n)
The data is loaded dynamically via Ajax. To get the data, you can use this script:
import json
import requests
url = 'https://www.austinchamber.com/api/v1/directory?filter[categories]=&filter[show]=all&page={page}&limit=24'
page = 1
for page in range(1, 10):
print('Page {}..'.format(page))
data = requests.get(url.format(page=page)).json()
# uncommentthis to print all data:
# print(json.dumps(data, indent=4))
for d in data['data']:
print(d['title'])
Prints:
...
Indeed
Austin Telco Federal Credit Union - Taos
Green Bank
Seton Medical Center Austin
Austin Telco Federal Credit Union - Jollyville
Page 42..
Texas State SBDC - San Marcos Office
PlainsCapital Bank - Motor Bank
University of Texas - Thompson Conference Center
Lamb's Tire & Automotive Centers - #2 Research & Braker
AT&T Labs
Prosperity Bank - Rollingwood
Kerbey Lane Cafe - Central
Lamb's Tire & Automotive Centers - #9 Bee Caves
Seton Medical Center Hays
PlainsCapital Bank - North Austin
Ellis & Salazar Body Shop
aLamb's Tire & Automotive Centers - #6 Lake Creek
Rudy's Country Store and BarBQ
...
I have attempted several methods to pull links from the following webpage, but can't seem to find the desired links. From this webpage (https://www.espn.com/collegefootball/scoreboard//year/2019/seasontype/2/week/1) I am attempting to extract all of the links for the "gamecast" button. The example of the first one I would be attempting to get is this: https://www.espn.com/college-football/game//gameId/401110723
When I try to just pull all links on the page I do not even seem to get the desired ones at all, so I'm confused where I'm going wrong here. A few attempts I have made below that don't seem to be pulling in what I want. First method I tried below.
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(page.text, 'html.parser')
# game_id = soup.find(name_='&lpos=college-football:scoreboard:gamecast')
game_id = soup.find('a',class_='button-alt sm')
Here is a second method I tried. Any help is greatly appreciated.
for a in soup.find_all('a'):
if 'college-football' in a['href']:
print(link['href'])
Edit: as a clarification I am attempting to pull all links that contain a gameID as in the example link.
The button with the link you are trying to have is loaded with javascript. The requests module does not load the javascript in the html it is searching through. Therefore, you cannot scrape the button directly to find the links you desire (without a web page simulator like Selenium). However, I found json data in the html that contains the scoreboard data in which the link is located in. If you are also looking to scrape more information (times, etc.) from this page, I highly recommend looking through the json data in the variable json_scoreboard in the code.
Code
import requests, re, json
from bs4 import BeautifulSoup
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(r.text, 'html.parser')
scripts_head = soup.find('head').find_all('script')
all_links = {}
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
print(all_links)
Output
{'Miami Hurricanes at Florida Gators': 'http://www.espn.com/college-football/game/_/gameId/401110723', 'Georgia Tech Yellow Jackets at Clemson Tigers': 'http://www.espn.com/college-football/game/_/gameId/401111653', 'Texas State Bobcats at Texas A&M Aggies': 'http://www.espn.com/college-football/game/_/gameId/401110731', 'Utah Utes at BYU Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114223', 'Florida A&M Rattlers at UCF Knights': 'http://www.espn.com/college-football/game/_/gameId/401117853', 'Tulsa Golden Hurricane at Michigan State Spartans': 'http://www.espn.com/college-football/game/_/gameId/401112212', 'Wisconsin Badgers at South Florida Bulls': 'http://www.espn.com/college-football/game/_/gameId/401117856', 'Duke Blue Devils at Alabama Crimson Tide': 'http://www.espn.com/college-football/game/_/gameId/401110720', 'Georgia Bulldogs at Vanderbilt Commodores': 'http://www.espn.com/college-football/game/_/gameId/401110732', 'Florida Atlantic Owls at Ohio State Buckeyes': 'http://www.espn.com/college-football/game/_/gameId/401112251', 'Georgia Southern Eagles at LSU Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110725', 'Middle Tennessee Blue Raiders at Michigan Wolverines': 'http://www.espn.com/college-football/game/_/gameId/401112222', 'Louisiana Tech Bulldogs at Texas Longhorns': 'http://www.espn.com/college-football/game/_/gameId/401112135', 'Oregon Ducks at Auburn Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110722', 'Eastern Washington Eagles at Washington Huskies': 'http://www.espn.com/college-football/game/_/gameId/401114233', 'Idaho Vandals at Penn State Nittany Lions': 'http://www.espn.com/college-football/game/_/gameId/401112257', 'Miami (OH) RedHawks at Iowa Hawkeyes': 'http://www.espn.com/college-football/game/_/gameId/401112191', 'Northern Iowa Panthers at Iowa State Cyclones': 'http://www.espn.com/college-football/game/_/gameId/401112085', 'Syracuse Orange at Liberty Flames': 'http://www.espn.com/college-football/game/_/gameId/401112434', 'New Mexico State Aggies at Washington State Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114228', 'South Alabama Jaguars at Nebraska Cornhuskers': 'http://www.espn.com/college-football/game/_/gameId/401112238', 'Northwestern Wildcats at Stanford Cardinal': 'http://www.espn.com/college-football/game/_/gameId/401112245', 'Houston Cougars at Oklahoma Sooners': 'http://www.espn.com/college-football/game/_/gameId/401112114', 'Notre Dame Fighting Irish at Louisville Cardinals': 'http://www.espn.com/college-football/game/_/gameId/401112436'}
I am trying to scrape this website but I keep getting error when I try to print out just the content of the table.
soup = BeautifulSoup(urllib2.urlopen('http://clinicaltrials.gov/show/NCT01718158
').read())
print soup('table')[6].prettify()
for row in soup('table')[6].findAll('tr'):
tds = row('td')
print tds[0].string,tds[1].string
IndexError Traceback (most recent call last)
<ipython-input-70-da84e74ab3b1> in <module>()
1 for row in soup('table')[6].findAll('tr'):
2 tds = row('td')
3 print tds[0].string,tds[1].string
4
IndexError: list index out of range
The table has a header row, with <th> header elements rather than <td> cells. Your code assumes there will always be <td> elements in each row, and that fails for the first row.
You could skip the row with not enough <td> elements:
for row in soup('table')[6].findAll('tr'):
tds = row('td')
if len(tds) < 2:
continue
print tds[0].string, tds[1].string
at which point you get output:
>>> for row in soup('table')[6].findAll('tr'):
... tds = row('td')
... if len(tds) < 2:
... continue
... print tds[0].string, tds[1].string
...
Responsible Party: Bristol-Myers Squibb
ClinicalTrials.gov Identifier: None
Other Study ID Numbers: AI452-021, 2011‐005409‐65
Study First Received: October 29, 2012
Last Updated: November 7, 2014
Health Authority: None
The last row contains text interspersed with <br/> elements; you could use the element.strings generator to extract all strings and perhaps join them into newlines; I'd strip each string first though:
>>> for row in soup('table')[6].findAll('tr'):
... tds = row('td')
... if len(tds) < 2:
... continue
... print tds[0].string, '\n'.join(filter(unicode.strip, tds[1].strings))
...
Responsible Party: Bristol-Myers Squibb
ClinicalTrials.gov Identifier: NCT01718158
History of Changes
Other Study ID Numbers: AI452-021, 2011‐005409‐65
Study First Received: October 29, 2012
Last Updated: November 7, 2014
Health Authority: United States: Institutional Review Board
United States: Food and Drug Administration
Argentina: Administracion Nacional de Medicamentos, Alimentos y Tecnologia Medica
France: Afssaps - Agence française de sécurité sanitaire des produits de santé (Saint-Denis)
Germany: Federal Institute for Drugs and Medical Devices
Germany: Ministry of Health
Israel: Israeli Health Ministry Pharmaceutical Administration
Israel: Ministry of Health
Italy: Ministry of Health
Italy: National Bioethics Committee
Italy: National Institute of Health
Italy: National Monitoring Centre for Clinical Trials - Ministry of Health
Italy: The Italian Medicines Agency
Japan: Pharmaceuticals and Medical Devices Agency
Japan: Ministry of Health, Labor and Welfare
Korea: Food and Drug Administration
Poland: National Institute of Medicines
Poland: Ministry of Health
Poland: Ministry of Science and Higher Education
Poland: Office for Registration of Medicinal Products, Medical Devices and Biocidal Products
Russia: FSI Scientific Center of Expertise of Medical Application
Russia: Ethics Committee
Russia: Ministry of Health of the Russian Federation
Spain: Spanish Agency of Medicines
Taiwan: Department of Health
Taiwan: National Bureau of Controlled Drugs
United Kingdom: Medicines and Healthcare Products Regulatory Agency