BeautifulSoup not getting web data - python

I'm creating a web scraper in order to pull the name of a company from a chamber of commerce website directory.
Im using BeautifulSoup. The page and soup objects appear to be working, but when I scrape the HTML content, an empty list is returned when it should be filled with the directory names on the page.
Web page trying to scrape: https://www.austinchamber.com/directory
Here is the HTML:
<div>
<ul> class="item-list item-list--small"> == $0
<li>
<div class='item-content'>
<div class='item-description'>
<h5 class = 'h5'>Women Helping Women LLC</h5>
Here is the python code:
def pageRequest(url):
page = requests.get(url)
return page
def htmlSoup(page):
soup = BeautifulSoup(page.content, "html.parser")
return soup
def getNames(soup):
name = soup.find_all('h5', class_='h5')
return name
page = pageRequest("https://www.austinchamber.com/directory")
soup = htmlSoup(page)
name = getNames(soup)
for n in name:
print(n)

The data is loaded dynamically via Ajax. To get the data, you can use this script:
import json
import requests
url = 'https://www.austinchamber.com/api/v1/directory?filter[categories]=&filter[show]=all&page={page}&limit=24'
page = 1
for page in range(1, 10):
print('Page {}..'.format(page))
data = requests.get(url.format(page=page)).json()
# uncommentthis to print all data:
# print(json.dumps(data, indent=4))
for d in data['data']:
print(d['title'])
Prints:
...
Indeed
Austin Telco Federal Credit Union - Taos
Green Bank
Seton Medical Center Austin
Austin Telco Federal Credit Union - Jollyville
Page 42..
Texas State SBDC - San Marcos Office
PlainsCapital Bank - Motor Bank
University of Texas - Thompson Conference Center
Lamb's Tire & Automotive Centers - #2 Research & Braker
AT&T Labs
Prosperity Bank - Rollingwood
Kerbey Lane Cafe - Central
Lamb's Tire & Automotive Centers - #9 Bee Caves
Seton Medical Center Hays
PlainsCapital Bank - North Austin
Ellis & Salazar Body Shop
aLamb's Tire & Automotive Centers - #6 Lake Creek
Rudy's Country Store and BarBQ
...

Related

Scrape Certain elements from HTML using Python and Beautifulsoup

So this is the html I'm working with
<hr>
<b>1914 December 12 - </b>.
<ul>
<li>
<b>Birth of Herbert Hans Guendel</b> - .
<i>Nation</i>:
Germany,
USA.
<i>Related Persons</i>:
Guendel.
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
</li>
</ul>
I would like for it to look like this:
Birth of Herbert Hans Guendel
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York.
Here's my code:
from bs4 import BeautifulSoup
import requests
import linkMaker as linkMaker
url = linkMaker.link
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
with open("test1.txt", "w") as file:
hrs = soup.find_all('hr')
for hr in hrs:
lis = soup.find_all('li')
for li in lis:
file.write(str(li.text)+str(hr.text)+"\n"+"\n"+"\n")
Here's what it's returning:
Birth of Herbert Hans Guendel - .
: Germany,
USA.
Related Persons: Guendel.
German-American engineer in WW2, member of the Rocket Team in the United States thereafter. German expert in guided missiles during WW2. As of January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
My ultimate Goal is to get those two parts of the html tags to tweet them out.
Looking at the HTML snippet for title you can search for first <b> inside the <li> tag. For the text you can search the last .contents of the <li> tag:
from bs4 import BeautifulSoup
html_doc = """\
<hr>
<b>1914 December 12 - </b>.
<ul>
<li>
<b>Birth of Herbert Hans Guendel</b> - .
<i>Nation</i>:
Germany,
USA.
<i>Related Persons</i>:
Guendel.
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
</li>
</ul>"""
soup = BeautifulSoup(html_doc, "html.parser")
title = soup.find("li").b.text
text = soup.find("li").contents[-1].strip(" .\n")
print(title)
print(text)
Prints:
Birth of Herbert Hans Guendel
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York

python BeautifulSoup Wikipedia Webscapping -learning

I learning Python and BeautifulSoup
I am trying to do some webscraping:
Let me first describe want I am trying to do?
the wiki page: https://en.m.wikipedia.org/wiki/List_of_largest_banks
I am trying to print out the
<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>
I want to print out the text: By market capitalization
Then the text of the table of the banks:
Example:
By market capitalization
Rank
Bank
Cap Rate
1
JP Morgan
466.1
2
Bank of China
300
all the way to 50
My code starts out like this:
from bs4 import
import requests
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)
I believe my problem is more on the html side of things:
But I am completely lost:
I inspected the element and the tags that I believe to look for are
{section class_='mf-section-2 collapsible-block open-block'}
Close to your goal - Find the heading and than its next table and transform it via pandas.read_html() to dataframe.
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]
or
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
Example
from bs4 import BeautifulSoup
import requests
import panda as pd
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
header = soup.select_one('h2:has(>#By_market_capitalization)')
print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
Output
By market capitalization
Rank
Bank name
Market cap(US$ billion)
1
JPMorgan Chase
466.21[5]
2
Industrial and Commercial Bank of China
295.65
3
Bank of America
279.73
4
Wells Fargo
214.34
5
China Construction Bank
207.98
6
Agricultural Bank of China
181.49
7
HSBC Holdings PLC
169.47
8
Citigroup Inc.
163.58
9
Bank of China
151.15
10
China Merchants Bank
133.37
11
Royal Bank of Canada
113.80
12
Toronto-Dominion Bank
106.61
...
As you know the desired header you can just direct print. Then with pandas, you can use a unique search term from the target table as a more direct select method:
import pandas as pd
df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0, drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))

Webscraping past a show more button that extends the page

I'm trying to scrape data from Elle.com under a search term. I noticed when I click the button, it sends a request that updates the &page=2 in the url. However, the following code just gets me a lot of duplicate entries. I need help finding a way to set a start point for each iteration of the loop (I think). Any ideas?
import requests,nltk,pandas as pd
from bs4 import BeautifulSoup as bs
def get_hits(url):
r = requests.get(url)
soup = bs(r.content, 'html')
body = []
for p in soup.find_all('p',{'class':'body-text'}):
sentences = nltk.sent_tokenize(p.text)
result1 = [s for s in sentences if 'kim' in s]
body.append(result1)
result2 = [s for s in sentences if 'kanye' in s]
body.append(result2)
body = [a for a in body if a!=[]]
if body == []:
body.append("no hits")
return body
titles =[]
key_hits = []
urls = []
counter = 1
for i in range(1,10):
url = f'https://www.elle.com/search/?page={i}&q=kanye'
r = requests.get(url)
soup = bs(r.content, 'html')
groups = soup.find_all('div',{'class':'simple-item grid-simple-item'})
for j in range(len(groups)):
urls.append('https://www.elle.com'+ groups[j].find('a')['href'])
titles.append(groups[j].find('div',{'class':'simple-item-title item-title'}).text)
key_hits.append(get_hits('https://www.elle.com'+ groups[j].find('a')['href']))
if (counter == 100):
break
counter+=1
data = pd.DataFrame({
'Title':titles,
'Body':key_hits,
'Links':urls
})
data.head()
Let me know if there's something I don't understand that I probably should. Just a marketing researcher trying to learn powerful tools here.
To get pagination working on the sige, you can use their infinite-scroll API URL (this example will print 9*42 titles):
import requests
from bs4 import BeautifulSoup
api_url = "https://www.elle.com/ajax/infiniteload/"
params = {
"id": "search",
"class": "CoreModels\\search\\TagQueryModel",
"viewset": "search",
"trackingId": "search-results",
"trackingLabel": "kanye",
"params": '{"input":"kanye","page_size":"42"}',
"page": "1",
"cachebuster": "undefined",
}
all_titles = set()
for page in range(1, 10):
params["page"] = page
soup = BeautifulSoup(
requests.get(api_url, params=params).content, "html.parser"
)
for title in soup.select(".item-title"):
print(title.text)
all_titles.add(title.text)
print()
print("Unique titles:", len(all_titles)) # <-- 9 * 42 = 378
Prints:
...
Kim Kardashian and Kanye West Respond to Those Divorce Rumors
People Are Noticing Something Fishy About Taylor Swift's Response to Kim Kardashian
Kim Kardashian Just Went on an Intense Twitter Rant Defending Kanye West
Trump Is Finally Able to Secure a Meeting With a Kim
Kim Kardashian West is Modeling Yeezy on the Street Again
Aziz Ansari's Willing to Model Kanye's Clothes
Unique titles: 378
Actually, load more pagination is generating from api calls plain html response and each page link/url is relative url and convert it into absolute url using urljoin method and I make pagination in api_urls.
Code:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
api_urls = ["https://www.elle.com/ajax/infiniteload/?id=search&class=CoreModels%5Csearch%5CTagQueryModel&viewset=search&trackingId=search-results&trackingLabel=kanye&params=%7B%22input%22%3A%22kanye%22%2C%22page_size%22%3A%2242%22%7D&page="+str(x)+"&cachebuster=undefined" for x in range(1,4)]
Base_url = "https://store.steampowered.com"
for url in api_urls:
req = requests.get(url)
soup = BeautifulSoup(req.content,"lxml")
cards = soup.select("div.simple-item.grid-simple-item")
for card in cards:
title = card.select_one("div.simple-item-title.item-title")
p = card.select_one("a")
l=p['href']
abs_link=urljoin(Base_url,l)
print("Title:" + title.text + " Links: " + abs_link)
print("-" * 80)
Output:
Title:Inside Kim Kardashian and Kanye West’s Current Relationship Amid Dinner Sighting Links: https://store.steampowered.com/culture/celebrities/a37833256/kim-kardashian-kanye-west-reconciled/
Title:Kim Kardashian And Ex Kanye West Left For SNL Together Amid Reports of Reconciliation Efforts Links: https://store.steampowered.com/culture/celebrities/a37919434/kim-kardashian-kanye-west-leave-for-snl-together-reconciliation/
Title:Kim Kardashian Wore a Purple Catsuit for Dinner With Kanye West Amid Reports She's Open to Reconciling Links: https://store.steampowered.com/culture/celebrities/a37822625/kim-kardashian-kanye-west-nobu-dinner-september-2021/
Title:How Kim Kardashian Really Feels About Kanye West Saying He ‘Wants Her Back’ Now Links:
https://store.steampowered.com/culture/celebrities/a37463258/kim-kardashian-kanye-west-reconciliation-feelings-september-2021/
Title:Why Irina Shayk and Kanye West Called Off Their Two-Month Romance Links: https://store.steampowered.com/culture/celebrities/a37366860/why-irina-shayk-kanye-west-broke-up-august-2021/
Title:Kim Kardashian and Kanye West Reportedly Are ‘Working on Rebuilding’ Relationship and May Call Off Divorce Links: https://store.steampowered.com/culture/celebrities/a37421190/kim-kardashian-kanye-west-repairing-relationship-divorce-august-2021/
Title:What Kim Kardashian and Kanye West's ‘Donda’ Wedding Moment Really Means for Their Relationship Links: https://store.steampowered.com/culture/celebrities/a37415557/kim-kardashian-kanye-west-donda-wedding-moment-explained/
Title:What Kim Kardashian and Kanye West's Relationship Is Like Now: ‘The Tension Has Subsided’ Links: https://store.steampowered.com/culture/celebrities/a37383301/kim-kardashian-kanye-west-relationship-details-august-2021/
Title:How Kim Kardashian and Kanye West’s Relationship as Co-Parents Has Evolved Links: https://store.steampowered.com/culture/celebrities/a37250155/kim-kardashian-kanye-west-co-parents/Title:Kim Kardashian Went Out in a Giant Shaggy Coat and a Black Wrap Top for Dinner in NYC Links: https://store.steampowered.com/culture/celebrities/a37882897/kim-kardashian-shaggy-coat-black-outfit-nyc-dinner/
Title:Kim Kardashian Wore Two Insane, Winter-Ready Outfits in One Warm NYC Day Links: https://store.steampowered.com/culture/celebrities/a37906750/kim-kardashian-overdressed-fall-outfits-october-2021/
Title:Kim Kardashian Dressed Like a Superhero for Justin Bieber's 2021 Met Gala After Party Links: https://store.steampowered.com/culture/celebrities/a37593656/kim-kardashian-superhero-outfit-met-gala-after-party-2021/
Title:Kim Kardashian Killed It In Her Debut as a Saturday Night Live Host Links: https://store.steampowered.com/culture/celebrities/a37918950/kim-kardashian-saturday-night-live-best-sketches/
Title:Kim Kardashian Has Been Working ‘20 Hours a Day’ For Her Appearance On SNL Links: https://store.steampowered.com/culture/celebrities/a37915962/kim-kardashian-saturday-night-live-preperation/
Title:Why Taylor Swift and Joe Alwyn Skipped the 2021 Met Gala Links: https://store.steampowered.com/culture/celebrities/a37446411/why-taylor-swift-joe-alwyn-skipped-met-gala-2021/
Title:Kim Kardashian Says North West Still Wants to Be an Only Child Five Years Into Having Siblings Links: https://store.steampowered.com/culture/celebrities/a37620539/kim-kardashian-north-west-only-child-comment-september-2021/
Title:How Kim Kardashian's Incognito 2021 Met Gala Glam Came Together Links: https://store.s
teampowered.com/beauty/makeup-skin-care/a37584576/kim-kardashians-incognito-2021-met-gala-beauty-breakdown/
Title:Kim Kardashian Completely Covered Her Face and Everything in a Black Balenciaga Look at the 2021 Met Gala Links: https://store.steampowered.com/culture/celebrities/a37578520/kim-kardashian-faceless-outfit-met-gala-2021/
Title:How Kim Kardashian Feels About Kanye West Singing About Their Divorce and ‘Losing My Family’ on Donda Album Links: https://store.steampowered.com/culture/celebrities/a37113130/kim-kardashian-kanye-west-divorce-song-donda-album-feelings/
Title:Kanye West Teases New Song In Beats By Dre Commercial Starring Sha'Carri Richardson Links: https://store.steampowered.com/culture/celebrities/a37090223/kanye-west-teases-new-song-in-beats-by-dre-commercial-starring-shacarri-richardson/
Title:Inside Kim Kardashian and Kanye West's Relationship Amid His Irina Shayk Romance Links: https://store.steampowered.com/culture/celebrities/a37077662/kim-kardashian-kanye-west-relationship-irina-shayk-romance-july-2021/
and ... so on

How to pull links from within an 'a' tag

I have attempted several methods to pull links from the following webpage, but can't seem to find the desired links. From this webpage (https://www.espn.com/collegefootball/scoreboard//year/2019/seasontype/2/week/1) I am attempting to extract all of the links for the "gamecast" button. The example of the first one I would be attempting to get is this: https://www.espn.com/college-football/game//gameId/401110723
When I try to just pull all links on the page I do not even seem to get the desired ones at all, so I'm confused where I'm going wrong here. A few attempts I have made below that don't seem to be pulling in what I want. First method I tried below.
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(page.text, 'html.parser')
# game_id = soup.find(name_='&lpos=college-football:scoreboard:gamecast')
game_id = soup.find('a',class_='button-alt sm')
Here is a second method I tried. Any help is greatly appreciated.
for a in soup.find_all('a'):
if 'college-football' in a['href']:
print(link['href'])
Edit: as a clarification I am attempting to pull all links that contain a gameID as in the example link.
The button with the link you are trying to have is loaded with javascript. The requests module does not load the javascript in the html it is searching through. Therefore, you cannot scrape the button directly to find the links you desire (without a web page simulator like Selenium). However, I found json data in the html that contains the scoreboard data in which the link is located in. If you are also looking to scrape more information (times, etc.) from this page, I highly recommend looking through the json data in the variable json_scoreboard in the code.
Code
import requests, re, json
from bs4 import BeautifulSoup
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(r.text, 'html.parser')
scripts_head = soup.find('head').find_all('script')
all_links = {}
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
print(all_links)
Output
{'Miami Hurricanes at Florida Gators': 'http://www.espn.com/college-football/game/_/gameId/401110723', 'Georgia Tech Yellow Jackets at Clemson Tigers': 'http://www.espn.com/college-football/game/_/gameId/401111653', 'Texas State Bobcats at Texas A&M Aggies': 'http://www.espn.com/college-football/game/_/gameId/401110731', 'Utah Utes at BYU Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114223', 'Florida A&M Rattlers at UCF Knights': 'http://www.espn.com/college-football/game/_/gameId/401117853', 'Tulsa Golden Hurricane at Michigan State Spartans': 'http://www.espn.com/college-football/game/_/gameId/401112212', 'Wisconsin Badgers at South Florida Bulls': 'http://www.espn.com/college-football/game/_/gameId/401117856', 'Duke Blue Devils at Alabama Crimson Tide': 'http://www.espn.com/college-football/game/_/gameId/401110720', 'Georgia Bulldogs at Vanderbilt Commodores': 'http://www.espn.com/college-football/game/_/gameId/401110732', 'Florida Atlantic Owls at Ohio State Buckeyes': 'http://www.espn.com/college-football/game/_/gameId/401112251', 'Georgia Southern Eagles at LSU Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110725', 'Middle Tennessee Blue Raiders at Michigan Wolverines': 'http://www.espn.com/college-football/game/_/gameId/401112222', 'Louisiana Tech Bulldogs at Texas Longhorns': 'http://www.espn.com/college-football/game/_/gameId/401112135', 'Oregon Ducks at Auburn Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110722', 'Eastern Washington Eagles at Washington Huskies': 'http://www.espn.com/college-football/game/_/gameId/401114233', 'Idaho Vandals at Penn State Nittany Lions': 'http://www.espn.com/college-football/game/_/gameId/401112257', 'Miami (OH) RedHawks at Iowa Hawkeyes': 'http://www.espn.com/college-football/game/_/gameId/401112191', 'Northern Iowa Panthers at Iowa State Cyclones': 'http://www.espn.com/college-football/game/_/gameId/401112085', 'Syracuse Orange at Liberty Flames': 'http://www.espn.com/college-football/game/_/gameId/401112434', 'New Mexico State Aggies at Washington State Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114228', 'South Alabama Jaguars at Nebraska Cornhuskers': 'http://www.espn.com/college-football/game/_/gameId/401112238', 'Northwestern Wildcats at Stanford Cardinal': 'http://www.espn.com/college-football/game/_/gameId/401112245', 'Houston Cougars at Oklahoma Sooners': 'http://www.espn.com/college-football/game/_/gameId/401112114', 'Notre Dame Fighting Irish at Louisville Cardinals': 'http://www.espn.com/college-football/game/_/gameId/401112436'}

How can I webscrape a Website for the Winners

Hi I am trying to scrape this website with Python 3 and noticed that in the source code it does not give a clear indication of how I would scrape the names of the winners in these primary elections. Can you show me how to scrape a list of all the winners in every MD primary election with this website?
https://elections2018.news.baltimoresun.com/results/
The parsing is a little bit complicated, because the results are in many subpages. This scripts collects them and prints result (all data is stored in variable data):
from bs4 import BeautifulSoup
import requests
url = "https://elections2018.news.baltimoresun.com/results/"
r = requests.get(url)
data = {}
soup = BeautifulSoup(r.text, 'lxml')
for race in soup.select('div[id^=race]'):
r = requests.get(f"https://elections2018.news.baltimoresun.com/results/contests/{race['id'].split('-')[1]}.html")
s = BeautifulSoup(r.text, 'lxml')
l = []
data[(s.find('h3').text, s.find('div', {'class': 'party-header'}).text)] = l
for candidate, votes, percent in zip(s.select('td.candidate'), s.select('td.votes'), s.select('td.percent')):
l.append((candidate.text, votes.text, percent.text))
print('Winners:')
for (race, party), v in data.items():
print(race, party, v[0])
# print(data)
Outputs:
Winners:
Governor / Lt. Governor Democrat ('Ben Jealous and Susan Turnbull', '227,764', '39.6%')
U.S. Senator Republican ('Tony Campbell', '50,915', '29.2%')
U.S. Senator Democrat ('Ben Cardin', '468,909', '80.4%')
State's Attorney Democrat ('Marilyn J. Mosby', '39,519', '49.4%')
County Executive Democrat ('John "Johnny O" Olszewski, Jr.', '27,270', '32.9%')
County Executive Republican ('Al Redmer, Jr.', '17,772', '55.7%')

Categories