How do i grab the main headline at cnn? - python

Trying to only grab the "Ambassador calls trump inept" but i cant seem to land in that area. I have tried pulling "h2" and the class as well as "strong tags but cant seem to find anything. The code below i left it as is, its the only thing i can get to display.
soup = BeautifulSoup(data.text,'html.parser')
for rows in soup.find_all('li'):
for x in soup.findChildren('div'):
print(x)

The page loads the data dynamically. If you inspect, to what URLs the page is making requests (eg. in Firefox Developer Tools) you will find that the data is in different url. Unfortunately, this url (https://edition.cnn.com/data/ocs/section/index.html:intl_homepage1-zone-1/views/zones/common/zone-manager.izl) is constructed dynamically:
import requests
from bs4 import BeautifulSoup
url = 'https://edition.cnn.com/data/ocs/section/index.html:intl_homepage1-zone-1/views/zones/common/zone-manager.izl'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
print(soup.h2.text)
Prints:
UK ambassador calls Trump 'inept' and 'insecure'

Related

How can I scrape for certain html classes in python

I'm trying to scrape a random site and get all the text with a certain class off of a page.
from bs4 import BeautifulSoup
import requests
sources = ['https://cnn.com']
for source in sources:
page = requests.get(source)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all("div", class_='cd_content')
for result in results:
title = result.find('span', class_="cd__headline-text vid-left-enabled")
print(title)
From what I found online, this should work but for some reason, it can't find anything and results is empty. Any help is greatly appreciated.
Upon inspecting the network calls, you see that the page is loaded dynamically via sending a GET request to:
https://www.cnn.com/data/ocs/section/index.html:homepage1-zone-1/views/zones/common/zone-manager.izl
The HTML is available within the html key on the page
import requests
from bs4 import BeautifulSoup
URL = "https://www.cnn.com/data/ocs/section/index.html:homepage1-zone-1/views/zones/common/zone-manager.izl"
response = requests.get(URL).json()["html"]
soup = BeautifulSoup(response, "html.parser")
for tag in soup.find_all(class_="cd__headline-text vid-left-enabled"):
print(tag.text)
Output (truncated):
This is the first Covid-19 vaccine in the US authorized for use in younger teens and adolescents
When the US could see Covid cases and deaths plummet
'Truly, madly, deeply false': Keilar fact-checks Ron Johnson's vaccine claim
These are the states with the highest and lowest vaccination rates

Python BeautifulSoup trouble extracting titles from a page with JS

I'm having some serious issues trying to extract the titles from a webpage. I've done this before on some other sites but this one seems to be an issue because of the Javascript.
The test link is "https://www.thomasnet.com/products/adhesives-393009-1.html"
The first title I want extracted is "Toagosei America, Inc."
Here is my code:
import requests
from bs4 import BeautifulSoup
url = ("https://www.thomasnet.com/products/adhesives-393009-1.html")
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
print(soup.get_text())
Now if I run it like this, with get_text, i can find the titles in the result, however as soon as I change it to find_all or find, the titles are lost. I cant find them using web browser's inspect tool, because its all JS generated.
Any advice would be greatly appreciated.
You have to specify what to find, in this case <h2> to get first title:
import requests
from bs4 import BeautifulSoup
url = 'https://www.thomasnet.com/products/adhesives-393009-1.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
first_title = soup.find('h2')
print(first_title.text)
Prints:
Toagosei America, Inc.

ESPN Table does not have TR/TD tags

ESPN Website View
I'd like to pull live auction/draft data from ESPN into a python script that adjusts player valuations / probability of being picked. The table on the page though, doesn't have TD/TR tags. It just has a lot of Div / Class. When trying different variations of find/findall for a lot of the Class' that I see in Chrome's inspector, I never seem to return any results.
import requests, bs4
url = "https://fantasy.espn.com/football/draft?leagueId=93589772&seasonId=2019&teamId=17&memberId={19AD42D6-8125-489D-B045-1E535CFC02E4}"
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find("main", {"class": "jsx-2236042501 draftContainer"})
print (table)
these draft links only last so long, so unfortunately it won't be live for much longer.
The contents of the table are loaded with Javascript. You must use browser automation such as Selenium to extract the DOM after Javascript has loaded the page contents.

Identifying DJIA data using Beautiful Soup

I'm fairly new to coding and am trying to write a script that would pull market data at timed intervals while running, then compare the delta between each pull and notify the user of the change - looking for simple shifts, let's say >.1% in any interval.
My initial approach is to run a Beautiful Soup script to obtain posted market data, using either Yahoo Finance or Barron's, as both seem to have the data available in the HTML code:
https://finance.yahoo.com/calendar
http://www.barrons.com/mdc/public/page/9_3000.html?mod=bol_mdc_topnav_9_3000
This is as far as I've gotten and not having much luck, the find function doesn't seem to be returning anything from the site - looking for any nudging that might help me get on the right track with this
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
url = 'https://finance.yahoo.com/calendar'
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup.find("span")
I would expect this to return the first span tag so I could later hone in on the DJIA data: "
span class="Trsdu(0.3s) Fz(s) Mt(4px) Mb(0px) Fw(b) D(ib)" data-reactid="31">26,430.14</span
but the script runs and returns nothing
You can use the same url the bottom of your listed urls is using to source the quote
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://quotes.wsj.com/index/DJIA?mod=mdc_uss_dtabnk')
soup = bs(r.content, 'lxml')
djia = soup.select_one('#quote_val').text
print(djia)
That is clear as source when you inspect the network traffic of the original bottom url you list and then focus on this url
http://www.barrons.com/mdc/public/js/9_3001_Refresh.js?
which has the js for refreshing that value. There you can see the listed source url for quote.
The response which contains:

Website Scraping Specific Forms

For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.
How would one go about scraping all occurrences of the 'elqFormRow' on the whole website? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
print(div.text.strip())
You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:
import bs4 as bs
import requests
domain = "engage.hpe.com"
initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage'
# get urls to scrape
text = requests.get(initial_url).text
initial_soup = bs.BeautifulSoup(text, 'lxml')
tags = initial_soup.findAll('a', href=True)
urls = []
for tag in tags:
if domain in tag:
urls.append(tag['href'])
urls.append(initial_url)
print(urls)
# function to grab your info
def scrape_desired_info(url):
out = []
text = requests.get(url).text
soup = bs.BeautifulSoup(text, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
out.append(div.text.strip())
return out
info = [scrape_desired_info(url) for url in urls if domain in url]
URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.
Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.

Categories