I want to scrape news from this website:
https://www.bbc.com/news
You can see that website has categories such as Home, US Election, Coronavirus etc.
For example, If I go to specific news article such as:
https://www.bbc.com/news/election-us-2020-54912611
I can write a scraper that will give me the headline, this is the code:
from bs4 import BeautifulSoup
response = requests.get("https://www.bbc.com/news/election-us-2020-54912611", headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.select("header h1")
print(title)
On this website there are hundreds of news, so my question is, Is there a way to access each news article thats on the website (all categories) from the home page url? On home page I cant see all news articles, I can see only some of them, so is there a way for me to load whole HTML code for whole website, so that I can easily get all news headlines with:
soup.select("header h1")
Ok, then after getting this headlines you can also have another links in this page, you just again open that links and fetch information from that links it can look like this:
visited = set()
links = [....]
while links:
if link_for_fetch in visited:
continue
link_for_fetch = links.pop()
content = get_contents(link_for_fetch)
headlines += parse_headlines()
links += parse_links()
visited.add(link_for_fetch)
it's just pseudocode, you can write in any programming language. but this can take a lot of time for parsing whole site :( and robots can block your ip address
Related
So I'm just learning how to scrape websites for data, and I have one I want to scrape to populate a database to do something with for practice. The site doesn't have all the information posted in one big page i can scrape, but instead has the information broken up into multiple "sets", with each set having its own page/collection of the data I want to scrape. All the "sets" are listed on a singular page though, with each set listed having the link to its individual page. I figured my best bet would be to scrape the "sets" page for their URL, and then request through to the "set" pages to collect the data I'm trying to get. Checking the html, each set is listed in a container, with the URL being the first thing listed within each section, like this:
<td class="flexbox">
<a href="url_i_need">
<more stuff i don't need>
</td>
<repeats_as_above_for_next_set>
what I've tried is:
response = requests.get('site_url')
content = response.content
soup = BeautifulSoup(content, 'html.parser')
data = soup.find_all('td', 'flexbox')
This seems to do the trick scraping each of the TD sections, but nothing I try let me further skim through the data to just the portion I need. After narrowing my search down to the general section I care about, how do I scrape the URL of each of those sections?
You can loop over data and for each item find nested tag a:
for item in data:
link = item.find('a')
url = link['href']
Btw is this line correct?
soup = BeautifulSoup.find_all(content, 'html.parser')
Standard way is this one:
soup = BeautifulSoup(content, 'html.parser')
I am trying to scrape the profile URLs from a Yelp search results page using Beautiful Soup. This is the code I currently have:
url="https://www.yelp.com/search?find_desc=tree+-+removal+-+&find_loc=Baltimore+MD&start=40"
response=requests.get(url)
data=response.text
soup = BeautifulSoup(data,'lxml')
for a in soup.find_all('a', href=True):
with open(r'C:\Users\my.name\Desktop\Yelp-URLs.csv',"a") as f:
print(a,file=f)
This gives me every href link on the page, not just profile URLs. Additionally, I am getting the full class string (a class lemon....), when I just need the business profile URL's.
Please help.
You can narrow down the href limitation by using select.
for a in soup.select('a[href^="/biz/"]'):
with open(r'/Users/my.name/Desktop/Yelp-URLs.csv',"a") as f:
print(a.attrs['href'], file=f)
I am trying to create two Python lists - one of book titles, and one of the books' authors, from a publisher's "coming soon" website.
I have tried a similar approach on other publisher's sites with success, but on this site it does not seem to be working. I am new to parsing html so I am obviously missing something, just can't figure out what. The find_all function just returns an empty list, so my titles and authors lists are empty too.
For reference, this is what the html shows when I click "inspect" in my browser for the first title and author, respectively. I've looked through the BS4 documentation and still can't figure out what I'm doing wrong here.
<h3 class="sp__the-title">Flame</h3>
<p class="sp__the-author">Donna Grant</p>
Thanks for your help!
import requests
from bs4 import BeautifulSoup
page = 'https://us.macmillan.com/search?collection=coming-soon'
page_response = requests.get(page)
soup = BeautifulSoup(page_response.content, "html.parser")
titles = []
for tag in soup.find_all("h3", {"class":"sp__the-title"}):
print(tag.text)
titles.append(tag.text)
authors = []
for tag in soup.find_all("p", {"class":"sp__the-author"}):
print(tag.text)
authors.append(tag.text)
ESPN Website View
I'd like to pull live auction/draft data from ESPN into a python script that adjusts player valuations / probability of being picked. The table on the page though, doesn't have TD/TR tags. It just has a lot of Div / Class. When trying different variations of find/findall for a lot of the Class' that I see in Chrome's inspector, I never seem to return any results.
import requests, bs4
url = "https://fantasy.espn.com/football/draft?leagueId=93589772&seasonId=2019&teamId=17&memberId={19AD42D6-8125-489D-B045-1E535CFC02E4}"
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find("main", {"class": "jsx-2236042501 draftContainer"})
print (table)
these draft links only last so long, so unfortunately it won't be live for much longer.
The contents of the table are loaded with Javascript. You must use browser automation such as Selenium to extract the DOM after Javascript has loaded the page contents.
For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.
How would one go about scraping all occurrences of the 'elqFormRow' on the whole website? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
print(div.text.strip())
You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:
import bs4 as bs
import requests
domain = "engage.hpe.com"
initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage'
# get urls to scrape
text = requests.get(initial_url).text
initial_soup = bs.BeautifulSoup(text, 'lxml')
tags = initial_soup.findAll('a', href=True)
urls = []
for tag in tags:
if domain in tag:
urls.append(tag['href'])
urls.append(initial_url)
print(urls)
# function to grab your info
def scrape_desired_info(url):
out = []
text = requests.get(url).text
soup = bs.BeautifulSoup(text, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
out.append(div.text.strip())
return out
info = [scrape_desired_info(url) for url in urls if domain in url]
URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.
Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.