Extracting book title and author from website - Python 3 - python

I am trying to create two Python lists - one of book titles, and one of the books' authors, from a publisher's "coming soon" website.
I have tried a similar approach on other publisher's sites with success, but on this site it does not seem to be working. I am new to parsing html so I am obviously missing something, just can't figure out what. The find_all function just returns an empty list, so my titles and authors lists are empty too.
For reference, this is what the html shows when I click "inspect" in my browser for the first title and author, respectively. I've looked through the BS4 documentation and still can't figure out what I'm doing wrong here.
<h3 class="sp__the-title">Flame</h3>
<p class="sp__the-author">Donna Grant</p>
Thanks for your help!
import requests
from bs4 import BeautifulSoup
page = 'https://us.macmillan.com/search?collection=coming-soon'
page_response = requests.get(page)
soup = BeautifulSoup(page_response.content, "html.parser")
titles = []
for tag in soup.find_all("h3", {"class":"sp__the-title"}):
print(tag.text)
titles.append(tag.text)
authors = []
for tag in soup.find_all("p", {"class":"sp__the-author"}):
print(tag.text)
authors.append(tag.text)

Related

Accessing all elements from main website page with Beautiful Soup

I want to scrape news from this website:
https://www.bbc.com/news
You can see that website has categories such as Home, US Election, Coronavirus etc.
For example, If I go to specific news article such as:
https://www.bbc.com/news/election-us-2020-54912611
I can write a scraper that will give me the headline, this is the code:
from bs4 import BeautifulSoup
response = requests.get("https://www.bbc.com/news/election-us-2020-54912611", headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.select("header h1")
print(title)
On this website there are hundreds of news, so my question is, Is there a way to access each news article thats on the website (all categories) from the home page url? On home page I cant see all news articles, I can see only some of them, so is there a way for me to load whole HTML code for whole website, so that I can easily get all news headlines with:
soup.select("header h1")
Ok, then after getting this headlines you can also have another links in this page, you just again open that links and fetch information from that links it can look like this:
visited = set()
links = [....]
while links:
if link_for_fetch in visited:
continue
link_for_fetch = links.pop()
content = get_contents(link_for_fetch)
headlines += parse_headlines()
links += parse_links()
visited.add(link_for_fetch)
it's just pseudocode, you can write in any programming language. but this can take a lot of time for parsing whole site :( and robots can block your ip address

How do I extract just the blog content and exclude other elements using Beautiful Soup

I am trying to get the blog content from this blog post and by content, I just mean the first six paragraphs. This is what I've come up with so far:
soup = BeautifulSoup(url, 'lxml')
body = soup.find('div', class_='post-body')
Printing body will also include other stuff under the main div tag.
Try this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/being-proud-too-soon.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div#post-body-604825342214355274"):
print(item.text.strip())
Use this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/acceptance-is-must.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div[id^='post-body-']"):
print(item.text)
I found this solution very interesting: Scrape multiple pages with BeautifulSoup and Python
However, I haven't found any Query String Parameters to tackle on, maybe you can start something out of this approach.
What I find most obvious to do right now is something like this:
Scrape through every month and year and get all titles from the Blog Archive part of the pages (e.g. on http://www.fashionpulis.com/2017/03/ and so on)
Build the URLs using the titles and the according months/years (the URL is always http://www.fashionpulis.com/$YEAR/$MONTH/$TITLE.html)
Scrape the text as described by Shahin in a previous answer

A general way to scrape link titles from any site in Python?

Is there a "general" way to scrape link titles from any website in Python? For example, if I use the following code:
from urllib.request import url open
from bs4 import BeautifulSoup
site = "https://news.google.com"
html = urlopen(site)
soup = BeautifulSoup(html.read(), 'lxml');
titles = soup.findAll('span', attrs = { 'class' : 'titletext' })
for title in titles:
print(title.contents)
I am able to extract nearly every headline title from news.google.com. However, if I use the same code at www.yahoo.com, I am unable to due to a different HTML formatting.
Is there a more general way to do this so that it works for most sites?
No, each site is different and if you make a more general scraper, it will get more data that isn't as specific as every headline title.
For instance the following would get every headline title from google and would also probably get them from yahoo also.
titles = soup.find_all('a')
for title in titles:
print(title.get_text())
However it would also get you all of the headers and other links which would muddy up your results. (there are approximately 150 links on that google page that aren't headlines)
Not, that's why we need CSS selector and XPath, but if there are small number of page, there is a convenient way to do that:
site = "https://news.google.com"
if 'google' in site:
filters = {'name':'span', "class" : 'titletext' }
elif 'yahoo' in site:
filters = {'name':'blala', "class" : 'blala' }
titles = soup.findAll(**filters)
for title in titles:
print(title.contents)

Website Scraping Specific Forms

For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.
How would one go about scraping all occurrences of the 'elqFormRow' on the whole website? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
print(div.text.strip())
You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:
import bs4 as bs
import requests
domain = "engage.hpe.com"
initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage'
# get urls to scrape
text = requests.get(initial_url).text
initial_soup = bs.BeautifulSoup(text, 'lxml')
tags = initial_soup.findAll('a', href=True)
urls = []
for tag in tags:
if domain in tag:
urls.append(tag['href'])
urls.append(initial_url)
print(urls)
# function to grab your info
def scrape_desired_info(url):
out = []
text = requests.get(url).text
soup = bs.BeautifulSoup(text, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
out.append(div.text.strip())
return out
info = [scrape_desired_info(url) for url in urls if domain in url]
URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.
Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.

problems scraping web page using python

Hi I'm quite new to python and my boss has asked me to scrape this data however it is not my strong point so i was wondering how i would go about this.
The text that I'm after also changes in the quote marks every few minutes so I'm also not sure how to locate that.
I am using beautiful soup at the moment and Lxml however if there are better alternatives I'm happy to try them
This is the inspected element of the webpage:
div class = "sometext"
<h3> somemoretext </h3>
<p>
<span class = "title" title="text i want">text i want</span>
<br>
</p>
I have tried using:
from lxml import html
import requests
from bs4 import BeautifulSoup
page = requests.get('the url')
soup = BeautifulSoup(page.text)
r = soup.findAll('//span[#class="title"]/text()')
print r
Thank you in advance,any help would be appreciated!
First do this to get what you are looking at in the soup:
soup = BeautifulSoup(page)
print soup
That way you can double check that you are actually dealing will what you think you are dealing with.
Then do this:
r = soup.findAll('span', attrs={"class":"title"})
for span in r:
print span.text
This will get all the span tags with a class=title, and then text will print out all the text in between the tags.
Edited to Add
Note that esecules' answer will get you the title within the tag (<span class = "title" title="text i want">) whereas mine will get the title from the text (<span class = "title" >text i want</span>)
perhaps find is the method you really need since you're only ever looking for one element. docs
r = soup.find('div', 'sometext').find('span','title')['title']
if you're familiar with XPath and you don't need feature that specific to BeautifulSoup, then using lxml only is enough (or maybe even better since lxml is known to be faster) :
from lxml import html
import requests
page = requests.get('the url')
root = html.fromstring(page.text)
r = root.xpath('//span[#class="title"]/text()')
print r

Categories