Using BeautifulSoup to extract the title of a link - python

I'm trying to extract the title of a link using BeautifulSoup. The code that I'm working with is as follows:
url = "http://www.example.com"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'}):
title = link.get('title')
print title
Now, an example link element contains the following:
<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>
However, nothing gets displayed after I run the above code. How can I extract the value stored inside the title attribute of the anchor tag stored in link?

Well, it seems you have put two spaces between s-access-detail-page and a-text-normal, which in turn, is not able to find any matching link. Try with correct number of spaces, then printing number of links found. Also, you can print the tag itself - print link
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.in/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=python"
source_code = requests.get(url)
plain_text = source_code.content
soup = BeautifulSoup(plain_text, "lxml")
links = soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'})
print len(links)
for link in links:
title = link.get('title')
print title

You are searching for an exact string here, by using multiple classes. In that case the class string has to match exactly, with single spaces.
See the Searching by CSS class section in the documentation:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]
But searching for variants of the string value won’t work:
css_soup.find_all("p", class_="strikeout body")
# []
You'd have a better time searching for individual classes:
soup.find_all('a', class_='a-link-normal')
If you must match more than one class, use a CSS selector:
soup.select('a.a-link-normal.s-access-detail-page.a-text-normal')
and it won't matter in what order you list the classes.
Demo:
>>> from bs4 import BeautifulSoup
>>> plain_text = u'<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>'
>>> soup = BeautifulSoup(plain_text)
>>> for link in soup.find_all('a', class_='a-link-normal'):
... print link.text
...
Introduction To Computation And Programming Using Python
>>> for link in soup.select('a.a-link-normal.s-access-detail-page.a-text-normal'):
... print link.text
...
Introduction To Computation And Programming Using Python

Related

Retrieving multiple values with Beautiful Soup

I have a file which I am using to parse articles in the reference section of wikipedia. I currently have it set up in such a way that it returns the URLs of any item in the reference section.
I'm trying to get it to export a single line containing both the link (which it does currently) and the text of the link in either a single line:
https://this.is.the.url "And this is the article header"
or over consecutive lines:
https://this.is.the.url
"And this is the article header"
Link Sample
<a
rel="nofollow"
class="external text"
href="https://www.mmajunkie.usatoday.com/2020/08/gerald-meerschaert-tests-positive-covid-19-ed-herman-fight-off-ufc-on-espn-plus-31/amp">
"Gerald Meerschaert tests positive for COVID-19; Ed Herman fight off UFC on ESPN+ 31"
</a>
Scraper
import requests
import sys
from bs4 import BeautifulSoup
session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"
if "wikipedia" in selectWikiPage:
html = session.post(selectWikiPage)
bsObj = BeautifulSoup(html.text, "html.parser")
references = bsObj.find('ol', {'class': 'references'})
href = BeautifulSoup(str(references), "html.parser")
links = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
title = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
for link in links:
print(link)
else:
print("Error: Please enter a valid Wikipedia URL")
Fixed it:
import requests
import sys
from bs4 import BeautifulSoup
session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"
if "wikipedia" in selectWikiPage:
html = session.post(selectWikiPage)
bsObj = BeautifulSoup(html.text, "html.parser")
references = bsObj.find('ol', {'class': 'references'})
href = BeautifulSoup(str(references), "html.parser")
for a in href.find_all("a", class_="external text", href=True):
listitem = [a["href"],a.getText()]
print(listitem)
else:
print("Error: Please enter a valid Wikipedia URL")
Instead of only getting the href attribute of the anchor tag you can also get the text of the link.
This can be done simply by
links = [(a["href"], a.text)
for a in href.find_all("a", class_="external text", href=True)]
for link, title in links:
print(link, title)
Now each links element will be a tuple with the link and the title.
You can now display it however you want.
Also the a.text can be written in like a.getText() or a.get_text() so choose what suits your code style.

BS4 Get text from within all DIV tags but not children

I am crawling multiple webpages but am having an issue with some websites that have content/text with div tags rather than p or span. Previously the script worked fine getting text from p and span tags however if a snippet of the code is like the below:
<div>Hello<p>this is a test</p></div>
Using find_all('div') and .getText() provides the following output:
Hello this is a test
I am looking to get the result of just Hello. This will allow me to determine what content is in what tags. I have tried using recursive=False however this doesn't appear to function on a whole webpage with multiple div tags that have content in.
ADDED SNIPPET OF CODE
req = urllib.request.Request("https://www.healthline.com/health/fitness-exercise/pushups-everyday", headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read().decode("utf-8").lower()
soup = BeautifulSoup(html, 'html.parser')
divTag = soup.find_all('div')
text = []
for div in divTag:
i = div.getText()
text.append(i)
print(text)
Thanks in advance.
Based on your information this is answered here: how to get text from within a tag, but ignore other child tags
this would lead to something like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for div in soup.find_all('div'):
print(div.find(text=True, recursive=False))
EDIT:
you just have to change
i = div.getText()
to
i = div.find(text=True, recursive=False)
Here is a possible solution, we extract all 'p's from soup.
from bs4 import BeautifulSoup
html = "<div>Hello<p>this is a test</p></div>"
soup = BeautifulSoup(html, 'html.parser')
for p in soup.find('p'):
p.extract()
print(soup.text)

Search in Sub-Pages of a Main Webpage using BeautifulSoup

I am trying to search for a div with class = 'class', but I need to find all matches in the mainpage as well as in the sub (or children) pages. How can I do this using BeautifulSoup or anything else?
I have found the closest answer in this search
Search the frequency of words in the sub pages of a webpage using Python
but this method only retrieved partial result, the page of interest has many more subpages. Is there another way of doing this?
My code so far:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
subpages = []
for anchor in soup.find_all('a', href=True):
string = 'https://www.mainpage.nl/'+str(anchor['href'])
subpages.append(string)
for subpage in subpages:
try:
soup_sub = BeautifulSoup(requests.get(subpage).content, 'html.parser')
promotie = soup_sub.find_all('strong', class_='c-action-banner__subtitle')
if len(promotie) > 0:
print(promotie)
except Exception:
pass
Thanks!

How to get a specific HTML element of a page (bs4)

import requests, bs4, webbrowser
url = 'https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords='
keywords = "keyboard"
full_link = url + keywords
res = requests.get(full_link)
soup = bs4.BeautifulSoup(res.text)
webbrowser.open(full_link)
a = soup.find('a', {'class': 'a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal'})
print(a)
Hi, I'm trying to get a very specific html element that is buried deep in divs but to no avail. Here is the HTML:
<a class="a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal" title="AmazonBasics Wired Keyboard" href="https://rads.stackoverflow.com/amzn/click/com/B005EOWBHC" rel="nofollow noreferrer"><h2 data-attribute="AmazonBasics Wired Keyboard" data-max-rows="0" class="a-size-medium s-inline s-access-title a-text-normal">AmazonBasics Wired Keyboard</h2></a>
and this is buried pretty deep. I want to get the href of this element, but currently my variable a returns None.
You need to use findAll and supply the classes as an array. For example:
a = soup.findAll('a', {'class': ['a-link-normal', 's-access-detail-page', 's-color-twister-title-link', 'a-text-normal']})
But I would also recommend against such specific class selection. The only one you really need is probably s-access-detail-page

BeautifulSoup, findAll after findAll?

I'm pretty new to Python and mainly need it for getting information from websites.
Here I tried to get the short headlines from the bottom of the website, but cant quite get them.
from bfs4 import BeautifulSoup
import requests
url = "http://some-website"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
nachrichten = soup.findAll('ul', {'class':'list'})
Now I would need another findAll to get all the links/a from the var "nachrichten", but how can I do this ?
Use a css selector with select if you want all the links in a single list:
anchors = soup.select('ul.list a')
If you want individual lists:
anchors = [ ul.find_all(a) for a in soup.find_all('ul', {'class':'list'})]
Also if you want the hrefs you can make sure you only find the anchors with href attributes and extract:
hrefs = [a["href"] for a in soup.select('ul.list a[href]')]
With find_all set href=True i.e ul.find_all(a, href=True) .
from bs4 import BeautifulSoup
import requests
url = "http://www.n-tv.de/ticker/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
nachrichten = soup.findAll('ul', {'class':'list'})
links = []
for ul in nachrichten:
links.extend(ul.findAll('a'))
print len(links)
Hope this solves your problem and I think the import is bs4. I never herd of bfs4

Categories