How to get a specific HTML element of a page (bs4)

How to get a specific HTML element of a page (bs4) - python

import requests, bs4, webbrowser
url = 'https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords='
keywords = "keyboard"
full_link = url + keywords
res = requests.get(full_link)
soup = bs4.BeautifulSoup(res.text)
webbrowser.open(full_link)
a = soup.find('a', {'class': 'a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal'})
print(a)
Hi, I'm trying to get a very specific html element that is buried deep in divs but to no avail. Here is the HTML:
<a class="a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal" title="AmazonBasics Wired Keyboard" href="https://rads.stackoverflow.com/amzn/click/com/B005EOWBHC" rel="nofollow noreferrer"><h2 data-attribute="AmazonBasics Wired Keyboard" data-max-rows="0" class="a-size-medium s-inline s-access-title a-text-normal">AmazonBasics Wired Keyboard</h2></a>
and this is buried pretty deep. I want to get the href of this element, but currently my variable a returns None.

You need to use findAll and supply the classes as an array. For example:
a = soup.findAll('a', {'class': ['a-link-normal', 's-access-detail-page', 's-color-twister-title-link', 'a-text-normal']})
But I would also recommend against such specific class selection. The only one you really need is probably s-access-detail-page

Related

Using Beautifulsoup to get a tags and attriibutes of these a tags

I just started using beautifulsoup and am stuck on an issue regarding getting attributes of tags inside other tags. I am using the whitehouse.gov/briefing-room/ for practice. What I'm trying to do right now is just get all the links on this page and append them to an empty list. This is my code right now:
result = requests.get("https://www.whitehouse.gov/briefing-room/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for h2_tags in soup.find_all('h2'):
a_tag = h2_tags.find('a')
urls.append(a_tag.attr['href']) # This is where I get the NoneType error
This code returns the <a tags, but the first and last 3 tags it returns are 'None' and because of this, get a type error when trying to access the attributes to get the href for these <a tags

The problem is, that some <h2> tags don't contain <a> tags. So you have to check for that alternative. Or just select all <a> tags that are under <h2> using CSS selector:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.whitehouse.gov/briefing-room/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for a_tag in soup.select('h2 a'): # <-- select <A> tags that are under <H2> tags
urls.append(a_tag.attrs['href'])
print(*urls, sep='\n')
Prints:
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/10/statement-by-nsc-spokesperson-emily-horne-on-national-security-advisor-jake-sullivan-leading-the-first-virtual-meeting-of-the-u-s-israel-strategic-consultative-group/
https://www.whitehouse.gov/briefing-room/press-briefings/2021/03/09/press-briefing-by-press-secretary-jen-psaki-and-deputy-director-of-the-national-economic-council-bharat-ramamurti-march-9-2021/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/readout-of-the-white-houses-meeting-with-climate-finance-leaders/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/readout-of-vice-president-kamala-harris-call-with-prime-minister-erna-solberg-of-norway/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/nomination-sent-to-the-senate-3/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/president-biden-announces-key-hire-for-the-office-of-management-and-budget/
https://www.whitehouse.gov/briefing-room/speeches-remarks/2021/03/09/remarks-by-president-biden-during-tour-of-w-s-jenks-son/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/president-joseph-r-biden-jr-approves-louisiana-disaster-declaration/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/statement-by-president-joe-biden-on-the-house-taking-up-the-pro-act/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/white-house-announces-additional-staff/

Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

I have the following:
html =
'''<div class=“file-one”>
<a href=“/file-one/additional” class=“file-link">
<h3 class=“file-name”>File One</h3>
</a>
<div class=“location”>
Down
</div>
</div>'''
And would like to get just the text of href which is /file-one/additional. So I did:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
link_text = “”
for a in soup.find_all(‘a’, href=True, text=True):
link_text = a[‘href’]
print “Link: “ + link_text
But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.
What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?
Thank you in advance and will be sure to upvote/accept answer!

The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.
You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
Or you could use a list comprehension, if you prefer one-liners.
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
Or you could pass a lambda to .find_all().
tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)
If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.
Using .find_all().
links = [a['href'] for a in soup.find_all('a', href=True)]
Using .select() with CSS selectors.
links = [a['href'] for a in soup.select('a[href]')]

You can also use attrs to get the href tag with regex search
soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']

First of all, use a different text editor that doesn't use curly quotes.
Second, remove the text=True flag from the soup.find_all

You could solve this with just a couple lines of gazpacho:
from gazpacho import Soup
html = """\
<div class="file-one">
<a href="/file-one/additional" class="file-link">
<h3 class="file-name">File One</h3>
</a>
<div class="location">
Down
</div>
</div>
"""
soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']
Which would output:
'/file-one/additional'

A bit late to the party but I had the same issue recently scraping some recipes and got mine printing clean by doing this:
from bs4 import BeautifulSoup
import requests
source = requests.get('url for website')
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
link = article.find('a', href=True)['href'}
print(link)

Getting href using Beautiful Soup

I am trying to extract a specific link for this html code
<a class="pageNum taLnk" data-offset="10" data-page-number="1"
href="www.blahblahblah.com/bb32123">Page 1 </a>
<a class="pageNum taLnk" data-offset="20" data-page-number="2"
href="www.blahblahblah.com/bb45135">Page 2 </a>
As you can see, the link (href) are disorganized, therefore there are no pattern for me to use which means I need to extract the href manually using BeautifulSoup.
I want to specifically get Page 2's href.
These can the code I have now.
from bs4 import BeautifulSoup
import urllib
url = 'https://www.tripadvisor.com/ShowUserReviews-g293917-d539542-r447460956-Duangtawan_Hotel_Chiang_Mai-Chiang_Mai.html#REVIEWS'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
for link in soup.find_all('a', attrs = {'class' : 'pageNum taLnk'}):
print (link)
As you can see, I am stuck at trying to obtain the href information specifically for Page 2. Is there anyway to access with extra bit of information within the tags such as data-page-number = "2" or data-offset = "20".

page_2 = soup.find('a', attrs = {'data-page-number' : '2'})
This will only get you the page 2, if you want to get the next page no matter what the current page is, you should find the next page url:
next_page = soup.find('a', attrs = {'class' = 'nav next rndBtn ui_button primary taLnk'})
Some attributes, like the data-* attributes in HTML 5, have names that
can’t be used as the names of keyword arguments:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
You can use these attributes in searches by putting them into a
dictionary and passing the dictionary into find_all() as the attrs
argument:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

Using BeautifulSoup to extract the title of a link

I'm trying to extract the title of a link using BeautifulSoup. The code that I'm working with is as follows:
url = "http://www.example.com"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'}):
title = link.get('title')
print title
Now, an example link element contains the following:
<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>
However, nothing gets displayed after I run the above code. How can I extract the value stored inside the title attribute of the anchor tag stored in link?

Well, it seems you have put two spaces between s-access-detail-page and a-text-normal, which in turn, is not able to find any matching link. Try with correct number of spaces, then printing number of links found. Also, you can print the tag itself - print link
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.in/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=python"
source_code = requests.get(url)
plain_text = source_code.content
soup = BeautifulSoup(plain_text, "lxml")
links = soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'})
print len(links)
for link in links:
title = link.get('title')
print title

You are searching for an exact string here, by using multiple classes. In that case the class string has to match exactly, with single spaces.
See the Searching by CSS class section in the documentation:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]
But searching for variants of the string value won’t work:
css_soup.find_all("p", class_="strikeout body")
# []
You'd have a better time searching for individual classes:
soup.find_all('a', class_='a-link-normal')
If you must match more than one class, use a CSS selector:
soup.select('a.a-link-normal.s-access-detail-page.a-text-normal')
and it won't matter in what order you list the classes.
Demo:
>>> from bs4 import BeautifulSoup
>>> plain_text = u'<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>'
>>> soup = BeautifulSoup(plain_text)
>>> for link in soup.find_all('a', class_='a-link-normal'):
... print link.text
...
Introduction To Computation And Programming Using Python
>>> for link in soup.select('a.a-link-normal.s-access-detail-page.a-text-normal'):
... print link.text
...
Introduction To Computation And Programming Using Python

Improving a python snippet

I'm working on a python script to do some web scraping. I want to find the base URL of a given section on a web page that looks like this:
<div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
...
</div>
So, I just need to get everything from the first href besides the number('webpage-category/page/') and I have the following working code:
pages = [l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href'])]
s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])
The problem is, generating this list is a waste, since I just need the first href. I think a Generator would be the answer but I couldn't pull this off. Maybe you guys could help me to make this code more concise?

What about this:
from bs4 import BeautifulSoup
html = """ <div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
</div>"""
soup = BeautifulSoup(html)
link = soup.find('div', {'class': 'pagination'}).find('a')['href']
print '/'.join(link.split('/')[:-1])
prints:
webpage-category/page
Just FYI, speaking about the code you've provided - you can use [next()][-1] instead of a list comprehension:
s = next(l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href']))
UPD (using the website link provided):
import urllib2
from bs4 import BeautifulSoup
url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))
links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')
print next('/'.join(link['href'].split('/')[:-1]) for link in links
if link.text.isdigit() and link.text != "1")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get a specific HTML element of a page (bs4) - python

Related

Using Beautifulsoup to get a tags and attriibutes of these a tags

Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

Getting href using Beautiful Soup

Using BeautifulSoup to extract the title of a link

Improving a python snippet

Categories

Resources