Get href links from a tag - python

This is just one part of the HTML and there are multiple products on the page with the same HTML construction
I want all the href's for all the products on page
<div class="row product-layout-category product-layout-list">
<div class="product-col wow fadeIn animated" style="visibility: visible;">
<a href="the link I want" class="product-item">
<div class="product-item-image">
<img data-src="link to an image" alt="name of the product" title="name of the product" class="img-responsive lazy" src="link to an image">
</div>
<div class="product-item-desc">
<p><span><strong>brand</strong></span></p>
<p><span class="font-size-16">name of the product</span></p>
<p class="product-item-price>
<span>product price</span></p>
</div>
</a>
</div>
.
.
.
With this code I wrote I only get None printed a bunch of times
from bs4 import BeautifulSoup
import requests
url = 'link to the site'
response = requests.get(url)
page = response.content
soup = BeautifulSoup(page, 'html.parser')
##this includes the part that I gave you
items = soup.find('div', {'class': 'product-layout-category'})
allItems = items.find_all('a')
for n in allItems:
print(n.href)
How can I get it to print all the href's in there?

Looking at your HTML code, you can use CSS selector a.product-item. This will select all <a> tags with class="product-item":
from bs4 import BeautifulSoup
html_text = """
<div class="row product-layout-category product-layout-list">
<div class="product-col wow fadeIn animated" style="visibility: visible;">
<a href="the link I want" class="product-item">
<div class="product-item-image">
<img data-src="link to an image" alt="name of the product" title="name of the product" class="img-responsive lazy" src="link to an image">
</div>
<div class="product-item-desc">
<p><span><strong>brand</strong></span></p>
<p><span class="font-size-16">name of the product</span></p>
<p class="product-item-price>
<span>product price</span></p>
</div>
</a>
</div>
"""
soup = BeautifulSoup(html_text, "html.parser")
for link in soup.select("a.product-item"):
print(link.get("href")) # or link["href"]
Prints:
the link I want

Related

how to get specific links with BeautifulSoup?

I am trying to crawl HTML source with Python using BeautifulSoup.
I need to get the href of specific link <a> tags.
This is my test code. I want to get links <a href="/example/test/link/activity1~10"target="_blank">
<div class="listArea">
<div class="activity_sticky" id="activity">
.
.
</div>
<div class="activity_content activity_loaded">
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity1" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292311</span>
</div>
</a>
</div>
</div>
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity2" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292312</span>
</div>
</a>
</div>
</div>
.
.
.
</div>
</div>
Check the main page of the bs4 documentation:
for link in soup.find_all('a'):
print(link.get('href'))
This is a code for the problem. You should find the all <a></a>, then to getting the value of href.
soup = BeautifulSoup(html, 'html.parser')
for i in soup.find_all('a'):
if i['target'] == "_blank":
print(i['href'])
Hope my answer could help you.
Select the <a> specific - lternative to #Mason Ma answer you can also use css selectors:
soup.select('.activity_content a')]
or by its attribute target -
soup.select('.activity_content a[target="_blank"]')
Example
Will give you a list of links, matching your condition:
import requests
from bs4 import BeautifulSoup
html = '''
<div class="activity_content activity_loaded">
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity1" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292311</span>
</div>
</a>
</div>
</div>
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity2" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292312</span>
</div>
</a>
</div>
</div>
'''
soup = BeautifulSoup(html)
[x['href'] for x in soup.select('.activity_content a[target="_blank"]')]
Output
['/example/test/link/activity1', '/example/test/link/activity2']
Based on my understanding of your question, you're trying to extract the links (href) from anchor tags where the target value is _blank. You can do this by searching for all anchor tags then narrowing down to those whose target == '_blank'
links = soup.findAll('a', attrs = {'target' : '_blank'})
for link in links:
print(link.get('href'))

Beautifulsoup: Replace all <div> with aria-level attributes with <h> tags of the same level

I have a HTML source where <div> elements serve as headings. Using Beautifulsoup and the attribute aria-level I would like to replace all <div> elements with <h> tags of the same level. My code kind of works for my purpose but it seems inelegant and ideally, the attributes of the former <div> elements would be removed.
import bs4
html = '''<div id="container">
<div role="heading" aria-level="1">The main page heading</div>
<p>This article is about showing a page structure.</p>
<div role="heading" aria-level="2">Introduction</div>
<p>An introductory text.</p>
<div role="heading" aria-level="2">Chapter 1</div>
<p>Text</p>
<div role="heading" aria-level="3">Chapter 1.1</div>
<p>More text in a sub section.</p>
</div>'''
soup = bs4.BeautifulSoup(html, "html.parser")
for divheader in soup.find_all("div", {"aria-level": "1"}):
divheader.name = "h1"
for divheader in soup.find_all("div", {"aria-level": "2"}):
divheader.name = "h2"
for divheader in soup.find_all("div", {"aria-level": "3"}):
divheader.name = "h3"
print(soup)
Output:
<div id="container">
<h1 aria-level="1" role="heading">The main page heading</h1>
<p>This article is about showing a page structure.</p>
<h2 aria-level="2" role="heading">Introduction</h2>
<p>An introductory text.</p>
<h2 aria-level="2" role="heading">Chapter 1</h2>
<p>Text</p>
<h3 aria-level="3" role="heading">Chapter 1.1</h3>
<p>More text in a sub section.</p>
</div>
What it should look like:
<div id="container">
<h1>The main page heading</h1>
<p>This article is about showing a page structure.</p>
<h2>Introduction</h2>
<p>An introductory text.</p>
<h2>Chapter 1</h2>
<p>Text</p>
<h3>Chapter 1.1</h3>
<p>More text in a sub section.</p>
</div>
You can use del.attrs to delete all attributes from tag:
for div in soup.select("div[aria-level]"):
div.name = f'h{div["aria-level"]}'
del div.attrs
print(soup)
Prints:
<div id="container">
<h1>The main page heading</h1>
<p>This article is about showing a page structure.</p>
<h2>Introduction</h2>
<p>An introductory text.</p>
<h2>Chapter 1</h2>
<p>Text</p>
<h3>Chapter 1.1</h3>
<p>More text in a sub section.</p>
</div>

BeautifulSoup - Find link in this HTML

Here's my code to get html
from bs4 import BeautifulSoup
import urllib.request
from fake_useragent import UserAgent
url = "https://blahblah.com"
ua = UserAgent()
ran_header = ua.random
req = urllib.request.Request(url,data=None,headers={'User-Agent': ran_header})
uClient = urllib.request.urlopen(req)
page_html = uClient.read()
uClient.close()
html_source = BeautifulSoup(page_html, "html.parser")
results = html_source.findAll("a",{"onclick":"googleTag('click-listings-item-image');"})
From here results contains various listings containing different info. If I then print(results[0]):
<a href="https://blahblah.com//link//asdfqwersdf" onclick="googleTag('click-listings-item-image');">
<div class="results-panel-new col-sm-12">
<div class="row">
<div class="col-xs-12 col-sm-3 col-lg-2 text-center thumb-table-cell">
<span class="eq-table-new text-center"><img class="img-thumbnail" src="//images/120x90/7831a94157234bc6.jpg" /></span>
</div>
<div class="col-xs-12 hidden-sm hidden-md col-lg-1 text-center thumb-table-cell">
<span class="eq-table-new text-center"><span class="hidden-sm hidden-md hidden-lg">Year: </span>2000</span>
</div>
<div class="col-xs-12 hidden-sm hidden-md col-lg-2 text-center thumb-table-cell">
<span class="eq-table-new text-center">Fake City, USA</span>
</div>
<div class="col-xs-12 col-sm-3 col-lg-2 text-center thumb-table-cell">
<span class="eq-table-new text-center"><span class="hidden-sm hidden-md hidden-lg">Price: </span>$900</span>
</div>
</div>
<div class="row">
<div class="hidden-xs col-sm-12 table_details_new"><span>Descriptive details</span></div>
</div>
</div><!-- results-panel-new -->
</a>
I can get the image, Year, Location, and Price by doing a variation of this:
ModelYear = results[0].div.find("div",{"class":"col-xs-12 hidden-sm hidden-md col-lg-1 text-center thumb-table-cell"}).span.text
How do I get the very first href from results[0]?
Based on chat discussion, the href link looks available in simply: results[0]['href'].
You can use find_all( , href=True)
e.g:
results[0].find_all('a', href=True)[0]
Your selector is returning an a tag element as you can see shown in print out. So yes, you simply directly access the href with results[0]['href']. You can also tell this as the entire panel (the card displaying the listing) on the page is a clickable element. If you wanted to make this clearer you could change your selector for results to #js_thumb_view ~ a. This is also a faster selector.
results = html_source.select('#js_thumb_view ~ a')
Then all links, for example, with
links = [result['href'] for result in results]

Using BeautifulSoup to extract specific nested div

I have this HTML code which I'm creating the script for:
http://imgur.com/a/dPNYI
I would like to extract the highlighted text ("some text") and print it.
I tried going through every nested div in the way to the div I needed, like this:
import requests
from bs4 import BeautifulSoup
url = "the url this is from"
r = requests.get(url)
for div in soup.find_all("div", {"id": "main"}):
for div2 in div.find_all("div", {"id": "app"}):
for div3 in div2.find_all("div", {"id": "right-sidebar"}):
for div4 in div3.find_all("div", {"id": "chat"}):
for div5 in div4.find_all("div", {"id": "chat-messages"}):
for div6 in div5.find_all("div", {"class": "chat-message"}):
for div7 in div6.find_all("div", {"class": "chat-message-content selectable"}):
print(div7.text.strip())
I implemented what I've seen in guides and similar questions online, but I bet this is not even close and there must be a much easier way.This doesn't work. It doesn't print anything, and I'm a bit lost. How can I print the highlighted line (which is essentially the very first div child of the div with the id "chat-messages")?
HTML CODE:
<!DOCTYPE html>
<html>
<head>
<title>
</title>
</head>
<body>
<div id="main">
<div data-reactroot="" id="app">
<div class="top-bar-authenticated" id="top-bar">
</div>
<div class="closed" id="navigation-bar">
</div>
<div id="right-sidebar">
<div id="chat">
<div id="chat-head">
</div>
<div id="chat-title">
</div>
<div id="chat-messages">
<div class="chat-message">
<div class="chat-message-avatar" style="background-image: url("https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/65/657dcec97cc00bc378629930ecae1776c0d981e0.jpg");">
</div>
<a class="chat-message-username clickable">
<div class="iron-color">
aloe
</div></a>
<div class="chat-message-content selectable">
<!-- react-text: 2532 -->some text<!-- /react-text -->
</div>
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
Using lxml parser (i.e. soup = BeautifulSoup(data, 'lxml')) you can use .find with multiple classes just as simple as single classes to find nested divs:
soup.find('div',{'class':'chat-message-content selectable'}).text
The line above should work for you as long as the occurence of that class is the only one in the html.

Python RegEx with Beautifulsoup 4 not working

I want to find all div tags which have a certain pattern in their class name but my code is not working as desired.
This is the code snippet
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':re.compile(r'common text .*')})
where html_doc is the string with the following html
<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>
But all_findings is coming out as an empty list while it should have found one item.
It's working in the case of exact match
all_findings = soup.findAll('div',attrs={'class':re.compile(r'hide-c')})
I am using bs4.
Instead of using a regular expression, put the classes you are looking for in a list:
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
Example code:
from bs4 import BeautifulSoup
html_doc = """<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
print all_findings
This outputs:
[<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>]
To extend #Andy's answer, you can make a list of class names and compiled regular expressions:
soup.find_all('div', {'class': ["common", "text", re.compile(r'sighting_\d{5}')]})
Note that, in this case, you'll get the div elements with one of the specified classes/patterns - in other words, it's common or text or sighting_ followed by five digits.
If you want to have them joined with "and", one option would be to turn off the special treatment for "class" attributes by having the document parsed as "xml":
soup = BeautifulSoup(html_doc, 'xml')
all_findings = soup.find_all('div', class_=re.compile(r'common text sighting_\d{5}'))
print all_findings

Categories