Analyze and edit links in html code with BeautifulSoup

Analyze and edit links in html code with BeautifulSoup - python

I have a part of html page. I have to find all out links from it and replace them with the mark <can_be_link>.
Next code do almost all what I want, but it fails on links that are located on some lines (not on one) and that lines starts with tabs (in my example this is link with http://bad.com).
How to solve this issue correctly?
# -*- coding: utf-8 -*-
import BeautifulSoup
import re
if __name__=="__main__":
body = """
good link
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
"""
metka_link = '<can_be_link>'
soup = BeautifulSoup.BeautifulSoup(body)
hrefs = soup.findAll(name = 'a', attrs = { 'href': re.compile('\.*') })
repl = {}
for t in hrefs:
line = str(t)
# print '\n'*2, line
if not t.has_key('href'):
continue
href = t['href'].lower()
if href.find('http') == 0 or href.find('//') == 0:
body = body.replace(line, metka_link)
print body
The rezult is
<can_be_link>
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
But the desired result must be
<can_be_link>
<ul>
<li class="FOLLOW">
<can_be_link>
</li>
</ul>

Use replace_with() method:
PageElement.replace_with() removes a tag or string from the tree, and
replaces it with the tag or string of your choice
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
body = """
good link
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
"""
soup = BeautifulSoup(body, 'html.parser')
links = soup.find_all('a')
for link in links:
link = link.replace_with('<can_be_link>')
print soup.prettify(formatter=None)
prints:
<can_be_link>
<ul>
<li class="FOLLOW">
<can_be_link>
</li>
</ul>
Note the import statement - use the 4th BeautifulSoup version since Beautiful Soup 3 is no longer being developed, and that Beautiful Soup 4 is recommended for all new projects.

Related

How to find ahref link inside a <li>

<div class="body>
<ul class = "graph">
<li>
<a href = "Address one"> Text1
</a>
</li>
<li>
<a href = "Address two"> Text2
</a>
</li>
<li>
<a href = "Address three"> Text3
</a>
</li>
</ul>
</div>
I am doing a web scraping project right now and I am having trouble extracting these ahref links above.
right now I have
from bs4 import BeautifulSoup as soup
import requests
page = requests.get(url)
content = soup(page.content, "html.parser")
I tried using the find_all('a') and get('href') functions but they dont seem to work in this situation.

Hope this helps:
for x in content.find_all('li'):
href = x.find('a').get('href')
print(href)

Select specific tag on BS4 Python

I have the following HTML
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428007" class="product-size__option product-size__option--no-stock">
I DONT WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
I use this code to get the data
linksize =soup.find_all('li', class_='product-size__option-wrapper')
productsize = []
for size in linksize:
for size_available in size.find_all('a', {'class':['product-size__option']}):
productsize.append(size_available.text.strip())
But it gets both tags, since it shares the same class (product-size__option), how can I get only the information I need?
Thanks

The data you don't want has a CSS class product-size__option--no-stock. You can check if the element does not contain this class, by doing the following check: if 'product-size__option--no-stock' not in size_available.attrs['class']
For example:
from bs4 import BeautifulSoup
html = '''<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428007" class="product-size__option product-size__option--no-stock">
I DONT WANT THIS</a>
</li>'''
soup = BeautifulSoup(html, 'html.parser')
linksize =soup.find_all('li', class_='product-size__option-wrapper')
productsize = []
for size in linksize:
for size_available in size.find_all('a', {'class':['product-size__option']}):
if 'product-size__option--no-stock' not in size_available.attrs['class']:
productsize.append(size_available.text.strip())

Targeting the third list item with beautiful soup

I'm scraping a website with Beautiful Soup and am having trouble trying to target an item in a span tag nested within an li tag. The website I'm trying to scrape is using the same classes for each list items which is making it harder. The HTML looks something like this:
<div class="bigger-container">
<div class="smaller-container">
<ul class="ulclass">
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item">**This is the only tag I want to scrape**</span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
</ul>
My first thought was to try and target it using "nth-of-type() - I found a similar questions here but it hasn't helped. I've tried playing with it for a while now but my code basically looks like this:
import requests
from bs4 import BeautifulSoup
url = 'url of website I'm scraping'
headers = {User-Agent Header}
for page in range(1):
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, features="lxml")
scrape = soup.find_all('div', class_ = 'even_bigger_container_not_included_in_html_above')
for item in scrape:
condition = soup.find('li:nth-of-type(2)', 'span:nth-of-type(1)').text
print(condition)
Any help is greatly appreciated!

To use a CSS Selector, use the select() method, not find().
So to get the third <li>, use li:nth-of-type(3) as a CSS Selector:
from bs4 import BeautifulSoup
html = """<div class="bigger-container">
<div class="smaller-container">
<ul class="ulclass">
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item">**This is the only tag I want to scrape**</span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
</ul>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.select_one("li:nth-of-type(3)").get_text(strip=True))
Output:
**This is the only tag I want to scrape**

Python 3 Beautifulsoup: Get span tag value with specific text which is also randomly placed within the html tree

I tried searching for this here but couldn't find an answer to be honest as this should be fairly easy to do with Selenium but since performance is an important factor, I was thinking on doing this with Beautifulsoup instead.
Scenario: I need to scrape the prices of different items which are generated in a random fashion depending on user input, see code below:
<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Third Party Liability</span>
<span>€756.62</span>
</li>
<li>
<span>Fire & Theft</span>
<span>€15.59</span>
</li>
</ul>
</div>
If these options were static and would always be displayed in the same position within the html, it would be easy to scrape the prices but since these could be placed anywhere within the div sk-expander-content, I'm not sure how to find these in a dynamic way.
The best approach would be to write a method to pass in the text of the span we are looking for and return the value in Euro. The structure of the span tags is always the same, the first span is always the name of the item and the second one is always the price.
The first thing that came to mind is the following code, but I'm not sure if this is even robust enough or if it makes sense:
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
div_i_need = soup.find_all("div", class_="sk-expander-content")[1]
def price_scraper(text_to_find):
for el in div_i_need.find_all(['ul', 'li', 'span']):
if el.name == 'span':
if el[0].text == text_to_find:
return(el[1].text)
Your help will be much appreciated.

Use regular expression.
import re
html='''<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Third Party Liability</span>
<span>€756.62</span>
</li>
<li>
<span>Fire & Theft</span>
<span>€15.59</span>
</li>
</ul>
</div>
<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Fire & Theft</span>
<span>€756.62</span>
</li>
<li>
<span>Third Party Liability</span>
<span>€15.59</span>
</li>
</ul>
</div>'''
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all(class_="sk-expander-content"):
for span in item.find_all('span',text=re.compile("€(\d+).(\d+)")):
print(span.find_previous_sibling('span').text)
print(span.text)
Output:
Third Party Liability
€756.62
Fire & Theft
€15.59
Fire & Theft
€756.62
Third Party Liability
€15.59
UPDATE:
If you want to get first node value.Then use find() instead of find_all().
import re
html='''<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Third Party Liability</span>
<span>€756.62</span>
</li>
<li>
<span>Fire & Theft</span>
<span>€15.59</span>
</li>
</ul>
</div>
<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Fire & Theft</span>
<span>€756.62</span>
</li>
<li>
<span>Third Party Liability</span>
<span>€15.59</span>
</li>
</ul>
</div>'''
soup = BeautifulSoup(html, "html.parser")
for span in soup.find(class_="sk-expander-content").find_all('span',text=re.compile("€(\d+).(\d+)")):
print(span.find_previous_sibling('span').text)
print(span.text)

from bs4 import BeautifulSoup
import re
html = """
<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Third Party Liability</span>
<span>€756.62</span>
</li>
<li>
<span>Fire & Theft</span>
<span>€15.59</span>
</li>
</ul>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
target = soup.select("div.sk-expander-content")
for tar in target:
data = [item.text for item in tar.findAll("span", text=re.compile("€"))]
print(data)
Output:
['€756.62', '€15.59']
Note: I used select which return ResultSet in order to find all div.

Python : Extract HTML content

Is there any way to get "Data to be extracted" content by extracting the following html, using BeautifulSoup or any library
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
Thanks in advance for any help !! :)

There are certainly multiple options. For starters, you can find the p element with class="class_label" and get the next p sibling:
from bs4 import BeautifulSoup
data = """
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('p', class_='class_label').find_next_sibling('p').text
Or, using a CSS selector:
soup.select('div ul.main li p.class_label + p')[0].text
Or, relying on the User Name text:
soup.find(text='User Name').parent.find_next_sibling('p').text
Or, relying on the p element's position inside the li tag:
soup.select('div ul.main li p')[1].text

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Analyze and edit links in html code with BeautifulSoup - python

Related

How to find ahref link inside a <li>

Select specific tag on BS4 Python

Targeting the third list item with beautiful soup

Python 3 Beautifulsoup: Get span tag value with specific text which is also randomly placed within the html tree

Python : Extract HTML content

Categories

Resources