How to find ahref link inside a <li> - python

<div class="body>
<ul class = "graph">
<li>
<a href = "Address one"> Text1
</a>
</li>
<li>
<a href = "Address two"> Text2
</a>
</li>
<li>
<a href = "Address three"> Text3
</a>
</li>
</ul>
</div>
I am doing a web scraping project right now and I am having trouble extracting these ahref links above.
right now I have
from bs4 import BeautifulSoup as soup
import requests
page = requests.get(url)
content = soup(page.content, "html.parser")
I tried using the find_all('a') and get('href') functions but they dont seem to work in this situation.

Hope this helps:
for x in content.find_all('li'):
href = x.find('a').get('href')
print(href)

Related

web scraping with python/beautiful soup

So i am learning how to web scrape. I am currently trying to find all of the social links in this code
<ul class="socials">
<li class="social instagram">
<b>
Instagram:
</b>
<a href="https://www.instagram.com/keithgalli/">
https://www.instagram.com/keithgalli/
</a>
</li>
<li class="social twitter">
<b>
Twitter:
</b>
<a href="https://twitter.com/keithgalli">
https://twitter.com/keithgalli
</a>
</li>
<li class="social linkedin">
<b>
LinkedIn:
</b>
<a href="https://www.linkedin.com/in/keithgalli/">
https://www.linkedin.com/in/keithgalli/
</a>
</li>
<li class="social tiktok">
<b>
TikTok:
</b>
<a href="https://www.tiktok.com/#keithgalli">
https://www.tiktok.com/#keithgalli
</a>
</li>
It is clearly the links in the anchor tags but i am having issues with the find_all command and when i try to use it i am only getting back one of the social links. The code im putting in is
href = soup.find_all("a")
print(href)
and the out put is
[keithgalli.github.io/web-scraping/webpage.html]
I am not exactly sure on what i am doing wrong. I thought that if i targeted the href that it would grab all of the hrefs..Any hints or direction would be greatly appreciated.
try this:
for href in soup.find_all("a"):
print(href)

Select specific tag on BS4 Python

I have the following HTML
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428007" class="product-size__option product-size__option--no-stock">
I DONT WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
I use this code to get the data
linksize =soup.find_all('li', class_='product-size__option-wrapper')
productsize = []
for size in linksize:
for size_available in size.find_all('a', {'class':['product-size__option']}):
productsize.append(size_available.text.strip())
But it gets both tags, since it shares the same class (product-size__option), how can I get only the information I need?
Thanks
The data you don't want has a CSS class product-size__option--no-stock. You can check if the element does not contain this class, by doing the following check: if 'product-size__option--no-stock' not in size_available.attrs['class']
For example:
from bs4 import BeautifulSoup
html = '''<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428007" class="product-size__option product-size__option--no-stock">
I DONT WANT THIS</a>
</li>'''
soup = BeautifulSoup(html, 'html.parser')
linksize =soup.find_all('li', class_='product-size__option-wrapper')
productsize = []
for size in linksize:
for size_available in size.find_all('a', {'class':['product-size__option']}):
if 'product-size__option--no-stock' not in size_available.attrs['class']:
productsize.append(size_available.text.strip())

Targeting the third list item with beautiful soup

I'm scraping a website with Beautiful Soup and am having trouble trying to target an item in a span tag nested within an li tag. The website I'm trying to scrape is using the same classes for each list items which is making it harder. The HTML looks something like this:
<div class="bigger-container">
<div class="smaller-container">
<ul class="ulclass">
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item">**This is the only tag I want to scrape**</span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
</ul>
My first thought was to try and target it using "nth-of-type() - I found a similar questions here but it hasn't helped. I've tried playing with it for a while now but my code basically looks like this:
import requests
from bs4 import BeautifulSoup
url = 'url of website I'm scraping'
headers = {User-Agent Header}
for page in range(1):
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, features="lxml")
scrape = soup.find_all('div', class_ = 'even_bigger_container_not_included_in_html_above')
for item in scrape:
condition = soup.find('li:nth-of-type(2)', 'span:nth-of-type(1)').text
print(condition)
Any help is greatly appreciated!
To use a CSS Selector, use the select() method, not find().
So to get the third <li>, use li:nth-of-type(3) as a CSS Selector:
from bs4 import BeautifulSoup
html = """<div class="bigger-container">
<div class="smaller-container">
<ul class="ulclass">
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item">**This is the only tag I want to scrape**</span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
</ul>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.select_one("li:nth-of-type(3)").get_text(strip=True))
Output:
**This is the only tag I want to scrape**

Python : Extract HTML content

Is there any way to get "Data to be extracted" content by extracting the following html, using BeautifulSoup or any library
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
Thanks in advance for any help !! :)
There are certainly multiple options. For starters, you can find the p element with class="class_label" and get the next p sibling:
from bs4 import BeautifulSoup
data = """
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('p', class_='class_label').find_next_sibling('p').text
Or, using a CSS selector:
soup.select('div ul.main li p.class_label + p')[0].text
Or, relying on the User Name text:
soup.find(text='User Name').parent.find_next_sibling('p').text
Or, relying on the p element's position inside the li tag:
soup.select('div ul.main li p')[1].text

Analyze and edit links in html code with BeautifulSoup

I have a part of html page. I have to find all out links from it and replace them with the mark <can_be_link>.
Next code do almost all what I want, but it fails on links that are located on some lines (not on one) and that lines starts with tabs (in my example this is link with http://bad.com).
How to solve this issue correctly?
# -*- coding: utf-8 -*-
import BeautifulSoup
import re
if __name__=="__main__":
body = """
good link
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
"""
metka_link = '<can_be_link>'
soup = BeautifulSoup.BeautifulSoup(body)
hrefs = soup.findAll(name = 'a', attrs = { 'href': re.compile('\.*') })
repl = {}
for t in hrefs:
line = str(t)
# print '\n'*2, line
if not t.has_key('href'):
continue
href = t['href'].lower()
if href.find('http') == 0 or href.find('//') == 0:
body = body.replace(line, metka_link)
print body
The rezult is
<can_be_link>
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
But the desired result must be
<can_be_link>
<ul>
<li class="FOLLOW">
<can_be_link>
</li>
</ul>
Use replace_with() method:
PageElement.replace_with() removes a tag or string from the tree, and
replaces it with the tag or string of your choice
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
body = """
good link
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
"""
soup = BeautifulSoup(body, 'html.parser')
links = soup.find_all('a')
for link in links:
link = link.replace_with('<can_be_link>')
print soup.prettify(formatter=None)
prints:
<can_be_link>
<ul>
<li class="FOLLOW">
<can_be_link>
</li>
</ul>
Note the import statement - use the 4th BeautifulSoup version since Beautiful Soup 3 is no longer being developed, and that Beautiful Soup 4 is recommended for all new projects.

Categories